Abstract
The increasing importance of data sharing and research reproducibility has heightened the need for effective pseudonymization of personal data in various types of corpora. This is particularly crucial for spontaneous speech transcripts, social media data and learner corpora, where GDPR compliance requires careful de-identification of participants (Fink & Pallas 2020) while maintaining the natural flow and readability of the text. Current approaches to pseudonymization rely on regular expressions for fixed patterns (e.g., emails, dates and phone numbers) and dictionary-based methods or Named Entity Recognition (NER) for patterns that are more unpredictable and bound to the context (locations, institutions, work of arts, etc.) (Volodina et al. 2020).
Especially for these types of entities, though, identification error rates are high and extensive manual correction is required in order to successfully de-identify participants. Most importantly, current approaches do not provide context-sensitive and culturally appropriate substitutions that keep the natural flow of the text intact. Rather, they substitute personal information with tags that are often unnatural to read. In addition, if pseudonymization is performed as the first step of the text processing pipeline, the insertion of tags on multi-words spans may later prevent the identification of syntactic categories and dependencies holding between pseudonymized items.
This paper presents pseudollm, an open-source Python package that leverages Large Language Models (LLMs) to perform sophisticated, context-aware pseudonymization of texts. Unlike existing solutions, pseudollm generates both standard NER tags and culturally appropriate substitutions that preserve text coherence and readability. Using the technique of one-shot prompting, pseudollm is adaptable to any pseudonymization schema with very little manual annotations.
The tool can be installed locally and operated via a command-line interface. As for now, it runs via the OpenAI API, but extensions with open-source LLMs are possible. This presentation will include a live demonstration of the tool’s capabilities on various genres and languages, highlighting its potential for advancing reproducible research while maintaining participant privacy.
References
Finck, M., & Pallas, F. (2020). They who must not be identified—Distinguishing personal from non-personal data under the GDPR. International Data Privacy Law, 10(1), 11–36. https://doi.org/10.1093/idpl/ipz026
Volodina, E., Ali Mohammed, Y., Derbring, S., Matsson, A., & Megyesi, B. (2020). Towards Privacy by Design in Learner Corpora Research: A Case of On-the-fly Pseudonymization of Swedish Learner Essays. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 357–369). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.32