Abstract
As language technologies like ChatGPT increasingly shape how we communicate, translate and access knowledge, they also raise critical questions about techno-linguistic bias and the representation of linguistic diversity. While these technologies promise efficiency, accessibility and innovation, they also introduce new forms of linguistic bias and raise critical questions around the representation of linguistic diversity in digital spaces. UniTermGPT investigates how large language models (LLMs) process terminological variation within the German language (specifically university-related terminology from Austria, Germany, Switzerland and South Tyrol). These regional varieties exhibit significant lexical and institutional differences, which are often flattened or misrepresented by generalized AI systems trained on non-transparent data. Though considered a single language, German encompasses significant internal variation (that shape terminology). Higher education is a prime example: the term Bachelorstudium in Austria or Bachelorstudiengang in Germany and South Tyrol reflect regionally specific educational systems and linguistic traditions (in the form of system-bound terminology). However, language technologies like ChatGPT tend to flatten these distinctions, producing translations or texts that favour dominant norms and disregard linguistic nuance, even when specific prompts are used. UniTermGPT critically examines how ChatGPT processes (and reproduces this kind) of intra-linguistic diversity. By compiling a corpus of university-related texts from Austrian, German, Swiss and South Tyrolean higher education institutions, the project identifies key terminological differences and tests how ChatGPT handles them in multilingual translation. This includes contrastive analyses between AI-generated outputs and existing terminological databases, as well as annotations from domain experts such as university translators and terminologists. Through this approach, UniTermGPT uncovers a consistent pattern: ChatGPT often fails to recognize regionally system-bound terminology or generalizes it into standardized forms that may obscure or misrepresent local institutional practices. Such techno-linguistic biases reflect broader dynamics of linguistic homogenization, thereby shaping whose language, knowledge and identity are made visible or invisible in digital environments. The project situates these findings within larger debates about languaging diversity (across digital platforms). By failing to adequately represent linguistic diversity, AI-driven translation and content generation tools risk erasing culturally situated language practices, limiting access to domain-specific knowledge. UniTermGPT also engages with the ethics and politics of translation in the age of AI. It challenges the assumption that “good enough” machine-generated translations are sufficient when they ignore the socio-cultural embeddedness of terminology. By incorporating expert annotation and evaluation into the process, the project models a participatory approach that values local linguistic knowledge and human expertise. Importantly, UniTermGPT is committed to Open Science principles. The corpus, methodology, annotated outputs and prompts are made openly available to support transparency, replication and community engagement. In doing so, the project not only advances research in language technology and terminology but also offers policy-relevant insights for developers and translators seeking to build and use more inclusive and culturally sensitive AI applications. Ultimately, UniTermGPT contributes to a growing body of work that calls for critical reflection on how language technologies participate in the negotiation of identity, diversity and power. It argues that for language technology to truly serve society, it must account for the full spectrum of linguistic variation.