Adapting machine translation for under-resourced languages: a first attempt for institutional German in South Tyrol

Flavia De Camillis; AG Contarino

The setting of our study is South Tyrol, a small province in Italy, where Italian and German are co-official languages. Despite German being a well-resourced, pluricentric language, the South Tyrolean variety is to a certain extent under-resourced. This is because institutional German in South Tyrol is strongly bound to the Italian legal system, which makes it substantially different from other German varieties as concerns legal and administrative terminology. Standardised and recommended terminology for South Tyrol is collected in the freely accessible informative system bistro (Ralli & Andreatta, 2018). Public institutions publish many of their documents (e.g. laws, invitations to tender, resolutions) in both Italian and German systematically since the 1990s. However, they are one of the few sources of legal and institutional discourse in South Tyrolean German, making it a narrow domain. For machine translation research, this is a well-known limitation (Koehn & Knowles, 2017; Michon et al., 2020; Skadiņa et al., 2010). It is also the primary reason why South Tyrolean civil servants cannot rely on mainstream MT-tools for their translations. Their output may be quite good standard German thanks to the progress made by neural network models (Barrault et al., 2020; Vaswani et al., 2017), but legal and administrative terms are frequently mistranslated (De Camillis, 2021; Wiesmann, 2019). Against this background, our exploratory study is the first attempt – to the best of our knowledge – of tailoring an MT-system to South Tyrolean institutions. From our experiments, we expect improvements particularly in what concerns legal and administrative terminology. For this pilot phase, we chose an adaptive NMT-system, ModernMT, as its adaptation approach allows on-the-fly fine-tuning of a pre-trained baseline model based on an in-domain adaptation set (Bertoldi, Caroselli, et al., 2018). To set the tests, we collected existent legal and administrative resources in South Tyrolean German, consisting of published documents and local terminology. The same resources are also available in Italian. Firstly, we created a parallel corpus of local legislation scraping the public database LexBrowser. 4987 texts were collected, aligned and accurately cleaned up and filtered using deterministic rules in order to retain high-quality sentence pairs. Cleaning operations consisted in removing segment-internal noise and correcting hyphenated words, whereas filtering operations included discarding bad sentence pairs according to several noise classes (wrong language, sentence length ratio, duplicates, etc.). At the same time, we collected and cleaned TMs from the central translation bureau of the local administration. Overall, we totalled approximately 243k translation units (11.6m tokens), as well as around 10k terms in German from the system bistro. Finally, we fed the ModernMT system with this data as TM files. For our experiments, we first used a test set consisting of 2k segments from the LexBrowser corpus. The results reveal considerable improvements both in terms of automated metrics, achieving 50.61 BLEU (+20.30 BLEU over the ModernMT baseline system), and with regard to the translation of legal and administrative terminology. Further analyses in relation to legal terminology correctness and adequacy are still ongoing. With our exploratory study, we hope to pave the way for an in-depth research, aiming at creating an MT-system for the translating institutions of South Tyrol (Koskinen, 2008).

Adapting machine translation for under-resourced languages: a first attempt for institutional German in South Tyrol

Abstract

Files and links (1)

Details

Metrics