Abstract
We present the first meta-evaluation of Automatic Machine Translation Evaluation (AMTE) metrics between Italian and South Tyrolean German, a low-resourced standard variety of German. This minor German variety is recognised as a co-official language at the local level and is used by the local public administration and legislature. We evaluate metric agreement with human judgement across translation quality levels, using a dataset of bilingual machine-translated decrees annotated with human-curated error tags. Our findings show that embedding-based metrics perform best for evaluating high-quality translations, while learned neural metrics correlate more strongly with human judgments on lower-quality ranges. We also expose a persistent bias in AMTE against minor language varieties and make suggestions about the design of linguistic resources for envisaged custom metric devolopment