Logo image
Towards an ASR System for Documenting Endangered Languages: A Preliminary Study on Sardinian
Conference proceeding   Open access   Peer reviewed

Towards an ASR System for Documenting Endangered Languages: A Preliminary Study on Sardinian

I Chizzoni and Alessandro Vietti
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), Vol.3878, 25
CEUR Workshop Proceedings, 3878
Tenth Italian Conference on Computational Linguistics (Clic-it 2024) (Pisa, 04/12/2024–06/12/2024)
2024
Handle:
https://hdl.handle.net/10863/52263

Abstract

Speech recognition Campidanese Sardinian Resource and evaluation Spoken language documentation
Speech recognition systems are still highly dependent on textual orthographic resources, posing a challenge for low-resource languages. Recent research leverages self-supervised learning of unlabeled data or employs multilingual models pre-trained on high resource languages for fine-tuning on the target low-resource language. These are effective approaches when the target language has a shared writing tradition, but when we are confronted with mainly spoken languages, being them endangered minority languages, dialects, or regional varieties, other than labeled data, we lack a shared metric to assess speech recognition performance. We first provide a research background on ASR for low-resource languages and describe the specific linguistic situation of Campidanese Sardinian, we then evaluate five multilingual ASR models using traditional evaluation metrics and an exploratory linguistic analysis. The paper addresses key challenges in developing a tool for researchers to document and analyze the phonetics and phonology of spoken (endangered) languages.
pdf
25_main_long256.58 kBDownloadView
Open Access
url
https://ceur-ws.org/Vol-3878/#25_main_longView

Details

Metrics

1 Record Views
Logo image