Logo image
Toward Optimised Datasets to Fine-tune ASR Systems Leveraging Less but More Informative Speech
Conference proceeding   Peer reviewed

Toward Optimised Datasets to Fine-tune ASR Systems Leveraging Less but More Informative Speech

Loredana Schettino, Vincenzo Norman Vitale and Alessandro Vietti
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025), Vol.4112, pp.1-7
CEUR Workshop Proceedings, 4112
Eleventh Italian Conference on Computational Linguistics (Cagliari, 24/09/2025–26/09/2025)
2025
Handle:
https://hdl.handle.net/10863/52198

Abstract

Speech style ASR Sample Efficiency Acoustic Features K-fold Cross-Validation
Modern Automatic Speech Recognition (ASR) systems, based on Deep Neural Networks (DNN), have achieved remarkable performance modelling huge quantity of speech data. However, recent studies have shown that fine-tuning pre-trained models, despite providing a powerful solution in low-resource settings, lacks robustness across different speech styles, and this is not just related to the amount of training data, but to substantial differences in phonetic-prosodic characteristics. Therefore, this study aims to explore how modern E2E ASR systems’ performance is affected by the amount of training data and the type of speech data and which acoustic-phonetic features most markedly exert an influence. To this aim, a k-fold cross-validation was performed by fine-tuning a pre-trained FastConformer model with datasets varying in type of speech data and size. Then we performed a correlation analysis between the values of the acoustic characteristics of the data and the recognition scores. The analyses allow the identification of an optimal combination of speech data type and amount of training data. Also, results show that using both more spontaneous speech or more controlled speech can be beneficial, provided that the speech rate is contained.
url
https://ceur-ws.org/Vol-4112/#97_main_longView

Details

Metrics

1 Record Views
Logo image