Toward Optimised Datasets to Fine-tune ASR Systems Leveraging Less but More Informative Speech

Loredana Schettino; Vincenzo Norman Vitale; Alessandro Vietti

Back

Toward Optimised Datasets to Fine-tune ASR Systems Leveraging Less but More Informative Speech

Conference proceeding

Peer reviewed

Toward Optimised Datasets to Fine-tune ASR Systems Leveraging Less but More Informative Speech

Loredana Schettino, Vincenzo Norman Vitale and Alessandro Vietti

Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025), Vol.4112, pp.1-7

CEUR Workshop Proceedings, 4112

Eleventh Italian Conference on Computational Linguistics (Cagliari, 24/09/2025–26/09/2025)

2025

Handle:

https://hdl.handle.net/10863/52198

Abstract

Speech style

ASR

Sample Efficiency

Acoustic Features

K-fold Cross-Validation

Modern Automatic Speech Recognition (ASR) systems, based on Deep Neural Networks (DNN), have achieved remarkable performance modelling huge quantity of speech data. However, recent studies have shown that fine-tuning pre-trained models, despite providing a powerful solution in low-resource settings, lacks robustness across different speech styles, and this is not just related to the amount of training data, but to substantial differences in phonetic-prosodic characteristics. Therefore, this study aims to explore how modern E2E ASR systems’ performance is affected by the amount of training data and the type of speech data and which acoustic-phonetic features most markedly exert an influence. To this aim, a k-fold cross-validation was performed by fine-tuning a pre-trained FastConformer model with datasets varying in type of speech data and size. Then we performed a correlation analysis between the values of the acoustic characteristics of the data and the recognition scores. The analyses allow the identification of an optimal combination of speech data type and amount of training data. Also, results show that using both more spontaneous speech or more controlled speech can be beneficial, provided that the speech rate is contained.

Files and links (1)

url

https://ceur-ws.org/Vol-4112/#97_main_longView

Details

Title: Toward Optimised Datasets to Fine-tune ASR Systems Leveraging Less but More Informative Speech
Creators: Loredana Schettino
Vincenzo Norman Vitale
Alessandro Vietti
Publication Details: Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025), Vol.4112, pp.1-7
Editor(s): Bosco C, Jezek E, Polignano M, Sanguinetti M
ISSN: 1613-0073
Conference: Eleventh Italian Conference on Computational Linguistics (Cagliari, 24/09/2025–26/09/2025)
Series / Volume: CEUR Workshop Proceedings
4112
Publisher: CEUR
Format: Online
Number of pages: 7
Identifiers: (UNIBZ)97880651
991007330262101241
Scopus ID: 2-s2.0-105034258026
Academic Unit: Faculty of Education
Language: English
Resource Type: Conference proceeding
Author Names String: Schettino L, Vitale VN, Vietti A
Additional Description: Editors/Supervisors: Bosco C, Jezek E, Polignano M, Sanguinetti M

Metrics

1 Record Views