Abstract
Modern Automatic Speech Recognition (ASR) systems, based on Deep Neural Networks (DNN), have achieved remarkable performance modelling huge quantity of speech data. However, recent studies have shown that fine-tuning pre-trained models, despite providing a powerful solution in low-resource settings, lacks robustness across different speech styles, and this is not just related to the amount of training data, but to substantial differences in phonetic-prosodic characteristics. Therefore, this study aims to explore how modern E2E ASR systems’ performance is affected by the amount of training data and the type of speech data and which acoustic-phonetic features most markedly exert an influence. To this aim, a k-fold cross-validation was performed by fine-tuning a pre-trained FastConformer model with datasets varying in type of speech data and size. Then we performed a correlation analysis between the values of the acoustic characteristics of the data and the recognition scores. The analyses allow the identification of an optimal combination of speech data type and amount of training data. Also, results show that using both more spontaneous speech or more controlled speech can be beneficial, provided that the speech rate is contained.