Abstract
The performance of automatic speech recognition (ASR) systems in acoustically challenging environments is crucial for the effectiveness of various voice-controlled applications. This study presents an extensive experimental evaluation of the robustness of different ASR models against a range of acoustic disturbances, including white noise, reverberation, time stretch, and pitch shift. By comparing the performance of these models in English, Italian, and German, this research provides a cross-linguistic perspective. The findings reveal a significant decline in performance across all models when subjected to these audio distortions, highlighting the varying degrees of resilience across different languages. By incorporating multiple languages, this study offers valuable insights into the unique challenges and potential opportunities for enhancing ASR technologies, addressing both well-researched and less-explored linguistic domains. Our comparative study highlights that although ASRs are reaching near-human accuracy in ideal acoustic conditions, ASR performance under the whole range of distortions is still well below human performance