Abstract
The automatic speech recognition (ASR) domain has advanced considerably with the emergence of large transformer-based models, such as OpenAI’s Whisper. This paper presents an experimentalbased evaluation of the Whisper models, focusing on its performance under various acoustic conditions and input configurations. We specifically examine the effects of audio transformations such as white and Gaussian noise, reverberation, time stretch, and pitch shift, as well as the impact of varying chunk lengths. The findings suggest that while Whisper models are capable of dealing with minimal background noise and demonstrate commendable performance in clean audio conditions, their performance degrades rapidly when subjected to more severe audio transformations and noise, particularly when using shorter chunk lengths. This study contributes valuable insights into the Whisper model’s capabilities and limitations, particularly when it comes to real-time speech recognition, offering guidance for future improvements in ASR technology.