Noise suppression is essential in real-time speech communication, yet common evaluation metrics often capture different aspects of performance. This paper investigates the trade-off between perceptual quality and intelligibility in speech enhanced by the Dual-Signal Transformation LSTM Network (DTLN). A dataset of 1,360 noisy mixtures was created from English and Indonesian speech combined with environmental noise at multiple SNR levels. Objective evaluation was conducted using Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Mean Square Error (MSE). Results reveal a weak negative correlation between PESQ and STOI (r = –0.265), indicating that cleaner-sounding speech is not always easier to understand. Correlations involving MSE were negligible (PESQ–MSE: r = –0.008; STOI–MSE: r = 0.074), confirming its limited perceptual relevance. These findings demonstrate that perceptual quality and intelligibility are not interchangeable, and that relying solely on MSE is insufficient. The study recommends intelligibility-aware training objectives and multi-metric evaluation strategies to balance comfort and clarity in practical applications such as telemedicine and online learning.
Key words: Noise Suppression, WebRTC, PESQ, STOI, Correlation Analysis
|