Reverberation significantly degrades speech intelligibility and quality, posing a substantial challenge in speech processing. While deep learning advancements offer promising solutions, current methodologies often overlook the effective integration of low-level and high-level feature representations, causing detrimental effects on overall performance. Simultaneously, prior approaches heavily rely on loss functions grounded in quantitative error metrics, which may not fully capture the perceptual intricacies of speech signals. To address these concerns, we introduce CAT-DUnet, a Unet architecture that integrates channel attention, time-frequency attention, and dilated convolution blocks to enhance feature fusion. Additionally, we propose a novel time-weighted Mel-spectrogram and leverage the structural similarity loss, which more closely aligns with human auditory perception. Through extensive ablation experiments, we demonstrate the effectiveness of our proposed enhancements. Our model outperforms state-of-the-art models on 6 out of 7 relative and absolute metrics, underscoring its exceptional performance.
RT60/s | Unprocessed | Clean | CAT-DUnet | WPE | DCCRN | FullSubNet+ | StoRM | LS Unet |
---|---|---|---|---|---|---|---|---|
1.20 |
||||||||
1.53 |
||||||||
1.53 |
||||||||
1.58 |
||||||||
1.68 |
||||||||
1.84 |
RT60/s | Unprocessed | Clean | CAT-DUnet | WPE | DCCRN | FullSubNet+ | StoRM | LS Unet |
---|---|---|---|---|---|---|---|---|
0.36 |
||||||||
0.57 |
||||||||
0.75 |
||||||||
0.91 |
||||||||
1.11 |
||||||||
1.43 |
RT60/s | Unprocessed | Clean | Backbone | +CA | +FTA | +DCB (Full) |
---|---|---|---|---|---|---|
0.49 |
||||||
0.61 |
||||||
0.75 |
||||||
1.21 |
||||||
1.42 |
||||||
1.53 |
RT60/s | Unprocessed | Clean | L1-S | SSIM-S | Mix-S | Mix-MS | Mix-TMS | SISNR | MSTFT |
---|---|---|---|---|---|---|---|---|---|
0.89 |
|||||||||
1.02 |
|||||||||
1.41 |
|||||||||
1.53 |
|||||||||
1.58 |
|||||||||
1.62 |