CAT-DUnet: Enhancing Speech Dereverberation via Feature Fusion and Structural Similarity Loss

Abstract

Reverberation significantly degrades speech intelligibility and quality, posing a substantial challenge in speech processing. While deep learning advancements offer promising solutions, current methodologies often overlook the effective integration of low-level and high-level feature representations, causing detrimental effects on overall performance. Simultaneously, prior approaches heavily rely on loss functions grounded in quantitative error metrics, which may not fully capture the perceptual intricacies of speech signals. To address these concerns, we introduce CAT-DUnet, a Unet architecture that integrates channel attention, time-frequency attention, and dilated convolution blocks to enhance feature fusion. Additionally, we propose a novel time-weighted Mel-spectrogram and leverage the structural similarity loss, which more closely aligns with human auditory perception. Through extensive ablation experiments, we demonstrate the effectiveness of our proposed enhancements. Our model outperforms state-of-the-art models on 6 out of 7 relative and absolute metrics, underscoring its exceptional performance.

Model Overview

Interpolate start reference image.

(a) The overall architecture of CAT-DUnet, comprising a Unet backbone and several Fusion Blocks. (b) Single Fusion Block, composed of Channel Attention Block, (c) Dilated Convolution Block, and (d) Time-frequency Attention Block.

Dereverberation Effects of Various Models

Here, we showcase the dereverberation performance of the models compared in Section III-B of the paper on synthetic reverberant audio samples described in Section III-A.

RT60/s	Unprocessed	Clean	CAT-DUnet	WPE	DCCRN	FullSubNet+	StoRM	LS Unet
1.20
1.53
1.53
1.58
1.68
1.84

Model Performance in the Presence of Noise Interference

Here, we showcase a collection of audio samples that exhibit both reverberation and noticeable noise, aiming to illustrate the dereverberation performance of different models in the presence of noise interference.

RT60/s	Unprocessed	Clean	CAT-DUnet	WPE	DCCRN	FullSubNet+	StoRM	LS Unet
0.36
0.57
0.75
0.91
1.11
1.43

Ablation Experiments on Sub-modules of CAT-DUnet

We showcase here some results of the ablation experiments conducted in Section III-C of the paper. Each column after the 'Backbone' column in the table represents the addition of a specific module based on the previous column.

RT60/s	Unprocessed	Clean	Backbone	+CA	+FTA	+DCB (Full)
0.49
0.61
0.75
1.21
1.42
1.53

Comparison of Loss Functions on CAT-DUnet Backbone

We compare here a subset of the loss functions discussed in Section III-D of the paper on a CAT-DUnet backbone.

RT60/s	Unprocessed	Clean	L1-S	SSIM-S	Mix-S	Mix-MS	Mix-TMS	SISNR	MSTFT
0.89
1.02
1.41
1.53
1.58
1.62

References

[1] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Blind speech dereverberation with multi-channel linear prediction based on short time fourier transform representation,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 85–88.

[2] D. Leon and F. Tobar, “Late reverberation suppression using u-nets,” 2021.

[3] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement,” arXiv preprint arXiv:2008.00264, 2020.

[4] J. Chen, Z. Wang, D. Tuo, Z. Wu, S. Kang, and H. Meng, “Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7857–7861.

[5] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023.

[6] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6199–6203.