Loading...
Using deep learning methods for supervised speech enhancement in noisy and reverberant environments
PIR HOSSEINLOO, SHADI
PIR HOSSEINLOO, SHADI
Citations
Altmetric:
Abstract
In real world environments, the speech signals received by our ears are usually a combination of different sounds that include not only the target speech, but also acoustic interference like music, background noise, and competing speakers. This interference has negative effect on speech perception and degrades the performance of speech processing applications such as automatic speech recognition (ASR), speaker identification, and hearing aid devices. One way to solve this problem is using source separation algorithms to separate the desired speech from the interfering sounds. Many source separation algorithms have been proposed to improve the performance of ASR systems and hearing aid devices, but it is still challenging for these systems to work efficiently in noisy and reverberant environments. On the other hand, humans have a remarkable ability to separate desired sounds and listen to a specific talker among noise and other talkers. Inspired by the capabilities of human auditory system, a popular method known as auditory scene analysis (ASA) was proposed to separate different sources in a two stage process of segmentation and grouping. The main goal of source separation in ASA is to estimate time frequency masks that optimally match and separate noise signals from a mixture of speech and noise. In this work, multiple algorithms are proposed to improve upon source separation in noisy and reverberant acoustic environment. First, a simple and novel algorithm is proposed to increase the discriminability between two sound sources by scaling (magnifying) the head-related transfer function of the interfering source. Experimental results from applications of this algorithm show a significant increase in the quality of the recovered target speech. Second, a time frequency masking-based source separation algorithm is proposed that can separate a male speaker from a female speaker in reverberant conditions by using the spatial cues of the source signals. Furthermore, the proposed algorithm has the ability to preserve the location of the sources after separation. Three major aims are proposed for supervised speech separation based on deep neural networks to estimate either the time frequency masks or the clean speech spectrum. Firstly, a novel monaural acoustic feature set based on a gammatone filterbank is presented to be used as the input of the deep neural network (DNN) based speech separation model, which shows significant improvement in objective speech intelligibility and speech quality in different testing conditions. Secondly, a complementary binaural feature set is proposed to increase the ability of source separation in adverse environment with non-stationary background noise and high reverberation using 2-channel recordings. Experimental results show that the combination of spatial features with this complementary feature set improves significantly the speech intelligibility and speech quality in noisy and reverberant conditions. Thirdly, a novel dilated convolution neural network is proposed to improve the generalization of the monaural supervised speech enhancement model to different untrained speakers, unseen noises and simulated rooms. This model increases the speech intelligibility and speech quality of the recovered speech significantly, while being computationally more efficient and requiring less memory in comparison to other models. In addition, the proposed model is modified with recurrent layers and dilated causal convolution layers for real-time processing. This model is causal which makes it suitable for implementation in hearing aid devices and ASR system, while having fewer trainable parameters and using only information about previous time frames in output prediction. The main goal of the proposed algorithms are to increase the intelligibility and the quality of the recovered speech from noisy and reverberant environments, which has the potential to improve both speech processing applications and signal processing strategies for hearing aid and cochlear implant technology.
Description
Date
2019-12-31
Journal Title
Journal ISSN
Volume Title
Publisher
University of Kansas
Research Projects
Organizational Units
Journal Issue
Keywords
Electrical engineering, deep neural networks, feature selection, noisy and reverberant environment, speech separation, supervised speech enhancement, time frequency masking