Computational auditory scene analysis

Computational auditory scene analysis (CASA) is the study of auditory scene analysis by computational means.^[1] In essence, CASA systems are "machine listening" systems that aim to separate mixtures of sound sources in the same way that human listeners do. CASA differs from the field of blind signal separation in that it is (at least to some extent) based on the mechanisms of the human auditory system, and thus uses no more than two microphone recordings of an acoustic environment. It is related to the cocktail party problem.

Principles

Since CASA serves to model functionality parts of the auditory system, it is necessary to view parts of the biological auditory system in terms of known physical models. Consisting of three areas, the outer, middle and inner ear, the auditory periphery acts as a complex transducer that converts sound vibrations into action potentials in the auditory nerve. The outer ear consists of the external ear, ear canal and the ear drum. The outer ear, like an acoustic funnel, helps locating the sound source.^[2] The ear canal acts as a resonant tube (like an organ pipe) to amplify frequencies between 2–5.5 kHz with a maximum amplification of about 11 dB occurring around 4 kHz.^[3] As the organ of hearing, the cochlea consists of two membranes, Reissner’s and the basilar membrane. The basilar membrane moves to audio stimuli through the specific stimulus frequency matches the resonant frequency of a particular region of the basilar membrane. The movement the basilar membrane displaces the inner hair cells in one direction, which encodes a half-wave rectified signal of action potentials in the spiral ganglion cells. The axons of these cells make up the auditory nerve, encoding the rectified stimulus. The auditory nerve responses select certain frequencies, similar to the basilar membrane. For lower frequencies, the fibers exhibit "phase locking". Neurons in higher auditory pathway centers are tuned to specific stimuli features, such as periodicity, sound intensity, amplitude and frequency modulation.^[1] There are also neuroanatomical associations of ASA through the posterior cortical areas, including the posterior superior temporal lobes and the posterior cingulate. Studies have found that impairments in ASA and segregation and grouping operations are affected in patients with Alzheimer's disease.^[4]

System Architecture

Cochleagram

As the first stage of CASA processing, the cochleagram creates a time-frequency representation of the input signal. By mimicking the components of the outer and middle ear, the signal is broken up into different frequencies that are naturally selected by the cochlea and hair cells. Because of the frequency selectivity of the basilar membrane, a filter bank is used to model the membrane, with each filter associated with a specific point on the basilar membrane.^[1]

Since the hair cells produce spike patterns, each filter of the model should also produce a similar spike in the impulse response. The use of a gammatone filter provides an impulse response as the product of a gamma function and a tone. The output of the gammatone filter can be regarded as a measurement of the basilar membrane displacement. Most CASA systems represent the firing rate in the auditory nerve rather than a spike-based. To obtain this, the filter bank outputs are half-wave rectified followed by a square root. (Other models, such as automatic gain controllers have been implemented). The half-rectified wave is similar to the displacement model of the hair cells. Additional models of the hair cells include the Meddis hair cell model which pairs with the gammatone filter bank, by modeling the hair cell transduction.^[5] Based on the assumption that there are three reservoirs of transmitter substance within each hair cell, and the transmitters are released in proportion to the degree of displacement to the basilar membrane, the release is equated with the probability of a spike generated in the nerve fiber. This model replicates many of the nerve responses in the CASA systems such as rectification, compression, spontaneous firing, and adaptation.^[1]

Correlogram

Important model of pitch perception by unifying 2 schools of pitch theory:^[1]

Place theories (emphasizing the role of resolved harmonics)
Temporal theories (emphasizing the role of unresolved harmonics)

The correlogram is generally computed in the time domain by autocorrelating the simulated auditory nerve firing activity to the output of each filter channel.^[1] By pooling the autocorrelation across frequency, the position of peaks in the summary correlogram corresponds to the perceived pitch.^[1]

Cross-Correlogram

Because the ears receive audio signals at different times, the sound source can be determined by using the delays retrieved from the two ears.^[6] By cross-correlating the delays from the left and right channels (of the model), the coincided peaks can be categorized as the same localized sound, despite their temporal location in the input signal.^[1] The use of interaural cross-correlation mechanism has been supported through physiological studies, paralleling the arrangement of neurons in the auditory midbrain.^[7]

Time-Frequency Masks

To segregate the sound source, CASA systems mask the cochleagram. This mask, sometimes a Wiener filter, weighs the target source regions and suppresses the rest.^[1] The physiological motivation behind the mask results from the auditory perception where sound is rendered inaudible by a louder sound.^[8]

Resynthesis

A resynthesis pathway reconstructs an audio signal from a group of segments. Achieved by inverting the cochleagram, high quality resynthesized speech signals can be obtained.^[1]

Applications

Monaural CASA

Monaural sound separation first began with separating voices based on frequency. There were many early developments based on segmenting different speech signals through frequency.^[1] Other models followed on this process, by the addition of adaption through state space models, batch processing, and prediction-driven architecture.^[9] The use of CASA has improved the robustness of ASR and speech separation systems.^[10]

Binaural CASA

Since CASA is modeling human auditory pathways, binaural CASA systems better the human model by providing sound localization, auditory grouping and robustness to reverberation by including 2 spatially separated microphones. With methods similar to cross-correlation, systems are able to extract the target signal from both input microphones.^[11]^[12]

Neural CASA Models

Since the biological auditory system is deeply connected with the actions of neurons, CASA systems also incorporated neural models within the design. Two different models provide the basis for this area. Malsburg and Schneider proposed a neural network model with oscillators to represent features of different streams (synchronized and desynchronized).^[13] Wang also presented a model using a network of excitatory units with a global inhibitor with delay lines to represent the auditory scene within the time-frequency.^[14]^[15]

Analysis of Musical Audio Signals

Typical approaches in CASA systems starts with segmenting sound-sources into individual constituents, in its attempts to mimic the physical auditory system. However, there is evidence that the brain does not necessarily process audio input separately, but rather as a mixture.^[16] Instead of breaking the audio signal down to individual constituents, the input is broken down of by higher level descriptors, such as chords, bass and melody, beat structure, and chorus and phrase repetitions. These descriptors run into difficulties in real-world scenarios, with monaural and binaural signals.^[1] Also, the estimation of these descriptors is highly dependent on the cultural influence of the musical input. For example, within Western music, the melody and bass influences the identity of the piece, with the core formed by the melody. By distinguishing the frequency responses of melody and bass, a fundamental frequency can be estimated and filtered for distinction.^[17] Chord detection can be implemented through pattern recognition, by extracting low-level features describing harmonic content.^[18] The techniques utilized in music scene analysis can also be applied to speech recognition, and other environmental sounds.^[19] Future bodies of work include a top-down integration of audio signal processing, such as a real-time beat-tracking system and expanding out of the signal processing realm with the incorporation of auditory psychology and physiology.^[20]

Neural Perceptual Modeling

While many models consider the audio signal as a complex combination of different frequencies, modeling the auditory system can also require consideration for the neural components. By taking a holistic process, where a stream (of feature-based sounds) correspond to neuronal activity distributed in many brain areas, the perception of the sound could be mapped and modeled. Two different solutions have been proposed to the binding of the audio perception and the area in the brain. Hierarchical coding models many cells to encode all possible combinations of features and objects in the auditory scene.^[21]^[22] Temporal or oscillatory correlation addressing the binding problem by focusing on the synchrony and desynchrony between neural oscillations to encode the state of binding among the auditory features.^[1] These two solutions are very similar to the debacle between place coding and temporal coding. While drawing from modeling neural components, another phenomenon of ASA comes into play with CASA systems: the extent of modeling neural mechanisms. The studies of CASA systems have involved modeling some known mechanisms, such as the bandpass nature of cochlear filtering and random auditory nerve firing patterns, however, these models may not lead to finding new mechanisms, but rather give an understanding of purpose to the known mechanisms.^[23]

References

^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m Wang, D. L. and Brown, G. J. (Eds.) (2006). Computational auditory scene analysis: Principles, algorithms and applications. IEEE Press/Wiley-Interscience
^ Warren, R.(1999). Auditory Perception: A New Analysis and Synthesis. New York: Cambridge University Press.
^ Wiener, F.(1947), "On the diffraction of a progressive wave by the human head". Journal of the Acoustical Society of America, 19, 143–146.
^ Goll, J., Kim, L. (2012), "Impairments of auditory scene analysis in Alzheimer's disease", Brain 135 (1), 190–200.
^ Meddis, R., Hewitt, M., Shackleton, T. (1990). "Implementation details of a computational model of the inner hair-cell/auditory nerve synapse". Journal of the Acoustical Society of America 87(4) 1813–1816.
^ Jeffress, L.A. (1948). "A place theory of sound localization". Journal of Comparative and Physiological Psychology, 41 35–39.
^ Yin, T., Chan, J. (1990). "Interaural time sensitivity in medial superior olive of cat" Journal Neurophysiology, 64(2) 465–488.
^ Moore, B. (2003). An Introduction to the Psychology of Hearing (5th ed.). Academic Press, London.
^ Ellis, D (1996). "Predication-Driven Computational Auditory Scene Analysis". PhD thesis, MIT Department of Electrical Engineering and Computer Science.
^ Li, P., Guan, Y. (2010). "Monaural speech separation based on MASVQ and CASA for robust speech recognition" Computer Speech and Language, 24, 30–44.
^ Bodden, M. (1993). "Modeling human sound-source locations and cocktail party effect" Acta Acustica 1 43–55.
^ Lyon, R.(1983). "A computational model of binaural locations and separation". Proceedings of the International Conference on Acoustics, Speech and Signal Processing 1148–1151.
^ Von der Malsburg, C., Schneider, W. (1986). "A neural cocktail-party processor". Biological Cybernetics 54 29–40.
^ Wang, D.(1994). "Auditory stream segregation based on oscillatory correlation". Proceedings of the IEEE International Workshop on Neural Networks for Signal Processings, 624–632.
^ Wang, D.(1996), "Primitive auditory segregation based on oscillatory correlation". Cognitive Science 20, 409–456.
^ Bregman, A (1995). "Constraints on computational models of auditory scene analysis as derived from human perception". The Journal of the Acoustical Society of Japan (E), 16(3), 133–136.
^ Goto, M.(2004). "A real-time music-scene-description system: predominate-F0 estimation for detecting melody and bass lines in real-world audio signals". Speech Communication, 43, 311–329.
^ Zbigniew, R., Wieczorkowska, A.(2010). "Advances in Music Information Retrieval". Studies in Computational Intelligence, 274 119–142.
^ Masuda-Katsuse, I (2001). "A new method for speech recognition in the presence of non-stationary, unpredictable and high-level noise". Proceedings Eurospeech, 1119–1122.
^ Goto, M (2001). "An Audio-based real-time beat tracking system for music with or without drum sounds". Journal of New Music Research, 30(2): 159–171.
^ deCharms, R., Merzenich, M, (1996). "Primary cortical representation of sounds by the coordination of action-potential timing". Nature, 381, 610–613.
^ Wang, D.(2005). "The time dimension of scene analysis". IEEE Transactions on Neural Networks, 16(6), 1401–1426.
^ Bregman, A.(1990). Auditory Scene Analysis. Cambridge: MIT Press.

[wangbrown06-1] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m Wang, D. L. and Brown, G. J. (Eds.) (2006). Computational auditory scene analysis: Principles, algorithms and applications. IEEE Press/Wiley-Interscience

[warren-2] Warren, R.(1999). Auditory Perception: A New Analysis and Synthesis. New York: Cambridge University Press.

[wiener-3] Wiener, F.(1947), "On the diffraction of a progressive wave by the human head". Journal of the Acoustical Society of America, 19, 143–146.

[goll-4] Goll, J., Kim, L. (2012), "Impairments of auditory scene analysis in Alzheimer's disease", Brain 135 (1), 190–200.

[meddis-5] Meddis, R., Hewitt, M., Shackleton, T. (1990). "Implementation details of a computational model of the inner hair-cell/auditory nerve synapse". Journal of the Acoustical Society of America 87(4) 1813–1816.

[jeffress-6] Jeffress, L.A. (1948). "A place theory of sound localization". Journal of Comparative and Physiological Psychology, 41 35–39.

[yin-7] Yin, T., Chan, J. (1990). "Interaural time sensitivity in medial superior olive of cat" Journal Neurophysiology, 64(2) 465–488.

[moore-8] Moore, B. (2003). An Introduction to the Psychology of Hearing (5th ed.). Academic Press, London.

[Ellis-9] Ellis, D (1996). "Predication-Driven Computational Auditory Scene Analysis". PhD thesis, MIT Department of Electrical Engineering and Computer Science.

[li-10] Li, P., Guan, Y. (2010). "Monaural speech separation based on MASVQ and CASA for robust speech recognition" Computer Speech and Language, 24, 30–44.

[bodden-11] Bodden, M. (1993). "Modeling human sound-source locations and cocktail party effect" Acta Acustica 1 43–55.

[lyon-12] Lyon, R.(1983). "A computational model of binaural locations and separation". Proceedings of the International Conference on Acoustics, Speech and Signal Processing 1148–1151.

[vdm-13] Von der Malsburg, C., Schneider, W. (1986). "A neural cocktail-party processor". Biological Cybernetics 54 29–40.

[wangseg-14] Wang, D.(1994). "Auditory stream segregation based on oscillatory correlation". Proceedings of the IEEE International Workshop on Neural Networks for Signal Processings, 624–632.

[wangprim-15] Wang, D.(1996), "Primitive auditory segregation based on oscillatory correlation". Cognitive Science 20, 409–456.

[bregman2-16] Bregman, A (1995). "Constraints on computational models of auditory scene analysis as derived from human perception". The Journal of the Acoustical Society of Japan (E), 16(3), 133–136.

[Goto-17] Goto, M.(2004). "A real-time music-scene-description system: predominate-F0 estimation for detecting melody and bass lines in real-world audio signals". Speech Communication, 43, 311–329.

[zb-18] Zbigniew, R., Wieczorkowska, A.(2010). "Advances in Music Information Retrieval". Studies in Computational Intelligence, 274 119–142.

[masudak-19] Masuda-Katsuse, I (2001). "A new method for speech recognition in the presence of non-stationary, unpredictable and high-level noise". Proceedings Eurospeech, 1119–1122.

[gotodrum-20] Goto, M (2001). "An Audio-based real-time beat tracking system for music with or without drum sounds". Journal of New Music Research, 30(2): 159–171.

[decharm-21] Charms, R., Merzenich, M, (1996). "Primary cortical representation of sounds by the coordination of action-potential timing". Nature, 381, 610–613.

[wangtime-22] Wang, D.(2005). "The time dimension of scene analysis". IEEE Transactions on Neural Networks, 16(6), 1401–1426.

[bregman-23] Bregman, A.(1990). Auditory Scene Analysis. Cambridge: MIT Press.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]