Accurate spatial reproduction of sound can significantly enhance the visualization of three-dimensional (3-D) multimedia information particularly for applications in which it is important to achieve sound localization relative to visual images. Such applications include immersive telepresence; augmented and virtual reality for manufacturing and entertainment; air traffic control, pilot warning, and guidance systems; displays for the visually- or aurally- impaired; home entertainment; and distance learning. Gaming can also benefit from multichannel audio by providing a more realistic sensation of the game environment and a more immersive gaming experience.
Reproduction using two-channels or stereo is the most common way that most people know to convey some spatial content into sound recording and reproduction, and this can be considered as the simplest approximation to spatial sound. On the other hand, surround sound systems, occupying more than two loudspeakers have evolved and entered homes and cinemas in order to give a better sensation. Exact reconstruction of a soundfield has been proposed based on the Ambisonics or the Wavefield Synthesis (WFS) approaches. While both can provide accurate reproduction of a soundfield, they suffer from practical constraints. The former approach suffers from a very narrow optimal listening area, while the latter requires a prohibitively large number of loudspeakers which is impractical for most commercial applications.
Instead of targeting the recreation of the physical soundfield, perceptually-optimized methods which have been implemented provide a more practical solution to spatial sound rendering and reproduction. Given a known arrangement of loudspeakers, one may use simple panning rules in order to position a virtual acoustic source at a desired location in order to construct a virtual 2-D or 3-D acoustic environment. Moreover, the virtual sound sources need not be static ones; their location may change dynamically in time, in order to simulate a “moving” source, or as the means to better interact with the actions of the listener.
Spatial audio theory offers today numerous different methodologies for capturing and reproducing real auditory environments as a whole, as well as for synthesizing completely artificial ones. The techniques vary with respect to the microphone types and arrangements as well with respect to the type of rendering that is applied on the recorded signals before they are transmitted through the available loudspeaker setup. Spaced microphone techniques are one of the most common configurations used from researchers and engineers. Microphones are sparsely distributed in space, capturing in a natural way the variability of the acoustic environment. The recorded signals may be directly fed to the loudspeakers of a surround reproduction system, delivering a sense of envelopment and immersion to the listener. At the same time, more sophisticated approaches have been proposed, in which the recorded signals are further processed in order to extract a higher level spatial description of the auditory scene, or in order to detect and modify certain auditory components of the acquired environment. While these techniques are very well established for achieving high-quality multichannel or immersive reproduction with specific placement of high-quality microphones, it certainly remains a challenging task to derive respective techniques for achieving immersive sound reproduction from crowdsourced audio content, given the possibly low audio quality and unknown position of the microphones (smartphones, tablets, etc.) in a specific venue.