Audio quality assessment
UHDTV refers to the the specifications relevant to the picture, but what about the sound? Sound is an integral component of the overall viewing experience, sometimes even more important than the visual content. Therefore, for the best viewing experience, UHDTV should be accompanied by high-quality sound, ideally in a surround sound format like 5.1 or 7.1 multichannel sound (for home theatre use). However, the quality of user-generated content is usually not suitable for UHDTV broadcasts, and often is not even suitable to be listened to by a secondary device accompanying a UHDTV broadcast. It is, therefore, important to be able to derive methods for the assessment of the audio quality of a given recording, so that we can use the content in an appropriate manner (or to decide if its content can be enhanced via signal processing methods).
Unfortunately, for audio signals, there does not exist today an objective measure of quality assessment which can accurately predict the perceived quality of the sound by an average listener. Subjective tests are the most accurate method for judging the perceived quality of audio content. In the last two decades, when lossy audio compression methods have been mainly developed and standardized, subjective tests became essential in order to understand the efficiency and limitations of an audio compression scheme. Generally, the procedure involves several human listeners who evaluate the quality of given audio content following a standard protocol in order to make the process consistent, and the listeners’ evaluations are then statistically analysed. Several methods have been proposed for performing such tests, and some of them have been standardized and used extensively in the audio community, such as the ITU BS.1116 and ITU BS.1534. The methodology includes the evaluation of monophonic, stereophonic (2-channel stereo) or multichannel (5.1 surround sounds) formats. At the same time, the research on deriving objective audio quality assessment methods which can accurately predict the perceived sound quality continues to advance, with the ITU BS.1387 (PEAQ) being the most popular in current state-of-the-art. PEAQ however is not directly applicable for multichannel sound, and has limitations when the audio quality of the tested sound relative to the reference is significantly lower. It is also a reference-based approach, which means that the “ideal” version of the sound content must be available for comparison. Equally importantly, in the context of assessing the overall Quality of Experience (QoE) of a multimedia services user, it is becoming more and more apparent that the evaluation of the content should be performed on the joint audio-visual experience, instead of looking separately on the two modalities as it has been the usual practice thus far.
Quality enhancement of user captured Audio
Audio recordings are often subject to localized distortions, due to environmental factors (i.e. acoustic interferences and noise) or due to constraints imposed by the recording device or the transmission medium itself. Depending on the type of degradation, one may want to remove impulsive or stationary noise, to restore missing data due to clipping or packet loss, to remove artifacts due to format conversion or to repair old CDs and vinyl records from scratches and other types of corruption. Over the years, different applications have been developed with the aim to automatically detect and remove these imperfections and to produce a version with better quality. Especially in relation to digital audio restoration, a large collection of software products is today available, either as standalone applications or as tools embedded inside Digital Audio Workstations (DAW).
Typically, quality enhancement is applied separately to each audio recording or audio channel. However, the recent paradigm of crowdsourced content has brought new directions in the way that audio can be processed. Given a multitude of recordings provided by people attending the same event, one of the greatest challenges is to exploit these recordings in order to create a complete in time, high-quality representation of the acoustic event, in order to complement or even to replace a professionally captured recording which is expensive and not always available. An important requirement here is to align each user generated recording along a common time axis and to accurately synchronize the ones that are overlapping in time. Recent works have shown that this is possible by exploiting the correlations among the available audio streams. One may then use the best available recording at each time, or combine different portions of the signal from the synchronized recordings in order to create a linear plot. For the same purpose, an even more interesting approach has been proposed; to exploit the redundancy of information available in a collection of overlapping user generated content in order to create an audio stream with better quality than each one of its components. This type of collaborative audio enhancement relies on the assumption that the different recordings contain common content, which needs to be revealed and enhanced, but at the same time, each audio stream is uniquely corrupted by different types of distortions and interferences, which need to be detected and suppressed.