Skip to main content

Compressing Digital Audio

  • Author:
  • Updated date:
Digital Audio Compression

Digital Audio Compression

Are you a sound engineer or audio enthusiast interested in understanding how digital audio data rates can be reduced to meet the available bandwidth of a medium, the storage capacity or a limited data transfer rate? Then you've come to the right place.

This article will look at the options available to reduce the data rate, including predictive data rate reduction and psycho-acoustic data rate reduction, and offer a comparison of the systems.

It is important to recognise that throughout the production process, digital audio should be handled at the highest practical resolution, so that the quality of audio at the end of the process is as good as it can be. This requires careful consideration and the application of data rate reduction may only be appropriate for the final delivery stage, so that the final output is the best quality that it can be.

Digital Audio Data Rates

At the output of the production chain, high quality (uncompressed) digital audio will consist of stereo audio data sampled at 48kHz to 16bit resolution. This will produce a data rate of 1.536Mbit/s.

This can be calculated as follows:

16 bits x 48kHz x 2 channels for stereo
= 16 x 48000 x 2
= 1536000 bits per second
= 1.536 Mbits per second

At that data rate, the amount of storage we require can be calculated as follows:

Data Rate x Time
= 1.536 Mbits per second x 60 seconds x 60 minutes
= 1536000 x 60 x 60
= 55,296,000,000,000 bits

Divide by 8 to convert to bytes
= 691,000,000 bytes or 691Mbytes of data

So, to store one hour of stereo audio at 16 bits resolution with a 48kHz sampling frequency, we would need almost 700Mbytes of data.

Limitations of Data Rate Reduction

On occasion, it will not be possible to store or transfer this audio at full bandwidth, so the data rate may have to be reduced.

Unfortunately, data reduction techniques cause some loss in quality. They have principally been designed for delivery systems rather than for use within the production environment, where the cascading of various audio processes may introduce unacceptable degradation to the sound quality.

To overcome a significant reduction in audio quality, there are a number of systems that have been developed. Techniques involving processing in the analogue domain introduce obvious audible effects, so only techniques that are implemented in the digital domain will be considered here.

Let's start by looking at the ISO-MPEG audio compression standards that exist and how they work.

ISO-MPEG Audio

The ISO MPEG standard for compression of digital television signals includes a series of audio data rate reduction systems based on psycho-acoustic theory. MPEG-1 was the first version of the standard and includes three audio 'layers':

  • Layer 1 is the simplest and offers a modest reduction in data rate of around 4:1. It formed the basis of the PASC (Precision Adaptive Subband Coding) used in DCC (Digital Compact Cassette) machines. DCC used cassettes similar to standard audio cassettes and recorded digital data using a fixed head with 12 tracks. At the end of the tape the transport reversed and recorded on the 'other side.' One of the selling points of this systems was that it could replay analogue cassette tapes.
  • Layer 2 is commonly called MUSICAM and offers various compression ratios typically 8:1 compression may be used. This is the system used in DAB (Digital Audio Broadcasting) and DTT (Digital Terrestrial Television). The data rate can be varied according to the need and data availability in the ensemble or multiplex. MUSICAM can be used over ISDN circuits. The quality and number of channels will depend on the number of circuits available.
  • Layer 3 is the most complex implementation and provides high quality reproduction at very low data rates. This is suited to transferring high quality audio over ISDN circuits. At 128kbit/s near CD quality can be achieved and with just one 64kbit/s circuit good quality stereo, albeit with only 12kHz. bandwidth, is possible. This is used in other applications where the data rate is restricted including MP3 files which gives high quality audio over the Internet.

The MPEG-2 standard includes, in addition to the options above, a backward-compatible multi-channel version and also a half sample rate option which provides good commentary- quality audio at low data rates without the complexity of layer 3 coding. It also includes Advanced Audio Coding (AAC) a 'toolbox' allowing a flexible approach to suit the data capacity available.

The MPEG-4 standard includes many different options and extensions.

Sub-bands

In advanced data rate reduction systems, the audio is split into sub-bands and data rate reduction is applied to each sub-band. This technique often uses quadrature mirror filters (QMFs), or modified versions of them.

In psychoacoustic systems many sub-bands are used, typically 32 or more. Predictive systems use only two or four sub-bands.

audio-data-rate-reduction

The use of digital filters allows the audio to be split and recombined accurately. The digital filters normally split the audio into a number of sub-bands each the same width, say 750Hz or 4kHz wide.

This process does not increase the data rate as each sub-band can be sampled at a fraction of the original rate.

Example of Sub-band Splitting

Example of Sub-band Splitting

The simplest QMF splits the digital audio into two subbands sampled at half the original sampling frequency.

In this example, based on the G772 codec, audio, sampled at 16kHz, is split and two outputs (each 4kHz wide) sampled at 8kHz are be produced. The lower subband will cover frequencies up to 4kHz (sampled at 8kHz); the upper band will have the frequency band from 4kHz to 8kHz (also sampled at 8kHz).

Because the bandwidth is only 4kHz, sampling at 8kHz is acceptable. Sampling theory requires the sampling frequency to be twice the bandwidth and not twice the maximum frequency. In the case of a baseband signal are these two equal.

The two sub-bands can be recombined by passing them back through an inverse QMF. This process does not introduce any distortion to the audio signal and is reversible.

Using an extension of this technique, the audio can be split into 32 or more bands as required.

Advanced Data Rate Reduction Techniques

With the development of digital signal processing techniques, it is possible to reduce the data rate required. Audio signals can be considered in terms of either the time domain or the frequency domain.

  • Time Domain or ADPCM (Adaptive Differential Pulse Code Modulation: This uses predictors to identify redundant data which does not need to be transmitted. This technique is used in G722 and apt-X100.

  • Transform Coding or Frequency Domain: This removes irrelevant data (data which is inaudible) and makes use of a psychoacoustic model of the ear. The data for the model is provided by a FFT or DCT. MUSICAM and ATRAC (used by MiniDisc) use this approach.

Later systems use a combination of these two approaches.

Predictive Data Rate Reduction

These are based on the removal of redundant data before transmission and adding it back on reception. The data is usually split into a small number of subbands (two or four) before being processed in the time domain and the data rate is reduced to around a quarter of the original rate.

Predictive systems include G722 and apt-X100.

Prediction

Predictors are used to identify the redundant data. Only the error in the prediction is transmitted.

Simple Predictive Data Rate Reduction System

Simple Predictive Data Rate Reduction System

In the decoder, the predictor does not have access to the original audio data. For the system to work successfully, the predictions made in the encoder should be the same as those of the decoder. The encoding process therefore must include a 'decoder'. The predictor in the encoder must not 'cheat' by using the original audio data.

The predictors make a 'guess' at the output based on what has gone already. If the prediction is good then the error signal will be small and few bits will be required to code the error signal. If the prediction is not so good then the error could be large and no reduction in data rate would be achieved.

To achieve better predictions, the predictors make use of the error signal as well as the output of the previous prediction 'corrected' by the error. For much of the time good predictions are possible.

Predictive Data Reduction System

Predictive Data Reduction System

Prediction techniques have difficulty with transients, which, by their very nature, are unpredictable. If a large error is produced because of a transient, it is not be possible to allocate any more bits than the channel capacity will allow. This is overcome by use of adaptation.

Adaptation

By altering the step size (using an adaptive quantiser) it is possible to code large error signals with lower resolution. Inevitably this will introduce some distortion of the signal but hopefully this will not be audible.

The data capacity available in data rate reduced systems is severely limited and so it is not practical to allocate any data to transmit the adaptation information directly. Instead the adaptation information is derived from the outgoing data.

Adaptive Data Rate Reduction Encoder

Adaptive Data Rate Reduction Encoder

In order to ensure that the system operates correctly, the encoder and decoder must derive identical adaptation information. The encoder contains an inverse adaptive quantiser as found in the decoder. The adaptation in the encoder and the decoder operate on the same data and so should respond identically.

Adaptive Data Rate Reduction Decoder

Adaptive Data Rate Reduction Decoder

The adaptation process changes the step size by a factor of two; each step represents and error or twice the size. If necessary this process can be repeated to cover a greater range and the step size will be increased by a factor of four or eight or more. Thus, large error signals can be coded using this approach.

Improved Adaptation Performance

The performance of the adaptive system can be improved if the predictor gets adaptation information. The predictor may then be able to respond to the transient, make a good 'prediction' and thus reducing the size of the next error signal that needs to be coded.

Improved Adaptive System

Improved Adaptive System

Errors

Because the adaptation process changes the size of the quantising steps it introduces a 'quantising' error; the decoded audio will not be the same as the original. This error may be small and inaudible but if the coding and decoding process is repeated then significant degradation of the signal will occur.

The second time around, the predictor will still not make prefect predictions and the error will be different as the data has already been coarsely quantised; thus the audio will be further degraded.

If the predictions are good then 'near lossless' compression is achieved. The quality is dependent on the programme material, and the compression ratio. Tone can be reproduced accurately but castanets, for example, will suffer some loss on the transients. For a single pass however, the loss of quality may not be audible.

G722

The G722 system uses one 64kbit/s bearer and codes 8kHz bandwidth audio in two 4kHz bands at a total data rate of 64kbit/s or less. 48kbit/s is used for the lower sub-band, effectively giving 6bits per 'error' sample; only 16kbit/s is used for the upper sub-band, 2 bits per 'error' sample.

Alternative modes allow additional data to be carried by reducing the lower sub-band rate by 8kbit/s or 16kbit/s.

APT-X100

APT-X100 uses four bands, which, with 48kHz sampling, are 6kHz wide. The data rate is reduced by a factor of four to 384kbit/s for stereo. It can also be used at other sample rates and can interface with a number of 64kbit ISDN circuits to provide high quality transmission.

Use of APT-X100 with ISDN for Mono Audio

Sample RateAudio BandwidthSub-band WidthMono Data Rate64kbit/s Circuits

48kHz

24kHz (20kHz)

6kHz

192kbit/s

3

32kHz

16kHz

4kHz

128kbit/s

2

24kHz

12kHz

3kHz

96kbit/s

2

16kHz

8kHz

2kHz

64kbit/s

1

Use of APT-X100 with ISDN for Stereo Audio

Sample RateAudio BandwidthSub-band WidthStereo Data Rate64kbit/s circuits

48kHz

24kHz (20kHz)

6kHz

384kbit/s

6

32kHz

16kHz

4kHz

256kbit/s

4

24kHz

12kHz

3kHz

192kbit/s

3

16kHz

8kHz

2kHz

128kbit/s

2

The audio bandwidth depends on the number of 64kbit/s circuits available as apt-X100 always processes at a fixed 4:1 ratio. If one of the circuits is lost the interface will reconfigure to the new data rate.

In addition to its use with ISDN, apt-X100 is also used in DART recorders that record audio on zip discs for use with jingles or trails.

Psycho Acoustic Data Rate Reduction

Another approach to data rate reduction avoids sending irrelevant data, data for audio which cannot be heard. In this case, analysis takes place in the frequency domain and these systems use many sub-bands, 32 or more.

Characteristics of the Ear

These data rate reduction systems remove irrelevant data, that is data for signals which are inaudible. They rely on a Psycho Acoustic Model representing the response characteristics of the ear.

Audio signals below the threshold of sensitivity of the ear need not be transmitted. The ear is most sensitive to signals between 1kHz and 5kHz; it is less sensitive to signals at low and high frequencies.

Threshold of Hearing

Threshold of Hearing

Masking

The presence of any signal alters the threshold of hearing. The level below which signals are inaudible is altered. For example, a loud signal will mask quieter signals close to the same frequency.

Signal at 630Hz masked by louder 1kHz signal

Signal at 630Hz masked by louder 1kHz signal

This masking phenomenon allows considerable savings in the data rate required to provide acceptable quality for the listener.

Psycho Acoustic Model

In order to make use of these effects in a data rate reduction system, a mathematical 'model' of the performance of the ear is required. The model takes account of the differing sensitivity of the ear to 'tonal' (musical) sounds and 'atonal' (noise-like) sounds.

In addition to the static masking described above, there are also temporal masking effects. It is not surprising that loud signals mask quiet signals which follow shortly afterwards. What also needs to be taken into account is that quiet signals may be masked by louder signals which occur a few milliseconds later.

ISO MPEG Layer 2 (MUSICAM)

MUSICAM (Masking-pattern Universal Sub-band Integrated Coding and Multiplexing) is the common name for ISO MPEG 2 Layer 2. This is the most widely used system in broadcast applications. It is used in both DAB (Digital Audio Broadcasting) and DTT (Digital Terrestrial Television) and also on ISDN circuits. It is particularly optimised for applications where only the decoder is in the consumer product. Although the coding process is more complex than Layer 1, the decoder is only slightly more complex and improved performance is achieved at lower data rates with little extra cost.

MUSICAM Encoder

The basic structure of an encoder has a main path and a side chain. The side chain incorporates the analysis of the audio and the psycho-acoustic model of the ear and controls the data output from the main chain.

MUSICAM codes data into self-contained frames lasting 24ms. The standard defines how the output data is formatted and decoders are therefore simpler to implement that the encoder as they do not require analysis software.

The conceptual diagram shows the basic processes required for either layer or layer 2 encoders. The simplest coders may not have a Fast Fourier Transform (FFT) and make a crude analysis using just the scale factors as the input data to determine the bit allocation.

ISO Layer 2 Conceptual Encoder

ISO Layer 2 Conceptual Encoder

Psycho Acoustic Decoder

ISO Layer 2 Decoder

ISO Layer 2 Decoder

The decoder (contained within the set-top box or digital radio) is much simpler than the encoder as it does not have to make any decisions. It reverses the bit allocation and scaling processes and reassembles the full audio signal.

Sub-bands

The audio signal is split into 32 sub-bands. These bands are each 750Hz wide and are sampled at 1.5kHz (48kHz/32). Splitting the audio into sub-bands in this way does not in itself alter the data rate but the bands representing signals above 20kHz will clearly never be required. It is by reducing the number of quantising levels in each sub-band that the data rate is reduced.

audio-data-rate-reduction

The audio is divided into 8ms blocks. Thus there are 12 audio samples in each sub-band block. The samples within each block are coded with the same number of quantising levels.

FFT Analysis

At the same time as the audio is passing through the sub-band filter, it is analysed using a 1024 point Fast Fourier Transform (FFT) which gives detailed data, 47Hz resolution, within the audio block. This data is used by the psycho-acoustic model to identify which of the sub-bands contain no audible information above the threshold of hearing and therefore do not need to be coded at all. The signals in the remaining bands are used to determine the masking threshold below which signals are inaudible.

In this way, the maximum signal and the masking level are determined for each sub-band. Some sub-bands may have no signals above the masking threshold and will not require to be coded.

Scale Factors

Scale factors are only required for active bands, bands for which audio data will be sent. A Scale Factor for each active band is used to indicate the maximum absolute value in the 8ms block. The 6 bit Scale Factor covers a range of over 120dB in 2dB steps.

Scale Factor Select Information (SFSI)

In Layer 2, to further save on the number of bits required for transmission, the Scale Factors for three blocks (a 24ms Frame) are compared. If all three are the same (or very close to the same value) then only one needs to be sent. If two are the same then one Scale Factor represents the data for two blocks and another is sent for the third block. If necessary three different Scale Factors could be sent.

Scale Factor Select Information also needs to be sent to indicate to the decoder how the Scale Factors are arranged. Two bits indicate to the decoder how the Scale Factors are arranged. Two bits indicate whether one, two or three scale factors are being transmitted.

In this way, for a 24ms Frame between 8 bits and 20 bits will be required to code the scale information for each active band. In many instances, this will be less than the 18 bits required for each active sub-band for the same 24ms of audio when using a Layer 1 coder.

Bit Allocation

The number of quantising levels required for each sub-band is determined. The aim is to use as few quantising levels as are required to provide sufficient dynamic range to ensure that the quantisation noise is below the masking noise threshold in the band.

Masking threshold diagram showing first two sub-bands

Masking threshold diagram showing first two sub-bands

Using the above example, in the first 750Hz sub-band, the Scale Factor indicates that the maximum signal level is 60dB. The masking threshold is at around 30dB, giving a signal to mask ratio of 30dB.. This will determine how many quantising levels are required to provide the necessary resolution.

In the second band, the Scale Factor indicates 90dB, and the signal-to-mask ratio is also 30dB, so the number of quantising levels required will be the same. In each case, the aim is to ensure that the quantising noise is below the masking noise threshold, therefore the quantising noise will be inaudible.

This part of the process may involve some compromise, depending on the audio programme content, as there may not be enough data capacity available to code the audio samples to the required resolution. In that case. the process of allocating bits to each band is repeated to minimise the total noise to mask ratio for each band and for the entire frame.

Quantising Levels

In many bands only a small number of quantisation levels may be required. With small numbers of quantisation levels the difficulty of representing silence must be dealt with.

A small offset of half of one quantising level is not significant in a high-resolution system, but if, for example, only 4 quantising levels are to be used, an offset of half a quantising level would be significant. To overcome this problem, the coding uses only odd numbers of quantisation levels; silence is correctly coded as the mid-value.

The number of levels used to code the samples may be 3, 5, 7, 9, 15, 31, 63,..., or 65536. A Bit Allocation code is used to indicate to the decoder which of the allowed numbers of quantising levels has been chosen. if no audio samples and no scale factor information is to be sent, because the band is inactive, this is also indicated by the bit allocation code.

Bit Allocation

Layer 1 used the same bit allocation method for all sub-bands. The audio in each band can use any of the 15 permitted numbers of quantisation levels. The choice is indicated by the 4 bit Bit Allocation Data.

In Layer 2, additional data is saved by reducing the bit allocation options for higher frequency bands as shown in the table below. For low frequency sub-bands (sub-bands 0 to 10), it is possible to choose any of 15 numbers of levels. For mid frequency sub-bands (sub-bands 11 to 22), 3 bits are used for the Bit Allocation Information and 7 possible numbers of quantisation levels may be used.

The highest bands (sub-bands 23 to 26) are around 10% of an octave wide. With very narrow bands, only a few quantising levels are required to ensure that any quantising noise is masked by the signal and only 2 bits are used for Bit Allocation, and 3 possible number of quantisation levels are used.

Bands 27 to 31 extend above 20kHz and are never used and so no data relating to those bands need to be sent.

Bit Allocation Table - Layer 2

Bit Allocation Table - Layer 2

In each sub-band if the index is 0, no quantising takes place in that band. Bands above 20kHz are never used.

If the index is 1, then only three quantising levels are allocated. If the index is 2, then seven levels are used for the bands 0 to 2, and five for all other active bands. Similarly, if the index is 3 then 15 levels are used for bands 0 to 2, and seven for bands 3 to 22, and 65535 for bands 23 to 26.

Audio Data

For some sub-bands, no audio data may be required because there is no signal above the masking threshold. For others, very few quantisation levels will be required to ensure that the dynamic range of signal above the masking level can be achieved. in this way, significant reductions in overall data rate can be achieved for many types of programme material.

Sub-band Samples

The 12 sub-band samples in each audio block are processed in the same way. They are first multiplied by the scale factor to "normalise" them. The data is then re-quantised to reduced resolution as determined by the bit allocation. The three blocks of audio samples within each 24ms frame are coded separately.

Granules

As 3,5, and 9 quantising levels are not coded efficiently, in binary form a further saving is made in layer 2 codecs by combining three successive audio samples to form a "granule".

For example, if the Bit Allocation allows 3 quantising levels for each sample in the sub-band, then there will be 27 possible combinations for the three audio samples forming a granule. These can be coded using just 5 bits rather than 6, this maklng a saving when this is applied to the 12 samples in the block and for may sub-bands.

If 5 quantising levels are required, the audio samples are coded with a 7 bit codeword, rather than 9 bits; samples with 9 quantising levels are coded with a 10 bit codeword, rather than 12. The saving can be up to 20% using this technique and even this modest saving can offer some useful improvement in performance as many bands particularly at high frequencies will be coded with small number of quantising levels.

Output

A Layer 1 coder produces a block of data for each 8ms of audio (at 48kHz). The usual data rate will be 384kbit/s for stereo.

The Layer 2 coder processes 24ms frames, which incorporate three 8ms blocks. Using this compression technique, 256kbit/s will provide the best quality stereo transmission, 192kbit/s will be almost as good and will be sufficient for most broadcast material. If the rate is reduced still further, the audio performance will be degraded.

These rates can be halved if mono programme material is used.

MUSICAM Audio Frame Structure

The standard defines the MUSCAM audio frame. The standard does not define how either the encoder or decoder should operate. The encoder must provide data conforming to the format and the decoder converts that data back to compressed audio. The frame of data is decoded into a 24ms audio segment.

MUSICAM Audio Frame (24ms)

MUSICAM Audio Frame (24ms)

  • Header: Sync word, ID, layer, bit rate, sample frequency, mode, copyright, emphasis
  • CRC: Error detection for header, Bit Allocation and Scale Factor Select Information
  • SFSI: Scale Factor Select Information used for active bands only - two sets in stereo mode.
  • Scale Factors: One, two or three 6 bit scale factors for active bands only - two sets in stereo mode.
  • Sub-band Samples: Audio data coded as indicated by the bit allocation information, using granules when required - separate fata for each audio channel.

Usage of Available Data Capacity

The encoder processes the data according to the demands of the psycho acoustic model . If the total number of bits required is less than the number available any remaining bits are unused.

If the total number required is greater than the number available some compromise will be required and the bit allocation will be recalculated to minimise the quantising noise to mask ratio for each active band and for the signal as a whole.

MPEG-1 Layer 3

Layer 3 of the MPEG standard was designed to give the best quality at very low rates and involves a considerable increase in complexity compared with Layer 2. it combines a number of techniques to try to optimise the performance under the most stringent conditions. This part of the standard represented the best that could be achieved at the time with processing power available then.

MPEG Layer 3 Encoder

MPEG Layer 3 Encoder

Filter Bank

The process starts with the same filter bank as MUSICAM, which splits the audio into 32 x 750Hz wide sub-bands.

FFT

At the same time the audio is analysed by a 1024 point Fast Fourier Transform to provide data for the psycho acoustic model.

MDCT

The audio is further split by 18-channel modified cosine transforms. This results in 576 fine resolution frequency "lines" about 40Hz wide.

The MDCT can be switched into two modes - long (24ms) or short (8ms) under the control of the psycho acoustic model. This improves the time resolution. Four windows are used in order to ensure that the correct overlapping takes places when the mode changes. Even the short window is not short enough to fully overcome the effects of pre-echo.

Quantisation

The quantisation is a two-loop process referred to as "Analysis by Synthesis".

The inner loop determines the step size for the data by using non-uniform quantisation. Large values are quantised less accurately than small values. The step sizes are chosen to ensure that the data required does not exceed the available capacity.

In the second outer loop, the actual quantising error produced by the inner loop is compared with the calculated masking threshold from the psycho acoustic model. The individual weightings for each band are adapted accordingly.

Pre-Echo Reduction

As the short window used by the MDCT is 8ms, some pre-echo effects may be present. In order to reduce the audible effects of pre-echoes, the resolution in blocks where this occurs must be increased. In this way, the quantising noise will be reduced and will no longer be audible.

The quantisation process allocates all the bits available to achieve the best possible performance. So no data capacity is available to allow increased resolution. Instead the frame size is varied so that frames do not all have the same length but the long-term average is kept constant.

Clearly, the bit rate cannot exceed the channel capacity so instead additional delay is introduced into the system and forms a "bit reservoir". Extra data required for pre-echo reduction uses bits from the reservoir. Over the following frames the bit rates is reduced below the average to allow the reservoir to balance the rates.

Huffman Encoding

In order to reduce still further the data rate required, Huffman coding is used. This is a lossless process which does not modify the information being transmitted, but merely reduces the number of bits required.

A sequence of "zeros" is tun length encoded - a number indicating how may zeros are in the sequence.

Other data uses a Huffman code table. In this technique frequently occurring patterns in the data are represented by short codes. The data is analysed to identify the patterns and a "look-up table" is used. Different sub-regions within the data stream may use different code tables. In this way the efficiency of the coding process is enhanced.

In addition, there is an improvement in sensitivity to transmission errors as, at a particular error rate, short codes will be less prone to errors than long codes.

Dolby AC-3

Dolby AC-3 was designed specifically with multichannel applications in mind. It is used for Dolby surround on films for cinemas and on DVDs. This is a proprietary data rate reduction system developed by Dolby and can code between one and five channels of audio and in addition there is an optional low-frequency channel (up to 120Hz).

The processing uses a MDCT-like transform and has variable block length of 5.3ms or 2.6ms and a basic frequency resolution of 94Hz.

The bit allocation is completely adaptive and depends on the control by the psycho acoustic model. It uses a s "shared bit pool". This allows channels with greater frequency content to demand more data than sparsely occupied channels. The data stream contains information on the bit allocation and ensures that the encoder and decoder keep in step.

Channel Coupling is also used, but like the joint stereo option in MPEG audio, it applies only to the high frequency bands.

Again like the system of "granules" used in layer 2, the sample values can be combined in pairs or triplets to give the most compact data.

The data rates are between 32kbit/s and 640kbit/s. A single mono channel might use 32kbit/s; 192kbit/s would be required for two channel audio and 384kbit/s for 5.1 channel Dolby Surround Digital.

Speech Codecs

When broadcasters started to use transmission methods which worked at very low data rates, speech coders were used for services involving only speech based audio.

Rather than trying to represent the speech as a sequence of samples, speech based codecs rely on a parametric model to simulate a short 10ms to 40ms segment of speech. For each segment, a set of parameters are estimated and the decoder uses those parameters to synthesise the speech signal, which is perceptually close to the original.

The CELP (Code Excited Linear Prediction) sometimes called VXC (Vector eXcitation Coding) approach achieves good performance at 8kbps but degrades very quickly if the data rate is reduced. The CELP approach determines whether each segment is periodic (voiced) or noise-like (unvoiced). This approach generally does not get it quite right and an error signal is computed to compensate for the errors. This approach also does not cope well with speech in a noisy environment.

Multi Band Excitation allows the speech segment to be a mixture of both voice and unvoiced components. This improves the performance.

The HXVC approach adopted within MPEG-4 uses a hybrid approach and includes both CELP and MBE approaches. The signal is classified into four classes - unvoiced, mixed, with low-level voicing, mixed with high-level voicing and voiced. This improves the performance and gives good results at 2kbps.

Comparison of Systems

The predictive systems have the advantage of short processing delays of typically 2 to 3ms and so can be used directly in live situations with relatively few problems. The coder and decoder must have the same predictor characteristics. If the predictor is improved then both the encoder and decoder need to be updated.

The transform systems have a much longer delay in excess of 50ms and may be as much as 200ms for MPEG layer 2 and over 600ms for layer 3. The decoder in these systems is relatively simple. Improvements in the psycho acoustic model can be made without the need to change the decoder.

Performance Testing

The traditional assessment method for high quality codecs exhibiting small impairments is standardised by the ITU (Recommendation BS 1116-3). This is a 'double blind, triple- stimulus with hidden reference'. Samples are presented in the order ABCABC where A is always the unprocessed source material or reference, and B and C are the processed material or another presentation of the original; the listeners do not know which is which. Listeners score the results on the five point impairment scale from imperceptible (5) to very annoying (1). The results for these tests are averaged and the mean values quoted. The hidden reference scores above 4.5 and usually above 4.7. No direct comparison between codecs of made in these tests.

For low data rate codecs this testing method is less satisfactory. The MUSHRA (Multi Stimulus test with Hidden Reference and hidden Anchors) approach is more appropriate as it gives and absolute measure of the audio quality which can be compared to the reference. This is a double blind multi-stimulus test with hidden reference and hidden anchors.

This system allows listeners to compare systems with each other and with the reference. It is expected that the difference between the codecs may be small and this technique allows direct comparisons. The tests involve one full bandwidth reference signal and at least one low pass filtered (say 3.5kHz) version of the unprocessed signal as an anchor. The anchor signals are 'hidden'. The results are recorded on a continuous scale from excellent (100%) to bad (0%).

In designing data rate reduction systems manufacturers and standards bodies have compared the degradation of processed signals with the originals. In many instances, particularly at high data rates, the degradation may be 'only just perceptible'. The aim in testing these systems is to achieve good performance in an A-B test.

An additional problem concerns the choice of test material. What audio is demanding on these systems and how do you test to identify if they are likely to fail in service? Typical audio used in broadcasting covers a large range of styles, languages and so forth. Many forms of new or experimental music may put unusual demands on the rate reduction systems.

Codecs working at low data rates may well give significantly different scores depending on the choice of programme material. Thus the choice of coding method may not be straightforward and may in fact depend on the source material.

Disadvantages of Audio Data Rate Reduction Systems

All the processes described are 'lossy' and will inevitably introduce some distortions or noise into the audio. In many applications this may not be a problem as the level of such effects may be below the threshold of audibility.

However, although these effects may not be audible, cascading (even the best) codecs will introduce significant degradation to the signal. The principle reason for this sudden reduction in quality is that all the processes are designed to work with 'normal' audio. Once the signal has been processed the characteristics of the audio will have changed. In many cases the noise floor will be considerably above the normal value and no longer evenly spread across the audio band; the noise may appear as a 'signal' in subsequent processing. This will in turn reduce the data available for the 'wanted' audio signals and therefore reduced the quality.

This content is accurate and true to the best of the author’s knowledge and is not meant to substitute for formal and individualized advice from a qualified professional.

© 2022 Mr Singh