114. Audio Encoding: PCM and AAC

Return to "iOS/Android Audio and Video Development Guide "

Compression of audio data is really about finding a way to both keep the sound quality high and drastically reduce the amount of data. Here are a few key points:

First, raw audio files, like uncompressed WAV files, contain a lot of data. These data represent sound wave samples that are taken 44,100 times per second, each represented by 16 bits, and are in stereo. Imagine, it's like trying to catch thousands of ping-pong balls every second, it's overwhelming!

Then, when we talk about compression, it's essentially finding a method to significantly reduce the data volume while keeping the sound quality relatively unaffected. Here are some techniques used:

Entropy coding: This method reduces file size by using the uneven frequency of symbols (or audio samples) in the audio data. Huffman coding is a classic example where common symbols are encoded with shorter codes and rare symbols with longer codes, reducing the average length of the code used.
Differential coding: This method relies on the correlation between audio samples. For example, if a note is sustained, many samples during this period might be very similar. Differential coding doesn't store each sample separately but stores the difference between a sample and the previous one. Since the differences are usually small, less storage space is required.
Perceptual compression: The human ear is not sensitive to some details in sound, like quiet sounds next to loud ones. Audio coding formats, like MP3, take advantage of this by removing audio information that human ears are likely not to hear. This does lose some information but has little effect on the listening experience.

To effectively compress audio data, the key is to identify and reduce redundant information. Audio signal redundancy mainly appears in two aspects: time domain and frequency domain.

Understanding Time Domain Redundancy:

Non-uniform amplitude distribution: Most audio samples, especially human voices, appear more frequently at lower levels. This means that for most of the time, we are dealing with smaller amplitudes, which makes compression possible.
Correlation between sample values: In audio samples, a value is often highly related to its previous one. For example, if the sampling rate is 8kHz, the correlation between adjacent samples can exceed 0.85. This high correlation allows us to use differential coding to compress data effectively.
Correlation between signal periods: At any given moment, only a few frequencies in the band are active in producing a sound, providing another opportunity for compression. Although using this periodic correlation is complex, it results in better compression.
Silence coefficient: For instance, in a phone call, each person speaks about half the time, and there are pauses in speech. These pauses represent a natural data redundancy, which can be exploited for effective compression.
Long-term auto-correlation: Long-term observations show that there is a persistent correlation between audio samples, offering another avenue for data compression.

Understanding Frequency Domain Redundancy:

Non-uniform long-term power spectral density: The long-term power spectral density of audio signals is uneven, indicating that some frequency bands are underutilized, i.e., redundant. These areas are mainly in the high-frequency range where energy is generally lower.
Short-term power spectral density of speech: The short-term power spectral density of speech signals shows clear peaks at specific frequencies, known as formant frequencies, which define the unique characteristics of speech. Additionally, the entire power spectrum decreases with increasing frequency and forms a complex harmonic structure based on the fundamental frequency.

To effectively compress audio data, we need to consider the way humans hear and use these characteristics to reduce redundant parts of the audio signal. Here are some key concepts based on auditory masking effects that help us remove certain components of the signal without affecting the listening experience:

Masked Signal Components: In audio signals, parts that would be masked by other signals and wouldn't be heard by the human ear can be removed. This means we don’t waste bandwidth on signals that wouldn’t be heard anyway.
Masking of Quantization Noise: During the process of converting sound into digital data (quantization), some noise is generated. If this noise is masked by other signals, we can ignore it.
Frequency Filtering: Since human ears are insensitive to certain frequencies, we can filter out these frequencies during digitization. For example, in voice signals, we usually keep only the frequencies between 300-3400 Hz, as this range is clear and recognizable over telephone communications.

Types of Masking Effects:

Absolute Threshold of Hearing: Sounds below a certain energy threshold at specific frequencies cannot be heard by the human ear.
Frequency Masking: When a loud sound at one frequency is present, it raises the threshold at which nearby frequencies can be heard, making these quieter sounds inaudible.
Temporal Masking: This includes pre-masking, simultaneous masking, and post-masking. Pre-masking occurs just before a loud sound and masks quieter sounds that are already present. Simultaneous masking happens when a loud sound masks other quieter sounds at the same time. Post-masking occurs after a loud sound has stopped, and it takes some time before quieter sounds that follow can be heard again.

Common Audio Encoding Formats:

PCM (Pulse Code Modulation): This is a lossless digital audio encoding method that does not compress the data and accurately reproduces the original analog signal.
WAV: This format is also uncompressed and is based on PCM but includes additional information like sample rate and channel number at the start of the file for better handling and compatibility.
MP3: This is a lossy compression format that significantly reduces file size while maintaining a relatively high sound quality, especially at bit rates above 128 Kbps, and is widely used.
AAC (Advanced Audio Coding): This format performs well at bit rates below 128 Kbps and is commonly used in video audio encoding.
OPUS: Suitable for real-time internet audio transmissions with low latency, like voice chats, it provides excellent sound quality at very low bit rates, though its compatibility is relatively low.

By choosing and using these technologies and formats wisely, we can compress audio data effectively, saving storage space and bandwidth while ensuring sound quality.

Understanding PCM Encoding

PCM stands for Pulse Code Modulation, a common digital communication technique used to convert analog signals into digital signals. This method is crucial for audio communications such as in phone networks.

The basic process of PCM encoding involves three key steps:

Sampling: This is the first step in audio processing where the analog signal is sampled at set intervals (known as the sample rate). These samples convert the continuous signal into a series of discrete values.
Quantization: Each sample point is then converted into a digital value, a process called quantization. Essentially, this step represents each sample point’s amplitude in digital form, typically rounding it to the nearest value.
Encoding: Finally, these quantized values are converted into binary codes that can be stored or transmitted.

Quality and Bit Rate Calculation in PCM

PCM is widely used for saving audio materials and enjoying high-quality music due to its high fidelity. Although PCM cannot guarantee a perfect replication of the original signal, it comes very close to reproducing the original sound. The bit rate of a PCM audio stream can be calculated simply by the formula: `Bit rate = Sampling rate × Bit depth × Number of channels`. This formula gives a direct estimate of the size of the audio data stream.

1. Storing PCM Data

When handling PCM data, there are two formats for storing data from different channels:

Interleaved format: Data from different channels are alternated in a single data stream.
Planar format: Data from the same channel are grouped together in the data stream.

Here is an example:

Additionally, it's important to consider the byte order, which affects how binary data is interpreted and stored across different computer architectures.

Application of Audio Encoding

Although PCM provides lossless audio quality, its high data requirement often necessitates compression of audio data to save space and reduce transmission costs. Formats like MP3, AAC, and OPUS are popular audio encoding formats that use lossy compression techniques to reduce the data size while maintaining audio quality as much as possible.

In summary, PCM encoding is a fundamental audio technology that offers a reliable way to digitize audio signals, ensuring high fidelity during transmission. Through further compression and encoding, these data can be effectively managed and transmitted.

2. AAC Encoding Detailed Explanation

2.1 Background Introduction

AAC, which stands for Advanced Audio Coding, is a sophisticated method of digital audio compression with losses. Developed by organizations including Fraunhofer IIS, Dolby Laboratories, AT&T, and Sony, it was first introduced in 1997 based on the MPEG-2 standard. By 2000, with the introduction of the MPEG-4 standard, AAC was enhanced with technologies like Long Term Prediction (LTP), Perceptual Noise Substitution (PNS), Spectral Band Replication (SBR), and Parametric Stereo (PS), among others.

AAC was designed to succeed MP3, incorporating various new technologies to improve audio compression efficiency. It supports a wide range of sampling rates from 8 kHz to 96 kHz and various channel configurations, providing better sound quality at the same bit rate compared to MP3.

2.2 Encoding Tools and Process

AAC employs a perceptual audio coding approach, focusing on exploiting the masking effect of human hearing to optimize audio data encoding. This strategy not only removes information that will be masked but also controls quantization noise to make it imperceptible to the ear.

Key steps in the encoding process include:

Frequency Domain Transformation: Initially, the time-domain signal is processed through a filter bank and transformed into the frequency domain using the Modified Discrete Cosine Transform (MDCT).
Psychoacoustic Model Analysis: This model helps determine important parameters like the signal-to-masking ratio and masking thresholds, guiding stereo encoding and other processing steps.
Stereo Processing: This includes M/S (Mid/Side) stereo coding and intensity stereo coding, which help reduce the number of bits needed.
Temporal Noise Shaping (TNS): This module shapes the noise to resemble the energy spectral envelope, enhancing audio quality.
Quantization and Coding: A two-loop quantization process is used for bit allocation, ensuring that quantization noise stays below the masking threshold.
Huffman Coding: Huffman coding, optimized through an improved codebook, is a crucial step in generating the AAC stream.

The AAC encoding system comprises several efficient tools such as gain control, filter banks, psychoacoustic models, quantization and coding, prediction, and stereo processing. These tools combine to form the fundamental encoding and decoding flow of AAC.

In practical applications, not all modules of AAC are necessary. The following table shows the optional nature of each module in MPEG-2 AAC, allowing users to optimize encoding settings based on specific needs.

1) Bitstream Formatter - This module breaks down the AAC data stream into different modules during decoding, providing specific bitstream data for each tool module. The outputs of this module include:

Section information of the noiseless encoded spectrum
The noiseless encoded spectrum itself
Mid/Side decision information
Predictor state information
Intensity stereo control information and coupled channel control information
Temporal Noise Shaping (TNS) information
Filter bank control information
Gain control information

2) Noiseless Decoding - This is the Huffman encoding module that works to further reduce redundancy in scale factors and quantized spectrum by encoding them. During decoding, this module takes the data stream from the Bitstream Formatter, decodes the Huffman encoded data, and rebuilds the quantized spectrum, Huffman encoded, and DPCM encoded scale factors. Inputs and outputs of this module include:

Inputs: Section information and the noiseless encoded spectrum.
Outputs: Decoded integers of scale factors and the quantized spectrum.

3) Inverse Quantization - In AAC encoding, the spectral coefficients are achieved using a non-uniform quantizer, which needs to be reversed during decoding. This module converts the quantized values of the spectrum back into integer values to represent the unscaled reconstructed spectrum. The quantizer is non-uniform. By controlling the quantization analysis well, the bitrate can be used more efficiently. The primary way to adjust quantization noise in the frequency domain is to use scale factors for noise shaping, which change the amplitude gain of all spectral coefficients within a scale factor band. This module's inputs and outputs are:

Inputs: The quantized values of the spectrum.
Outputs: The unscaled, inverse quantized spectrum.

4) Rescaling - During decoding, this module converts the integer representations of scale factors into actual values and then multiplies the unscaled inverse quantized spectrum by the related scale factors. This module's inputs and outputs are:

Inputs: The decoded integers of scale factors and the unscaled, inverse quantized spectrum.
Outputs: The rescaled inverse quantized spectrum.

5) M/S (Mid/Side) Stereo Coding - This is a type of joint stereo coding that considers the common information volume of the two channels. Based on the Mid/Side decision information, this module converts the spectrum pairs from Mid/Side mode to Left/Right mode to improve encoding efficiency. It's generally used when the similarity between left and right channel information is high. The processing involves combining (L+R) and subtracting (L-R) the left and right channel information, then processing these two tracks with psychoacoustic models and filters. Inputs and outputs of this module are:

Inputs: Mid/Side decision information and the related rescaled inverse quantized spectrum of the channels.
Outputs: The rescaled inverse quantized spectrum related to the channel pairs after M/S decoding.

6) Prediction - During decoding, this module reinserts redundancy information extracted during encoding under the control of predictor state information. This module is implemented as a second-order backward adaptive predictor. Predicting the audio signal reduces the processing of repetitive redundant signals, improving efficiency. Inputs and outputs of this module are:

Inputs: Predictor state information and the rescaled inverse quantized spectrum.
Outputs: The predicted rescaled inverse quantized spectrum.

The enhancements made to MPEG-4 AAC after MPEG-2 AAC include additional modules that improve both encoding efficiency and sound quality, especially at lower bitrates. Here are the details of these added modules:

1) LTP (Long Term Prediction): This module aims to reduce redundancy between consecutive audio frames. It's particularly effective for low bitrate audio such as speech, by predicting the signals in upcoming frames to enhance compression efficiency.

2) PNS (Perceptual Noise Substitution): When the encoder detects noise-like signal components, it skips traditional quantization for these parts and simply marks them to be restored during decoding. This method improves efficiency by analyzing the tone and energy changes of the signal, especially for components below 4 kHz frequency.

3) SBR (Spectral Band Replication): Since the main energy of music signals is typically concentrated in the lower frequency ranges, while high frequencies, though less energetic, are critical for sound quality, SBR technology optimizes encoding efficiency by separating and individually processing different frequency bands. It preserves the main components in low frequencies and amplifies the high frequencies to ensure sound quality.

4) PS (Parametric Stereo): Traditional stereo encoding typically requires twice the data. Parametric Stereo technology stores complete information for one channel and uses minimal data to describe the differences of the other channel, significantly improving the efficiency of stereo audio encoding.

These technologies have significantly enhanced the performance of AAC encoding in various scenarios, both in terms of sound quality and compression efficiency. The ISO/IEC 13818-7 standard illustrates the encoding and decoding processes of MPEG-2 AAC with diagrams, providing a clear visual guide to understanding how the AAC coding system works.

The decoding process for MPEG-2 AAC is illustrated in the diagram below:

2.3 Encoding Specifications Overview

To meet diverse application needs, the MPEG-2 AAC standard defines three core encoding specifications, each tailored for different situations and performance requirements:

MPEG-2 AAC LC (Low Complexity): This specification is mainly suitable for environments with limited resources, such as restricted storage space and computing power. It does not support predictive and gain control tools, and the use of TNS (Temporal Noise Shaping) is relatively limited. This spec is typically used for encoding rates between 96kbps and 192kbps, commonly seen in the audio portion of MP4 files.
MPEG-2 AAC Main: The main specification has the highest complexity and is suitable for situations where there is ample storage and processing capability. It utilizes almost all available encoding tools (except gain control) to achieve the highest compression efficiency.
MPEG-2 AAC SSR (Scalable Sample Rate): This specification allows for variable sample rates, uses gain control tools, but does not permit predictive and coupling tools. It is particularly suitable for environments with fluctuating network bandwidth, capable of adjusting complexity based on bandwidth changes.

In terms of technical implementation, the Main and LC specifications use MDCT (Modified Discrete Cosine Transform) as the time/frequency analysis tool, while the SSR specification uses a hybrid filter bank, initially dividing the signal into four sub-bands, then performing MDCT transformations. These three approaches balance between encoding quality and algorithm complexity by selecting different modules.

Extended Encoding Specifications in MPEG-4 AAC

The MPEG-4 AAC standard not only inherits and improves the three specifications mentioned above but also introduces additional encoding specifications to cater to a broader range of applications:

MPEG-4 AAC LC (Low Complexity): Maintains the low complexity specification for most standard applications.
MPEG-4 AAC Main: Continues the high complexity of the main specification.
MPEG-4 AAC SSR (Scalable Sample Rate): Continues to provide variable sample rate support.
MPEG-4 AAC LD (Low Delay): A low delay specification designed specifically for real-time two-way communication, ensuring no more than 20ms of encoding delay, suitable for a wide range of signals including voice and music.
MPEG-4 AAC LTP (Long Term Prediction): Adds forward prediction functionality to improve compression efficiency.
MPEG-4 AAC HE (High Efficiency): A high-efficiency specification that combines basic AAC encoding with SBR technology, and in the latest version HE v2, includes Parametric Stereo (PS) technology. This spec is recommended for low bitrate applications ranging from 32-96 Kbps, offering high-quality audio output.

2.4 AAC Format

2.4.1 Audio Object Types

The MPEG-4 standard includes various versions of AAC, such as AAC-LC, HE-AAC, AAC-LD, and others mentioned earlier. The standard defines encoding and decoding tool modules, audio object types, and profiles to specify the encoder. Among these, the audio object types are the primary means of marking the encoder. Below is an illustration of the common MPEG-4 audio object types:

2.4.2 Audio Specific Config

When transmitting and storing MPEG-4 audio, the audio object types and the basic information of the audio (such as sample rate, bit depth, channels) must be encoded. This information is usually specified in the AudioSpecificConfig data structure.

The information in AudioSpecificConfig allows the decoder to understand the relevant details about the audio without needing to transmit the AAC bitstream.

This is useful during the setup phase of codec negotiation, such as for initializing SIP (Session Initiation Protocol) or SDP (Session Description Protocol) settings. MPEG-2 does not specify AudioSpecificConfig, so its ADIF and ADTS use a fixed length of 1024 samples.

The structure of AudioSpecificConfig is as follows:

It includes two parts:

General Section: Contains common fields used by most MPEG-4 audio specifications.
Specific Section: Contains characteristic fields for different audio object types.

For example, for AAC-LC, HE-AAC, and AAC-LD, the specific section includes the GASpecificConfig characteristics; for AAC-ELD, it includes the ELDSpecificConfig characteristics; and for xHE-AAC (USAC), it includes the UsacConfig characteristics.

2.4.3 Data Format Overview

In the MPEG-4 system, the transfer and storage of audio data primarily use Raw Data Blocks or Access Units. These units contain the actual audio encoded bitstreams. These bitstreams are flexibly divided to represent different audio channels. Once these data are obtained, the next critical step is to further analyze their format.

For MPEG-2 AAC, there are two main audio data formats:

1) ADIF (Audio Data Interchange Format): The main feature of ADIF is that it has a clear marker at the beginning of the stream, so decoding must start from a clearly defined start point. This format is suitable for file storage because it allows precise location of the start of audio data from anywhere in the file.

2) ADTS (Audio Data Transport Stream): This format features a sync word that allows the decoder to start decoding from any point in the stream. This makes it very suitable for streaming media because decoding can resume immediately after the next sync word, even if some data are lost during transmission.

In MPEG-4 AAC, additional audio encoding data formats have been added for new variants like AAC-LD and AAC-ELD:

1) LATM (Low-overhead MPEG-4 Audio Transport Multiplex): This is a low-overhead audio transport multiplex format that includes an independent bitstream and supports error recovery syntax in MPEG-4, making it more reliable for transmission over networks.

2) LOAS (Low Overhead Audio Stream): This format is a LATM with sync information, supporting random access and data skipping. LOAS improves the robustness and flexibility of the stream against interruptions by embedding sync and structural information within the audio stream.

The following chart provides a key feature comparison of various AAC transmission formats, helping users and developers better understand the application scenarios and advantages of each format:

ADIF format structure is generally as follows:

The structure of the ADTS format is based on the ADTS Frame. Its structure is generally as follows:

The structure of the LOAS format is generally as follows:

Follow me on:

(Foks) Hui Wang's LinkedIn

(Foks) Hui Wang

Senior iOS Developer