105. The Representation of Sound (3)

Hui Wang
Jan 10, 2024
7 min read

Updated: Jan 19, 2024

Return to "iOS/Android Audio and Video Development Guide "

In our previous articles, we raised a thought-provoking question: What mysterious transformation occurs when the 'sound' we hear with our ears becomes 'audio data' processed by our phones and computers? Based on this query, we have already ventured into topics such as 'What is the essence of sound?', 'What are the significant characteristics of sound?', and 'How can sound be described using mathematical language?' in our articles "The Representation of Sound (1)" and "The Representation of Sound (2)". Now, we continue our journey of exploration, delving into two more questions: 'How can sound be converted into digital form?' and 'What is the nature of digital audio data?'.

The Journey of Digitalizing Sound

To digitalize sound, we first need to capture it using special devices, like microphones, which are our familiar collectors of sound. Inside a microphone lies an extremely thin and sensitive layer of carbon film. Sound, being a longitudinal wave, compresses not only the air but also this carbon film. As the carbon film vibrates under pressure, it touches an electrode beneath it. The duration and frequency of this contact correlate with the amplitude and frequency of the sound wave, thus transforming the sound signal into an electrical signal. After being processed by an amplification circuit, the sound is ready for sampling and quantization.

Our earlier discussions on the mathematical description of sound's three main elements form the foundation of sound digitalization.

Sound consists of waveforms that include the superposition of waves of various frequencies and amplitudes. To represent these waveforms in the digital realm, we must sample them, ensuring the sampling rate is sufficient to capture the highest frequency of the sound. Also, we need enough bit depth to accurately record the amplitude of the waveforms.

The ability of sound processing equipment to reconstruct frequencies is known as frequency response, and its capability to create suitable loudness and softness is called its dynamic range. These terms are generally referred to as the fidelity of the sound equipment. The simplest encoding methods use these two basic elements to reconstruct sound, while also efficiently storing and transmitting data.

The process of digitalizing sound involves converting an analog signal (continuous time signal) into a digital signal (discrete time signal), which includes three key steps:

Sampling: Capturing discrete signals at a certain sampling rate within the time domain.
Quantization: Digitally representing the amplitude of each sampled point.
Encoding: Storing data in a specific format.

The process is illustrated in the following image:

The digitally processed audio contains three key elements:

Sampling rate
Quantization bit depth
The number of audio channels

The Tale of Sampling Rates

In the world of digital audio, the first step in converting an analog signal into a digital one is sampling, guided by the Nyquist Sampling Theorem. This theorem states that if a signal is limited within a certain frequency range and sampled densely enough (relative to its highest frequency), these samples can uniquely represent and perfectly reconstruct the original signal. To faithfully reproduce the analog signal, the sampling rate should be at least twice the highest frequency in the signal. In practical applications, it's typically ensured that the sampling rate is 2.56 to 4 times higher than the highest frequency.

Digital signals, derived from sampling analog signals and adhering to the sampling theorem, can fully restore the original analog signal.

From a sound production perspective, most human-generated audio frequencies are within 5k Hz, so a sampling rate of 10k Hz is sufficient.

From a hearing perspective, the human auditory range is between 20 and 20k Hz, thus the sampling rate for digital audio should be above 40k Hz.

The use of a 44100 Hz sampling rate for CD audio is partly due to this. The specific choice of 44100 Hz is historical: early digital recordings were made using a video recorder with a PCM encoder. Using the PAL video standard (50 Hz field frequency with 294 usable scan lines), and recording 3 audio data blocks per video scan line, multiplying these numbers gives the unique figure of 44100.

Common sampling rates we encounter in daily life include:

8000 Hz: Used for telephones, adequate for speech.
11025 Hz: Used for AM radio broadcasting.
22050 Hz and 24000 Hz: Used for FM radio broadcasting.
32000 Hz: Used for miniDV digital video cameras, DAT (LP mode).
44100 Hz: Used for audio CDs and MPEG-1 audio (VCD/SVCD/MP3).
47250 Hz: Used for commercial PCM recorders.
48000 Hz: Used for miniDV, digital TV, DVD, DAT, movies, and professional audio.
50000 Hz: Used for commercial digital recorders.
96000 or 192000 Hz: Used for DVD-Audio, some LPCM DVD tracks, BD-ROM (Blu-ray) audio tracks, and HD-DVD audio tracks.
2.8224 MHz: Used for the sigma-delta modulation process in Direct Stream Digital.

The Melody of Quantization Bit Depth

In the world of digital audio, quantization bit depth plays a crucial role in digitizing the amplitude axis of analog audio signals, determining the dynamic range after digitization. For example, an 8-bit depth can create a dynamic range of 48 decibels, 16-bit depth offers a range of 96 decibels, 24-bit depth reaches up to 144 decibels, and 32-bit depth extends to an impressive 192 decibels. The relationship between bit depth and dynamic range can be derived from the sound pressure level formula we discussed earlier. Bit depth indicates the range of values that can be represented. For instance, 16-bit depth can represent a maximum value of 2^16 - 1 = 65535. Therefore, its maximum sound pressure level is calculated as: Maximum Sound Pressure Level = 20 × lg(65535) = 96.33 decibels. This means that 16-bit depth can represent a sound up to 96 decibels at maximum.

This relationship is expressed by the following formula:

Human ears have an auditory dynamic range of about 140 decibels, spanning from the faint sound of a needle dropping to the roar of a jet engine. When the sound pressure level reaches 120 decibels, it becomes painful and unbearable for our ears, so the comfortable range for human hearing is between 0 to 120 decibels. In a concert hall, listening to a grand symphony, the loudest parts can reach up to 115 decibels while the softest are around 25 decibels, giving a dynamic range of about 90 decibels. However, this is quite rare. Usually, the dynamic range of symphonic music is about 50 to 80 decibels, while for smaller music pieces it's around 40 decibels, and for spoken language, it's about 30 decibels.

Regarding audio storage and processing, CD music commonly uses a 16-bit depth, DVD audio uses a 24-bit depth, and most telephone equipment uses an 8-bit depth. These choices reflect the different needs and capabilities for capturing details in various audio formats.

To avoid loss of precision in the sound signal during processing, high-end audio systems currently use 32-bit floating-point sampling for calculations, ensuring refined and nuanced sound quality. However, for output, it is usually converted to 16-bit to meet the standards of playback devices and the auditory experience of the final listeners. This conversion is like transforming a rich and colorful painting into a form more suitable for general viewing, preserving the essence of the art while adapting to a wider acceptance.

The Dimensions of Audio Channels

In the world of sound, audio channels are a way to tell spatial stories. They refer to independent audio signals collected or played back from different spatial positions during recording or playback. Thus, the number of channels also represents the number of sound sources during recording or the number of speakers during playback.

Mono (Monaural): This is the method of reproducing sound using a single channel. It involves using just one microphone and one speaker or headphone, or several speakers connected in parallel, but they all receive the same signal from the same signal path. In this parallel speaker setup, although there are multiple speakers, each one transmits the same melody.
Stereo (Stereophonic): This method uses two or more independent sound channels to reproduce sound on a pair of symmetrically positioned speakers. With this technique, the sound remains natural and pleasing to the ear, even from different directions.
5.1 Channels: This is a more complex channel system, consisting of a front center channel, left and right front channels, left and right surround channels, and a channel dedicated to reproducing ultra-low frequencies below 120 Hz. It was first used in early cinema sound systems, like Dolby AC-3.
7.1 Channels: Building on the 5.1 channel format, this format further divides the left and right surround channels into left and right surround and left and right rear channels. It is primarily used in Blu-ray and modern cinema, offering an even more immersive and three-dimensional auditory experience to the audience.

The Illusion of Digital Audio Data

The sound data we process on our phones and computers is the result of sound transformed by the magic of digitalization, known as digital audio data. The most common form of this in the digital realm is PCM (Pulse Code Modulation). The mystical process of creating PCM data mainly involves sampling human voices and other analog signals at various moments in time, making them discrete, and then rounding and quantizing these sample values, which are then represented as a series of binary codes depicting the amplitude of each sample pulse. This is the sampling, quantization, and encoding process we discussed earlier.

In the world of computers, PCM is like an art form that achieves the highest level of audio fidelity, widely used for preserving original materials and enjoying music. Therefore, PCM is also known as a lossless encoding format. But this doesn't mean PCM can perfectly guarantee the fidelity of the signal; it only gets as close as possible to the original sound. To calculate the bitrate of a PCM audio stream, we only need three elements:

Bitrate = Sampling Rate × Quantization Bit Depth × Number of Channels.

When handling PCM data, the data of different audio channels can have two magical storage formats:

Interleaved Format: Data from different channels are interwoven like steps in a dance.

Planar Format: Data from the same channels are grouped together, like a calm water surface.

Here's an example:

Furthermore, when processing PCM data, one should also pay attention to the endianness, the byte order difference.

Since PCM encoding is lossless and widely used, we usually consider it as the raw data format of audio. However, to save storage space and reduce transmission costs, we often compress the PCM data, which is the charm of audio encoding. For example, MP3, AAC, OPUS, and others are common audio encoding formats. We will unveil more mysteries of audio encoding in future topics.

Follow me on:

(Foks) Hui Wang's LinkedIn

1 comentario

Alx Bob

25 jun

For those needing Descargas rápidas, sin registro y en calidad óptima when converting YouTube videos to MP3 or MP4, web-based converters offer a straightforward solution. Many of these services are designed for ease of use – you just provide the video link, choose your desired output format (MP3 for audio, MP4 for video), and start the conversion. The lack of registration requirements makes them quick to use for occasional conversions. Always be mindful of copyright when downloading content, of course, but for legitimate uses, these tools are very practical.

Me gusta

(Foks) Hui Wang

Senior iOS Developer