Title: Digital Audio
1Digital Audio
- Introducing perceptual encoders
- Using psychoacoustics
2Principal features of the ear / brain response
- Ear is approximately logarithmic in subjective
response to increasing volume (the 3dB fader in
audio control) - Response of the ear to a fixed audio spectral
distribution (e.g an audio recording) is
subjectively different as the volume changes (the
phon curves) bass becomes more pronounced as
volume increases (loudness controls) and
richness of tone increases over about 60db
loudness level (i.e. harmonics introduced by
hearing system)
3Principal features of the ear / brain response
part 2
- Masking occurs in both time and frequency
- in frequency, this modifies the effective
threshold of hearing - in time this can modify the effective threshold
of hearing before a loud signal arrives (cuts
off the build up) and after the signal stopped.
The before deaf period can be a few msec, the
after deaf period can be upto 200msec - Equivalently, the ear behaves differently for
long and short duration bursts i.e when
compared to a few hundred msec.
4Masking holds to key to psychoacoustic
codes..raise significantly, the effective
threshold of hearing
MAF- minimum audible field threshold of hearing
5Even with simple MAF curves
- Audible dynamic range is less at 50Hz than
5kHz.. - Maybe we could start then by splitting the audio
spectrum into bands and quantising each one
differently forget masking at this stage..
6Sub band coding digital audio sampled at
48ksamples/sec
16x48000 768 kbps
4x3x16000 192 kbps (reduce bits Per sample)
16x3x48000 2304 kbps
Bit rates
16x3x16000 768 kbps (1/3 in each channel)
7Reducing the bits per sample an example using
only the standard MAF
8Bits needed in different parts of the spectrum
still no masking
80
Peak Signal Level
70
9 bits
9 bits
10 bits
10 bits
10 bits
9 bits
10 bits
11 bits
12 bits
11 bits
12 bits
12 bits
60
50
40
Sound Pressure Level dB-SPL
30
Threshold of Hearing
20
10
Frequency Hz
0
5000
10000
15000
-10
-20
-30
9Using masking a psychoacoustic model..
Signal
Signal Noise (SNR 24 dB)
Noise
- Signal suppresses the noise
- Raises effective threshold of hearing to the
masking threshold - Establish a model of the new threshold of hearing
and use this to determine the resolution needed
in a particular frequency band essence of MPEG
audio codes (but at present ignore temporal
masking) - Only use the dynamic range needed and make this
adaptive
10Single tone masking
11Masking on a more complex signal
12Other information needed in addition to the coded
audioframes
- The code is prepared and processed in frames of a
predetermined length - Each frame contains mostly coded audio, but in
addition - The peak level in each frequency sub band
- The masking level in each sub band
- The number of bits in each sample in each sub
band
13What does the system look like??- encoding
compressed audio
Digital Audio In
Sub-band filter bank
Scale and Quantise
Multiplex and Data Format
Coded Audio Out
Masking thresholds
Code additional Info
Psycho-acoustic model
FFT
ENCODER
Everything is done in the digital domain the
analogue original has been digitised to high
quality before entering the coder
14What does the system look like??- decoding
compressed audio
15MPEG 1 digital audio standards
- Three perceptual coders in the MPEG 1
specification - Layers 1, 2 3
- Layer 1 (.mp1)
- Similar to the simple coder just described
- 32 sub-bands are used
- Each frame contains 384 samples (32 x 12) lasting
about 8msec - A version of layer 1 was used in the Digital
Compact Cassette (DCC) - Layer 2 (.mp2)
- Slightly more complex but better quality than
layer 1 - Frame length increased to 1152 samples (32 x 36)
lasting about 20msec
16MPEG 1 digital audio standards - more
- Layer 2 (continued)
- Data formatting of samples and side information
is slightly more efficient - Used in Digital Audio Broadcasting (DAB)
- Layer 3 (.mp3)
- Significantly more complex than layers 1 or 2
- Capable of reasonable quality even at very low
data rates - A combination of fixed sub-band coding and
adaptive frequency transform coding is used to
give up to 576 frequency bands (compared to 32
for layers 1 2) - Uses signal statistics as well as signal waveform
for coding - Huffman encoding is applied to samples (more on
these to come..) - MP3 files most prevalent of compressed
audio.mp3 players etc - Introduced late 1990s.
17From wax discs to MP3.
- Original analogue records (discs, audio tape)
attempted to cope with all possible signals at
all possible times - The CD (1980) did the same and included error
correction (almost 1GByte/hour) - Perceptual encoders eliminate what isnt relevant
to the listener and compress to 100MByte per hour
(and less) using MP3 and related formats