Title: Voice DSP Processing I
1VoiceDSPProcessing I
- Yaakov J. Stein
- Chief ScientistRAD Data Communications
2Voice DSP
- Part 1 Speech biology and what we can learn from
it - Part 2 Speech DSP (AGC, VAD, features, echo
cancellation) - Part 3 Speech compression techiques
- Part 4 Speech Recognition
3Voice DSP - Part 1a
- Speech production mechanisms
- Biology of the vocal tract
- Pitch and formants
- Sonograms
- The basic LPC model
- The cepstrum
- LPC cepstrum
- Line spectral pairs
4Voice DSP - Part 1b
- Speech perception mechanisms
- Biology of the ear
- Psychophysical phenomena
- Webers law
- Fechners law
- Changes
- Masking
5Voice DSP - Part 1c
- Speech quality measurement
- Subjective measurement
- MOS and its variants
- Objective measurement
- PSQM, PESQ
6Voice DSP - Part 2a
- Basic speech processing
- Simplest processing
- AGC
- Simplistic VAD
- More complex processing
- pitch tracking
- formant tracking
- U/V decision
- computing LPC and other features
7Voice DSP - Part 2b
- Echo Cancellation
- Sources of echo (acoustic vs. line echo)
- Echo suppression and cancellation
- Adaptive noise cancellation
- The LMS algorithm
- Other adaptive algorithms
- The standard LEC
8Voice DSP - Part 3
- Speech compression techniques
- PCM
- ADPCM
- SBC
- VQ
- ABS-CELP
- MBE
- MELP
- STC
- Waveform Interpolation
9Voice DSP - Part 4
- Speech Recognition tasks
- ASR Engine
- Phonetic labeling
- DTW
- HMM
- State-of-the-Art
10Voice DSP - Part 1a
- Speech
- production
- mechanisms
11Speech Production Organs
Brain
Hard Palate
Nasal cavity
Velum
Teeth
Uvula
Lips
Mouth cavity
Pharynx
Tongue
Larynx
Trachea
Lungs
12Speech Production Organs - cont.
- Air from lungs is exhaled into trachea (windpipe)
- Vocal chords (folds) in larynx can produce
periodic pulses of air - by opening and closing (glottis)
- Throat (pharynx), mouth, tongue and nasal cavity
modify air flow - Teeth and lips can introduce turbulence
- Epiglottis separates esophagus (food pipe) from
trachea
13Voiced vs. Unvoiced Speech
- When vocal cords are held open air flows
unimpeded - When laryngeal muscles stretch them glottal flow
is in bursts - When glottal flow is periodic called voiced
speech - Basic interval/frequency called the pitch
- Pitch period usually between 2.5 and 20
milliseconds - Pitch frequency between 50 and 400 Hz
- You can feel the vibration of the larynx
- Vowels are always voiced (unless whispered)
- Consonants come in voiced/unvoiced pairs
- for example B/P K/G D/T V/F J/CH TH/th
W/WH Z/S ZH/SH
14Excitation spectra
- Voiced speech
- Pulse train is not sinusoidal - harmonic
rich - Unvoiced speech
- Common assumption white noise
f
f
15Effect of vocal tract
- Mouth and nasal cavities have resonances
- Resonant frequencies
- depend on geometry
16Effect of vocal tract - cont.
- Sound energy at these resonant frequencies is
amplified - Frequencies of peak amplification are called
formants
frequency response
frequency
F0
17Formant frequencies
- Peterson - Barney data (note the vowel triangle)
18Sonograms
19Cylinder model(s)
- Rough model of throat and mouth cavity
- With nasal cavity
Voice Excitation
open
open
Voice Excitation
open/closed
20Phonemes
- The smallest acoustic unit that can change
meaning - Different languages have different phoneme sets
- Types (notations
phonetic, CVC, ARPABET) - Vowels
- front (heed, hid, head, hat)
- mid (hot, heard, hut, thought)
- back (boot, book, boat)
- dipthongs (buy, boy, down, date)
- Semivowels
- liquids (w, l)
- glides (r, y)
21Phonemes - cont.
- Consonants
- nasals (murmurs) (n, m, ng)
- stops (plosives)
- voiced (b,d,g)
- unvoiced (p, t, k)
- fricatives
- voiced (v, that, z, zh)
- unvoiced (f, think, s, sh)
- affricatives (j, ch)
- whispers (h, what)
- gutturals ( ? ,? )
- clicks, etc. etc. etc.
22Basic LPC Model
Pulse Generator
LPC synthesis filter
U/V Switch
White Noise Generator
23Basic LPC Model - cont.
- Pulse generator produces a harmonic rich periodic
impulse train (with pitch period and gain) - White noise generator produces a random signal
- (with gain)
- U/V switch chooses between voiced and unvoiced
speech - LPC filter amplifies formant frequencies
- (all-pole or AR IIR filter)
- The output will resemble true speech to within
residual error
24Cepstrum
- Another way of thinking about the LPC model
- Speech spectrum is the obtained from
multiplication - Spectrum of (pitch) pulse train times
- Vocal tract (formant) frequency response
- So log of this spectrum is obtained from addition
- Log spectrum of pitch train plus
- Log of vocal tract frequency response
- Consider this log spectrum to be the spectrum of
some new signal - called the cepstrum
- The cepstrum is the sum of two components
- excitation plus vocal tract
25Cepstrum - cont.
- Cepstral processing has its own language
- Cepstrum (note that this is really a signal in
the time domain) - Quefrency (its units are seconds)
- Liftering (filtering)
- Alanysis
- Saphe
- Several variants
- complex cepstrum
- power cesptrum
- LPC cepstrum
26Do we know enough?
- Standard speech model (LPC)
- (used by most speech processing/compression/re
cognition systems) - is a model of speech production
- Unfortunately, speech production and speech
perception systems - are not matched
- So next well look at the biology of the hearing
(auditory) system - and some psychophysics (perception)
27Voice DSP - Part 1b
- Speech
- Hearing perception mechanisms
28Hearing Organs
29Hearing Organs - cont.
- Sound waves impinge on outer ear enter auditory
canal - Amplified waves cause eardrum to vibrate
- Eardrum separates outer ear from middle ear
- The Eustachian tube equalizes air pressure of
middle ear - Ossicles (hammer, anvil, stirrup) amplify
vibrations - Oval window separates middle ear from inner ear
- Stirrup excites oval window which excites liquid
in the cochlea - The cochlea is curled up like a snail
- The basilar membrane runs along middle of cochlea
- The organ of Corti transduces vibrations to
electric pulses - Pulses are carried by the auditory nerve to the
brain
30Function of Cochlea
- Cochlea has 2 1/2 to 3 turns
- were it straightened out it would be 3 cm in
length - The basilar membrane runs down the center of the
cochlea - as does the organ of Corti
- 15,000 cilia (hairs) contact the vibrating
basilar membrane - and release neurotransmitter stimulating
30,000 auditory neurons - Cochlea is wide (1/2 cm) near oval window and
tapers towards apex - is stiff near oval window and
flexible near apex - Hence high frequencies cause section near oval
window to vibrate - low frequencies cause section
near apex to vibrate - Overlapping bank of filter frequency decomposition
31Psychophysics - Webers law
- Ernst Weber Professor of physiology at Leipzig in
the early 1800s - Just Noticeable Difference
- minimal stimulus change that can be detected
by senses - Discovery D I K I
- Example
- Tactile sense place coins in each hand
- subject could discriminate between with 10 coins
and 11, - but not 20/21, but could 20/22!
- Similarly vision lengths of lines, taste
saltiness, sound frequency
32Webers law - cont.
- This makes a lot of sense
Bill Gates
33Psychophysics - Fechners law
- Webers law is not a true psychophysical law
- it relates stimulus threshold to stimulus
(both physical entities) - not internal representation (feelings) to
physical entity - Gustav Theodor Fechner student of Weber
medicine, physics philosophy - Simplest assumption JND is single internal unit
- Using Webers law we find
- Y A log I B
- Fechner Day (October 22 1850)
34Fechners law - cont.
- Log is very compressive
- Fechners law explains the fantastic ranges of
our senses - Sight single photon - direct sunlight 1015
- Hearing eardrum move 1 H atom - jet plane 1012
- Bel defined to be log10 of power ratio
- decibel (dB) one tenth of a Bel
- d(dB) 10 log10 P 1 / P 2
35Fechners law - sound amplitudes
- Companding
- adaptation of logarithm to positive/negative
signals - m-law and A-law are piecewise linear
approximations - Equivalent to linear sampling at 12-14 bits
- (8 bit linear sampling is significantly more
noisy)
36Fechners law - sound frequencies
- octaves, well tempered scale
- Critical bands
- Frequency warping
- Melody 1 KHz 1000, JND afterwards M 1000
log2 ( 1 fKHz ) - Barkhausen can be simultaneously heard B 25
75 ( 1 1.4 f2KHz )0.69 - excite different basilar
membrane regions
f
37Psychophysics - changes
- Our senses respond to changes
38Psychophysics - masking
- Masking strong tones block weaker ones at nearby
frequencies - narrowband noise blocks
tones (up to critical band)
f
39Voice DSP - Part 1c
- Speech
- Quality
- Measurement
40Why does it sound the way
it sounds?
- PSTN
- BW0.2-3.8 KHz, SNRgt30 dB
- PCM, ADPCM (BER 10-3)
- five nines reliability
- line echo cancellation
- Voice over packet network
- speech compression
- delay, delay variation, jitter
- packet loss/corruption/priority
- echo cancellation
41Subjective Voice Quality
- Old Measures
- 5/9
- DRT
- DAM
- The modern scale
- MOS
- DMOS
meet neat seat feet Pete beat heat
42MOS according to ITU
- P.800 Subjective Determination of Transmission
Quality - Annex B Absolute Category Rating (ACR)
- Listening Quality
Listening Effort - 5 excellent relaxed
- 4 good attention needed
- 3 fair moderate effort
- 2 poor considerable effort
- 1 bad no meaning
- with feasible
effort
43MOS according to ITU (cont)
- Annex D Degradation Category Rating (DCR)
- Annex E Comparison Category Rating (CCR)
- ACR not good at high quality speech
- DCR
CCR - 5 inaudible
- 4 not annoying
- 3 slightly annoying much better
- 2 annoying better
- 1 very annoying slightly better
- 0 the same
- -1 slightly worse
- -2 worse
- -3 much worse
44Some MOS numbers
- Effect of Speech Compression
- (from ITU-T Study Group 15)
- Quiet room 48 KHz 16 bit linear sampling 5.0
- PCM (A-law/mlaw) 64 Kb/s 4.1
- G.723.1 _at_ 6.3 Kb/s 3.9
- G.729 _at_ 8 Kb/s 3.9
- ADPCM G.726 32 Kb/s 3.8
toll quality - GSM _at_ 13Kb/s 3.6
- VSELP IS54 _at_ 8Kb/s 3.4
45The Problem(s) with MOS
- Accurate MOS tests are the only reliable
benchmark - BUT
- MOS tests are off-line
- MOS tests are slow
- MOS tests are expensive
- Different labs give consistently different
results - Most MOS tests only check one aspect of system
46The Problem(s) with SNR
- Naive question Isnt CCR the same as SNR?
- SNR does not correlate well with subjective
criteria - Squared difference is not an accurate comparator
- Gain
- Delay
- Phase
- Nonlinear processing
47Speech distance measures
- Many objective measures have been proposed
- Segmental SNR
- Itakura Saito distance
- Euclidean distance in Cepstrum space
- Bark spectral distortion
- Coherence Function
- None correlate well with MOS
- ITU target - find a quality-measure that does
correlate well
48Some objective methods
- Perceptual Speech Quality Measurement (PSQM)
- ITU-T P.861
- Perceptual Analysis Measurement System (PAMS)
- BT proprietary technique
- Perceptual Evaluation of Speech Quality (PESQ)
- ITU-T P.862
- Objective Measurement of Perceived Audio Quality
(PAQM) - ITU-R BS.1387
49Objective Quality Strategy
speech
50PSQM philosophy(from P.861)
Internal Representation
Perceptual model
Audible Difference
Cognitive Model
Perceptual model
Internal Representation
51PSQM philosophy (cont)
- Perceptual Modelling (Internal representation)
- Short time Fourier transform
- Frequency warping (telephone-band filtering, Hoth
noise) - Intensity warping
- Cognitive Modelling
- Loudness scaling
- Internal cognitive noise
- Asymmetry
- Silent interval processing
- PSQM Values
- 0 (no degradation) to 6.5 (maximum degradation)
- Conversion to MOS
- PSQM to MOS calibration using known references
- Equivalent Q values
52Problems with PSQM
- Designed for telephony grade speech codecs
- Doesnt take network effects into account
- filtering
- variable time delay
- localized distortions
- Draft standard P.862 adds
- transfer function equalization
- time alignment, delay skipping
- distortion averaging
53PESQ philosophy(from P.862)
Perceptual model
Internal Representation
Cognitive Model
Audible Difference
Time Alignment
Perceptual model
Internal Representation