Title: Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends
1- Spoken Language Structure
- E.M. Bakker
- LIACS Media Lab
- Leiden University
2Spoken Language Structure
- Introduction
- Sound
- Speech Production
- Speech Perception
- Phonetics and Phonology
- Syllables and Words
- Syntax and Semantics
3Introduction
- Speech Generation
- Message Formulation (Application semantics,
actions) - Language System (Phonemes, words, prosody)
- Neuromuscular Mappeng (Articulatory Parameter)
- Vocal Tract System (Speech Generation)
- Sound Wave
- Speech Understanding
- Cochlea Motion (Speech Analysis)
- Neural Transduction (Feature Extraction)
- Language System (Phonemes, words, prosody)
- Message Comprehension (Application semantics,
actions)
4Sound (some facts)
- Speed of a sound pressure wave 331.5 0.6T m/s,
where T is the air-temperature - At 20 degrees Celsius this is equal to 343.5
m/s 1236.6 km/h - The amplitude of a sound ( the degree of
displacement) is measured on a log10-scale in
decibels (dB). - In fact it is a comparison of the sound-intensity
I to the intensity I0 of the
threshold of our hearing (TOH). - Sound Amplitude
- 10 log10(I/I0)
- Sound Pressure Level
- 10 log10(P2/P02) SPL(dB) 20 log10(P/P0),
- where P00.0002?bar for a tone of 1kHz (TOH)
- Note that I is proportional to P2.
- SPL is a measure of the absolute sound pressure P
in dB.
5Sound (some facts)
- Sound dB Level xTOH
- TOH (10-12W/m2) 0 100 (1)
- Light Wishper 10 101
- Quiet conversation 40 104
- Average office 50 105
- Busy city street 70 107
- Power tools 110 1011
- Pain threshold ear 120 1012
- Airport runway 130 1013
- Permanent damage to hearing 140 1014
- Jet engine (close) 160 1016
- 3.6m from canon (1010W/m2) 220 1022
6Sound (some facts)
- Sound dB Level xTOH
- Quiet conversation 40 104
- Average office 50 105
- Busy city street 70 107
- The sounds energy is inversely proportional to
r2, where r is the distance to the sounds point
source. - gt if distance is doubled (x2), SPL(dB) decreases
by 10 log10 (4) 6 dB - Energy E A2/r2
- Example talking to a microphone from a distance
of 2.56cm gives intensity I 1 Pascal 10-5
bar, this corresponds to 94 dB SPL. - gt 25.6 cm away the intensity I 0.1 Pascal
10-6 bar, while the energy is 1/100th of the
original energy. This corresponds to a 10
log10(100) 20 dB decrease in dB SPL gt 74 dB
SPL.
7Sound (some facts)
Absolute threshold of hearing.
8Speech Production
- Articulators
- The Voicing Mechanism
- Spectrograms and Formants
9Speech Production Articulators
- Speech production apparatus consists of
- lungs
- vocal cords (larynx, strottenhoofd) (movie
model vocal tract) - if vibrating during a speech sound sound is
voiced - if not, sound is unvoiced
- soft palate (velum, huig)
- hard palate (gehemelte)
- tongue
- teeth
- lips
10Speech Production the Voicing Mechanism
- Voiced sounds
- vocal folds vibrate
- the fundamental frequency 60 - 300 Hz (movie
vocal cords) - higher-frequency harmonics are contributed by the
different oral resonance cavities - roughly regular patterns in time and frequency
- more energy
- timbres are created by tong, lip, shape main oral
resonance cavity - Voiceless sounds
- no vocal vibration
- no regular patterns
- less energy
11Speech Production Spectrograms and Formants
- The glotal wave is periodic consisting of the
fundamental frequency F0 and a number of
harmonics (frequency m.F0) - Formants are harmonics that are emphasized
because they are close to the resonances of the
cavities (in particular articular configurations)
12Speech Production Spectrograms and Formants
S /s/ voiceless ee /iy/ voiced S /z/
voiced
13Speech Perception
- Physiology of the Ear
- Physical vs. Perceptual Attributes
- Dynamic Range
- Frequency Analysis
- Masking
14Speech Perception Physiology of the Ear
15Speech Perception Physical vs. Perceptual
Attributes
- Physical Quantity Perceptual Quality
- Intensity Loudness
- Fundamental Frequency Pitch
- Spectral Shape Timbre
- Onset/offset time Timing
- Phase difference in binaural Location
- hearing
16Speech Perception Physical vs. Perceptual
Attributes
17Speech Perception Dynamic Range
- Human ear features a wide dynamic range. An
important property of human's ear is an absolute
hearing threshold. The hearing threshold changes
significantly within frequency range. Figure
below shows the hearing threshold curve. Sounds
of volume below threshold cannot be heard.
18Speech Perception Frequency Analysis
- Investigations of many researchers prove, that
humans hearing system processes perceived sound
in sub-bands called critical bands. In each
critical band sound is analyzed independently.
Each band corresponds with an equal section of
cochlea (about 1.3 mm). Critical band width
differs within the frequency range. Below 500 Hz
bandwidths are constant, equal 100 Hz. Over 500
Hz the width of each next critical band is 20
larger than of the band below. Therefore, human
hearing system can be modeled as a set of
band-pass filters with bandwidth of corresponding
critical band. - However, there are some differences between
definitions of critical bands presented by
different researchers. Definitions of Zwicker and
Fletcher are most commonly used in many
applications.
19Speech Perception Frequency Analysis
- Accordingly do the definition of critical bands a
special psychoacoustic unit - the bark - was
introduced. One bark corresponds to the width of
one critical band. Graph below illustrates the
relation between the objective frequency scale
and subjective (bark scale). - Bark
- Band Edge (Hz) Center (Hz)
- 1 100 50
- 2 200 150
- 3 300 250
- 4 400 350
- 5 510 450
- 6 630 570
- 7 770 700
- 8 920 840
- 9 1080 1000
- etc
20Speech Perception Masking
- Masking is an important feature of human hearing
system. As a result of masking some tone
components of audible signal cannot be heard, if
another sound close in frequency has a high
enough level. - The masking feature is widely used in sound
compression algorithms as well as in our
perceptual noise reduction method. - There are two types of masking
- non-parallel (time-domain)
- parallel (frequency-domain), as shown on figure
below.
21Speech Perception Masking
- non-parallel (time-domain)
- parallel (frequency-domain) (left some curves
right bark scale).
22Phonetics and Phonology
- Phonemes
- Vowels
- Consonants
- Phonetic Typology
- The Allophone Sound and Context
- Speech Rate and Coarticulation
23Phonetics and Phonology Phonemes
- The phoneme is the minimal unit of speech sound
in a language that can distinguish one word from
another. (perceptual category, or class) - Bat and pat are similar sounding, but we want
the phoneme /b/ and /p/ to distinguish the
different meanings of the words. - A phone is a phonemes acoustic realization.
- The phoneme /t/ has two different acoustic
realizations, phones, in the words - sat voiceless alveolar plosive
- meter alveolar flap
- Phoneme and phone will be used interchangeably to
refer to speaker-independent and
context-independent units of meaningful
sound-contrast.
24Phonetics and Phonology PhoneHead
- sat voiceless alveolar plosive
- meter alveolar flap
(Movie x-ray woman)
25Phonetics and Phonology Vowels
- SIMPLE VOWELS (pure vowels or monophthongs.)
- A vowel is a sound without detectable change in
quality from beginning to end. - It results from changing the shape and the
position of the tongue and lips. - Simple vowels can be compared in terms of their
duration, tongue position and lip shape. - Examples sleep sit book boot ten after bird
horse cat up far hot - COMPLEX VOWELS (DIPHTHONGS)
- continually moving tongue shape and changing
sound quality - represented by two vowel symbols but counted as
one unit - 2 symbols represent the begining and the end of
the sound quality - the jaw, tongue and lips make a gliding movement
from the first element of the diphthong to the
second - the first part is much stronger than the second
part - can be classified as either closing or centering
- Examples beer say fewer boy go bear high how
26Phonetics and Phonology Vowels
- Vowels (klinkers)
- major resonance of the oral and pharyngeal
cavities lead to first and second formants F1 and
F2
Relative tongue positions gt phonological feature
decomposition
27Phonetics and Phonology Consonants
- Consonants (medeklinkers)
- plosive /p/ pat, tap closure in oral cavity
- nasal /m/ team, meat closure of nasal cavity
- frictative /s/ sick, kiss turbulant airstream
noise - retroflex liquid /r/ rat, tar vowel-like, tongue
high and curled back - lateral liquid /l/ lean, kneel vowel-like,
tongue central, side airstream - glide /y/, /w/ yes, well vowel-like
28Phonetics and Phonology Phonetic Typology
- Oral, nasal, pharyngeal, and glotal mechanism can
produce effects beyond those used in English - length kado (corner), kaado (card) (Japanese,
Dutch,...) - trilled r pero (but), perro (dog) (Spanish)
- pitch ma (high level mother), ma (high rising
numb), ma (low rising horse), ma (high
falling to scold) - etc.
29The Allophone Sound and Context
- Coarticulation is the process by which
neighboring sounds influence one another. - The perceivable modified phonemes are called
allophones.
/p ih n/ /s p
ih n pin spin
30Coarticulation
- Perseverance in the early part of the vowel e
the articulators are still somewhat set from the
realization of the initial consonants (/b, /d,
/g). - Also anticiption of the vowel to the ending
consonants
/b eh t /d eh b
t/ /g eh t/ bet
debt get
31Syllables and Words Syntax and Semantics
- Syllables and Words
- Lexical Part-of-Speech
- Morphology
- Word Classes
- Syntax and Semantics
- Syntactic Constituents
- Phrase Schemata
- Clauses and Sentences
- Parse Tree Representations
- Semantic Roles
- Lexical Semantics
- Logical Form
32Syllables
33Words
34Lexical Part-of-Speech
35Morphology
36Word Classes
37Syntax and Semantics
- Syntactic Constituents
- Phrase Schemata
- Clauses and Sentences
- Parse Tree Representations
- Semantic Roles
- Lexical Semantics
- Logical Form
38Syntax Syntactic Constituents
39Syntax Phrase Schemata
40Syntax Clauses and Sentences
41Syntax Parse Tree Representations
42Semantics Semantic Roles
43Semantics Lexical Semantics
44Semantics Logical Form