Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends


1
  • Spoken Language Structure
  • E.M. Bakker
  • LIACS Media Lab
  • Leiden University

2
Spoken Language Structure
  • Introduction
  • Sound
  • Speech Production
  • Speech Perception
  • Phonetics and Phonology
  • Syllables and Words
  • Syntax and Semantics

3
Introduction
  • Speech Generation
  • Message Formulation (Application semantics,
    actions)
  • Language System (Phonemes, words, prosody)
  • Neuromuscular Mappeng (Articulatory Parameter)
  • Vocal Tract System (Speech Generation)
  • Sound Wave
  • Speech Understanding
  • Cochlea Motion (Speech Analysis)
  • Neural Transduction (Feature Extraction)
  • Language System (Phonemes, words, prosody)
  • Message Comprehension (Application semantics,
    actions)

4
Sound (some facts)
  • Speed of a sound pressure wave 331.5 0.6T m/s,
    where T is the air-temperature
  • At 20 degrees Celsius this is equal to 343.5
    m/s 1236.6 km/h
  • The amplitude of a sound ( the degree of
    displacement) is measured on a log10-scale in
    decibels (dB).
  • In fact it is a comparison of the sound-intensity
    I to the intensity I0 of the
    threshold of our hearing (TOH).
  • Sound Amplitude
  • 10 log10(I/I0)
  • Sound Pressure Level
  • 10 log10(P2/P02) SPL(dB) 20 log10(P/P0),
  • where P00.0002?bar for a tone of 1kHz (TOH)
  • Note that I is proportional to P2.
  • SPL is a measure of the absolute sound pressure P
    in dB.

5
Sound (some facts)
  • Sound dB Level xTOH
  • TOH (10-12W/m2) 0 100 (1)
  • Light Wishper 10 101
  • Quiet conversation 40 104
  • Average office 50 105
  • Busy city street 70 107
  • Power tools 110 1011
  • Pain threshold ear 120 1012
  • Airport runway 130 1013
  • Permanent damage to hearing 140 1014
  • Jet engine (close) 160 1016
  • 3.6m from canon (1010W/m2) 220 1022

6
Sound (some facts)
  • Sound dB Level xTOH
  • Quiet conversation 40 104
  • Average office 50 105
  • Busy city street 70 107
  • The sounds energy is inversely proportional to
    r2, where r is the distance to the sounds point
    source.
  • gt if distance is doubled (x2), SPL(dB) decreases
    by 10 log10 (4) 6 dB
  • Energy E A2/r2
  • Example talking to a microphone from a distance
    of 2.56cm gives intensity I 1 Pascal 10-5
    bar, this corresponds to 94 dB SPL.
  • gt 25.6 cm away the intensity I 0.1 Pascal
    10-6 bar, while the energy is 1/100th of the
    original energy. This corresponds to a 10
    log10(100) 20 dB decrease in dB SPL gt 74 dB
    SPL.

7
Sound (some facts)
Absolute threshold of hearing.
8
Speech Production
  • Articulators
  • The Voicing Mechanism
  • Spectrograms and Formants

9
Speech Production Articulators
  • Speech production apparatus consists of
  • lungs
  • vocal cords (larynx, strottenhoofd) (movie
    model vocal tract)
  • if vibrating during a speech sound sound is
    voiced
  • if not, sound is unvoiced
  • soft palate (velum, huig)
  • hard palate (gehemelte)
  • tongue
  • teeth
  • lips

10
Speech Production the Voicing Mechanism
  • Voiced sounds
  • vocal folds vibrate
  • the fundamental frequency 60 - 300 Hz (movie
    vocal cords)
  • higher-frequency harmonics are contributed by the
    different oral resonance cavities
  • roughly regular patterns in time and frequency
  • more energy
  • timbres are created by tong, lip, shape main oral
    resonance cavity
  • Voiceless sounds
  • no vocal vibration
  • no regular patterns
  • less energy

11
Speech Production Spectrograms and Formants
  • The glotal wave is periodic consisting of the
    fundamental frequency F0 and a number of
    harmonics (frequency m.F0)
  • Formants are harmonics that are emphasized
    because they are close to the resonances of the
    cavities (in particular articular configurations)

12
Speech Production Spectrograms and Formants
S /s/ voiceless ee /iy/ voiced S /z/
voiced
13
Speech Perception
  • Physiology of the Ear
  • Physical vs. Perceptual Attributes
  • Dynamic Range
  • Frequency Analysis
  • Masking

14
Speech Perception Physiology of the Ear
15
Speech Perception Physical vs. Perceptual
Attributes
  • Physical Quantity Perceptual Quality
  • Intensity Loudness
  • Fundamental Frequency Pitch
  • Spectral Shape Timbre
  • Onset/offset time Timing
  • Phase difference in binaural Location
  • hearing

16
Speech Perception Physical vs. Perceptual
Attributes
17
Speech Perception Dynamic Range
  • Human ear features a wide dynamic range. An
    important property of human's ear is an absolute
    hearing threshold. The hearing threshold changes
    significantly within frequency range. Figure
    below shows the hearing threshold curve. Sounds
    of volume below threshold cannot be heard.

18
Speech Perception Frequency Analysis
  • Investigations of many researchers prove, that
    humans hearing system processes perceived sound
    in sub-bands called critical bands. In each
    critical band sound is analyzed independently.
    Each band corresponds with an equal section of
    cochlea (about 1.3 mm). Critical band width
    differs within the frequency range. Below 500 Hz
    bandwidths are constant, equal 100 Hz. Over 500
    Hz the width of each next critical band is 20
    larger than of the band below. Therefore, human
    hearing system can be modeled as a set of
    band-pass filters with bandwidth of corresponding
    critical band.
  • However, there are some differences between
    definitions of critical bands presented by
    different researchers. Definitions of Zwicker and
    Fletcher are most commonly used in many
    applications.

19
Speech Perception Frequency Analysis
  • Accordingly do the definition of critical bands a
    special psychoacoustic unit - the bark - was
    introduced. One bark corresponds to the width of
    one critical band. Graph below illustrates the
    relation between the objective frequency scale
    and subjective (bark scale).
  • Bark
  • Band Edge (Hz) Center (Hz)
  • 1 100 50
  • 2 200 150
  • 3 300 250
  • 4 400 350
  • 5 510 450
  • 6 630 570
  • 7 770 700
  • 8 920 840
  • 9 1080 1000
  • etc

20
Speech Perception Masking
  • Masking is an important feature of human hearing
    system. As a result of masking some tone
    components of audible signal cannot be heard, if
    another sound close in frequency has a high
    enough level.
  • The masking feature is widely used in sound
    compression algorithms as well as in our
    perceptual noise reduction method.
  • There are two types of masking
  • non-parallel (time-domain)
  • parallel (frequency-domain), as shown on figure
    below.

21
Speech Perception Masking
  • non-parallel (time-domain)
  • parallel (frequency-domain) (left some curves
    right bark scale).

22
Phonetics and Phonology
  • Phonemes
  • Vowels
  • Consonants
  • Phonetic Typology
  • The Allophone Sound and Context
  • Speech Rate and Coarticulation

23
Phonetics and Phonology Phonemes
  • The phoneme is the minimal unit of speech sound
    in a language that can distinguish one word from
    another. (perceptual category, or class)
  • Bat and pat are similar sounding, but we want
    the phoneme /b/ and /p/ to distinguish the
    different meanings of the words.
  • A phone is a phonemes acoustic realization.
  • The phoneme /t/ has two different acoustic
    realizations, phones, in the words
  • sat voiceless alveolar plosive
  • meter alveolar flap
  • Phoneme and phone will be used interchangeably to
    refer to speaker-independent and
    context-independent units of meaningful
    sound-contrast.

24
Phonetics and Phonology PhoneHead
  • sat voiceless alveolar plosive
  • meter alveolar flap

(Movie x-ray woman)
25
Phonetics and Phonology Vowels
  • SIMPLE VOWELS (pure vowels or monophthongs.)
  • A vowel is a sound without detectable change in
    quality from beginning to end.
  • It results from changing the shape and the
    position of the tongue and lips.
  • Simple vowels can be compared in terms of their
    duration, tongue position and lip shape.
  • Examples sleep sit book boot ten after bird
    horse cat up far hot
  • COMPLEX VOWELS (DIPHTHONGS)
  • continually moving tongue shape and changing
    sound quality
  • represented by two vowel symbols but counted as
    one unit
  • 2 symbols represent the begining and the end of
    the sound quality
  • the jaw, tongue and lips make a gliding movement
    from the first element of the diphthong to the
    second
  • the first part is much stronger than the second
    part
  • can be classified as either closing or centering
  • Examples beer say fewer boy go bear high how

26
Phonetics and Phonology Vowels
  • Vowels (klinkers)
  • major resonance of the oral and pharyngeal
    cavities lead to first and second formants F1 and
    F2

Relative tongue positions gt phonological feature
decomposition
27
Phonetics and Phonology Consonants
  • Consonants (medeklinkers)
  • plosive /p/ pat, tap closure in oral cavity
  • nasal /m/ team, meat closure of nasal cavity
  • frictative /s/ sick, kiss turbulant airstream
    noise
  • retroflex liquid /r/ rat, tar vowel-like, tongue
    high and curled back
  • lateral liquid /l/ lean, kneel vowel-like,
    tongue central, side airstream
  • glide /y/, /w/ yes, well vowel-like

28
Phonetics and Phonology Phonetic Typology
  • Oral, nasal, pharyngeal, and glotal mechanism can
    produce effects beyond those used in English
  • length kado (corner), kaado (card) (Japanese,
    Dutch,...)
  • trilled r pero (but), perro (dog) (Spanish)
  • pitch ma (high level mother), ma (high rising
    numb), ma (low rising horse), ma (high
    falling to scold)
  • etc.

29
The Allophone Sound and Context
  • Coarticulation is the process by which
    neighboring sounds influence one another.
  • The perceivable modified phonemes are called
    allophones.

/p ih n/ /s p
ih n pin spin
30
Coarticulation
  • Perseverance in the early part of the vowel e
    the articulators are still somewhat set from the
    realization of the initial consonants (/b, /d,
    /g).
  • Also anticiption of the vowel to the ending
    consonants

/b eh t /d eh b
t/ /g eh t/ bet
debt get
31
Syllables and Words Syntax and Semantics
  • Syllables and Words
  • Lexical Part-of-Speech
  • Morphology
  • Word Classes
  • Syntax and Semantics
  • Syntactic Constituents
  • Phrase Schemata
  • Clauses and Sentences
  • Parse Tree Representations
  • Semantic Roles
  • Lexical Semantics
  • Logical Form

32
Syllables
33
Words
34
Lexical Part-of-Speech
35
Morphology
36
Word Classes
37
Syntax and Semantics
  • Syntactic Constituents
  • Phrase Schemata
  • Clauses and Sentences
  • Parse Tree Representations
  • Semantic Roles
  • Lexical Semantics
  • Logical Form

38
Syntax Syntactic Constituents
39
Syntax Phrase Schemata
40
Syntax Clauses and Sentences
41
Syntax Parse Tree Representations
42
Semantics Semantic Roles
43
Semantics Lexical Semantics
44
Semantics Logical Form
Write a Comment
User Comments (0)
About PowerShow.com