LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing

Description:

short b closure, voicing barely visible. ... Build a statistical model of the speech-to-words process. Collect lots and lots of speech, and transcribe all the words. ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 53
Provided by: DanJur6
Learn more at: https://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing


1
LING 138/238 SYMBSYS 138Intro to Computer Speech
and Language Processing
  • Lecture 9 Speech Recognition (I)
  • October 26, 2004
  • Dan Jurafsky

2
Outline for ASR this week
  • Acoustic Phonetics
  • ASR Architecture
  • The Noisy Channel Model
  • Five easy pieces of an ASR system
  • Feature Extraction
  • Acoustic Model
  • Lexicon/Pronunciation Model
  • Decoder
  • Language Model
  • Evaluation

3
Acoustic Phonetics
  • Sound Waves
  • http//www.kettering.edu/drussell/Demos/waves-int
    ro/waves-intro.html
  • http//www.kettering.edu/drussell/Demos/waves/Lwa
    ve.gif

4
Waveforms for speech
  • Waveform of the vowel iy
  • Frequency repetitions/second of a wave
  • Above vowel has 28 reps in .11 secs
  • So freq is 28/.11 255 Hz
  • This is speed that vocal folds move, hence
    voicing
  • Amplitude y axis amount of air pressure at that
    point in time
  • Zero is normal air pressure, negative is
    rarefaction

5
She just had a baby
  • What can we learn from a wavefile?
  • Vowels are voiced, long, loud
  • Length in time length in space in waveform
    picture
  • Voicing regular peaks in amplitude
  • When stops closed no peaks silence.
  • Peaks voicing .46 to .58 (vowel iy, from
    second .65 to .74 (vowel ax) and so on
  • Silence of stop closure (1.06 to 1.08 for first
    b, or 1.26 to 1.28 for second b)
  • Fricatives like sh intense irregular pattern
    see .33 to .46

6
Examples from Ladefoged
pad
bad
spat
7
Spectra
  • New idea spectra (singular spectrum)
  • Different way to view a waveform
  • Fourier analysis every wave can be represented
    as sum of many simple waves of different
    frequencies.
  • Articulatory facts
  • The vocal cord vibrations create harmonics
  • The mouth is an amplifier
  • Depending on shape of mouth, some harmonics are
    amplified more than others

8
Part of ae waveform from had
  • Note complex wave repeating nine times in figure
  • Plus smaller waves which repeats 4 times for
    every large pattern
  • Large wave has frequency of 250 Hz (9 times in
    .036 seconds)
  • Small wave roughly 4 times this, or roughly 1000
    Hz
  • Two little tiny waves on top of peak of 1000 Hz
    waves

9
A spectrum
  • Spectrum represents these freq components
  • Computed by Fourier transform, algorithm which
    separates out each frequency component of wave.
  • x-axis shows frequency, y-axis shows magnitude
    (in decibels, a log measure of amplitude)
  • Peaks at 930 Hz, 1860 Hz, and 3020 Hz.

10
Spectrogram
11
Formants
  • Vowels largely distinguished by 2 characteristic
    pitches.
  • One of them (the higher of the two) goes downward
    throughout the series iy ih eh ae aa ao ou u
    (whisper iy eh uw)
  • The other goes up for the first four vowels and
    then down for the next four.
  • creaky voice iy ih eh ae (goes up)
  • creaky voice aa ow uh uw (goes down)
  • These are called "formants" of the vowels,
    lower is 1st formant, higher is 2nd formant.

12
How formants are produced
  • Q why do vowels have different pitches if the
    vocal cords are same rate?
  • A mouth as "amplifier" amplifies different
    frequencies
  • Formants are result of differen shapes of vocal
    tract.
  • Any body of air will vibrate in a way that
    depends on its size and shape. Air in vocal tract
    is set in vibration by action of focal cords.
    Every time the vocal cords open and close, pulse
    of air from the lungs, acting like sharp taps on
    air in vocal tract,
  • Setting resonating cavities into vibration so
    produce a number of different frequencies.

13
How to read spectrograms
  • bab closure of lips lowers all formants so
    rapid increase in all formants at beginning of
    "bab
  • dad first formant increases, but F2 and F3
    slight fall
  • gag F2 and F3 come together this is a
    characteristic of velars. Formant transitions
    take longer in velars than in alveolars or labials

14
She came back and started again
  • 1., lots of high-freq energy
  • 3. closure for k
  • 4. burst of aspiration for k
  • 5. ey vowelfaint 1100 Hz formant is
    nasalization
  • 6. bilabial nasal
  • short b closure, voicing barely visible.
  • 8. ae note upward transitions after bilabial
    stop at beginning
  • 9. note F2 and F3 coming together for "k"

15
Spectrogram for She just had a baby
16
Perceptual properties
  • Pitch perceptual correlate of frequency
  • Loudness perceptual correlate of power, which is
    related to square of amplitude

17
Speech Recognition
  • Applications of Speech Recognition (ASR)
  • Dictation
  • Telephone-based Information (directions, air
    travel, banking, etc)
  • Hands-free (in car)
  • Speaker Identification
  • Language Identification
  • Second language ('L2') (accent reduction)
  • Audio archive searching

18
LVCSR
  • Large Vocabulary Continuous Speech Recognition
  • 20,000-64,000 words
  • Speaker independent (vs. speaker-dependent)
  • Continuous speech (vs isolated-word)

19
LVCSR Design Intuition
  • Build a statistical model of the speech-to-words
    process
  • Collect lots and lots of speech, and transcribe
    all the words.
  • Train the model on the labeled speech
  • Paradigm Supervised Machine Learning Search

20
Speech Recognition Architecture
21
The Noisy Channel Model
  • Search through space of all possible sentences.
  • Pick the one that is most probable given the
    waveform.

22
The Noisy Channel Model (II)
  • What is the most likely sentence out of all
    sentences in the language L given some acoustic
    input O?
  • Treat acoustic input O as sequence of individual
    observations
  • O o1,o2,o3,,ot
  • Define a sentence as a sequence of words
  • W w1,w2,w3,,wn

23
Noisy Channel Model (III)
  • Probabilistic implication Pick the highest prob
    S
  • We can use Bayes rule to rewrite this
  • Since denominator is the same for each candidate
    sentence W, we can ignore it for the argmax

24
A quick derivation of Bayes Rule
  • Conditionals
  • Rearranging
  • And also

25
Bayes (II)
  • We know
  • So rearranging things

26
Noisy channel model
likelihood
prior
27
The noisy channel model
  • Ignoring the denominator leaves us with two
    factors P(Source) and P(SignalSource)

28
Five easy pieces
  • Feature extraction
  • Acoustic Modeling
  • HMMs, Lexicons, and Pronunciation
  • Decoding
  • Language Modeling

29
Feature Extraction
  • Digitize Speech
  • Extract Frames

30
Digitizing Speech
31
Digitizing Speech (A-D)
  • Sampling
  • measuring amplitude of signal at time t
  • 16,000 Hz (samples/sec) Microphone (Wideband)
  • 8,000 Hz (samples/sec) Telephone
  • Why?
  • Need at least 2 samples per cycle
  • max measurable frequency is half sampling rate
  • Human speech lt 10,000 Hz, so need max 20K
  • Telephone filtered at 4K, so 8K is enough

32
Digitizing Speech (II)
  • Quantization
  • Representing real value of each amplitude as
    integer
  • 8-bit (-128 to 127) or 16-bit (-32768 to 32767)
  • Formats
  • 16 bit PCM
  • 8 bit mu-law log compression
  • LSB (Intel) vs. MSB (Sun, Apple)
  • Headers
  • Raw (no header)
  • Microsoft wav
  • Sun .au

40 byte header
33
Frame Extraction
  • A frame (25 ms wide) extracted every 10 ms

25 ms
. . .
10ms
a1 a2 a3
Figure from Simon Arnfield
34
MFCC (Mel Frequency Cepstral Coefficients)
  • Do FFT to get spectral information
  • Like the spectrogram/spectrum we saw earlier
  • Apply Mel scaling
  • Linear below 1kHz, log above, equal samples above
    and below 1kHz
  • Models human ear more sensitivity in lower freqs
  • Plus Discrete Cosine Transformation

35
Final Feature Vector
  • 39 Features per 10 ms frame
  • 12 MFCC features
  • 12 Delta MFCC features
  • 12 Delta-Delta MFCC features
  • 1 (log) frame energy
  • 1 Delta (log) frame energy
  • 1 Delta-Delta (log frame energy)
  • So each frame represented by a 39D vector

36
Acoustic Modeling
  • Given a 39d vector corresponding to the
    observation of one frame oi
  • And given a phone q we want to detect
  • Compute p(oiq)
  • Most popular method
  • GMM (Gaussian mixture models)
  • Other methods
  • MLP (multi-layer perceptron)

37
Acoustic Modeling MLP computes p(qo)
38
Gaussian Mixture Models
  • Also called fully-continuous HMMs
  • P(oq) computed by a Gaussian

39
Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a
variance
Different means
  • P(oq)

P(oq) is highest here at mean
P(oq is low here, very far from mean)
P(oq)
o
40
Training Gaussians
  • A (single) Gaussian is characterized by a mean
    and a variance
  • Imagine that we had some training data in which
    each phone was labeled
  • We could just compute the mean and variance from
    the data

41
But we need 39 gaussians, not 1!
  • The observation o is really a vector of length 39
  • So need a vector of Gaussians

42
Actually, mixture of gaussians
  • Each phone is modeled by a sum of different
    gaussians
  • Hence able to model complex facts about the data

Phone A
Phone B
43
Gaussians acoustic modeling
  • Summary each phone is represented by a GMM
    parameterized by
  • M mixture weights
  • M mean vectors
  • M covariance matrices
  • Usually assume covariance matrix is diagonal
  • I.e. just keep separate variance for each
    cepstral feature

44
ASR Lexicon Markov Models for pronunciation
45
The Hidden Markov model
46
Formal definition of HMM
  • States a set of states Q q1, q2qN
  • Transition probabilities a set of probabilities
    A a01,a02,an1,ann.
  • Each aij represents P(ji)
  • Observation likelihoods a set of likelihoods
    Bbi(ot), probability that state i generated
    observation t
  • Special non-emitting initial and final states

47
Pieces of the HMM
  • Observation likelihoods (b), p(oq), represents
    the acoustics of each phone, and are computed by
    the gaussians (Acoustic Model, or AM)
  • Transition probabilities represent the
    probability of different pronunciations
    (different sequences of phones)
  • States correspond to phones

48
Pieces of the HMM
  • Actually, I lied when I say states correspond to
    phones
  • Actually states usually correspond to triphones
  • CHEESE (phones) ch iy z
  • CHEESE (triphones) -chiy, ch-iyz, iy-z

49
Pieces of the HMM
  • Actually, I lied again when I said states
    correspond to triphones
  • In fact, each triphone has 3 states for
    beginning, middle, and end of the triphone.

50
A real HMM
51
Cross-word triphones
  • Word-Internal Context-Dependent Models
  • OUR LIST
  • SIL AAR AA-R LIH L-IHS IH-ST S-T
  • Cross-Word Context-Dependent Models
  • OUR LIST
  • SIL-AAR AA-RL R-LIH L-IHS IH-ST S-TSIL

52
Summary
  • ASR Architecture
  • The Noisy Channel Model
  • Five easy pieces of an ASR system
  • Feature Extraction
  • 39 MFCC features
  • Acoustic Model
  • Gaussians for computing p(oq)
  • Lexicon/Pronunciation Model
  • HMM
  • Next time Decoding how to combine these to
    compute words from speech!
Write a Comment
User Comments (0)
About PowerShow.com