Title: LSA 352: Speech Recognition and Synthesis
1LSA 352 Speech Recognition and Synthesis
- Dan Jurafsky
- Lecture 1
- 1) Overview of Course
- 2) Refresher Intro to Probability
- 3) Language Modeling
IP notice some slides for today from Josh
Goodman, Dan Klein, Bonnie Dorr, Julia
Hirschberg, Sandiway Fong
2Outline
- Overview of Course
- Probability
- Language Modeling
- Language Modeling means probabilistic grammar
3Definitions
- Speech Recognition
- Speech-to-Text
- Input a wavefile,
- Output string of words
- Speech Synthesis
- Text-to-Speech
- Input a string of words
- Output a wavefile
4Automatic Speech Recognition (ASR)Automatic
Speech Understanding (ASU)
- Applications
- Dictation
- Telephone-based Information (directions, air
travel, banking, etc) - Hands-free (in car)
- Second language ('L2') (accent reduction)
- Audio archive searching
- Linguistic research
- Automatically computing word durations, etc
5Applications of Speech Synthesis/Text-to-Speech
(TTS)
- Games
- Telephone-based Information (directions, air
travel, banking, etc) - Eyes-free (in car)
- Reading/speaking for disabled
- Education Reading tutors
- Education L2 learning
6Applications of Speaker/Lg Recognition
- Language recognition for call routing
- Speaker Recognition
- Speaker verification (binary decision)
- Voice password, telephone assistant
- Speaker identification (one of N)
- Criminal investigation
7History foundational insights 1900s-1950s
- Automaton
- Markov 1911
- Turing 1936
- McCulloch-Pitts neuron (1943)
- http//marr.bsee.swin.edu.au/dtl/het704/lecture10
/ann/node1.html - http//diwww.epfl.ch/mantra/tutorial/english/mcpit
s/html/ - Shannon (1948) link between automata and Markov
models - Human speech processing
- Fletcher at Bell Labs (1920s)
- Probabilistic/Information-theoretic models
- Shannon (1948)
8Synthesis precursors
- Von Kempelen mechanical (bellows, reeds) speech
production simulacrum - 1929 Channel vocoder (Dudley)
9History Early Recognition
- 1920s Radio Rex
- Celluloid dog with iron base held within house by
electromagnet against force of spring - Current to magnet flowed through bridge which was
sensitive to energy at 500 Hz - 500 Hz energy caused bridge to vibrate,
interrupting current, making dog spring forward - The sound e (ARPAbet eh) in Rex has 500 Hz
component
10History early ASR systems
- 1950s Early Speech recognizers
- 1952 Bell Labs single-speaker digit recognizer
- Measured energy from two bands (formants)
- Built with analog electrical components
- 2 error rate for single speaker, isolated digits
- 1958 Dudley built classifier that used
continuous spectrum rather than just formants - 1959 Denes ASR combining grammar and acoustic
probability - 1960s
- FFT - Fast Fourier transform (Cooley and Tukey
1965) - LPC - linear prediction (1968)
- 1969 John Pierce letter Whither Speech
Recognition? - Random tuning of parameters,
- Lack of scientific rigor, no evaluation metrics
- Need to rely on higher level knowledge
11ASR 1970s and 1980s
- Hidden Markov Model 1972
- Independent application of Baker (CMU) and
Jelinek/Bahl/Mercer lab (IBM) following work of
Baum and colleagues at IDA - ARPA project 1971-1976
- 5-year speech understanding project 1000 word
vocab, continous speech, multi-speaker - SDC, CMU, BBN
- Only 1 CMU system achieved goal
- 1980s
- Annual ARPA Bakeoffs
- Large corpus collection
- TIMIT
- Resource Management
- Wall Street Journal
12State of the Art
- ASR
- speaker-independent, continuous, no noise,
worlds best research systems - Human-human speech 13-20 Word Error Rate
(WER) - Human-machine speech 3-5 WER
- TTS (demo next week)
13LVCSR Overview
- Large Vocabulary Continuous (Speaker-Independent)
Speech Recognition - Build a statistical model of the speech-to-words
process - Collect lots of speech and transcribe all the
words - Train the model on the labeled speech
- Paradigm Supervised Machine Learning Search
14Unit Selection TTS Overview
- Collect lots of speech (5-50 hours) from one
speaker, transcribe very carefully, all the
syllables and phones and whatnot - To synthesize a sentence, patch together
syllables and phones from the training data. - Paradigm search
15Requirements and Grading
- Readings
- Required Text
- Selected chapters on web from
- Jurafsky Martin, 2000. Speech and Language
Processing. - Taylor, Paul. 2007. Text-to-Speech Synthesis.
- Grading
- Homework 75 (3 homeworks, 25 each)
- Participation 25
- You may work in groups
16Overview of the course
- http//nlp.stanford.edu/courses/lsa352/
176. Introduction to Probability
- Experiment (trial)
- Repeatable procedure with well-defined possible
outcomes - Sample Space (S)
- the set of all possible outcomes
- finite or infinite
- Example
- coin toss experiment
- possible outcomes S heads, tails
- Example
- die toss experiment
- possible outcomes S 1,2,3,4,5,6
Slides from Sandiway Fong
18Introduction to Probability
- Definition of sample space depends on what we are
asking - Sample Space (S) the set of all possible
outcomes - Example
- die toss experiment for whether the number is
even or odd - possible outcomes even,odd
- not 1,2,3,4,5,6
19More definitions
- Events
- an event is any subset of outcomes from the
sample space - Example
- die toss experiment
- let A represent the event such that the outcome
of the die toss experiment is divisible by 3 - A 3,6
- A is a subset of the sample space S
1,2,3,4,5,6 - Example
- Draw a card from a deck
- suppose sample space S heart,spade,club,diamond
(four suits) - let A represent the event of drawing a heart
- let B represent the event of drawing a red card
- A heart
- B heart,diamond
20Introduction to Probability
- Some definitions
- Counting
- suppose operation oi can be performed in ni ways,
then - a sequence of k operations o1o2...ok
- can be performed in n1 ? n2 ? ... ? nk ways
- Example
- die toss experiment, 6 possible outcomes
- two dice are thrown at the same time
- number of sample points in sample space 6 ? 6
36
21Definition of Probability
- The probability law assigns to an event a
nonnegative number - Called P(A)
- Also called the probability A
- That encodes our knowledge or belief about the
collective likelihood of all the elements of A - Probability law must satisfy certain properties
22Probability Axioms
- Nonnegativity
- P(A) gt 0, for every event A
- Additivity
- If A and B are two disjoint events, then the
probability of their union satisfies - P(A U B) P(A) P(B)
- Normalization
- The probability of the entire sample space S is
equal to 1, I.e. P(S) 1.
23An example
- An experiment involving a single coin toss
- There are two possible outcomes, H and T
- Sample space S is H,T
- If coin is fair, should assign equal
probabilities to 2 outcomes - Since they have to sum to 1
- P(H) 0.5
- P(T) 0.5
- P(H,T) P(H)P(T) 1.0
24Another example
- Experiment involving 3 coin tosses
- Outcome is a 3-long string of H or T
- S HHH,HHT,HTH,HTT,THH,THT,TTH,TTTT
- Assume each outcome is equiprobable
- Uniform distribution
- What is probability of the event that exactly 2
heads occur? - A HHT,HTH,THH
- P(A) P(HHT)P(HTH)P(THH)
- 1/8 1/8 1/8
- 3/8
25Probability definitions
- In summary
- Probability of drawing a spade from 52
well-shuffled playing cards
26Probabilities of two events
- If two events A and B are independent
- Then
- P(A and B) P(A) x P(B)
- If flip a fair coin twice
- What is the probability that they are both heads?
- If draw a card from a deck, then put it back,
draw a card from the deck again - What is the probability that both drawn cards are
hearts? - A coin is flipped twice
- What is the probability that it comes up heads
both times?
27How about non-uniform probabilities? An example
- A biased coin,
- twice as likely to come up tails as heads,
- is tossed twice
- What is the probability that at least one head
occurs? - Sample space hh, ht, th, tt (h heads, t
tails) - Sample points/probability for the event
- ht 1/3 x 2/3 2/9 hh 1/3 x 1/3 1/9
- th 2/3 x 1/3 2/9 tt 2/3 x 2/3 4/9
- Answer 5/9 ?0.56 (sum of weights in red)
28Moving toward language
- Whats the probability of drawing a 2 from a deck
of 52 cards with four 2s? - Whats the probability of a random word (from a
random dictionary page) being a verb?
29Probability and part of speech tags
- Whats the probability of a random word (from a
random dictionary page) being a verb? - How to compute each of these
- All words just count all the words in the
dictionary - of ways to get a verb number of words which
are verbs! - If a dictionary has 50,000 entries, and 10,000
are verbs. P(V) is 10000/50000 1/5 .20
30Conditional Probability
- A way to reason about the outcome of an
experiment based on partial information - In a word guessing game the first letter for the
word is a t. What is the likelihood that the
second letter is an h? - How likely is it that a person has a disease
given that a medical test was negative? - A spot shows up on a radar screen. How likely is
it that it corresponds to an aircraft?
31More precisely
- Given an experiment, a corresponding sample space
S, and a probability law - Suppose we know that the outcome is within some
given event B - We want to quantify the likelihood that the
outcome also belongs to some other given event A. - We need a new probability law that gives us the
conditional probability of A given B - P(AB)
32An intuition
- A is its raining now.
- P(A) in dry California is .01
- B is it was raining ten minutes ago
- P(AB) means what is the probability of it
raining now if it was raining 10 minutes ago - P(AB) is probably way higher than P(A)
- Perhaps P(AB) is .10
- Intuition The knowledge about B should change
our estimate of the probability of A.
33Conditional probability
- One of the following 30 items is chosen at random
- What is P(X), the probability that it is an X?
- What is P(Xred), the probability that it is an X
given that it is red?
34Conditional Probability
- let A and B be events
- p(BA) the probability of event B occurring
given event A occurs - definition p(BA) p(A ? B) / p(A)
35Conditional probability
Note P(A,B)P(AB) P(B) Also P(A,B) P(B,A)
36Independence
- What is P(A,B) if A and B are independent?
- P(A,B)P(A) P(B) iff A,B independent.
- P(heads,tails) P(heads) P(tails) .5 .5
.25 - Note P(AB)P(A) iff A,B independent
- Also P(BA)P(B) iff A,B independent
37Bayes Theorem
- Swap the conditioning
- Sometimes easier to estimate one kind of
dependence than the other
38Deriving Bayes Rule
39Summary
- Probability
- Conditional Probability
- Independence
- Bayes Rule
40How many words?
- I do uh main- mainly business data processing
- Fragments
- Filled pauses
- Are cat and cats the same word?
- Some terminology
- Lemma a set of lexical forms having the same
stem, major part of speech, and rough word sense - Cat and cats same lemma
- Wordform the full inflected surface form.
- Cat and cats different wordforms
41How many words?
- they picnicked by the pool then lay back on the
grass and looked at the stars - 16 tokens
- 14 types
- SWBD
- 20,000 wordform types,
- 2.4 million wordform tokens
- Brown et al (1992) large corpus
- 583 million wordform tokens
- 293,181 wordform types
- Let N number of tokens, V vocabulary number
of types - General wisdom V gt O(sqrt(N))
42Language Modeling
- We want to compute P(w1,w2,w3,w4,w5wn), the
probability of a sequence - Alternatively we want to compute
P(w5w1,w2,w3,w4,w5) the probability of a word
given some previous words - The model that computes P(W) or P(wnw1,w2wn-1)
is called the language model. - A better term for this would be The Grammar
- But Language model or LM is standard
43Computing P(W)
- How to compute this joint probability
- P(the,other,day,I,was,walking,along,
and,saw,a,lizard) - Intuition lets rely on the Chain Rule of
Probability
44The Chain Rule of Probability
- Recall the definition of conditional
probabilities - Rewriting
- More generally
- P(A,B,C,D) P(A)P(BA)P(CA,B)P(DA,B,C)
- In general
- P(x1,x2,x3,xn) P(x1)P(x2x1)P(x3x1,x2)P(xnx1
xn-1)
45The Chain Rule Applied to joint probability of
words in sentence
- P(the big red dog was)
- P(the)P(bigthe)P(redthe big)P(dogthe big
red)P(wasthe big red dog)
46Very easy estimate
- How to estimate?
- P(theits water is so transparent that)
- P(theits water is so transparent that)
-
- C(its water is so transparent that the)
- _______________________________
- C(its water is so transparent that)
47Unfortunately
- There are a lot of possible sentences
- Well never be able to get enough data to compute
the statistics for those long prefixes - P(lizardthe,other,day,I,was,walking,along,and,saw
,a) - Or
- P(theits water is so transparent that)
48Markov Assumption
- Make the simplifying assumption
- P(lizardthe,other,day,I,was,walking,along,and,saw
,a) P(lizarda) - Or maybe
- P(lizardthe,other,day,I,was,walking,along,and,saw
,a) P(lizardsaw,a)
49Markov Assumption
- So for each component in the product replace with
the approximation (assuming a prefix of N) - Bigram version
50Estimating bigram probabilities
- The Maximum Likelihood Estimate
51An example
- ltsgt I am Sam lt/sgt
- ltsgt Sam I am lt/sgt
- ltsgt I do not like green eggs and ham lt/sgt
- This is the Maximum Likelihood Estimate, because
it is the one which maximizes P(Training
setModel)
52Maximum Likelihood Estimates
- The maximum likelihood estimate of some parameter
of a model M from a training set T - Is the estimate
- that maximizes the likelihood of the training set
T given the model M - Suppose the word Chinese occurs 400 times in a
corpus of a million words (Brown corpus) - What is the probability that a random word from
some other text will be Chinese - MLE estimate is 400/1000000 .004
- This may be a bad estimate for some other corpus
- But it is the estimate that makes it most likely
that Chinese will occur 400 times in a million
word corpus.
53More examples Berkeley Restaurant Project
sentences
- can you tell me about any good cantonese
restaurants close by - mid priced thai food is what im looking for
- tell me about chez panisse
- can you give me a listing of the kinds of food
that are available - im looking for a good place to eat breakfast
- when is caffe venezia open during the day
54Raw bigram counts
55Raw bigram probabilities
- Normalize by unigrams
- Result
56Bigram estimates of sentence probabilities
- P(ltsgt I want english food lt/sgt)
- p(iltsgt) x p(wantI) x p(englishwant)
x p(foodenglish) x p(lt/sgtfood) - .24 x .33 x .0011 x 0.5 x 0.68
- .000031
57What kinds of knowledge?
- P(englishwant) .0011
- P(chinesewant) .0065
- P(towant) .66
- P(eat to) .28
- P(food to) 0
- P(want spend) 0
- P (i ltsgt) .25
58The Shannon Visualization Method
- Generate random sentences
- Choose a random bigram ltsgt, w according to its
probability - Now choose a random bigram (w, x) according to
its probability - And so on until we choose lt/sgt
- Then string the words together
- ltsgt I
- I want
- want to
- to eat
- eat Chinese
- Chinese food
- food lt/sgt
59(No Transcript)
60Shakespeare as corpus
- N884,647 tokens, V29,066
- Shakespeare produced 300,000 bigram types out of
V2 844 million possible bigrams so, 99.96 of
the possible bigrams were never seen (have zero
entries in the table) - Quadrigrams worse What's coming out looks like
Shakespeare because it is Shakespeare
61The wall street journal is not shakespeare (no
offense)
62Evaluation
- We train parameters of our model on a training
set. - How do we evaluate how well our model works?
- We look at the models performance on some new
data - This is what happens in the real world we want
to know how our model performs on data we havent
seen - So a test set. A dataset which is different than
our training set - Then we need an evaluation metric to tell us how
well our model is doing on the test set. - One such metric is perplexity (to be introduced
below)
63Unknown words Open versus closed vocabulary tasks
- If we know all the words in advanced
- Vocabulary V is fixed
- Closed vocabulary task
- Often we dont know this
- Out Of Vocabulary OOV words
- Open vocabulary task
- Instead create an unknown word token ltUNKgt
- Training of ltUNKgt probabilities
- Create a fixed lexicon L of size V
- At text normalization phase, any training word
not in L changed to ltUNKgt - Now we train its probabilities like a normal word
- At decoding time
- If text input Use UNK probabilities for any word
not in training
64Evaluating N-gram models
- Best evaluation for an N-gram
- Put model A in a speech recognizer
- Run recognition, get word error rate (WER) for A
- Put model B in speech recognition, get word error
rate for B - Compare WER for A and B
- In-vivo evaluation
65Difficulty of in-vivo evaluation of N-gram models
- In-vivo evaluation
- This is really time-consuming
- Can take days to run an experiment
- So
- As a temporary solution, in order to run
experiments - To evaluate N-grams we often use an approximation
called perplexity - But perplexity is a poor approximation unless the
test data looks just like the training data - So is generally only useful in pilot experiments
(generally is not sufficient to publish) - But is helpful to think about.
66Perplexity
- Perplexity is the probability of the test set
(assigned by the language model), normalized by
the number of words - Chain rule
- For bigrams
- Minimizing perplexity is the same as maximizing
probability - The best language model is one that best predicts
an unseen test set
67A totally different perplexity Intuition
- How hard is the task of recognizing digits
0,1,2,3,4,5,6,7,8,9,oh easy, perplexity 11 (or
if we ignore oh, perplexity 10) - How hard is recognizing (30,000) names at
Microsoft. Hard perplexity 30,000 - If a system has to recognize
- Operator (1 in 4)
- Sales (1 in 4)
- Technical Support (1 in 4)
- 30,000 names (1 in 120,000 each)
- Perplexity is 54
- Perplexity is weighted equivalent branching
factor
Slide from Josh Goodman
68Perplexity as branching factor
69Lower perplexity better model
- Training 38 million words, test 1.5 million
words, WSJ
70Lesson 1 the perils of overfitting
- N-grams only work well for word prediction if the
test corpus looks like the training corpus - In real life, it often doesnt
- We need to train robust models, adapt to test
set, etc
71Lesson 2 zeros or not?
- Zipfs Law
- A small number of events occur with high
frequency - A large number of events occur with low frequency
- You can quickly collect statistics on the high
frequency events - You might have to wait an arbitrarily long time
to get valid statistics on low frequency events - Result
- Our estimates are sparse! no counts at all for
the vast bulk of things we want to estimate! - Some of the zeroes in the table are really zeros
But others are simply low frequency events you
haven't seen yet. After all, ANYTHING CAN
HAPPEN! - How to address?
- Answer
- Estimate the likelihood of unseen N-grams!
Slide adapted from Bonnie Dorr and Julia
Hirschberg
72Smoothing is like Robin HoodSteal from the rich
and give to the poor (in probability mass)
Slide from Dan Klein
73Laplace smoothing
- Also called add-one smoothing
- Just add one to all the counts!
- Very simple
- MLE estimate
- Laplace estimate
- Reconstructed counts
74Laplace smoothed bigram counts
75Laplace-smoothed bigrams
76Reconstituted counts
77Note big change to counts
- C(count to) went from 608 to 238!
- P(towant) from .66 to .26!
- Discount d c/c
- d for chinese food .10!!! A 10x reduction
- So in general, Laplace is a blunt instrument
- Could use more fine-grained method (add-k)
- But Laplace smoothing not used for N-grams, as we
have much better methods - Despite its flaws Laplace (add-k) is however
still used to smooth other probabilistic models
in NLP, especially - For pilot studies
- in domains where the number of zeros isnt so
huge.
78Better discounting algorithms
- Intuition used by many smoothing algorithms
- Good-Turing
- Kneser-Ney
- Witten-Bell
- Is to use the count of things weve seen once to
help estimate the count of things weve never seen
79Good-Turing Josh Goodman intuition
- Imagine you are fishing
- There are 8 species carp, perch, whitefish,
trout, salmon, eel, catfish, bass - You have caught
- 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon,
1 eel 18 fish - How likely is it that next species is new (i.e.
catfish or bass) - 3/18
- Assuming so, how likely is it that next species
is trout? - Must be less than 1/18
Slide adapted from Josh Goodman
80Good-Turing Intuition
- Notation Nx is the frequency-of-frequency-x
- So N101, N13, etc
- To estimate total number of unseen species
- Use number of species (words) weve seen once
- c0 c1 p0 N1/N
- All other estimates are adjusted (down) to give
probabilities for unseen
Slide from Josh Goodman
81Good-Turing Intuition
- Notation Nx is the frequency-of-frequency-x
- So N101, N13, etc
- To estimate total number of unseen species
- Use number of species (words) weve seen once
- c0 c1 p0 N1/N p0N1/N3/18
- All other estimates are adjusted (down) to give
probabilities for unseen
P(eel) c(1) (11) 1/ 3 2/3
Slide from Josh Goodman
82(No Transcript)
83Bigram frequencies of frequencies and GT
re-estimates
84Complications
- In practice, assume large counts (cgtk for some k)
are reliable - That complicates c, making it
- Also we assume singleton counts c1 are
unreliable, so treat N-grams with count of 1 as
if they were count0 - Also, need the Nk to be non-zero, so we need to
smooth (interpolate) the Nk counts before
computing c from them
85Backoff and Interpolation
- Another really useful source of knowledge
- If we are estimating
- trigram p(zxy)
- but c(xyz) is zero
- Use info from
- Bigram p(zy)
- Or even
- Unigram p(z)
- How to combine the trigram/bigram/unigram info?
86Backoff versus interpolation
- Backoff use trigram if you have it, otherwise
bigram, otherwise unigram - Interpolation mix all three
87Interpolation
- Simple interpolation
- Lambdas conditional on context
88How to set the lambdas?
- Use a held-out corpus
- Choose lambdas which maximize the probability of
some held-out data - I.e. fix the N-gram probabilities
- Then search for lambda values
- That when plugged into previous equation
- Give largest probability for held-out set
- Can use EM to do this search
89Katz Backoff
90Why discounts P and alpha?
- MLE probabilities sum to 1
- So if we used MLE probabilities but backed off to
lower order model when MLE prob is zero - We would be adding extra probability mass
- And total probability would be greater than 1
91GT smoothed bigram probs
92Intuition of backoffdiscounting
- How much probability to assign to all the zero
trigrams? - Use GT or other discounting algorithm to tell us
- How to divide that probability mass among
different contexts? - Use the N-1 gram estimates to tell us
- What do we do for the unigram words not seen in
training? - Out Of Vocabulary OOV words
93OOV words ltUNKgt word
- Out Of Vocabulary OOV words
- We dont use GT smoothing for these
- Because GT assumes we know the number of unseen
events - Instead create an unknown word token ltUNKgt
- Training of ltUNKgt probabilities
- Create a fixed lexicon L of size V
- At text normalization phase, any training word
not in L changed to ltUNKgt - Now we train its probabilities like a normal word
- At decoding time
- If text input Use UNK probabilities for any word
not in training
94Practical Issues
- We do everything in log space
- Avoid underflow
- (also adding is faster than multiplying)
95ARPA format
96(No Transcript)
97Language Modeling Toolkits
- SRILM
- CMU-Cambridge LM Toolkit
98Google N-Gram Release
99Google N-Gram Release
- serve as the incoming 92
- serve as the incubator 99
- serve as the independent 794
- serve as the index 223
- serve as the indication 72
- serve as the indicator 120
- serve as the indicators 45
- serve as the indispensable 111
- serve as the indispensible 40
- serve as the individual 234
100Advanced LM stuff
- Current best smoothing algorithm
- Kneser-Ney smoothing
- Other stuff
- Variable-length n-grams
- Class-based n-grams
- Clustering
- Hand-built classes
- Cache LMs
- Topic-based LMs
- Sentence mixture models
- Skipping LMs
- Parser-based LMs
101Summary
- LM
- N-grams
- Discounting Good-Turing
- Katz backoff with Good-Turing discounting
- Interpolation
- Unknown words
- Evaluation
- Entropy, Entropy Rate, Cross Entropy
- Perplexity
- Advanced LM algorithms