Title: Speech
1Speech
- Dr. Björn Gambäck
- SICS Swedish Institute of Computer Science AB
- Stockholm, Sweden
2Reading Instructions
- Daniel Jurafsky and James H. Martin
- Ch. 1 Ch. 2 Ch. 3 Ch. 4, pp. 91-112, 120-133
- Ch. 6 Ch. 7.1-7 Ch. 8 Ch. 9 Ch. 10
- Ch. 11.1-3 Ch. 12.4-5 Ch. 14, pp. 501-527 Ch.
15.1-2, 4-5 - Ch. 16 Ch. 17 Ch. 18.1-3 Ch. 19 Ch. 20.1-2
Ch. 21 - Douglas Arnold et al. 1994 Machine Translation
An Introductory Guide, Ch. 3, 4, 6, 8, 9. - Björn Gambäck 1999 Human Language Technology
The Babel Fish - John Kimball 1973 "Seven Principles of Surface
Structure Parsing in Natural Languages",
Cognition 2(1), pp. 15-47. - Yorick Wilks Roberta Catizone 2000
"Human-Computer Conversation", Encyclopedia of
Microcomputers, Dekker, New York. - Victor Zue James Glass 2000 "Conversational
Interfaces Advances and Challenges", Proceedings
of the IEEE 88(8), pp. 1166-1180.
3Left Recursion
- A left-recursive rule is on the form
- A ? A ?
- ?
- where ? and ? are sequencies of non-terminal
symbols that dont start with A - The rule can be rewritten as (right-recursive)
- A ? ? A
- A ? ? A
- ?
4Dialogue
- Turn-taking
- Utterances
- Spoken language phenomena
- Discourse markers
- Pauses
5Content Relevance
- Relevant to Domain
- Non Relevant
- Uninterpretable
- SelfTalk (one speaker talking to him or herself)
- OffTopic
6Verbmobil Speech-to-speech translation
- Ca 900 researchers 166 million DM (ca 1 billion
birr) - Speaker-independent spontaneous speech.
- Offers assistance in dialogue situations
7Verbmobil, phase 1 (1993-1996)
- Appointment negotiation
- German / Japanese ? English
- around 2,500 words
- less than six times real life
- 74.2 of proposed translations approx. correct
8Verbmobil, phase 2 (1997-2000)
- Bidirectional
- Travel planning,
- hotel reserving
9Verbmobil Dialogue Phases
- Hello
- The dialogue participants greet each other and
introduce themselves. - Opening
- The topic to be discussed is introduced.
- Negotiation
- The actual negotiation, between opening and
closing. - Closing
- The discussion is finished (all the participants
have agreed). - Good-Bye
- The dialogue participants say good-bye to each
other.
10Speaker Variation
- Lexical variation
- (which words?)
- Allophonic variation
- (which sounds?)
- Dialect
- Sociolect
- Hearer
- Style
- ...
11Coarticulation
- A segment is influenced by its neighbours
- Assimilation
- change to be more like the neighbours
- Deletion
12Text-to-Speech Synthesis (TTS)
- Fundamental frequency contour, F0
- Duration
- Prosody
- Mood
- Canned speech
- Fill in the blanks
- Free speech (generation)
13Text-to-Speech Synthesis, cont
- Concatenative synthesis
- Formant synthesis
- Articulatory synthesis
- Diphones
- Polyphones
- PSOLA (Pitch-Synchronous Overlap and Add)
- MBROLA (Multi-Band Resynthesis Overlap and Add)
14Multimodal Speech Synthesis
15Automatic Speech Recognition (ASR)
- Speaker-dependent vs. speaker-independent
- (speaker-adaptive?!)
- Isolated words vs. continuous speech
- Small vs. large vocabulary
- (large 10,000 words words units)
- Broad vocabulary vs. restricted domain
- Speaker recognition/verification
16ASR System Components
- Bayes decision rule
- All state-of-the-art ASR is based on statistical
approaches - Reasoning under uncertainty
- The theory of probability is a system for
making better guesses Feynman - Language model
- predicts input based on previous words (N-grams)
- Search engine
- chooses amongst competing hypotheses
17ASR System Components, cont.
- Acoustic analysis
- Analog speech signal ? a sequence of acoustic
feature vectors (contain characteristic
information about the spoken utterance) - Global search
- Determines unknown length word sequence which
most probably caused the observed sequence of
acoustic feature vectors
18Acoustic Confusability
Overlap classification error Acoustic and
linguistic context ? reduce overlap
19Decoding
- Choose the most likely word given an observation
- P(wordobservation) P(wO) ? Maximize P(wO)
- P(wO) P(Ow) P(w) (Bayes Rule)
- P(O)
- max P(wO) max P(Ow) P(w)
- (since P(O) is the same for all words)
- P(w) prior probability (e.g., frequency)
- P(Ow) likelihood
20Probabilistic Formulation
- Objective Minimize word error rate by maximizing
- P(OW) P(W)
- P(O)
- Approach Maximize P(OW) during training
- Components
- P(0W) observation likelihood Acoustic Model
- P(W) prior probability Language Model
- P(O) acoustic probability
- (same for all sequences ignored during
maximization)
21Probabilistic Method
- Objective Minimize word error rate by maximizing
- P(OW) P(W)
- Global search
- Combine two statistical knowledge sources
- Acoustic Model
- Language Model
- Optimal combination search process
22ASR System Architecture Ney 1990
23Acoustic Model
- P(0W) observation likelihood Acoustic Model
- (Class-dependent probability distribution)
- How closely does the hypothesized sequence mimic
the observed acoustic sequence? - (the conditional probability of observing the
acoustic feature vectors xT when a speaker
utters the word sequence wN) - Acoustic analysis (spectral analysis)
- transforms the analog speech signal into a
sequence of acoustic feature vectors - Hidden Markov Models for sub-word units
- approximate speech speed variations
- Pronunciation lexicon
- Decomposition of words into sub-word units
24Acoustic Analysis
- Transforms the analog speech signal into a
sequence of acoustic feature vectors - Dependencies on the speakers voice, the acoustic
channel and the environmental conditions should
be suppressed - Short-time spectral analysis of the speech signal
- performed every 10 milliseconds on a short
segment - (e.g., with a length of 25 milliseconds)
25Decomposition into sub-units
- Not feasible to estimate probability
distributions for each possible word sequence ?
break into units - Small vocabulary recognition tasks the words
- Larger vocabularies ? decompose further into
sub-word units - Pronunciation lexicon
- phoneme sequences representing the words
- Coarticulation
- Triphones
- (modeled by Hidden Markov Models)
26Pronunciation Lexicon
- Store the most likely pronunciations
- Weighted automaton
- (each arch has a probability)
- (sum of probabilities for all archs leaving a
node 1) - Pronunciation networks
27Hidden Markov Models (HMMs)
- Stochastic finite automata states and
transitions - State set
- model the acoustic characteristics of part of
the triphone -
- Transition probabilities
- aij transition from state i to state j
-
- Observation likelihoods (emission probabilities)
- biot probability of observation ot coming from
state I - (acoustic emission probability distribution)
-
28Acoustic Modeling, HMMs
- Hidden Markov Models
- temporal variation in the transition
probabilities - Gaussian mixture distributions
- variations in speaker, accent, and pronunciation
Hamaker 2002
29ANN Hybrids
- Flexible, discriminative classifiers for emission
probabilities - Avoid HMM independence assumptions (can use wider
acoustic context) - Prone to overfitting
- require cross-validation to determine when to
stop training - No substantial recognition improvements over
HMM/GMM
30Recombination Search Strategies
- The global decision about the most probable word
sequence is decomposed into local decisions
within the network. - On HMM states and/or on words.
- Considerable reduction of search paths
- Two strategies
- A (Best-First)
- Viterbi (dynamic programming)
31Best-First (A) Search
- The score of each partial word sequence
hypothesis is enhanced by an estimate of the
probability of the not yet decoded part of the
sentence - C(n) S(n) G(n)
- C(n) evaluation function for node n, estimates
C(n) - C(n) actual cost of optimal path from start to
goal through n - S(n) cost of path followed so far to n S(n) ?
S(n) - G(n) estimates cost of remaining path from n to
goal
32The Viterbi Algorithm(Dynamic Time Warping,
breadth-first)
- All possible word sequences are hypothesized in
parallel - Threshold excludes improbable hypotheses
- Based on
- previous path probability
- (getting to state i)
- transition probability
- (getting from i to j)
- observation likelihood
- (state j matches input)
33Language Model
- P(W) prior probability Language Model
- independent of the acoustic feature vectors
- provides the a-priori probability of a given word
sequence - Perplexity the average number of choices
- N-gram P(W) ? P( Wi Wi-1 Wi-2 Wi-N )
- Part-of-Speech
- Grammar
- Class pairs
34Class Pair Grammar
- A word graph all possible word combinations
that the A search can generate with the help of
a search network - Search network created by
- lexical transcriptions
- class pair grammar
- Class pair grammar defines which words may
follow which - Cant recognize utterances not covered by the
grammar
35A Class Pair Grammar
- Start class pairs Class pairs End class pairs
- SILENCE PRON ART PLACE PLACE SILENCE
- SILENCE FRAGMENT PRON OBJECT OBJECT SILENCE
- SILENCE INTRO FRAGMENT OBJECT
- FRAGMENT PLACE
- OBJECT ART
- OBJECT VERB
- VERB ART
- INTRO FRAGMENT2
- FRAGMENT2 OBJECT