Speech - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Speech

Description:

Speech. Dr. Bj rn Gamb ck. SICS Swedish Institute of Computer Science AB. Stockholm, Sweden ... Bj rn Gamb ck 1999: Human Language Technology: The Babel Fish ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 36

Provided by: bjrng

Category:

more less

Transcript and Presenter's Notes

Title: Speech

1
Speech

Dr. Björn Gambäck
SICS Swedish Institute of Computer Science AB
Stockholm, Sweden

2
Reading Instructions

Daniel Jurafsky and James H. Martin
Ch. 1 Ch. 2 Ch. 3 Ch. 4, pp. 91-112, 120-133
Ch. 6 Ch. 7.1-7 Ch. 8 Ch. 9 Ch. 10
Ch. 11.1-3 Ch. 12.4-5 Ch. 14, pp. 501-527 Ch.
15.1-2, 4-5
Ch. 16 Ch. 17 Ch. 18.1-3 Ch. 19 Ch. 20.1-2
Ch. 21
Douglas Arnold et al. 1994 Machine Translation
An Introductory Guide, Ch. 3, 4, 6, 8, 9.
Björn Gambäck 1999 Human Language Technology
The Babel Fish
John Kimball 1973 "Seven Principles of Surface
Structure Parsing in Natural Languages",
Cognition 2(1), pp. 15-47.
Yorick Wilks Roberta Catizone 2000
"Human-Computer Conversation", Encyclopedia of
Microcomputers, Dekker, New York.
Victor Zue James Glass 2000 "Conversational
Interfaces Advances and Challenges", Proceedings
of the IEEE 88(8), pp. 1166-1180.

3
Left Recursion

A left-recursive rule is on the form
A ? A ?
?
where ? and ? are sequencies of non-terminal
symbols that dont start with A
The rule can be rewritten as (right-recursive)
A ? ? A
A ? ? A
?

4
Dialogue

Turn-taking
Utterances
Spoken language phenomena
Discourse markers
Pauses

5
Content Relevance

Relevant to Domain
Non Relevant
Uninterpretable
SelfTalk (one speaker talking to him or herself)
OffTopic

6
Verbmobil Speech-to-speech translation

Ca 900 researchers 166 million DM (ca 1 billion
birr)
Speaker-independent spontaneous speech.
Offers assistance in dialogue situations

7
Verbmobil, phase 1 (1993-1996)

Appointment negotiation
German / Japanese ? English
around 2,500 words
less than six times real life
74.2 of proposed translations approx. correct

8
Verbmobil, phase 2 (1997-2000)

Bidirectional
Travel planning,
hotel reserving

9
Verbmobil Dialogue Phases

Hello
The dialogue participants greet each other and
introduce themselves.
Opening
The topic to be discussed is introduced.
Negotiation
The actual negotiation, between opening and
closing.
Closing
The discussion is finished (all the participants
have agreed).
Good-Bye
The dialogue participants say good-bye to each
other.

10
Speaker Variation

Lexical variation
(which words?)
Allophonic variation
(which sounds?)
Dialect
Sociolect
Hearer
Style
...

11
Coarticulation

A segment is influenced by its neighbours
Assimilation
change to be more like the neighbours
Deletion

12
Text-to-Speech Synthesis (TTS)

Fundamental frequency contour, F0
Duration
Prosody
Mood
Canned speech
Fill in the blanks
Free speech (generation)

13
Text-to-Speech Synthesis, cont

Concatenative synthesis
Formant synthesis
Articulatory synthesis
Diphones
Polyphones
PSOLA (Pitch-Synchronous Overlap and Add)
MBROLA (Multi-Band Resynthesis Overlap and Add)

14
Multimodal Speech Synthesis

Lip movements
Eye lids

15
Automatic Speech Recognition (ASR)

Speaker-dependent vs. speaker-independent
(speaker-adaptive?!)
Isolated words vs. continuous speech
Small vs. large vocabulary
(large 10,000 words words units)
Broad vocabulary vs. restricted domain
Speaker recognition/verification

16
ASR System Components

Bayes decision rule
All state-of-the-art ASR is based on statistical
approaches
Reasoning under uncertainty
The theory of probability is a system for
making better guesses Feynman
Language model
predicts input based on previous words (N-grams)
Search engine
chooses amongst competing hypotheses

17
ASR System Components, cont.

Acoustic analysis
Analog speech signal ? a sequence of acoustic
feature vectors (contain characteristic
information about the spoken utterance)
Global search
Determines unknown length word sequence which
most probably caused the observed sequence of
acoustic feature vectors

18
Acoustic Confusability
Overlap classification error Acoustic and
linguistic context ? reduce overlap
19
Decoding

Choose the most likely word given an observation
P(wordobservation) P(wO) ? Maximize P(wO)
P(wO) P(Ow) P(w) (Bayes Rule)
P(O)
max P(wO) max P(Ow) P(w)
(since P(O) is the same for all words)
P(w) prior probability (e.g., frequency)
P(Ow) likelihood

20
Probabilistic Formulation

Objective Minimize word error rate by maximizing
P(OW) P(W)
P(O)
Approach Maximize P(OW) during training
Components
P(0W) observation likelihood Acoustic Model
P(W) prior probability Language Model
P(O) acoustic probability
(same for all sequences ignored during
maximization)

21
Probabilistic Method

Objective Minimize word error rate by maximizing
P(OW) P(W)
Global search
Combine two statistical knowledge sources
Acoustic Model
Language Model
Optimal combination search process

22
ASR System Architecture Ney 1990
23
Acoustic Model

P(0W) observation likelihood Acoustic Model
(Class-dependent probability distribution)
How closely does the hypothesized sequence mimic
the observed acoustic sequence?
(the conditional probability of observing the
acoustic feature vectors xT when a speaker
utters the word sequence wN)
Acoustic analysis (spectral analysis)
transforms the analog speech signal into a
sequence of acoustic feature vectors
Hidden Markov Models for sub-word units
approximate speech speed variations
Pronunciation lexicon
Decomposition of words into sub-word units

24
Acoustic Analysis

Transforms the analog speech signal into a
sequence of acoustic feature vectors
Dependencies on the speakers voice, the acoustic
channel and the environmental conditions should
be suppressed
Short-time spectral analysis of the speech signal
performed every 10 milliseconds on a short
segment
(e.g., with a length of 25 milliseconds)

25
Decomposition into sub-units

Not feasible to estimate probability
distributions for each possible word sequence ?
break into units
Small vocabulary recognition tasks the words
Larger vocabularies ? decompose further into
sub-word units
Pronunciation lexicon
phoneme sequences representing the words
Coarticulation
Triphones
(modeled by Hidden Markov Models)

26
Pronunciation Lexicon

Store the most likely pronunciations
Weighted automaton
(each arch has a probability)
(sum of probabilities for all archs leaving a
node 1)
Pronunciation networks

27
Hidden Markov Models (HMMs)

Stochastic finite automata states and
transitions
State set
model the acoustic characteristics of part of
the triphone
Transition probabilities
aij transition from state i to state j
Observation likelihoods (emission probabilities)
biot probability of observation ot coming from
state I
(acoustic emission probability distribution)

28
Acoustic Modeling, HMMs

Hidden Markov Models
temporal variation in the transition
probabilities
Gaussian mixture distributions
variations in speaker, accent, and pronunciation

Hamaker 2002
29
ANN Hybrids

Flexible, discriminative classifiers for emission
probabilities
Avoid HMM independence assumptions (can use wider
acoustic context)
Prone to overfitting
require cross-validation to determine when to
stop training
No substantial recognition improvements over
HMM/GMM

30
Recombination Search Strategies

The global decision about the most probable word
sequence is decomposed into local decisions
within the network.
On HMM states and/or on words.
Considerable reduction of search paths
Two strategies
A (Best-First)
Viterbi (dynamic programming)

31
Best-First (A) Search

The score of each partial word sequence
hypothesis is enhanced by an estimate of the
probability of the not yet decoded part of the
sentence
C(n) S(n) G(n)
C(n) evaluation function for node n, estimates
C(n)
C(n) actual cost of optimal path from start to
goal through n
S(n) cost of path followed so far to n S(n) ?
S(n)
G(n) estimates cost of remaining path from n to
goal

32
The Viterbi Algorithm(Dynamic Time Warping,
breadth-first)

All possible word sequences are hypothesized in
parallel
Threshold excludes improbable hypotheses
Based on
previous path probability
(getting to state i)
transition probability
(getting from i to j)
observation likelihood
(state j matches input)

33
Language Model

P(W) prior probability Language Model
independent of the acoustic feature vectors
provides the a-priori probability of a given word
sequence
Perplexity the average number of choices
N-gram P(W) ? P( Wi Wi-1 Wi-2 Wi-N )
Part-of-Speech
Grammar
Class pairs

34
Class Pair Grammar

A word graph all possible word combinations
that the A search can generate with the help of
a search network
Search network created by
lexical transcriptions
class pair grammar
Class pair grammar defines which words may
follow which
Cant recognize utterances not covered by the
grammar

35
A Class Pair Grammar