Automatic Speech Recognition - PowerPoint PPT Presentation

1 / 93
About This Presentation
Title:

Automatic Speech Recognition

Description:

(2) Marry picks a color ball, tells the blind observer (3) Marry puts the ball back in urn ... 4) Fluent speech Broadcast news. 5) Spontaneous speech Conversation ... – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 94
Provided by: 20314
Category:

less

Transcript and Presenter's Notes

Title: Automatic Speech Recognition


1
Automatic Speech Recognition
Chai Wutiwiwatchai, Ph.D. Human Language
Technology Laboratory NECTEC
2
Outline
  • Hidden Markov Model (HMM)
  • Sequence Modeling Problems
  • HMM
  • Automatic Speech Recognition (ASR)
  • Classification of ASR
  • ASR Formulation
  • Components

3
Sequence Modeling Problems
4
Sequence Modeling
Forecasting Tomorrow Weather
  • Given weather sequences in the past

  • Creating a model that can predict
  • tomorrow weather

??
5
Sequence Modeling
  • Prediction done by calculating probabilities

Sunny
Rain
Cloud
  • And selecting max-probability sequence

6
Sequence Modeling
Modeling Brazils World-Cup Results
  • Given results in the past

  • How possible if Brazil will never lose
  • in the next tournament ?

7
Sequence Classification
Genomic DNA Sequencing
  • Given a long string of genes
  • Classify substrings into DNS types

8
Sequence Classification
  • Where each type has its own
  • characteristic

EXON
INTRON
9
Sequence Classification
Football Event Detection
Free-kick
Foul
Adv.
10
In Conclusion
Problem 1 Training How to create a model ?
given a training set of observation sequences
O Problem 2 Scoring How to compute the
probability of an observation sequence O given a
model ?, P(O ?)
11
Hidden Markov Model (HMM)
12
Sequence Modeling Problem
Problem 1 Training How to create a model ?
given a training set of observation sequences
O Problem 2 Scoring How to compute the
probability of an observation sequence O given a
model ?, P(O ?)
13
Example Urn-and-Ball 1
Rabiner, L.R., A tutorial on hidden Markov
models and selected applications in speech
recognition, Proc. IEEE, Vol.7, No.2,
pp.257-286, 1989.
  • Marry plays a game with a blind observer
  • (1) Marry initially select an urn at random
  • (2) Marry picks a color ball, tells the blind
    observer
  • (3) Marry puts the ball back in urn
  • (4) Marry moves to the next urn randomly
  • (5) Repeat steps 2-4

14
Example Urn-and-Ball 2
? Observation Sequence - The color
sequence of picked-up balls ? States - The
identity of the urn ? State Transitions -
The process of selecting the urns ? Initial
State - The identity of the first selected
urn
15
Example Urn-and-Ball 3
p1 0.4
p2 0.3
p3 0.3
a33 0.2
a11 0.3
a22 0.5
1
2
3
a12 0.6
a23 0.2
a21 0.3
a32 0.1
a13 0.1
a31 0.7
b1(R) 0.3 b1(G) 0.2 b1(B) 0.2 b1(Y) 0.3
b2(R) 0.33 b2(G) 0.17 b2(B) 0.17 b2(Y)
0.33
b3(R) 0.2 b3(G) 0.3 b3(B) 0.3 b3(Y) 0.2
16
? An Observation Sequence ? A Set of N States ?
Resulting State Sequence ? Transition
Probabilities ? A Set of M Observation
Symbols ? Emission Probabilities ? Initial
State Probabilities
HMM Notations
17
HMM Notations In Short
? An HMM is symbolized by
18
HMM Topology
Left-to-Right Model
Ergodic Model
19
Example World Cup
Brazils match results modeled by HMM
Ex.
20
Scoring
Forward Algorithm
? Define
Step 1
Step 2
Step 3
Problem1 Training Problem2 Scoring
21
Forward Algorithm
S1
S1
S1
S1
S2
S2
S2
S2
S3
S3
S3
S3
Problem1 Training Problem2 Scoring
22
Example
What is the probability of generating i.e.
using Forward algorithm
Problem1 Training Problem2 Scoring
23
Scoring
Viterbi Algorithm
? Define
Step 1
Step 2
Problem1 Training Problem2 Scoring
Step 3
24
Viterbi Algorithm Illustration
S1
S1
S2
S2
S3
S3
Problem1 Training Problem2 Scoring
25
Viterbi vs. Forward
S1
S1
S2
S2
S3
S3
or
Problem1 Training Problem2 Scoring
26
Example
What is the probability of generating i.e.
using Viterbi algorithm
Problem1 Training Problem2 Scoring
27
Viterbi in Log Domain
? Due to numerical underflow issues, it is often
conduct the Viterbi algorithm in log-domain
Step 1
Step 2
Step 3
where
Problem1 Training Problem2 Scoring
28
CDHMM
  • Discrete HMM
  • Modeling for discrete symbols such as World
  • Cup match-results
  • Continuous-Density HMM (CDHMM)
  • Modeling for continuous values such as
  • speech feature vectors
  • -gt is a probability density function
    of

29
Mixture Gaussian PDF
? A typical pdf is M-component Mixture Gaussian
30
Training
Given , tune to maximize
Baum-Welch Algorithm
  • Also called Forward-Backward algorithm
  • Step1 Initialize
  • Step2 Compute probabilities using
  • Step3 Adjust based on computed
  • probabilities
  • Step4 Repeat 2-3 until converge

Problem1 Training Problem2 Scoring
31
Baum-Welch Algorithm
? Define
Probability of being in state i at time t
and State j at time t 1 given O and
Problem1 Training Problem2 Scoring
32
Baum- Welch Algorithm
? Adjust
Expected no. of times in state i at t 1
? Adjust
Problem1 Training Problem2 Scoring
33
Baum- Welch Algorithm
? Adjust
Expected no. of times vk observed in state i
Expected no. of times in state i
Problem1 Training Problem2 Scoring
34
In Conclusion
  • Hidden Markov Model (HMM)
  • for Sequence Modeling

35
In Conclusion
  • Hidden Markov Model (HMM)
  • for Sequence Classification

36
Questions ?
37
Automatic Speech Recognition
38
Advantages Of ASR
  • Easy to perform, No need of
  • specialized skill
  • 3-4 times faster than typewriters
  • 8-10 times faster than handwriting
  • (if correct !)
  • Comfortable for
  • multiple activities
  • Economic
  • equipments

39
Classification Of ASR
  • Continuity of speech
  • 1) Isolated word recognition (IWR)
  • 2) Continuous speech recognition (CSR)
  • - Transcription/Understanding task
  • - Restricted/Free grammar
  • Speaking style
  • 1) Isolated words or phrases Voice command
  • 2) Connected speech Digit string
  • 3) Read speech Dictation
  • 4) Fluent speech Broadcast news
  • 5) Spontaneous speech Conversation

40
Classification Of ASR
  • Speaker dependency
  • 1) Speaker dependent
  • 2) Speaker independent
  • Unit of reference template
  • 1) Word unit
  • 2) Subword unit
  • - Phoneme
  • - Syllable

41
Status of ASR
42
Pattern Recognition
Training set
Development set
Feature Extraction
Feature Extraction
Trained model
Trained model with optimization
Training
Testing
Result
Adjusting
43
Feature Extraction
44
ASR Problem Formulation 1
  • Given an observation sequence (sequence
  • of feature vectors) from a speech signal
  • Determine the underlying word or word
  • sequence
  • IWR
  • CSR

45
ASR Problem Formulation 2
  • Solution maximize a posterior probability
  • Using Bayes Rule

46
ASR Problem Formulation 3
  • Since P(O) is equal for all words W

P(OW) Acoustic Model (AM), the prob that the
word sequence W is uttered as O
P(W) Language Model (LM), how often the
word sequence W is said
47
IWR
Examples
Digit 0-9 recognizer
Digit
Choose the maximum
CSR
Digit string recognizer
Digit string
Choose the maximum
48
  • Impossible to build HMMs for each
  • word sequence W
  • Advantageously, HMM of a longer unit can
  • be built up by concatenating HMMs of
  • smaller units

Basic Idea
  • Each smaller-unit HMM is trained as that
  • done for word HMM in IWR

49
CSR Structure
50
Acoustic Model
51
Subword-Unit Acoustic Model
52
Context-Dependent Model
  • In continuous speech, the sound of a
  • phoneme often changes in different context
  • /a/ in d e t a b e s (database)
  • /a/ in d e t a (data)
  • Using different HMMs for a phoneme in
  • different context
  • ? Context-Dependent Model
  • Always using an HMM for a phoneme
  • regardless of context
  • ? Context-Independent Model

53
Context-Dependent Model
Context-Independent (Monophone) Model
Context-Dependent (Triphone) Model
54
Tied-State Model 1
  • An example of Thai acoustic model (NECTEC)
  • No. of monophone HMMs 76
  • No. of triphone HMMs 49,631 !
  • Not all triphones appear in training data
  • ? Unseen triphones
  • Many triphones occur infrequently
  • ? Problem of data sparseness
  • Triphones with similar context share HMM
  • states ? Tied-State Triphone

55
Tied-State Model 2
p-aan
p-aang
  • 49,631 triphone HMMs can be constructed
  • from a small set of states (1K to 3K states)

56
Tied-State Model 3
  • Decision Tree-Based State Tying
  • Decision trees are built for each phoneme

57
Conclusion Of AM In CSR
  • Phoneme-based HMM
  • Context-dependent triphone HMM
  • Tied-state triphone HMM

58
Language Model
59
Language Modeling
  • Typical language modeling techniques
  • for CSR
  • ? Regular Grammar Model
  • - Finite-state model
  • - Small vocabulary
  • - Restricted grammar
  • ? N-gram Model
  • - Large vocabulary
  • - Free grammar
  • - Higher computation

60
Regular Grammar Model 1
  • Grammar is defined by a Finite-State Network

An example of Voice Dialing task, in HTK Book,
Cambridge University
61
N-gram Model 1
  • Large Vocabulary Continuous Speech
  • Recognition (LVCSR)
  • ? No. of vocabulary gt 1,000
  • ? Unrestricted grammar
  • ? Example tasks
  • - Broadcast news transcription
  • - Dictation system
  • - Meeting transcription
  • ? N-gram language model

62
N-gram Model 2
  • Given a word sequence W w1wM
  • e.g. W w1w2w3w4 compute P(W) by

63
2-gram Model
  • Assume wi depends only on one previous
  • word wi-1 ? 2-gram (Bigram) Model
  • For example,

64
3-gram Model
  • Assume wi depends only on two previous
  • word wi-1 wi-2 ? 3-gram (Trigram) Model
  • For example,

65
Training N-gram
  • P(wiwi-1,wi-2), P(wiwi-1) and P(wi) are
  • computed from a Training Text

C(w) No. of word w N Total no. of words
66
Example Of N-gram Model
? Given a training text Compute P(W) for W
is he a student W student is man
he is a man he is a student is he a man
67
Smoothing
? N-grams not occur in the training text
always have zero probabilities. However, these
N-grams might happen in real world ? Give
probabilities to unseen word pairs ? Smoothing
process ? Add-1 ? Delete Interpolation ?
Good-Turing ? Katz
68
Add-1 Smoothing
? Eliminate zero probability by adding 1
occurrence to all N-grams V
No. of vocabulary ? Simple but not so good !
69
Delete Interpolation
? Interpolation (linearly combination) of
different-order N-gram ? Interpolation
weights ?i can be estimated given a training
text ? Better than Add-1, but still not so good !
70
Conclusion Of LM In CSR
  • Regular grammar model for small tasks
  • N-gram model with smoothing for LVCSR

71
Pronunciation Modeling
72
Pronunciation Modeling
  • Finding the best phoneme sequence given
  • a word sequence W
  • Not easy ! Lets consider
  • the elephant vs. the butterfly
  • lead nitrate vs. he follows her lead
  • A simple way is Pronunciation Dictionary

73
Decoding
74
Decoding Problem
  • A critical problem of CSR is an infinite number
  • of possible word sequences W

Digit string recognizer
Digit string
Choose the maximum
75
Decoding Solution 1
  • We instead expand the LM, regular grammar
  • or N-gram, as a word network with LM probs.

P(0)
P(00)
P(000)
. . .
0
0
0
P(10)
P(100)
P(1)
. . .
1
1
1
Start
. . .
. . .
. . .
P(90)
P(900)
P(9)
. . .
9
9
9
76
Decoding Solution 2
  • Then expand each word node to its
  • phoneme sequence using the pronunciation
  • dictionary

P(0)
. . .
P(00)
z
iy
r
o
0
z
iy
r
o
0
P(10)
P(1)
. . .
w
a
n
1
w
a
n
1
Start
. . .
. . .
P(90)
. . .
P(9)
n
ay
n
9
n
ay
n
9
77
Decoding Solution 3
  • Finally, incorporate phoneme HMMs into each
  • phoneme node, which produce AM probs.

z iy r o
z iy
P(O0)
P(0)
P(00)
0
w a n
w a
P(10)
P(O1)
P(1)
1
. . .
. . .
Start
P(90)
n ai n
n ai
P(O9)
P(9)
9
  • Decoding Network

78
Decoding Solution 4
  • Frame-Synchronous Viterbi Beam Search
  • ? Observation sequence O is slide frame-
  • by-frame into the decoding network
  • ? Probabilities P(OW)P(W) are cumulated
  • for every possible paths using Viterbi
  • algorithm
  • ? At every time, paths having cumulative
  • probabilities less than a threshold are
  • eliminated
  • ? After sliding all frames, the path with
  • maximum cumulative probability is the
  • resulting word sequence

79
Decoding Solution 5
  • Viterbi Algorithm

HMM node HMM state emission probability Word
node None
HMM node HMM state transition probability Word
node N-gram probability
80
Decoding Illustration
81
Question ?
82
Building ASR
http//htk.eng.cam.ac.uk
83
  • Preparation
  • Phoneme inventory design
  • Task grammar
  • Pronunciation dictionary
  • Phoneme list
  • Training speech data
  • Training data transcriptions

Procedure
  • Training
  • Acoustic model training
  • Language model training

Evaluation
84
Preparation 1
Task Grammar Digit (Isolated Word)
word onetwo zero (SENT-START word
SENT-END)
Digit String (Continuous Speech)
word onetwo zero (SENT-START ltwordgt
SENT-END)
85
Preparation 2
Pronun Dict
zero zero s ii r oo one one w a n two two
th uu nine nine n aa j SENT-START
sil SENT-END sil
86
Preparation 3
87
Training 1
Task Grammar config/dgs.gram Pronun
Dict config/dgs.dict Phoneme List config/monophn
.list Training Speech Data wav/train/.wav Data
Transcription config/monophn.mlf Script
for Feature Extraction config/trcode.scp HMM
Training config/train.scp Configuration
Files Feature Extraction config/code.config HMM
Training config/train.config HMM
Prototype config/proto5s
88
Training 2
Feature Extraction HCopy -T 1 -C
config/code.config -S config/trcode.scp HMM
Initialization mkdir am/hmm_0 HCompV -T 1 -C
config/train.config -f 0.01 -m -S
config/train.scp -M am/hmm_0 config/proto5s
perl script/createmono.pl config/monophn.list
am/hmm_0/proto5s am/hmm_0/vFloors
am/hmm_0/newMacros HMM Parameter Re-estimation
mkdir am/hmm_1 HERest -T 1 -C
config/train.config -I config/monophn.mlf -S
config/train.scp -H am/hmm_0/newMacros -M
am/hmm_1 config/monophn.list Repeat HERest 3
times (obtaining am/hmm_3)
89
Training 3
Ways to improve
  • Increasing Gaussian Mixtures in HMM
  • Tied-state triphone HMM
  • Speaker adaptation

Increasing Gaussian Mixtures in HMM mkdir
am/hmm2m_0 HHEd -T 1 -H am/hmm_3/newMacros -M
am/hmm2m_0 config/mix1to2.hed
config/monophn.list mkdir am/hmm2m_1 HERest
-T 1 -C config/train.config -v 1.0e-8 -m 1 -I
config/monophn.mlf -S config/train.scp -H
am/hmm2m_0/newMacros -M am/hmm2m_1
config/monophn.list Repeat HERest 3 times
(obtaining am/hmm2m_3)
90
Evaluation
Compiling Language Model HParse lm/dgs.gram
lm/dgs.wdnet
Test Speech Data wav/test/.wav Script
for Feature Extraction config/tscode.scp Testing
config/test.scp
Testing HCopy -T 1 -C config/code.config -S
config/tscode.scp HVite -H am/hmm2m_3/newMacros
-S config/test.scp -l '' -w lm/dgs.wdnet -i
result/result.mlf config/dgs.dict
config/monophn.list
91
Demonstration
  • Isolated Digit recognition
  • Digit String recognition

92
Improving
  • In terms of training data
  • Adding more speech training data
  • Recording training speech that best matched
  • to test speech
  • - Speaking style
  • - Environment
  • - Equipment
  • In terms of training algorithms
  • Tied-state triphone HMMs
  • Optimizing training parameters
  • Trying algorithms for robust ASR
  • - Noise classification Model selection
  • - Noise/speaker adaptation

93
Question ?
Write a Comment
User Comments (0)
About PowerShow.com