Title: ICASSP 2005 Survey Discriminative Training 6 papers
1ICASSP 2005 SurveyDiscriminative Training (6
papers)
2Outline
- Adaptation of Precision Matrix Models on Large
Vocabulary Continuous Speech Recognition
Cambrige University - Discriminative Training of CDHMMs for Maximum
Relative Separation Margin York University - Statistical Performance Analysis of MCE/GPD
Learning in Gaussian Classifiers and Hidden
Markov Models BBN - Discriminative Training of Acoustic Models
Applied to Domains with Unreliable Transcripts
JHU - Minimum Classification Error for Large Scale
Speech Recognition Tasks using Weighted Finite
State Transducers NTT - Discriminative Training based on the Criterion of
Least Phone Competing Tokens for Large Vocabulary
Speech Recognition Microsoft
3Discriminative Training of CDHMMs for Maximum
Relative Separation Margin
- Chaojun Liu, Hui Jiang, Xinwei Li
- York University, Canada
- ICASSP05 - Discriminative Training
- Presenter Jen-Wei Kuo
4Reference
- Large Margin HMMs for Speech Recognition
- Xinwei Li, Hui Jiang, Chaojun Liu
- York University, Canada
- ICASSP05 - Speech and Audio Processing
Applications session
5Large Margin Estimation (LME) of HMM
The constrain can not guarantee the existence of
the solution
6Iterative Localized Optimization
- Step 1. Based on current model , choose the
support token satisfying above constrains
gives the minimum margin. - Step 2. Update model by using GPD
- Step 3. If some convergence conditions are not
met then go to Step 1.
7Experimental Results
- English E-set vocabulary of OGI ISOLET database
8Experimental Results
9Experimental Results
10Large Relative Margin Estimation (LRME) of HMM
11Large Relative Margin Estimation (LRME) of HMM
12Experimental Results
- English E-set vocabulary of OGI ISOLET database
and Alphabet set
13Experimental Results
14Experimental Results
15Conclusion
- Main Concept
- Criterion
- Maximum Large Margin
- Maximum Large Relative Margin
- Support token
- Utterance which has relatively small positive
margin
16Discriminative Training of Acoustic Models
Applied to Domains with Unreliable Transcripts
- Lambert Mathias
- Girija Yegnanarayanan, Juergen Fritsch
- JHU
- Multimodal Technologies, Inc.
- ICASSP05 - Discriminative Training
- Presenter Jen-Wei Kuo
17Introduction
- This paper presents a method for the automatic
generation of transcripts from medical reports. - Medical Domain
- Unlimited amount of speech data available for
each speaker - These speech data have no verbatim transcripts
but final reports - Medical final reports
- Made by physicians and other healthcare
professionals - Grammatical error corrections
- Removal of disfluencies and repetitions
- Addition of nondictated sentence and paragraph
boundaries - Rearranged order of dictated paragraphs
- Can still be explored as an information source
for generating training transcripts
18Introduction
- Central idea of this paper
- Step1. Transform the reports to spoken form
transcripts ( Partially Reliable Transcripts,
PRT) - Step2. Identify reliable regions in the
transcripts - Step3. Apply ML/MMI acoustic training
- Propose an approach of frame-based filtering for
lattice-based MMI - Step4. The results show that MMI outperforms ML
19Partially Reliable Transcripts
- Step1. Normalize the medical reports to a common
format - Step2. Generate a report-specific FSG for all the
available medical reports - Step3. Use the normalized medical reports to
train a LM - Step4. Generate the orthographic transcripts
using the LM and the best AM - Step5. Annotate the orthographic transcripts by
aligning it against the corresponding
report-specific FSG - Step6. Parse the orthographic transcripts using
the report-specific FSG with a robust parser that
allows for INS, DEL and SUB - Step7. If the word is an INS, DEL or SUB then
mark the frames of underlying phone sequence as
unreliable, or reliable otherwise - Step8. Use the reliable segments to retrain the
AMs - Step9. Goto step4.
20MMI Training with Frame Filtering
- Approach 1
- Step1. Mark each are on the MMI training lattices
as RELIABLE or UNRELIABLE - Step2. Counts (num and den) are then accumulated
only on the RELIABLE arcs - Approach 2 (Frame Filtering)
- Step1. Mark each frame as reliable or
unreliable - Step2. Allow for inclusion of partially reliable
words in the training
21Experimental Results
22Experimental Results
23Minimum Classification Error for Large Scale
Speech Recognition Tasks using Weighted Finite
State Transducers
- Erik McDermott and Shigeru Katagiri
- NTT Communication Science Laboratories
- ICASSP05 - Discriminative Training
- Presenter Jen-Wei Kuo
24Introduction
- Special features focused in this paper
- MCE training with Quickprop optimization
- SOLON WFST-based recognizer (designed by NTT)
- It uses a time-synchronous beam search strategy
and has been applied LM with vocabularies of up
to 1.8 million words - Context-dependent model design using decision
tree - Corpus of Spontaneous Japanese (CSJ) lecture
speech transcription task (about 190 hrs) - Name recognition on 22k names
- Word recognition on 30k words
25Corpus for Name Recognition
- Name Recognition (40 hrs from CSJ)
- 35500 utterances (39 hrs) for training
- Contain 22320 names ( 16547 family names and 5744
given names) - 6428 utterances for testing
- Contain OOVs
- WFST
- Weight-pushing, Network Optimization
- 489756 nodes
- 1349430 arcs
26WFST Recognizer
- Four strategies to generate denominator
statistics for MCE training - Triphone-Loop
- Like free syllable recognition in Mandarin
- Bigram triphone LM
- Full-WFST LM Flat Transcripts
- Full 22k LM (22,320 names in vocabulary)
- Represent transcription as a WFST which is by
compositing of the full WFST and the transcribed
word sequence - Lattice-WFST Flat Transcripts
- The lattice is first generated by MLE-trained
model - Faster than Full-WFST (average 800 arcs each v.s.
1349430 arcs) - Lattice-WFST Rich Transcripts
- Add all possible fillers into transcription
grammar
27Experimental Results
28Experimental Results
29Experimental Results
- Use of Lp norm and N-best incorrect candidates
30Word Recognition
- Word Recognition Corpus and Exp. Results
- 154000 utterances (190 hrs) for training
- 10 lecture speeches and 130 minutes in total
- 30k words in vocabulary
- WFST
- Trigram LM
- 6138702 arcs
- MCE Training
- Beam search with unigram (about 3-5x RT)
- 494845 arcs
31Discriminative Training based on the Criterion of
Least Phone Competing Tokens for Large Vocabulary
Speech Recognition
- Bo Liu12, Hui Jiang3, Jian-Lai Zhou1, Ren-Hua
Wang2 - 1Micorsoft Research Asia
- 2University of Science and Technology of China
- 3York University
- ICASSP05 - Discriminative Training
- Presenter Jen-Wei Kuo
32Reference
- A Dynamic In-Search Discriminative Training
Approach for Large Vocabulary Speech Recognition - Hui Jiang, Olivier Siohan, Frank K. Soong,
Chin-Hui Lee - Bell Labs, Lucent Technologies
- ICASSP02 Discriminative Training in Speech
Recognition session
33Competing Token Collection
- For each frame t
- For each active word arc w
- Perform backtrace to obtain the partial path
- HMM alignment
- For each HMM m
- Calculate the overlap rate
- If overlap rate lt threshold and Likelihood(m) lt
Likelihood(Ref) - Then m is collected to be a competing token
- End
- End
- End
34Experimental Results
- Corpus
- DARPA Communicator task (Travel Reservation
Application)
35Introduction
- Discriminative Criterion in Phone Level
- Least Phone Competing Tokens Criterion (LPCT)
- Given speech segment O and phone a
- Competing Token (CT)
- True Token (TT)
36Off-line Token Collection
- Discriminative Criterion in Phone Level
- True Token (TT)
- Firstly, the forced-alignment is performed.
- Every segment in the reference is treated as a
TT. - Competing Token (CT)
- Generate word lattice.
- At each word arc, phone boundaries are annotated.
- Choose phone arcs to be CT or not
- 1. max overlap with same phones in reference gt
threshold - 2. the difference log-likelihood gt threshold
- 3. add the phone arc (segment and phone id) into
CT - LPCT Token Collection MCE/GPD
37Least Phone Competing Tokens Criterion (LPCT)
- Experimental Results
- Resource Management database
38Least Phone Competing Tokens Criterion (LPCT)
39Experimental Results
40Experimental Results
41Adaptation of Precision Matrix Models on Large
Vocabulary Continuous Speech Recognition
- K. C. Sim and M. J. F. Gales
- University of Cambridge
- ICASSP05 - Discriminative Training
- Presenter Jen-Wei Kuo
42Background for Precision Modeling
- Problem
- How to model the correlation in the feature in
that increasing the dimension - Solution
- Approximate diagonal covariance matrix is
employed - Structured precision matrix approximations ? SPAM
model - R1
- nd ?STC model
- dltnltd(d1)/2 ? EMLLT model
43Research Progress