Title: MultiPass Pronunciation Adaptation
1 Multi-Pass Pronunciation Adaptation
- Nathan Bodenstab, Summer Intern 2006
2The Problem
- Word pronunciations (prons) in a lexicon can be
- Incorrect
- Consistently mispronounced by speakers
(especially names) - Using transcribed acoustic data, we want to
correct these prons and increase recognition
accuracy, ie. solve - X acoustic data
- A lexicon entry / canonical pron (language
model) - Bi ith pron candidate
- Example Stephan Granger Auto Attendant, Nuance
phonemes (similar to CPA) - Lexicon entries (canonical prons)
- s t E i f v _at_ I n
- g r e n dZ _at_r
- Pros from acoustic data
- s t _at_ I f A n
- g r A o n dZ i e _at_r
May be multi-modal
3Prior Work
- Dragons MREC pron-guessing (Drew Lowry) and
dictionary checker (Paul Vozila). - Learns graphoneme n-grams (grapheme / phoneme
pairs) using an HMM. Passes acoustic data
through the n-best results to find best pron(s). - Blue Nuance Autopron (Francoise Beaufays, Ananth
Sankar, Mitch Weintraub) - Forced alignment of utterance to best dictionary
pron. Finds worst phoneme match and replaces
with alternative phonemes. Passes acoustic data
through resulting prons linguistic prior to
find best pron(s). - Speechworks 6.5 LEARN (Mark Fanty and Krishna
Govindarajan) - Start with dictionary phone graph (FSM) and
augment with learned phone variations. Pass
acoustic data through new phone graph to find
best pron(s).
4PronLearn - New Pron Learning Algorithm
- Summer Project Goals
- Stand-alone tool to correct sub-optimal prons and
be compatible with both Quantum and OSR - PronLearn Algorithm Outline
- Input a set of transcribed audio files (ie. 25
utterances of Stephan Granger) - Pass 1
- Create a weighted FSM of possible prons for the
utterance - Run each audio file through the FSM and record
its preferred path - Pass 2
- Take the top X phoneme distortions from Pass 1
and build a new (non-weighted) FSM - Re-run each audio file through the new FSM and
record the preferred path - Pass 3 (if recognition results arent clustered
well) - Repeat Pass 2 using new preferred phoneme
distortions
5PronLearn Pass 1
- Pass 1 example Stephen
- Initialize weighted FSM with canonical pron(s)
- Add phoneme substitutions, deletions, and
insertions with learned weights P(new_phone
canonical_phone) - Run utterances through weighted FSM and retrieve
each preferred path
6PronLearn Pass 1
- Substitution probabilities P(new_phone
canonical_phone) estimated using
linguist-generated lexicon - Align alternate prons of a single word with a
dynamic programming alignment algorithm (ie.
spelling correction, shortest edit distance) - EPS is used to represent insertions and deletions
- Example
- / s t E f _at_ n /
- / t E v I n / ? (s,EPS) (f,v) and (_at_,I)
- Phoneme differences between prons are tallied and
relative frequency counts are used to estimate
probabilities - Adding more context ie. P(new prev,
canonical, next) - Didnt cause a sufficient improvement in accuracy
- Simplifies hand-crafted estimation to a phone
similarity confusion matrix when no data is
available
7PronLearn Pass 1 FSM Weight
- We can control the balance between the acoustic
and the language model contribution - Can modify phoneme substitutions, deletions, and
insertions weights to - Favor acoustics P(new_phone canonical_phone)
1.0 - Favor canonical pron (LM) P(new_phone
canonical_phone) 0.0
8PronLearn Problems
- Why is one pass not enough?
- We have a tuning parameter to shift bias between
favoring the utterance acoustics or favoring the
canonical phonemes
Favor acoustics Weights all phoneme sequences
equally (used in voice enrollment). Recognized
prons vary widely no clustered group of new prons
9PronLearn Problems
- Why is one pass not enough?
- We have a tuning parameter to shift bias between
favoring the utterance acoustics or favoring the
dictionary phonemes
Favor acoustics Weights all phoneme sequences
equally (used in voice enrollment). Recognized
prons vary widely no clustered group of new prons
Favor dictionary Heavy bias towards dictionary
prons clusters new pron results, but does not
allow much deviation
10PronLearn Problems
- Why is one pass not enough?
- We have a tuning parameter to shift bias between
favoring the utterance acoustics or favoring the
dictionary phonemes - PronLearn solution First learn which phonemes
substitutions are acoustically popular (Pass 1),
then limit the number of possible paths through
the FSM using only these substitutions (Pass 2)
Favor acoustics Weights all phoneme sequences
equally (used in voice enrollment). Recognized
prons vary widely no clustered group of new prons
But we want both!
Favor dictionary Heavy bias towards dictionary
prons clusters new pron results, but does not
allow much deviation
11PronLearn Pass 2
- Pass 1 Favor acoustics (low dictionary pron
bias)
- Pass 2
- Extract top X substitutions from Pass 1 prons and
build unweighted FSM - Re-run utterances through new Pass 2 FSM and
record preferred prons
12PronLearn Three Pass Examples
Most frequent pron is new
13PronLearn Pass 3
- If we want to add at most n new prons, we can
either take the top n prons from Pass 2, or - Pass 3 Build a new FSM with only the dictionary
prons and the n-best new prons - This forces every utterance to choose which of
the n new prons is its best acoustic
representation
14PronLearn Three Pass Examples
15PronLearn Three Pass Examples
16Results
- How much can pron learning help improve
recognition results? Many Auto Attendant
recognition errors had one or more of the
following - Heavy signal noise
- Name alteration (Michael - Mike, Richard -
Dick) - Difficult grammar competition (Teri Thomas vs.
Kerry Thomas) - Oracle pron learning accuracy is difficult
(impossible) to know, but perfect prons will
obviously not solve all of our problems
17Results
- Auto Attendant Simulation - Phantom
- Training 3750 utterances (150 names)
- Testing 3750 utterances 10,000 additional
grammar names to increase task difficulty
18Results
- Auto Attendant Simulation (2) - Phantom
- Training 3750 utterances (150 names)
- Testing 3-pass PronLearn with 3750 utterances
X additional grammar names to increase the task
difficulty
19Thanks
20Results
- BellSouth Directory Assistance OSR 3.09 (thanks
to Jean-Philippe) - Baseline accuracy was achieved using an OSS
hand-crafted dictionary that decreased original
error by 2.0 - Only learned prons for 240 words from one data set
21PronLearn Tools
- Documentation at
- http//silicon.speechworks.com/cgi-bin/wiki.pl?Pro
nLearn - Written in Python
- Requires local install of OSR or Phantom. Uses
acc_test, dicttest, split_gram, FSM tools - genProns.py
- Input (transcription, audio file) pairs
- Output
- mergePronCounts.py Parallellize work or
accumulate counts over time - genUserDict.py Threshold pron percentages or
optimize on a validation set, and output to an
XML user dictionary
Word Freq Percent InDict Pron leonard
21 0.840 0 l I n _at_r d leonard
3 0.120 1 l E n _at_r d leonard
1 0.040 0 A l w E n _at_r d leonard
0 0.000 1 l E n _at_ d pendergast
23 0.920 1 p E n d _at_r g a s t pendergast
2 0.080 0 E n d _at_r g a s t