Language Modeling experiences collected over the past evaluations SWB, MT

1 / 30

About This Presentation

Title:

Language Modeling experiences collected over the past evaluations SWB, MT

Description:

Language Modeling experiences collected. over the ... Switchboard. 52.0% Switchboard. Meeting. 51.6% Meeting. Meeting. WER. Dictionary type. Acoustic Model ... – PowerPoint PPT presentation

Number of Views:267

Avg rating:3.0/5.0

Slides: 31

Provided by: christi100

Learn more at: http://www.is.cs.cmu.edu

more less

Transcript and Presenter's Notes

Title: Language Modeling experiences collected over the past evaluations SWB, MT

1
Language Modeling experiences collectedover the
past evaluations (SWB, MT)

Interactive Systems Labs
Christian Fügen
Pittsburgh, Aug. 25th, 2004

2
Outline

Corpus selection cleaning
Some aspects of language modeling
Length of n-grams and cut-offs
Class based LMs
Text-Merging vs. Interpolation
Decoding with interpolated LMs
Dictionary making
Hints on using the SRILM-Toolkit
Some evaluation results

3
Corpus Selection

The more data, the better? Not always!
Corpora should match the demanded domain,if used
to form a single LM
Erroneous material should better be
removed,instead of using it for training ? LM
pollution
The more consistent the data, the better!
Never use pre-filtered text material
Rules which have been used to filter the text are
unknown
Interesting parts could be already filtered out,
e.g. noises
Unknown errors could have been inserted, e.g.
used abbreviation format, word substitutions,
compound handling
Merging different corpora could be difficult

4
Corpus Cleaning I

Noises, e.g. ltbreathgt
Map noises only to a few classes
SWB ltnoisegt BN ltbreathgt, ltfillergt, ltnoisegt
Decision should be made on available
transcriptions
Hesitations, e.g. uh, uh-huh, mhm,
Many different spelling variants
Map all hesitations to a few (3) distinguishable
classes
Automatic clustering depending on the context
probably performs best
Abbreviations / Spellings, e.g. CNN ? C._N._N.
Normalize all abbreviations to a common form
Did not used the abbrproc script provided by LDC
? inserts errors
Instead wrote own abbreviation tagger based on a
background list and a set of rules
Numbers / Punctuation, e.g. 1993, 1st,
Used numproc and punctproc provided by LDC

5
Corpus Cleaning II

Partial words, e.g. some-, s-
Map all partial words with the same starting
phoneme to the same class
Add a few more classes for the most common
partials
Reductions, e.g. gonna ? going to
Expand all reductions to their base form
Not done for e.g. weve, youre
Compounds, e.g. good-bye ? good bye
Use a background dictionary to find a common form
Sentence splitting
Use output of punctproc together with a set of
own rules
Overall improvement about 0.5 - 1.0 absolutely

6
N-Gram Length and Cut-Offs

Depends on the corpus size
SWB, 4.5M wordsbest 3-gram LM with cut-off of 2
for tri-grams
MTSWB, 5.5M wordsbest 3-gram LM with cut-off
of 2 for tri-grams
BN 140M wordsbest 4-gram LM with cut-off of 1
for bi-grams, 2 for tri-grams and 4 for
four-grams
Better way for pruning instead of using
cut-offs?
Remove n-grams if their removal causes (training
set) perplexity of the model increase by less
than threshold relative (SRILM)

MT corpus only
7
Class Based LMs

Clustering
500 - 1000 classes seems to be pratical
Very time consuming 2 days for clustering 5M
word corpus with 800 classes and doing 30
iterations
Using the SRILM toolkit instead seems to be
better (1 day, see later)
Hints
Best performance when interpolated together with
a standard n-gram LM on same corpus (see next
slide)
Due to introduced classes the n-gram length can
be increased
Also use cut-offs

means context dependent interpolation
8
Text Merging vs. Interpolation

Merging of text corpora
No discounting already applied
Reduces number of interpolated LMs
Interpolation of LMs
Interpolation weight can beautomatically
computed
Different n-gram lengths, cut-offs
Context dependent interpolationis possible
But time consuming during decoding
Hints
Merge text corpora, if they belong to the same
domain
Interpolate LMs if they represent different
types or domains
Do context dependent interpolation only, if
cross-validation corpus matches test corpus
Maximum number of interpolated LMs used so far
for eval system 3-fold

9
Decoding with interpolated LMs

Problems when using 3-fold interpolation with big
LMs
Decoding speed decreases
Lattice generation takes long
Rescoring over a matrix takes even longer
Use a lookahead mapper
Lookahead build over the first domain dependent
LM only
Exact scores only asked, when word identity is
known

Example Manual decoder initialization
10
Dictionary Making I

Filler-Words
Models for silence and pause
Noises
SWB acoustic models for ltnoisegt, lthumangt,
ltbreathgt, ltlaughgt, ltthroatgt, ltsmackgt mapped to
one LM-class
BN acoustic models for ltbreathgt, ltnoisegt mapped
to themselves
Depends on the trained acoustic models therefore
Hesitations
SWB several special trained acoustic models
mapped to 3 LM classes
BN one acoustic model ltfillergt mapped to itself
Depends on the trained acoustic models therefore
Multi-Words
Should be included

11
Dictionary Making II

Pronunciation variants
SWB multiple pronunciation dictionary, 2.2
entries per word, 100k
BN single pronunciation dictionary, 1.1
entries per word, 54k
Decision depends highly on type of training
dictionary
Comparison can only be done between over several
iterations trained systems
Computing pronunciation probabilities
Generate all pronunciation variants on the basis
of rules
Forced alignment on training data
Count used rules during forced alignment
Compute probabilities after some pruning
Overall reduction in WER through LM-ing and
Dictionary making 3 absolutely

12
Hints on using the SRILM-Toolkit I

Discounting
Modified Kneser-Ney
kndiscountN for every Nkndiscount1
kndiscount2 kndiscount3 for tri-gram LM
Good-Touring
Default
Cut-offs for all discountings
gtNmin Me.g. gt3min 3 for 32
Default is gtNmin 1 for Ngt2
Pruning
prune P
Same value for all n-grams
Value has to be very small

13
Hints on using the SRILM-Toolkit II

Interpolation
Only context independent
Result can be written into one single LM
Seems to be better than UKA/CMU context dependent
interpolation (only results for small domain)
Currently only good results for interpolating
standard n-gram LMs, i.e. w/o classes
Clustering
Uses only bi-gram statistics
Faster and as good as the tool from
UKA/CMUngram-class -vocab swb04mt04.vocab -text
swb04mt04.txt-numclasses 800 -classcounts
swb04mt04.counts-classes swb04mt04.classes gt
ngram-class.log
Interpolation with standard n-gram LMusing the
SRILM currently not working
For using LMs produced by SRILM-Toolkit
aseparate conversion script must be used,e.g.
convertSRI.tcl provided by me!!!

14
RT-03 (SWB) Results I

Vocabulary
41k vocabulary selected from SWB, BN, CNN
CNN used for vocabulary selection but not for LM
training
Pronunciation Variants
Rule derived dictionary expansion 95k entries
Probabilities
Based on frequencies (forced alignment)
Viterbi decoding probabilities as
penalties (e.g. max 1)
Confusion networks real probabilities (e.g. sum
1)
Single vs. multi-pronunciation dictionary

15
RT-03 Results II

Better text processing, more data
Removing inconsistencies 32.6 ? 32.4
Adding CELL CTRAN transcripts 32.4 ? 31.6
LM interpolation (context dependent)
3gram SWB
31.4
3gram SWB 5gram class SWB
31.0
3gram SWB 5gram class SWB 4gram BN
30.3
3gram SWB 5gram class SWB 4gram BN 4gram
CNN 30.5

16
RT-04S (Meeting) Results I

Existing 40k BN vocabulary expanded by
Missing words from RT-03 SWB vocabulary
Missing words from meetings (frequency gt 4)
Missing abbreviations (frequency gt 5)
56 most common partial words plus breath, filler,
noise
48k vocabulary with an OOV-Rate of 0.8 on RT-04S
devtest
Pronunciations derived from
Existing SWB/Meeting dictionaries
CMU dictionary
Generated by Festival
55k dictionary

17
RT-04S Results II

LM sources
SWB/CTRAN/CH/CELL
Meetings
BN
3-fold interpolation
SWB Meeting 3-gram LM
SWB Meeting 5-gram classbased LM (800
classes)
BN 4-gram LM
Results on RT-04S devtest(SDM, ref.
segmentation)
Baseline 40k BN dictionary 3-gram BN LM
All other experiments useexpanded dictionary

text merging interpolation (context ind.)
18

Questions?

19
Decoding alongContext Free Grammars with Ibis

Interactive Systems Labs
Christian Fügen
Pittsburgh, Aug. 25th, 2004

20
Outline

Basics
Language Model Interface in Ibis
CFG Framework
Some other Aspects when using grammar based
speech recognition
Spontaneous speech
Sub-grammars / grammar domains
Tight coupling of SR and DLM
Other features

21
Basics

Context Free Grammars
Chomsky Type 2
A ? v, A ? V, v ? (V ? ?) e.g. anbn?
set of terminals, V set of non-terminals, (V??)
set of strings composes from symbols from V
and/or ? including the empty string
Some Speech Grammar Specifications
SOUP Format of the SOUP-Parser, extension of the
CMU Phoenix-Parser format.greeting ( hello
) ( hi )
JSGF The Java Speech Grammar Formatltgreetinggt
hello hi
SRGS XML Specification of JSGF from W3C
A few others
Grammar types
Syntactical, semantical or mixed
Statistical or non statistical

22
Grammar based decoding

Static network representation
Faster, especially when used in FST-based
decoders
Transition scores can be computed on the basis of
any context lengthMostly only bi-/tri-grams are
practical
Memory requirements
Time consuming static network compilation
Dynamic network representation
More flexible Adding of new parts on the fly is
easy
Less memory expensive
Slower, due to dynamical expansion
Only bi-gram statistics

23
Language Model Interface in Ibis

Linguistic Knowledge Source (LingKS)
Common interface to all types of language models,
e.g. PhraseLM, Grammars, interpolation of LingKSs
Consists of 4 major functions
lks.createLCT (lvX)creates an initial LCT with
an lvX
lks.extendLCT (lct, lvX)extends a given LCT with
an lvX
lks.scoreArray (lct)returns transition scores
for all lvX given an LCT
lks.score (lct, lvX)returns only the transition
score given an LCT to an lvX
Search Vocabulary Mapper (SVMap) defines mapping
between svX and lvX

lks lingistic knowledge source lct linguistic
context lvX language model index svX search
vocabulary index
24
CFG Framework I

CFG
Context free grammar
CFGSet
Set of context free grammars
ParseTree
Stores the grammar states(LCTs) while
traversingthrough the network
RuleStack
Stack for accessing andleaving the grammar
ruleswhile traversing through the network
Lexicon
Stores terminal and non-terminal symbols

LingKS CFGSet
LingKS CFG
25
CFG Framework II

Supported file formats
SOUP Grammar format of the SOUP-Parser(without
support of character level and speaker-side
rules)
JSGF Java Speech Grammar Format(without support
of import statements)
PFSG Probabilistic Finite State Graph format,
which is used by SRILM
FSM ATT FSM (finite state machine) text file
format
Basic initialization
Can be used for decoding and basic parsing (word
skipping is not possible)
Hints for grammar writing
Be as restrictive as possible

set SID(cfg,grammars) list list NAV
cfgPath/cfg.ka.nav cfgPath/cfg.base.nav cfgIni
t SID
ibisInit SID -lm cfgSetSID spassSID run

decoding cfgSetSID.data parseTree text svMap
svmapSID parsing
26
CFG Framework III

Transition probabilities
Can be defined to have a fixed value (-1.0) or
equally distributed
Fixed value seems to work a little better
Automatic dictionary selection
A dictionary can beautomatically compiled
whilereading in the grammarsusing a large
backgrounddictionary
Outsourcing terminal classes
Large lists of terminals,e.g. street names can
beoutsourced to the SVMapfor speeding up the
decoding

set SID(cfg,dict) dictPath/nav.dict set
SID(cfg,baseDict) dictPath/baseDict cfgInit
SID -makeDict 1 dictInit SID
-desc set SID(cfg,dict) ibisInit SID
-lm cfgSetSID
set SID(cfg,classes) list dictPath/nav.classe
s Class File Example, whereby _at_street has to be
a terminal in the CFG acherstrase
_at_street adalbert-stifter-strase
_at_street adenauerring _at_street
27
Spontaneous Speech

Coping with non-human and human noises
Difficult to model in a grammar, because its not
known when they appear
Use filler words for modeling noises
Filler words can occur between any two terminals
of the grammars
LM is not asked for their score, instead a filler
penalty is used therefore
Complete set of variable definition and
initialization may look like

set SID(cfg,grammars) list list NAV
cfgPath/cfg.ka.nav cfgPath/cfg.base.nav \
list SHARED
cfgPath/cfg.shared set SID(cfg,dict)
dictPath/nav.dict set SID(cfg,baseDict)
dictPath/baseDict set SID(cfg,classes)
list dictPath/nav.classes set
SID(cfg,fillers) list dictPath/nav.fillers
set dict set SID(cfg,dict) cfgInit
SID -makeDict 1 dictInit SID -desc
dict ibisInit SID -lm cfgSetSID \
-vocabDesc dict.v -mapDesc
dict.m svmapSID configure -filPen 60
Example filler file click interjection inter
jection(ah) pause
28
Sub-Grammars / Grammar Domains

A CFGSet is a set of single sub-grammars
Each grammar can be seen to model a single
domain, e.g. hotel reservation, navigation,
Tags can be given to sub-grammars to identify the
domains
Sub-grammars can be activated / deactivated
during run-time
Common, domain independent concepts can be
defined in a special shared sub-grammar
Better to maintain and better handling together
with dialogue systems
Can be used e.g. for multilingual decoding
Multilingual acoustic model
Each sub-grammarrepresents a differentlanguage
Common, multilingualsearch and language model
vocabulary
Decoding is working in parallel

29
Tight Coupling of SR and DLM

Sub-Grammars and single rules can be
activated / deactivated during runtime
given a weight for e.g. penalizing them
Mechanism can be used for a tight coupling of SR
and DLM
Therefore, the same grammar description files
should be used in both, SR and DLM
Parsed output of SR can be directly used by the
DLM to determine the user intention
Together with the dialoguecontext, it is now
possibleto predict the context of thenext user
answer
Rules within that contextcan be preferred
byadjusting the rule weightsin the SR

30
Other features

Grammars can be expanded on the fly
Starting over during decoding one sentence can be
activated
It is possible to define all rules as top-level
rules
Visualization is possible by using the FSM output
format together with the ATT FSM toolkit and
dot
Upcoming features
Training of the grammar transitions by parsing a
training corpus
Collecting bi-gram transition statistics
Hybrid Language Models / Unified Language Models
Combinations of n-gram LMs and CFGs
.