Language Modeling experiences collected over the past evaluations SWB, MT

1 / 30
About This Presentation
Title:

Language Modeling experiences collected over the past evaluations SWB, MT

Description:

Language Modeling experiences collected. over the ... Switchboard. 52.0% Switchboard. Meeting. 51.6% Meeting. Meeting. WER. Dictionary type. Acoustic Model ... – PowerPoint PPT presentation

Number of Views:267
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Language Modeling experiences collected over the past evaluations SWB, MT


1
Language Modeling experiences collectedover the
past evaluations (SWB, MT)
  • Interactive Systems Labs
  • Christian Fügen
  • Pittsburgh, Aug. 25th, 2004

2
Outline
  • Corpus selection cleaning
  • Some aspects of language modeling
  • Length of n-grams and cut-offs
  • Class based LMs
  • Text-Merging vs. Interpolation
  • Decoding with interpolated LMs
  • Dictionary making
  • Hints on using the SRILM-Toolkit
  • Some evaluation results

3
Corpus Selection
  • The more data, the better? Not always!
  • Corpora should match the demanded domain,if used
    to form a single LM
  • Erroneous material should better be
    removed,instead of using it for training ? LM
    pollution
  • The more consistent the data, the better!
  • Never use pre-filtered text material
  • Rules which have been used to filter the text are
    unknown
  • Interesting parts could be already filtered out,
    e.g. noises
  • Unknown errors could have been inserted, e.g.
    used abbreviation format, word substitutions,
    compound handling
  • Merging different corpora could be difficult

4
Corpus Cleaning I
  • Noises, e.g. ltbreathgt
  • Map noises only to a few classes
  • SWB ltnoisegt BN ltbreathgt, ltfillergt, ltnoisegt
  • Decision should be made on available
    transcriptions
  • Hesitations, e.g. uh, uh-huh, mhm,
  • Many different spelling variants
  • Map all hesitations to a few (3) distinguishable
    classes
  • Automatic clustering depending on the context
    probably performs best
  • Abbreviations / Spellings, e.g. CNN ? C._N._N.
  • Normalize all abbreviations to a common form
  • Did not used the abbrproc script provided by LDC
    ? inserts errors
  • Instead wrote own abbreviation tagger based on a
    background list and a set of rules
  • Numbers / Punctuation, e.g. 1993, 1st,
  • Used numproc and punctproc provided by LDC

5
Corpus Cleaning II
  • Partial words, e.g. some-, s-
  • Map all partial words with the same starting
    phoneme to the same class
  • Add a few more classes for the most common
    partials
  • Reductions, e.g. gonna ? going to
  • Expand all reductions to their base form
  • Not done for e.g. weve, youre
  • Compounds, e.g. good-bye ? good bye
  • Use a background dictionary to find a common form
  • Sentence splitting
  • Use output of punctproc together with a set of
    own rules
  • Overall improvement about 0.5 - 1.0 absolutely

6
N-Gram Length and Cut-Offs
  • Depends on the corpus size
  • SWB, 4.5M wordsbest 3-gram LM with cut-off of 2
    for tri-grams
  • MTSWB, 5.5M wordsbest 3-gram LM with cut-off
    of 2 for tri-grams
  • BN 140M wordsbest 4-gram LM with cut-off of 1
    for bi-grams, 2 for tri-grams and 4 for
    four-grams
  • Better way for pruning instead of using
    cut-offs?
  • Remove n-grams if their removal causes (training
    set) perplexity of the model increase by less
    than threshold relative (SRILM)

MT corpus only
7
Class Based LMs
  • Clustering
  • 500 - 1000 classes seems to be pratical
  • Very time consuming 2 days for clustering 5M
    word corpus with 800 classes and doing 30
    iterations
  • Using the SRILM toolkit instead seems to be
    better (1 day, see later)
  • Hints
  • Best performance when interpolated together with
    a standard n-gram LM on same corpus (see next
    slide)
  • Due to introduced classes the n-gram length can
    be increased
  • Also use cut-offs

means context dependent interpolation
8
Text Merging vs. Interpolation
  • Merging of text corpora
  • No discounting already applied
  • Reduces number of interpolated LMs
  • Interpolation of LMs
  • Interpolation weight can beautomatically
    computed
  • Different n-gram lengths, cut-offs
  • Context dependent interpolationis possible
  • But time consuming during decoding
  • Hints
  • Merge text corpora, if they belong to the same
    domain
  • Interpolate LMs if they represent different
    types or domains
  • Do context dependent interpolation only, if
    cross-validation corpus matches test corpus
  • Maximum number of interpolated LMs used so far
    for eval system 3-fold

9
Decoding with interpolated LMs
  • Problems when using 3-fold interpolation with big
    LMs
  • Decoding speed decreases
  • Lattice generation takes long
  • Rescoring over a matrix takes even longer
  • Use a lookahead mapper
  • Lookahead build over the first domain dependent
    LM only
  • Exact scores only asked, when word identity is
    known

Example Manual decoder initialization
10
Dictionary Making I
  • Filler-Words
  • Models for silence and pause
  • Noises
  • SWB acoustic models for ltnoisegt, lthumangt,
    ltbreathgt, ltlaughgt, ltthroatgt, ltsmackgt mapped to
    one LM-class
  • BN acoustic models for ltbreathgt, ltnoisegt mapped
    to themselves
  • Depends on the trained acoustic models therefore
  • Hesitations
  • SWB several special trained acoustic models
    mapped to 3 LM classes
  • BN one acoustic model ltfillergt mapped to itself
  • Depends on the trained acoustic models therefore
  • Multi-Words
  • Should be included

11
Dictionary Making II
  • Pronunciation variants
  • SWB multiple pronunciation dictionary, 2.2
    entries per word, 100k
  • BN single pronunciation dictionary, 1.1
    entries per word, 54k
  • Decision depends highly on type of training
    dictionary
  • Comparison can only be done between over several
    iterations trained systems
  • Computing pronunciation probabilities
  • Generate all pronunciation variants on the basis
    of rules
  • Forced alignment on training data
  • Count used rules during forced alignment
  • Compute probabilities after some pruning
  • Overall reduction in WER through LM-ing and
    Dictionary making 3 absolutely

12
Hints on using the SRILM-Toolkit I
  • Discounting
  • Modified Kneser-Ney
  • kndiscountN for every Nkndiscount1
    kndiscount2 kndiscount3 for tri-gram LM
  • Good-Touring
  • Default
  • Cut-offs for all discountings
  • gtNmin Me.g. gt3min 3 for 32
  • Default is gtNmin 1 for Ngt2
  • Pruning
  • prune P
  • Same value for all n-grams
  • Value has to be very small

13
Hints on using the SRILM-Toolkit II
  • Interpolation
  • Only context independent
  • Result can be written into one single LM
  • Seems to be better than UKA/CMU context dependent
    interpolation (only results for small domain)
  • Currently only good results for interpolating
    standard n-gram LMs, i.e. w/o classes
  • Clustering
  • Uses only bi-gram statistics
  • Faster and as good as the tool from
    UKA/CMUngram-class -vocab swb04mt04.vocab -text
    swb04mt04.txt-numclasses 800 -classcounts
    swb04mt04.counts-classes swb04mt04.classes gt
    ngram-class.log
  • Interpolation with standard n-gram LMusing the
    SRILM currently not working
  • For using LMs produced by SRILM-Toolkit
    aseparate conversion script must be used,e.g.
    convertSRI.tcl provided by me!!!

14
RT-03 (SWB) Results I
  • Vocabulary
  • 41k vocabulary selected from SWB, BN, CNN
  • CNN used for vocabulary selection but not for LM
    training
  • Pronunciation Variants
  • Rule derived dictionary expansion 95k entries
  • Probabilities
  • Based on frequencies (forced alignment)
  • Viterbi decoding probabilities as
    penalties (e.g. max 1)
  • Confusion networks real probabilities (e.g. sum
    1)
  • Single vs. multi-pronunciation dictionary

15
RT-03 Results II
  • Better text processing, more data
  • Removing inconsistencies 32.6 ? 32.4
  • Adding CELL CTRAN transcripts 32.4 ? 31.6
  • LM interpolation (context dependent)
  • 3gram SWB
    31.4
  • 3gram SWB 5gram class SWB
    31.0
  • 3gram SWB 5gram class SWB 4gram BN
    30.3
  • 3gram SWB 5gram class SWB 4gram BN 4gram
    CNN 30.5

16
RT-04S (Meeting) Results I
  • Existing 40k BN vocabulary expanded by
  • Missing words from RT-03 SWB vocabulary
  • Missing words from meetings (frequency gt 4)
  • Missing abbreviations (frequency gt 5)
  • 56 most common partial words plus breath, filler,
    noise
  • 48k vocabulary with an OOV-Rate of 0.8 on RT-04S
    devtest
  • Pronunciations derived from
  • Existing SWB/Meeting dictionaries
  • CMU dictionary
  • Generated by Festival
  • 55k dictionary

17
RT-04S Results II
  • LM sources
  • SWB/CTRAN/CH/CELL
  • Meetings
  • BN
  • 3-fold interpolation
  • SWB Meeting 3-gram LM
  • SWB Meeting 5-gram classbased LM (800
    classes)
  • BN 4-gram LM
  • Results on RT-04S devtest(SDM, ref.
    segmentation)
  • Baseline 40k BN dictionary 3-gram BN LM
  • All other experiments useexpanded dictionary

text merging interpolation (context ind.)
18
  • Questions?

19
Decoding alongContext Free Grammars with Ibis
  • Interactive Systems Labs
  • Christian Fügen
  • Pittsburgh, Aug. 25th, 2004

20
Outline
  • Basics
  • Language Model Interface in Ibis
  • CFG Framework
  • Some other Aspects when using grammar based
    speech recognition
  • Spontaneous speech
  • Sub-grammars / grammar domains
  • Tight coupling of SR and DLM
  • Other features

21
Basics
  • Context Free Grammars
  • Chomsky Type 2
  • A ? v, A ? V, v ? (V ? ?) e.g. anbn?
    set of terminals, V set of non-terminals, (V??)
    set of strings composes from symbols from V
    and/or ? including the empty string
  • Some Speech Grammar Specifications
  • SOUP Format of the SOUP-Parser, extension of the
    CMU Phoenix-Parser format.greeting ( hello
    ) ( hi )
  • JSGF The Java Speech Grammar Formatltgreetinggt
    hello hi
  • SRGS XML Specification of JSGF from W3C
  • A few others
  • Grammar types
  • Syntactical, semantical or mixed
  • Statistical or non statistical

22
Grammar based decoding
  • Static network representation
  • Faster, especially when used in FST-based
    decoders
  • Transition scores can be computed on the basis of
    any context lengthMostly only bi-/tri-grams are
    practical
  • Memory requirements
  • Time consuming static network compilation
  • Dynamic network representation
  • More flexible Adding of new parts on the fly is
    easy
  • Less memory expensive
  • Slower, due to dynamical expansion
  • Only bi-gram statistics

23
Language Model Interface in Ibis
  • Linguistic Knowledge Source (LingKS)
  • Common interface to all types of language models,
    e.g. PhraseLM, Grammars, interpolation of LingKSs
  • Consists of 4 major functions
  • lks.createLCT (lvX)creates an initial LCT with
    an lvX
  • lks.extendLCT (lct, lvX)extends a given LCT with
    an lvX
  • lks.scoreArray (lct)returns transition scores
    for all lvX given an LCT
  • lks.score (lct, lvX)returns only the transition
    score given an LCT to an lvX
  • Search Vocabulary Mapper (SVMap) defines mapping
    between svX and lvX

lks lingistic knowledge source lct linguistic
context lvX language model index svX search
vocabulary index
24
CFG Framework I
  • CFG
  • Context free grammar
  • CFGSet
  • Set of context free grammars
  • ParseTree
  • Stores the grammar states(LCTs) while
    traversingthrough the network
  • RuleStack
  • Stack for accessing andleaving the grammar
    ruleswhile traversing through the network
  • Lexicon
  • Stores terminal and non-terminal symbols

LingKS CFGSet
LingKS CFG
25
CFG Framework II
  • Supported file formats
  • SOUP Grammar format of the SOUP-Parser(without
    support of character level and speaker-side
    rules)
  • JSGF Java Speech Grammar Format(without support
    of import statements)
  • PFSG Probabilistic Finite State Graph format,
    which is used by SRILM
  • FSM ATT FSM (finite state machine) text file
    format
  • Basic initialization
  • Can be used for decoding and basic parsing (word
    skipping is not possible)
  • Hints for grammar writing
  • Be as restrictive as possible

set SID(cfg,grammars) list list NAV
cfgPath/cfg.ka.nav cfgPath/cfg.base.nav cfgIni
t SID
ibisInit SID -lm cfgSetSID spassSID run

decoding cfgSetSID.data parseTree text svMap
svmapSID parsing
26
CFG Framework III
  • Transition probabilities
  • Can be defined to have a fixed value (-1.0) or
    equally distributed
  • Fixed value seems to work a little better
  • Automatic dictionary selection
  • A dictionary can beautomatically compiled
    whilereading in the grammarsusing a large
    backgrounddictionary
  • Outsourcing terminal classes
  • Large lists of terminals,e.g. street names can
    beoutsourced to the SVMapfor speeding up the
    decoding

set SID(cfg,dict) dictPath/nav.dict set
SID(cfg,baseDict) dictPath/baseDict cfgInit
SID -makeDict 1 dictInit SID
-desc set SID(cfg,dict) ibisInit SID
-lm cfgSetSID
set SID(cfg,classes) list dictPath/nav.classe
s Class File Example, whereby _at_street has to be
a terminal in the CFG acherstrase
_at_street adalbert-stifter-strase
_at_street adenauerring _at_street
27
Spontaneous Speech
  • Coping with non-human and human noises
  • Difficult to model in a grammar, because its not
    known when they appear
  • Use filler words for modeling noises
  • Filler words can occur between any two terminals
    of the grammars
  • LM is not asked for their score, instead a filler
    penalty is used therefore
  • Complete set of variable definition and
    initialization may look like

set SID(cfg,grammars) list list NAV
cfgPath/cfg.ka.nav cfgPath/cfg.base.nav \
list SHARED
cfgPath/cfg.shared set SID(cfg,dict)
dictPath/nav.dict set SID(cfg,baseDict)
dictPath/baseDict set SID(cfg,classes)
list dictPath/nav.classes set
SID(cfg,fillers) list dictPath/nav.fillers
set dict set SID(cfg,dict) cfgInit
SID -makeDict 1 dictInit SID -desc
dict ibisInit SID -lm cfgSetSID \
-vocabDesc dict.v -mapDesc
dict.m svmapSID configure -filPen 60
Example filler file click interjection inter
jection(ah) pause
28
Sub-Grammars / Grammar Domains
  • A CFGSet is a set of single sub-grammars
  • Each grammar can be seen to model a single
    domain, e.g. hotel reservation, navigation,
  • Tags can be given to sub-grammars to identify the
    domains
  • Sub-grammars can be activated / deactivated
    during run-time
  • Common, domain independent concepts can be
    defined in a special shared sub-grammar
  • Better to maintain and better handling together
    with dialogue systems
  • Can be used e.g. for multilingual decoding
  • Multilingual acoustic model
  • Each sub-grammarrepresents a differentlanguage
  • Common, multilingualsearch and language model
    vocabulary
  • Decoding is working in parallel

29
Tight Coupling of SR and DLM
  • Sub-Grammars and single rules can be
  • activated / deactivated during runtime
  • given a weight for e.g. penalizing them
  • Mechanism can be used for a tight coupling of SR
    and DLM
  • Therefore, the same grammar description files
    should be used in both, SR and DLM
  • Parsed output of SR can be directly used by the
    DLM to determine the user intention
  • Together with the dialoguecontext, it is now
    possibleto predict the context of thenext user
    answer
  • Rules within that contextcan be preferred
    byadjusting the rule weightsin the SR

30
Other features
  • Grammars can be expanded on the fly
  • Starting over during decoding one sentence can be
    activated
  • It is possible to define all rules as top-level
    rules
  • Visualization is possible by using the FSM output
    format together with the ATT FSM toolkit and
    dot
  • Upcoming features
  • Training of the grammar transitions by parsing a
    training corpus
  • Collecting bi-gram transition statistics
  • Hybrid Language Models / Unified Language Models
  • Combinations of n-gram LMs and CFGs
  • .

N-gram LM
Context Free Grammar
N-gram LM
N-gram LM
N-gram LM
CFG
CFG
CFG
Write a Comment
User Comments (0)