Title: A new framework for Language Model Training
1A new framework for Language Model Training
- David Huggins-Daines
- January 19, 2006
2Overview
- Current tools
- Requirements for new framework
- User Interface Examples
- Design and API
3Current status of LM training
- The CMU SLM toolkit
- Efficient implementation of basic algorithms
- Doesnt handle all tasks of building a LM
- Text normalization
- Vocabulary selection
- Interpolation/adaptation
- Requires an expert to put the pieces together
- Lots of scripts
- SimpleLM, Communicator, CALO, etc.
- Other LM toolkits
- SRILM, Lemur, others?
4Requirements
- LM training should be
- Repeatable
- An end-to-end rebuild should produce the same
result - Configurable
- It should be easy to change parameters and
rebuild the entire model to see their effect - Flexible
- Should support many types of source texts,
methods of training - Extensible
- Modular structure to allow new methods and data
sources to be easily implemented
5Tasks of building an LM
- Normalize source texts
- They come in many different formats!
- LM toolkit expects a stream of words
- What is a word?
- Compound words, acronyms
- Non-lexemes (filler words, pauses, disfluencies)
- What is a sentence?
- Segmentation of input data
- Annotate source texts with class tags
- Select a vocabulary
- Determine optimal vocabulary size
- Collect words from training texts
- Define vocabulary classes
- Vocabulary closure
- Build a dictionary (pronunciation modeling)
6Tasks, continued
- Estimate N-Gram model(s)
- Choose the appropriate smoothing parameters
- Find the appropriate divisions of the training
set - Interpolate N-Gram models
- Use a held-out set representative of the test set
- Find weights for different models which maximize
likelihood (minimize perplexity) on this domain - Evaluate language model
- Jointly minimize perplexity and OOV rate
- (they tend to move in opposite directions)
7A Simple Switchboard Example
Top level tag - must be only one
- ltNGramModelgt
- ltTranscripts name"swb.files"gt
- ltInputFilterSWBgt
- ltTranscripts list"swb.files"/gt
- lt/InputFilterSWBgt
- lt/Transcriptsgt
- ltVocabulary cutoff"1"gt
- ltTranscripts name"swb.files"/gt
- lt/Vocabularygt
- lt/NGramModelgt
A set of transcripts
The input filter to use
A list of files
Exclude singletons
Backreference to named object
8A More Complicated Example
?ltNGramModel name"interp.test"gt ltTranscripts
name"swb.test"gt swb.test.lsn
lt/Transcriptsgt ltTranscripts name"icsi.test"gt
ltInputFilterICSIgt icsi.test.mrt
lt/InputFilterICSIgt lt/Transcriptsgt
ltVocabulary name"icsi.swb1"gt ltVocabulary
cutoff"1"gt ltTranscripts name"swb.test"/gt
lt/Vocabularygt ltVocabularygt
ltTranscripts name"icsi.test"/gt
lt/Vocabularygt BRAZIL lt/Vocabularygt
ltNGramModel name"swb.test"gt ltTranscripts
name"swb.test"/gt ltVocabulary
name"icsi.swb1"/gt lt/NGramModelgt ltNGramModel
name"icsi.test"gt ltTranscripts
name"icsi.test"/gt ltVocabulary
name"icsi.swb1"/gt lt/NGramModelgt
ltInterpolationgt ltInputFilterCMUgt
cmu.test.trs lt/InputFilterCMUgt
ltNGramModel name"swb.test"/gt ltNGramModel
name"icsi.test"/gt lt/Interpolationgt lt/NGramModel
gt
(Interpolation of ICSI and Switchboard)
Files can be listed directly in element contents
Vocabularies can be nested (merged)
Words can be listed directly in element contents
Held-out set for interpolation
Interpolate previously named LMs
9Command-line Interface
- lm_train
- Runs an XML configuration file
- build_vocab
- Build vocabularies, normalize transcripts
- ngram_train
- Train individual N-Gram models
- ngram_test
- Evaluate N-Gram models
- ngram_interpolate
- Interpolate and combine N-Gram models
- ngram_pronounce
- Build a pronunciation lexicon from a language
model or vocabulary
10Programming Interface
- NGramFactory
- Builds an NGramModel from an XML specification
(as seen previously) - NGramModel
- Trains a single N-Gram LM from some transcripts
- Vocabulary
- Builds a vocabulary from transcripts or other
vocabularies - InputFilter
- Subclassed into InputFilterCMU,
InputFilterICSI, InputFilterHUB5,
InputFilterISL, etc - Reads transcripts in some format and outputs a
word stream
11Design in Plain English
- NGramFactory builds an NGramModel
- NGramModel has a Vocabulary
- NGramModel and Vocabulary can have Transcripts
- NGramModel and Vocabulary use an InputFilter (or
maybe they dont) - NGramModel can merge two other NGramModels using
a set of Transcripts - Vocabulary can merge another Vocabulary
12A very simple InputFilter
please!!!
(InputFilter/Simple.pm)
?use strict package InputFilterSimple require
InputFilter use base 'InputFilter' sub
process_transcript my (self, file) _at__
local (_, FILE) open FILE, "ltfile" or
die "Failed to open file !" while
(ltFILEgt) chomp my _at_words
split self-gtoutput_sentence(\_at_words)
1
Subclass of InputFilter
(This is just good practice)
Read the input file
Tokenize, normalize, etc
Pass each sentence to this method
13Where to get it
- Currently in CVS on fife.speech
- extfife.speech.cs.cmu.edu/home/CVS
- module LMTraining
- Future CPAN and cmusphinx.org
- Possibly integrated with the CMU SLM toolkit in
the future
14Stuff TODO
- Class LM support
- Communicator-style class tags are recognized and
supported - NGramModel will build .lmctl and .probdef files
- However this requires normalizing the files to a
transcript first, then running the semi-automatic
Communicator tagger - Automatic tagging would be nice
- Support for languages other than English
- Text normalization conventions
- Word segmentation (for Asian languages)
- Character set support (case conversions etc)
- Unicode (also a CMU-SLM problem)
15Questions?