A new framework for Language Model Training

About This Presentation

Title:

A new framework for Language Model Training

Description:

Annotate source texts with class tags. Select a vocabulary. Determine optimal vocabulary size ... Define vocabulary classes. Vocabulary closure. Build a ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 16

Provided by: davidhugg

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: A new framework for Language Model Training

1
A new framework for Language Model Training

David Huggins-Daines
January 19, 2006

2
Overview

Current tools
Requirements for new framework
User Interface Examples
Design and API

3
Current status of LM training

The CMU SLM toolkit
Efficient implementation of basic algorithms
Doesnt handle all tasks of building a LM
Text normalization
Vocabulary selection
Interpolation/adaptation
Requires an expert to put the pieces together
Lots of scripts
SimpleLM, Communicator, CALO, etc.
Other LM toolkits
SRILM, Lemur, others?

4
Requirements

LM training should be
Repeatable
An end-to-end rebuild should produce the same
result
Configurable
It should be easy to change parameters and
rebuild the entire model to see their effect
Flexible
Should support many types of source texts,
methods of training
Extensible
Modular structure to allow new methods and data
sources to be easily implemented

5
Tasks of building an LM

Normalize source texts
They come in many different formats!
LM toolkit expects a stream of words
What is a word?
Compound words, acronyms
Non-lexemes (filler words, pauses, disfluencies)
What is a sentence?
Segmentation of input data
Annotate source texts with class tags
Select a vocabulary
Determine optimal vocabulary size
Collect words from training texts
Define vocabulary classes
Vocabulary closure
Build a dictionary (pronunciation modeling)

6
Tasks, continued

Estimate N-Gram model(s)
Choose the appropriate smoothing parameters
Find the appropriate divisions of the training
set
Interpolate N-Gram models
Use a held-out set representative of the test set
Find weights for different models which maximize
likelihood (minimize perplexity) on this domain
Evaluate language model
Jointly minimize perplexity and OOV rate
(they tend to move in opposite directions)

7
A Simple Switchboard Example
Top level tag - must be only one

ltNGramModelgt
ltTranscripts name"swb.files"gt
ltInputFilterSWBgt
ltTranscripts list"swb.files"/gt
lt/InputFilterSWBgt
lt/Transcriptsgt
ltVocabulary cutoff"1"gt
ltTranscripts name"swb.files"/gt
lt/Vocabularygt
lt/NGramModelgt

A set of transcripts
The input filter to use
A list of files
Exclude singletons
Backreference to named object
8
A More Complicated Example
?ltNGramModel name"interp.test"gt ltTranscripts
name"swb.test"gt swb.test.lsn
lt/Transcriptsgt ltTranscripts name"icsi.test"gt
ltInputFilterICSIgt icsi.test.mrt
lt/InputFilterICSIgt lt/Transcriptsgt
ltVocabulary name"icsi.swb1"gt ltVocabulary
cutoff"1"gt ltTranscripts name"swb.test"/gt
lt/Vocabularygt ltVocabularygt
ltTranscripts name"icsi.test"/gt
lt/Vocabularygt BRAZIL lt/Vocabularygt
ltNGramModel name"swb.test"gt ltTranscripts
name"swb.test"/gt ltVocabulary
name"icsi.swb1"/gt lt/NGramModelgt ltNGramModel
name"icsi.test"gt ltTranscripts
name"icsi.test"/gt ltVocabulary
name"icsi.swb1"/gt lt/NGramModelgt
ltInterpolationgt ltInputFilterCMUgt
cmu.test.trs lt/InputFilterCMUgt
ltNGramModel name"swb.test"/gt ltNGramModel
name"icsi.test"/gt lt/Interpolationgt lt/NGramModel
gt
(Interpolation of ICSI and Switchboard)
Files can be listed directly in element contents
Vocabularies can be nested (merged)
Words can be listed directly in element contents
Held-out set for interpolation
Interpolate previously named LMs
9
Command-line Interface

lm_train
Runs an XML configuration file
build_vocab
Build vocabularies, normalize transcripts
ngram_train
Train individual N-Gram models
ngram_test
Evaluate N-Gram models
ngram_interpolate
Interpolate and combine N-Gram models
ngram_pronounce
Build a pronunciation lexicon from a language
model or vocabulary

10
Programming Interface

NGramFactory
Builds an NGramModel from an XML specification
(as seen previously)
NGramModel
Trains a single N-Gram LM from some transcripts
Vocabulary
Builds a vocabulary from transcripts or other
vocabularies
InputFilter
Subclassed into InputFilterCMU,
InputFilterICSI, InputFilterHUB5,
InputFilterISL, etc
Reads transcripts in some format and outputs a
word stream

11
Design in Plain English

NGramFactory builds an NGramModel
NGramModel has a Vocabulary
NGramModel and Vocabulary can have Transcripts
NGramModel and Vocabulary use an InputFilter (or
maybe they dont)
NGramModel can merge two other NGramModels using
a set of Transcripts
Vocabulary can merge another Vocabulary

12
A very simple InputFilter
please!!!
(InputFilter/Simple.pm)
?use strict package InputFilterSimple require
InputFilter use base 'InputFilter' sub
process_transcript my (self, file) _at__
local (_, FILE) open FILE, "ltfile" or
die "Failed to open file !" while
(ltFILEgt) chomp my _at_words
split self-gtoutput_sentence(\_at_words)
1
Subclass of InputFilter
(This is just good practice)
Read the input file
Tokenize, normalize, etc
Pass each sentence to this method
13
Where to get it

Currently in CVS on fife.speech
extfife.speech.cs.cmu.edu/home/CVS
module LMTraining
Future CPAN and cmusphinx.org
Possibly integrated with the CMU SLM toolkit in
the future

14
Stuff TODO

Class LM support
Communicator-style class tags are recognized and
supported
NGramModel will build .lmctl and .probdef files
However this requires normalizing the files to a
transcript first, then running the semi-automatic
Communicator tagger
Automatic tagging would be nice
Support for languages other than English
Text normalization conventions
Word segmentation (for Asian languages)
Character set support (case conversions etc)
Unicode (also a CMU-SLM problem)

15
Questions?

Write a Comment

User Comments (0)