A new framework for Language Model Training - PowerPoint PPT Presentation

About This Presentation
Title:

A new framework for Language Model Training

Description:

Annotate source texts with class tags. Select a vocabulary. Determine optimal vocabulary size ... Define vocabulary classes. Vocabulary closure. Build a ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 16
Provided by: davidhugg
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: A new framework for Language Model Training


1
A new framework for Language Model Training
  • David Huggins-Daines
  • January 19, 2006

2
Overview
  • Current tools
  • Requirements for new framework
  • User Interface Examples
  • Design and API

3
Current status of LM training
  • The CMU SLM toolkit
  • Efficient implementation of basic algorithms
  • Doesnt handle all tasks of building a LM
  • Text normalization
  • Vocabulary selection
  • Interpolation/adaptation
  • Requires an expert to put the pieces together
  • Lots of scripts
  • SimpleLM, Communicator, CALO, etc.
  • Other LM toolkits
  • SRILM, Lemur, others?

4
Requirements
  • LM training should be
  • Repeatable
  • An end-to-end rebuild should produce the same
    result
  • Configurable
  • It should be easy to change parameters and
    rebuild the entire model to see their effect
  • Flexible
  • Should support many types of source texts,
    methods of training
  • Extensible
  • Modular structure to allow new methods and data
    sources to be easily implemented

5
Tasks of building an LM
  • Normalize source texts
  • They come in many different formats!
  • LM toolkit expects a stream of words
  • What is a word?
  • Compound words, acronyms
  • Non-lexemes (filler words, pauses, disfluencies)
  • What is a sentence?
  • Segmentation of input data
  • Annotate source texts with class tags
  • Select a vocabulary
  • Determine optimal vocabulary size
  • Collect words from training texts
  • Define vocabulary classes
  • Vocabulary closure
  • Build a dictionary (pronunciation modeling)

6
Tasks, continued
  • Estimate N-Gram model(s)
  • Choose the appropriate smoothing parameters
  • Find the appropriate divisions of the training
    set
  • Interpolate N-Gram models
  • Use a held-out set representative of the test set
  • Find weights for different models which maximize
    likelihood (minimize perplexity) on this domain
  • Evaluate language model
  • Jointly minimize perplexity and OOV rate
  • (they tend to move in opposite directions)

7
A Simple Switchboard Example
Top level tag - must be only one
  • ltNGramModelgt
  • ltTranscripts name"swb.files"gt
  • ltInputFilterSWBgt
  • ltTranscripts list"swb.files"/gt
  • lt/InputFilterSWBgt
  • lt/Transcriptsgt
  • ltVocabulary cutoff"1"gt
  • ltTranscripts name"swb.files"/gt
  • lt/Vocabularygt
  • lt/NGramModelgt

A set of transcripts
The input filter to use
A list of files
Exclude singletons
Backreference to named object
8
A More Complicated Example
?ltNGramModel name"interp.test"gt ltTranscripts
name"swb.test"gt swb.test.lsn
lt/Transcriptsgt ltTranscripts name"icsi.test"gt
ltInputFilterICSIgt icsi.test.mrt
lt/InputFilterICSIgt lt/Transcriptsgt
ltVocabulary name"icsi.swb1"gt ltVocabulary
cutoff"1"gt ltTranscripts name"swb.test"/gt
lt/Vocabularygt ltVocabularygt
ltTranscripts name"icsi.test"/gt
lt/Vocabularygt BRAZIL lt/Vocabularygt
ltNGramModel name"swb.test"gt ltTranscripts
name"swb.test"/gt ltVocabulary
name"icsi.swb1"/gt lt/NGramModelgt ltNGramModel
name"icsi.test"gt ltTranscripts
name"icsi.test"/gt ltVocabulary
name"icsi.swb1"/gt lt/NGramModelgt
ltInterpolationgt ltInputFilterCMUgt
cmu.test.trs lt/InputFilterCMUgt
ltNGramModel name"swb.test"/gt ltNGramModel
name"icsi.test"/gt lt/Interpolationgt lt/NGramModel
gt
(Interpolation of ICSI and Switchboard)
Files can be listed directly in element contents
Vocabularies can be nested (merged)
Words can be listed directly in element contents
Held-out set for interpolation
Interpolate previously named LMs
9
Command-line Interface
  • lm_train
  • Runs an XML configuration file
  • build_vocab
  • Build vocabularies, normalize transcripts
  • ngram_train
  • Train individual N-Gram models
  • ngram_test
  • Evaluate N-Gram models
  • ngram_interpolate
  • Interpolate and combine N-Gram models
  • ngram_pronounce
  • Build a pronunciation lexicon from a language
    model or vocabulary

10
Programming Interface
  • NGramFactory
  • Builds an NGramModel from an XML specification
    (as seen previously)
  • NGramModel
  • Trains a single N-Gram LM from some transcripts
  • Vocabulary
  • Builds a vocabulary from transcripts or other
    vocabularies
  • InputFilter
  • Subclassed into InputFilterCMU,
    InputFilterICSI, InputFilterHUB5,
    InputFilterISL, etc
  • Reads transcripts in some format and outputs a
    word stream

11
Design in Plain English
  • NGramFactory builds an NGramModel
  • NGramModel has a Vocabulary
  • NGramModel and Vocabulary can have Transcripts
  • NGramModel and Vocabulary use an InputFilter (or
    maybe they dont)
  • NGramModel can merge two other NGramModels using
    a set of Transcripts
  • Vocabulary can merge another Vocabulary

12
A very simple InputFilter
please!!!
(InputFilter/Simple.pm)
?use strict package InputFilterSimple require
InputFilter use base 'InputFilter' sub
process_transcript my (self, file) _at__
local (_, FILE) open FILE, "ltfile" or
die "Failed to open file !" while
(ltFILEgt) chomp my _at_words
split self-gtoutput_sentence(\_at_words)
1
Subclass of InputFilter
(This is just good practice)
Read the input file
Tokenize, normalize, etc
Pass each sentence to this method
13
Where to get it
  • Currently in CVS on fife.speech
  • extfife.speech.cs.cmu.edu/home/CVS
  • module LMTraining
  • Future CPAN and cmusphinx.org
  • Possibly integrated with the CMU SLM toolkit in
    the future

14
Stuff TODO
  • Class LM support
  • Communicator-style class tags are recognized and
    supported
  • NGramModel will build .lmctl and .probdef files
  • However this requires normalizing the files to a
    transcript first, then running the semi-automatic
    Communicator tagger
  • Automatic tagging would be nice
  • Support for languages other than English
  • Text normalization conventions
  • Word segmentation (for Asian languages)
  • Character set support (case conversions etc)
  • Unicode (also a CMU-SLM problem)

15
Questions?
Write a Comment
User Comments (0)
About PowerShow.com