Title: Build Turing language model by SRILM
1Build Turing language model by SRILM
- Student Sunya Santananchai
- Instructor Dr. Veton Z. Këpuska
2Objective
- Build a Good-Turing language model based on two
different corpora. - Use SRI Language Model toolkit to do a 3-gram
language model. - Compare the perplexities between two 3-gram
language model. - Compare the perplexity using combined these two
corpora.
3Introduction
- SRILM is a toolkit for building and applying
statistical language models (LMs) - Use in speech recognition, statistical tagging
and segmentation. - It has been development under the SRI Speech
Technology and Research Laboratory since 1995.
4SRILM consists of the following components
- A set of C class libraries implementing
language models, supporting data stuctures and
miscellaneous utility functions. - A set of executable programs built on top of
these libraries to perform standard tasks such as
training LMs and testing them on data, tagging or
segmenting text, etc. - A collection of miscellaneous scripts
facilitating minor related tasks.
5Basic LM operations
- The main objective of SRILM is to support
language model estimation and evaluation. - Estimation create a model from training data
- Evaluation compute the probability of a test
corpus (copora) for which conventionally
expressed as the test set perplexity. - SRILM are based on N-gram statistics which the
tools are ngram-count and ngram
6Build Turing language model by SRILM
- Use SRILM toolkit to build 3-gram language models
based on two different corpora. - Compare these two perplexities of language model
which base on each corpus. - Build 3-gram language models which based on the
combining of two corpus, and compare the result
of perplexity.
7Install Linux-like environment - cygwin
- Download the installation file for cygwin
- http//www.cygwin.com
- Run setup.exe
- Click Install from Internet
- Setup default root install directory - C\cygwin
- Select a downloadable site
- Install all cygwin packages
8Install SRILM
- Download SRILM toolkit
- Run cygwin
- Unzip srilm.tgz by following commands
- cd /cygdrive/c/cygwin/srilm
- tar zxvf srilm.tgz
9Install SRILM
- Edit the makefile which is in the cygwin folder
- Add following lines
- SRILM/cygdrive/c/cygwin/srilm
- MACHINE_TYPEcygwin
- Command for installing SRILM
- Make World
10Corpus and Lexicon
- Download nltk-data-0.9.zip form National
Language Toolkit website. - http//nltk.sourceforge.net/index.php/Corpora
- Choose folder abc and folder words form
nltk-data-0.9 - Folder abc includes two corpora rural.txt and
sceince.txt files - Folder words includes lexicon en file
11Generate Count File
- commands for generating 3-gram count file
- ./ngram-count -vocab en
- -text rural.txt
- -order 3
- -write rural.count
- -unk
12Generate Count File
- commands for generating 3-gram count file
- ./ngram-count -vocab en
- -text science.txt
- -order 3
- -write sceince.count
- -unk
13Generate Count File
- Combined two corpora to one corpus sum.txt
- commands for generating 3-gram count file
- ./ngram-count - vocab en
- - text sum.txt
- -order 3
- -write sum.count
- -unk
14Count File
- ngram-count
- count N-grams and estimate language models
- -vocab file
- Read a vocabulary from file. Subsequently,
out-of-vocabulary words in both counts or text
are replaced with the unknown-word token. If this
option is not specified all words found are
implicitly added to the vocabulary. - -text textfile
- Generate N-gram counts from text file.
textfile should contain one sentence unit per
line. Begin/end sentence tokens are added if not
already present. Empty lines are ignored. - -order n
- Set the maximal order (length) of N-grams
to count. This also determines the order of the
estimated LM, if any. The default order is 3. - -write file
- Write total counts to file.
- -unk
- Build an open vocabulary LM, i.e., one
that contains the unknown-word token as a regular
word. The default is to remove the unknown word.
15(No Transcript)
16Good-Turing Language Model
- ./ngram-count
- -read rural.count
- -order 3
- -lm rural.lm
- -gt3min 1 -gt3max 3
17Good-Turing Language Model
- ./ngram-count
- -read science.count
- -order 3
- -lm science.lm
- -gt3min 1 -gt3max 3
18Good-Turing Language Model
- ./ngram-count
- -read sum.count
- -order 3
- -lm sum.lm
- -gt3min 1 -gt3max 3
19Good-Turing Language Model
- -read countsfile
- Read N-gram counts from a file. ASCII count
files contain one N-gram of words per line,
followed by an integer count, all separated by
whitespace. Repeated counts for the same N-gram
are added. - -lm lmfile
- Estimate a backoff N-gram model from the
total counts, and write it to lmfile. - -gtnmin count
- where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9.
Set the minimal count of N-grams of order n that
will be included in the LM. All N-grams with
frequency lower than that will effectively be
discounted to 0. If n is omitted the parameter
for N-grams of order gt 9 is set. NOTE This
option affects not only the default Good-Turing
discounting but the alternative discounting
methods described below as well. - -gtnmax count
- where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9.
Set the maximal count of N-grams of order n that
are discounted under Good-Turing. All N-grams
more frequent than that will receive maximum
likelihood estimates. Discounting can be
effectively disabled by setting this to 0. If n
is omitted the parameter for N-grams of order gt 9
is set.
20(No Transcript)
21Test Data Perplexity
- Choose one article from yahoo news for test data.
- Command for 3-gram language Models
- ./ngram -ppl test.txt -order 3 -lm rural.lm
- ./ngram -ppl test.txt -order 3 -lm
science.lm - ./ngram -ppl test.txt -order 3 -lm sum.lm
22Test Data Perplexity
- -ppl textfile
- Compute sentence scores (log probabilities)
and perplexities from the sentences in textfile,
which should contain one sentence per line. The
-debug option controls the level of detail
printed, even though output is to stdout (not
stderr). - -lm file
- Read the (main) N-gram model from file. This
option is always required, unless -null was
chosen.
23Result of test.txt
24Conclusion
Corpus Perplexity
Rural 389.96
Science 463.479
RuralScience 484.409
25Conclusion
- We can use SRILM toolkit to build 3-gram language
models based on corpora which we use in the
model. - We also use SRILM toolkit to build 3-gram
language model which based on the combining of
corpus. All of language models can give the
different result of testing perplexity corpora.
26Conclusion
- why rural and science corpus cannot get a good
perplexity than rural corpus? - Because the article we choose has more
information about rural part than science part.
These two corpora do not cover all articles so
that rural and science part get the lowest
perplexity than others.
27Reference
- SRI International, The SRI Language Modeling
Toolkit, http//www.speech.sri.com/projects/srilm
/ - Cygwin Information and Installation, Installing
and Updating Cygwin, http//www.cygwin.com/ - National Language Toolkit, http//nltk.sourceforge
.net/index.php/Corpora