Build Turing language model by SRILM

1 / 27

About This Presentation

Title:

Build Turing language model by SRILM

Description:

Build Turing language model by SRILM Student: Sunya Santananchai Instructor: Dr. Veton Z. K puska – PowerPoint PPT presentation

Number of Views:10

Avg rating:3.0/5.0

Slides: 28

Provided by: fit102

Learn more at: https://my.fit.edu

more less

Transcript and Presenter's Notes

Title: Build Turing language model by SRILM

1
Build Turing language model by SRILM

Student Sunya Santananchai
Instructor Dr. Veton Z. Këpuska

2
Objective

Build a Good-Turing language model based on two
different corpora.
Use SRI Language Model toolkit to do a 3-gram
language model.
Compare the perplexities between two 3-gram
language model.
Compare the perplexity using combined these two
corpora.

3
Introduction

SRILM is a toolkit for building and applying
statistical language models (LMs)
Use in speech recognition, statistical tagging
and segmentation.
It has been development under the SRI Speech
Technology and Research Laboratory since 1995.

4
SRILM consists of the following components

A set of C class libraries implementing
language models, supporting data stuctures and
miscellaneous utility functions.
A set of executable programs built on top of
these libraries to perform standard tasks such as
training LMs and testing them on data, tagging or
segmenting text, etc.
A collection of miscellaneous scripts
facilitating minor related tasks.

5
Basic LM operations

The main objective of SRILM is to support
language model estimation and evaluation.
Estimation create a model from training data
Evaluation compute the probability of a test
corpus (copora) for which conventionally
expressed as the test set perplexity.
SRILM are based on N-gram statistics which the
tools are ngram-count and ngram

6
Build Turing language model by SRILM

Use SRILM toolkit to build 3-gram language models
based on two different corpora.
Compare these two perplexities of language model
which base on each corpus.
Build 3-gram language models which based on the
combining of two corpus, and compare the result
of perplexity.

7
Install Linux-like environment - cygwin

Download the installation file for cygwin
http//www.cygwin.com
Run setup.exe
Click Install from Internet
Setup default root install directory - C\cygwin
Select a downloadable site
Install all cygwin packages

8
Install SRILM

Download SRILM toolkit
Run cygwin
Unzip srilm.tgz by following commands
cd /cygdrive/c/cygwin/srilm
tar zxvf srilm.tgz

9
Install SRILM

Edit the makefile which is in the cygwin folder
Add following lines
SRILM/cygdrive/c/cygwin/srilm
MACHINE_TYPEcygwin
Command for installing SRILM
Make World

10
Corpus and Lexicon

Download nltk-data-0.9.zip form National
Language Toolkit website.
http//nltk.sourceforge.net/index.php/Corpora
Choose folder abc and folder words form
nltk-data-0.9
Folder abc includes two corpora rural.txt and
sceince.txt files
Folder words includes lexicon en file

11
Generate Count File

commands for generating 3-gram count file
./ngram-count -vocab en
-text rural.txt
-order 3
-write rural.count
-unk

12
Generate Count File

commands for generating 3-gram count file
./ngram-count -vocab en
-text science.txt
-order 3
-write sceince.count
-unk

13
Generate Count File

Combined two corpora to one corpus sum.txt
commands for generating 3-gram count file
./ngram-count - vocab en
- text sum.txt
-order 3
-write sum.count
-unk

14
Count File

ngram-count
count N-grams and estimate language models
-vocab file
Read a vocabulary from file. Subsequently,
out-of-vocabulary words in both counts or text
are replaced with the unknown-word token. If this
option is not specified all words found are
implicitly added to the vocabulary.
-text textfile
Generate N-gram counts from text file.
textfile should contain one sentence unit per
line. Begin/end sentence tokens are added if not
already present. Empty lines are ignored.
-order n
Set the maximal order (length) of N-grams
to count. This also determines the order of the
estimated LM, if any. The default order is 3.
-write file
Write total counts to file.
-unk
Build an open vocabulary LM, i.e., one
that contains the unknown-word token as a regular
word. The default is to remove the unknown word.

15
(No Transcript)
16
Good-Turing Language Model

./ngram-count
-read rural.count
-order 3
-lm rural.lm
-gt3min 1 -gt3max 3

17
Good-Turing Language Model

./ngram-count
-read science.count
-order 3
-lm science.lm
-gt3min 1 -gt3max 3

18
Good-Turing Language Model

./ngram-count
-read sum.count
-order 3
-lm sum.lm
-gt3min 1 -gt3max 3

19
Good-Turing Language Model

-read countsfile
Read N-gram counts from a file. ASCII count
files contain one N-gram of words per line,
followed by an integer count, all separated by
whitespace. Repeated counts for the same N-gram
are added.
-lm lmfile
Estimate a backoff N-gram model from the
total counts, and write it to lmfile.
-gtnmin count
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9.
Set the minimal count of N-grams of order n that
will be included in the LM. All N-grams with
frequency lower than that will effectively be
discounted to 0. If n is omitted the parameter
for N-grams of order gt 9 is set. NOTE This
option affects not only the default Good-Turing
discounting but the alternative discounting
methods described below as well.
-gtnmax count
where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9.
Set the maximal count of N-grams of order n that
are discounted under Good-Turing. All N-grams
more frequent than that will receive maximum
likelihood estimates. Discounting can be
effectively disabled by setting this to 0. If n
is omitted the parameter for N-grams of order gt 9
is set.

20
(No Transcript)
21
Test Data Perplexity

Choose one article from yahoo news for test data.
Command for 3-gram language Models
./ngram -ppl test.txt -order 3 -lm rural.lm
./ngram -ppl test.txt -order 3 -lm
science.lm
./ngram -ppl test.txt -order 3 -lm sum.lm

22
Test Data Perplexity

-ppl textfile
Compute sentence scores (log probabilities)
and perplexities from the sentences in textfile,
which should contain one sentence per line. The
-debug option controls the level of detail
printed, even though output is to stdout (not
stderr).
-lm file
Read the (main) N-gram model from file. This
option is always required, unless -null was
chosen.

23
Result of test.txt
24
Conclusion
Corpus Perplexity
Rural 389.96
Science 463.479
RuralScience 484.409
25
Conclusion

We can use SRILM toolkit to build 3-gram language
models based on corpora which we use in the
model.
We also use SRILM toolkit to build 3-gram
language model which based on the combining of
corpus. All of language models can give the
different result of testing perplexity corpora.

26
Conclusion

why rural and science corpus cannot get a good
perplexity than rural corpus?
Because the article we choose has more
information about rural part than science part.
These two corpora do not cover all articles so
that rural and science part get the lowest
perplexity than others.

27
Reference

SRI International, The SRI Language Modeling
Toolkit, http//www.speech.sri.com/projects/srilm
/
Cygwin Information and Installation, Installing
and Updating Cygwin, http//www.cygwin.com/
National Language Toolkit, http//nltk.sourceforge
.net/index.php/Corpora

Write a Comment

User Comments (0)