Build Turing language model by SRILM

1 / 27
About This Presentation
Title:

Build Turing language model by SRILM

Description:

Build Turing language model by SRILM Student: Sunya Santananchai Instructor: Dr. Veton Z. K puska – PowerPoint PPT presentation

Number of Views:10
Avg rating:3.0/5.0
Slides: 28
Provided by: fit102
Learn more at: https://my.fit.edu

less

Transcript and Presenter's Notes

Title: Build Turing language model by SRILM


1
Build Turing language model by SRILM
  • Student Sunya Santananchai
  • Instructor Dr. Veton Z. Këpuska

2
Objective
  • Build a Good-Turing language model based on two
    different corpora.
  • Use SRI Language Model toolkit to do a 3-gram
    language model.
  • Compare the perplexities between two 3-gram
    language model.
  • Compare the perplexity using combined these two
    corpora.

3
Introduction
  • SRILM is a toolkit for building and applying
    statistical language models (LMs)
  • Use in speech recognition, statistical tagging
    and segmentation.
  • It has been development under the SRI Speech
    Technology and Research Laboratory since 1995.

4
SRILM consists of the following components
  • A set of C class libraries implementing
    language models, supporting data stuctures and
    miscellaneous utility functions.
  • A set of executable programs built on top of
    these libraries to perform standard tasks such as
    training LMs and testing them on data, tagging or
    segmenting text, etc.
  • A collection of miscellaneous scripts
    facilitating minor related tasks.

5
Basic LM operations
  • The main objective of SRILM is to support
    language model estimation and evaluation.
  • Estimation create a model from training data
  • Evaluation compute the probability of a test
    corpus (copora) for which conventionally
    expressed as the test set perplexity.
  • SRILM are based on N-gram statistics which the
    tools are ngram-count and ngram

6
Build Turing language model by SRILM
  • Use SRILM toolkit to build 3-gram language models
    based on two different corpora.
  • Compare these two perplexities of language model
    which base on each corpus.
  • Build 3-gram language models which based on the
    combining of two corpus, and compare the result
    of perplexity.

7
Install Linux-like environment - cygwin
  • Download the installation file for cygwin
  • http//www.cygwin.com
  • Run setup.exe
  • Click Install from Internet
  • Setup default root install directory - C\cygwin
  • Select a downloadable site
  • Install all cygwin packages

8
Install SRILM
  • Download SRILM toolkit
  • Run cygwin
  • Unzip srilm.tgz by following commands
  • cd /cygdrive/c/cygwin/srilm
  • tar zxvf srilm.tgz

9
Install SRILM
  • Edit the makefile which is in the cygwin folder
  • Add following lines
  • SRILM/cygdrive/c/cygwin/srilm
  • MACHINE_TYPEcygwin
  • Command for installing SRILM
  • Make World

10
Corpus and Lexicon
  • Download nltk-data-0.9.zip form National
    Language Toolkit website.
  • http//nltk.sourceforge.net/index.php/Corpora
  • Choose folder abc and folder words form
    nltk-data-0.9
  • Folder abc includes two corpora rural.txt and
    sceince.txt files
  • Folder words includes lexicon en file

11
Generate Count File
  • commands for generating 3-gram count file
  • ./ngram-count -vocab en
  • -text rural.txt
  • -order 3
  • -write rural.count
  • -unk

12
Generate Count File
  • commands for generating 3-gram count file
  • ./ngram-count -vocab en
  • -text science.txt
  • -order 3
  • -write sceince.count
  • -unk

13
Generate Count File
  • Combined two corpora to one corpus sum.txt
  • commands for generating 3-gram count file
  • ./ngram-count - vocab en
  • - text sum.txt
  • -order 3
  • -write sum.count
  • -unk

14
Count File
  • ngram-count
  • count N-grams and estimate language models
  • -vocab file
  • Read a vocabulary from file. Subsequently,
    out-of-vocabulary words in both counts or text
    are replaced with the unknown-word token. If this
    option is not specified all words found are
    implicitly added to the vocabulary.
  • -text textfile
  • Generate N-gram counts from text file.
    textfile should contain one sentence unit per
    line. Begin/end sentence tokens are added if not
    already present. Empty lines are ignored.
  • -order n
  • Set the maximal order (length) of N-grams
    to count. This also determines the order of the
    estimated LM, if any. The default order is 3.
  • -write file
  • Write total counts to file.
  • -unk
  • Build an open vocabulary LM, i.e., one
    that contains the unknown-word token as a regular
    word. The default is to remove the unknown word.

15
(No Transcript)
16
Good-Turing Language Model
  • ./ngram-count
  • -read rural.count
  • -order 3
  • -lm rural.lm
  • -gt3min 1 -gt3max 3

17
Good-Turing Language Model
  • ./ngram-count
  • -read science.count
  • -order 3
  • -lm science.lm
  • -gt3min 1 -gt3max 3

18
Good-Turing Language Model
  • ./ngram-count
  • -read sum.count
  • -order 3
  • -lm sum.lm
  • -gt3min 1 -gt3max 3

19
Good-Turing Language Model
  • -read countsfile
  • Read N-gram counts from a file. ASCII count
    files contain one N-gram of words per line,
    followed by an integer count, all separated by
    whitespace. Repeated counts for the same N-gram
    are added.
  • -lm lmfile
  • Estimate a backoff N-gram model from the
    total counts, and write it to lmfile.
  • -gtnmin count
  • where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9.
    Set the minimal count of N-grams of order n that
    will be included in the LM. All N-grams with
    frequency lower than that will effectively be
    discounted to 0. If n is omitted the parameter
    for N-grams of order gt 9 is set. NOTE This
    option affects not only the default Good-Turing
    discounting but the alternative discounting
    methods described below as well.
  • -gtnmax count
  • where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9.
    Set the maximal count of N-grams of order n that
    are discounted under Good-Turing. All N-grams
    more frequent than that will receive maximum
    likelihood estimates. Discounting can be
    effectively disabled by setting this to 0. If n
    is omitted the parameter for N-grams of order gt 9
    is set.

20
(No Transcript)
21
Test Data Perplexity
  • Choose one article from yahoo news for test data.
  • Command for 3-gram language Models
  • ./ngram -ppl test.txt -order 3 -lm rural.lm
  • ./ngram -ppl test.txt -order 3 -lm
    science.lm
  • ./ngram -ppl test.txt -order 3 -lm sum.lm

22
Test Data Perplexity
  • -ppl textfile
  • Compute sentence scores (log probabilities)
    and perplexities from the sentences in textfile,
    which should contain one sentence per line. The
    -debug option controls the level of detail
    printed, even though output is to stdout (not
    stderr).
  • -lm file
  • Read the (main) N-gram model from file. This
    option is always required, unless -null was
    chosen.

23
Result of test.txt
24
Conclusion
Corpus Perplexity
Rural 389.96
Science 463.479
RuralScience 484.409
25
Conclusion
  • We can use SRILM toolkit to build 3-gram language
    models based on corpora which we use in the
    model.
  • We also use SRILM toolkit to build 3-gram
    language model which based on the combining of
    corpus. All of language models can give the
    different result of testing perplexity corpora.

26
Conclusion
  • why rural and science corpus cannot get a good
    perplexity than rural corpus?
  • Because the article we choose has more
    information about rural part than science part.
    These two corpora do not cover all articles so
    that rural and science part get the lowest
    perplexity than others.

27
Reference
  • SRI International, The SRI Language Modeling
    Toolkit, http//www.speech.sri.com/projects/srilm
    /
  • Cygwin Information and Installation, Installing
    and Updating Cygwin, http//www.cygwin.com/
  • National Language Toolkit, http//nltk.sourceforge
    .net/index.php/Corpora
Write a Comment
User Comments (0)