STATISTICAL LANGUAGE MODELS FOR CROATIAN WEATHERDOMAIN CORPUS

1 / 27
About This Presentation
Title:

STATISTICAL LANGUAGE MODELS FOR CROATIAN WEATHERDOMAIN CORPUS

Description:

Lucia Nacinovic, Sanda Martincic-Ip ic and Ivo Ip ic ... Statistical language modelling estimates the regularities in natural languages ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 28
Provided by: Igo135

less

Transcript and Presenter's Notes

Title: STATISTICAL LANGUAGE MODELS FOR CROATIAN WEATHERDOMAIN CORPUS


1
STATISTICAL LANGUAGE MODELS FOR CROATIAN
WEATHER-DOMAIN CORPUS
  • Lucia Nacinovic, Sanda Martincic-Ipic and Ivo
    Ipic
  • Department of Informatics, University of Rijeka
  • lnacinovic, smarti, ivoi _at_inf.uniri.hr

2
Introduction
  • Statistical language modelling estimates the
    regularities in natural languages
  • the probabilities of word sequences which are
    usually derived from large collections of text
    material
  • Employed in
  • Speech recognition
  • Optical character recognition
  • Handwriting recognition
  • Machine translation
  • Spelling correction
  • ...

3
N-gram language models
  • The most widely-used LMs
  • Based on the probability of a word wn given the
    preceding sequence of words wn-1
  • Bigram models (2-grams)
  • determine the probability of a word given the
    previous word
  • Trigram models (3-gram)
  • determine the probability of a word given the
    previous two words

4
Language model perplexity
  • The most common metric for evaluating a language
    model - probability that the model assigns to
    test data, or the derivative measures of
  • cross-entropy
  • perplexity

5
Cross-entropy
  • The cross-entropy of a model p(T) on data T
  • WT -the length of the text T measured in words

6
Perplexity
  • The reciprocal value of the average probability
    assigned by the model to each word in the test
    set T
  • The perplexity PPp(T) of a model - related to
    cross-entropy by the equation
  • lower cross-entropies and perplexities are better

7
Smoothing
  • Data sparsity problem
  • N-gram models - trained from finite corpus
  • some perfectly acceptable N-grams are missing
    probability0
  • Solution smoothing techiques
  • adjust the maximum likelihood estimate of
    probabilities to produce more accurate
    probabilities
  • adjust low probabilities such as zero
    probabilities upward, and high probabilities
    downward

8
Smoothing techniques used in our research
  • Additive smoothing
  • Absolute discounting
  • Witten-Bell technique
  • Kneser-Nay technique

9
Additive smoothing
  • one of the simplest types of smoothing
  • we add a factor d to every count d (0lt d 1)
  • Formula for additive smoothing
  • V - the vocabulary (set of all words considered)
  • c - the number of occurrences
  • values of d parameter used in our research
    0.1,0.5 and 1

10
Absolute discounting
  • When there is little data for directly estimating
    an n-gram probability, useful information can be
    provided by the corresponding (n-1)-gram
  • Absolute discounting - the higher-order
    distribution is created by subtracting a fixed
    discount D from each non-zero count
  • Values of D used in our research 0.3, 0.5, 1

11
Witten-Bell technique
  • Number of different words in the corpus is used
    as a help at determing the probability of words
    that never occur in the corpus
  • Example for bigram

12
Kneser-Nay technique
  • An extension of absolute discounting
  • the lower-order distribution that one combines
    with a higher-order distribution is built in a
    novel manner
  • it is taken into consideration only when few or
    no counts are present in the higher-order
    distribution

13
Smoothing implementation
  • 2-gram, 3-gram and 4-gram language models were
    built
  • Corpus 290 480 words
  • 2 398 1-grams,
  • 18 694 2-grams,
  • 23 021 3-grams and
  • 29 736 4-grams
  • On each of these models four different smoothing
    techniques were applied

14
Corpus
  • Major part developed from 2002 until 2005 and
    some parts added later
  • Includes the vocabulary related to weather, bio
    and maritime forecast, river water levels and
    weather reports
  • Devided into 10 parts
  • 9/10 used for building language models
  • 1/10 used for evaluating those models in terms of
    their estimated perplexities

15
Results given by the perplexities of LM-s
16
Conclusion
  • In this paper we described the process of
    language model building from the Croatian
    weather-domain corpus
  • We built models of different order
  • 2-grams
  • 3-grams
  • 4-grams

17
Conclusion
  • We applied four different smoothing techniques
  • additive smoothing
  • absolute discounting
  • Witten-Bell technique
  • Kneser-Ney technique
  • We estimated and compared perplexities of those
    models
  • Kneser-Ney smoothing technique gives the best
    results

18
Further work
  • Prepare more balanced corpus of Croatian text and
    thus build more complete language model
  • Other LM
  • Class based
  • Other smoothing techniques

19
STATISTICAL LANGUAGE MODELS FOR CROATIAN
WEATHER-DOMAIN CORPUS
  • Lucia Nacinovic, Sanda Martincic-Ipic and Ivo
    Ipic
  • Department of Informatics, University of Rijeka
  • lnacinovic, smarti, ivoi _at_inf.uniri.hr

20
References
  • Chen, Stanley F. Goodman, Joshua. An empirical
    study of smoothing techniques for language
    modelling. Cambridge, MA Computer Science Group,
    Harvard University, 1998
  • Chou, Wu Juang, Biing-Hwang. Pattern recognition
    in speech and language processing. CRC Press,
    2003
  • Jelinek, Frederick. Statistical Methods for
    Speech Recognition. Cambridge, MA The MIT Press,
    1998
  • Jurafsky, Daniel Martin, James H. Speech and
    Language Processing, An Introduction to Natural
    Language Processing, Computational Linguistics,
    and Speech Recognition. Upper Saddle River, New
    Jersey Prentice Hall, 2000
  • Manning, Christopher D. Schütze, Hinrich.
    Foundations of Statistical Natural Language
    Processing. Cambridge, MA The MIT Press, 1999
  • Martincic-Ipic, Sanda. Raspoznavanje i sinteza
    hrvatskoga govora konteksno ovisnim skrivenim
    Markovljevim modelima, doktorska disertacija.
    Zagreb, FER, 2007
  • Milharcic, Grega ibert, Janez Mihelic, France.
    Statistical Language Modeling of SiBN Broadcast
    News Text Corpus.//Proceedings of 5th Slovenian
    and 1st international Language Technologies
    Conference 2006/Erjavec, T. ganec Gros, J.
    (ed.). Ljubljana, Joef Stefan Institute, 2006
  • Stolcke, Andreas. SRILM An Extensible Language
    Modeling Toolkit.//Proceedings Intl. Conf. on
    Spoken Language Processing. Denver, 2002, vol.2,
    pp. 901-904

21
SRILM toolkit
  • Modeli su gradeni i evaluirani pomocu SRILM alata
  • http//www.speech.sri.com/projects/srilm/
  • ngram-count text TRAINDATA lm LM
  • ngram lm LM ppl TESTDATA

22
Language model
  • Speech recognition converting an acoustic
    signal into a sequence of words
  • Through language modelling, the speech signal is
    being statistically modelled
  • Language model of a speech estimates probability
    Pr(W) for all possible word strings W(w1,
    w2,wi).

23
System diagram of a generic speech recognizer
based on statistical models
24
  • Bigram language models (2-grams)
  • Central goal to determine the probability of a
    word given the previous word
  • Trigram language models (3-grams)
  • Central goal to determine the probability of a
    word given the previous two words
  • The simplest way to approximate this probability
    is to compute
  • -This value is called the maximum likelihood
    (ML) estimate

25
  • Linear interpolation - simple method for
    combining the information from lower-order n-gram
    models in estimating higher-order probabilities

26
  • A general class of interpolated models is
    described by Jelinek and Mercer
  • The nth-order smoothed model is defined
    recursively as a linear interpolation between the
    nth-order maximum likelihood model and the
    (n-1)-th-order smoothed model
  • Given fixed pML, it is possible to search
    efficiently for the factor that
    maximizes the probability of some data using the
    BaumWelch algorithm

27
  • In absolute discounting smoothing instead of
    multiplying the higher-order maximum-likelihood
    distribution by a factor , the
    higher-order distribution is created by
    subtracting a fixed discount D from each non-zero
    count
  • Values of D used in research 0.3, 0.5, 1
Write a Comment
User Comments (0)