Title: STATISTICAL LANGUAGE MODELS FOR CROATIAN WEATHER-DOMAIN CORPUS
1STATISTICAL LANGUAGE MODELS FOR CROATIAN
WEATHER-DOMAIN CORPUS
- Lucia Nacinovic, Sanda Martincic-Ipšic and Ivo
Ipšic - Department of Informatics, University of Rijeka
- lnacinovic, smarti, ivoi _at_inf.uniri.hr
2Introduction
- Statistical language modelling estimates the
regularities in natural languages - the probabilities of word sequences which are
usually derived from large collections of text
material - Employed in
- Speech recognition
- Optical character recognition
- Handwriting recognition
- Machine translation
- Spelling correction
- ...
3N-gram language models
- The most widely-used LMs
- Based on the probability of a word wn given the
preceding sequence of words wn-1 - Bigram models (2-grams)
- determine the probability of a word given the
previous word - Trigram models (3-gram)
- determine the probability of a word given the
previous two words
4Language model perplexity
- The most common metric for evaluating a language
model - probability that the model assigns to
test data, or the derivative measures of - cross-entropy
- perplexity
5Cross-entropy
- The cross-entropy of a model p(T) on data T
- WT -the length of the text T measured in words
6Perplexity
- The reciprocal value of the average probability
assigned by the model to each word in the test
set T - The perplexity PPp(T) of a model - related to
cross-entropy by the equation - lower cross-entropies and perplexities are better
7Smoothing
- Data sparsity problem
- N-gram models - trained from finite corpus
- some perfectly acceptable N-grams are missing
probability0 - Solution smoothing techiques
- adjust the maximum likelihood estimate of
probabilities to produce more accurate
probabilities - adjust low probabilities such as zero
probabilities upward, and high probabilities
downward
8Smoothing techniques used in our research
- Additive smoothing
- Absolute discounting
- Witten-Bell technique
- Kneser-Nay technique
9Additive smoothing
- one of the simplest types of smoothing
- we add a factor d to every count d (0lt d 1)
- Formula for additive smoothing
- V - the vocabulary (set of all words considered)
- c - the number of occurrences
- values of d parameter used in our research
0.1,0.5 and 1
10Absolute discounting
- When there is little data for directly estimating
an n-gram probability, useful information can be
provided by the corresponding (n-1)-gram - Absolute discounting - the higher-order
distribution is created by subtracting a fixed
discount D from each non-zero count - Values of D used in our research 0.3, 0.5, 1
11Witten-Bell technique
- Number of different words in the corpus is used
as a help at determing the probability of words
that never occur in the corpus - Example for bigram
12Kneser-Nay technique
- An extension of absolute discounting
- the lower-order distribution that one combines
with a higher-order distribution is built in a
novel manner - it is taken into consideration only when few or
no counts are present in the higher-order
distribution
13Smoothing implementation
- 2-gram, 3-gram and 4-gram language models were
built - Corpus 290 480 words
- 2 398 1-grams,
- 18 694 2-grams,
- 23 021 3-grams and
- 29 736 4-grams
- On each of these models four different smoothing
techniques were applied
14Corpus
- Major part developed from 2002 until 2005 and
some parts added later - Includes the vocabulary related to weather, bio
and maritime forecast, river water levels and
weather reports - Devided into 10 parts
- 9/10 used for building language models
- 1/10 used for evaluating those models in terms of
their estimated perplexities
15Results given by the perplexities of LM-s
16Conclusion
- In this paper we described the process of
language model building from the Croatian
weather-domain corpus - We built models of different order
- 2-grams
- 3-grams
- 4-grams
17Conclusion
- We applied four different smoothing techniques
- additive smoothing
- absolute discounting
- Witten-Bell technique
- Kneser-Ney technique
- We estimated and compared perplexities of those
models - Kneser-Ney smoothing technique gives the best
results
18Further work
- Prepare more balanced corpus of Croatian text and
thus build more complete language model - Other LM
- Class based
- Other smoothing techniques
19STATISTICAL LANGUAGE MODELS FOR CROATIAN
WEATHER-DOMAIN CORPUS
- Lucia Nacinovic, Sanda Martincic-Ipšic and Ivo
Ipšic - Department of Informatics, University of Rijeka
- lnacinovic, smarti, ivoi _at_inf.uniri.hr
20References
- Chen, Stanley F. Goodman, Joshua. An empirical
study of smoothing techniques for language
modelling. Cambridge, MA Computer Science Group,
Harvard University, 1998 - Chou, Wu Juang, Biing-Hwang. Pattern recognition
in speech and language processing. CRC Press,
2003 - Jelinek, Frederick. Statistical Methods for
Speech Recognition. Cambridge, MA The MIT Press,
1998 - Jurafsky, Daniel Martin, James H. Speech and
Language Processing, An Introduction to Natural
Language Processing, Computational Linguistics,
and Speech Recognition. Upper Saddle River, New
Jersey Prentice Hall, 2000 - Manning, Christopher D. Schütze, Hinrich.
Foundations of Statistical Natural Language
Processing. Cambridge, MA The MIT Press, 1999 - Martincic-Ipšic, Sanda. Raspoznavanje i sinteza
hrvatskoga govora konteksno ovisnim skrivenim
Markovljevim modelima, doktorska disertacija.
Zagreb, FER, 2007 - Milharcic, Grega Žibert, Janez Mihelic, France.
Statistical Language Modeling of SiBN Broadcast
News Text Corpus.//Proceedings of 5th Slovenian
and 1st international Language Technologies
Conference 2006/Erjavec, T. Žganec Gros, J.
(ed.). Ljubljana, Jožef Stefan Institute, 2006 - Stolcke, Andreas. SRILM An Extensible Language
Modeling Toolkit.//Proceedings Intl. Conf. on
Spoken Language Processing. Denver, 2002, vol.2,
pp. 901-904
21SRILM toolkit
- Modeli su gradeni i evaluirani pomocu SRILM alata
- http//www.speech.sri.com/projects/srilm/
- ngram-count text TRAINDATA lm LM
- ngram lm LM ppl TESTDATA
22Language model
- Speech recognition converting an acoustic
signal into a sequence of words - Through language modelling, the speech signal is
being statistically modelled - Language model of a speech estimates probability
Pr(W) for all possible word strings W(w1,
w2,wi).
23System diagram of a generic speech recognizer
based on statistical models
24- Bigram language models (2-grams)
- Central goal to determine the probability of a
word given the previous word - Trigram language models (3-grams)
- Central goal to determine the probability of a
word given the previous two words - The simplest way to approximate this probability
is to compute - -This value is called the maximum likelihood
(ML) estimate
25- Linear interpolation - simple method for
combining the information from lower-order n-gram
models in estimating higher-order probabilities
26- A general class of interpolated models is
described by Jelinek and Mercer - The nth-order smoothed model is defined
recursively as a linear interpolation between the
nth-order maximum likelihood model and the
(n-1)-th-order smoothed model - Given fixed pML, it is possible to search
efficiently for the factor that
maximizes the probability of some data using the
BaumWelch algorithm
27- In absolute discounting smoothing instead of
multiplying the higher-order maximum-likelihood
distribution by a factor , the
higher-order distribution is created by
subtracting a fixed discount D from each non-zero
count - Values of D used in research 0.3, 0.5, 1