STATISTICAL LANGUAGE MODELS FOR CROATIAN WEATHER-DOMAIN CORPUS

About This Presentation

Title:

STATISTICAL LANGUAGE MODELS FOR CROATIAN WEATHER-DOMAIN CORPUS

Description:

STATISTICAL LANGUAGE MODELS FOR CROATIAN WEATHER-DOMAIN CORPUS Lucia Na inovi , Sanda Martin i -Ip i and Ivo Ip i Department of Informatics, University of ... – PowerPoint PPT presentation

Number of Views:2

Avg rating:3.0/5.0

Slides: 28

Provided by: igor196

more less

Transcript and Presenter's Notes

Title: STATISTICAL LANGUAGE MODELS FOR CROATIAN WEATHER-DOMAIN CORPUS

1
STATISTICAL LANGUAGE MODELS FOR CROATIAN
WEATHER-DOMAIN CORPUS

Lucia Nacinovic, Sanda Martincic-Ipšic and Ivo
Ipšic
Department of Informatics, University of Rijeka
lnacinovic, smarti, ivoi _at_inf.uniri.hr

2
Introduction

Statistical language modelling estimates the
regularities in natural languages
the probabilities of word sequences which are
usually derived from large collections of text
material
Employed in
Speech recognition
Optical character recognition
Handwriting recognition
Machine translation
Spelling correction
...

3
N-gram language models

The most widely-used LMs
Based on the probability of a word wn given the
preceding sequence of words wn-1
Bigram models (2-grams)
determine the probability of a word given the
previous word
Trigram models (3-gram)
determine the probability of a word given the
previous two words

4
Language model perplexity

The most common metric for evaluating a language
model - probability that the model assigns to
test data, or the derivative measures of
cross-entropy
perplexity

5
Cross-entropy

The cross-entropy of a model p(T) on data T
WT -the length of the text T measured in words

6
Perplexity

The reciprocal value of the average probability
assigned by the model to each word in the test
set T
The perplexity PPp(T) of a model - related to
cross-entropy by the equation
lower cross-entropies and perplexities are better

7
Smoothing

Data sparsity problem
N-gram models - trained from finite corpus
some perfectly acceptable N-grams are missing
probability0
Solution smoothing techiques
adjust the maximum likelihood estimate of
probabilities to produce more accurate
probabilities
adjust low probabilities such as zero
probabilities upward, and high probabilities
downward

8
Smoothing techniques used in our research

Additive smoothing
Absolute discounting
Witten-Bell technique
Kneser-Nay technique

9
Additive smoothing

one of the simplest types of smoothing
we add a factor d to every count d (0lt d 1)
Formula for additive smoothing
V - the vocabulary (set of all words considered)
c - the number of occurrences
values of d parameter used in our research
0.1,0.5 and 1

10
Absolute discounting

When there is little data for directly estimating
an n-gram probability, useful information can be
provided by the corresponding (n-1)-gram
Absolute discounting - the higher-order
distribution is created by subtracting a fixed
discount D from each non-zero count
Values of D used in our research 0.3, 0.5, 1

11
Witten-Bell technique

Number of different words in the corpus is used
as a help at determing the probability of words
that never occur in the corpus
Example for bigram

12
Kneser-Nay technique

An extension of absolute discounting
the lower-order distribution that one combines
with a higher-order distribution is built in a
novel manner
it is taken into consideration only when few or
no counts are present in the higher-order
distribution

13
Smoothing implementation

2-gram, 3-gram and 4-gram language models were
built
Corpus 290 480 words
2 398 1-grams,
18 694 2-grams,
23 021 3-grams and
29 736 4-grams
On each of these models four different smoothing
techniques were applied

14
Corpus

Major part developed from 2002 until 2005 and
some parts added later
Includes the vocabulary related to weather, bio
and maritime forecast, river water levels and
weather reports
Devided into 10 parts
9/10 used for building language models
1/10 used for evaluating those models in terms of
their estimated perplexities

15
Results given by the perplexities of LM-s
16
Conclusion

In this paper we described the process of
language model building from the Croatian
weather-domain corpus
We built models of different order
2-grams
3-grams
4-grams

17
Conclusion

We applied four different smoothing techniques
additive smoothing
absolute discounting
Witten-Bell technique
Kneser-Ney technique
We estimated and compared perplexities of those
models
Kneser-Ney smoothing technique gives the best
results

18
Further work

Prepare more balanced corpus of Croatian text and
thus build more complete language model
Other LM
Class based
Other smoothing techniques

19
STATISTICAL LANGUAGE MODELS FOR CROATIAN
WEATHER-DOMAIN CORPUS

Lucia Nacinovic, Sanda Martincic-Ipšic and Ivo
Ipšic
Department of Informatics, University of Rijeka
lnacinovic, smarti, ivoi _at_inf.uniri.hr

20
References

Chen, Stanley F. Goodman, Joshua. An empirical
study of smoothing techniques for language
modelling. Cambridge, MA Computer Science Group,
Harvard University, 1998
Chou, Wu Juang, Biing-Hwang. Pattern recognition
in speech and language processing. CRC Press,
2003
Jelinek, Frederick. Statistical Methods for
Speech Recognition. Cambridge, MA The MIT Press,
1998
Jurafsky, Daniel Martin, James H. Speech and
Language Processing, An Introduction to Natural
Language Processing, Computational Linguistics,
and Speech Recognition. Upper Saddle River, New
Jersey Prentice Hall, 2000
Manning, Christopher D. Schütze, Hinrich.
Foundations of Statistical Natural Language
Processing. Cambridge, MA The MIT Press, 1999
Martincic-Ipšic, Sanda. Raspoznavanje i sinteza
hrvatskoga govora konteksno ovisnim skrivenim
Markovljevim modelima, doktorska disertacija.
Zagreb, FER, 2007
Milharcic, Grega Žibert, Janez Mihelic, France.
Statistical Language Modeling of SiBN Broadcast
News Text Corpus.//Proceedings of 5th Slovenian
and 1st international Language Technologies
Conference 2006/Erjavec, T. Žganec Gros, J.
(ed.). Ljubljana, Jožef Stefan Institute, 2006
Stolcke, Andreas. SRILM An Extensible Language
Modeling Toolkit.//Proceedings Intl. Conf. on
Spoken Language Processing. Denver, 2002, vol.2,
pp. 901-904

21
SRILM toolkit

Modeli su gradeni i evaluirani pomocu SRILM alata
http//www.speech.sri.com/projects/srilm/
ngram-count text TRAINDATA lm LM
ngram lm LM ppl TESTDATA

22
Language model

Speech recognition converting an acoustic
signal into a sequence of words
Through language modelling, the speech signal is
being statistically modelled
Language model of a speech estimates probability
Pr(W) for all possible word strings W(w1,
w2,wi).

23
System diagram of a generic speech recognizer
based on statistical models
24

Bigram language models (2-grams)
Central goal to determine the probability of a
word given the previous word
Trigram language models (3-grams)
Central goal to determine the probability of a
word given the previous two words
The simplest way to approximate this probability
is to compute
-This value is called the maximum likelihood
(ML) estimate

Linear interpolation - simple method for
combining the information from lower-order n-gram
models in estimating higher-order probabilities

A general class of interpolated models is
described by Jelinek and Mercer
The nth-order smoothed model is defined
recursively as a linear interpolation between the
nth-order maximum likelihood model and the
(n-1)-th-order smoothed model
Given fixed pML, it is possible to search
efficiently for the factor that
maximizes the probability of some data using the
BaumWelch algorithm

In absolute discounting smoothing instead of
multiplying the higher-order maximum-likelihood
distribution by a factor , the
higher-order distribution is created by
subtracting a fixed discount D from each non-zero
count
Values of D used in research 0.3, 0.5, 1

Write a Comment

User Comments (0)