Title: language modelling
1language modelling
- María Fernández Pajares
- Verarbeitung gesprochener Sprache
2Index
- 1.introduction
- 2. regular grammars
- 3. stochastics languages
- 4. N-grams models
- 5. perplexity
3Introduction Language models
- What is a language model?
- Its a language structure defining method, in
order to limit the most probable linguistic units
sequences. - They tend to be useful for aplications which show
a complex syntax and/or semantic. - A good ML should only accept( with a high
probability) right sentences and reject (or give
a low probability) to wrong word sequences. - CLASSIC MODELS
- - N-gramms
- - Stochastic Grammars.
4Introduction general scheme of a system
signal
text
measurement of parameters
comparison of models
Rule of decision
Acustic and grammar models
5Introduction tasks difficulty measurement
- Determined by the admited languages real
flexibility - Perplexity average of options
- There are finer measures that take into account
the difficulty of the words or the acustics
models - Speech recognizers seek the word sequence W which
is most likely to be produced from acoustic
evidence A - Speech recognition involves acoustic processing,
acoustic modelling, language modelling, and
search
6- Language models (LMs) assign a probability
estimate P(W ) to word sequences W w1,...,wn
subject to - Language models help guide and constrain the
search among alternative word hypotheses during
recognition - Huge vocabularies integration of the acoustic
models and of the language in a hidden
macro-model in the Markov to all the language.
7Introduction problems dificulty dimensions
conectivity
(noise, robustness)
speakers
Vocabulary and language complexity
8Introduction MODELS BASED IN GRAMMARS
They represent language restrictions in a
natural way They allow the modelling of
dependencies as long as required the definition
of these models involves a big difficulty for
tasks that entail languages next to natural
languages (pseudo-natural) Integration with the
acustic model isnt very natural
9Introduction Kinds of grammars
- If we take the following grammar G(N,S,P,S)
- Chomsky hierarchy
- 0. No restrictions in the rules? too complex to
be useful - 1 Sensible rules to the context? too complex
- 2 Independent of the context?they are used in
experimental systems - 3 regulars or Finite state
10Grammars and automat
- Every kind of grammar is relationed with a kind
of automat, that recognizes it - Kind 0 (without restrictions) Turing Machine
- Kind 1(free of context) lineal limited automat
- Kind 2 (sensibles to the context)push-down
automat - Kind 3 (regulars) finite state automat
11Regular grammars
- A regular grammar is any
- right-linear or left-linear grammar
- Examples
- Regular grammars generate regular languages
Languages Generated by Regular Grammars
Regular Languages
12space search
13An example
14Grammars and stochastics languages
- Add a probability to each of the production rules
- A stochastics grammar is a couple (G,p)
- Where G is a grammar and p is a function
pP?0,1 that has the property - Where represents a set of grammar
rules whos antecedent is A. - A stochastic language over an alphabet
is a pair that fulfill the
following conditions
15example
16 N-gramms models
P(W) can be broken down like When n2
?bigrams When n3?trigrams
17Example
- Let us suppose that the result of an acoustic
decoding assigns to resemblances probabilities to
the phrases -
- If
- P(pig the)P(big the) then the election of
one or another depends of the word dog. - P(the pig dog)P(the). P(pig the). P(dog
the pig) - P(the big dog)P(the). P(big the). P(dog
the big) - as P(dog the big)gt P(dog the pig) the model
helps to decode the sentences correctly - Problems
- Necessity of elevating number of learning
samples - unigram
- bigram
- trigram
18- Advantages
- Probabilities are based on data
- Parameters determined automatically from
corpora - Incorporate local syntax, semantics, and
pragmatics - Many languages have a strong tendency toward
standard word order and are thus substantially
local - Relatively easy to integrate into forward
search methods such as Viterbi (bigram) or A - Disadvantages
- Unable to incorporate long-distance
constraints - Not well suited for flexible word order
languages - Cannot easily accommodate
- New vocabulary items
- Alternative domains
- Dynamic changes (e.g., discourse)
- Not as good as humans at tasks of
- Identifying and correcting recognizer errors
- Predicting following words (or letters)
- Do not capture meaning for speech understanding
19Estimation of the Probabilities
- We go to you suppose that the model of N-gramms
has been modelized with a finite automat - Unigram bigram w1w2 trigram w1w2w3
- Let us suppose that they we have a sample of
training, on which has considered a model of
N-gramms, represented like a finite automat. - A state of the automat is q, and is c (q) is
total number of events (N-gramas) observed in the
sample when model is in state q.
20- C(wq) is the number of times that the word w has
been observed in the sample,being the model in
the state q. - P(wq) is the probability of observation of the
word w conditioned to the state q. - The set of words observed in the sample when the
model is in the state q. - The total vocabulary of the language that has to
be modelate - For example in a bigram
- This attitude approach assigns the probability 0
to the events that havent been said? this cause
problems of cover?the solution is smooth the
model?we can smooth the model withplane,lineal,no
lineal, back-off, sintact back-off..
21- Bigrams are easily incorporated in Viterbi search
- Trigrams used for large vocabulary recognition in
mid-1970s and remain the dominant language modeL - IBM TRIGRAM EXAMPLE
22- Methods, in order to measure the probability of
ungesehenen N-grams - n-gram performance can be improved by clustering
words - Hard clustering puts a word into a single
cluster - Soft clustering allows a word to belong to
multiple clusters - Clusters can be created manually, or
automatically - Manually created clusters have worked well
for small domains - Automatic clusters have been created
bottom-up or top-down
23PERPLEXITY
- Average of options
- Quantifying LM Complexity
- One LM is better than another if it can
predict an n word test corpus W with a higher
probability - For LMs representable by the chain rule,
comparisons are usually based on the average per
word logprob, LP - A more intuitive representation of LP is
the perplexity - (a uniform LM will have PP equal to vocabulary
size) - PP is often interpreted as an average
branching factor
24Perplexity Examples
25Bibliography
- P. Brown et al., Class-based n-gram models of
natural language, Computational Linguistics,
1992. - R. Lau, Adaptive Statistical Language
Modelling, S.M. Thesis, MIT, 1994. - M. McCandless, Automatic Acquisition of
Language Models for Speech Recognition, S.M.
Thesis, MIT, 1994. - L.R.Rabiner y B.-H.JuangFundamentals of Speech
Recognition,Prentice-Hall,1993 - GOOGLE