Language and Document Models in Information Retrieval

1 / 38

About This Presentation

Title:

Language and Document Models in Information Retrieval

Description:

A Statistical Language Model (SLM) is a probability distribution function over ... Image you are fishing and you have caught 5 carp, 3 tuna, 1 trout, 1 bass. ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 39

Provided by: infor257

more less

Transcript and Presenter's Notes

Title: Language and Document Models in Information Retrieval

1
Language and Document Models in Information
Retrieval

ZhuoRan Chen
2006-2-8

2
Table of Content

Definitions
Applications
Evaluations
SLM for IR
Burstiness

3
What is a SLM?

A Statistical Language Model (SLM) is a
probability distribution function over sequences
of words.
An example P(Rose is red) gt P(Red is Rose) gt
0
Another P( color around It might be nice to
have a little more ) ?

4
Two Stories of SLM

The Story of Document Model
Giving a document (def a sequence of words), how
good is that document (the odds that it is
composed by a person)?
Judgment may be drawn from words and other
sources, e.g. syntax, burstiness, hyperlinks,
etc.
The Story of generating (used in SR and IR)
Giving a training set (def a collection of
sequences), how can we generate a sequence that
is in accordance with the training set?
In speech recognition generating the next word
In IR generating a query from a document

5
What SLM can do?

Speech recognition
Machine Translation
Information Retrieval
Handwriting recognition
Spelling check
OCR

6
How can SLM do that?

Compare the probabilities of candidates of word
sequences and pick one that looks most likely.
The actual question depends on specific field
MT Giving a bag of words, what is the best
permutation to get a sentence?
Speech recognition Giving the preceding words,
what is the next word?
IR Giving a document, what is the query?

7
Challenges in SLM

Long sequences
Partial independence assumption
Sparseness
Smoothing methods
Distributions
Is there really one?

8
Evaluation of SLMs

Indirect evaluation
Compare the outcomes of the application, be it
MT, SR, IR, or others.
Issues slow, depends on dataset, other
components, etc
Direct evaluation
Perplexity
Cross entropy

9
Evaluation of SLM Perplexity

Definition Perplexity is geometric average of
inverse probability
Formula
(from Joshua Goodman)
Usually the lower the better, but
Limits
LM must be normalized (sum to 1)
The probability of any term must gt 0.

10
Evaluation of SLM Cross entropy

Entropy log2 perplexity
Example

11
The Poisson Model Bookstein and Swanson

Intuition content-bearing words clustered in
relevant documents non-content words occur
randomly.
Methods linear combination of Poisson
distributions
The two-poisson model, surprisingly, could
account for the occupancy distribution of most
words.

12
Poisson Mixtures Church Gale

Enhancements for 2-Poisson Poisson mixtures,
negative binomial...
Problems Parameter estimation and Overfit

From ChurchGale1995
13
Formulas (from ChurchGale)
14
SLM for IR Ponte Croft

Tell a story different from 2-Poisson
Doesnt rely on Bayers theorem
Conceptually simple and parameter free, leave
room for further improvement

15
SLM for IR Lafferty and Zhai

A framework that incorporates Bayesian theory,
Markov chain, and language modeling by using the
loss function
Feathers query expansion

16
SLM for IR Liu and Croft

The Query likelihood model
To generate query from document
arg max P(DQ) arg max P(QD)P(D)
D D
P(D) assumed to be uniform. Many ways to model
P(DQ) multi-variant Bernoulli, multinomial,
tf-idf, HMM, noisy channel, risk minimization
function (K-L divergence) and all the smoothing
methods.

17
SLM Syntactic

Chelba and Jelinek
Construct ngrams from syntactic analysis
e.g. The contract ended with a loss of 7 cents
after trading as low as 89 cents.
(ended (with ())) after ? ended_after
headword long distance information when
predicting using ngram
Left-to-right increasmentally parsing strategy
usable for speech recognition

18
Smoothing Strategies

No smoothing (Maximal Likelihood)
Interpolation
Jelinek-Mercer
Good-Turing
Absolute discounting

19
Smoothing Strategies maximum likelihood

Formula P(zxy) C(xyz)/C(xy)
The name comes from that it does not waste any
probability mass on unseen events, maximizes the
probability of observed events.
Cons zero probabilities for unseen n-grams,
which will propagate into P(D).

20
Smoothing Strategies interpolation

Formula P(zxy) w1C(xyz)/C(xy)
w2C(yz)/C(y) (1-w1-w2)C(z)/C
combine unigram, bigram and trigram
Search for w1, w2 training set, pick best
Hints allow enough training data for each
parameter
Good in practice

21
Smoothing Strategies Jelinek-Mercer

Formula P(zxy) w1C(xyz)/C(xy) (1-w1)
C(yz)/C(y)
W1 usually trained using EM.
Also known as deleted-interpolation

22
Example for Good-Turing smoothing (from Joshua
Goodman)

Image you are fishing and you have caught 5 carp,
3 tuna, 1 trout, 1 bass.
How likely is it that your next fish is none of
the four species? (2/10)
How likely is it that your next fish is tuna?
(less than 3/10)

23
Smoothing Strategies Good-Turing

Intuition odds of all unseen events have a total
probability mass of those occur once odds for
other events adjusted accordingly.
Formula
nr number of types that occurs r times
N total tokens in corpus
p(w) (r1)/N (nr1/nr)
note maximum likelihood estimation for w is r/N.

24
Smoothing Strategies Absolute discounting

Intuition lower the probability mass of seen
events by subtracting a constant D.
Formula
Pa(zxy) max0, C(xyz)-D/ C(xy) wPa(zy)
w DT/N, where N is the number of tokens and T
is the number of types.
Rule of thumb D n1/(n12n2)
Works well except for count1 situations

25
The Study of Burstiness
26
Burstiness of Words

The definitions of word frequency
Term frequency or TF count of occurrences in a
given document
Document frequency or DF count of documents in a
corpus that a word occurs
Generalized document frequency or DFj like DF
but a word must occurs at least j times
DF/N Given a word, the chance we will see it in
a document (the p in Church2000).
?TF/N Given a word, the average count we will
see it in a document
Given we have seen a word in one document, whats
the chance that we will see it again?

27
Burstiness the question

What are the chances of seeing one, two, and
three Noriegas within a document?
Traditional assumptions
Poisson mixture, 2-Poisson model
Independence of word
The first occurrence depends on DF, but the
second does not!
The adaptive language model (used in SR)
The degree of adaptation depends on lexical
content independent of the frequency.
word rates vary from genre to genre, author to
author, topic to topic, document to document,
section to section, and paragraph to paragraph
-- ChurchGale

28
Count in the adaptations

Churchs formulas
Cache model
Pr(w) ?Prlocal(w) (1-?)Prglobal(w)
History-Test division Positive and negative
adaptations
Pr(adapt) Pr(w in test w in history)
Pr(-adapt) Pr(w in test w not in history)
observation Pr(adapt) gtgt Pr(prior) gt Pr(-adapt)
Generalized DF
dfj number of documents with j or more
instances of w.

29
Experimental results 1

High adaptation words (based on Pr(adapt2))
a 14218 13306
and 14248 13196
ap 15694 14567
i 12178 11691
in 14533 13604
of 14648 13635
the 15183 14665
to 14099 13368
-----------------------------------------
agusta 18 17
akchurin 14 14
amex 20 20
apnewsalert 137 131
barghouti 11 11
berisha 18 17

30
Experimental results 2

Low adaptation words
asia 9560 489
brit 12632 18
ct 15694 7
eds 5631 11
english 15694 529
est 15694 72
euro 12660 261
lang 15694 24
ny 15694 370
----------------------------------------------
accuses 177 3
angered 155 2
attract 109 2
carpenter 117 2
confirm 179 3
confirmation 114 2
considers 102 2

31
Experimental results 3

Words with low frequency and high burstiess
(many)
alianza, andorra, atl, awadallah, ayhan,
bertelsmann, bhutto, bliss, boesak, bougainville,
castel, chess, chiquita, cleopa, coolio,
creatine, damas, demobilization
Words with high frequency and high burstiess
(few)
a, and, as, at, by, for, from, has, he, his, in,
iraq, is, it, of, on, reg, said, that, the, to,
u, was, were, with

32
Experimental results 4

Words with low frequency and low burstiess (lots)
accelerating, aga, aida, ales, annie, ashton,
auspices, banditry, beg, beveridge, birgit,
bombardments, bothered, br, breached, brisk,
broadened, brunet, carrefour, catching, chant,
combed, communicate, compel, concede,
constituency, corpses, cushioned, defensively,
deplore, desolate, dianne, dismisses
Words with high frequency and low burstiess (few)
adc, afri, ams, apw, asiangames, byline, edits,
engl, filter, indi, mest, miscellaneous, ndld,
nw, prompted, psov, rdld, recasts, stld, thld,
ws, wstm

33
Detection of bursty words from a stream of
documents

Idea Find features that occur with high
intensity over a limited period of time
Method infinite-state automaton. Bursts appear
as state transitions
-- Kleinberg, Bursty and Hierarchical Structure
in Streams. Proc. 8th ACM SIGKDD, 2002

34
Detecting Bursty Words

Term w occurs in a sequence of text at positions
u1, u2, ? events happen with positive time-gap
x1, x2, where x1u2-u1, x2 u3 u2, etc.
Assume the events are emited by a probabilistic
infinite-state automaton, each state associated
with a exponential density function f(x)ae-ax,
where a is the rate parameter (expected value
of gap is a-1 )

35
Finding the state transitions
From J. Kleinberg, Bursty and Hierachical
Structure in Streams. 8th ACM SIGKDD,
2002 Optimal sequence less state transitions
while keeping the rates closely agreeing with
observed gaps.
36
Sample Results