Title: Language and Document Models in Information Retrieval
1Language and Document Models in Information
Retrieval
2Table of Content
- Definitions
- Applications
- Evaluations
- SLM for IR
- Burstiness
3What is a SLM?
- A Statistical Language Model (SLM) is a
probability distribution function over sequences
of words. - An example P(Rose is red) gt P(Red is Rose) gt
0 - Another P( color around It might be nice to
have a little more ) ?
4Two Stories of SLM
- The Story of Document Model
- Giving a document (def a sequence of words), how
good is that document (the odds that it is
composed by a person)? - Judgment may be drawn from words and other
sources, e.g. syntax, burstiness, hyperlinks,
etc. - The Story of generating (used in SR and IR)
- Giving a training set (def a collection of
sequences), how can we generate a sequence that
is in accordance with the training set? - In speech recognition generating the next word
In IR generating a query from a document
5What SLM can do?
- Speech recognition
- Machine Translation
- Information Retrieval
- Handwriting recognition
- Spelling check
- OCR
-
6How can SLM do that?
- Compare the probabilities of candidates of word
sequences and pick one that looks most likely. - The actual question depends on specific field
- MT Giving a bag of words, what is the best
permutation to get a sentence? - Speech recognition Giving the preceding words,
what is the next word? - IR Giving a document, what is the query?
7Challenges in SLM
- Long sequences
- Partial independence assumption
- Sparseness
- Smoothing methods
- Distributions
- Is there really one?
8Evaluation of SLMs
- Indirect evaluation
- Compare the outcomes of the application, be it
MT, SR, IR, or others. - Issues slow, depends on dataset, other
components, etc - Direct evaluation
- Perplexity
- Cross entropy
9Evaluation of SLM Perplexity
- Definition Perplexity is geometric average of
inverse probability - Formula
- (from Joshua Goodman)
- Usually the lower the better, but
- Limits
- LM must be normalized (sum to 1)
- The probability of any term must gt 0.
10Evaluation of SLM Cross entropy
- Entropy log2 perplexity
- Example
11The Poisson Model Bookstein and Swanson
- Intuition content-bearing words clustered in
relevant documents non-content words occur
randomly. - Methods linear combination of Poisson
distributions - The two-poisson model, surprisingly, could
account for the occupancy distribution of most
words.
12Poisson Mixtures Church Gale
- Enhancements for 2-Poisson Poisson mixtures,
negative binomial... - Problems Parameter estimation and Overfit
From ChurchGale1995
13Formulas (from ChurchGale)
14SLM for IR Ponte Croft
- Tell a story different from 2-Poisson
- Doesnt rely on Bayers theorem
- Conceptually simple and parameter free, leave
room for further improvement -
15SLM for IR Lafferty and Zhai
- A framework that incorporates Bayesian theory,
Markov chain, and language modeling by using the
loss function - Feathers query expansion
-
16SLM for IR Liu and Croft
- The Query likelihood model
- To generate query from document
- arg max P(DQ) arg max P(QD)P(D)
- D D
- P(D) assumed to be uniform. Many ways to model
P(DQ) multi-variant Bernoulli, multinomial,
tf-idf, HMM, noisy channel, risk minimization
function (K-L divergence) and all the smoothing
methods.
17SLM Syntactic
- Chelba and Jelinek
- Construct ngrams from syntactic analysis
- e.g. The contract ended with a loss of 7 cents
after trading as low as 89 cents. - (ended (with ())) after ? ended_after
- headword long distance information when
predicting using ngram - Left-to-right increasmentally parsing strategy
usable for speech recognition
18Smoothing Strategies
- No smoothing (Maximal Likelihood)
- Interpolation
- Jelinek-Mercer
- Good-Turing
- Absolute discounting
19Smoothing Strategies maximum likelihood
- Formula P(zxy) C(xyz)/C(xy)
- The name comes from that it does not waste any
probability mass on unseen events, maximizes the
probability of observed events. - Cons zero probabilities for unseen n-grams,
which will propagate into P(D).
20Smoothing Strategies interpolation
- Formula P(zxy) w1C(xyz)/C(xy)
w2C(yz)/C(y) (1-w1-w2)C(z)/C - combine unigram, bigram and trigram
- Search for w1, w2 training set, pick best
- Hints allow enough training data for each
parameter - Good in practice
21Smoothing Strategies Jelinek-Mercer
- Formula P(zxy) w1C(xyz)/C(xy) (1-w1)
C(yz)/C(y) - W1 usually trained using EM.
- Also known as deleted-interpolation
22Example for Good-Turing smoothing (from Joshua
Goodman)
- Image you are fishing and you have caught 5 carp,
3 tuna, 1 trout, 1 bass. - How likely is it that your next fish is none of
the four species? (2/10) - How likely is it that your next fish is tuna?
(less than 3/10)
23Smoothing Strategies Good-Turing
- Intuition odds of all unseen events have a total
probability mass of those occur once odds for
other events adjusted accordingly. - Formula
- nr number of types that occurs r times
- N total tokens in corpus
- p(w) (r1)/N (nr1/nr)
- note maximum likelihood estimation for w is r/N.
24Smoothing Strategies Absolute discounting
- Intuition lower the probability mass of seen
events by subtracting a constant D. - Formula
- Pa(zxy) max0, C(xyz)-D/ C(xy) wPa(zy)
- w DT/N, where N is the number of tokens and T
is the number of types. - Rule of thumb D n1/(n12n2)
- Works well except for count1 situations
25The Study of Burstiness
26Burstiness of Words
- The definitions of word frequency
- Term frequency or TF count of occurrences in a
given document - Document frequency or DF count of documents in a
corpus that a word occurs - Generalized document frequency or DFj like DF
but a word must occurs at least j times - DF/N Given a word, the chance we will see it in
a document (the p in Church2000). - ?TF/N Given a word, the average count we will
see it in a document - Given we have seen a word in one document, whats
the chance that we will see it again?
27Burstiness the question
- What are the chances of seeing one, two, and
three Noriegas within a document? - Traditional assumptions
- Poisson mixture, 2-Poisson model
- Independence of word
- The first occurrence depends on DF, but the
second does not! - The adaptive language model (used in SR)
- The degree of adaptation depends on lexical
content independent of the frequency. - word rates vary from genre to genre, author to
author, topic to topic, document to document,
section to section, and paragraph to paragraph
-- ChurchGale
28Count in the adaptations
- Churchs formulas
- Cache model
- Pr(w) ?Prlocal(w) (1-?)Prglobal(w)
- History-Test division Positive and negative
adaptations - Pr(adapt) Pr(w in test w in history)
- Pr(-adapt) Pr(w in test w not in history)
- observation Pr(adapt) gtgt Pr(prior) gt Pr(-adapt)
- Generalized DF
- dfj number of documents with j or more
instances of w.
29Experimental results 1
- High adaptation words (based on Pr(adapt2))
- a 14218 13306
- and 14248 13196
- ap 15694 14567
- i 12178 11691
- in 14533 13604
- of 14648 13635
- the 15183 14665
- to 14099 13368
- -----------------------------------------
- agusta 18 17
- akchurin 14 14
- amex 20 20
- apnewsalert 137 131
- barghouti 11 11
- berisha 18 17
30Experimental results 2
- Low adaptation words
- asia 9560 489
- brit 12632 18
- ct 15694 7
- eds 5631 11
- english 15694 529
- est 15694 72
- euro 12660 261
- lang 15694 24
- ny 15694 370
- ----------------------------------------------
- accuses 177 3
- angered 155 2
- attract 109 2
- carpenter 117 2
- confirm 179 3
- confirmation 114 2
- considers 102 2
31Experimental results 3
- Words with low frequency and high burstiess
(many) - alianza, andorra, atl, awadallah, ayhan,
bertelsmann, bhutto, bliss, boesak, bougainville,
castel, chess, chiquita, cleopa, coolio,
creatine, damas, demobilization - Words with high frequency and high burstiess
(few) - a, and, as, at, by, for, from, has, he, his, in,
iraq, is, it, of, on, reg, said, that, the, to,
u, was, were, with
32Experimental results 4
- Words with low frequency and low burstiess (lots)
- accelerating, aga, aida, ales, annie, ashton,
auspices, banditry, beg, beveridge, birgit,
bombardments, bothered, br, breached, brisk,
broadened, brunet, carrefour, catching, chant,
combed, communicate, compel, concede,
constituency, corpses, cushioned, defensively,
deplore, desolate, dianne, dismisses - Words with high frequency and low burstiess (few)
- adc, afri, ams, apw, asiangames, byline, edits,
engl, filter, indi, mest, miscellaneous, ndld,
nw, prompted, psov, rdld, recasts, stld, thld,
ws, wstm
33Detection of bursty words from a stream of
documents
- Idea Find features that occur with high
intensity over a limited period of time - Method infinite-state automaton. Bursts appear
as state transitions - -- Kleinberg, Bursty and Hierarchical Structure
in Streams. Proc. 8th ACM SIGKDD, 2002
34Detecting Bursty Words
- Term w occurs in a sequence of text at positions
u1, u2, ? events happen with positive time-gap
x1, x2, where x1u2-u1, x2 u3 u2, etc. - Assume the events are emited by a probabilistic
infinite-state automaton, each state associated
with a exponential density function f(x)ae-ax,
where a is the rate parameter (expected value
of gap is a-1 )
35Finding the state transitions
From J. Kleinberg, Bursty and Hierachical
Structure in Streams. 8th ACM SIGKDD,
2002 Optimal sequence less state transitions
while keeping the rates closely agreeing with
observed gaps.
36Sample Results
- From database conferences SIGMOD, VLDB
1975-2001 - data, base, application 1975- 1979/1981/1982
- relational 1975 1989
- schema 1975 1980
- distributed 1977 1985
- statistical 1981 1984
- transaction 1987 1992
- object-oriented 1987 1994
- parallel 1989 1996
- mining 1995
- web 1998
- xml 1999 -
37Sample Results
- From AI conferences, AAAI, IJCAI, 1980 -- 2001
- an 1980 1982
- language 1980 1983
- image 1980 1987
- prolog 1983 -- 1987
- reasoning 1987 1988
- decision 1992 1997
- agents 1998
- agent 1994
- mobile 1996
- web 1996
- bayesian 1996 1998
- auctions 1998
- reinforcement 1998
38THE END