Language and Document Models in Information Retrieval

1 / 38
About This Presentation

Language and Document Models in Information Retrieval


A Statistical Language Model (SLM) is a probability distribution function over ... Image you are fishing and you have caught 5 carp, 3 tuna, 1 trout, 1 bass. ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 39
Provided by: infor257


Transcript and Presenter's Notes

Title: Language and Document Models in Information Retrieval

Language and Document Models in Information
  • ZhuoRan Chen
  • 2006-2-8

Table of Content
  • Definitions
  • Applications
  • Evaluations
  • SLM for IR
  • Burstiness

What is a SLM?
  • A Statistical Language Model (SLM) is a
    probability distribution function over sequences
    of words.
  • An example P(Rose is red) gt P(Red is Rose) gt
  • Another P( color around It might be nice to
    have a little more ) ?

Two Stories of SLM
  • The Story of Document Model
  • Giving a document (def a sequence of words), how
    good is that document (the odds that it is
    composed by a person)?
  • Judgment may be drawn from words and other
    sources, e.g. syntax, burstiness, hyperlinks,
  • The Story of generating (used in SR and IR)
  • Giving a training set (def a collection of
    sequences), how can we generate a sequence that
    is in accordance with the training set?
  • In speech recognition generating the next word
    In IR generating a query from a document

What SLM can do?
  • Speech recognition
  • Machine Translation
  • Information Retrieval
  • Handwriting recognition
  • Spelling check
  • OCR

How can SLM do that?
  • Compare the probabilities of candidates of word
    sequences and pick one that looks most likely.
  • The actual question depends on specific field
  • MT Giving a bag of words, what is the best
    permutation to get a sentence?
  • Speech recognition Giving the preceding words,
    what is the next word?
  • IR Giving a document, what is the query?

Challenges in SLM
  • Long sequences
  • Partial independence assumption
  • Sparseness
  • Smoothing methods
  • Distributions
  • Is there really one?

Evaluation of SLMs
  • Indirect evaluation
  • Compare the outcomes of the application, be it
    MT, SR, IR, or others.
  • Issues slow, depends on dataset, other
    components, etc
  • Direct evaluation
  • Perplexity
  • Cross entropy

Evaluation of SLM Perplexity
  • Definition Perplexity is geometric average of
    inverse probability
  • Formula
  • (from Joshua Goodman)
  • Usually the lower the better, but
  • Limits
  • LM must be normalized (sum to 1)
  • The probability of any term must gt 0.

Evaluation of SLM Cross entropy
  • Entropy log2 perplexity
  • Example

The Poisson Model Bookstein and Swanson
  • Intuition content-bearing words clustered in
    relevant documents non-content words occur
  • Methods linear combination of Poisson
  • The two-poisson model, surprisingly, could
    account for the occupancy distribution of most

Poisson Mixtures Church Gale
  • Enhancements for 2-Poisson Poisson mixtures,
    negative binomial...
  • Problems Parameter estimation and Overfit

From ChurchGale1995
Formulas (from ChurchGale)
SLM for IR Ponte Croft
  • Tell a story different from 2-Poisson
  • Doesnt rely on Bayers theorem
  • Conceptually simple and parameter free, leave
    room for further improvement

SLM for IR Lafferty and Zhai
  • A framework that incorporates Bayesian theory,
    Markov chain, and language modeling by using the
    loss function
  • Feathers query expansion

SLM for IR Liu and Croft
  • The Query likelihood model
  • To generate query from document
  • arg max P(DQ) arg max P(QD)P(D)
  • D D
  • P(D) assumed to be uniform. Many ways to model
    P(DQ) multi-variant Bernoulli, multinomial,
    tf-idf, HMM, noisy channel, risk minimization
    function (K-L divergence) and all the smoothing

SLM Syntactic
  • Chelba and Jelinek
  • Construct ngrams from syntactic analysis
  • e.g. The contract ended with a loss of 7 cents
    after trading as low as 89 cents.
  • (ended (with ())) after ? ended_after
  • headword long distance information when
    predicting using ngram
  • Left-to-right increasmentally parsing strategy
    usable for speech recognition

Smoothing Strategies
  • No smoothing (Maximal Likelihood)
  • Interpolation
  • Jelinek-Mercer
  • Good-Turing
  • Absolute discounting

Smoothing Strategies maximum likelihood
  • Formula P(zxy) C(xyz)/C(xy)
  • The name comes from that it does not waste any
    probability mass on unseen events, maximizes the
    probability of observed events.
  • Cons zero probabilities for unseen n-grams,
    which will propagate into P(D).

Smoothing Strategies interpolation
  • Formula P(zxy) w1C(xyz)/C(xy)
    w2C(yz)/C(y) (1-w1-w2)C(z)/C
  • combine unigram, bigram and trigram
  • Search for w1, w2 training set, pick best
  • Hints allow enough training data for each
  • Good in practice

Smoothing Strategies Jelinek-Mercer
  • Formula P(zxy) w1C(xyz)/C(xy) (1-w1)
  • W1 usually trained using EM.
  • Also known as deleted-interpolation

Example for Good-Turing smoothing (from Joshua
  • Image you are fishing and you have caught 5 carp,
    3 tuna, 1 trout, 1 bass.
  • How likely is it that your next fish is none of
    the four species? (2/10)
  • How likely is it that your next fish is tuna?
    (less than 3/10)

Smoothing Strategies Good-Turing
  • Intuition odds of all unseen events have a total
    probability mass of those occur once odds for
    other events adjusted accordingly.
  • Formula
  • nr number of types that occurs r times
  • N total tokens in corpus
  • p(w) (r1)/N (nr1/nr)
  • note maximum likelihood estimation for w is r/N.

Smoothing Strategies Absolute discounting
  • Intuition lower the probability mass of seen
    events by subtracting a constant D.
  • Formula
  • Pa(zxy) max0, C(xyz)-D/ C(xy) wPa(zy)
  • w DT/N, where N is the number of tokens and T
    is the number of types.
  • Rule of thumb D n1/(n12n2)
  • Works well except for count1 situations

The Study of Burstiness
Burstiness of Words
  • The definitions of word frequency
  • Term frequency or TF count of occurrences in a
    given document
  • Document frequency or DF count of documents in a
    corpus that a word occurs
  • Generalized document frequency or DFj like DF
    but a word must occurs at least j times
  • DF/N Given a word, the chance we will see it in
    a document (the p in Church2000).
  • ?TF/N Given a word, the average count we will
    see it in a document
  • Given we have seen a word in one document, whats
    the chance that we will see it again?

Burstiness the question
  • What are the chances of seeing one, two, and
    three Noriegas within a document?
  • Traditional assumptions
  • Poisson mixture, 2-Poisson model
  • Independence of word
  • The first occurrence depends on DF, but the
    second does not!
  • The adaptive language model (used in SR)
  • The degree of adaptation depends on lexical
    content independent of the frequency.
  • word rates vary from genre to genre, author to
    author, topic to topic, document to document,
    section to section, and paragraph to paragraph
    -- ChurchGale

Count in the adaptations
  • Churchs formulas
  • Cache model
  • Pr(w) ?Prlocal(w) (1-?)Prglobal(w)
  • History-Test division Positive and negative
  • Pr(adapt) Pr(w in test w in history)
  • Pr(-adapt) Pr(w in test w not in history)
  • observation Pr(adapt) gtgt Pr(prior) gt Pr(-adapt)
  • Generalized DF
  • dfj number of documents with j or more
    instances of w.

Experimental results 1
  • High adaptation words (based on Pr(adapt2))
  • a 14218 13306
  • and 14248 13196
  • ap 15694 14567
  • i 12178 11691
  • in 14533 13604
  • of 14648 13635
  • the 15183 14665
  • to 14099 13368
  • -----------------------------------------
  • agusta 18 17
  • akchurin 14 14
  • amex 20 20
  • apnewsalert 137 131
  • barghouti 11 11
  • berisha 18 17

Experimental results 2
  • Low adaptation words
  • asia 9560 489
  • brit 12632 18
  • ct 15694 7
  • eds 5631 11
  • english 15694 529
  • est 15694 72
  • euro 12660 261
  • lang 15694 24
  • ny 15694 370
  • ----------------------------------------------
  • accuses 177 3
  • angered 155 2
  • attract 109 2
  • carpenter 117 2
  • confirm 179 3
  • confirmation 114 2
  • considers 102 2

Experimental results 3
  • Words with low frequency and high burstiess
  • alianza, andorra, atl, awadallah, ayhan,
    bertelsmann, bhutto, bliss, boesak, bougainville,
    castel, chess, chiquita, cleopa, coolio,
    creatine, damas, demobilization
  • Words with high frequency and high burstiess
  • a, and, as, at, by, for, from, has, he, his, in,
    iraq, is, it, of, on, reg, said, that, the, to,
    u, was, were, with

Experimental results 4
  • Words with low frequency and low burstiess (lots)
  • accelerating, aga, aida, ales, annie, ashton,
    auspices, banditry, beg, beveridge, birgit,
    bombardments, bothered, br, breached, brisk,
    broadened, brunet, carrefour, catching, chant,
    combed, communicate, compel, concede,
    constituency, corpses, cushioned, defensively,
    deplore, desolate, dianne, dismisses
  • Words with high frequency and low burstiess (few)
  • adc, afri, ams, apw, asiangames, byline, edits,
    engl, filter, indi, mest, miscellaneous, ndld,
    nw, prompted, psov, rdld, recasts, stld, thld,
    ws, wstm

Detection of bursty words from a stream of
  • Idea Find features that occur with high
    intensity over a limited period of time
  • Method infinite-state automaton. Bursts appear
    as state transitions
  • -- Kleinberg, Bursty and Hierarchical Structure
    in Streams. Proc. 8th ACM SIGKDD, 2002

Detecting Bursty Words
  • Term w occurs in a sequence of text at positions
    u1, u2, ? events happen with positive time-gap
    x1, x2, where x1u2-u1, x2 u3 u2, etc.
  • Assume the events are emited by a probabilistic
    infinite-state automaton, each state associated
    with a exponential density function f(x)ae-ax,
    where a is the rate parameter (expected value
    of gap is a-1 )

Finding the state transitions
From J. Kleinberg, Bursty and Hierachical
Structure in Streams. 8th ACM SIGKDD,
2002 Optimal sequence less state transitions
while keeping the rates closely agreeing with
observed gaps.
Sample Results
  • From database conferences SIGMOD, VLDB
  • data, base, application 1975- 1979/1981/1982
  • relational 1975 1989
  • schema 1975 1980
  • distributed 1977 1985
  • statistical 1981 1984
  • transaction 1987 1992
  • object-oriented 1987 1994
  • parallel 1989 1996
  • mining 1995
  • web 1998
  • xml 1999 -

Sample Results
  • From AI conferences, AAAI, IJCAI, 1980 -- 2001
  • an 1980 1982
  • language 1980 1983
  • image 1980 1987
  • prolog 1983 -- 1987
  • reasoning 1987 1988
  • decision 1992 1997
  • agents 1998
  • agent 1994
  • mobile 1996
  • web 1996
  • bayesian 1996 1998
  • auctions 1998
  • reinforcement 1998

  • Discussion?
Write a Comment
User Comments (0)