Title: Introduction to Statistical Modeling
1Introduction to Statistical Modeling
2Why Statistical Modeling?
- Vector space model for information retrieval
- Both documents and queries are vectors in the
term space - Relevance is measured by the similarity between
document vectors and query vector - Many problems with vector space model
- Ad-hoc term weighting schemes
- Ad-hoc basis vectors
- Ad-hoc similarity measurement
- We need something that is much more principled !
3A Simple Example (I)
- Consider you have three coins C1, C2, C3
- Alex picked up one of the coins and flipped it
six times. - You didnt see which coin he picked out. But, you
observed the results of flipping coins - t, h, t, h, t, t
- Question how to guess which coin Alex choose?
4A Simple Example (II)
- You experimented with the three coins, say 6
times - C1 h, h, h, t, h, t
- C2 t, t, h, t, t, t
- C3 t, h, t, t, t, h
- Given t, h, t, h, t, t
- Now, what one you think Alex choose?
5A Simple Example (III)
- q t, h, t, h, t, t ? bias bq 1/3
- C1 h, h, h, t, h ? bias b1 5/6
- C2 t, t, h, t, t, t ? bias b2 1/6
- C3 t, h, t, t, t, h ? bias b3 1/3
- So, which coin you think Alex select?
- A more principled approach
- Compute the likelihood p(qCi) for each coin
6A Simple Example (IV)
- p(qC1) p(t, h, t, h, t, t C1)
- p(tC1)p(hC1)p(tC1)p(hC1)p(tC1
)p(tC1) - 1/6 5/6 1/6 5/6 1/6 1/6
5.310-4 - Compute p(qC2) and p(qC3)
- Which coin has the largest likelihood ?
7A Simple Example (IV)
- p(qC1) p(t, h, t, h, t, t C1)
- p(tC1)p(hC1)p(tC1)p(hC1)p(tC1
)p(tC1) - 1/6 5/6 1/6 5/6 1/6 1/6
5.310-4 - Compute p(qC2) and p(qC3)
- p(qC2) 0.013, p(qC3) 0.02
- Which coin has the largest likelihood ?
8An Information Retrieval View
- Query (q) t, h, t, h, t, t
- Doc1(C1) h, h, h, t, h
- Doc2(C2) t, t, h, t, t, t
- Doc3(C3) t, h, t, t, t, h
- Which document is ranked first if we use the
vector space model?
9An Information Retrieval View
- Query (q) t, h, t, h, t, t
- Doc1(C1) h, h, h, t, h
- Doc2(C2) t, t, h, t, t, t
- Doc3(C3) t, h, t, t, t, h
- Which document is ranked first if we use the
vector space model?
10An Information Retrieval View
- Query (q) t, h, t, h, t, t
- Doc1(C1) h, h, h, t, h sim(D1)
1/35/62/31/6 0.39 - Doc2(C2) t, t, h, t, t, t sim(D2)
1/31/62/35/6 0.61 - Doc3(C3) t, h, t, t, t, h sim(D3)
1/31/32/32/3 0.56 - Which document is ranked first if we use the
vector space model?
11A Simple Example Summary
?
?
?
12A Simple Example Summary
Estimating likelihood p(qbias)
Estimating bias for each coin
13A Probabilistic Framework for Information
Retrieval
Estimating likelihood p(q ?)
Estimating some statistics ? for each document
14A Probabilistic Framework for Information
Retrieval
- Three fundamental questions
- What statistics ? should be chosen to describe
the characteristics of documents ? - How to estimate this statistics ?
- How to compute the likelihood of generating
queries given the statistics ??
15Unigram Language Model
- Probabilities for single word p(w)
- ?p(w) for any word w in vocabulary V
- Estimate an unigram language model
- Simple counting
- Given a document d, count term frequency c(w,d)
for each word w. Then, p(w) c(w,d)/d - How to estimate the likelihood p(q?)?
16Estimate p(q?)
- qw1, w2, , wk
- Similar to the example of flipping coins
- E.g. qbush, kerry
- ?dp(bush)0.001, p(kerry)0.02
- p(q?d)0.001 0.02 2 10-5
- What if the document didnt mention word bush,
instead it used phrase president of united
states ?
17Estimate p(q?)
- qw1, w2, , wk
- Similar to the example of flipping coins
- E.g. qbush, kerry
- ?dp(bush)0.001, p(kerry)0.02
- p(q?d)0.001 0.02 2 10-5
- What if the document didnt mention word bush,
instead it used phrase president of united
states ?
18Illustration of Language Models for Information
Retrieval
Estimating likelihood p(q?)p(h)2p(t)4
?2 p(h)1/2, p(t)1/2
?1 p(h)1/3, p(t)2/3
Estimating language models by counting
19A Simple Example Summary
Estimating likelihood p(q?)p(h)2p(t)4
?2 p(h)1/2, p(t)1/2
?1 p(h)1/3, p(t)2/3
Estimating language models by counting
20A Simple Example Summary
Estimating likelihood p(q?)p(h)2p(t)4
?2 p(h)1/6, p(t)5/6
?3 p(h)1/3, p(t)2/3
Estimating language models by counting
Problems?
21Problems With Unigram LM
- Unigram probabilities
- Insufficient for representing true documents
- Simple counting for estimating unigram
probabilities - It does not account for variance in documents
- If you ask the same person to write the same
story twice, it will be different - Most words will have zero probabilities
- Sparse data problem
22Sparse Data Problems
- Shrinkage
- Maximum a posterior (MAP) estimation
- Bayesian approach
23 Shrinkage Jelinek Mercer Smoothing
- Linearly interpolate between document language
model and the collection language model
0 lt ? lt 1 is a smoothing parameter
24Smoothing TF-IDF Weighting
Are they totally irrelevant ?
25Smoothing TF-IDF Weighting
Similar to TF.IDF weighting
irrelevant to documents