Title: Basic%20IR:%20Modeling
1Basic IR Modeling
- Basic IR Task
- Match a subset of documents to the users query
- Slightly more complex
- and rank the resulting documents by predicted
relevance - The derivation of relevance leads to different IR
models.
2Concepts Term-Document Incidence
- Imagine matrix of terms X documents with 1 when
the term appears in the document and 0 otherwise. - Queries satisfied how?
- Problems?
search segment select semantic
MIR 1 0 1 1
AI 1 1 0 1
3Concepts Term Frequency
- To support document ranking, need more than just
term incidence. - Term frequency records number of times a given
term appears in each document. - Intuition More times a term appears in a
document the more central it is to the topic of
the document.
4Concept Term Weight
- Weights represent the importance of a given term
for characterizing a document. - wij is a weight for term i in document j.
5Mapping Task and Document Type to Model
Index Terms Full Text Full Text Structure
Searching (Retrieval) Classic Classic Structured
Surfing (Browsing) Flat Flat Hypertext Structure Guided Hypertext
6IR Models
from MIR text
7Classic Models Basic Concepts
- Ki is an index term
- dj is a document
- t is the total number of docs
- K (k1, k2, , kt) is the set of all index
terms - wij gt 0 is a weight associated with (ki,dj)
- wij 0 indicates that term does not belong to
doc - vec(dj) (w1j, w2j, , wtj) is a weighted
vector associated with the document dj - gi(vec(dj)) wij is a function which returns
the weight associated with pair (ki,dj)
8Classic Boolean Model
- Based on set theory map queries with Boolean
operations to set operations - Select documents from term-document incidence
matrix - Pros
- Cons
9Exact Matching Ignores
- term frequency in document
- term scarcity in corpus
- size of document
- ranking
10Vector Model
- Vector of term weights based on term frequency
- Compute similarity between query and document
where both are vectors - vec(dj) (w1j, w2j, ..., wtj) vec(q) (w1q,
w2q, ..., wtq) - Similarity is the cosine of the angle between the
vectors.
11Cosine Measure
- Since wij gt 0 and wiq gt 0, 0 lt sim(q,dj) lt1
from MIR notes
12How to Set Wij Weights? TF-IDF
- Within document Term-Frequency
- tf measures term density within a document
- Across document Inverse Document Frequency
- idf measures informativeness or rarity of term
across corpus.
13TF IDF Computation
- What happens as number of occurrences in a
document increases? - What happens as term becomes more rare?
14TF IDF
- TF may be normalized.
- tf(i,d) freq(i,d) / max(freq(l,d))
- IDF is computed
- normalized to size of corpus
- as log to make TF and IDF values comparable
- IDF requires a static corpus.
15How to Set Wi,q Weights?
- Create Vector directly from query
- Use modified tf-idf
16The Vector Model Example
from MIR notes
17The Vector Model Example (cont.)
- Compute Tf-IDF Vector for each document
- For first document
- K1 ((2/2)(log (7/5)) .33
- K2 (0(log (7/4))) 0
- K3 ((1/2)(log (7/3))) .42
- for rest
- .34 0 0, 0 .19 .85, .34 0 0, .08 .28
.85, - .17 .56 0, 0 .56 0
from MIR notes
18The Vector Model Example (cont.)
- 2. Compute the Tf-IDF for the query 1 2 3
- K1 (.5 ((.5 1)/3))(log (7/5)))
- K2 (.5 ((.5 2)/3))(log (7/4)))
- K3 (.5 ((.5 3)/3))(log (7/3)))
- which is .22 .47 .85
19The Vector Model Example (cont.)
- 3. Compute the Sim for each document
- D1
- D1q (.33 .22) (0 .47) (.42 .85)
.43 - D1 sqrt((.332) (.422)) .53
- q sqrt((.222) (.472) (.852))
1.0 - sim .43 / (.53 1.0) .81
- D2 .22 D3 .93 D4 .23
- D5 .97 D6 .51 D7 .47
20Vector Model Implementation Issues
- Sparse TermXDocument matrix
- Store term count, term weight, or weighted by
idfi ? - What if the corpus is not fixed (e.g., the Web)?
What happens to IDF? - How to efficiently compute Cosine for large index?
21Heuristics for Computing Cosine for Large Index
- Select from only non-zero cosines
- Focus on non-zero cosines for rare (high idf)
words - Pre-compute document adjacency
- for each term, pre-compute k nearest docs
- for a t term query, compute cosines from query to
union of t pre-computed lists, choose top k
22The TFIDF Vector Model Pros/Cons
- Pros
- term-weighting improves quality
- cosine ranking formula sorts documents according
to degree of similarity to the query - Cons
- assumes independence of index terms