Title: Under The Hood Part I WebBased Information Architectures
1Under The Hood Part IWeb-Based Information
Architectures
- MSEC 20-760 Mini II
- 28-October-2003
- Jaime Carbonell
2Topics Covered
- The Vector Space Model for IR (VSM)
- Evaluation Metrics for IR
- Query Expansion (the Rocchio Method)
- Inverted Indexing for Efficiency
- A Glimpse into Harder Problems
3The Vector Space Model
- Definitions of document and query vectors, where
wj jth word, and c(wj,di) count the
occurrences of wi in document dj
4Computing the Similarity
- Dot-product similarity
- Cosine similarity
5Computing Norms and Products
- Dot product
- Eucledian vector norm (aka 2-norm)
6Similarity in Retrieval
- Similarity ranking
- If sim(q,di) gt sim(q,dj), di ranks higher
- Retrieving top k documents
7Refinements to VSM (1)
- Word normalization
- Words in morphological root form
- countries gt country
- interesting gt interest
- Stemming as a fast approximation
- countries, country gt countr
- moped gt mop
- Reduces vocabulary (always good)
- Generalizes matching (usually good)
- More useful for non-English IR
- (Arabic has gt 100 variants per verb)
8Refinements to VSM (2)
- Stop-Word Elimination
- Discard articles, auxiliaries, prepositions, ...
typically 100-300 most frequent small words - Reduce document length by 30-40
- Retrieval accuracy improves slightly (5-10)
9Refinements to VSM (3)
- Proximity Phrases
- E.g. "air force" gt airforce
- Found by high-mutual information
- p(w1 w2) gtgt p(w1)p(w2)
- p(w1 w2 in k-window) gtgt
- p(w1 in k-window) p(w2 in same k-window)
- Retrieval accuracy improves slightly (5-10)
- Too many phrases gt inefficiency
10Refinements to VSM (4)
- Words gt Terms
- term word stemmed word phrase
- Use exactly the same VSM method on terms (vs
words)
11Evaluating Information Retrieval (1)
Recall a/(ac) fraction of relevant
retrieved Precision a/(ab) fraction of
retrieved that is relevant
12Evaluating Information Retrieval (2)
- P a/(ab) R a/(ac)
- Accuracy (ad)/(abcd)
- F1 2PR/(PR)
- Miss c/(ac) 1 - R
- (false negatives)
- F/A b/(abcd)
- (false positives)
13Evaluating Information Retrieval (3)
- 11-point precision curves
- IR system generates total ranking
- Plot precision at 10, 20, 30 ... recall,
14Query Expansion (1)
- Observations
- Longer queries often yield better results
- Users vocabulary may differ from document
- vocabulary
- Q how to avoid heart disease
- D "Factors in minimizing stroke and cardiac
arrest Recommended dietary and exercise
regimens" - Maybe longer queries have more chances to help
recall.
15Query Expansion (2)
- Bridging the Gap
- Human query expansion (user or expert)
- Thesaurus-based expansion
- Seldom works in practice (unfocused)
- Relevance feedback
- Widen a thin bridge over vocabulary gap
- Adds words from document space to query
- Pseudo-Relevance feedback
- Local Context analysis
16Relevance FeedbackRocchios Method
- Idea update the query via user feedback
- Exact method (vector sums)
17Relevance Feedback (2)
- For example, if
- Q (heart attack medicine)
- W(heart,Q) W(attack,Q) W(medicine,Q) 1
- Drel (cardiac arrest prevention medicine
- nitroglycerine heart disease...)
- W(nitroglycerine,D) 2, W(medicine,D) 1
- Dirr (terrorist attack explosive semtex attack
nitroglycerine proximity fuse...) - W(attack,D) 1, W(nitroglycerine 2),
- W(explosive,D) 1
- AND a 1, ß 2, ? .5
18Relevance Feedback (3)
- Then
- W(attack,Q) 11 - 0.51 0.5
- W(nitroglycerine, Q)
- W(medicine, Q)
- w(explosive, Q)
19Term Weighting Methods (1)
- Saltons TfIDf
- Tf term frequency in a document
- Df document frequency of term
- documents in collection
- with this term
- IDf Df-1
20Term Weighting Methods (2)
- Saltons TfIDf
- TfIDf f1(Tf)f2(IDf)
- E.g. f1(Tf) Tfave(Dj)/D
- E.g. f2(IDf) log2(IDF)
- f1 and f2 can differ for Q and D
21Efficient Implementations of VSM (1)
- Exploit sparseness
- Only compute non-zero multiplies in dot-products
- Do not even look at zero elements (how?)
- gt Use non-stop terms to index documents
22Efficient Implementations of VSM (2)
- Inverted Indexing
- Find all unique stemmed terms in document
collection - Remove stopwords from word list
- If collection is large (over 100,000 documents),
Optionally remove singletons - Usually spelling errors or obscure names
- Alphabetize or use hash table to store list
- For each term create data structure like
23Efficient Implementations of VSM (3)
- term IDFtermi,
- ltdoci, freq(term, doci )
- docj, freq(term, docj )
- ...gt
- or
- term IDFtermi,
- ltdoci, freq(term, doci), pos1,i, pos2,i, ...
- docj, freq(term, docj), pos1,j, pos2,j, ...
- ...gt
- posl,1 indicates the first position of term in
documentj and so on.
24Open Research Problems in IR (1)
- Beyond VSM
- Vectors in different Spaces
- Generalized VSM, Latent Semantic Indexing...
- Probabilistic IR (Language Modeling)
- P(DQ) P(QD)P(D)/P(Q)
25Open Research Problems in IR (2)
- Beyond Relevance
- Appropriateness of doc to user comprehension
level, etc. - Novelty of information in doc to user
anti-redundancy as approx to novelty
26Open Research Problems in IR (3)
- Beyond one Language
- Translingual IR
- Transmedia IR
27Open Research Problems in IR (4)
- Beyond Content Queries
- "Whats new today?"
- "What sort of things to you know about"
- "Build me a Yahoo-style index for X"
- "Track the event in this news-story"