Title: Under The Hood Part I WebBased Information Architectures
1Under The Hood Part IWeb-Based Information
Architectures
- MSEC 20-760Mini IIJaime Carbonell
2Topics Covered
- The Vector Space Model for IR (VSM)
- Evaluation Metrics for IR
- Query Expansion (the Rocchio Method)
- Inverted Indexing for Efficiency
- A Glimpse into Harder Problems
3The Vector Space Model (1)
- Let S w1, w2, ... wn
- Let Dj c(w1, Dj), c(w2, Dj), ... c(wn, Dj)
- Let Q c(w1, Q), c(w2, Q), ... c(wn, Q)
4The Vector Space Model (2)
- Initial Definition of Similarity
- SI(Q, Dj) Q . Dj
- Normalized Definition of Similarity
- SN(Q, Dj) (Q . Dj)/(Q x Dj)
- cos(Q, Dj)
5The Vector Space Model (3)
- Relevance Ranking
- If SN(Q, Di) gt SN(Q, Dj)
- Then Di is more relevant than Di to Q
- Retrieve(k,Q,Dj)
- Argmaxkcos(Q, Dj)
- Dj in Dj
6Refinements to VSM (1)
- Word normalization
- Words in morphological root form
- countries gt country
- interesting gt interest
- Stemming as a fast approximation
- countries, country gt countr
- moped gt mop
- Reduces vocabulary (always good)
- Generalizes matching (usually good)
- More useful for non-English IR
- (Arabic has gt 100 variants per verb)
7Refinements to VSM (2)
- Stop-Word Elimination
- Discard articles, auxiliaries, prepositions, ...
typically 100-300 most frequent small words - Reduce document length by 30-40
- Retrieval accuracy improves slightly (5-10)
8Refinements to VSM (3)
- Proximity Phrases
- E.g. "air force" gt airforce
- Found by high-mutual information
- p(w1 w2) gtgt p(w1)p(w2)
- p(w1 w2 in k-window) gtgt
- p(w1 in k-window) p(w2 in same k-window)
- Retrieval accuracy improves slightly (5-10)
- Too many phrases gt inefficiency
9Refinements to VSM (4)
- Words gt Terms
- term word stemmed word phrase
- Use exactly the same VSM method on terms (vs
words)
10Evaluating Information Retrieval (1)
11Evaluating Information Retrieval (2)
- P a/(ab) R a/(ac)
- Accuracy (ad)/(abcd)
- F1 2PR/(PR)
- Miss c/(ac) 1 - R
- (false negative)
- F/A b/(abcd)
- (false positive)
12Evaluating Information Retrieval (3)
- 11-point precision curves
- IR system generates total ranking
- Plot precision at 10, 20, 30 ... recall,
13Query Expansion (1)
- Observations
- Longer queries often yield better results
- Users vocabulary may differ from document
- vocabulary
- Q how to avoid heart disease
- D "Factors in minimizing stroke and cardiac
arrest Recommended dietary and exercise
regimens" - Maybe longer queries have more chances to help
recall.
14Query Expansion (2)
- Bridging the Gap
- Human query expansion (user or expert)
- Thesaurus-based expansion
- Seldom works in practice (unfocused)
- Relevance feedback
- Widen a thin bridge over vocabulary gap
- Adds words from document space to query
- Pseudo-Relevance feedback
- Local Context analysis
15Relevance Feedback (1)
- Rocchio Formula
- Q FQ, Dret
- F weighted vector sum, such as
- W(t,Q)
- aW(t,Q) ßW(t,Drel ) - ?W(t,Dirr )
16Relevance Feedback (2)
- For example, if
- Q (heart attack medicine)
- W(heart,Q) W(attack,Q) W(medicine,Q) 1
- Drel (cardiac arrest prevention medicine
- nitroglycerine heart disease...)
- W(nitroglycerine,D) 2, W(medicine,D) 1
- Dirr (terrorist attack explosive semtex attack
nitroglycerine proximity fuse...) - W(attack,D) 1, W(nitroglycerine 2),
- W(explosive,D) 1
- AND a 1, ß 2, ? .5
17Relevance Feedback (3)
- Then
- W(attack,Q) 11 - 0.51 0.5
- W(nitroglycerine, Q)
- W(medicine, Q)
- w(explosive, Q)
18Term Weighting Methods (1)
- Saltons TfIDf
- Tf term frequency in a document
- Df document frequency of term
- documents in collection
- with this term
- IDf Df-1
19Term Weighting Methods (2)
- Saltons TfIDf
- TfIDf f1(Tf)f2(IDf)
- E.g. f1(Tf) Tfave(Dj)/D
- E.g. f2(IDf) log2(IDF)
- f1 and f2 can differ for Q and D
20Vector Space Model a toy example
- Suppose your document collection only contains 2
documents - Q (heart attack medicine)
- D1 (cardiac arrest prevention medicine
nitroglycerine heart disease) - D2 (terrorist attack explosive semtex attack
nitroglycerine proximity fuse)
21Vector Space Model a toy example
- Then, the dictionary is (alphabetically sorted)
- arrest, attack, cardiac, disease, explosive,
fuse, heart, medicine, nitroglycerine,
prevention, proximity, semtex, terrorist - The vectors of Q, D1 and D2 are as follows
- Q 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0
- D11, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0
- D20, 2, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1
- Each component of the vector is defined as TW(t,
V), and here for simplicity, we just use raw TF
as the term weight of a term in the vector V.
22Efficient Implementations of VSM (1)
- Exploit sparseness
- Only compute non-zero multiplies in dot-products
- Do not even look at zero elements (how?)
- gt Use non-stop terms to index documents
23Efficient Implementations of VSM (2)
- Inverted Indexing
- Find all unique stemmed terms in document
collection - Remove stopwords from word list
- If collection is large (over 100,000 documents),
Optionally remove singletons - Usually spelling errors or obscure names
- Alphabetize or use hash table to store list
- For each term create data structure like
24Efficient Implementations of VSM (3)
- term IDFtermi,
- ltdoci, freq(term, doci )
- docj, freq(term, docj )
- ...gt
- or
- term IDFtermi,
- ltdoci, freq(term, doci), pos1,i, pos2,i, ...
- docj, freq(term, docj), pos1,j, pos2,j, ...
- ...gt
- posl,1 indicates the first position of term in
documentj and so on.
25Open Research Problems in IR (1)
- Beyond VSM
- Vectors in different Spaces
- Generalized VSM, Latent Semantic Indexing...
- Probabilistic IR (Language Modeling)
- P(DQ) P(QD)P(D)/P(Q)
26Open Research Problems in IR (2)
- Beyond Relevance
- Appropriateness of doc to user comprehension
level, etc. - Novelty of information in doc to user
anti-redundancy as approx to novelty
27Open Research Problems in IR (3)
- Beyond one Language
- Translingual IR
- Transmedia IR
28Open Research Problems in IR (4)
- Beyond Content Queries
- "Whats new today?"
- "What sort of things to you know about"
- "Build me a Yahoo-style index for X"
- "Track the event in this news-story"