Under The Hood Part I WebBased Information Architectures - PowerPoint PPT Presentation

About This Presentation

Title:

Under The Hood Part I WebBased Information Architectures

Description:

... arrest prevention medicine. nitroglycerine heart disease...) W(nitroglycerine,D) = 2, W(medicine,D) = 1 ... Then, the dictionary is (alphabetically sorted) ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 29

Provided by: cjin

Learn more at: https://www.andrew.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Under The Hood Part I WebBased Information Architectures

1
Under The Hood Part IWeb-Based Information
Architectures

MSEC 20-760Mini IIJaime Carbonell

2
Topics Covered

The Vector Space Model for IR (VSM)
Evaluation Metrics for IR
Query Expansion (the Rocchio Method)
Inverted Indexing for Efficiency
A Glimpse into Harder Problems

3
The Vector Space Model (1)

Let S w1, w2, ... wn
Let Dj c(w1, Dj), c(w2, Dj), ... c(wn, Dj)
Let Q c(w1, Q), c(w2, Q), ... c(wn, Q)

4
The Vector Space Model (2)

Initial Definition of Similarity
SI(Q, Dj) Q . Dj
Normalized Definition of Similarity
SN(Q, Dj) (Q . Dj)/(Q x Dj)
cos(Q, Dj)

5
The Vector Space Model (3)

Relevance Ranking
If SN(Q, Di) gt SN(Q, Dj)
Then Di is more relevant than Di to Q
Retrieve(k,Q,Dj)
Argmaxkcos(Q, Dj)
Dj in Dj

6
Refinements to VSM (1)

Word normalization
Words in morphological root form
countries gt country
interesting gt interest
Stemming as a fast approximation
countries, country gt countr
moped gt mop
Reduces vocabulary (always good)
Generalizes matching (usually good)
More useful for non-English IR
(Arabic has gt 100 variants per verb)

7
Refinements to VSM (2)

Stop-Word Elimination
Discard articles, auxiliaries, prepositions, ...
typically 100-300 most frequent small words
Reduce document length by 30-40
Retrieval accuracy improves slightly (5-10)

8
Refinements to VSM (3)

Proximity Phrases
E.g. "air force" gt airforce
Found by high-mutual information
p(w1 w2) gtgt p(w1)p(w2)
p(w1 w2 in k-window) gtgt
p(w1 in k-window) p(w2 in same k-window)
Retrieval accuracy improves slightly (5-10)
Too many phrases gt inefficiency

9
Refinements to VSM (4)

Words gt Terms
term word stemmed word phrase
Use exactly the same VSM method on terms (vs
words)

10
Evaluating Information Retrieval (1)

Contingency table

11
Evaluating Information Retrieval (2)

P a/(ab) R a/(ac)
Accuracy (ad)/(abcd)
F1 2PR/(PR)
Miss c/(ac) 1 - R
(false negative)
F/A b/(abcd)
(false positive)

12
Evaluating Information Retrieval (3)

11-point precision curves
IR system generates total ranking
Plot precision at 10, 20, 30 ... recall,

13
Query Expansion (1)

Observations
Longer queries often yield better results
Users vocabulary may differ from document
vocabulary
Q how to avoid heart disease
D "Factors in minimizing stroke and cardiac
arrest Recommended dietary and exercise
regimens"
Maybe longer queries have more chances to help
recall.

14
Query Expansion (2)

Bridging the Gap
Human query expansion (user or expert)
Thesaurus-based expansion
Seldom works in practice (unfocused)
Relevance feedback
Widen a thin bridge over vocabulary gap
Adds words from document space to query
Pseudo-Relevance feedback
Local Context analysis

15
Relevance Feedback (1)

Rocchio Formula
Q FQ, Dret
F weighted vector sum, such as
W(t,Q)
aW(t,Q) ßW(t,Drel ) - ?W(t,Dirr )

16
Relevance Feedback (2)

For example, if
Q (heart attack medicine)
W(heart,Q) W(attack,Q) W(medicine,Q) 1
Drel (cardiac arrest prevention medicine
nitroglycerine heart disease...)
W(nitroglycerine,D) 2, W(medicine,D) 1
Dirr (terrorist attack explosive semtex attack
nitroglycerine proximity fuse...)
W(attack,D) 1, W(nitroglycerine 2),
W(explosive,D) 1
AND a 1, ß 2, ? .5

17
Relevance Feedback (3)

Then
W(attack,Q) 11 - 0.51 0.5
W(nitroglycerine, Q)
W(medicine, Q)
w(explosive, Q)

18
Term Weighting Methods (1)

Saltons TfIDf
Tf term frequency in a document
Df document frequency of term
documents in collection
with this term
IDf Df-1

19
Term Weighting Methods (2)

Saltons TfIDf
TfIDf f1(Tf)f2(IDf)
E.g. f1(Tf) Tfave(Dj)/D
E.g. f2(IDf) log2(IDF)
f1 and f2 can differ for Q and D

20
Vector Space Model a toy example

Suppose your document collection only contains 2
documents
Q (heart attack medicine)
D1 (cardiac arrest prevention medicine
nitroglycerine heart disease)
D2 (terrorist attack explosive semtex attack
nitroglycerine proximity fuse)

21
Vector Space Model a toy example

Then, the dictionary is (alphabetically sorted)
arrest, attack, cardiac, disease, explosive,
fuse, heart, medicine, nitroglycerine,
prevention, proximity, semtex, terrorist
The vectors of Q, D1 and D2 are as follows
Q 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0
D11, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0
D20, 2, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1
Each component of the vector is defined as TW(t,
V), and here for simplicity, we just use raw TF
as the term weight of a term in the vector V.

22
Efficient Implementations of VSM (1)