Under The Hood Part I WebBased Information Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

Under The Hood Part I WebBased Information Architectures

Description:

The Vector Space Model for IR (VSM) Evaluation Metrics for IR ... Discard articles, auxiliaries, prepositions, ... typically 100-300 most frequent ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 28
Provided by: cjin
Category:

less

Transcript and Presenter's Notes

Title: Under The Hood Part I WebBased Information Architectures


1
Under The Hood Part IWeb-Based Information
Architectures
  • MSEC 20-760 Mini II
  • 28-October-2003
  • Jaime Carbonell

2
Topics Covered
  • The Vector Space Model for IR (VSM)
  • Evaluation Metrics for IR
  • Query Expansion (the Rocchio Method)
  • Inverted Indexing for Efficiency
  • A Glimpse into Harder Problems

3
The Vector Space Model
  • Definitions of document and query vectors, where
    wj jth word, and c(wj,di) count the
    occurrences of wi in document dj

4
Computing the Similarity
  • Dot-product similarity
  • Cosine similarity

5
Computing Norms and Products
  • Dot product
  • Eucledian vector norm (aka 2-norm)

6
Similarity in Retrieval
  • Similarity ranking
  • If sim(q,di) gt sim(q,dj), di ranks higher
  • Retrieving top k documents

7
Refinements to VSM (1)
  • Word normalization
  • Words in morphological root form
  • countries gt country
  • interesting gt interest
  • Stemming as a fast approximation
  • countries, country gt countr
  • moped gt mop
  • Reduces vocabulary (always good)
  • Generalizes matching (usually good)
  • More useful for non-English IR
  • (Arabic has gt 100 variants per verb)

8
Refinements to VSM (2)
  • Stop-Word Elimination
  • Discard articles, auxiliaries, prepositions, ...
    typically 100-300 most frequent small words
  • Reduce document length by 30-40
  • Retrieval accuracy improves slightly (5-10)

9
Refinements to VSM (3)
  • Proximity Phrases
  • E.g. "air force" gt airforce
  • Found by high-mutual information
  • p(w1 w2) gtgt p(w1)p(w2)
  • p(w1 w2 in k-window) gtgt
  • p(w1 in k-window) p(w2 in same k-window)
  • Retrieval accuracy improves slightly (5-10)
  • Too many phrases gt inefficiency

10
Refinements to VSM (4)
  • Words gt Terms
  • term word stemmed word phrase
  • Use exactly the same VSM method on terms (vs
    words)

11
Evaluating Information Retrieval (1)
Recall a/(ac) fraction of relevant
retrieved Precision a/(ab) fraction of
retrieved that is relevant
  • Contingency table

12
Evaluating Information Retrieval (2)
  • P a/(ab) R a/(ac)
  • Accuracy (ad)/(abcd)
  • F1 2PR/(PR)
  • Miss c/(ac) 1 - R
  • (false negatives)
  • F/A b/(abcd)
  • (false positives)

13
Evaluating Information Retrieval (3)
  • 11-point precision curves
  • IR system generates total ranking
  • Plot precision at 10, 20, 30 ... recall,

14
Query Expansion (1)
  • Observations
  • Longer queries often yield better results
  • Users vocabulary may differ from document
  • vocabulary
  • Q how to avoid heart disease
  • D "Factors in minimizing stroke and cardiac
    arrest Recommended dietary and exercise
    regimens"
  • Maybe longer queries have more chances to help
    recall.

15
Query Expansion (2)
  • Bridging the Gap
  • Human query expansion (user or expert)
  • Thesaurus-based expansion
  • Seldom works in practice (unfocused)
  • Relevance feedback
  • Widen a thin bridge over vocabulary gap
  • Adds words from document space to query
  • Pseudo-Relevance feedback
  • Local Context analysis

16
Relevance FeedbackRocchios Method
  • Idea update the query via user feedback
  • Exact method (vector sums)

17
Relevance Feedback (2)
  • For example, if
  • Q (heart attack medicine)
  • W(heart,Q) W(attack,Q) W(medicine,Q) 1
  • Drel (cardiac arrest prevention medicine
  • nitroglycerine heart disease...)
  • W(nitroglycerine,D) 2, W(medicine,D) 1
  • Dirr (terrorist attack explosive semtex attack
    nitroglycerine proximity fuse...)
  • W(attack,D) 1, W(nitroglycerine 2),
  • W(explosive,D) 1
  • AND a 1, ß 2, ? .5

18
Relevance Feedback (3)
  • Then
  • W(attack,Q) 11 - 0.51 0.5
  • W(nitroglycerine, Q)
  • W(medicine, Q)
  • w(explosive, Q)

19
Term Weighting Methods (1)
  • Saltons TfIDf
  • Tf term frequency in a document
  • Df document frequency of term
  • documents in collection
  • with this term
  • IDf Df-1

20
Term Weighting Methods (2)
  • Saltons TfIDf
  • TfIDf f1(Tf)f2(IDf)
  • E.g. f1(Tf) Tfave(Dj)/D
  • E.g. f2(IDf) log2(IDF)
  • f1 and f2 can differ for Q and D

21
Efficient Implementations of VSM (1)
  • Exploit sparseness
  • Only compute non-zero multiplies in dot-products
  • Do not even look at zero elements (how?)
  • gt Use non-stop terms to index documents

22
Efficient Implementations of VSM (2)
  • Inverted Indexing
  • Find all unique stemmed terms in document
    collection
  • Remove stopwords from word list
  • If collection is large (over 100,000 documents),
    Optionally remove singletons
  • Usually spelling errors or obscure names
  • Alphabetize or use hash table to store list
  • For each term create data structure like

23
Efficient Implementations of VSM (3)
  • term IDFtermi,
  • ltdoci, freq(term, doci )
  • docj, freq(term, docj )
  • ...gt
  • or
  • term IDFtermi,
  • ltdoci, freq(term, doci), pos1,i, pos2,i, ...
  • docj, freq(term, docj), pos1,j, pos2,j, ...
  • ...gt
  • posl,1 indicates the first position of term in
    documentj and so on.

24
Open Research Problems in IR (1)
  • Beyond VSM
  • Vectors in different Spaces
  • Generalized VSM, Latent Semantic Indexing...
  • Probabilistic IR (Language Modeling)
  • P(DQ) P(QD)P(D)/P(Q)

25
Open Research Problems in IR (2)
  • Beyond Relevance
  • Appropriateness of doc to user comprehension
    level, etc.
  • Novelty of information in doc to user
    anti-redundancy as approx to novelty

26
Open Research Problems in IR (3)
  • Beyond one Language
  • Translingual IR
  • Transmedia IR

27
Open Research Problems in IR (4)
  • Beyond Content Queries
  • "Whats new today?"
  • "What sort of things to you know about"
  • "Build me a Yahoo-style index for X"
  • "Track the event in this news-story"
Write a Comment
User Comments (0)
About PowerShow.com