Under The Hood Part I WebBased Information Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

Under The Hood Part I WebBased Information Architectures

Description:

... arrest prevention medicine. nitroglycerine heart disease...) W(nitroglycerine,D) = 2, W(medicine,D) = 1 ... Then, the dictionary is (alphabetically sorted) ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 29
Provided by: cjin
Category:

less

Transcript and Presenter's Notes

Title: Under The Hood Part I WebBased Information Architectures


1
Under The Hood Part IWeb-Based Information
Architectures
  • MSEC 20-760Mini IIJaime Carbonell

2
Topics Covered
  • The Vector Space Model for IR (VSM)
  • Evaluation Metrics for IR
  • Query Expansion (the Rocchio Method)
  • Inverted Indexing for Efficiency
  • A Glimpse into Harder Problems

3
The Vector Space Model (1)
  • Let S w1, w2, ... wn
  • Let Dj c(w1, Dj), c(w2, Dj), ... c(wn, Dj)
  • Let Q c(w1, Q), c(w2, Q), ... c(wn, Q)

4
The Vector Space Model (2)
  • Initial Definition of Similarity
  • SI(Q, Dj) Q . Dj
  • Normalized Definition of Similarity
  • SN(Q, Dj) (Q . Dj)/(Q x Dj)
  • cos(Q, Dj)

5
The Vector Space Model (3)
  • Relevance Ranking
  • If SN(Q, Di) gt SN(Q, Dj)
  • Then Di is more relevant than Di to Q
  • Retrieve(k,Q,Dj)
  • Argmaxkcos(Q, Dj)
  • Dj in Dj

6
Refinements to VSM (1)
  • Word normalization
  • Words in morphological root form
  • countries gt country
  • interesting gt interest
  • Stemming as a fast approximation
  • countries, country gt countr
  • moped gt mop
  • Reduces vocabulary (always good)
  • Generalizes matching (usually good)
  • More useful for non-English IR
  • (Arabic has gt 100 variants per verb)

7
Refinements to VSM (2)
  • Stop-Word Elimination
  • Discard articles, auxiliaries, prepositions, ...
    typically 100-300 most frequent small words
  • Reduce document length by 30-40
  • Retrieval accuracy improves slightly (5-10)

8
Refinements to VSM (3)
  • Proximity Phrases
  • E.g. "air force" gt airforce
  • Found by high-mutual information
  • p(w1 w2) gtgt p(w1)p(w2)
  • p(w1 w2 in k-window) gtgt
  • p(w1 in k-window) p(w2 in same k-window)
  • Retrieval accuracy improves slightly (5-10)
  • Too many phrases gt inefficiency

9
Refinements to VSM (4)
  • Words gt Terms
  • term word stemmed word phrase
  • Use exactly the same VSM method on terms (vs
    words)

10
Evaluating Information Retrieval (1)
  • Contingency table

11
Evaluating Information Retrieval (2)
  • P a/(ab) R a/(ac)
  • Accuracy (ad)/(abcd)
  • F1 2PR/(PR)
  • Miss c/(ac) 1 - R
  • (false negative)
  • F/A b/(abcd)
  • (false positive)

12
Evaluating Information Retrieval (3)
  • 11-point precision curves
  • IR system generates total ranking
  • Plot precision at 10, 20, 30 ... recall,

13
Query Expansion (1)
  • Observations
  • Longer queries often yield better results
  • Users vocabulary may differ from document
  • vocabulary
  • Q how to avoid heart disease
  • D "Factors in minimizing stroke and cardiac
    arrest Recommended dietary and exercise
    regimens"
  • Maybe longer queries have more chances to help
    recall.

14
Query Expansion (2)
  • Bridging the Gap
  • Human query expansion (user or expert)
  • Thesaurus-based expansion
  • Seldom works in practice (unfocused)
  • Relevance feedback
  • Widen a thin bridge over vocabulary gap
  • Adds words from document space to query
  • Pseudo-Relevance feedback
  • Local Context analysis

15
Relevance Feedback (1)
  • Rocchio Formula
  • Q FQ, Dret
  • F weighted vector sum, such as
  • W(t,Q)
  • aW(t,Q) ßW(t,Drel ) - ?W(t,Dirr )

16
Relevance Feedback (2)
  • For example, if
  • Q (heart attack medicine)
  • W(heart,Q) W(attack,Q) W(medicine,Q) 1
  • Drel (cardiac arrest prevention medicine
  • nitroglycerine heart disease...)
  • W(nitroglycerine,D) 2, W(medicine,D) 1
  • Dirr (terrorist attack explosive semtex attack
    nitroglycerine proximity fuse...)
  • W(attack,D) 1, W(nitroglycerine 2),
  • W(explosive,D) 1
  • AND a 1, ß 2, ? .5

17
Relevance Feedback (3)
  • Then
  • W(attack,Q) 11 - 0.51 0.5
  • W(nitroglycerine, Q)
  • W(medicine, Q)
  • w(explosive, Q)

18
Term Weighting Methods (1)
  • Saltons TfIDf
  • Tf term frequency in a document
  • Df document frequency of term
  • documents in collection
  • with this term
  • IDf Df-1

19
Term Weighting Methods (2)
  • Saltons TfIDf
  • TfIDf f1(Tf)f2(IDf)
  • E.g. f1(Tf) Tfave(Dj)/D
  • E.g. f2(IDf) log2(IDF)
  • f1 and f2 can differ for Q and D

20
Vector Space Model a toy example
  • Suppose your document collection only contains 2
    documents
  • Q (heart attack medicine)
  • D1 (cardiac arrest prevention medicine
    nitroglycerine heart disease)
  • D2 (terrorist attack explosive semtex attack
    nitroglycerine proximity fuse)

21
Vector Space Model a toy example
  • Then, the dictionary is (alphabetically sorted)
  • arrest, attack, cardiac, disease, explosive,
    fuse, heart, medicine, nitroglycerine,
    prevention, proximity, semtex, terrorist
  • The vectors of Q, D1 and D2 are as follows
  • Q 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0
  • D11, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0
  • D20, 2, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1
  • Each component of the vector is defined as TW(t,
    V), and here for simplicity, we just use raw TF
    as the term weight of a term in the vector V.

22
Efficient Implementations of VSM (1)
  • Exploit sparseness
  • Only compute non-zero multiplies in dot-products
  • Do not even look at zero elements (how?)
  • gt Use non-stop terms to index documents

23
Efficient Implementations of VSM (2)
  • Inverted Indexing
  • Find all unique stemmed terms in document
    collection
  • Remove stopwords from word list
  • If collection is large (over 100,000 documents),
    Optionally remove singletons
  • Usually spelling errors or obscure names
  • Alphabetize or use hash table to store list
  • For each term create data structure like

24
Efficient Implementations of VSM (3)
  • term IDFtermi,
  • ltdoci, freq(term, doci )
  • docj, freq(term, docj )
  • ...gt
  • or
  • term IDFtermi,
  • ltdoci, freq(term, doci), pos1,i, pos2,i, ...
  • docj, freq(term, docj), pos1,j, pos2,j, ...
  • ...gt
  • posl,1 indicates the first position of term in
    documentj and so on.

25
Open Research Problems in IR (1)
  • Beyond VSM
  • Vectors in different Spaces
  • Generalized VSM, Latent Semantic Indexing...
  • Probabilistic IR (Language Modeling)
  • P(DQ) P(QD)P(D)/P(Q)

26
Open Research Problems in IR (2)
  • Beyond Relevance
  • Appropriateness of doc to user comprehension
    level, etc.
  • Novelty of information in doc to user
    anti-redundancy as approx to novelty

27
Open Research Problems in IR (3)
  • Beyond one Language
  • Translingual IR
  • Transmedia IR

28
Open Research Problems in IR (4)
  • Beyond Content Queries
  • "Whats new today?"
  • "What sort of things to you know about"
  • "Build me a Yahoo-style index for X"
  • "Track the event in this news-story"
Write a Comment
User Comments (0)
About PowerShow.com