Basic%20IR:%20Modeling - PowerPoint PPT Presentation

About This Presentation
Title:

Basic%20IR:%20Modeling

Description:

Basic IR: Modeling. Basic IR Task: Match a subset of documents to the user's query ... Classic Models: Basic Concepts. Ki is an index term. dj is a document ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 23
Provided by: CSU67
Category:
Tags: 20ir | 20modeling | basic

less

Transcript and Presenter's Notes

Title: Basic%20IR:%20Modeling


1
Basic IR Modeling
  • Basic IR Task
  • Match a subset of documents to the users query
  • Slightly more complex
  • and rank the resulting documents by predicted
    relevance
  • The derivation of relevance leads to different IR
    models.

2
Concepts Term-Document Incidence
  • Imagine matrix of terms X documents with 1 when
    the term appears in the document and 0 otherwise.
  • Queries satisfied how?
  • Problems?

search segment select semantic
MIR 1 0 1 1
AI 1 1 0 1

3
Concepts Term Frequency
  • To support document ranking, need more than just
    term incidence.
  • Term frequency records number of times a given
    term appears in each document.
  • Intuition More times a term appears in a
    document the more central it is to the topic of
    the document.

4
Concept Term Weight
  • Weights represent the importance of a given term
    for characterizing a document.
  • wij is a weight for term i in document j.

5
Mapping Task and Document Type to Model
Index Terms Full Text Full Text Structure
Searching (Retrieval) Classic Classic Structured
Surfing (Browsing) Flat Flat Hypertext Structure Guided Hypertext
6
IR Models
from MIR text
7
Classic Models Basic Concepts
  • Ki is an index term
  • dj is a document
  • t is the total number of docs
  • K (k1, k2, , kt) is the set of all index
    terms
  • wij gt 0 is a weight associated with (ki,dj)
  • wij 0 indicates that term does not belong to
    doc
  • vec(dj) (w1j, w2j, , wtj) is a weighted
    vector associated with the document dj
  • gi(vec(dj)) wij is a function which returns
    the weight associated with pair (ki,dj)

8
Classic Boolean Model
  • Based on set theory map queries with Boolean
    operations to set operations
  • Select documents from term-document incidence
    matrix
  • Pros
  • Cons

9
Exact Matching Ignores
  • term frequency in document
  • term scarcity in corpus
  • size of document
  • ranking

10
Vector Model
  • Vector of term weights based on term frequency
  • Compute similarity between query and document
    where both are vectors
  • vec(dj) (w1j, w2j, ..., wtj) vec(q) (w1q,
    w2q, ..., wtq)
  • Similarity is the cosine of the angle between the
    vectors.

11
Cosine Measure
  • Since wij gt 0 and wiq gt 0, 0 lt sim(q,dj) lt1

from MIR notes
12
How to Set Wij Weights? TF-IDF
  • Within document Term-Frequency
  • tf measures term density within a document
  • Across document Inverse Document Frequency
  • idf measures informativeness or rarity of term
    across corpus.

13
TF IDF Computation
  • What happens as number of occurrences in a
    document increases?
  • What happens as term becomes more rare?

14
TF IDF
  • TF may be normalized.
  • tf(i,d) freq(i,d) / max(freq(l,d))
  • IDF is computed
  • normalized to size of corpus
  • as log to make TF and IDF values comparable
  • IDF requires a static corpus.

15
How to Set Wi,q Weights?
  1. Create Vector directly from query
  2. Use modified tf-idf

16
The Vector Model Example
from MIR notes
17
The Vector Model Example (cont.)
  • Compute Tf-IDF Vector for each document
  • For first document
  • K1 ((2/2)(log (7/5)) .33
  • K2 (0(log (7/4))) 0
  • K3 ((1/2)(log (7/3))) .42
  • for rest
  • .34 0 0, 0 .19 .85, .34 0 0, .08 .28
    .85,
  • .17 .56 0, 0 .56 0

from MIR notes
18
The Vector Model Example (cont.)
  • 2. Compute the Tf-IDF for the query 1 2 3
  • K1 (.5 ((.5 1)/3))(log (7/5)))
  • K2 (.5 ((.5 2)/3))(log (7/4)))
  • K3 (.5 ((.5 3)/3))(log (7/3)))
  • which is .22 .47 .85

19
The Vector Model Example (cont.)
  • 3. Compute the Sim for each document
  • D1
  • D1q (.33 .22) (0 .47) (.42 .85)
    .43
  • D1 sqrt((.332) (.422)) .53
  • q sqrt((.222) (.472) (.852))
    1.0
  • sim .43 / (.53 1.0) .81
  • D2 .22 D3 .93 D4 .23
  • D5 .97 D6 .51 D7 .47

20
Vector Model Implementation Issues
  • Sparse TermXDocument matrix
  • Store term count, term weight, or weighted by
    idfi ?
  • What if the corpus is not fixed (e.g., the Web)?
    What happens to IDF?
  • How to efficiently compute Cosine for large index?

21
Heuristics for Computing Cosine for Large Index
  • Select from only non-zero cosines
  • Focus on non-zero cosines for rare (high idf)
    words
  • Pre-compute document adjacency
  • for each term, pre-compute k nearest docs
  • for a t term query, compute cosines from query to
    union of t pre-computed lists, choose top k

22
The TFIDF Vector Model Pros/Cons
  • Pros
  • term-weighting improves quality
  • cosine ranking formula sorts documents according
    to degree of similarity to the query
  • Cons
  • assumes independence of index terms
Write a Comment
User Comments (0)
About PowerShow.com