CS246 - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

CS246

Description:

CS246 Basic Information Retrieval Today s Topic Basic Information Retrieval (IR) Bag of words assumption Boolean Model Inverted index Vector-space model Document ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 22
Provided by: cho125
Category:
Tags: cs246 | index | search | text

less

Transcript and Presenter's Notes

Title: CS246


1
CS246
  • Basic Information Retrieval

2
Todays Topic
  • Basic Information Retrieval (IR)
  • Bag of words assumption
  • Boolean Model
  • Inverted index
  • Vector-space model
  • Document-term matrix
  • TF-IDF vector and cosine similarity
  • Phrase queries
  • Spell correction

3
Information-Retrieval System
  • Information source Existing text documents
  • Keyword-based/natural-language query
  • The system returns best-matching documents given
    the query
  • Challenge
  • Both queries and data are fuzzy
  • Unstructured text and natural language query
  • What documents are good matches for a query?
  • Computers do not understand the documents or
    the queries
  • Developing a computerizable model is essential
    to implement this approach

4
Bag of Words Major Simplification
  • Consider each document as a bag of words
  • bag vs set
  • Ignore word ordering, but keep word count
  • Consider queries as bag of words as well
  • Great oversimplification, but works adequately in
    many cases
  • John loves only Jane vs Only John loves Jane
  • The limitation still shows up on current search
    engines
  • Still how do we match documents and queries?

5
Boolean Model
  • Return all documents that contain the words in
    the query
  • Simplest model for information retrieval
  • No notion of ranking
  • A document is either a match or non-match
  • Q How to find and return matching documents?
  • Basic algorithm?
  • Useful data structure?

6
Inverted Index
  • Allows quick lookup of document ids with a
    particular word
  • Q How can we use this to answer UCLA Physics?

Postings list
lexicon/dictionary DIC
3 8 10 13 16 20
PL(Stanford)
Stanford
1 2 3 9 16 18
PL(UCLA)
UCLA
MIT
4 5 8 10 13 19 20 22
PL(MIT)

7
Inverted Index
  • Allows quick lookup of document ids with a
    particular word

Postings list
lexicon/dictionary DIC
3 8 10 13 16 20
PL(Stanford)
Stanford
1 2 3 9 16 18
PL(UCLA)
UCLA
MIT
4 5 8 10 13 19 20 22
PL(MIT)

8
Size of Inverted Index (1)
  • 100M docs, 10KB/doc, 1000 unique words/doc,
    10B/word, 4B/docid
  • Q Document collection size?
  • Q Inverted index size?
  • Heaps Law Vocabulary size k nb with 30 lt k lt
    100 and 0.4 lt b lt 1
  • k 50 and b 0.5 are good rule of thumb

9
Size of Inverted Index (2)
  • Q Between dictionary and postings lists, which
    one is larger?
  • Q Lengths of postings lists?
  • Zipfs law collection term frequency ?
    1/frequency rank
  • Q How do we construct an inverted index?

10
Inverted Index Construction
  • C set of all documents (corpus)
  • DIC dictionary of inverted index
  • PL(w) postings list of word w
  • 1 For each document d ? C
  • 2 Extract all words in content(d) into W
  • 3 For each w ? W
  • 4 If w ? DIC, then add w to DIC
  • 5 Append id(d) to PL(w)
  • Q What if the index is larger than main memory?

11
Inverted-Index Construction
  • For large text corpus
  • Block-sorted based construction
  • Partition and merge

12
Evaluation Precision and Recall
  • Q Are all matching documents what users want?
  • Basic idea a model is good if it returns
    document if and only if it is relevant.
  • R set of relevant documentD set of documents
    returned by a model

13
Vector-Space Model
  • Main problem of Boolean model
  • Too many matching documents when the corpus is
    large
  • Any way to rank documents?
  • Matrix interpretation of Boolean model
  • Document Term matrix
  • Boolean 0 or 1 value for each entry
  • Basic idea
  • Assign real-valued weight to the matrix entries
    depending on the importance of the term
  • the vs UCLA
  • Q How should we assign the weights?

14
TF-IDF Vector
  • A term t is important for document d
  • If t appears many times in d or
  • If t is a rare term
  • TF term frequency
  • occurrence of t in d
  • IDF inverse document frequency
  • documents containing t
  • TF-IDF weighting
  • TF X Log(N/IDF)
  • Q How to use it to compute query-document
    relevance?

15
Cosine Similarity
  • Represent both query and document as a TF-IDF
    vector
  • Take the inner product of the two normalized
    vectors to compute their similarity
  • Note Q does not matter for document ranking.
    Division by D penalizes longer document.

16
Cosine Similarity Example
  • idf(UCLA)10, idf(good)0.1, idf(university)
    idf(car) idf(racing) 1
  • Q (UCLA, university), D (car, racing)
  • Q (UCLA, university), D (UCLA, good)
  • Q (UCLA, university), D (university, good)

17
Finding High Cosine-Similarity Documents
  • Q Under vector-space model, does
    precision/recall make sense?
  • Q How to find the documents with highest cosine
    similarity from corpus?
  • Q Any way to avoid complete scan of corpus?

18
Inverted Index for TF-IDF
  • Q di 0 if di has no query words
  • Consider only the documents with query words
  • Inverted Index Word ? Document

18
19
Phrase Queries
  • Havard University Boston exactly as a phrase
  • Q How can we support this query?
  • Two approaches
  • Biword index
  • Positional index
  • Q Pros and cons of each approach?
  • Rule of thumb x2 x4 size increase for
    positional index compared to docid only

20
Spell correction
  • Q What is the users intention for the query
    Britnie Spears? How can we find the correct
    spelling?
  • Given a user-typed word w, find its correct
    spelling c.
  • Probabilistic approach Find c with the highest
    probability P(cw).
  • Q How to estimate it?
  • Bayes rule P(cw) P(wc)P(c)/P(w)
  • Q What are these probabilities and how can we
    estimate them?
  • Rule of thumb 75 misspells are within edit
    distance 1. 98 are within edit distance 2.

21
Summary
  • Boolean model
  • Vector-space model
  • TF-IDF weight, cosine similarity
  • Inverted index
  • Boolean model
  • TF-IDF model
  • Phrase queries
  • Spell correction
Write a Comment
User Comments (0)
About PowerShow.com