Information Retrieval: aka - PowerPoint PPT Presentation

About This Presentation
Title:

Information Retrieval: aka

Description:

(define (posn wd dict) (cond ((null? Dict) (error ' missing word')) ((eq? ( map-wd (car dict)) wd) (map-num (car dict))) (else (posn wd (cdr dict)))) Indexing ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 12
Provided by: classesCs
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval: aka


1
Information Retrievalaka Google-lite
  • CMSC 16100
  • November 27, 2006

2
Roadmap
  • Information Retrieval (IR)
  • Goal Match Information Need to Document Concept
  • Solution Vector Space Model
  • Representation of Documents and Queries
  • Computing Similarity
  • Implementation
  • Indexing Documents -gt Vectors
  • Query Construction Query -gt Vector
  • Retrieval Finding Best match Query/Document

3
The Information Retrieval Task
  • Goal
  • Match the information need expressed by user
  • (the Query)
  • With concepts in documents
  • (the Document collection)
  • Issues
  • How do we represent documents and queries ?
  • How do we know if they're similar? Match?

4
Vector Space Model
  • Represent documents and queries with
  • Pattern of words
  • I.E. Queries and documents with lots of the same
    words
  • Vector of word occurrences
  • Each position in vector word
  • Value of position x in vector times word x
    occurs
  • Similarity
  • Dot product of document vector query vector
  • Biggest wins

5
Vector Space Model
Tv
Program
Computer
Two documents computer program, tv program
Query computer program matches 1 st doc
exact distance2 vs 0 educational
program matches both equally distance1
6
Information Retrieval in Scheme
  • Representation
  • A vector-rep is (vectorof number)
  • (define-struct doc-rep (id vec))
  • A doc is (make-doc-rep id vec)
  • Where idsymbol vec vector-rep
  • A doc-index is (listof doc)
  • A query is vector-rep
  • A simple-web-page (swp) is
  • (make-swp h b)
  • Where (define-struct swp h b) hsymbol b
    (listof symbol)

7
Three Steps to IR
  • Three phases
  • Indexing Build collection of document
    representations
  • Convert web pages to doc-rep
  • Vectors of word counts
  • Query construction
  • Convert query text to vector of word counts
  • Retrieval
  • Compute similarity between query and doc
    representation
  • Return closest match

8
Words-to-vector
(define (words-to-vector wlist wvec)
words-to-vector (listof symbol) (vectorof num)
-gt (vectorof num) (cond ((null? wlist) wvec)
(else (let ((wpos (posn (car wlist)
dict)))) (let ((cur-count (vector-ref wvec
wpos))) (vector-set! wvec
wpos ( cur-count 1))
(words-to-vector (cdr wlist) wvec))))) (define
(posn wd dict) (cond ((null? Dict) (error
missing word)) ((eq? (map-wd (car
dict)) wd) (map-num (car dict)))
(else (posn wd (cdr dict))))
9
Indexing
(define (build-index swp-list) build-index
(listof swp) -gt (listof doc-rep) Convert text
of web pages to list of vector document reps
(cond ((null? swp-list) '()) (else (cons
(make-doc-rep (swp-header (car swp-list))
(words-to-vector (swp-body (car swp-list))
(make-vector dictionary-size
0))) (build-index (cdr swp-list)))))
10
Query Construction
(define (build-query wlist) build-query
(listof symbol) -gt vector-rep Convert query
text to vector of word occurrence counts
(words-to-vector wlist (make-vector dict-size
0)))
11
Retrieval
(define (retrieve query index) retrieve
vector-rep (listof doc-rep) -gt symbol Finds id
of document with best match with query
(doc-rep-id (max (map (lambda
(doc) (dot-product (doc-rep-vec doc)
query)) index))))
Write a Comment
User Comments (0)
About PowerShow.com