Title: Information Retrieval: aka
1Information Retrievalaka Google-lite
- CMSC 16100
- November 27, 2006
2Roadmap
- Information Retrieval (IR)
- Goal Match Information Need to Document Concept
- Solution Vector Space Model
- Representation of Documents and Queries
- Computing Similarity
- Implementation
- Indexing Documents -gt Vectors
- Query Construction Query -gt Vector
- Retrieval Finding Best match Query/Document
3The Information Retrieval Task
- Goal
- Match the information need expressed by user
- (the Query)
- With concepts in documents
- (the Document collection)
- Issues
- How do we represent documents and queries ?
- How do we know if they're similar? Match?
4Vector Space Model
- Represent documents and queries with
- Pattern of words
- I.E. Queries and documents with lots of the same
words - Vector of word occurrences
- Each position in vector word
- Value of position x in vector times word x
occurs - Similarity
- Dot product of document vector query vector
- Biggest wins
5Vector Space Model
Tv
Program
Computer
Two documents computer program, tv program
Query computer program matches 1 st doc
exact distance2 vs 0 educational
program matches both equally distance1
6Information Retrieval in Scheme
- Representation
- A vector-rep is (vectorof number)
- (define-struct doc-rep (id vec))
- A doc is (make-doc-rep id vec)
- Where idsymbol vec vector-rep
- A doc-index is (listof doc)
- A query is vector-rep
- A simple-web-page (swp) is
- (make-swp h b)
- Where (define-struct swp h b) hsymbol b
(listof symbol)
7Three Steps to IR
- Three phases
- Indexing Build collection of document
representations - Convert web pages to doc-rep
- Vectors of word counts
- Query construction
- Convert query text to vector of word counts
- Retrieval
- Compute similarity between query and doc
representation - Return closest match
8Words-to-vector
(define (words-to-vector wlist wvec)
words-to-vector (listof symbol) (vectorof num)
-gt (vectorof num) (cond ((null? wlist) wvec)
(else (let ((wpos (posn (car wlist)
dict)))) (let ((cur-count (vector-ref wvec
wpos))) (vector-set! wvec
wpos ( cur-count 1))
(words-to-vector (cdr wlist) wvec))))) (define
(posn wd dict) (cond ((null? Dict) (error
missing word)) ((eq? (map-wd (car
dict)) wd) (map-num (car dict)))
(else (posn wd (cdr dict))))
9Indexing
(define (build-index swp-list) build-index
(listof swp) -gt (listof doc-rep) Convert text
of web pages to list of vector document reps
(cond ((null? swp-list) '()) (else (cons
(make-doc-rep (swp-header (car swp-list))
(words-to-vector (swp-body (car swp-list))
(make-vector dictionary-size
0))) (build-index (cdr swp-list)))))
10Query Construction
(define (build-query wlist) build-query
(listof symbol) -gt vector-rep Convert query
text to vector of word occurrence counts
(words-to-vector wlist (make-vector dict-size
0)))
11Retrieval
(define (retrieve query index) retrieve
vector-rep (listof doc-rep) -gt symbol Finds id
of document with best match with query
(doc-rep-id (max (map (lambda
(doc) (dot-product (doc-rep-vec doc)
query)) index))))