Introduction to Information Retrieval IR - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Introduction to Information Retrieval IR

Description:

to answer query 'hungry AND zebra', get intersection of documents pointed to by ' ... 'laughable zebra' Boolean queries are too coarse ... 'laughable zebra' ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 23
Provided by: MarkC120
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Information Retrieval IR


1
Introduction to Information Retrieval (IR)
  • Mark Craven
  • craven_at_cs.wisc.edu
  • craven_at_biostat.wisc.edu
  • 5730 Medical Sciences Center

2
Documents and Corpora
  • document a passage of free text or hypertext
  • Usenet posting
  • Web page
  • newswire story
  • MEDLINE abstract
  • journal article
  • corpus (pl. corpora) a collection of documents
  • MEDLINE
  • Reuters stories from 1999
  • the Web

3
The Ad-Hoc Retrieval Problem
  • given
  • a document collection (corpus)
  • an arbitrary query
  • do
  • return a list of relevant documents
  • this is the problem addressed by Web search
    engines

4
Typical IR System
inverted index
5
The Index and Inverse Index
  • index a relation mapping each document to the
    set of keywords it is about
  • inverse index
  • where do these come from?

6
Inverted Index
index
corpus
7
A Simple Boolean Query
  • to answer query hungry AND zebra, get
    intersection of documents pointed to by hungry
    and documents pointed to by zebra

8
Other Things to Consider
  • How wan we search on phrases?
  • Should we treat these queries differently?
  • a hungry zebra
  • the hungry zebra
  • hungry as a zebra
  • If we query on laugh zebra should we return
    documents containing the following?
  • laughing zebra
  • laughable zebra
  • Boolean queries are too coarse - return too many
    or too few relevant documents.

9
Handling Phrases
95
40
25
38
26
  • store position information in the inverted index
  • to answer query hungry zebra, look for
    documents having hungry at position i and
    zebra at position i 1

10
Handling Phrases
  • but this is a primitive notion of phrase
  • we might want zebras that are hungry to be
    considered a match to the phrase hungry zebra
  • this requires doing sentence analysis
    determining parts of speech for words, etc.

11
Stop Words
  • Should we treat these queries differently?
  • a hungry zebra
  • the hungry zebra
  • hungry as a zebra
  • Some systems employ a list of stop words (a.k.a.
    function words) that are probably not informative
    for most searches.
  • a, an, the, that, this, of, by, with, to
  • stop words in a query are ignored
  • but might be handled differently in phrases

12
Stop Words
a able about above according accordingly across ac
tually after afterwards again against all allow al
lows almost alone along already also although alwa
ys am among amongst
an and another any anybody anyhow anyone anything
anyway anyways anywhere apart appear appreciate ap
propriate are around as aside ask asking associate
d at available away
awfully b be became because become becomes becomin
g been before beforehand behind being believe belo
w beside besides best better between beyond both b
rief but by ...
13
A Special Purpose Stop List
Bos taurus Botrytis cinerea C. elegans Chicken Goa
t Gorilla Guinea pig Hamster Human Mouse Pig Rat S
pinach
unknown gene cDNA DNA clone BAC PAC cosmid clone g
enomic sequence potentially degraded
14
Stemming
  • If we query on laugh zebra should we return
    documents containing the following?
  • laughing zebra
  • laughable zebra
  • Some systems perform stemming on words
    truncating related words to a common stem.
  • laugh laugh-
  • laughs laugh-
  • laughing laugh-
  • laughed laugh-

15
Stemming
  • the Lovins stemmer
  • 260 suffix patterns
  • iterative longest match procedure

(.)SSES
1SS
(.AEIOU.)ED
1
  • the Porter stemmer
  • about 60 patterns grouped into sets
  • apply patterns in each set before moving to next

16
Stemming
  • May be helpful
  • reduces vocabulary 10-50
  • may increase recall
  • May not be helpful
  • for some queries, the sense of a word is
    important
  • stemming algorithms are heuristic may conflate
    semantically different words (e.g. gall
    and gallery)
  • As with stop words, might want to handle stemming
    differently in phrases

17
The Vector Space Model
  • Boolean queries are too coarse - return too many
    or too few relevant documents.
  • Most IR systems are based on the vector space
    model

18
The Vector Space Model
  • documents/queries represented by vectors in a
    high-dimensional space
  • each dimension corresponds to a word in the
    vocabulary
  • most relevant documents are those whose vectors
    are closest to query vector

19
Vector Similarity
  • one way to determine vector similarity is the
    cosine measure
  • if the vectors are normalized, we can simply take
    their dot product

20
Determining Word Weights
  • lots of heuristics
  • one well established one is TFIDF (term
    frequency, inverse document frequency) weighting
  • numerator includes , number of
    occurrences of word in document
  • denominator includes , total number of
    occurrences of in corpus

21
TFIDF One Form
(N total number of words in the corpus)
22
The Probability Ranking Principle
  • most IR systems are based on the premise that
    ranking documents in order of decreasing
    probability is the right thing to do
  • assumes documents are independent
  • does wrong thing with duplicates
  • doesnt promote diversity in returned documents
Write a Comment
User Comments (0)
About PowerShow.com