Introduction to Information Retrieval IR presentation

About This Presentation

Transcript and Presenter's Notes

Title: Introduction to Information Retrieval IR

1
Introduction to Information Retrieval (IR)

Mark Craven
craven_at_cs.wisc.edu
craven_at_biostat.wisc.edu
5730 Medical Sciences Center

2
Documents and Corpora

document a passage of free text or hypertext
Usenet posting
Web page
newswire story
MEDLINE abstract
journal article
corpus (pl. corpora) a collection of documents
MEDLINE
Reuters stories from 1999
the Web

3
The Ad-Hoc Retrieval Problem

given
a document collection (corpus)
an arbitrary query
do
return a list of relevant documents

this is the problem addressed by Web search
engines

4
Typical IR System
inverted index
5
The Index and Inverse Index

index a relation mapping each document to the
set of keywords it is about

inverse index

where do these come from?

6
Inverted Index
index
corpus
7
A Simple Boolean Query

to answer query hungry AND zebra, get
intersection of documents pointed to by hungry
and documents pointed to by zebra

8
Other Things to Consider

How wan we search on phrases?
Should we treat these queries differently?
a hungry zebra
the hungry zebra
hungry as a zebra
If we query on laugh zebra should we return
documents containing the following?
laughing zebra
laughable zebra
Boolean queries are too coarse - return too many
or too few relevant documents.

9
Handling Phrases
95
40
25
38
26

store position information in the inverted index
to answer query hungry zebra, look for
documents having hungry at position i and
zebra at position i 1

10
Handling Phrases

but this is a primitive notion of phrase
we might want zebras that are hungry to be
considered a match to the phrase hungry zebra
this requires doing sentence analysis
determining parts of speech for words, etc.

11
Stop Words

Should we treat these queries differently?
a hungry zebra
the hungry zebra
hungry as a zebra
Some systems employ a list of stop words (a.k.a.
function words) that are probably not informative
for most searches.
a, an, the, that, this, of, by, with, to
stop words in a query are ignored
but might be handled differently in phrases

12
Stop Words
a able about above according accordingly across ac
tually after afterwards again against all allow al
lows almost alone along already also although alwa
ys am among amongst
an and another any anybody anyhow anyone anything
anyway anyways anywhere apart appear appreciate ap
propriate are around as aside ask asking associate
d at available away
awfully b be became because become becomes becomin
g been before beforehand behind being believe belo
w beside besides best better between beyond both b
rief but by ...
13
A Special Purpose Stop List
Bos taurus Botrytis cinerea C. elegans Chicken Goa
t Gorilla Guinea pig Hamster Human Mouse Pig Rat S
pinach
unknown gene cDNA DNA clone BAC PAC cosmid clone g
enomic sequence potentially degraded
14
Stemming

If we query on laugh zebra should we return
documents containing the following?
laughing zebra
laughable zebra
Some systems perform stemming on words
truncating related words to a common stem.
laugh laugh-
laughs laugh-
laughing laugh-
laughed laugh-

15
Stemming

the Lovins stemmer
260 suffix patterns
iterative longest match procedure

(.)SSES
1SS
(.AEIOU.)ED
1

the Porter stemmer
about 60 patterns grouped into sets
apply patterns in each set before moving to next

16
Stemming

May be helpful
reduces vocabulary 10-50
may increase recall
May not be helpful
for some queries, the sense of a word is
important
stemming algorithms are heuristic may conflate
semantically different words (e.g. gall
and gallery)
As with stop words, might want to handle stemming
differently in phrases

17
The Vector Space Model

Boolean queries are too coarse - return too many
or too few relevant documents.
Most IR systems are based on the vector space
model

18
The Vector Space Model

documents/queries represented by vectors in a
high-dimensional space
each dimension corresponds to a word in the
vocabulary
most relevant documents are those whose vectors
are closest to query vector

19
Vector Similarity

one way to determine vector similarity is the
cosine measure

if the vectors are normalized, we can simply take
their dot product

20
Determining Word Weights

lots of heuristics
one well established one is TFIDF (term
frequency, inverse document frequency) weighting
numerator includes , number of
occurrences of word in document
denominator includes , total number of
occurrences of in corpus

21
TFIDF One Form
(N total number of words in the corpus)
22
The Probability Ranking Principle

most IR systems are based on the premise that
ranking documents in order of decreasing
probability is the right thing to do
assumes documents are independent
does wrong thing with duplicates
doesnt promote diversity in returned documents

Write a Comment

User Comments (0)

About PowerShow.com

Introduction to Information Retrieval IR PowerPoint PPT Presentation