Information Retrieval

About This Presentation

Title:

Information Retrieval

Description:

Example: To find recipes for cookies with oatmeal but without raisins, try ... would find the nursery rhyme, but likely not religious or Christmas-related documents. ... – PowerPoint PPT presentation

Number of Views:1334

Avg rating:3.0/5.0

Slides: 57

Provided by: dragomi3

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval

1
Information Retrieval
January 7, 2005

Handout 1

2
Course Information

Instructor Dragomir R. Radev (radev_at_si.umich.edu)
Office 3080, West Hall Connector
Phone (734) 615-5225
Office hours TBA
Course page http//tangra.si.umich.edu/radev/650
/
Class meets on Fridays, 210-455 PM in 409 West
Hall

3
Introduction
4
IR systems

Google
Vivísimo
AskJeeves
NSIR
Lemur
MG
Nutch

5
Examples of IR systems

Conventional (library catalog). Search by
keyword, title, author, etc.
Text-based (Lexis-Nexis, Google, FAST).Search by
keywords. Limited search using queries in natural
language.
Multimedia (QBIC, WebSeek, SaFe)Search by visual
appearance (shapes, colors, ).
Question answering systems (AskJeeves, NSIR,
Answerbus)Search in (restricted) natural language

6
(No Transcript)
7
(No Transcript)
8
Need for IR

Advent of WWW - more than 4 Billion documents
indexed on Google
How much information? 200TB according to Lyman
and Varian 2003.
http//www.sims.berkeley.edu/research/projects/ho
w-much-info/
Search, routing, filtering
Users information need

9
Some definitions of Information Retrieval (IR)
Salton (1989) Information-retrieval systems
process files of records and requests for
information, and identify and retrieve from the
files certain records in response to the
information requests. The retrieval of particular
records depends on the similarity between the
records and the queries, which in turn is
measured by comparing the values of certain
attributes to records and information
requests.Kowalski (1997) An Information
Retrieval System is a system that is capable of
storage, retrieval, and maintenance of
information. Information in this context can be
composed of text (including numeric and date
data), images, audio, video, and other
multi-media objects).
10
Syllabus (Part I)

Introduction. Information Needs and Queries.
Document preprocessing. Stemming. Document
representations. TFIDF. Indexing and Searching.
Inverted indexes
IR Models. The Vector model. The Boolean model.
Retrieval Evaluation. Precision and Recall.
F-measure. Reference collections. The TREC
conferences.
Queries and Documents. Query Languages. Natural
language querying
Word distributions. The Zipf distribution.
Relevance feedback and query expansion.
Approximate matching.
Compression.
Vector space similarity and clustering. k-means
clustering.

11
Syllabus (Part II)

Document classification. k-nearest neighbors.
Naive Bayes. Support vector machines.
Singular value decomposition and Latent Semantic
Indexing.
Probabilistic models. Document models. Language
models.
Crawling the Web. Hyperlink analysis. Measuring
the Web.
Hypertext retrieval. Web-based IR.
Social network analysis for IR. Hubs and
authorities. PageRank and HITS.
Focused crawling. Resource discovery. Discovering
communities.
Collaborative filtering.
Information extraction using Hidden Markov
Models.
Additional topics, e.g., relevance transfer, XML
retrieval, text tiling, text summarization,
question answering.

12
Readings

Books
1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto
Modern Information Retrieval, Addison-Wesley/ACM
Press, 1999.
2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth
Modeling the Internet and the Web Probabilistic
Methods and Algorithms Wiley, 2003, ISBN
0-470-84906-1
Papers (tentative list)
Barabasi and Albert "Emergence of scaling in
random networks" Science (286) 509-512, 1999
Bharat and Broder "A technique for measuring the
relative size and overlap of public Web search
engines" WWW 1998
Brin and Page "The Anatomy of a Large-Scale
Hypertextual Web Search Engine" WWW 1998
Bush "As we may thing" The Atlantic Monthly 1945
Chakrabarti, van den Berg, and Dom "Focused
Crawling" WWW 1999
Cho, Garcia-Molina, and Page "Efficient Crawling
Through URL Ordering" WWW 1998
Davison "Topical locality on the Web" SIGIR 2000
Dean and Henzinger "Finding related pages in the
World Wide Web" WWW 1999
Deerwester, Dumais, Landauer, Furnas, Harshman
"Indexing by latent semantic analysis" JASIS
41(6) 1990

13
Readings

Erkan and Radev "LexRank Graph-based Lexical
Centrality as Salience in Text Summarization"
JAIR 22, 2004
Jeong and Barabasi "Diameter of the world wide
web" Nature (401) 130-131, 1999
Hawking, Voorhees, Craswell, and Bailey "Overview
of the TREC-8 Web Track" TREC 2000
Haveliwala "Topic-sensitive pagerank" WWW 2002
Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins,
Upfal "The Web as a graph" PODS 2000
Lawrence and Giles "Accessibility of information
on the Web" Nature (400) 107-109, 1999
Lawrence and Giles "Searching the World-Wide Web"
Science (280) 98-100, 1998
Menczer "Links tell us about lexical and semantic
Web content" arXiv 2001
Page, Brin, Motwani, and Winograd "The PageRank
citation ranking Bringing order to the Web"
Stanford TR, 1998
Radev, Fan, Qi, Wu and Grewal "Probabilistic
Question Answering on the Web" JASIST 2005
Singhal "Modern Information Retrieval an
Overview" IEEE 2001

14
Assignments

Homeworks
The course will have three homework assignments
in the form of problem sets. Each problem set
will include essay-type questions, questions
designed to show understanding of specific
concepts, and hands-on exercises involving
existing IR engines.
Project
The final course project can be done in three
different formats
(1) a programming project implementing a
challenging and novel information retrieval
application,
(2) an extensive survey-style research paper
providing an exhaustive look at an area of IR, or
(3) a SIGIR-style experimental IR paper.

15
Grading

Three HW assignments (30)
Project (30)
Final (40)

16
Sample queries (from Excite)

In what year did baseball become an offical
sport?
play station codes . com
birth control and depression
government
"WorkAbility I"conference
kitchen appliances
where can I find a chines rosewood
tiger electronics
58 Plymouth Fury
How does the character Seyavash in Ferdowsi's
Shahnameh exhibit characteristics of a hero?
emeril Lagasse
Hubble
M.S Subalaksmi
running

17
Types of queries (AltaVista)
Including or excluding words To make sure that a
specific word is always included in your search
topic, place the plus () symbol before the key
word in the search box. To make sure that a
specific word is always excluded from your search
topic, place a minus (-) sign before the keyword
in the search box. Example To find recipes for
cookies with oatmeal but without raisins,
try recipe cookie oatmeal -raisin. Expand your
search using wildcards () By typing an at the
end of a keyword, you can search for the word
with multiple endings. Example Try wish, to
find wish, wishes, wishful, wishbone, and
wishy-washy.
18
Types of queries
AND () Finds only documents containing all of
the specified words or phrases. Mary AND lamb
finds documents with both the word Mary and the
word lamb. OR () Finds documents containing
at least one of the specified words or phrases.
Mary OR lamb finds documents containing either
Mary or lamb. The found documents could contain
both, but do not have to.
NOT (!) Excludes documents
containing the specified word or phrase. Mary AND
NOT lamb finds documents with Mary but not
containing lamb. NOT cannot stand alone--use it
with another operator, like AND. NEAR () Finds
documents containing both specified words or
phrases within 10 words of each other. Mary NEAR
lamb would find the nursery rhyme, but likely not
religious or Christmas-related documents.
19
Mappings and abstractions
Reality
Data
Information need
Query
From Korfhages book
20
Typical IR system

(Crawling)
Indexing
Retrieval
User interface

21
Key Terms Used in IR

QUERY a representation of what the user is
looking for - can be a list of words or a phrase.
DOCUMENT an information entity that the user
wants to retrieve
COLLECTION a set of documents
INDEX a representation of information that makes
querying easier
TERM word or concept that appears in a document
or a query

22
Other important terms

Classification
Cluster
Similarity
Information Extraction
Term Frequency
Inverse Document Frequency
Precision
Recall

Inverted File
Query Expansion
Relevance
Relevance Feedback
Stemming
Stopword
Vector Space Model
Weighting
TREC/TIPSTER/MUC

23
Query structures

Query viewed as a document?
Length
repetitions
syntactic differences
Types of matches
exact
range
approximate

24
Additional references on IR

Gerard Salton, Automatic Text Processing,
Addison-Wesley (1989)
Gerald Kowalski, Information Retrieval Systems
Theory and Implementation, Kluwer (1997)
Gerard Salton and M. McGill, Introduction to
Modern Information Retrieval, McGraw-Hill (1983)
C. J. an Rijsbergen, Information Retrieval,
Buttersworths (1979)
Ian H. Witten, Alistair Moffat, and Timothy C.
Bell, Managing Gigabytes, Van Nostrand Reinhold
(1994)
ACM SIGIR Proceedings, SIGIR Forum
ACM conferences in Digital Libraries

25
Related courses elsewhere

Stanford (Chris Manning, Prabhakar Raghavan, and
Hinrich Schuetze)http//www.stanford.edu/class/cs
276a/
Cornell (Jon Kleinberg)http//www.cs.cornell.edu/
Courses/cs685/2002fa/
CMU (Yiming Yang and Jamie Callan)http//krakow.l
ti.cs.cmu.edu/classes/11-741/2004/index.html/
UMass (James Allan)http//ciir.cs.umass.edu/cmpsc
i646/
UTexas (Ray Mooney)http//www.cs.utexas.edu/users
/mooney/ir-course/
Illinois (Chengxiang Zhai)http//sifaka.cs.uiuc.e
du/course/498cxz04f/
Johns Hopkins (David Yarowsky)http//www.cs.jhu.e
du/yarowsky/cs466.html

26
Readings for weeks 1 3

MIR (Modern Information Retrieval)
Week 1
Chapter 1 Introduction
Chapter 2 Modeling
Chapter 3 Evaluation
Week 2
Chapter 4 Query languages
Chapter 5 Query operations
Week 3
Chapter 6 Text and multimedia languages
Chapter 7 Text operations
Chapter 8 Indexing and searching

27
Documents
28
Documents

Not just printed paper
collections vs. documents
data structures representations
Bag of words method
document surrogates keywords, summaries
encoding ASCII, Unicode, etc.

29
Document preprocessing

Formatting
Tokenization (Pauls, Willow Dr., Dr. Willow,
555-1212, New York, ad hoc)
Casing (cat vs. CAT)
Stemming (computer, computation)
Soundex

30
Document representations

Term-document matrix (m x n)
term-term matrix (m x m x n)
document-document matrix (n x n)
Example 3,000,000 documents (n) with 50,000
terms (m)
sparse matrices
Boolean vs. integer matrices

31
Document representations

Term-document matrix
Evaluating queries (e.g., (A?B)?C)
Storage issues
Inverted files
Storage issues
Evaluating queries
Advantages and disadvantages

32
Additional issues

Dealing with phrases?
Proximity search
Synonyms?

33
Porters algorithm
Example the word duplicatable
duplicat rule 4duplicate rule
1b1duplic rule 3
The application of another rule in step 4,
removing ic, cannotbe applied since one rule
from each step is allowed to be applied.
34
Porters algorithm
35
Relevance feedback

Automatic
Manual
Method identifying feedback terms
Q a1Q a2R - a3N
Often a1 1, a2 1/R and a3 1/N

36
Example

Q safety minivans
D1 car safety minivans tests injury
statistics - relevant
D2 liability tests safety - relevant
D3 car passengers injury reviews -
non-relevant
R ?
S ?
Q ?

37
Approximate string matching

The Soundex algorithm (Odell and Russell)
Uses
spelling correction
hash function
non-recoverable

38
The Soundex algorithm

1. Retain the first letter of the name, and drop
all occurrences of a,e,h,I,o,u,w,y in other
positions
2. Assign the following numbers to the remaining
letters after the first
b,f,p,v 1
c,g,j,k,q,s,x,z 2
d,t 3
l 4
m n 5
r 6

39
The Soundex algorithm

3. if two or more letters with the same code were
adjacent in the original name, omit all but the
first
4. Convert to the form LDDD by adding terminal
zeros or by dropping rightmost digits
Examples
Euler E460, Gauss G200, H416 Hilbert, K530
Knuth, Lloyd L300
same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
Some problems Rogers and Rodgers, Sinclair and
StClair

40
IR models
41
Major IR models

Boolean
Vector
Probabilistic
Language modeling
Fuzzy retrieval
Latent semantic indexing

42
Major IR tasks

Ad-hoc
Filtering and routing
Question answering
Spoken document retrieval
Multimedia retrieval

43
Venn diagrams
z
x
w
y
D1
D2
44
Boolean model
B
A
45
Boolean queries
restaurants AND (Mideastern OR vegetarian) AND
inexpensive

What types of documents are returned?
Stemming
thesaurus expansion
inclusive vs. exclusive OR
confusing uses of AND and OR

dinner AND sports AND symphony
4 OF (Pentium, printer, cache, PC, monitor,
computer, personal)
46
Boolean queries

Weighting (Beethoven AND sonatas)
precedence

coffee AND croissant OR muffin
raincoat AND umbrella OR sunglasses

Use of negation potential problems
Conjunctive and Disjunctive normal forms
Full CNF and DNF

47
Transformations

De Morgans Laws

NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)

CNF or DNF?
Reference librarians prefer CNF - why?

48
Boolean model

Partition
Partial relevance?
Operators AND, NOT, OR, parentheses

49
Exercise

D1 computer information retrieval
D2 computer retrieval
D3 information
D4 computer information
Q1 information ? retrieval
Q2 information ? computer

50
Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
51
Stop lists

250-300 most common words in English account for
50 or more of a given text.
Example the and of represent 10 of tokens.
and, to, a, and in - another 10. Next 12
words - another 10.
Moby Dick Ch.1 859 unique words (types), 2256
word occurrences (tokens). Top 65 types cover
1132 tokens ( 50).
Token/type ratio 2256/859 2.63

52
Vector models
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
53
Vector queries

Each document is represented as a vector
non-efficient representations (bit vectors)
dimensional compatibility

54
The matching process

Document space
Matching is done between a document and a query
(or between two documents)
distance vs. similarity
Euclidean distance, Manhattan distance, Word
overlap, Jaccard coefficient, etc.

55
Miscellaneous similarity measures

The Cosine measure

? (di x qi)
X ? Y
? (D,Q)

? (di)2
? (qi)2

X Y

The Jaccard coefficient

X ? Y
? (D,Q)
X ? Y
56
Exercise

Compute the cosine measures ? (D1,D2) and ?
(D1,D3) for the documents D1 , D2
and D3
Compute the corresponding Euclidean distances,
Manhattan distances, and Jaccard coefficients.

Write a Comment

User Comments (0)