Title: Information Retrieval
1Information Retrieval
January 7, 2005
2Course Information
- Instructor Dragomir R. Radev (radev_at_si.umich.edu)
- Office 3080, West Hall Connector
- Phone (734) 615-5225
- Office hours TBA
- Course page http//tangra.si.umich.edu/radev/650
/ - Class meets on Fridays, 210-455 PM in 409 West
Hall
3Introduction
4IR systems
- Google
- VivÃsimo
- AskJeeves
- NSIR
- Lemur
- MG
- Nutch
5Examples of IR systems
- Conventional (library catalog). Search by
keyword, title, author, etc. - Text-based (Lexis-Nexis, Google, FAST).Search by
keywords. Limited search using queries in natural
language. - Multimedia (QBIC, WebSeek, SaFe)Search by visual
appearance (shapes, colors, ). - Question answering systems (AskJeeves, NSIR,
Answerbus)Search in (restricted) natural language
6(No Transcript)
7(No Transcript)
8Need for IR
- Advent of WWW - more than 4 Billion documents
indexed on Google - How much information? 200TB according to Lyman
and Varian 2003. - http//www.sims.berkeley.edu/research/projects/ho
w-much-info/ - Search, routing, filtering
- Users information need
9Some definitions of Information Retrieval (IR)
Salton (1989) Information-retrieval systems
process files of records and requests for
information, and identify and retrieve from the
files certain records in response to the
information requests. The retrieval of particular
records depends on the similarity between the
records and the queries, which in turn is
measured by comparing the values of certain
attributes to records and information
requests.Kowalski (1997) An Information
Retrieval System is a system that is capable of
storage, retrieval, and maintenance of
information. Information in this context can be
composed of text (including numeric and date
data), images, audio, video, and other
multi-media objects).
10Syllabus (Part I)
- Introduction. Information Needs and Queries.
- Document preprocessing. Stemming. Document
representations. TFIDF. Indexing and Searching.
Inverted indexes - IR Models. The Vector model. The Boolean model.
- Retrieval Evaluation. Precision and Recall.
F-measure. Reference collections. The TREC
conferences. - Queries and Documents. Query Languages. Natural
language querying - Word distributions. The Zipf distribution.
- Relevance feedback and query expansion.
- Approximate matching.
- Compression.
- Vector space similarity and clustering. k-means
clustering.
11Syllabus (Part II)
- Document classification. k-nearest neighbors.
Naive Bayes. Support vector machines. - Singular value decomposition and Latent Semantic
Indexing. - Probabilistic models. Document models. Language
models. - Crawling the Web. Hyperlink analysis. Measuring
the Web. - Hypertext retrieval. Web-based IR.
- Social network analysis for IR. Hubs and
authorities. PageRank and HITS. - Focused crawling. Resource discovery. Discovering
communities. - Collaborative filtering.
- Information extraction using Hidden Markov
Models. - Additional topics, e.g., relevance transfer, XML
retrieval, text tiling, text summarization,
question answering.
12Readings
- Books
- 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto
Modern Information Retrieval, Addison-Wesley/ACM
Press, 1999. - 2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth
Modeling the Internet and the Web Probabilistic
Methods and Algorithms Wiley, 2003, ISBN
0-470-84906-1 - Papers (tentative list)
- Barabasi and Albert "Emergence of scaling in
random networks" Science (286) 509-512, 1999 - Bharat and Broder "A technique for measuring the
relative size and overlap of public Web search
engines" WWW 1998 - Brin and Page "The Anatomy of a Large-Scale
Hypertextual Web Search Engine" WWW 1998 - Bush "As we may thing" The Atlantic Monthly 1945
- Chakrabarti, van den Berg, and Dom "Focused
Crawling" WWW 1999 - Cho, Garcia-Molina, and Page "Efficient Crawling
Through URL Ordering" WWW 1998 - Davison "Topical locality on the Web" SIGIR 2000
- Dean and Henzinger "Finding related pages in the
World Wide Web" WWW 1999 - Deerwester, Dumais, Landauer, Furnas, Harshman
"Indexing by latent semantic analysis" JASIS
41(6) 1990
13Readings
- Erkan and Radev "LexRank Graph-based Lexical
Centrality as Salience in Text Summarization"
JAIR 22, 2004 - Jeong and Barabasi "Diameter of the world wide
web" Nature (401) 130-131, 1999 - Hawking, Voorhees, Craswell, and Bailey "Overview
of the TREC-8 Web Track" TREC 2000 - Haveliwala "Topic-sensitive pagerank" WWW 2002
- Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins,
Upfal "The Web as a graph" PODS 2000 - Lawrence and Giles "Accessibility of information
on the Web" Nature (400) 107-109, 1999 - Lawrence and Giles "Searching the World-Wide Web"
Science (280) 98-100, 1998 - Menczer "Links tell us about lexical and semantic
Web content" arXiv 2001 - Page, Brin, Motwani, and Winograd "The PageRank
citation ranking Bringing order to the Web"
Stanford TR, 1998 - Radev, Fan, Qi, Wu and Grewal "Probabilistic
Question Answering on the Web" JASIST 2005 - Singhal "Modern Information Retrieval an
Overview" IEEE 2001
14Assignments
- Homeworks
- The course will have three homework assignments
in the form of problem sets. Each problem set
will include essay-type questions, questions
designed to show understanding of specific
concepts, and hands-on exercises involving
existing IR engines. - Project
- The final course project can be done in three
different formats - (1) a programming project implementing a
challenging and novel information retrieval
application, - (2) an extensive survey-style research paper
providing an exhaustive look at an area of IR, or - (3) a SIGIR-style experimental IR paper.
15Grading
- Three HW assignments (30)
- Project (30)
- Final (40)
16Sample queries (from Excite)
- In what year did baseball become an offical
sport? - play station codes . com
- birth control and depression
- government
- "WorkAbility I"conference
- kitchen appliances
- where can I find a chines rosewood
- tiger electronics
- 58 Plymouth Fury
- How does the character Seyavash in Ferdowsi's
Shahnameh exhibit characteristics of a hero? - emeril Lagasse
- Hubble
- M.S Subalaksmi
- running
17Types of queries (AltaVista)
Including or excluding words To make sure that a
specific word is always included in your search
topic, place the plus () symbol before the key
word in the search box. To make sure that a
specific word is always excluded from your search
topic, place a minus (-) sign before the keyword
in the search box. Example To find recipes for
cookies with oatmeal but without raisins,
try recipe cookie oatmeal -raisin. Expand your
search using wildcards () By typing an at the
end of a keyword, you can search for the word
with multiple endings. Example Try wish, to
find wish, wishes, wishful, wishbone, and
wishy-washy.
18Types of queries
AND () Finds only documents containing all of
the specified words or phrases. Mary AND lamb
finds documents with both the word Mary and the
word lamb. OR () Finds documents containing
at least one of the specified words or phrases.
Mary OR lamb finds documents containing either
Mary or lamb. The found documents could contain
both, but do not have to.
NOT (!) Excludes documents
containing the specified word or phrase. Mary AND
NOT lamb finds documents with Mary but not
containing lamb. NOT cannot stand alone--use it
with another operator, like AND. NEAR () Finds
documents containing both specified words or
phrases within 10 words of each other. Mary NEAR
lamb would find the nursery rhyme, but likely not
religious or Christmas-related documents.
19Mappings and abstractions
Reality
Data
Information need
Query
From Korfhages book
20Typical IR system
- (Crawling)
- Indexing
- Retrieval
- User interface
21Key Terms Used in IR
- QUERY a representation of what the user is
looking for - can be a list of words or a phrase. - DOCUMENT an information entity that the user
wants to retrieve - COLLECTION a set of documents
- INDEX a representation of information that makes
querying easier - TERM word or concept that appears in a document
or a query
22Other important terms
- Classification
- Cluster
- Similarity
- Information Extraction
- Term Frequency
- Inverse Document Frequency
- Precision
- Recall
- Inverted File
- Query Expansion
- Relevance
- Relevance Feedback
- Stemming
- Stopword
- Vector Space Model
- Weighting
- TREC/TIPSTER/MUC
23Query structures
- Query viewed as a document?
- Length
- repetitions
- syntactic differences
- Types of matches
- exact
- range
- approximate
24Additional references on IR
- Gerard Salton, Automatic Text Processing,
Addison-Wesley (1989) - Gerald Kowalski, Information Retrieval Systems
Theory and Implementation, Kluwer (1997) - Gerard Salton and M. McGill, Introduction to
Modern Information Retrieval, McGraw-Hill (1983) - C. J. an Rijsbergen, Information Retrieval,
Buttersworths (1979) - Ian H. Witten, Alistair Moffat, and Timothy C.
Bell, Managing Gigabytes, Van Nostrand Reinhold
(1994) - ACM SIGIR Proceedings, SIGIR Forum
- ACM conferences in Digital Libraries
25Related courses elsewhere
- Stanford (Chris Manning, Prabhakar Raghavan, and
Hinrich Schuetze)http//www.stanford.edu/class/cs
276a/ - Cornell (Jon Kleinberg)http//www.cs.cornell.edu/
Courses/cs685/2002fa/ - CMU (Yiming Yang and Jamie Callan)http//krakow.l
ti.cs.cmu.edu/classes/11-741/2004/index.html/ - UMass (James Allan)http//ciir.cs.umass.edu/cmpsc
i646/ - UTexas (Ray Mooney)http//www.cs.utexas.edu/users
/mooney/ir-course/ - Illinois (Chengxiang Zhai)http//sifaka.cs.uiuc.e
du/course/498cxz04f/ - Johns Hopkins (David Yarowsky)http//www.cs.jhu.e
du/yarowsky/cs466.html
26Readings for weeks 1 3
- MIR (Modern Information Retrieval)
- Week 1
- Chapter 1 Introduction
- Chapter 2 Modeling
- Chapter 3 Evaluation
- Week 2
- Chapter 4 Query languages
- Chapter 5 Query operations
- Week 3
- Chapter 6 Text and multimedia languages
- Chapter 7 Text operations
- Chapter 8 Indexing and searching
27Documents
28Documents
- Not just printed paper
- collections vs. documents
- data structures representations
- Bag of words method
- document surrogates keywords, summaries
- encoding ASCII, Unicode, etc.
29Document preprocessing
- Formatting
- Tokenization (Pauls, Willow Dr., Dr. Willow,
555-1212, New York, ad hoc) - Casing (cat vs. CAT)
- Stemming (computer, computation)
- Soundex
30Document representations
- Term-document matrix (m x n)
- term-term matrix (m x m x n)
- document-document matrix (n x n)
- Example 3,000,000 documents (n) with 50,000
terms (m) - sparse matrices
- Boolean vs. integer matrices
31Document representations
- Term-document matrix
- Evaluating queries (e.g., (A?B)?C)
- Storage issues
- Inverted files
- Storage issues
- Evaluating queries
- Advantages and disadvantages
32Additional issues
- Dealing with phrases?
- Proximity search
- Synonyms?
33Porters algorithm
Example the word duplicatable
duplicat rule 4duplicate rule
1b1duplic rule 3
The application of another rule in step 4,
removing ic, cannotbe applied since one rule
from each step is allowed to be applied.
34Porters algorithm
35Relevance feedback
- Automatic
- Manual
- Method identifying feedback terms
- Q a1Q a2R - a3N
- Often a1 1, a2 1/R and a3 1/N
36Example
- Q safety minivans
- D1 car safety minivans tests injury
statistics - relevant - D2 liability tests safety - relevant
- D3 car passengers injury reviews -
non-relevant - R ?
- S ?
- Q ?
37Approximate string matching
- The Soundex algorithm (Odell and Russell)
- Uses
- spelling correction
- hash function
- non-recoverable
38The Soundex algorithm
- 1. Retain the first letter of the name, and drop
all occurrences of a,e,h,I,o,u,w,y in other
positions - 2. Assign the following numbers to the remaining
letters after the first - b,f,p,v 1
- c,g,j,k,q,s,x,z 2
- d,t 3
- l 4
- m n 5
- r 6
39The Soundex algorithm
- 3. if two or more letters with the same code were
adjacent in the original name, omit all but the
first - 4. Convert to the form LDDD by adding terminal
zeros or by dropping rightmost digits - Examples
- Euler E460, Gauss G200, H416 Hilbert, K530
Knuth, Lloyd L300 - same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
- Some problems Rogers and Rodgers, Sinclair and
StClair
40IR models
41Major IR models
- Boolean
- Vector
- Probabilistic
- Language modeling
- Fuzzy retrieval
- Latent semantic indexing
42Major IR tasks
- Ad-hoc
- Filtering and routing
- Question answering
- Spoken document retrieval
- Multimedia retrieval
43Venn diagrams
z
x
w
y
D1
D2
44Boolean model
B
A
45Boolean queries
restaurants AND (Mideastern OR vegetarian) AND
inexpensive
- What types of documents are returned?
- Stemming
- thesaurus expansion
- inclusive vs. exclusive OR
- confusing uses of AND and OR
dinner AND sports AND symphony
4 OF (Pentium, printer, cache, PC, monitor,
computer, personal)
46Boolean queries
- Weighting (Beethoven AND sonatas)
- precedence
coffee AND croissant OR muffin
raincoat AND umbrella OR sunglasses
- Use of negation potential problems
- Conjunctive and Disjunctive normal forms
- Full CNF and DNF
47Transformations
NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)
- CNF or DNF?
- Reference librarians prefer CNF - why?
48Boolean model
- Partition
- Partial relevance?
- Operators AND, NOT, OR, parentheses
49Exercise
- D1 computer information retrieval
- D2 computer retrieval
- D3 information
- D4 computer information
- Q1 information ? retrieval
- Q2 information ? computer
50Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
51Stop lists
- 250-300 most common words in English account for
50 or more of a given text. - Example the and of represent 10 of tokens.
and, to, a, and in - another 10. Next 12
words - another 10. - Moby Dick Ch.1 859 unique words (types), 2256
word occurrences (tokens). Top 65 types cover
1132 tokens ( 50). - Token/type ratio 2256/859 2.63
52Vector models
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
53Vector queries
- Each document is represented as a vector
- non-efficient representations (bit vectors)
- dimensional compatibility
54The matching process
- Document space
- Matching is done between a document and a query
(or between two documents) - distance vs. similarity
- Euclidean distance, Manhattan distance, Word
overlap, Jaccard coefficient, etc.
55Miscellaneous similarity measures
? (di x qi)
X ? Y
? (D,Q)
? (di)2
? (qi)2
X Y
X ? Y
? (D,Q)
X ? Y
56Exercise
- Compute the cosine measures ? (D1,D2) and ?
(D1,D3) for the documents D1 , D2
and D3 - Compute the corresponding Euclidean distances,
Manhattan distances, and Jaccard coefficients.