Information Retrieval - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Information Retrieval

Description:

Example: To find recipes for cookies with oatmeal but without raisins, try ... would find the nursery rhyme, but likely not religious or Christmas-related documents. ... – PowerPoint PPT presentation

Number of Views:1332
Avg rating:3.0/5.0
Slides: 57
Provided by: dragomi3
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
January 7, 2005
  • Handout 1

2
Course Information
  • Instructor Dragomir R. Radev (radev_at_si.umich.edu)
  • Office 3080, West Hall Connector
  • Phone (734) 615-5225
  • Office hours TBA
  • Course page http//tangra.si.umich.edu/radev/650
    /
  • Class meets on Fridays, 210-455 PM in 409 West
    Hall

3
Introduction
4
IR systems
  • Google
  • Vivísimo
  • AskJeeves
  • NSIR
  • Lemur
  • MG
  • Nutch

5
Examples of IR systems
  • Conventional (library catalog). Search by
    keyword, title, author, etc.
  • Text-based (Lexis-Nexis, Google, FAST).Search by
    keywords. Limited search using queries in natural
    language.
  • Multimedia (QBIC, WebSeek, SaFe)Search by visual
    appearance (shapes, colors, ).
  • Question answering systems (AskJeeves, NSIR,
    Answerbus)Search in (restricted) natural language

6
(No Transcript)
7
(No Transcript)
8
Need for IR
  • Advent of WWW - more than 4 Billion documents
    indexed on Google
  • How much information? 200TB according to Lyman
    and Varian 2003.
  • http//www.sims.berkeley.edu/research/projects/ho
    w-much-info/
  • Search, routing, filtering
  • Users information need

9
Some definitions of Information Retrieval (IR)
Salton (1989) Information-retrieval systems
process files of records and requests for
information, and identify and retrieve from the
files certain records in response to the
information requests. The retrieval of particular
records depends on the similarity between the
records and the queries, which in turn is
measured by comparing the values of certain
attributes to records and information
requests.Kowalski (1997) An Information
Retrieval System is a system that is capable of
storage, retrieval, and maintenance of
information. Information in this context can be
composed of text (including numeric and date
data), images, audio, video, and other
multi-media objects).
10
Syllabus (Part I)
  • Introduction. Information Needs and Queries.
  • Document preprocessing. Stemming. Document
    representations. TFIDF. Indexing and Searching.
    Inverted indexes
  • IR Models. The Vector model. The Boolean model.
  • Retrieval Evaluation. Precision and Recall.
    F-measure. Reference collections. The TREC
    conferences.
  • Queries and Documents. Query Languages. Natural
    language querying
  • Word distributions. The Zipf distribution.
  • Relevance feedback and query expansion.
  • Approximate matching.
  • Compression.
  • Vector space similarity and clustering. k-means
    clustering.

11
Syllabus (Part II)
  • Document classification. k-nearest neighbors.
    Naive Bayes. Support vector machines.
  • Singular value decomposition and Latent Semantic
    Indexing.
  • Probabilistic models. Document models. Language
    models.
  • Crawling the Web. Hyperlink analysis. Measuring
    the Web.
  • Hypertext retrieval. Web-based IR.
  • Social network analysis for IR. Hubs and
    authorities. PageRank and HITS.
  • Focused crawling. Resource discovery. Discovering
    communities.
  • Collaborative filtering.
  • Information extraction using Hidden Markov
    Models.
  • Additional topics, e.g., relevance transfer, XML
    retrieval, text tiling, text summarization,
    question answering.

12
Readings
  • Books
  • 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto
    Modern Information Retrieval, Addison-Wesley/ACM
    Press, 1999.
  • 2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth
    Modeling the Internet and the Web Probabilistic
    Methods and Algorithms Wiley, 2003, ISBN
    0-470-84906-1
  • Papers (tentative list)
  • Barabasi and Albert "Emergence of scaling in
    random networks" Science (286) 509-512, 1999
  • Bharat and Broder "A technique for measuring the
    relative size and overlap of public Web search
    engines" WWW 1998
  • Brin and Page "The Anatomy of a Large-Scale
    Hypertextual Web Search Engine" WWW 1998
  • Bush "As we may thing" The Atlantic Monthly 1945
  • Chakrabarti, van den Berg, and Dom "Focused
    Crawling" WWW 1999
  • Cho, Garcia-Molina, and Page "Efficient Crawling
    Through URL Ordering" WWW 1998
  • Davison "Topical locality on the Web" SIGIR 2000
  • Dean and Henzinger "Finding related pages in the
    World Wide Web" WWW 1999
  • Deerwester, Dumais, Landauer, Furnas, Harshman
    "Indexing by latent semantic analysis" JASIS
    41(6) 1990

13
Readings
  • Erkan and Radev "LexRank Graph-based Lexical
    Centrality as Salience in Text Summarization"
    JAIR 22, 2004
  • Jeong and Barabasi "Diameter of the world wide
    web" Nature (401) 130-131, 1999
  • Hawking, Voorhees, Craswell, and Bailey "Overview
    of the TREC-8 Web Track" TREC 2000
  • Haveliwala "Topic-sensitive pagerank" WWW 2002
  • Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins,
    Upfal "The Web as a graph" PODS 2000
  • Lawrence and Giles "Accessibility of information
    on the Web" Nature (400) 107-109, 1999
  • Lawrence and Giles "Searching the World-Wide Web"
    Science (280) 98-100, 1998
  • Menczer "Links tell us about lexical and semantic
    Web content" arXiv 2001
  • Page, Brin, Motwani, and Winograd "The PageRank
    citation ranking Bringing order to the Web"
    Stanford TR, 1998
  • Radev, Fan, Qi, Wu and Grewal "Probabilistic
    Question Answering on the Web" JASIST 2005
  • Singhal "Modern Information Retrieval an
    Overview" IEEE 2001

14
Assignments
  • Homeworks
  • The course will have three homework assignments
    in the form of problem sets. Each problem set
    will include essay-type questions, questions
    designed to show understanding of specific
    concepts, and hands-on exercises involving
    existing IR engines.
  • Project
  • The final course project can be done in three
    different formats
  • (1) a programming project implementing a
    challenging and novel information retrieval
    application,
  • (2) an extensive survey-style research paper
    providing an exhaustive look at an area of IR, or
  • (3) a SIGIR-style experimental IR paper.

15
Grading
  • Three HW assignments (30)
  • Project (30)
  • Final (40)

16
Sample queries (from Excite)
  • In what year did baseball become an offical
    sport?
  • play station codes . com
  • birth control and depression
  • government
  • "WorkAbility I"conference
  • kitchen appliances
  • where can I find a chines rosewood
  • tiger electronics
  • 58 Plymouth Fury
  • How does the character Seyavash in Ferdowsi's
    Shahnameh exhibit characteristics of a hero?
  • emeril Lagasse
  • Hubble
  • M.S Subalaksmi
  • running

17
Types of queries (AltaVista)
Including or excluding words To make sure that a
specific word is always included in your search
topic, place the plus () symbol before the key
word in the search box. To make sure that a
specific word is always excluded from your search
topic, place a minus (-) sign before the keyword
in the search box. Example To find recipes for
cookies with oatmeal but without raisins,
try recipe cookie oatmeal -raisin. Expand your
search using wildcards () By typing an at the
end of a keyword, you can search for the word
with multiple endings. Example Try wish, to
find wish, wishes, wishful, wishbone, and
wishy-washy.
18
Types of queries
AND () Finds only documents containing all of
the specified words or phrases. Mary AND lamb
finds documents with both the word Mary and the
word lamb. OR () Finds documents containing
at least one of the specified words or phrases.
Mary OR lamb finds documents containing either
Mary or lamb. The found documents could contain
both, but do not have to.
NOT (!) Excludes documents
containing the specified word or phrase. Mary AND
NOT lamb finds documents with Mary but not
containing lamb. NOT cannot stand alone--use it
with another operator, like AND. NEAR () Finds
documents containing both specified words or
phrases within 10 words of each other. Mary NEAR
lamb would find the nursery rhyme, but likely not
religious or Christmas-related documents.
19
Mappings and abstractions
Reality
Data
Information need
Query
From Korfhages book
20
Typical IR system
  • (Crawling)
  • Indexing
  • Retrieval
  • User interface

21
Key Terms Used in IR
  • QUERY a representation of what the user is
    looking for - can be a list of words or a phrase.
  • DOCUMENT an information entity that the user
    wants to retrieve
  • COLLECTION a set of documents
  • INDEX a representation of information that makes
    querying easier
  • TERM word or concept that appears in a document
    or a query

22
Other important terms
  • Classification
  • Cluster
  • Similarity
  • Information Extraction
  • Term Frequency
  • Inverse Document Frequency
  • Precision
  • Recall
  • Inverted File
  • Query Expansion
  • Relevance
  • Relevance Feedback
  • Stemming
  • Stopword
  • Vector Space Model
  • Weighting
  • TREC/TIPSTER/MUC

23
Query structures
  • Query viewed as a document?
  • Length
  • repetitions
  • syntactic differences
  • Types of matches
  • exact
  • range
  • approximate

24
Additional references on IR
  • Gerard Salton, Automatic Text Processing,
    Addison-Wesley (1989)
  • Gerald Kowalski, Information Retrieval Systems
    Theory and Implementation, Kluwer (1997)
  • Gerard Salton and M. McGill, Introduction to
    Modern Information Retrieval, McGraw-Hill (1983)
  • C. J. an Rijsbergen, Information Retrieval,
    Buttersworths (1979)
  • Ian H. Witten, Alistair Moffat, and Timothy C.
    Bell, Managing Gigabytes, Van Nostrand Reinhold
    (1994)
  • ACM SIGIR Proceedings, SIGIR Forum
  • ACM conferences in Digital Libraries

25
Related courses elsewhere
  • Stanford (Chris Manning, Prabhakar Raghavan, and
    Hinrich Schuetze)http//www.stanford.edu/class/cs
    276a/
  • Cornell (Jon Kleinberg)http//www.cs.cornell.edu/
    Courses/cs685/2002fa/
  • CMU (Yiming Yang and Jamie Callan)http//krakow.l
    ti.cs.cmu.edu/classes/11-741/2004/index.html/
  • UMass (James Allan)http//ciir.cs.umass.edu/cmpsc
    i646/
  • UTexas (Ray Mooney)http//www.cs.utexas.edu/users
    /mooney/ir-course/
  • Illinois (Chengxiang Zhai)http//sifaka.cs.uiuc.e
    du/course/498cxz04f/
  • Johns Hopkins (David Yarowsky)http//www.cs.jhu.e
    du/yarowsky/cs466.html

26
Readings for weeks 1 3
  • MIR (Modern Information Retrieval)
  • Week 1
  • Chapter 1 Introduction
  • Chapter 2 Modeling
  • Chapter 3 Evaluation
  • Week 2
  • Chapter 4 Query languages
  • Chapter 5 Query operations
  • Week 3
  • Chapter 6 Text and multimedia languages
  • Chapter 7 Text operations
  • Chapter 8 Indexing and searching

27
Documents
28
Documents
  • Not just printed paper
  • collections vs. documents
  • data structures representations
  • Bag of words method
  • document surrogates keywords, summaries
  • encoding ASCII, Unicode, etc.

29
Document preprocessing
  • Formatting
  • Tokenization (Pauls, Willow Dr., Dr. Willow,
    555-1212, New York, ad hoc)
  • Casing (cat vs. CAT)
  • Stemming (computer, computation)
  • Soundex

30
Document representations
  • Term-document matrix (m x n)
  • term-term matrix (m x m x n)
  • document-document matrix (n x n)
  • Example 3,000,000 documents (n) with 50,000
    terms (m)
  • sparse matrices
  • Boolean vs. integer matrices

31
Document representations
  • Term-document matrix
  • Evaluating queries (e.g., (A?B)?C)
  • Storage issues
  • Inverted files
  • Storage issues
  • Evaluating queries
  • Advantages and disadvantages

32
Additional issues
  • Dealing with phrases?
  • Proximity search
  • Synonyms?

33
Porters algorithm
Example the word duplicatable
duplicat rule 4duplicate rule
1b1duplic rule 3
The application of another rule in step 4,
removing ic, cannotbe applied since one rule
from each step is allowed to be applied.
34
Porters algorithm
35
Relevance feedback
  • Automatic
  • Manual
  • Method identifying feedback terms
  • Q a1Q a2R - a3N
  • Often a1 1, a2 1/R and a3 1/N

36
Example
  • Q safety minivans
  • D1 car safety minivans tests injury
    statistics - relevant
  • D2 liability tests safety - relevant
  • D3 car passengers injury reviews -
    non-relevant
  • R ?
  • S ?
  • Q ?

37
Approximate string matching
  • The Soundex algorithm (Odell and Russell)
  • Uses
  • spelling correction
  • hash function
  • non-recoverable

38
The Soundex algorithm
  • 1. Retain the first letter of the name, and drop
    all occurrences of a,e,h,I,o,u,w,y in other
    positions
  • 2. Assign the following numbers to the remaining
    letters after the first
  • b,f,p,v 1
  • c,g,j,k,q,s,x,z 2
  • d,t 3
  • l 4
  • m n 5
  • r 6

39
The Soundex algorithm
  • 3. if two or more letters with the same code were
    adjacent in the original name, omit all but the
    first
  • 4. Convert to the form LDDD by adding terminal
    zeros or by dropping rightmost digits
  • Examples
  • Euler E460, Gauss G200, H416 Hilbert, K530
    Knuth, Lloyd L300
  • same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
  • Some problems Rogers and Rodgers, Sinclair and
    StClair

40
IR models
41
Major IR models
  • Boolean
  • Vector
  • Probabilistic
  • Language modeling
  • Fuzzy retrieval
  • Latent semantic indexing

42
Major IR tasks
  • Ad-hoc
  • Filtering and routing
  • Question answering
  • Spoken document retrieval
  • Multimedia retrieval

43
Venn diagrams
z
x
w
y
D1
D2
44
Boolean model
B
A
45
Boolean queries
restaurants AND (Mideastern OR vegetarian) AND
inexpensive
  • What types of documents are returned?
  • Stemming
  • thesaurus expansion
  • inclusive vs. exclusive OR
  • confusing uses of AND and OR

dinner AND sports AND symphony
4 OF (Pentium, printer, cache, PC, monitor,
computer, personal)
46
Boolean queries
  • Weighting (Beethoven AND sonatas)
  • precedence

coffee AND croissant OR muffin
raincoat AND umbrella OR sunglasses
  • Use of negation potential problems
  • Conjunctive and Disjunctive normal forms
  • Full CNF and DNF

47
Transformations
  • De Morgans Laws

NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)
  • CNF or DNF?
  • Reference librarians prefer CNF - why?

48
Boolean model
  • Partition
  • Partial relevance?
  • Operators AND, NOT, OR, parentheses

49
Exercise
  • D1 computer information retrieval
  • D2 computer retrieval
  • D3 information
  • D4 computer information
  • Q1 information ? retrieval
  • Q2 information ? computer

50
Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
51
Stop lists
  • 250-300 most common words in English account for
    50 or more of a given text.
  • Example the and of represent 10 of tokens.
    and, to, a, and in - another 10. Next 12
    words - another 10.
  • Moby Dick Ch.1 859 unique words (types), 2256
    word occurrences (tokens). Top 65 types cover
    1132 tokens ( 50).
  • Token/type ratio 2256/859 2.63

52
Vector models
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
53
Vector queries
  • Each document is represented as a vector
  • non-efficient representations (bit vectors)
  • dimensional compatibility

54
The matching process
  • Document space
  • Matching is done between a document and a query
    (or between two documents)
  • distance vs. similarity
  • Euclidean distance, Manhattan distance, Word
    overlap, Jaccard coefficient, etc.

55
Miscellaneous similarity measures
  • The Cosine measure

? (di x qi)
X ? Y
? (D,Q)

? (di)2
? (qi)2

X Y
  • The Jaccard coefficient

X ? Y
? (D,Q)
X ? Y
56
Exercise
  • Compute the cosine measures ? (D1,D2) and ?
    (D1,D3) for the documents D1 , D2
    and D3
  • Compute the corresponding Euclidean distances,
    Manhattan distances, and Jaccard coefficients.
Write a Comment
User Comments (0)
About PowerShow.com