Information Retrieval - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Information Retrieval

Description:

books, documents (ms-word, pdf, etc.), articles (journal, magazine, newspaper, ... What happens if the word Caesar is added to document 14? Inverted Index ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 33
Provided by: dellz
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
(No Transcript)
2
Information Retrieval
For the MSc Computer Science Programme
  • Lecture 1
  • Introduction to Information Retrieval (Manning et
    al. 2007)
  • Chapter 1

Dell Zhang Birkbeck, University of London
3
What is IR?
  • IR is about search engines.

Search Engine Unstructured Data (Text) Keyword
Queries
Database System Structured Data (Tables) SQL
Queries
VS
Structured data has been the big commercial
success (e.g., Oracle) but unstructured data is
now becoming dominant in a large and increasing
range of activities.
4
Search is Cool
  • Text everywhere
  • books, documents (ms-word, pdf, etc.), articles
    (journal, magazine, newspaper, etc.), Web pages,
    emails, SMS, chat,
  • Much more than text
  • music, photos, videos,
  • Much more than the Web
  • personal
  • enterprise, institutional and domain-specific

5
Search is Cool
  • Not only search/finding, but also organization
    and mining
  • clustering
  • classification

http//commons.wikimedia.org/wiki/ImageBookshelf.
jpg
For example, given thousaunds of CVs,
6
Search is Hot
  • Most peoples means of information access
  • In the 1990s other people
  • Nowadays Web search
  • 92 of Internet users say the Internet is a good
    place to go for getting everyday information
    (2004 Pew Internet Survey)
  • The Search Engine War
  • Google, Yahoo, Microsoft,

Video How Big Can You Think?
7
Data Unstructured vs. Structured
in 1996
8
Data Unstructured vs. Structured
in 2006
9
Free Textbook Online
Head of Yahoo! Research
  • C. D. Manning, P. Raghavan and H. Schütze,
    Introduction to Information Retrieval, Cambridge
    University Press. 2008.
  • http//www-csli.stanford.edu/schuetze/
    information-retrieval-book.html

Dont remember. Search!
Another reason for you to come to this module.
10
A Simple Search Engine
to be, or not to be
  • http//www.rhymezone.com/shakespeare/
  • Which plays of Shakespeare contain the words
    Brutus AND Caesar but NOT Calpurnia?
  • One could grep all of Shakespeares plays for
    Brutus and Caesar, then strip out lines
    containing Calpurnia?
  • Slow (for large corpora)
  • NOT Calpurnia is non-trivial
  • Other operations (e.g., find the word Romans near
    countrymen) not feasible

How do you search in a book?
11
Boolean Queries
  • Boolean Queries are queries using AND, OR and NOT
    together with query terms.
  • Each document is viewed as a set of words.
  • Precise a document matches condition or not.
  • Primary commercial retrieval tool for 3 decades.
  • Professional searchers still like Boolean queries
    - you know exactly what youre getting.
  • For example, http//www.westlaw.com/ .

12
Term-Document Incidence
1 if play contains word, 0 otherwise
Brutus AND Caesar but NOT Calpurnia
13
Incidence Vectors
  • So we have a 0/1 vector for each term.
  • To answer the query
  • take the vectors for Brutus, Caesar and Calpurnia
    (complemented) ? bitwise AND.
  • 110100 110111 101111 100100

14
Query Results
Antony and Cleopatra, Act III, Scene
ii Agrippa Aside to DOMITIUS ENOBARBUS Why,
Enobarbus, When Antony found Julius Caesar
dead, He cried almost to roaring and he
wept When at Philippi he found Brutus
slain. Hamlet, Act III, Scene ii . Lord
Polonius I did enact Julius Caesar I was killed
i' the Capitol Brutus
killed me.
15
Bigger Corpora
  • Consider
  • n 1M documents
  • each with about 1K terms.
  • Avg 6 bytes/term incl spaces/punctuation
  • 6GB of data in the documents.
  • Say there are m 500K distinct terms among these.

16
Bigger Corpora
  • The 500K x 1M matrix has
  • half-a-trillion 0s and 1s,
  • but no more than one billion 1s.
  • The matrix is extremely sparse.
  • Cant build the matrix straightforwardly.
  • Whats a better representation?
  • Only record the 1 positions.

Why?
17
Inverted Index
  • For each term T, we must store a list of all
    documents that contain T.
  • Do we use an array or a list for this?

Brutus
Calpurnia
Caesar
13
16
What happens if the word Caesar is added to
document 14?
18
Inverted Index
  • Linked lists generally preferred to arrays
  • Dynamic space allocation
  • Insertion of terms into documents easy
  • Space overhead of pointers

2
4
8
16
32
64
128
2
3
5
8
13
21
34
1
13
16
(sorted by docID)
19
Inverted Index - Construction
20
Linguistic Preprocessing
  • Case-folding
  • Often best to lower case everything, since users
    will use lowercase regardless of correct
    capitalization.
  • Removal of stopwords
  • Very common words like the, of, to, etc.
  • List of stopwords (stop lists)
  • http//www.dcs.gla.ac.uk/idom/ir_resources/
    linguistic_utils/stop_words
  • http//www.ranks.nl/tools/stopwords.html

21
Linguistic Preprocessing
  • Stemming
  • Reduce terms to their roots before indexing,
    e.g., automate(s), automatic, automation all
    reduced to automat.
  • Porters stemmer
  • Implementations in C, Java, Perl, Python, etc.
  • http//www.tartarus.org/martin/PorterStemmer/

22
Indexing
  • Input
  • a sequence of (term, docID) pairs

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
23
  • Sort
  • by terms.

This is the core indexing step,
24
  • Merge
  • multiple term entries in a single document.
  • Add
  • frequency information.

Why? Will discuss later.
25
  • Split
  • the result into
  • a dictionary file and a postings file.

26

The index we just built
Where do we pay in storage?
How do we process a query?
27
Query Processing
  • Consider processing the query
  • Brutus AND Caesar
  • Locate Brutus in the dictionary
  • Retrieve its postings.
  • Locate Caesar in the dictionary
  • Retrieve its postings.
  • Merge the two postings

128
Brutus
Caesar
34
28
Query Processing - Merge
Walk through the two postings simultaneously, in
time linear in the total number of postings
entries. In other words, if the posting list
lengths are x and y, the merge takes O(xy)
operations.
128
2
34
Whats crucial postings must be sorted by docID.
29
Query Processing - Exercise
  • Adapt the merge algorithm for the queries
  • Brutus AND NOT Caesar
  • Brutus OR NOT Caesar

Can we still run through the merge in time O(xy)?
30
Query Processing - Exercise
  • What about an arbitrary Boolean formula?
  • (Brutus OR Caesar) AND NOT (Antony OR Cleopatra)

Can we always merge in linear time? The time
complexity is linear in what? Can we do better?
31
Query Processing - Exercise
  • Extend the merge algorithm to an arbitrary
    Boolean query.

Can we always guarantee execution in time linear
in the total postings size? Hint Begin with
the case of a Boolean formula query, then each
query term appears only once in the query.
32
Take Home Messages
  • IR is about search engines.
  • Search is hot! Search is cool!
  • Free textbook online!
  • http//www.dcs.bbk.ac.uk/dell/teaching/ir/
  • Inverted Index
  • dictionary postings
  • Boolean Search
  • query processing
Write a Comment
User Comments (0)
About PowerShow.com