Title: Information Retrieval
1(No Transcript)
2Information Retrieval
For the MSc Computer Science Programme
- Lecture 1
- Introduction to Information Retrieval (Manning et
al. 2007) - Chapter 1
Dell Zhang Birkbeck, University of London
3What is IR?
- IR is about search engines.
Search Engine Unstructured Data (Text) Keyword
Queries
Database System Structured Data (Tables) SQL
Queries
VS
Structured data has been the big commercial
success (e.g., Oracle) but unstructured data is
now becoming dominant in a large and increasing
range of activities.
4Search is Cool
- Text everywhere
- books, documents (ms-word, pdf, etc.), articles
(journal, magazine, newspaper, etc.), Web pages,
emails, SMS, chat, - Much more than text
- music, photos, videos,
- Much more than the Web
- personal
- enterprise, institutional and domain-specific
5Search is Cool
- Not only search/finding, but also organization
and mining - clustering
- classification
http//commons.wikimedia.org/wiki/ImageBookshelf.
jpg
For example, given thousaunds of CVs,
6Search is Hot
- Most peoples means of information access
- In the 1990s other people
- Nowadays Web search
- 92 of Internet users say the Internet is a good
place to go for getting everyday information
(2004 Pew Internet Survey) - The Search Engine War
- Google, Yahoo, Microsoft,
Video How Big Can You Think?
7Data Unstructured vs. Structured
in 1996
8Data Unstructured vs. Structured
in 2006
9Free Textbook Online
Head of Yahoo! Research
- C. D. Manning, P. Raghavan and H. Schütze,
Introduction to Information Retrieval, Cambridge
University Press. 2008. - http//www-csli.stanford.edu/schuetze/
information-retrieval-book.html
Dont remember. Search!
Another reason for you to come to this module.
10A Simple Search Engine
to be, or not to be
- http//www.rhymezone.com/shakespeare/
- Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia? - One could grep all of Shakespeares plays for
Brutus and Caesar, then strip out lines
containing Calpurnia? - Slow (for large corpora)
- NOT Calpurnia is non-trivial
- Other operations (e.g., find the word Romans near
countrymen) not feasible
How do you search in a book?
11Boolean Queries
- Boolean Queries are queries using AND, OR and NOT
together with query terms. - Each document is viewed as a set of words.
- Precise a document matches condition or not.
- Primary commercial retrieval tool for 3 decades.
- Professional searchers still like Boolean queries
- you know exactly what youre getting. - For example, http//www.westlaw.com/ .
12Term-Document Incidence
1 if play contains word, 0 otherwise
Brutus AND Caesar but NOT Calpurnia
13Incidence Vectors
- So we have a 0/1 vector for each term.
- To answer the query
- take the vectors for Brutus, Caesar and Calpurnia
(complemented) ? bitwise AND. - 110100 110111 101111 100100
14Query Results
Antony and Cleopatra, Act III, Scene
ii Agrippa Aside to DOMITIUS ENOBARBUS Why,
Enobarbus, When Antony found Julius Caesar
dead, He cried almost to roaring and he
wept When at Philippi he found Brutus
slain. Hamlet, Act III, Scene ii . Lord
Polonius I did enact Julius Caesar I was killed
i' the Capitol Brutus
killed me.
15Bigger Corpora
- Consider
- n 1M documents
- each with about 1K terms.
- Avg 6 bytes/term incl spaces/punctuation
- 6GB of data in the documents.
- Say there are m 500K distinct terms among these.
16Bigger Corpora
- The 500K x 1M matrix has
- half-a-trillion 0s and 1s,
- but no more than one billion 1s.
- The matrix is extremely sparse.
- Cant build the matrix straightforwardly.
- Whats a better representation?
- Only record the 1 positions.
Why?
17Inverted Index
- For each term T, we must store a list of all
documents that contain T. - Do we use an array or a list for this?
Brutus
Calpurnia
Caesar
13
16
What happens if the word Caesar is added to
document 14?
18Inverted Index
- Linked lists generally preferred to arrays
- Dynamic space allocation
- Insertion of terms into documents easy
- Space overhead of pointers
2
4
8
16
32
64
128
2
3
5
8
13
21
34
1
13
16
(sorted by docID)
19Inverted Index - Construction
20Linguistic Preprocessing
- Case-folding
- Often best to lower case everything, since users
will use lowercase regardless of correct
capitalization. - Removal of stopwords
- Very common words like the, of, to, etc.
- List of stopwords (stop lists)
- http//www.dcs.gla.ac.uk/idom/ir_resources/
linguistic_utils/stop_words - http//www.ranks.nl/tools/stopwords.html
21Linguistic Preprocessing
- Stemming
- Reduce terms to their roots before indexing,
e.g., automate(s), automatic, automation all
reduced to automat. - Porters stemmer
- Implementations in C, Java, Perl, Python, etc.
- http//www.tartarus.org/martin/PorterStemmer/
22Indexing
- Input
- a sequence of (term, docID) pairs
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
23 This is the core indexing step,
24 - Merge
- multiple term entries in a single document.
- Add
- frequency information.
Why? Will discuss later.
25 - Split
- the result into
- a dictionary file and a postings file.
26 The index we just built
Where do we pay in storage?
How do we process a query?
27Query Processing
- Consider processing the query
- Brutus AND Caesar
- Locate Brutus in the dictionary
- Retrieve its postings.
- Locate Caesar in the dictionary
- Retrieve its postings.
- Merge the two postings
128
Brutus
Caesar
34
28Query Processing - Merge
Walk through the two postings simultaneously, in
time linear in the total number of postings
entries. In other words, if the posting list
lengths are x and y, the merge takes O(xy)
operations.
128
2
34
Whats crucial postings must be sorted by docID.
29Query Processing - Exercise
- Adapt the merge algorithm for the queries
- Brutus AND NOT Caesar
- Brutus OR NOT Caesar
Can we still run through the merge in time O(xy)?
30Query Processing - Exercise
- What about an arbitrary Boolean formula?
- (Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
Can we always merge in linear time? The time
complexity is linear in what? Can we do better?
31Query Processing - Exercise
- Extend the merge algorithm to an arbitrary
Boolean query.
Can we always guarantee execution in time linear
in the total postings size? Hint Begin with
the case of a Boolean formula query, then each
query term appears only once in the query.
32Take Home Messages
- IR is about search engines.
- Search is hot! Search is cool!
- Free textbook online!
- http//www.dcs.bbk.ac.uk/dell/teaching/ir/
- Inverted Index
- dictionary postings
- Boolean Search
- query processing