Title: Information Retrieval
1Information Retrieval
- CSE 8337 (Part I)
- Spring 2011
- Some Material for these slides obtained from
- Modern Information Retrieval by Ricardo
Baeza-Yates and Berthier Ribeiro-Neto
http//www.sims.berkeley.edu/hearst/irbook/ - Data Mining Introductory and Advanced Topics by
Margaret H. Dunham - http//www.engr.smu.edu/mhd/book
- Introduction to Information Retrieval by
Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schutze - http//informationretrieval.org
2CSE 8337 Outline
- Introduction
- Text Processing
- Indexes
- Boolean Queries
- Web Searching/Crawling
- Vector Space Model
- Matching
- Evaluation
- Feedback/Expansion
3Information Retrieval
- Information Retrieval (IR) retrieving desired
information from textual data. - Library Science
- Digital Libraries
- Web Search Engines
- Traditionally keyword based
- Sample query
- Find all documents about data mining.
4Motivation
- IR representation, storage, organization of, and
access to information items - Focus is on the user information need
- User information need (example)
- Find all docs containing information on college
tennis teams which (1) are maintained by a USA
university and (2) participate in the NCAA
tournament. - Emphasis is on the retrieval of information (not
data)
5DB vs IR
- Records (tuples) vs. documents
- Well defined results vs. fuzzy results
- DB grew out of files and traditional business
systesm - IR grew out of library science and need to
categorize/group/access books/articles
6Unstructured data
- Typically refers to free text
- Allows
- Keyword queries including operators
- More sophisticated concept queries e.g.,
- find all web pages dealing with drug abuse
- Classic model for searching text documents
7Semi-structured data
- In fact almost no data is unstructured
- E.g., this slide has distinctly identified zones
such as the Title and Bullets - Facilitates semi-structured search such as
- Title contains data AND Bullets contain search
- to say nothing of linguistic structure
8DB vs IR (contd)
- Data retrieval
- which docs contain a set of keywords?
- Well defined semantics
- a single erroneous object implies failure!
- Information retrieval
- information about a subject or topic
- semantics is frequently loose
- small errors are tolerated
- IR system
- interpret contents of information items
- generate a ranking which reflects relevance
- notion of relevance is most important
9Motivation
- IR software issues
- classification and categorization
- systems and languages
- user interfaces and visualization
- Still, area was seen as of narrow interest
- Advent of the Web changed this perception once
and for all - universal repository of knowledge
- free (low cost) universal access
- no central editorial board
- many problems though IR seen as key to finding
the solutions!
10Basic Concepts
- The User Task
- Retrieval
- information or data
- purposeful
- Browsing
- glancing around
- Feedback
Response
Retrieval
Database
Browsing
Feedback
11Basic Concepts
Logical view of the documents
12The Retrieval Process
13Basic assumptions of Information Retrieval
- Collection Fixed set of documents
- Goal Retrieve documents with information that is
relevant to users information need and helps him
complete a task
14Fuzzy Sets and Logic
- Fuzzy Set Set membership function is a real
valued function with output in the range 0,1. - f(x) Probability x is in F.
- 1-f(x) Probability x is not in F.
- EX
- T x x is a person and x is tall
- Let f(x) be the probability that x is tall
- Here f is the membership function
15Fuzzy Sets
16IR is Fuzzy
Relevant
Relevant
Not Relevant
Not Relevant
Simple
Fuzzy
17Information Retrieval Metrics
- Similarity measure of how close a query is to a
document. - Documents which are close enough are retrieved.
- Metrics
- Precision Relevant and Retrieved
- Retrieved
- Recall Relevant and Retrieved
- Relevant
18IR Query Result Measures
IR
19CSE 8337 Outline
- Introduction
- Text Processing (Background)
- Indexes
- Boolean Queries
- Web Searching/Crawling
- Vector Space Model
- Matching
- Evaluation
- Feedback/Expansion
20Text Processing TOC
- Simple Text Storage
- String Matching
- Approximate (Fuzzy) Matching (Spell Checker)
- Parsing
- Tokenization
- Stemming/ngrams
- Stop words
- Synonyms
21Text storage
- EBCDIC/ASCII
- Array of character
- Linked list of character
- Trees- B Tree, Trie
- Stuart E. Madnick, String Processing
Techniques, Communications of the ACM, Vol 10,
No 7, July 1967, pp 420-424.
22Pattern Matching(Recognition)
- Pattern Matching finds occurrences of a
predefined pattern in the data. - Applications include speech recognition,
information retrieval, time series analysis.
23Similarity Measures
- Determine similarity between two objects.
- Similarity characteristics
- Alternatively, distance measures measure how
unlike or dissimilar objects are.
24String Matching Problem
- Input
- Pattern length m
- Text string length n
- Find one (next, all) occurrences of string in
pattern - Ex
- String 00110011011110010100100111
- Pattern 011010
25String Matching Algorithms
- Brute Force
- Knuth-Morris Pratt
- Boyer Moore
26Brute Force String Matching
- Brute Force
- Handbook of Algorithms and Data Structures
- http//www.dcc.uchile.cl/rbaeza/handbook/algs/7/7
11a.srch.c.html - Space O(mn)
- Time O(mn)
00110011011110010100100111
27FSR
28Creating FSR
- Create FSM
- Construct the correct spine.
- Add a default failure bus to state 0.
- Add a default initial bus to state 1.
- For each state, decide its attachments to failure
bus, initial bus, or other failure links.
29Knuth-Morris-Pratt
- Apply FSM to string by processing characters one
at a time. - Accepting state is reached when pattern is found.
- Space O(mn)
- Time O(mn)
- Handbook of Algorithms and Data Structures
- http//www.dcc.uchile.cl/rbaeza/handbook/algs/7/7
12.srch.c.html
30Boyer-Moore
- Scan pattern from right to left
- Skip many positions on illegal character string.
- O(mn)
- Expected time better than KMP
- Expected behavior better
- Handbook of Algorithms and Data Structures
- http//www.dcc.uchile.cl/rbaeza/handbook/algs/7/7
13.preproc.c.html
31Approximate String Matching
- Find patterns close to the string
- Fuzzy matching
- Applications
- Spelling checkers
- IR
- Define similarity (distance) between string and
pattern
32String-to-String Correction
- Levenshtein Distance
- http//www.mendeley.com/research/binary-codes-capa
ble-of-correcting-insertions-and-reversals/ - Measure of similarity between strings
- Can be used to determine how to convert from one
string to another - Cost to convert one to the other
- Transformations
- Match Current characters in both strings are
the same - Delete Delete current character in input string
- Insert Insert current character in target
string into string
33Distance Between Strings
34Spell Checkers
- Check or Replace or Expand or Suggest
- Phonetic
- Use phonetic spelling for word
- Truespel www.foreignword.com/cgi-bin//transpel.cgi
- Phoneme smallest sounds
- Jaro Winkler
- distance measure
- http//en.wikipedia.org/wiki/JaroE28093Winkler_
distance - Autocomplete
- www.amazon.com
35Tokenization
- Find individual words (tokens) in text string.
- Look for spaces, commas, etc.
- http//nlp.stanford.edu/IR-book/html/htmledition/t
okenization-1.html
36Stemming/ngrams
- Convert token/word into smallest word with
similar derivations - Remove suffixes (s, ed, ing, )
- Remove prefixes (pre, re, un, )
- ngram subsequences of length n
37Stopwords
- Common words
- Bad words
- Implementation
- Text file
38Synonyms
- Exact/similar meaning
- Hierarchy
- One way
- Bidirectional
- Expand Query
- Replace terms
- Implementation
- Synonym File or dictionary
39CSE 8337 Outline
- Introduction
- Text Processing
- Indexes
- Boolean Queries
- Web Searching/Crawling
- Vector Space Model
- Matching
- Evaluation
- Feedback/Expansion
40Index
- Common access is by keyword
- Fast access by keyword
- Index organizations?
- Hash
- B-tree
- Linked List
- Process document and query to identify keywords
41Term-document incidence
1 if play contains word, 0 otherwise
Brutus AND Caesar but NOT Calpurnia
42Incidence vectors
- So we have a 0/1 vector for each term.
- To answer query take the vectors for Brutus,
Caesar and Calpurnia (complemented) ? bitwise
AND. - 110100 AND 110111 AND 101111 100100.
- http//www.rhymezone.com/shakespeare/
43Inverted index
- For each term T, we must store a list of all
documents that contain T. - Do we use an array or a list for this?
Brutus
Calpurnia
Caesar
13
16
What happens if the word Caesar is added to
document 14?
44Inverted index
- Linked lists generally preferred to arrays
- Dynamic space allocation
- Insertion of terms into documents easy
- Space overhead of pointers
Posting
2
4
8
16
32
64
128
2
3
5
8
13
21
34
1
13
16
Sorted by docID (more later on why).
45Inverted index construction
Documents to be indexed.
Friends, Romans, countrymen.
46Indexer steps
- Sequence of (Modified token, Document ID) pairs.
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
47 Core indexing step.
48 - Multiple term entries in a single document are
merged. - Frequency information is added.
Why frequency? Will discuss later.
49 - The result is split into a Dictionary file and a
Postings file.
50 - Where do we pay in storage?
Terms
Pointers
51Example Data
- As an example for applying scalable index
construction algorithms, we will use the Reuters
RCV1 collection. - This is one year of Reuters newswire (part of
1995 and 1996) - http//about.reuters.com/researchandstandards/corp
us/available.asp - Hardware assumptions Table 4.1 p62 in textbook
52A Reuters RCV1 document
53Reuters RCV1 statistics
Symbol Statistic Value
N documents 800,000
L avg. tokens per doc 200
M terms ( word types) 400,000
avg. bytes per token(incl. spaces/punct.) 6
avg. bytes per token (without spaces/punct.) 4.5
avg. bytes per term 7.5
non-positional postings 100,000,000
4.5 bytes per word token vs. 7.5 bytes per word
type why?
54Basic index construction
- Documents are parsed to extract words and these
are saved with the Document ID.
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
55 Key step
- After all documents have been parsed, the
inverted file is sorted by terms.
We focus on this sort step. We have 100M items to
sort.
56Scaling index construction
- In-memory index construction does not scale.
- How can we construct an index for very large
collections? - Taking into account the hardware constraints
- Memory, disk, speed etc.
57Sort-based Index construction
- As we build the index, we parse docs one at a
time. - While building the index, we cannot easily
exploit compression tricks - The final postings for any term are incomplete
until the end. - At 12 bytes per postings entry, demands a lot of
space for large collections. - T 100,000,000 in the case of RCV1
- So we can do this in memory in 2008, but
typical collections are much larger. E.g. New
York Times provides index of gt150 years of
newswire - Thus We need to store intermediate results on
disk.
58Use the same algorithm for disk?
- Can we use the same index construction algorithm
for larger collections, but by using disk instead
of memory? - No Sorting T 100,000,000 records on disk is
too slow too many disk seeks. - We need an external sorting algorithm.
59Bottleneck
- Parse and build postings entries one doc at a
time - Now sort postings entries by term (then by doc
within each term) - Doing this with random disk seeks would be too
slow must sort T100M records
If every comparison took 2 disk seeks, and N
items could be sorted with N log2N comparisons,
how long would this take?
60BSBI Blocked sort-based Indexing
- 12-byte (444) records (term, doc, freq).
- These are generated as we parse docs.
- Must now sort 100M such 12-byte records by term.
- Define a Block 10M such records
- Can easily fit a couple into memory.
- Will have 10 such blocks to start with.
- Basic idea of algorithm
- Accumulate postings for each block, sort, write
to disk. - Then merge the blocks into one long sorted order.
61(No Transcript)
62Sorting 10 blocks of 10M records
- First, read each block and sort within
- Quicksort takes 2N ln N expected steps
- In our case 2 x (10M ln 10M) steps
- Exercise estimate total time to read each block
from disk and and quicksort it. - 10 times this estimate - gives us 10 sorted runs
of 10M records each. - Done straightforwardly, need 2 copies of data on
disk - But can optimize this
63(No Transcript)
64How to merge the sorted runs?
- Can do binary merges, with a merge tree of log210
4 layers. - During each layer, read into memory runs in
blocks of 10M, merge, write back.
2
1
Merged run.
3
4
Runs being merged.
Disk
65How to merge the sorted runs?
- But it is more efficient to do a n-way merge,
where you are reading from all blocks
simultaneously - Providing you read decent-sized chunks of each
block into memory, youre not killed by disk seeks
66Problem with sort-based algorithm
- Our assumption was we can keep the dictionary in
memory. - We need the dictionary (which grows dynamically)
in order to implement a term to termID mapping. - Actually, we could work with term,docID postings
instead of termID,docID postings . . . - . . . but then intermediate files become very
large. (We would end up with a scalable, but very
slow index construction method.)
67SPIMI Single-pass in-memory indexing
- Key idea 1 Generate separate dictionaries for
each block no need to maintain term-termID
mapping across blocks. - Key idea 2 Dont sort. Accumulate postings in
postings lists as they occur. - With these two ideas we can generate a complete
inverted index for each block. - These separate indexes can then be merged into
one big index.
68SPIMI-Invert
- Merging of blocks is analogous to BSBI.
69SPIMI Compression
- Compression makes SPIMI even more efficient.
- Compression of terms
- Compression of postings
70Distributed indexing
- For web-scale indexing (dont try this at home!)
- must use a distributed computing cluster
- Individual machines are fault-prone
- Can unpredictably slow down or fail
- How do we exploit such a pool of machines?
71Google data centers
- Google data centers mainly contain commodity
machines. - Data centers are distributed around the world.
- Estimate a total of 1 million servers, 3 million
processors/cores (Gartner 2007) - Estimate Google installs 100,000 servers each
quarter. - Based on expenditures of 200250 million dollars
per year - This would be 10 of the computing capacity of
the world!?!
72Google data centers
- If in a non-fault-tolerant system with 1000
nodes, each node has 99.9 uptime, what is the
uptime of the system? - Answer 63
- Calculate the number of servers failing per
minute for an installation of 1 million servers.
73Distributed indexing
- Maintain a master machine directing the indexing
job considered safe. - Break up indexing into sets of (parallel) tasks.
- Master machine assigns each task to an idle
machine from a pool.
74Parallel tasks
- We will use two sets of parallel tasks
- Parsers
- Inverters
- Break the input document corpus into splits
- Each split is a subset of documents
(corresponding to blocks in BSBI/SPIMI)
75Parsers
- Master assigns a split to an idle parser machine
- Parser reads a document at a time and emits
(term, doc) pairs - Parser writes pairs into j partitions
- Each partition is for a range of terms first
letters - (e.g., a-f, g-p, q-z) here j3.
- Now to complete the index inversion
76Inverters
- An inverter collects all (term,doc) pairs (
postings) for one term-partition. - Sorts and writes to postings lists
77Data flow
Master
assign
assign
Postings
Parser
Inverter
a-f
g-p
q-z
a-f
Parser
a-f
g-p
q-z
Inverter
g-p
Inverter
splits
q-z
Parser
a-f
g-p
q-z
Map phase
Reduce phase
Segment files
78MapReduce
- The index construction algorithm we just
described is an instance of MapReduce. - MapReduce (Dean and Ghemawat 2004) is a robust
and conceptually simple framework for - distributed computing
- without having to write code for the
distribution part. - They describe the Google indexing system (ca.
2002) as consisting of a number of phases, each
implemented in MapReduce.
79MapReduce
- Index construction was just one phase.
- Another phase transforming a term-partitioned
index into document-partitioned index. - Term-partitioned one machine handles a subrange
of terms - Document-partitioned one machine handles a
subrange of documents - (As we discuss in the web part of the course)
most search engines use a document-partitioned
index better load balancing, etc.)
80Dynamic indexing
- Up to now, we have assumed that collections are
static. - They rarely are
- Documents come in over time and need to be
inserted. - Documents are deleted and modified.
- This means that the dictionary and postings lists
have to be modified - Postings updates for terms already in dictionary
- New terms added to dictionary
81Simplest approach
- Maintain big main index
- New docs go into small auxiliary index
- Search across both, merge results
- Deletions
- Invalidation bit-vector for deleted docs
- Filter docs output on a search result by this
invalidation bit-vector - Periodically, re-index into one main index
82Issues with main and auxiliary indexes
- Problem of frequent merges you touch stuff a
lot - Poor performance during merge
- Actually
- Merging of the auxiliary index into the main
index is efficient if we keep a separate file for
each postings list. - Merge is the same as a simple append.
- But then we would need a lot of files
inefficient for O/S. - Assumption for the rest of the lecture The index
is one big file. - In reality Use a scheme somewhere in between
(e.g., split very large postings lists, collect
postings lists of length 1 in one file etc.)
83Further issues with multiple indexes
- Corpus-wide statistics are hard to maintain
- E.g., when we spoke of spell-correction which of
several corrected alternatives do we present to
the user? - We said, pick the one with the most hits
- How do we maintain the top ones with multiple
indexes and invalidation bit vectors? - One possibility ignore everything but the main
index for such ordering - Will see more such statistics used in results
ranking
84Dynamic indexing at search engines
- All the large search engines now do dynamic
indexing - Their indices have frequent incremental changes
- News items, new topical web pages
- Sarah Palin
- But (sometimes/typically) they also periodically
reconstruct the index from scratch - Query processing is then switched to the new
index, and the old index is then deleted
85Other sorts of indexes
- Positional indexes
- Same sort of sorting problem just larger
- Building character n-gram indexes
- As text is parsed, enumerate n-grams.
- For each n-gram, need pointers to all dictionary
terms containing it the postings. - Note that the same postings entry will arise
repeatedly in parsing the docs need efficient
hashing to keep track of this. - E.g., that the trigram uou occurs in the term
deciduous will be discovered on each text
occurrence of deciduous - Only need to process each term once
86CSE 8337 Outline
- Introduction
- Text Processing
- Indexes
- Boolean Queries
- Web Searching/Crawling
- Vector Space Model
- Matching
- Evaluation
- Feedback/Expansion
87The index we just built
Todays focus
- How do we process a query?
- Later - what kinds of queries can we process?
88Keyword Based Queries
- Basic Queries
- Single word
- Multiple words
- Context Queries
- Phrase
- Proximity
89Boolean Queries
- Keywords combined with Boolean operators
- OR (e1 OR e2)
- AND (e1 AND e2)
- BUT (e1 BUT e2) Satisfy e1 but not e2
- Negation only allowed using BUT to allow
efficient use of inverted index by filtering
another efficiently retrievable set. - Naïve users have trouble with Boolean logic.
90Boolean Retrieval with Inverted Indices
- Primitive keyword Retrieve containing documents
using the inverted index. - OR Recursively retrieve e1 and e2 and take
union of results. - AND Recursively retrieve e1 and e2 and take
intersection of results. - BUT Recursively retrieve e1 and e2 and take set
difference of results.
91Query processing AND
- Consider processing the query
- Brutus AND Caesar
- Locate Brutus in the Dictionary
- Retrieve its postings.
- Locate Caesar in the Dictionary
- Retrieve its postings.
- Merge the two postings
128
Brutus
Caesar
34
92The merge
- Walk through the two postings simultaneously, in
time linear in the total number of postings
entries
128
2
34
If the list lengths are x and y, the merge takes
O(xy) operations. Crucial postings sorted by
docID.
93Example WestLaw http//www.westlaw.com/
- Largest commercial (paying subscribers) legal
search service (started 1975 ranking added 1992) - Tens of terabytes of data 700,000 users
- Majority of users still use boolean queries
- Example query
- What is the statute of limitations in cases
involving the federal tort claims act? - LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3
CLAIM - /3 within 3 words, /S in same sentence
94Boolean queries More general merges
- Exercise Adapt the merge for the queries
- Brutus AND NOT Caesar
- Brutus OR NOT Caesar
- Can we still run through the merge in time
O(xy)?
95Merging
- What about an arbitrary Boolean formula?
- (Brutus OR Caesar) AND NOT
- (Antony OR Cleopatra)
- Can we always merge in linear time?
- Linear in what?
- Can we do better?
96Query optimization
- What is the best order for query processing?
- Consider a query that is an AND of t terms.
- For each of the t terms, get its postings, then
AND them together.
Brutus
Calpurnia
Caesar
13
16
Query Brutus AND Calpurnia AND Caesar
97Query optimization example
- Process in order of increasing freq
- start with smallest set, then keep cutting
further.
This is why we kept freq in dictionary
Execute the query as (Caesar AND Brutus) AND
Calpurnia.
98More general optimization
- e.g., (madding OR crowd) AND (ignoble OR strife)
- Get freqs for all terms.
- Estimate the size of each OR by the sum of its
freqs (conservative). - Process in increasing order of OR sizes.
99Exercise
- Recommend a query processing order for
(tangerine OR trees) AND (marmalade OR skies)
AND (kaleidoscope OR eyes)
100Phrasal Queries
- Retrieve documents with a specific phrase
(ordered list of contiguous words) - information theory
- May allow intervening stop words and/or stemming.
- buy camera matches
buy a camera
buying the cameras
etc.
101Phrasal Retrieval with Inverted Indices
- Must have an inverted index that also stores
positions of each keyword in a document. - Retrieve documents and positions for each
individual word, intersect documents, and then
finally check for ordered contiguity of keyword
positions. - Best to start contiguity check with the least
common word in the phrase.
102Phrasal Search Algorithm
- Find set of documents D in which all keywords
(k1km) in phrase occur (using AND query
processing). - Intitialize empty set, R, of retrieved documents.
- For each document, d, in D do
- Get array, Pi , of positions of occurrences
for each ki in d - Find shortest array Ps of the Pis
- For each position p of keyword ks in Ps do
- For each keyword ki except ks do
- Use binary search to find a
position (p s i ) in the - array Pi
- If correct position for every keyword
found, add d to R - Return R
103Proximity Queries
- List of words with specific maximal distance
constraints between terms. - Example dogs and race within 4 words
match dogs will begin the race - May also perform stemming and/or not count stop
words.
104Proximity Retrieval with Inverted Index
- Use approach similar to phrasal search to find
documents in which all keywords are found in a
context that satisfies the proximity constraints. - During binary search for positions of remaining
keywords, find closest position of ki to p and
check that it is within maximum allowed distance.
105Pattern Matching
- Allow queries that match strings rather than word
tokens. - Requires more sophisticated data structures and
algorithms than inverted indices to retrieve
efficiently.
106Simple Patterns
- Prefixes Pattern that matches start of word.
- anti matches antiquity, antibody, etc.
- Suffixes Pattern that matches end of word
- ix matches fix, matrix, etc.
- Substrings Pattern that matches arbitrary
subsequence of characters. - rapt matches enrapture, velociraptor etc.
- Ranges Pair of strings that matches any word
lexicographically (alphabetically) between them. - tin to tix matches tip, tire, title,
etc.
107Allowing Errors
- What if query or document contains typos or
misspellings? - Judge similarity of words (or arbitrary strings)
using - Edit distance (cost of insert/delete/match)
- Longest Common Subsequence (LCS)
- Allow proximity search with bound on string
similarity.
108Longest Common Subsequence (LCS)
- Length of the longest subsequence of characters
shared by two strings. - A subsequence of a string is obtained by deleting
zero or more characters. - Examples
- misspell to mispell is 7
- misspelled to misinterpretted is 7
mispeed
109Structural Queries
- Assumes documents have structure that can be
exploited in search. - Structure could be
- Fixed set of fields, e.g. title, author,
abstract, etc. - Hierarchical (recursive) tree structure
book
chapter
chapter
title
section
title
section
title
subsection
110Queries with Structure
- Allow queries for text appearing in specific
fields - nuclear fusion appearing in a chapter title
- SFQL Relational database query language SQL
enhanced with full text search. - Select abstract from journal.papers
- where author contains Teller and
- title contains nuclear fusion and
date lt 1/1/1950
111Ranking search results
- Boolean queries give inclusion or exclusion of
docs. - Often we want to rank/group results
- Need to measure proximity from query to each doc.
- Need to decide whether docs presented to user are
singletons, or a group of docs covering various
aspects of the query.
112The web and its challenges
- Unusual and diverse documents
- Unusual and diverse users, queries, information
needs - Beyond terms, exploit ideas from social networks
- link analysis, clickstreams ...
- How do search engines work? And how can we make
them better?
113More sophisticated information retrieval
- Cross-language information retrieval
- Question answering
- Summarization
- Text mining