Title: Information Retrieval IR
1Information Retrieval (IR)
2Unstructured Information
not very accessible via standard data retrieval
techniques
- Email
- Insurance claims
- News articles
- Web pages
- Patent portfolios
- Scientific articles
- Sound (music)
- Customer complaint letters
- Contracts (legal docs)
- Transcripts of phone calls with customers
- Technical documents
- Images
3Text Retrieval
- Deals with returning all relevant documents
related to a given user need (query) - Relevance is a difficult concept to determine.
However, most people know a relevant document
when they see it. - In IR, relevance is usually taken to be objective.
4Text retrieval Process
- Preprocessing
- Representation (term-weighting)
- Comparison
- Results
- Reformulation (Feedback)
5A typical IR system
6Sample Document
16 said 14 McDonalds 12 fat 11 fries 8
new 6 company french nutrition 5 food oil
percent reduce taste Tuesday
- McDonald's slims down spuds
- Fast-food chain to reduce certain types of fat in
its french fries with new cooking oil. - NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french
fries nearly in half, the fast-food chain said
Tuesday as it moves to make all its fried menu
items healthier. - But does that mean the popular shoestring fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along
with an even healthier nutrition profile," said
Mike Roberts, president of McDonald's USA. - But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to
use, but at least one nutrition expert says
playing with the formula could mean a different
taste. - Shares of Oak Brook, Ill.-based McDonald's (MCD
down 0.54 to 23.22, Research, Estimates) were
lower Tuesday afternoon. It was unclear Tuesday
whether competitors Burger King and Wendy's
International (WEN down 0.80 to 34.91,
Research, Estimates) would follow suit. Neither
company could immediately be reached for comment.
Bag of Words
7Pre-processing
- Stop-word removal
- remove common words that have little or no
semantic meaning (e.g. the, a, of) - These words occur too frequently and in most
documents to be of use - Advantages It reduces the size of text to be
processed. Stop-words take up a considerable
amount of text in a document - It can leave some documents irretrievable
8Stemming
- Stemming algorithms remove common suffixes from
terms occurring in documents - An example of a stem is the word connect, which
is the stem for the variants connected,
connecting, connection and connections. - It is worth noting that semantic information can
be lost by stemming terms but in general stemming
queries and documents does not damage and often
improves the performance of IR systems, while at
the same time decreasing the size of document
collection
9Representation (weighting)
- Provide a weighting for terms based on some
frequency characteristics (tf-idf) - Term-frequency
- A terms that occurs more frequently in a document
is more likely to describe the content of that
document - Inverse document frequency
- A term that occurs in few documents is better
able to distinguish those documents from the rest
of the collection
10Representation (weighting)(2)
- Length Normalisation
- Penalise overlong documents as they simply
contain more words and may not be as relevant as
shorter documents (which are more concise)
11Comparison
- Use the weighting scheme to compare each document
to the query - Weight the terms that are in common to the query
and document - Score each document in the collection and return
a ranked list to the user
12Ranked Lists
- Popular way
- to return
- Results for VSM
13Evaluation
- How to evaluate systems?
- How to tell if system A is better that system B.
- Systems are usually tested on document test
collections - Relevance is evaluated using a binary decision
(i.e. relevant or not relevant)
14Test collections
- Test collections are usually comprised of 3 parts
- A set of documents (large sample, often up to a
million or so nowadays) - A set of queries
- Human judgements for the queries
- i.e. documents 400,10234 and 502344 are relevant
to query 1
15Evaluation metrics
- Precision
- How accurate is the system?
- ( of relevant returned) / ( of returned
documents) - Recall
- How many relevant documents has the system
returned? - ( of relevant returned) / ( of relevant
documents in total)
16Mean Average Precision (MAP)
- Average precision (AP)
- For a query, for each relevant document retrieved
calculate the precision and average over the
number relevant found - Mean Average precision (MAP)
- Mean of the APs for a set of queries
- E.g. you could have a sample of 50 queries
17Example (10 docs)
- System returns the following ranked list for a
certain query - Actual relevant documents are in blue
- (1/3) (1/5) (1/8) divided by 3 0.2194
18Differing Models of IR
- Boolean Model
- Uses binary weights and logical operators AND,
OR, NOT - Vector Space Model
- Models documents and queries in a vector
framework. Can provide partial matching. - Probabilistic Model
- Uses relevant and non-relevance judgements to
correctly assign the correct weighting of terms
19Boolean Model
- Weights assigned to terms are either 0 or 1
- 0 represents absence term isnt in the
document - 1 represents presence term is in the
document - Build queries by combining terms with Boolean
operators - AND, OR, NOT
- The system returns all documents that satisfy the
query
20Boolean Model (1)
All documents
A
B
C
21Boolean View of a Collection
Each column represents the view of a particular
document What terms are contained in this
document?
Each row represents the view of a particular
term What documents contain this term?
To execute a query, pick out rows corresponding
to query terms and then apply logic table of
corresponding Boolean operator
22Sample Queries
dog AND fox ? Doc 3, Doc 5
dog OR fox ? Doc 3, Doc 5, Doc 7
dog NOT fox ? empty
fox NOT dog ? Doc 7
Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
good
0
1
0
1
0
1
0
1
party
0
0
0
0
0
1
0
1
good AND party ? Doc 6, Doc 8
g ? p
0
0
0
0
0
1
0
1
over
1
0
1
0
1
0
1
1
good AND party NOT over ? Doc 6
g ? p ? o
0
0
0
0
0
1
0
0
23Why Boolean Retrieval Works
- Boolean operators approximate natural language
- Find documents about a good party that is not
over - AND can discover relationships between concepts
- good party
- OR can discover alternate terminology
- excellent party, wild party, etc.
- NOT can discover alternate meanings
- Democratic party
24Strengths and Weaknesses
- Strengths
- Precise, if you know the right strategies
- Precise, if you have an idea of what youre
looking for - Efficient for the computer
- Weaknesses
- Users must learn Boolean logic
- Boolean logic insufficient to capture the
richness of language - No control over size of result set either too
many documents or none - When do you stop reading? All documents in the
result set are considered equally good - What about partial matches? Documents that dont
quite match the query may be useful also
25Vector Space Model
- Arranging documents by relevance is
- Closer to how humans think some documents are
better than others - Closer to user behavior users can decide when to
stop reading - Best (partial) match documents need not have all
query terms - Although documents with more query terms should
be better - Easier said than done!
26Document Vectors
- Documents are represented as bags of words
- Represented as vectors when used computationally
- A vector is like an array of floating point
- Has direction and magnitude
- Each vector holds a place for every term in the
collection - Therefore, most vectors are sparse
27Vector Representation
- Documents and Queries are represented as vectors.
- Position 1 corresponds to term 1, position 2 to
term 2, position t to term t
28Document Vectors
Document ids
- nova galaxy heat hwood film role diet fur
- 1.0 0.5 0.3
- 0.5 1.0
- 1.0 0.8 0.7
- 0.9 1.0 0.5
- 1.0 1.0
- 0.9 1.0
- 0.5 0.7 0.9
- 0.6 1.0 0.3 0.2 0.8
- 0.7 0.5 0.1 0.3
A B C D E F G H I
29Computing Similarity Scores
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
30Problems with Vector Space
- There is no real theoretical basis for the
assumption of a term space - it is more for visualization that having any real
basis - most similarity measures work about the same
regardless of model - Terms are not really orthogonal dimensions
- Terms are not independent of all other terms
31Probabilistic Model
- Rigorous formal model attempts to predict the
probability that a given document will be
relevant to a given query - Ranks retrieved documents according to this
probability of relevance (Probability Ranking
Principle) - Relies on accurate estimates of probabilities for
accurate results
32Probabilistic
- Goes back to 1960s (Maron and Kuhns)
- Robertsons Probabilistic Ranking Principle
- Retrieved documents should be ranked in
decreasing probability that they are relevant to
the users query. - How to estimate these probabilities?
- Several methods (Model 1, Model 2, Model 3) with
different emphases on how estimates are done.
33Probabilistic Models
Disadvantages
- Relevance information is required -- or is
guestimated - Important indicators of relevance may not be term
-- though terms only are usually used - Optimally requires on-going collection of
relevance information
Advantages
- Strong theoretical basis
- In principle should supply the best predictions
of relevance given available information - Can be implemented similarly to Vector
34Relevance Feedback
- User can give samples of relevant documents from
an initial retrieval run - The user can mark documents that he/she has found
relevant - The system takes these positive samples and adds
terms from them into the query to improve the
performance of the system (a type of document
clustering) - User tend not to want to give much feedback for
short searches
35Automatic Query expansion
- Use some type of automated approach to select and
add terms to the query that might be useful - Helps overcome term-mismatch (vocabulary
differences) - How to select these terms?
- Look at all documents in the collection and
co-occurrences (global) - Look at top few documents from an initial run and
assume they are relevant (local)
36Thesaurus construction
- Uses terms in the entire collection (Global
approach) - Measures the similarity between terms using
co-occurrence characteristics - term-document matrix to calculate similarity
stored in a term-term matrix - Add terms that are similar to query terms
37Pseudorelevance (blind) feedback
- Perform an initial retrieval run
- Assume top N documents are relevant
- This gives us a pool of terms that are
potentially useful - Add some of these to the query and run the query
again
38Summary
- Know basics of IR
- Pre-processing
- Stop-word removal
- Stemming
- Evaluation
- Precision and Recall
- Query expansion
- Thesaurus and pseudo-relevance feedback