Information Retrieval IR - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Information Retrieval IR

Description:

... food chain said Tuesday as it moves to make all its fried menu items healthier. ... whether competitors Burger King and Wendy's International (WEN: down $0.80 to ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 39

Provided by: gemingaIt

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval IR

1
Information Retrieval (IR)
2
Unstructured Information
not very accessible via standard data retrieval
techniques

Email
Insurance claims
News articles
Web pages
Patent portfolios
Scientific articles
Sound (music)

Customer complaint letters
Contracts (legal docs)
Transcripts of phone calls with customers
Technical documents
Images

3
Text Retrieval

Deals with returning all relevant documents
related to a given user need (query)
Relevance is a difficult concept to determine.
However, most people know a relevant document
when they see it.
In IR, relevance is usually taken to be objective.

4
Text retrieval Process

Preprocessing
Representation (term-weighting)
Comparison
Results
Reformulation (Feedback)

5
A typical IR system
6
Sample Document
16 said 14 McDonalds 12 fat 11 fries 8
new 6 company french nutrition 5 food oil
percent reduce taste Tuesday

McDonald's slims down spuds
Fast-food chain to reduce certain types of fat in
its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french
fries nearly in half, the fast-food chain said
Tuesday as it moves to make all its fried menu
items healthier.
But does that mean the popular shoestring fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along
with an even healthier nutrition profile," said
Mike Roberts, president of McDonald's USA.
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to
use, but at least one nutrition expert says
playing with the formula could mean a different
taste.
Shares of Oak Brook, Ill.-based McDonald's (MCD
down 0.54 to 23.22, Research, Estimates) were
lower Tuesday afternoon. It was unclear Tuesday
whether competitors Burger King and Wendy's
International (WEN down 0.80 to 34.91,
Research, Estimates) would follow suit. Neither
company could immediately be reached for comment.

Bag of Words
7
Pre-processing

Stop-word removal
remove common words that have little or no
semantic meaning (e.g. the, a, of)
These words occur too frequently and in most
documents to be of use
Advantages It reduces the size of text to be
processed. Stop-words take up a considerable
amount of text in a document
It can leave some documents irretrievable

8
Stemming

Stemming algorithms remove common suffixes from
terms occurring in documents
An example of a stem is the word connect, which
is the stem for the variants connected,
connecting, connection and connections.
It is worth noting that semantic information can
be lost by stemming terms but in general stemming
queries and documents does not damage and often
improves the performance of IR systems, while at
the same time decreasing the size of document
collection

9
Representation (weighting)

Provide a weighting for terms based on some
frequency characteristics (tf-idf)
Term-frequency
A terms that occurs more frequently in a document
is more likely to describe the content of that
document
Inverse document frequency
A term that occurs in few documents is better
able to distinguish those documents from the rest
of the collection

10
Representation (weighting)(2)

Length Normalisation
Penalise overlong documents as they simply
contain more words and may not be as relevant as
shorter documents (which are more concise)

11
Comparison

Use the weighting scheme to compare each document
to the query
Weight the terms that are in common to the query
and document
Score each document in the collection and return
a ranked list to the user

12
Ranked Lists

Popular way
to return
Results for VSM

13
Evaluation

How to evaluate systems?
How to tell if system A is better that system B.
Systems are usually tested on document test
collections
Relevance is evaluated using a binary decision
(i.e. relevant or not relevant)

14
Test collections

Test collections are usually comprised of 3 parts
A set of documents (large sample, often up to a
million or so nowadays)
A set of queries
Human judgements for the queries
i.e. documents 400,10234 and 502344 are relevant
to query 1

15
Evaluation metrics

Precision
How accurate is the system?
( of relevant returned) / ( of returned
documents)
Recall
How many relevant documents has the system
returned?
( of relevant returned) / ( of relevant
documents in total)

16
Mean Average Precision (MAP)

Average precision (AP)
For a query, for each relevant document retrieved
calculate the precision and average over the
number relevant found
Mean Average precision (MAP)
Mean of the APs for a set of queries
E.g. you could have a sample of 50 queries

17
Example (10 docs)

System returns the following ranked list for a
certain query
Actual relevant documents are in blue
(1/3) (1/5) (1/8) divided by 3 0.2194

18
Differing Models of IR

Boolean Model
Uses binary weights and logical operators AND,
OR, NOT
Vector Space Model
Models documents and queries in a vector
framework. Can provide partial matching.
Probabilistic Model
Uses relevant and non-relevance judgements to
correctly assign the correct weighting of terms

19
Boolean Model

Weights assigned to terms are either 0 or 1
0 represents absence term isnt in the
document
1 represents presence term is in the
document
Build queries by combining terms with Boolean
operators
AND, OR, NOT
The system returns all documents that satisfy the
query

20
Boolean Model (1)
All documents
A
B
C
21
Boolean View of a Collection
Each column represents the view of a particular
document What terms are contained in this
document?
Each row represents the view of a particular
term What documents contain this term?
To execute a query, pick out rows corresponding
to query terms and then apply logic table of
corresponding Boolean operator
22
Sample Queries
dog AND fox ? Doc 3, Doc 5
dog OR fox ? Doc 3, Doc 5, Doc 7
dog NOT fox ? empty
fox NOT dog ? Doc 7
Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
good
0
1
0
1
0
1
0
1
party
0
0
0
0
0
1
0
1
good AND party ? Doc 6, Doc 8
g ? p
0
0
0
0
0
1
0
1
over
1
0
1
0
1
0
1
1
good AND party NOT over ? Doc 6
g ? p ? o
0
0
0
0
0
1
0
0
23
Why Boolean Retrieval Works

Boolean operators approximate natural language
Find documents about a good party that is not
over
AND can discover relationships between concepts
good party
OR can discover alternate terminology
excellent party, wild party, etc.
NOT can discover alternate meanings
Democratic party

24
Strengths and Weaknesses

Strengths
Precise, if you know the right strategies
Precise, if you have an idea of what youre
looking for
Efficient for the computer
Weaknesses
Users must learn Boolean logic
Boolean logic insufficient to capture the
richness of language
No control over size of result set either too
many documents or none
When do you stop reading? All documents in the
result set are considered equally good
What about partial matches? Documents that dont
quite match the query may be useful also

25
Vector Space Model

Arranging documents by relevance is
Closer to how humans think some documents are
better than others
Closer to user behavior users can decide when to
stop reading
Best (partial) match documents need not have all
query terms
Although documents with more query terms should
be better
Easier said than done!

26
Document Vectors

Documents are represented as bags of words
Represented as vectors when used computationally
A vector is like an array of floating point
Has direction and magnitude
Each vector holds a place for every term in the
collection
Therefore, most vectors are sparse

27
Vector Representation

Documents and Queries are represented as vectors.
Position 1 corresponds to term 1, position 2 to
term 2, position t to term t

28
Document Vectors
Document ids

nova galaxy heat hwood film role diet fur
1.0 0.5 0.3
0.5 1.0
1.0 0.8 0.7
0.9 1.0 0.5
1.0 1.0
0.9 1.0
0.5 0.7 0.9
0.6 1.0 0.3 0.2 0.8
0.7 0.5 0.1 0.3

A B C D E F G H I
29
Computing Similarity Scores
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
30
Problems with Vector Space

There is no real theoretical basis for the
assumption of a term space
it is more for visualization that having any real
basis
most similarity measures work about the same
regardless of model
Terms are not really orthogonal dimensions
Terms are not independent of all other terms

31
Probabilistic Model

Rigorous formal model attempts to predict the
probability that a given document will be
relevant to a given query
Ranks retrieved documents according to this
probability of relevance (Probability Ranking
Principle)
Relies on accurate estimates of probabilities for
accurate results

32
Probabilistic

Goes back to 1960s (Maron and Kuhns)
Robertsons Probabilistic Ranking Principle
Retrieved documents should be ranked in
decreasing probability that they are relevant to
the users query.
How to estimate these probabilities?
Several methods (Model 1, Model 2, Model 3) with
different emphases on how estimates are done.

33
Probabilistic Models
Disadvantages

Relevance information is required -- or is
guestimated
Important indicators of relevance may not be term
-- though terms only are usually used
Optimally requires on-going collection of
relevance information

Advantages

Strong theoretical basis
In principle should supply the best predictions
of relevance given available information
Can be implemented similarly to Vector

34
Relevance Feedback

User can give samples of relevant documents from
an initial retrieval run
The user can mark documents that he/she has found
relevant
The system takes these positive samples and adds
terms from them into the query to improve the
performance of the system (a type of document
clustering)
User tend not to want to give much feedback for
short searches

35
Automatic Query expansion

Use some type of automated approach to select and
add terms to the query that might be useful
Helps overcome term-mismatch (vocabulary
differences)
How to select these terms?
Look at all documents in the collection and
co-occurrences (global)
Look at top few documents from an initial run and
assume they are relevant (local)

36
Thesaurus construction