Information Retrieval IR - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Information Retrieval IR

Description:

... food chain said Tuesday as it moves to make all its fried menu items healthier. ... whether competitors Burger King and Wendy's International (WEN: down $0.80 to ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 39
Provided by: gemingaIt
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval IR


1
Information Retrieval (IR)
2
Unstructured Information
not very accessible via standard data retrieval
techniques
  • Email
  • Insurance claims
  • News articles
  • Web pages
  • Patent portfolios
  • Scientific articles
  • Sound (music)
  • Customer complaint letters
  • Contracts (legal docs)
  • Transcripts of phone calls with customers
  • Technical documents
  • Images

3
Text Retrieval
  • Deals with returning all relevant documents
    related to a given user need (query)
  • Relevance is a difficult concept to determine.
    However, most people know a relevant document
    when they see it.
  • In IR, relevance is usually taken to be objective.

4
Text retrieval Process
  • Preprocessing
  • Representation (term-weighting)
  • Comparison
  • Results
  • Reformulation (Feedback)

5
A typical IR system
6
Sample Document
16 said 14 McDonalds 12 fat 11 fries 8
new 6 company french nutrition 5 food oil
percent reduce taste Tuesday
  • McDonald's slims down spuds
  • Fast-food chain to reduce certain types of fat in
    its french fries with new cooking oil.
  • NEW YORK (CNN/Money) - McDonald's Corp. is
    cutting the amount of "bad" fat in its french
    fries nearly in half, the fast-food chain said
    Tuesday as it moves to make all its fried menu
    items healthier.
  • But does that mean the popular shoestring fries
    won't taste the same? The company says no. "It's
    a win-win for our customers because they are
    getting the same great french-fry taste along
    with an even healthier nutrition profile," said
    Mike Roberts, president of McDonald's USA.
  • But others are not so sure. McDonald's will not
    specifically discuss the kind of oil it plans to
    use, but at least one nutrition expert says
    playing with the formula could mean a different
    taste.
  • Shares of Oak Brook, Ill.-based McDonald's (MCD
    down 0.54 to 23.22, Research, Estimates) were
    lower Tuesday afternoon. It was unclear Tuesday
    whether competitors Burger King and Wendy's
    International (WEN down 0.80 to 34.91,
    Research, Estimates) would follow suit. Neither
    company could immediately be reached for comment.

Bag of Words
7
Pre-processing
  • Stop-word removal
  • remove common words that have little or no
    semantic meaning (e.g. the, a, of)
  • These words occur too frequently and in most
    documents to be of use
  • Advantages It reduces the size of text to be
    processed. Stop-words take up a considerable
    amount of text in a document
  • It can leave some documents irretrievable

8
Stemming
  • Stemming algorithms remove common suffixes from
    terms occurring in documents
  • An example of a stem is the word connect, which
    is the stem for the variants connected,
    connecting, connection and connections.
  • It is worth noting that semantic information can
    be lost by stemming terms but in general stemming
    queries and documents does not damage and often
    improves the performance of IR systems, while at
    the same time decreasing the size of document
    collection

9
Representation (weighting)
  • Provide a weighting for terms based on some
    frequency characteristics (tf-idf)
  • Term-frequency
  • A terms that occurs more frequently in a document
    is more likely to describe the content of that
    document
  • Inverse document frequency
  • A term that occurs in few documents is better
    able to distinguish those documents from the rest
    of the collection

10
Representation (weighting)(2)
  • Length Normalisation
  • Penalise overlong documents as they simply
    contain more words and may not be as relevant as
    shorter documents (which are more concise)

11
Comparison
  • Use the weighting scheme to compare each document
    to the query
  • Weight the terms that are in common to the query
    and document
  • Score each document in the collection and return
    a ranked list to the user

12
Ranked Lists
  • Popular way
  • to return
  • Results for VSM

13
Evaluation
  • How to evaluate systems?
  • How to tell if system A is better that system B.
  • Systems are usually tested on document test
    collections
  • Relevance is evaluated using a binary decision
    (i.e. relevant or not relevant)

14
Test collections
  • Test collections are usually comprised of 3 parts
  • A set of documents (large sample, often up to a
    million or so nowadays)
  • A set of queries
  • Human judgements for the queries
  • i.e. documents 400,10234 and 502344 are relevant
    to query 1

15
Evaluation metrics
  • Precision
  • How accurate is the system?
  • ( of relevant returned) / ( of returned
    documents)
  • Recall
  • How many relevant documents has the system
    returned?
  • ( of relevant returned) / ( of relevant
    documents in total)

16
Mean Average Precision (MAP)
  • Average precision (AP)
  • For a query, for each relevant document retrieved
    calculate the precision and average over the
    number relevant found
  • Mean Average precision (MAP)
  • Mean of the APs for a set of queries
  • E.g. you could have a sample of 50 queries

17
Example (10 docs)
  • System returns the following ranked list for a
    certain query
  • Actual relevant documents are in blue
  • (1/3) (1/5) (1/8) divided by 3 0.2194

18
Differing Models of IR
  • Boolean Model
  • Uses binary weights and logical operators AND,
    OR, NOT
  • Vector Space Model
  • Models documents and queries in a vector
    framework. Can provide partial matching.
  • Probabilistic Model
  • Uses relevant and non-relevance judgements to
    correctly assign the correct weighting of terms

19
Boolean Model
  • Weights assigned to terms are either 0 or 1
  • 0 represents absence term isnt in the
    document
  • 1 represents presence term is in the
    document
  • Build queries by combining terms with Boolean
    operators
  • AND, OR, NOT
  • The system returns all documents that satisfy the
    query

20
Boolean Model (1)
All documents
A
B
C
21
Boolean View of a Collection
Each column represents the view of a particular
document What terms are contained in this
document?
Each row represents the view of a particular
term What documents contain this term?
To execute a query, pick out rows corresponding
to query terms and then apply logic table of
corresponding Boolean operator
22
Sample Queries
dog AND fox ? Doc 3, Doc 5
dog OR fox ? Doc 3, Doc 5, Doc 7
dog NOT fox ? empty
fox NOT dog ? Doc 7
Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
good
0
1
0
1
0
1
0
1
party
0
0
0
0
0
1
0
1
good AND party ? Doc 6, Doc 8
g ? p
0
0
0
0
0
1
0
1
over
1
0
1
0
1
0
1
1
good AND party NOT over ? Doc 6
g ? p ? o
0
0
0
0
0
1
0
0
23
Why Boolean Retrieval Works
  • Boolean operators approximate natural language
  • Find documents about a good party that is not
    over
  • AND can discover relationships between concepts
  • good party
  • OR can discover alternate terminology
  • excellent party, wild party, etc.
  • NOT can discover alternate meanings
  • Democratic party

24
Strengths and Weaknesses
  • Strengths
  • Precise, if you know the right strategies
  • Precise, if you have an idea of what youre
    looking for
  • Efficient for the computer
  • Weaknesses
  • Users must learn Boolean logic
  • Boolean logic insufficient to capture the
    richness of language
  • No control over size of result set either too
    many documents or none
  • When do you stop reading? All documents in the
    result set are considered equally good
  • What about partial matches? Documents that dont
    quite match the query may be useful also

25
Vector Space Model
  • Arranging documents by relevance is
  • Closer to how humans think some documents are
    better than others
  • Closer to user behavior users can decide when to
    stop reading
  • Best (partial) match documents need not have all
    query terms
  • Although documents with more query terms should
    be better
  • Easier said than done!

26
Document Vectors
  • Documents are represented as bags of words
  • Represented as vectors when used computationally
  • A vector is like an array of floating point
  • Has direction and magnitude
  • Each vector holds a place for every term in the
    collection
  • Therefore, most vectors are sparse

27
Vector Representation
  • Documents and Queries are represented as vectors.
  • Position 1 corresponds to term 1, position 2 to
    term 2, position t to term t

28
Document Vectors
Document ids
  • nova galaxy heat hwood film role diet fur
  • 1.0 0.5 0.3
  • 0.5 1.0
  • 1.0 0.8 0.7
  • 0.9 1.0 0.5
  • 1.0 1.0
  • 0.9 1.0
  • 0.5 0.7 0.9
  • 0.6 1.0 0.3 0.2 0.8
  • 0.7 0.5 0.1 0.3

A B C D E F G H I
29
Computing Similarity Scores
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
30
Problems with Vector Space
  • There is no real theoretical basis for the
    assumption of a term space
  • it is more for visualization that having any real
    basis
  • most similarity measures work about the same
    regardless of model
  • Terms are not really orthogonal dimensions
  • Terms are not independent of all other terms

31
Probabilistic Model
  • Rigorous formal model attempts to predict the
    probability that a given document will be
    relevant to a given query
  • Ranks retrieved documents according to this
    probability of relevance (Probability Ranking
    Principle)
  • Relies on accurate estimates of probabilities for
    accurate results

32
Probabilistic
  • Goes back to 1960s (Maron and Kuhns)
  • Robertsons Probabilistic Ranking Principle
  • Retrieved documents should be ranked in
    decreasing probability that they are relevant to
    the users query.
  • How to estimate these probabilities?
  • Several methods (Model 1, Model 2, Model 3) with
    different emphases on how estimates are done.

33
Probabilistic Models
Disadvantages
  • Relevance information is required -- or is
    guestimated
  • Important indicators of relevance may not be term
    -- though terms only are usually used
  • Optimally requires on-going collection of
    relevance information

Advantages
  • Strong theoretical basis
  • In principle should supply the best predictions
    of relevance given available information
  • Can be implemented similarly to Vector

34
Relevance Feedback
  • User can give samples of relevant documents from
    an initial retrieval run
  • The user can mark documents that he/she has found
    relevant
  • The system takes these positive samples and adds
    terms from them into the query to improve the
    performance of the system (a type of document
    clustering)
  • User tend not to want to give much feedback for
    short searches

35
Automatic Query expansion
  • Use some type of automated approach to select and
    add terms to the query that might be useful
  • Helps overcome term-mismatch (vocabulary
    differences)
  • How to select these terms?
  • Look at all documents in the collection and
    co-occurrences (global)
  • Look at top few documents from an initial run and
    assume they are relevant (local)

36
Thesaurus construction
  • Uses terms in the entire collection (Global
    approach)
  • Measures the similarity between terms using
    co-occurrence characteristics
  • term-document matrix to calculate similarity
    stored in a term-term matrix
  • Add terms that are similar to query terms

37
Pseudorelevance (blind) feedback
  • Perform an initial retrieval run
  • Assume top N documents are relevant
  • This gives us a pool of terms that are
    potentially useful
  • Add some of these to the query and run the query
    again

38
Summary
  • Know basics of IR
  • Pre-processing
  • Stop-word removal
  • Stemming
  • Evaluation
  • Precision and Recall
  • Query expansion
  • Thesaurus and pseudo-relevance feedback
Write a Comment
User Comments (0)
About PowerShow.com