Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Information Retrieval

Description:

The Boolean Model. Simple model based on set theory. Queries specified as boolean expressions ... Objective: to capture the IR problem using a probabilistic framework ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 56
Provided by: bert193
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
  • CSE 8337
  • Spring 2005
  • Modeling
  • Material for these slides obtained from
  • Modern Information Retrieval by Ricardo
    Baeza-Yates and Berthier Ribeiro-Neto
    http//www.sims.berkeley.edu/hearst/irbook/
  • Introduction to Modern Information Retrieval by
    Gerard Salton and Michael J. McGill, McGraw-Hill,
    1983.

2
Modeling TOC
  • Introduction
  • Classic IR Models
  • Boolean Model
  • Vector Model
  • Probabilistic Model
  • Set Theoretic Models
  • Fuzzy Set Model
  • Extended Boolean Model
  • Generalized Vector Model
  • Latent Semantic Indexing
  • Neural Network Model
  • Alternative Probabilistic Models
  • Inference Network
  • Belief Network

3
Introduction
  • IR systems usually adopt index terms to process
    queries
  • Index term
  • a keyword or group of selected words
  • any word (more general)
  • Stemming might be used
  • connect connecting, connection, connections
  • An inverted file is built for the chosen index
    terms

4
Introduction
Docs
Index Terms
doc
match
Ranking
Information Need
query
5
Introduction
  • Matching at index term level is quite imprecise
  • No surprise that users get frequently unsatisfied
  • Since most users have no training in query
    formation, problem is even worst
  • Frequent dissatisfaction of Web users
  • Issue of deciding relevance is critical for IR
    systems ranking

6
Introduction
  • A ranking is an ordering of the documents
    retrieved that (hopefully) reflects the relevance
    of the documents to the query
  • A ranking is based on fundamental premisses
    regarding the notion of relevance, such as
  • common sets of index terms
  • sharing of weighted terms
  • likelihood of relevance
  • Each set of premisses leads to a distinct IR model

7
IR Models
U s e r T a s k
Retrieval Adhoc Filtering
Browsing
8
IR Models
9
Classic IR Models - Basic Concepts
  • Each document represented by a set of
    representative keywords or index terms
  • An index term is a document word useful for
    remembering the document main themes
  • Usually, index terms are nouns because nouns have
    meaning by themselves
  • However, search engines assume that all words are
    index terms (full text representation)

10
Classic IR Models - Basic Concepts
  • The importance of the index terms is represented
    by weights associated to them
  • ki- an index term
  • dj - a document
  • wij - a weight associated with (ki,dj)
  • The weight wij quantifies the importance of the
    index term for describing the document contents

11
Classic IR Models - Basic Concepts
  • t is the total number of index terms
  • K k1, k2, , kt is the set of all index
    terms
  • wij gt 0 is a weight associated with (ki,dj)
  • wij 0 indicates that term does not belong to
    doc
  • dj (w1j, w2j, , wtj) is a weighted vector
    associated with the document dj
  • gi(dj) wij is a function which returns the
    weight associated with pair (ki,dj)

12
The Boolean Model
  • Simple model based on set theory
  • Queries specified as boolean expressions
  • precise semantics and neat formalism
  • Terms are either present or absent. Thus,
    wij ? 0,1
  • Consider
  • q ka ? (kb ? ?kc)
  • qdnf (1,1,1) ? (1,1,0) ? (1,0,0)
  • qcc (1,1,0) is a conjunctive component

13
The Boolean Model
  • q ka ? (kb ? ?kc)
  • sim(q,dj)
  • 1 if ? qcc (qcc ? qdnf) ? (?ki, gi(dj)
    gi(qcc))
  • 0 otherwise

14
Drawbacks of the Boolean Model
  • Retrieval based on binary decision criteria with
    no notion of partial matching
  • No ranking of the documents is provided
  • Information need has to be translated into a
    Boolean expression
  • The Boolean queries formulated by the users are
    most often too simplistic
  • As a consequence, the Boolean model frequently
    returns either too few or too many documents in
    response to a user query

15
The Vector Model
  • Use of binary weights is too limiting
  • Non-binary weights provide consideration for
    partial matches
  • These term weights are used to compute a degree
    of similarity between a query and each document
  • Ranked set of documents provides for better
    matching

16
The Vector Model
  • wij gt 0 whenever ki appears in dj
  • wiq gt 0 associated with the pair (ki,q)
  • dj (w1j, w2j, ..., wtj)
  • q (w1q, w2q, ..., wtq)
  • To each term ki is associated a unitary vector
    i
  • The unitary vectors i and j are assumed to be
    orthonormal (i.e., index terms are assumed to
    occur independently within the documents)
  • The t unitary vectors i form an orthonormal
    basis for a t-dimensional space where queries and
    documents are represented as weighted vectors

17
The Vector Model
j
dj
?
q
i
  • Sim(q,dj) cos(?)
  • dj ? q / dj q
  • ? wij wiq / dj q
  • Since wij gt 0 and wiq gt 0, 0 lt sim(q,dj) lt1
  • A document is retrieved even if it matches the
    query terms only partially

18
Weights wij and wiq ?
  • One approach is to examine the frequency of the
    occurence of a word in a document
  • Absolute frequency
  • tf factor, the term frequency within a document
  • freqi,j - raw frequency of ki within dj
  • Both high-frequency and low-frequency terms may
    not actually be significant
  • Relative frequency tf divided by number of
    words in document
  • Normalized frequency
  • fi,j (freqi,j)/(maxl freql,j)

19
Inverse Document Frequency
  • Importance of term may depend more on how it can
    distinguish between documents.
  • Quantification of inter-documents separation
  • Dissimilarity not similarity
  • idf factor, the inverse document frequency

20
IDF
  • N be the total number of docs in the collection
  • ni be the number of docs which contain ki
  • The idf factor is computed as
  • idfi log (N/ni)
  • the log is used to make the values of tf and
    idf comparable. It can also be interpreted as
    the amount of information associated with the
    term ki.
  • IDF Ex
  • N1000, n1100, n2500, n3800
  • idf1 3 - 2 1
  • idf2 3 2.7 0.3
  • idf3 3 2.9 0.1

21
The Vector Model
  • The best term-weighting schemes take both into
    account.
  • wij fi,j log(N/ni)
  • This strategy is called a tf-idf weighting
    scheme

22
The Vector Model
  • For the query term weights, a suggestion is
  • wiq (0.5 0.5 freqi,q / max(freql,q)
    log(N/ni)
  • The vector model with tf-idf weights is a good
    ranking strategy with general collections
  • The vector model is usually as good as any known
    ranking alternatives.
  • It is also simple and fast to compute.

23
The Vector Model
  • Advantages
  • term-weighting improves quality of the answer set
  • partial matching allows retrieval of docs that
    approximate the query conditions
  • cosine ranking formula sorts documents according
    to degree of similarity to the query
  • Disadvantages
  • Assumes independence of index terms (??) not
    clear that this is bad though

24
The Vector Model Example I
25
The Vector Model Example II
26
The Vector Model Example III
27
Probabilistic Model
  • Objective to capture the IR problem using a
    probabilistic framework
  • Given a user query, there is an ideal answer set
  • Querying as specification of the properties of
    this ideal answer set (clustering)
  • But, what are these properties?
  • Guess at the beginning what they could be (i.e.,
    guess initial description of ideal answer set)
  • Improve by iteration

28
Probabilistic Model
  • An initial set of documents is retrieved somehow
  • User inspects these docs looking for the relevant
    ones (in truth, only top 10-20 need to be
    inspected)
  • IR system uses this information to refine
    description of ideal answer set
  • By repeting this process, it is expected that the
    description of the ideal answer set will improve
  • Have always in mind the need to guess at the very
    beginning the description of the ideal answer set
  • Description of ideal answer set is modeled in
    probabilistic terms

29
Probabilistic Ranking Principle
  • Given a user query q and a document dj, the
    probabilistic model tries to estimate the
    probability that the user will find the document
    dj interesting (i.e., relevant). Ideal answer set
    is referred to as R and should maximize the
    probability of relevance. Documents in the set R
    are predicted to be relevant.
  • But,
  • how to compute probabilities?
  • what is the sample space?

30
The Ranking
  • Probabilistic ranking computed as
  • sim(q,dj) P(dj relevant-to q) / P(dj
    non-relevant-to q)
  • This is the odds of the document dj being
    relevant
  • Taking the odds minimize the probability of an
    erroneous judgement
  • Definition
  • wij ? 0,1
  • P(R dj) probability that given doc is
    relevant
  • P(?R dj) probability doc is not relevant

31
The Ranking
  • sim(dj,q) P(R dj) / P(?R dj) P(dj
    R) P(R) P(dj ?R)
    P(?R)
  • P(dj R) P(dj
    ?R)
  • P(dj R) probability of randomly selecting the
    document dj from the set R of relevant documents

32
The Ranking
  • sim(dj,q) P(dj R)
    P(dj ?R) ? P(ki
    R) ? P(?ki R)
  • ? P(ki ?R) ? P(?ki ?R)
  • P(ki R) probability that the index term ki is
    present in a document randomly selected from the
    set R of relevant documents

33
The Ranking
  • sim(dj,q)
  • log ? P(ki R) ? P(?kj R)
  • ? P(ki ?R) ? P(?ki
    ?R)
  • K log ? P(ki R) log ? P(ki
    ?R) P(?ki R)
    P(?ki ?R)
  • where P(?ki R) 1 - P(ki
    R) P(?ki ?R) 1 - P(ki ?R)

34
The Initial Ranking
  • sim(dj,q)
  • ? wiq wij (log P(ki R) log P(ki
    ?R) )
  • P(?ki R) P(?ki ?R)
  • Probabilities P(ki R) and P(ki ?R) ?
  • Estimates based on assumptions
  • P(ki R) 0.5
  • P(ki ?R) ni N
  • Use this initial guess to retrieve an initial
    ranking
  • Improve upon this initial ranking

35
Improving the Initial Ranking
  • Let
  • V set of docs initially retrieved
  • Vi subset of docs retrieved that contain ki
  • Reevaluate estimates
  • P(ki R) Vi V
  • P(ki ?R) ni - Vi N - V
  • Repeat recursively

36
Improving the Initial Ranking
  • To avoid problems with V1 and Vi0
  • P(ki R) Vi 0.5 V 1
  • P(ki ?R) ni - Vi 0.5 N - V 1
  • Also,
  • P(ki R) Vi ni/N V 1
  • P(ki ?R) ni - Vi ni/N N - V
    1

37
Pluses and Minuses
  • Advantages
  • Docs ranked in decreasing order of probability of
    relevance
  • Disadvantages
  • need to guess initial estimates for P(ki R)
  • method does not take into account tf and idf
    factors

38
Brief Comparison of Classic Models
  • Boolean model does not provide for partial
    matches and is considered to be the weakest
    classic model
  • Salton and Buckley did a series of experiments
    that indicate that, in general, the vector model
    outperforms the probabilistic model with general
    collections
  • This seems also to be the view of the research
    community

39
Set Theoretic Models
  • The Boolean model imposes a binary criterion for
    deciding relevance
  • The question of how to extend the Boolean model
    to accomodate partial matching and a ranking has
    attracted considerable attention in the past
  • We discuss now two set theoretic models for this
  • Fuzzy Set Model
  • Extended Boolean Model

40
Fuzzy Set Model
  • This vagueness of document/query matching can be
    modeled using a fuzzy framework, as follows
  • with each term is associated a fuzzy set
  • each doc has a degree of membership in this fuzzy
    set
  • Here, we discuss the model proposed by Ogawa,
    Morita, and Kobayashi (1991)

41
Fuzzy Set Theory
  • A fuzzy subset A of U is characterized by a
    membership function ?(A,u) U ?
    0,1 which associates with each element u of
    U a number ?(u) in the interval 0,1
  • Definition
  • Let A and B be two fuzzy subsets of U. Also, let
    A be the complement of A. Then,
  • ?(A,u) 1 - ?(A,u)
  • ?(A?B,u) max(?(A,u), ?(B,u))
  • ?(A?B,u) min(?(A,u), ?(B,u))

42
Fuzzy Information Retrieval
  • Fuzzy sets are modeled based on a thesaurus
  • This thesaurus is built as follows
  • Let c be a term-term correlation matrix
  • Let ci,l be a normalized correlation factor for
    (ki,kl) ci,l ni,l
    ni nl - ni,l
  • ni number of docs which contain ki
  • nl number of docs which contain kl
  • ni,l number of docs which contain both ki and kl
  • We now have the notion of proximity among index
    terms.

43
Fuzzy Information Retrieval
  • The correlation factor ci,l can be used to define
    fuzzy set membership for a document dj as
    follows ?i,j 1 - ? (1 - ci,l)
    ki ? dj
  • ?i,j membership of doc dj in fuzzy subset
    associated with ki
  • The above expression computes an algebraic sum
    over all terms in the doc dj

44
Fuzzy Information Retrieval
  • A doc dj belongs to the fuzzy set for ki, if its
    own terms are associated with ki
  • If doc dj contains a term kl which is closely
    related to ki, we have
  • ci,l 1
  • ?i,j 1
  • index ki is a good fuzzy index for doc

45
Fuzzy IR An Example
  • q ka ? (kb ? ?kc)
  • qdnf
  • (1,1,1) (1,1,0) (1,0,0)
  • cc1 cc2 cc3
  • ?q,dj ?cc1cc2cc3,j 1 - (1 - ?a,j
    ?b,j) ?c,j) (1 - ?a,j ?b,j (1-?c,j))
    (1 - ?a,j (1-?b,j) (1-?c,j))

46
Fuzzy Information Retrieval
  • Fuzzy IR models have been discussed mainly in the
    literature associated with fuzzy theory
  • Experiments with standard test collections are
    not available
  • Difficult to compare at this time

47
Extended Boolean Model
  • Boolean model is simple and elegant.
  • But, no provision for a ranking
  • As with the fuzzy model, a ranking can be
    obtained by relaxing the condition on set
    membership
  • Extend the Boolean model with the notions of
    partial matching and term weighting
  • Combine characteristics of the Vector model with
    properties of Boolean algebra

48
The Idea
  • The Extended Boolean Model (introduced by Salton,
    Fox, and Wu, 1983) is based on a critique of a
    basic assumption in Boolean algebra
  • Let,
  • q kx ? ky
  • wxj fxj idfx associated with
    kx,dj max(idfi)
  • Further, wxj x and wyj y

49
The Idea
qand kx ? ky wxj x and wyj y
AND
50
The Idea
qor kx ? ky wxj x and wyj y
(1,1)
OR
51
Generalizing the Idea
  • We can extend the previous model to consider
    Euclidean distances in a t-dimensional space
  • This can be done using p-norms which extend the
    notion of distance to include p-distances, where
    1 ? p ? ? is a new parameter

52
Generalizing the Idea
  • A generalized disjunctive query is given by
  • qor k1 k2 . . . kt
  • A generalized conjunctive query is given by
  • qand k1 k2 . . . kt

53
Properties
54
Properties
  • This is quite powerful and is a good argument in
    favor of the extended Boolean model
  • q (k1 k2) k3
  • k1 and k2 are to be used as in a vector retrieval
    while the presence of k3 is required.
  • sim(q,dj) ( (1 - ( (1-x1) (1-x2) ) )
    x3 ) 2
    ______ 2

55
Conclusions
  • Model is quite powerful
  • Properties are interesting and might be useful
  • Computation is somewhat complex
  • However, distributivity operation does not hold
    for ranking computation
  • q1 (k1 ? k2) ? k3
  • q2 (k1 ? k3) ? (k2 ? k3)
  • sim(q1,dj) ? sim(q2,dj)
Write a Comment
User Comments (0)
About PowerShow.com