Modeling Chap. 2 - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Modeling Chap. 2

Description:

Traditional IR systems adopt index terms to index, retrieve documents. An index term is simply any word ... clean formalism. simplicity. Main disadvantages ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 34
Provided by: KCK86
Category:

less

Transcript and Presenter's Notes

Title: Modeling Chap. 2


1
Modeling (Chap. 2)
  • Modern Information Retrieval
  • Spring 2000

2
Introduction
  • Traditional IR systems adopt index terms to
    index, retrieve documents
  • An index term is simply any word that appears in
    text of documents
  • Retrieval based on index terms is simple
  • premise is that semantics of documents and user
    information can be expressed through set of index
    terms

3
  • Key Question
  • semantics in document (user request) lost when
    text replaced with set of words
  • matching between documents and user request done
    in very imprecise space of index terms (low
    quality retrieval)
  • problem worsened for users with no training in
    properly forming queries (cause of frequent
    dissatisfaction of Web users with answers
    obtained)

4
Taxonomy of IR Models
  • Three classic models
  • Boolean
  • documents and queries represented as sets of
    index terms
  • Vector
  • documents and queries represented as vectors in
    t-dimensional space
  • Probabilistic
  • document and query representations based on
    probability theory

5
Basic Concepts
  • Classic models consider that each document is
    described by index terms
  • Index term is a (document) word that helps in
    remembering documents main themes
  • index terms used to index and summarize document
    content
  • in general, index terms are nouns (because
    meaning by themselves)
  • index terms may consider all distinct words in a
    document collection

6
  • Distinct index terms have varying relevance when
    describing document contents
  • Thus numerical weights assigned to each index
    term of a document
  • Let ki be index term, dj document, and wi,j ? 0
    be weight for pair (ki, dj)
  • Weight quantifies importance of index term for
    describing document semantic contents

7
Definition (pp. 25)
  • Let t be no. of index terms in system and ki be
    generic index term.
  • K k1, , kt is set of all index terms.
  • A weight wi,j gt 0 associated with each index term
    ki of document dj.
  • For index term that does not appear in document
    text, wi,j 0.
  • Document dj associated with index term vector j
    represented by j (w1,j, w2,j, wt,j)

8
Boolean Model
  • Simple retrieval model based on set theory and
    Boolean algebra
  • framework is easy to grasp by users (concept of
    set is intuitive)
  • Queries specified as Boolean expressions which
    have precise semantics

9
Drawbacks
  • Retrieval strategy is binary decision (document
    is relevant/non-relevant)
  • prevents good retrieval performance
  • not simple to translate information need into
    Boolean expression (difficult and awkward to
    express)
  • dominant model with commercial DB systems

10
Boolean Model (Cont.)
  • Considers that index terms are present or absent
    in document
  • index term weights are binary, I.e. wi,j ? 0,1
  • query q composed of index terms linked by not,
    and, or
  • query is Boolean expression which can be
    represented as DNF

11
Boolean Model (Cont.)
  • Query qka ? (kb ? ?kc) can be written in DNF
    as dnf (1,1,1) ? (1,1,0) ? (1,0,0)
  • each component is binary weighted vector
    associated with tuple (ka, kb, kc)
  • binary weighted vectors are called conjunctive
    components of dnf

12
Boolean Model (cont.)
  • Index term weight variables are all binary, I.e.
    wi,j ? 0,1
  • query q is a Boolean expression
  • Let dnf be DNF for query q
  • Let cc be any conjunctive components of dnf
  • Similarity of document dj to query q is
  • sim(dj,q) 1 if ? cc ( cc ? dnf) ?
    (?ki,gi( j) gi( cc)) where gi( j) wi,j
  • sim(dj,q) 0 otherwise

13
Boolean Model (Cont.)
  • If sim(dj,q) 1 then Boolean model predict that
    document dj is relevant to query q (it might not
    be)
  • Otherwise, prediction is that document is not
    relevant
  • Boolean model predicts that each document is
    either relevant or non-relevant
  • no notion of partial match

14
  • Main advantages
  • clean formalism
  • simplicity
  • Main disadvantages
  • exact matching lead to retrieval of too few or
    too many documents
  • index term weighting can lead to improvement in
    retrieval performance

15
Vector Model
  • Assign non-binary weights to index terms in
    queries and documents
  • term weights used to compute degree of similarity
    between document and user query
  • by sorting retrieved documents in decreasing
    order (of degree of similarity), vector model
    considers partially matched documents
  • ranked document answer set a lot more precise
    (than answer set by Boolean model)

16
Vector Model (Cont.)
  • Weight wi,j for pair (ki, dj) is positive and
    non-binary
  • index terms in query are also weighted
  • Let wi,q be weight associated with pair ki,q,
    where wi,q ? 0
  • query vector defined as (w1,q, w2,q, ,
    wt,q) where t is total no. of index terms in
    system
  • vector for document dj is represented by j
    (w1,j, w2,j, , wt,j)

17
Vector Model (Cont.)
  • Document dj and user query q represented as
    t-dimensional vectors.
  • evaluate degree of similarity of dj with regard
    to q as correlation between vectors j and .
  • Correlation can be quantified by cosine of angle
    between these two vectors
  • sim(dj,q)

18
Vector Model (Cont.)
  • Sim(q,dj) varies from 0 to 1.
  • Ranks documents according to degree of similarity
    to query
  • document may be retrieved even if it partially
    matches query
  • establish a threshold on sim(dj,q) and retrieve
    documents with degree of similarity above
    threshold

19
Index term weights
  • Documents are collection C of objects
  • User query is set A of objects
  • IR problem is to determine which documents are in
    set A and which are not (I.e. clustering problem)
  • In clustering problem
  • intra-cluster similarity (features which better
    describe objects in set A)
  • inter-cluster similarity (features which better
    distinguish objects in set A from remaining
    objects in collection C

20
  • In vector model, intra-cluster similarity
    quantified by measuring raw frequency of term ki
    inside document dj (tf factor)
  • how well term describes document contents
  • inter-cluster dissimilarity quantified by
    measuring inverse of frequency of term ki among
    documents in collection (idf factor)
  • terms which appear in many documents are not very
    useful for distinguishing relevant document from
    non-relevant one

21
Definition (pp.29)
  • Let N be total no. of documents in system
  • let ni be number of documents in which index term
    ki appears
  • let freqi,j be raw frequency of term ki in
    document dj
  • no. of times term ki mentioned in text of
    document dj
  • Normalized frequency fi,j of term ki in dj
  • fi,j

22
  • Maximum computed over all terms mentioned in text
    of document dj
  • if term ki does not appear in document dj then
    fi,j 0
  • let idfi, inverse document frequency for ki be
  • idfi log
  • best known term weighting scheme
  • wi,j fi,j ? log

23
  • Advantages of vector model
  • term weighting scheme improves retrieval
    performance
  • retrieve documents that approximate query
    conditions
  • sorts documents according to degree of similarity
    to query
  • Disadvantage
  • index terms are mutually independent

24
Probabilistic Model
  • Given user query, there is set of documents
    containing exactly relevant documents.
  • Ideal answer set
  • given description of ideal answer set, no problem
    in retrieving its documents
  • querying process is process of specifying
    properties of ideal answer set
  • the properties are not exactly known
  • there are index terms whose semantics are used to
    characterize these properties

25
Probabilistic Model (Cont.)
  • These properties not known at query time
  • effort has to be made to initially guess what
    they (I.e. properties) are
  • initial guess generate preliminary probabilistic
    description of ideal answer set to retrieve first
    set of documents
  • user interaction initiated to improve
    probabilistic description of ideal answer set

26
  • User examine retrieved documents and decide which
    ones are relevant
  • this information used to refine description of
    ideal answer set
  • by repeating this process, such description will
    evolve and be closer to ideal answer set

27
Fundamental Assumption
  • Given user query q and document dj in collection,
    probabilistic model estimate probability that
    user will find document dj relevant
  • assumes that probability of relevance depends on
    query and document representations only
  • assumes that there is subset of all documents
    which user prefers as answer set for query q
  • such ideal answer set is labeled R
  • documents in set R are predicted to be relevant
    to query

28
  • Given query q, probabilistic model assigns to
    each document dj the ratio P(dj relevant-to
    q)/P(dj non-relevant-to q)
  • measure of similarity to query
  • odds of document dj being relevant to query q

29
  • Index term weight variables are all binary I.e.
    wi,j ? 0,1, wi,q ? 0,1
  • query q is subset of index terms
  • let R be set of documents known (initially
    guessed) to be relevant
  • let be complement of R
  • let P(R j) be probability that document dj is
    relevant to query q
  • let P( j) be probability that document dj
    not relevant to query q.

30
  • Similarity sim(dj,q) of document dj to query q is
    ratio
  • sim(dj,q)
  • sim(dj,q)
  • sim(dj,q) wi,q ? wi,j ?

31
  • How to compute P(kiR) and P(ki ) initially ?
  • assume P(kiR) is constant for all index terms ki
    (typically 0.5)
  • P(kiR) 0.5
  • assume distribution of index terms among
    non-relevant documents approximated by
    distribution of index terms among all documents
    in collection
  • P(ki ) ni/N where ni is no. of documents
    containing index term ki N is total no. of doc.

32
  • Let V be subset of documents initially retrieved
    and ranked by model
  • let Vi be subset of V composed of documents in V
    with index term ki
  • P(kiR) approximated by distribution of index
    term ki among doc. retrieved
  • P(kiR) Vi / V
  • P(ki ) approximated by considering all
    non-retrieved doc. are not relevant
  • P(ki )

33
  • Advantages
  • documents ranked in decreasing order of their
    probability of being relevant
  • Disadvantages
  • need to guess initial separation of relevant and
    non-relevant sets
  • all index term weights are binary
  • index terms are mutually independent
Write a Comment
User Comments (0)
About PowerShow.com