IST2140 Information Storage and Retrieval - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

IST2140 Information Storage and Retrieval

Description:

Robertson and colleagues (Okapi) Probabilistic Model ... article, and success of probabilistic systems such as InQuery and Okapi at TREC. ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 36
Provided by: eras
Category:

less

Transcript and Presenter's Notes

Title: IST2140 Information Storage and Retrieval


1
IST2140Information Storage and Retrieval
  • Week 4
  • Information Retrieval Models II

2
Information Retrieval Models
  • A model is an embodiment of the theory in which
    we define a set of objects about which assertions
    can be made and restrict the ways in which
    classes of objects can interact
  • A retrieval model specifies the representations
    used for documents and information needs, and how
    they are compared.
  • (Turtle Croft, 1992)

3
Information Retrieval Models
  • An IR model specifies the representations used
    for documents and information needs, and how they
    are compared
  • Three classic models
  • Boolean Model
  • Vector Space Model
  • Probabilistic Model

4
Boolean Model
  • Exact match system
  • Document features are binary variables 0,1
  • Features can be document terms, also more complex
    information (dates, source, authors, etc.)
  • Query is a Boolean expression using feature
    variables related by AND, OR, NOT operators
  • All documents which match query are retrieved and
    equal
  • Extensions to model
  • Add additional operators, e.g. positional
  • Introduce term weights

5
Vector space model
  • Documents and queries are represented as vectors
    in l-dimensional hyperspace
  • Each dimension corresponds to possible document
    feature
  • Vector elements are usually weighted 0 to 1
  • Matching function is a distance metric that
    operates on document and query vectors
  • Documents can be ranked in order of distance from
    query

6
Probabilistic Model
  • First proposed by Maron Kuhns (1960)
  • Further developed by (inter alia)
  • Maron Cooper
  • Van Rijsbergen and colleagues
  • Croft and Turtle (InQuery)
  • Robertson and colleagues (Okapi)

7
Probabilistic Model
  • Views retrieval as an attempt to answer a basic
    question What is the probability that this
    document is relevant to this query?
  • expressed as
  • P(RELD)
  • ie. Probability of x given y (Probability that
    of relevance given a particular document D)

8
Assumptions
  • document here means the content representation or
    description, i.e. surrogate
  • relevance is binary
  • relevance of a document is independent of
    relevance of other documents
  • terms are independent of one another

9
Independence Assumption
  • General model is developed by making a strong
    Independence Assumption
  • Given relevance, the attributes are statistically
    independent
  • I.e.. Within each class of documents, relevant or
    non-relevant, the attributes are statistically
    independent
  • Simplifies theory

10
Probability Ranking Principle
  • If retrieved documents are ordered by decreasing
    probability of relevance on the data available,
    then the systems effectiveness is the best to be
    gotten for the data

11
Formal Model
  • Given a document D and query Q, there are two
    possible events
  • REL, that D is relevant to Q
  • notREL, that D is not relevant to Q
  • We want to calculate the probability
  • P(RELD),
  • i.e. the probability that a document is relevant
    given its content

12
Formal Model
  • Assuming relevance is binary
  • P(notREL) 1 P(REL)
  • Probability that a document is relevant is
  • P(REL) n no. of relevant docs
  • N no. of docs
  • Probability that a document is not relevant
  • P(notREL) 1 P(REL)
  • N-n
  • N

13
  • Assuming term independence, then
  • P(DREL) ?(aiREL)
  • i.e. probability that a document is relevant is
    based on the probability of relevance of
    individual terms
  • Can derive a matching score of a document which
    is the sum of the weights of the matching
    (present) terms
  • For simplicity use log-odds rather than odds

14
What term weights to use?
  • What contribution does presence of a term in
    specific documents make to documents probability
    of relevance?
  • E.g.
  • CFW collection frequency weight (log N/ni)
  • RW relevance weight
  • CW combined weight (using in-document frequency)

15
Formula for relevance weights
  • (r0.5)(N-n-Rr0.5)
  • RW log ---------------------------
  • (R-40.5)(n-r0.5)
  • Where r no. of rel docs containing term i
  • n no of docs containing term i
  • N number of docs in collection

16
Probabilistic Model
  • Extensively tested
  • See for instance Sparck Jones et al, A
    probabilistic model of information retrieval
    development and comparative experiments. Parts I
    and II. Information Processing Management 36
    779-840.
  • -based on performance figures from Cranfield,
    UKCIS, NPL, TREC

17
Sparck Jones Evaluation
  • CWgtRW gt CFW (idf) gt UW
  • I.e. performance gains from
  • Unweighted terms ?
  • Collection frequency weights ?
  • Relevance weights ?
  • Combined weights (using in-document frequency)
  • Largest single gains from using term frequency
    information
  • Also very noticeable gain from using relevance
    information

18
Classic IR Models
  • Boolean considered weakest
  • Vector vs. probabilistic
  • Through several different measures, Salton and
    Buckley showed that the vector model is expected
    to outperform the probabilistic model with
    general collections. This also seems to be the
    dominant thought among researchers,
    practitioners, and the Web community, where the
    popularity of the vector model runs high.
    Baeza-Yates Ribeiro, p. 34

19
Classic IR Models
  • Vector vs. probabilistic
  • Numerous experiments demonstrate that
    probabilistic retrieval procedures yield good
    results. However, the results have not been
    sufficiently better than those obtained using
    Boolean or vector techniques to convince system
    developers to move heavily in this
    direction---Korfhage, p. 92
  • However, see recent Sparck Jones et al article,
    and success of probabilistic systems such as
    InQuery and Okapi at TREC.

20
Latest Model Language Model
  • a statistical language model is a probabilistic
    model for generating text
  • In late 90s the language model was proposed for
    information retrieval
  • (not a new model but new for IR)
  • generates scores for probability that query is
    generated from the document model

21
Language Models in IR
  • Views documents as models and considers queries
    as strings of texts randomly sampled from models
  • documents are ranked by probability that a query
    Q would be observed during repeated random
    sampling from the document model MD
  • P(QMD)

22
Other IR Models
  • Extended Boolean
  • Fuzzy Set
  • Cluster-based Retrieval

23
Cluster-based retrieval
  • Cluster analysis a technique in multivariate
    analysis used to generate a category structure
    which fits a set of observations
  • Form of automatic classification but classes
    formed are not known prior to processing but
    defined by the cluster process

24
Clustering of Text
  • Cluster documents on basis of terms they contain
  • Cluster documents on basis of co-occurring
    citations
  • Cluster terms on basis of documents they occur in

25
Applications in IR
  • Cluster based retrieval retrieve documents in
    the same cluster or hierarchy
  • Text categorization use to categorize documents
  • Use of term clusters build a thesaurus of
    like terms use in query expansion
  • Web IR use post-hoc clustering to group
    retrieved web pages (e.g. jaguar car, animal,
    software)
  • Use in text mining to find patterns

26
Practicalities of clustering
  • Determine attributes clusters will be based on
  • Choose a clustering method
  • Choose a similarity function
  • Create the clusters
  • Assess validity of result

27
Similarity Matrix
  • Many similarity measures for text Dice,
    Jaccard, cosine are commonly used
  • Calculate the similarity matrix
  • S21
  • S31 S32
  • S41 S42 S43
  • .
  • Sn1 Sn2 Sn3 Sn(n-1)

28
Clustering methods
  • Non-hierarchical
  • Single pass
  • Reallocation
  • Hierarchical
  • Single link
  • Complete link
  • Wards method

29
Single Pass Method
  • First document becomes representative for first
    cluster
  • For Di calculate Sim with each cluster
    representative
  • If Simmax gt threshold SimT, add to corresponding
    cluster otherwise start a new cluster
  • If an item Di remains, return to step 2

30
Reallocation Method
  • Select M cluster representatives or centroids
  • For I 1 to N, assign Di to most similar
    centroid
  • For j 1 to M, recalculate cluster centroid Cj
  • Repeat steps 2 and 3 until cluster membership
    stabilizes

31
HACM(Hierarchical agglomerative clustering
methods)
  • Identify two closest points and combine them in a
    cluster
  • Identify and combine the next two closest points,
    treating existing clusters as points
  • If more than one cluster remains, return to step 1

32
HACM
  • Differ based on
  • Definition of most similar pair
  • Cluster representation used
  • High storage and time requirements
  • Output presented as a dendrogram

33
HACM
  • Single link
  • Joins most similar pair of objects not yet in
    same cluster
  • Complete link
  • Uses least similar pair between clusters to
    measure intercluster similarity
  • Wards method
  • Merges pair which minimizes increase in within
    group error sum of squares uses Euclidean
    distance cluster center is weighted average

34
How many clusters?
  • HACM give complete hierarchy clusters from 1 to
    N
  • Use a stopping rule to determine when to stop
    process -- gt number of clusters

35
Cluster-based Retrieval
  • Nearest neighbour cluster (most similar document
  • Rank clusters rather than documents retrieve
    entire clusters
  • Use hierarchy structure top-down or bottom-up
    search
Write a Comment
User Comments (0)
About PowerShow.com