Prof. Ray Larson - PowerPoint PPT Presentation

About This Presentation
Title:

Prof. Ray Larson

Description:

Lecture 8: Probabilistic IR and Relevance Feedback SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 67
Provided by: ValuedGate1241
Category:

less

Transcript and Presenter's Notes

Title: Prof. Ray Larson


1
Lecture 8 Probabilistic IR and Relevance
Feedback
SIMS 202 Information Organization and Retrieval
  • Prof. Ray Larson Prof. Marc Davis
  • UC Berkeley SIMS
  • Tuesday and Thursday 1030 am - 1200 pm
  • Fall 2004
  • http//www.sims.berkeley.edu/academics/courses/is2
    02/f04/

2
Lecture Overview
  • Review
  • Vector Representation
  • Term Weights
  • Vector Matching
  • Clustering
  • Probabilistic Models of IR
  • Relevance Feedback

Credit for some of the slides in this lecture
goes to Marti Hearst
3
Lecture Overview
  • Review
  • Vector Representation
  • Term Weights
  • Vector Matching
  • Clustering
  • Probabilistic Models of IR
  • Relevance Feedback

Credit for some of the slides in this lecture
goes to Marti Hearst
4
Document Vectors
5
Vector Space Documents and Queries
Q is a query also represented as a vector
Boolean term combinations
6
Documents in Vector Space
t3
D1
D9
D11
D5
D3
D10
D2
D4
t1
D7
D6
D8
t2
7
Binary Weights
  • Only the presence (1) or absence (0) of a term is
    included in the vector

8
Raw Term Weights
  • The frequency of occurrence for the term in each
    document is included in the vector

9
tfidf weights
10
Inverse Document Frequency
  • IDF provides high values for rare words and low
    values for common words

For a collection of 10000 documents (N 10000)
11
tfidf Normalization
  • Normalize the term weights (so longer vectors are
    not unfairly given more weight)
  • Normalize usually means force all values to fall
    within a certain range, usually between 0 and 1,
    inclusive

12
Vector Space Similarity
  • Now, the similarity of two documents is
  • This is also called the cosine, or normalized
    inner product
  • The normalization was done when weighting the
    terms
  • Note that the wik weights can be stored in the
    vectors/ inverted files for the documents

13
Vector Space Matching
Di(di1,wdi1di2, wdi2dit, wdit) Q
(qi1,wqi1qi2, wqi2qit, wqit)
Term B
1.0
Q (0.4,0.8) D1(0.8,0.3) D2(0.2,0.7)
Q
D2
0.8
0.6
0.4
D1
0.2
0.8
0.6
0.4
0.2
0
1.0
Term A
14
Vector Space Visualization
15
Document/Document Matrix
16
Text Clustering
  • Clustering is
  • The art of finding groups in data.
  • -- Kaufmann and Rousseau

Term 1
Term 2
17
(No Transcript)
18
Problems with Vector Space
  • There is no real theoretical basis for the
    assumption of a term space
  • it is more for visualization that having any real
    basis
  • most similarity measures work about the same
    regardless of model
  • Terms are not really orthogonal dimensions
  • Terms are not independent of all other terms
  • Retrieval efficiency vs. indexing and update
    efficiency for stored pre-calculated weights

19
Lecture Overview
  • Review
  • Vector Representation
  • Term Weights
  • Vector Matching
  • Clustering
  • Probabilistic Models of IR
  • Relevance Feedback

Credit for some of the slides in this lecture
goes to Marti Hearst
20
Probabilistic Models
  • Rigorous formal model attempts to predict the
    probability that a given document will be
    relevant to a given query
  • Ranks retrieved documents according to this
    probability of relevance (Probability Ranking
    Principle)
  • Relies on accurate estimates of probabilities

21
Probability Ranking Principle
  • If a reference retrieval systems response to
    each request is a ranking of the documents in the
    collections in the order of decreasing
    probability of usefulness to the user who
    submitted the request, where the probabilities
    are estimated as accurately as possible on the
    basis of whatever data has been made available to
    the system for this purpose, then the overall
    effectiveness of the system to its users will be
    the best that is obtainable on the basis of that
    data.

Stephen E. Robertson, J. Documentation 1977
22
Model 1 Maron and Kuhns
  • Concerned with estimating probabilities of
    relevance at the point of indexing
  • If a patron came with a request using term ti,
    what is the probability that she/he would be
    satisfied with document Dj ?

23
Model 1
  • A patron submits a query (call it Q) consisting
    of some specification of her/his information
    need. Different patrons submitting the same
    stated query may differ as to whether or not they
    judge a specific document to be relevant. The
    function of the retrieval system is to compute
    for each individual document the probability that
    it will be judged relevant by a patron who has
    submitted query Q.

Robertson, Maron Cooper, 1982
24
Model 1 Bayes
  • A is the class of events of using the library
  • Di is the class of events of Document i being
    judged relevant
  • Ij is the class of queries consisting of the
    single term Ij
  • P(DiA,Ij) probability that if a query is
    submitted to the system then a relevant document
    is retrieved

25
Model 2
  • Documents have many different properties some
    documents have all the properties that the patron
    asked for, and other documents have only some or
    none of the properties. If the inquiring patron
    were to examine all of the documents in the
    collection she/he might find that some having all
    the sought after properties were relevant, but
    others (with the same properties) were not
    relevant. And conversely, he/she might find that
    some of the documents having none (or only a few)
    of the sought after properties were relevant,
    others not. The function of a document retrieval
    system is to compute the probability that a
    document is relevant, given that it has one (or a
    set) of specified properties.

Robertson, Maron Cooper, 1982
26
Model 2 Robertson Sparck Jones
Given a term t and a query q
Document Relevance
-
r n-r n -
R-r N-n-Rr N-n
R N-R N
Document Indexing
27
Robertson-Sparck Jones Weights
  • Retrospective formulation

28
Robertson-Sparck Jones Weights
  • Predictive formulation

29
Probabilistic Models Some Unifying Notation
  • D All present and future documents
  • Q All present and future queries
  • (Di,Qj) A document query pair
  • x class of similar documents,
  • y class of similar queries,
  • Relevance (R) is a relation

30
Probabilistic Models
  • Model 1 -- Probabilistic Indexing, P(Ry,Di)
  • Model 2 -- Probabilistic Querying, P(RQj,x)
  • Model 3 -- Merged Model, P(R Qj, Di)
  • Model 0 -- P(Ry,x)
  • Probabilities are estimated based on prior usage
    or relevance estimation

31
Probabilistic Models
Q
D
y
Qj
x
Di
32
Logistic Regression
  • Another approach to estimating probability of
    relevance
  • Based on work by William Cooper, Fred Gey and
    Daniel Dabney
  • Builds a regression model for relevance
    prediction based on a set of training data
  • Uses less restrictive independence assumptions
    than Model 2
  • Linked Dependence

33
So Whats Regression?
  • A method for fitting a curve (not necessarily a
    straight line) through a set of points using some
    goodness-of-fit criterion
  • The most common type of regression is linear
    regression

34
Whats Regression?
  • Least Squares Fitting is a mathematical procedure
    for finding the best fitting curve to a given set
    of points by minimizing the sum of the squares of
    the offsets ("the residuals") of the points from
    the curve
  • The sum of the squares of the offsets is used
    instead of the offset absolute values because
    this allows the residuals to be treated as a
    continuous differentiable quantity

35
Logistic Regression
36
Probabilistic Models Logistic Regression
  • Estimates for relevance based on log-linear model
    with various statistical measures of document
    content as independent variables

Log odds of relevance is a linear function of
attributes
Term contributions summed
Probability of Relevance is inverse of log odds
37
Logistic Regression Attributes
Average Absolute Query Frequency Query
Length Average Absolute Document
Frequency Document Length Average Inverse
Document Frequency Inverse Document
Frequency Number of Terms in common between
query and document -- logged
38
Logistic Regression
  • Probability of relevance is based on Logistic
    regression from a sample set of documents to
    determine values of the coefficients
  • At retrieval the probability estimate is obtained
    by
  • For the 6 X attribute measures shown previously

39
Probabilistic Models
Advantages
Disadvantages
  • Strong theoretical basis
  • In principle should supply the best predictions
    of relevance given available information
  • Can be implemented similarly to Vector
  • Relevance information is required -- or is
    guestimated
  • Important indicators of relevance may not be term
    -- though terms only are usually used
  • Optimally requires on-going collection of
    relevance information

40
Vector and Probabilistic Models
  • Support natural language queries
  • Treat documents and queries the same
  • Support relevance feedback searching
  • Support ranked retrieval
  • Differ primarily in theoretical basis and in how
    the ranking is calculated
  • Vector assumes relevance
  • Probabilistic relies on relevance judgments or
    estimates

41
Current Use of Probabilistic Models
  • Virtually all the major systems in TREC now use
    the Okapi BM25 formula which incorporates the
    Robertson-Sparck Jones weights

42
Okapi BM25
  • Where
  • Q is a query containing terms T
  • K is k1((1-b) b.dl/avdl)
  • k1, b and k3 are parameters , usually 1.2, 0.75
    and 7-1000
  • tf is the frequency of the term in a specific
    document
  • qtf is the frequency of the term in a topic from
    which Q was derived
  • dl and avdl are the document length and the
    average document length measured in some
    convenient unit
  • w(1) is the Robertson-Sparck Jones weight

43
Language Models
  • A recent addition to the probabilistic models is
    language modeling that estimates the
    probability that a query could have been produced
    by a given document.
  • This is a slight variation on the other
    probabilistic models that has led to some modest
    improvements in performance

44
Logistic Regression and Cheshire II
  • The Cheshire II system (see readings) uses
    Logistic Regression equations estimated from TREC
    full-text data
  • Used for a number of production level systems
    here and in the U.K.

45
Lecture Overview
  • Review
  • Vector Representation
  • Term Weights
  • Vector Matching
  • Clustering
  • Probabilistic Models of IR
  • Relevance Feedback

Credit for some of the slides in this lecture
goes to Marti Hearst
46
Querying in IR System
47
Relevance Feedback in an IR System
48
Query Modification
  • Problem How to reformulate the query?
  • Thesaurus expansion
  • Suggest terms similar to query terms
  • Relevance feedback
  • Suggest terms (and documents) similar to
    retrieved documents that have been judged to be
    relevant

49
Relevance Feedback
  • Main Idea
  • Modify existing query based on relevance
    judgements
  • Extract terms from relevant documents and add
    them to the query
  • And/or re-weight the terms already in the query
  • Two main approaches
  • Automatic (pseudo-relevance feedback)
  • Users select relevant documents
  • Users/system select terms from an
    automatically-generated list

50
Relevance Feedback
  • Usually do both
  • Expand query with new terms
  • Re-weight terms in query
  • There are many variations
  • Usually positive weights for terms from relevant
    docs
  • Sometimes negative weights for terms from
    non-relevant docs
  • Remove terms ONLY in non-relevant documents

51
Rocchio Method
52
Rocchio/Vector Illustration
Q0 retrieval of information (0.7,0.3) D1
information science (0.2,0.8) D2
retrieval systems (0.9,0.1) Q
½Q0 ½ D1 (0.45,0.55) Q ½Q0 ½ D2
(0.80,0.20)
53
Example Rocchio Calculation
Relevant docs
Non-rel doc
Original Query
Constants
Rocchio Calculation
Resulting feedback query
54
Rocchio Method
  • Rocchio automatically
  • Re-weights terms
  • Adds in new terms (from relevant docs)
  • Have to be careful when using negative terms
  • Rocchio is not a machine learning algorithm
  • Most methods perform similarly
  • Results heavily dependent on test collection
  • Machine learning methods are proving to work
    better than standard IR approaches like Rocchio

55
Probabilistic Relevance Feedback
Given a query term t
Document Relevance
-
r n-r n -
R-r N-n-Rr N-n
R N-R N
Document Indexing
Where N is the number of documents seen
56
Robertson-Sparck Jones Weights
  • Retrospective formulation

57
Using Relevance Feedback
  • Known to improve results
  • In TREC-like conditions (no user involved)
  • What about with a user in the loop?
  • How might you measure this?

58
Relevance Feedback Summary
  • Iterative query modification can improve
    precision and recall for a standing query
  • In at least one study, users were able to make
    good choices by seeing which terms were suggested
    for R.F. and selecting among them (Koeneman
    Belkin)

59
Alternative Notions of Relevance Feedback
  • Find people whose taste is similar to yours
  • Will you like what they like?
  • Follow a users actions in the background
  • Can this be used to predict what the user will
    want to see next?
  • Track what lots of people are doing
  • Does this implicitly indicate what they think is
    good and not good?

60
Alternative Notions of Relevance Feedback
  • Several different criteria to consider
  • Implicit vs. Explicit judgements
  • Individual vs. Group judgements
  • Standing vs. Dynamic topics
  • Similarity of the items being judged vs.
    similarity of the judges themselves

61
Collaborative Filtering (Social Filtering)
  • If Pam liked the paper, Ill like the paper
  • If you liked Star Wars, youll like Independence
    Day
  • Rating based on ratings of similar people
  • Ignores the text, so works on text, sound,
    pictures, etc.
  • But Initial users can bias ratings of future
    users

62
Ringo Collaborative Filtering
  • Users rate musical artists from like to dislike
  • 1 detest 7 cant live without 4 ambivalent
  • There is a normal distribution around 4
  • However, what matters are the extremes
  • Nearest Neighbors Strategy Find similar users
    and predicted (weighted) average of user ratings
  • Pearson r algorithm weight by degree of
    correlation between user U and user J
  • 1 means very similar, 0 means no correlation, -1
    dissimilar
  • Works better to compare against the ambivalent
    rating (4), rather than the individuals average
    score

63
Social Filtering
  • Ignores the content, only looks at who judges
    things similarly
  • Works well on data relating to taste
  • something that people are good at predicting
    about each other too
  • Does it work for topic?
  • GroupLens results suggest otherwise (preliminary)
  • Perhaps for quality assessments
  • What about for assessing if a document is about a
    topic?

64
Summary
  • Relevance feedback is an effective means for
    user-directed query modification
  • Modification can be done with either direct or
    indirect user input
  • Modification can be done based on an individuals
    or a groups past input

65
David Hong on Cheshire
  • Cheshire II provided the paradigm of a fully
    standards-based IR system (SGML and Z39.50
    Protocol). While there are both benefits and
    drawback to implementing standards-based
    technologies, what can other IR systems gain from
    being standards-compliant and how could this
    model make other IR systems more flexible?
  • Cheshire II's interface allows users to specify
    conventional Boolean matching and probabilistic
    search. How would you infer this level of
    granularity in the form of a natural language
    query?
  • What would be some of the potential benefits of
    doing feedback searching with multiple records in
    an large Internet search engine?
  • What are the potential barriers in implementing
    this feature?

66
Next Time
  • Information Retrieval Evaluation more on
    collaborative filtering
  • Readings for next time
  • An Evaluation of Retrieval Effectiveness (Blair
    Maron)
  • Rave Reviews Acquiring Relevance Assessments
    from Multiple Users (Belew)
  • A Case for Interaction A Study of Interactive
    Information Retrieval Behavior and Effectiveness
    (Koeneman Belkin)
  • Work Tasks and Socio-Cognitive Relevence A
    Specific Example (Hjorland Chritensen)
  • Social Information Filtering Algorithms for
    Automating "Word of Mouth" (Shardanand Maes)
Write a Comment
User Comments (0)
About PowerShow.com