Relevance Ranking and Relevance Feedback - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Relevance Ranking and Relevance Feedback

Description:

Users are generally looking for the best document with a particular piece of information ... Terms with similar association clusters are more likely to be synonyms ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 35
Provided by: hpl5
Category:

less

Transcript and Presenter's Notes

Title: Relevance Ranking and Relevance Feedback


1
Relevance RankingandRelevance Feedback
  • Carl Staelin

2
Motivation - Feast or famine
  • Queries return either too few or too many results
  • Users are generally looking for the best document
    with a particular piece of information
  • Users dont want to look through hundreds of
    documents to locate the information
  • ? Rank documents according to expected relevance!

3
Model
  • Can we get user feedback?
  • Document score is influenced by similarity to
    previously user-rated documents
  • Can we utilize external information?
  • E.g., how many other documents reference this
    document?

4
Queries
  • Most queries are short
  • One to three words
  • Many queries are ambiguous
  • Saturn
  • Saturn the planet?
  • Saturn the car?

5
Sample Internal Features
  • Term frequency
  • Location of term appearance in document
  • Capitalization
  • Font
  • Relative font size
  • Bold, italic,
  • Appearance in lttitlegt, ltmetagt, lth?gt tags
  • Co-location with other (relevant) words

6
Sample External Features
  • Document citations
  • How often is it cited?
  • How important are the documents that cited it?
  • Relevance of text surrounding hyperlink
  • Relevance of documents citing this document
  • Location within website (e.g. height in the
    directory structure or click distance from /)
  • Popularity of pages from similar queries
  • Search engine links often connect through search
    site so they can track click-throughs
  • Your idea here

7
Problem
  • Given all these features, how do we rank the
    search results?
  • Often use similarity between query and document
  • May use other factors to weight ranking score,
    such as citation ranking
  • May use an iterative search which ranks documents
    according to similarity/dissimilarity to query
    and previously marked relevant/irrelevant
    documents

8
Relevance Feedback
  • Often an information retrieval system does not
    return useful information on the first try!
  • If at first you dont succeed, try, try, again
  • Find out from the user which results were most
    relevant, and try to find more documents like
    them and less like the others
  • Assumption relevant documents are somehow
    similar to each other and different from
    irrelevant documents
  • Question how?

9
Relevance Feedback Methods
  • Two general approaches
  • Create new queries with user feedback
  • Create new queries automatically
  • Re-compute document weights with new information
  • Expand or modify the query to more accurately
    reflect the users desires

10
Vector Space Re-Weighting
  • Given a query Q with its query vector q
  • Initial, user-annotated results D
  • Dr relevant, retrieved documents
  • Dn irrelevant, retrieved documents
  • di are the document weight vectors
  • Update weights on query vector q
  • Re-compute similarity score to new q

11
Vector Space Re-Weighting
  • Basic idea
  • increase weight for terms appearing in relevant
    documents
  • Decrease weight for terms appearing in irrelevant
    documents
  • There are a few standard equations

12
Vector Space Re-Weighting
  • Rochio
  • q' ?q (?/Dr)?di ?Dr di - (?/Dn)?di ?Dn di
  • Ide regular
  • q' ?q ??di ?Dr di - ??di ?Dn di
  • Ide Dec_hi
  • q' ?q ??di ?Dr di - ?maxdi ?Dn (di )

13
Vector Space Re-Weighting
  • The initial query vector q0 will have non-zero
    weights only for terms appearing in the query
  • The query vector update process can add weight to
    terms that dont appear in the original query
  • Automatic expansion of the original query terms
  • Some terms can end up having negative weight!
  • E.g., if you want to find information on the
    planet Saturn, car could have a negative weight

14
Probabilistic Re-Weighting
  • After initial search, get feedback from user on
    document relevance
  • Use relevance information to recalculate term
    weights
  • Re-compute similarity and try again

15
Probabilistic Re-Weighting
  • Remember from last time
  • Simij ??kwikwjk (ln(P(tkR)/(1-P(tkR)))
    ln((1-P(tkR))/P(tkR)))
  • P(tkR) 0.5
  • P(tkR) ni /N ni docs containing tk
  • gt
  • Simij ?k wik wjk ln((N - ni)/ni)

16
Probabilistic Re-Weighting
  • Given document relevance feedback
  • Dr set of relevant retrieved docs
  • Drk subset of Dr containing tk
  • gt
  • P(tkR) Drk / Dr
  • P(tkR) (ni - Drk) / (N - Dr)

17
Probabilistic Re-Weighting
  • Substituting the new probabilities gives
  • Simij ?k wik wjk ln((Drk / (Dr-Drk)) ?
    ((ni - Drk) / (N - Dr- (ni - Drk))))
  • However small values of Drk, Dr can cause
    problems, so usually a fudge factor is added

18
Probabilistic Re-Weighting
  • Effectively updates query term weights
  • No automatic query expansion
  • Terms not in the initial query are never
    considered
  • No memory of previous weights

19
Query Expansion Via Local Clustering
  • Create new queries automatically
  • Cluster initial search results
  • Use clusters to create new queries
  • Compute Sk(n), which is the set of keywords
    similar to tk that should be added to the query
  • D set of retrieved documents
  • V vocabulary of D

20
Association Clustering
  • Find terms that frequently co-occur within
    documents
  • skl ckl ?i fik fil
  • Or, normalized association matrix s
  • skl ckl / (ckk cll ckl)
  • Association cluster Sk(n)
  • Sk(n) n largest skl s.t. l ? k

21
Metric Clustering
  • Measure distance between keyword appearances in a
    document
  • r(tk, tl) words between tk and tl
  • ckl ?k?l (1 / r(tk, tl))
  • Normalized skl ckl / (tk ? tl)
  • tk of words stemmed to tk
  • Sk(n) is the same as before

22
Scalar Clustering
  • Terms with similar association clusters are more
    likely to be synonyms
  • sk is the row vector from association clustering
    skl
  • skl (sk ? sl) / (sk ? sl)
  • Sk(n) is the same as before

23
Query Expansion Via Local Context Analysis
  • Combine information from initial search results
    and global corpus
  • Break retrieved documents into fixed-length
    passages
  • Treat each passage as a document, and rank order
    them using the initial query
  • Compute the weight of each term in top ranked
    passages using a TFIDF-like similarity with query
  • Take the top m terms and add them to the original
    query with weight
  • w 1 0.9(rank / m)

24
Local Context Analysis
  • Compute weight of each term in top ranked
    passages
  • N of passages
  • nk of passages containing tk
  • pfik frequency of tk in passage pi
  • f(tk, tl) ?i pfik pfil
  • idfk max(1, log10(N/nk)/5)
  • Sim(q, tk) ?tl ?query(? ln(f(tk,tl) ?
    idfk)/ln(N) )idfl

25
Global Analysis
  • Examine all documents in the corpus
  • Create a corpus-wide data structure that is used
    by all queries

26
Similarity Thesaurus
  • itfi ln(V / Vi)
  • wik ((0.5 0.5fik/maxi(fik))itfi) / ?(?j(0.5
    0.5fjk/maxj(fjk))2itfi2)
  • ckl ??i wik wil

27
Query Expansion With Similarity Thesaurus
  • wqk weight from above with query
  • sim(q,tl) ?k wqk ckl
  • Take top m terms according to sim(q,tl)
  • Assign query weights to new terms
  • wk sim(q,tk) / ?l wql
  • Re-run with new (weighted) query

28
SONIA Feedback
  • SONIA is a meta-search engine
  • Clusters search results
  • Extracts keywords describing each cluster
  • Allow user to expand search within cluster

29
ifile Feedback
  • ifile is an automatic mail filer for mh
  • Email is automatically filed in folders
  • User refile actions provide feedback on poor
    classification decisions
  • Does not use positive feedback
  • No reinforcement of correct decisions, only
    correction based on bad decisions

30
Search Engine Feedback
  • Search engines rarely allow relevance feedback
  • Too CPU-intensive
  • Most people arent willing to wait
  • Search engines typically operate near the edge of
    their capacity
  • Google has 2,000 servers(?) and 300TB of data

31
Meta-Search Engine Feedback
  • Meta-search engines collect and process results
    from several search engines
  • Most search engines do not allow users to specify
    query term weights
  • Some search engines allow users to specify
    negative query terms
  • User relevance feedback might be used to weight
    results by search engine

32
Pet Peeves
  • What are your pet peeves when trying to locate
    information from
  • Web?
  • Intranet?
  • Email?
  • Local disk?
  • Peer-to-peer network?
  • Who knows?

33
Information Types
  • What types of information do you typically try to
    find?
  • Technical papers?
  • Product information? (e.g., books, motherboards,
    )
  • Travel? (e.g., where to go for vacation, and what
    to do there?)
  • People? (e.g., Mehran Sahami)

34
Project
  • First draft of search is available
  • It is incomplete and buggy
  • Can look at architecture to see how you might
    extend it
  • Python w/ extensions
  • Should be installed on cluster
  • RPMs/SRPMs available on web
  • http//www.hpl.hp.com/personal/Carl_Staelin/cs2366
    01/software/
Write a Comment
User Comments (0)
About PowerShow.com