Title: Relevance Ranking and Relevance Feedback
1Relevance RankingandRelevance Feedback
2Motivation - Feast or famine
- Queries return either too few or too many results
- Users are generally looking for the best document
with a particular piece of information - Users dont want to look through hundreds of
documents to locate the information - ? Rank documents according to expected relevance!
3Model
- Can we get user feedback?
- Document score is influenced by similarity to
previously user-rated documents - Can we utilize external information?
- E.g., how many other documents reference this
document?
4Queries
- Most queries are short
- One to three words
- Many queries are ambiguous
- Saturn
- Saturn the planet?
- Saturn the car?
5Sample Internal Features
- Term frequency
- Location of term appearance in document
- Capitalization
- Font
- Relative font size
- Bold, italic,
- Appearance in lttitlegt, ltmetagt, lth?gt tags
- Co-location with other (relevant) words
6Sample External Features
- Document citations
- How often is it cited?
- How important are the documents that cited it?
- Relevance of text surrounding hyperlink
- Relevance of documents citing this document
- Location within website (e.g. height in the
directory structure or click distance from /) - Popularity of pages from similar queries
- Search engine links often connect through search
site so they can track click-throughs - Your idea here
7Problem
- Given all these features, how do we rank the
search results? - Often use similarity between query and document
- May use other factors to weight ranking score,
such as citation ranking - May use an iterative search which ranks documents
according to similarity/dissimilarity to query
and previously marked relevant/irrelevant
documents
8Relevance Feedback
- Often an information retrieval system does not
return useful information on the first try! - If at first you dont succeed, try, try, again
- Find out from the user which results were most
relevant, and try to find more documents like
them and less like the others - Assumption relevant documents are somehow
similar to each other and different from
irrelevant documents - Question how?
9Relevance Feedback Methods
- Two general approaches
- Create new queries with user feedback
- Create new queries automatically
- Re-compute document weights with new information
- Expand or modify the query to more accurately
reflect the users desires
10Vector Space Re-Weighting
- Given a query Q with its query vector q
- Initial, user-annotated results D
- Dr relevant, retrieved documents
- Dn irrelevant, retrieved documents
- di are the document weight vectors
- Update weights on query vector q
- Re-compute similarity score to new q
11Vector Space Re-Weighting
- Basic idea
- increase weight for terms appearing in relevant
documents - Decrease weight for terms appearing in irrelevant
documents - There are a few standard equations
12Vector Space Re-Weighting
- Rochio
- q' ?q (?/Dr)?di ?Dr di - (?/Dn)?di ?Dn di
- Ide regular
- q' ?q ??di ?Dr di - ??di ?Dn di
- Ide Dec_hi
- q' ?q ??di ?Dr di - ?maxdi ?Dn (di )
13Vector Space Re-Weighting
- The initial query vector q0 will have non-zero
weights only for terms appearing in the query - The query vector update process can add weight to
terms that dont appear in the original query - Automatic expansion of the original query terms
- Some terms can end up having negative weight!
- E.g., if you want to find information on the
planet Saturn, car could have a negative weight
14Probabilistic Re-Weighting
- After initial search, get feedback from user on
document relevance - Use relevance information to recalculate term
weights - Re-compute similarity and try again
15Probabilistic Re-Weighting
- Remember from last time
- Simij ??kwikwjk (ln(P(tkR)/(1-P(tkR)))
ln((1-P(tkR))/P(tkR))) - P(tkR) 0.5
- P(tkR) ni /N ni docs containing tk
- gt
- Simij ?k wik wjk ln((N - ni)/ni)
16Probabilistic Re-Weighting
- Given document relevance feedback
- Dr set of relevant retrieved docs
- Drk subset of Dr containing tk
- gt
- P(tkR) Drk / Dr
- P(tkR) (ni - Drk) / (N - Dr)
17Probabilistic Re-Weighting
- Substituting the new probabilities gives
- Simij ?k wik wjk ln((Drk / (Dr-Drk)) ?
((ni - Drk) / (N - Dr- (ni - Drk)))) - However small values of Drk, Dr can cause
problems, so usually a fudge factor is added
18Probabilistic Re-Weighting
- Effectively updates query term weights
- No automatic query expansion
- Terms not in the initial query are never
considered - No memory of previous weights
19Query Expansion Via Local Clustering
- Create new queries automatically
- Cluster initial search results
- Use clusters to create new queries
- Compute Sk(n), which is the set of keywords
similar to tk that should be added to the query - D set of retrieved documents
- V vocabulary of D
20Association Clustering
- Find terms that frequently co-occur within
documents - skl ckl ?i fik fil
- Or, normalized association matrix s
- skl ckl / (ckk cll ckl)
- Association cluster Sk(n)
- Sk(n) n largest skl s.t. l ? k
21Metric Clustering
- Measure distance between keyword appearances in a
document - r(tk, tl) words between tk and tl
- ckl ?k?l (1 / r(tk, tl))
- Normalized skl ckl / (tk ? tl)
- tk of words stemmed to tk
- Sk(n) is the same as before
22Scalar Clustering
- Terms with similar association clusters are more
likely to be synonyms - sk is the row vector from association clustering
skl - skl (sk ? sl) / (sk ? sl)
- Sk(n) is the same as before
23Query Expansion Via Local Context Analysis
- Combine information from initial search results
and global corpus - Break retrieved documents into fixed-length
passages - Treat each passage as a document, and rank order
them using the initial query - Compute the weight of each term in top ranked
passages using a TFIDF-like similarity with query - Take the top m terms and add them to the original
query with weight - w 1 0.9(rank / m)
24Local Context Analysis
- Compute weight of each term in top ranked
passages - N of passages
- nk of passages containing tk
- pfik frequency of tk in passage pi
- f(tk, tl) ?i pfik pfil
- idfk max(1, log10(N/nk)/5)
- Sim(q, tk) ?tl ?query(? ln(f(tk,tl) ?
idfk)/ln(N) )idfl
25Global Analysis
- Examine all documents in the corpus
- Create a corpus-wide data structure that is used
by all queries
26Similarity Thesaurus
- itfi ln(V / Vi)
- wik ((0.5 0.5fik/maxi(fik))itfi) / ?(?j(0.5
0.5fjk/maxj(fjk))2itfi2) - ckl ??i wik wil
27Query Expansion With Similarity Thesaurus
- wqk weight from above with query
- sim(q,tl) ?k wqk ckl
- Take top m terms according to sim(q,tl)
- Assign query weights to new terms
- wk sim(q,tk) / ?l wql
- Re-run with new (weighted) query
28SONIA Feedback
- SONIA is a meta-search engine
- Clusters search results
- Extracts keywords describing each cluster
- Allow user to expand search within cluster
29ifile Feedback
- ifile is an automatic mail filer for mh
- Email is automatically filed in folders
- User refile actions provide feedback on poor
classification decisions - Does not use positive feedback
- No reinforcement of correct decisions, only
correction based on bad decisions
30Search Engine Feedback
- Search engines rarely allow relevance feedback
- Too CPU-intensive
- Most people arent willing to wait
- Search engines typically operate near the edge of
their capacity - Google has 2,000 servers(?) and 300TB of data
31Meta-Search Engine Feedback
- Meta-search engines collect and process results
from several search engines - Most search engines do not allow users to specify
query term weights - Some search engines allow users to specify
negative query terms - User relevance feedback might be used to weight
results by search engine
32Pet Peeves
- What are your pet peeves when trying to locate
information from - Web?
- Intranet?
- Email?
- Local disk?
- Peer-to-peer network?
- Who knows?
33Information Types
- What types of information do you typically try to
find? - Technical papers?
- Product information? (e.g., books, motherboards,
) - Travel? (e.g., where to go for vacation, and what
to do there?) - People? (e.g., Mehran Sahami)
34Project
- First draft of search is available
- It is incomplete and buggy
- Can look at architecture to see how you might
extend it - Python w/ extensions
- Should be installed on cluster
- RPMs/SRPMs available on web
- http//www.hpl.hp.com/personal/Carl_Staelin/cs2366
01/software/