Title: Personalized Web Search using Clickthrough History
1Personalized Web Search using Clickthrough
History
- U. Rohini
- 200407019
- rohini_at_research.iiit.ac.in
- Language Technologies Research Center (LTRC)
- International Institute of Information Technology
(IIIT) - Hyderabad, India
2Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web Search - Personalized Search using user Relevance
Feedback Statistical Language modeling based
approaches - Simple N-gram based methods
- Noisy Channel based method
- Personalized Search using user Relevance
Feedback Machine Learning based approach - Ranking SVM based method
- Personalization without Relevance Feedback
Simple Statistical Language modeling based method - Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
3Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web Search - Statistical Language modeling based approaches
- Simple N-gram based methods
- Noisy Channel based method
- Machine Learning based approach
- Ranking SVM based method
- Personalization without Relevance Feedback
- Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
4 Introduction
- Current Web Search engines
- Provide users with documents relevant to their
information need - Issues
- Information overload
- To cater Hundreds of millions of users
- Terabytes of data
- Poor description of Information need
- Short queries - Difficult to understand
- Word ambiguities
- Users only see top few results
- Relevance
- subjective depends on the user
- One size Fits all ???
5Motivation
- Search is not a solved problem!
- Poorly described information need
- Java (Java island / Java programming language
) - Jaguar (cat /car)
- Lemur (animal / lemur tool kit)
- SBH (State bank of Hyderbad/Syracuse Behavioral
Health care) - Given prior information
- I am into biology best guess for Jaguar?
- past queries - information retrieval, language
modeling best guess for lemur?
6Background
- Prior Information user feedback
7Problem Description
- Personalized Search
- Customize search results according to each
individual user - Personalized Search - Issues
- What to use to Personalize?
- How to Personalize?
- When not to Personalize?
- How to know Personalization helped?
8Problem Statement
- Problem
- How to Personalize?
- Our Direction
- Use past Search history
- Long term learning
- Sub Problems
- Broken down into 2 sub problems
- How to model and represent past search contexts
- How to use it to improve search results
9Solution Outline
- 1. How to model and represent past search
contexts - Past search history from user over a period of
time query logs - User contexts triples user,query,relevant
documents - Apply appropriate method, learn from user
contexts, build model user profile - User Profile Learning
- 2. How to use it to improve search results
- Get Initial Search results
- Take top few documents, re-score using user
profile and sort again - Reranking
10Contributions
- I Search A suite of approaches for Personalized
Web Search - Proposed Personalized search approaches
- Baseline
- Basic Retrieval methods
- Automatic Evaluation
- Analysis of Query Log
- Creating Simulated Feedback
11Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web Search - Statistical Language modeling based approaches
- Simple N-gram based methods
- Noisy Channel based method
- Machine Learning based approach
- Ranking SVM based method
- Personalization without Relevance Feedback
- Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
12Review of Personalized Search
-
- Personalized Search
- Query logs Machine learning
Language modeling Community based
Others
13Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web Search - Statistical Language modeling based approaches
- Simple N-gram based methods
- Noisy Channel based method
- Machine Learning based approach
- Ranking SVM based method
- Personalization without Relevance Feedback
- Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
14I Search A suite of approaches for Personalized
Search
- Suite of Approaches
- Statistical Language modeling based approaches
- Simple N-gram based methods
- Noisy Channel Model based method
- Machine learning based approach
- Ranking SVM based method
- Personalization without relevance feedback
- Simple N-gram based method
15Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web - Statistical Language modeling based approaches
- Simple Language model based method
- Noisy Channel based method
- Machine Learning based approach
- Ranking SVM based method
- Personalization without Relevance Feedback
- Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
16Statistical Language Modeling based Approaches
Introduction
- Statistical language modeling task of
estimating probability distribution that captures
statistical regularities of natural language - Applied to a number of problems Speech, Machine
Translation, IR, Summarization
17Statistical Language Modeling based Approaches
Background
Lemur
Query Formulation Model
Query
Given a query, which is most likely to be the
Ideal Document?
User Information need Ideal Document
In spite of the progress, not much work to
capture, model and integrate user context !
18 Motivation for our approach
Ideal document
Encyclopedia gives a brief description of the
physical traits of this animal.
The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
Information retrieval
User Past Search Contexts
Information retrieval (IR) is the science of
searching for information in documents,
searching for documents themselves, searching
for metadata which
19Statistical Language Modeling based Approaches
Overview
- From user contexts, capture statistical
properties of texts - Use the same to improve search results
- Different Contexts
- Unigram and Bigrams
- Simple N-gram based approaches
- Relationship between query and document words
- Noisy Channel based approach
20Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web - Statistical Language modeling based approaches
- Simple N-gram based methods
- Noisy Channel based method
- Machine Learning based approach
- Ranking SVM based method
- Personalization without Relevance Feedback
- Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
21 N-gram based Approaches Motivation
Ideal document
Lemur - Encyclopedia gives a brief description of
the physical traits of this animal.
The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
Information retrieval
Past Search Contexts
Unigrams Information Retrieval Documents
Bigrams Information retrieval Searching
documents Information documents
Information retrieval (IR) is the science of
searching for information in documents,
searching for documents themselves, searching
for metadata which
22Sample user profile
23Learning user profile
- Given Past search history
- Hu (q1, rf1), (q2, rf2), , (qn, rfn)
- rfall contentation of all rf
- For each unigram wi
- User profile
24Reranking
- Recall, in general LM for IR
- Our Approach
25Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web - Statistical Language modeling based approaches
- Simple N-gram based methods
- Noisy Channel based method
- Machine Learning based approach
- Ranking SVM based method
- Personalization without Relevance Feedback
- Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
26Noisy Channel based Approach
- Documents and Queries different information
spaces - Queries short, concise
- Documents more descriptive
- Most methods to retrieval or personalized web
search do not model this - We capture relationship between query and
document words
27Noisy Channel based approach Motivation
Query Generation Process (Noisy Channel)
Ideal Document
Retrieval
Query Generation Process (Noisy Channel)
28Similar to Statistical Machine Translation
- Given an english sentence translate into french
- Given a query, retrieve documents closer to ideal
document
Noisy channel 1
English Sentence
French Sentence
P(e/f)
Noisy Channel 2
Ideal Document
Query
P(q/w)
29Learning user profile
- User profile Translation Model
- Triples (qw,dw,p(qw/dw))
- Use Statistical Machine Translation methods
- Learning user profile training a translation
model - In SMT Training a translation model
- From Parallel texts
- Using EM algorithm
30Learning User profile
- Extracting Parallel Texts
- From Queries and corresponding snippets from
clicked documents - Training a Translation Model
- GIZA - an open source tool kit widely used for
training translation models in Statistical
Machine Translation research.
U. Rohini, Vamshi Ambati, and Vasudeva Varma.
Statistical machine transla- tion models for
personalized search. Technical report,
International Institute of Information
Technology, 2007
31Sample user profile
32Reranking
- Recall, in general LM for IR
- Noisy Channel based approach
lemur
P(retrieval/lemur)
Lemur encyclopedia brief
Lemur toolkit information retireval
Lemur - Encyclopedia gives a brief description of
the physical traits of this animal.
The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
D1
D4
33Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web - Statistical Language modeling based approaches
- Simple N-gram based methods
- Noisy Channel based method
- Machine Learning based approach
- Ranking SVM based method
- Personalization without Relevance Feedback
- Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
34Machine Learning based ApproachesIntroduction
- Most machine learning for IR - Binary
classification problem relevant and
non-relevant - Click through data
- Click is not an absolute relevance but relative
relevance - i.e., assuming clicked relevant, un clicked -
irrelevant is wrong. - Clicks biased
- Partial relative relevance - Clicked documents
are more relevant than the un clicked documents.
35Background
- Ranking SVM
- A variation of SVM
- Learns from Partial Relevance Data
- Learning similar to classification SVM
36Ranking SVMs based method
- Use Ranking SVMs for learning user profile
- Experimented
- Different features
- Unigram, bigram
- Different Feature weights
- Boolean, Term Frequency, Normalized Term Frequency
37Learning user profile
- User profile a weight vector
- Learning Training an SVM Model
- Steps
- Extracting Features
- Computing Feature Weights
- Training SVM
1. Uppuluri R, Ambati V, Improving web search
results using collaborative filtering, In
proceedings of 3rd International Workshop on Web
Personalization (ITWP), held in conjunction with
AAAI 2006, 2006. 2. U. Rohini and Vasudeva
Varma. A novel approach for re-ranking of search
results using collaborative filtering. In
Proceeedings of International Conference on
Computing Theory and Applications (ICCTA07),
pages 491495, Kolkota, India, March 2007
38Extracting Features
- Features unigram, bigram
- Given Past search history
- Hu (q1, rf1), (q2, rf2), , (qn, rfn)
- rfall contentation of all rf
- Remove stop words from rfall
- Extract all unigrams (or bigrams) from rfall
39Computing Feature Weights
- In each Relevant Document (di), compute weights
of features - Boolean Weighting
- 1 or 0
- Term Frequency Weighting
- tfw Number of times it occurs in di
- Normalized Term Frequency Weighting
- tfw/ di Q
40Training SVM
- Each relevant document represent as a string of
features and corresponding weights - We used SVMlight for training
41 Sample Training
Sample User Profile
42Reranking
- Sim(Q,D) W. ?(Q,D)
- W weight vector/user profile
- ?(Q,D) vector of term and their weights
- Measure of similarity between Q and D
- Each term term in the query
- Term weight product of weights in the query and
the document (boolean, term frequency,normalized
term frequency)
43Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web - Statistical Language modeling based approaches
- Simple N-gram based methods
- Noisy Channel based method
- Machine Learning based approach
- Ranking SVM based method
- Personalization without Relevance Feedback
- Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
44Personalized Search without Relevance
FeedbackIntroduction
- Can personalized be done without relevance
feedback about which documents are relevant - How much informative are the queries posed by
users - Is information contained in the queries enough to
personalize?
45Approach
- Past queries of the user available
- Make effective use of past queries
- Simple N-gram based approach
46Learning user profile
- Given Past search history
- Hu q1 q2, qn
- qconcat Concatenation of all queries
- For each unigram wi
- User profile
47Sample user profile
48Reranking
- In general LM for IR
- Our Approach
U. Rohini, Vamshi Ambati, and Vasudeva Varma.
Personalized search without relevance feedback.
Technical report, International Institute of
Information Technology, 2007
49Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web - Statistical Language modeling based approaches
- Simple N-gram based methods
- Noisy Channel based method
- Machine Learning based approach
- Ranking SVM based method
- Personalization without Relevance Feedback
- Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
50Experiments Introduction, Problems
- Aim To see how they perform by comparing it with
a baseline - Problems
- No standard evaluation framework
- Data
- Lack of standardization
- Comparison with previous work difficult
- Difficult to repeat previously conducted
experiments - Difficult to share results and observations
- Repeating effort to collect data over and over
- Identified as a problem and need for
standardization (Allan et al. 2003) - Lack of standard personalized search baselines
- In our work, used a variation of the Rocchio
Algorithm - Metrics
51Experiments Data
- Click through data from a popular search engine
- Data collected from 250k million users over 3
months data in 2006. - Consists of (anonymous id, query,
timestamp,position of the click,domain name of
the click url)
52Experiments Sample Data
53Issues with the query log data
- Web Search engines
- Changing search engine indices
- However, top 10 results mostly same
- Implicit feedback Partial relevance feedback
- 90 of the users click only top 10 results.
- 95 only top 5 results
- Only contained the domain name of the clicked URLs
54Extracting Data Set
- Conditions
- A query should have at least 1 click
- Exhibit long term behaviour (pose query over 3
months and exhibit similar interests) - Assumptions
- Each anonymous id corresponds to one user
- Use the domain name of the click url while
comparing - Final Data Set
- How to split the data for training (learning user
profile) and testing ? - Temporally
- Training data learning user profile, Testing
data Testing - First 2 months for training, third month for
testing - 17 users
- 51.88 average queries in train set and 12.64
average queries in test set.
55Baseline
- Variation of Rocchio algorithm (Rocchio 1971)
- Learning profile
- User profile Vector of word and weights
- For each query
- For each clicked document
- Collect corresonding snippet from search engine
- Concatenate all such snippets for all queires
- Compute frequency distribution of words
- Reranking
- Sim (Q,D) (tfq/Q tfrup/RUP). tfD/D
56Metrics
- MRR Mean Reciprocal Rank
- Mrr(Q,D,u) ?q ? Q rr(q,RQ,D,u )
- -----------------------
- Q
- rr(q,RQ,D,u ) position of the first relevant
document and 0 if no relevant result in the top
N(10).
57Set up
Reranker 1. Rerank top M(10) resuts click
through data 2. First get the results from
google, Ignore ranks given by Google
(Similar to Tan, Shen Zhai 2006) 3. Rescore the
results using appropriately 4. Sort in
descending order and return
Top m urls
Results
Reranked Reslts
Query
Test Data Queryclicked urls
Clicked urls
Compare top n urls
MRR, P_at_n
58Results Simple N-gram based Methods
59Noisy Channel Based Method
- Experiment 1
- Comparison with baseline
- Experiment 2
- Different methods of extracting parallel texts
- Experiment 3
- Different training schemes
- Different contexts for training
- Different training models
60Experiment 1
Comparison with baseline
61Experiment 2
- Extracting Parallel Texts Comparison of methods
62Results
NS1 Query Snippets of relevant documents
NS3 Query Snippets of relevant documents
document Title Snippets
Synthetic query Snippets NS2 - Query
Snippets of relevant documents NS2 -
Query Snippets of relevant documents
Synthetic query Snippets
Synthetic query Snippets
query document title
63Experiment 3
- Different training schemes
- Different contexts for training
- Snippet Vs Document
- Different training models
- Different Training Models
64 Data and Set up
- Data
- Explicit Feedback data collected from 7 users
- For each query, each user examined top 10
documents and identified top 10 documents - Collected the top 10 results for all queries.
Total documents 3469 documents - Set up
- 3469 documents - created lucene index.
- For reranking, first retrieve the results using
lucene and then rerank them using the noisy
channel approach. - We perform 10 fold cross validation
65Results
66Results
I - Document Training and Document Testing II
- Document Training and Snippet Testing III -
Snippet Training and Document Testing IV -
Snippet Training and Snippet Testing
67Results SVM
SVM1 - unigram, Binary SVM2 - unigram, Term
Frequency SVM3 - unigram, normalized term
frequency SVM4 - bigram, normalized term
frequency SVM4 unigrams bigrams, normalized
term frequency
68Results Personalization without Relevance
Feedback
PRWF personalization without relevance feedback
using only the profile learnt from queries
alone PRWFSmoothing smoothing the
probabilities from the user profile using huge
query language model obtained from all the
queries from all the users in collection 01 of
the click through data
69Experiments Summary
- Language Modeling Best Results!
- Interesting framework Personalized Search
- Simple N-gram based approaches also worked well
- Noisy Channel model worked best
- Extracting Synthetic Queries helped
- Different Training schemes
- IBM Model1 Vs GIZA
- Snippet Vs Document
- Machine Learning competitive results
- Different Features and weights
- Without Relevance Feedback Very encouraging
results - Simple Approach worked well
- Sparsity Query log was useful
70Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web Search - Statistical Language modeling based approaches
- Simple N-gram based methods
- Noisy Channel based method
- Machine Learning based approach
- Ranking SVM based method
- Personalization without Relevance Feedback
- Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
71Query Log Study Introduction
- Large interest in finding patterns and computing
statistics from query logs - Previous work
- Patterns statistics of queries Common
queries, avg. no. of words, avg. no. of queries
per session etc. - Little work on analyzing click behaviour of users
- Granka et. al - Eye tracking study
72Query Log Study Our Analysis
- Analyzing clicking behaviour of users
- Study if any general pattern in clicking
behaviour - Aim to answer the following
- Expt1 Do all users view results from top to
bottom? - Expt2 Do all users view same number of results?
73Query Log Study Observations
- Expt1 All users view results from top to bottom?
- YES!! - For 90 of Queries
- Why is this important ?
- Expt2 How many top results does the user view?
gt Deepest click made by users - Statistical Analysis showed that deepest clicks
made by a sample of users follow a Zipfs
distribution or Power law - Many users view only top 5 (about 90/95), few
users view top 10, much fewer view top 20 and so
on - Why is this important?
74Outline of the talk
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem Description
- Solution Outline
- Contributions
- Review of Personalized Search
- I Search A suite of approaches for Personalized
Web Search - Statistical Language modeling based approaches
- Simple N-gram based methods
- Noisy Channel based method
- Machine Learning based approach
- Ranking SVM based method
- Personalization without Relevance Feedback
- Experiments
- Query Log Study
- Simulated Feedback
- Conclusions and Future Directions
75Simulated Feedback Introduction
- Relevance Feedback Types, problems
- Explicit
- Difficult to collect
- Implicit
- Clickthrough data from search engines not
available - Repeatability of experiments Problem!
- Web Dynamic data collections Feedback
collected becomes stale - Privacy
76Simulated Feedback Motivation
- Simulated Feedback Like from explicit and
implicit feedback - Potential area outcome useful for web search
and personalization - Easy to create
- Customizable
- Large amounts can be created
- Repeatable
- Testing specific domains
77Simulated Feedback Creation
SIMULATOR
Web Search Behaviour Simulator Step1 Formulate
query Step2 Posing to a search engine Step3
Looking at results returned by search
engine Step4 Possibly clicking one or more
results
Simulator
User Creator
Parameters
Simulated Feedback
78Outline
- Introduction
- Current Search Engines Problems
- Motivation
- Background
- Problem
- Solution Outline
- Contributions
- Review of Personalized Search
- Thesis Outline
- Statistical Language modeling based approaches
- Simple Language model based approaches
- Noisy Channel
- Machine Learning based approach
- Ranking SVM
- Personalization without Relevance Feedback
- Experiments
- Conclusions and Future Directions
79Conclusions
- Statistical Language Modeling based approaches
- Machine learning based approach
- Personalized Search without relevance feedback
- Performed evaluation using query log data
- Query Log Analysis and Simulated Feedback
80Future Directions
- Recommending Documents
- Extend to exploit Repetition in queries and
clickthroughs - Language Modeling based Approaches
- Capture Richer context
- N-gram based method trigrams etc
- Noisy Channel based method bigram
- Machine learning based Approaches
- Can learn non-text patterns or behaviour
- Personalized Summarization
- Simulating user behaviour
81 82Simple N-gram based approaches
- N-gram general term for words
- 1-gram unigram, 2-gram bigram
- Capture statistical properties of text
- Single words (Unigrams)
- Two adjacent words (Bigrams)
83Query Log Study Introduction
- Query logs
- Large interest in finding patterns and computing
statistics from query logs - Previous work
- Patterns and statistics on queries
- Common queries, avg. no. of words, avg. no. of
queries per session etc - Little work on analyzing click behaviour of users
- Granka et. al - Eye tracking study
84Query Log Study Our Analysis
- Focus on Analyzing clicking behaviour of users
- Study if any general pattern in clicking
behaviour - Aim to answer the following
- All users view results from top to bottom (Expt
1) - All users view same number of results? (Expt 2)
85Query log Data
- Click through data from a popular search engine
- Data collected from 250k million users over 3
months data in 2006. - Consists of (anonymous id, query,
timestamp,position of the click,domain name of
the click url)
86Sample Data
87Experiment 1
- All users view results from top to bottom?
- Position position of the search result in the
search engine - For each query
- Arrange clicks based on time of click
- If all the postions are in ascending order, user
views from top to bottom - The query is said to be an anomaly if not so!
88(No Transcript)
89Observations
- For 90 of the queries, users always go from top
to bottom!!! - For the rest 10 queries
- Uses clicks at least one bottom result before
clicking a top result - User not happy with search engine ranking
- Not the behaviour of the user - 50 users
exhibit it - Certain Queries are hard ?
90Experiment 2
- How many top results does the user view?
- Intuition
- Typically users dont view all the results
- Only top few How many?
- Depends on the user?
- Goal To see, how deep a user goes to see results
91- Patience how many results a user views
- For each query, the deepest click. Maximum over
all queries - For each query, average click. Maximum over all
queries
92For each query, the deepest click. Maximum over
all queries
93For each query, average click. Maximum over all
queries
94Observations
- Statistical Analysis show they follow a Zipfs
distribution or Power law - Many users view only top 5 (about 90/95), few
users view top 10, much fewer view top 20 and so
on - Can characterize patience of a group of users
using Zipfs law or power law
95Simulated Feedback
- Relevance Feedback
- Explicit
- Difficult to collect
- Implicit
- Clickthrough data from search engines not
available - Repeatability of experiments Problem!
- Web Dynamic data collections Feedback
collected becomes stale
96Simulated Feedback
- Simulated Feedback Drawing analog from explicit
and implicit feedback - Potential area outcome useful for web search
and personalization - Easy to create
- Customizable
- Large amounts can be created
- Repeatable
- Testing specific domains
97- Creating simulated Feedback
- Creating Simulated user
- Simulating user web search behaviour
U. Rohini, Vamshi Ambati, and Vasudeva Varma.
Creating simulated feedback. Technical report,
International Institute of Information
Technology, 2007.
98Creating Simulated User
- User Specific Parameters (Unique id etc)
- Web search Specific parameters
- Patience (From Query log analysis)
- Threshold
- Others can be Interests (User Profile/Model),
Browsing History etc.
We considered Patience and threshold in this work
99Patience
Pick From Power law Distribution. Many users view
top 5, less few top 10, much fewer view top 20
and so on
100Relevance Threshold
- Depends on the query and user
- For some query, very high relevance is needed
- We compute it according to the query for each user
101Simulating user web search behaviour
- Formulate a Web Search Process
- Step1 Create the query
- Step2 Posing to a search engine
- Step3 Looking at the results returned by the
search engine - Step4 Possibly clicking one or more results
- Step 5 Reformulate if unsatisfied
- Simulate the search process for the created user
We consider only Steps 1 to 4 in our approach
102Simulating Step1Formulating the query
- Can be very complex
- We take a simple and practical approach
- As of now, the queries are assumed to be given to
the system
103Simulating Step2Searching the Search Engine
- Given a search engine
- Pose the query from Step1 to the search engine
- Get the search results.
104Simulating Step3Looking at the Search Results
- Simulation of this step can be done in a number
of ways - Ex Random, top to bottom, bottom to up etc
- We consider
- Sequential from Top to bottom until patience is
zero - For each document performs clicks as in Step4
- (motivated by Radlinski et al, Granka et al )
105Simulating Step 4Clicking the results
- Crucial Step of our simulation
- User Clicks a result if
- The snippet shown by the search engine appears to
be relevant to the user - The result below it is not more relevant than it
(motivated by Radlinski et al, Granka et al )
106Simulated Feedback Creation
Search results
Search Engine
Simulator
Web Search Behaviour Simulator
User Creator
Parameters
Simulated Feedback
107Evaluation Problems
- Is Simulated Feedback relevant?
- How different is it from a randomly created
feedback? - Evaluation -
- No standard methods to evaluate
- No Metrics to quantify success
- How and what to compare ?
108Experiments
- Experiment 1
- Comparison with Implicit Feedback from Query log
Data - Experiment 2
- Comparison with Baselines
- Experiment 3
- Comparison with Explicit Feedback
109Experimental Set up
- Creating simulated user
- Randomly assign unique id
- Patience
- Draw randomly from Power law Distribution 1-
25
110Experimental set up
- Simulating Web Search Process
- Pick a user from query log, gather all queries
posed by him. - Simulate Web search process of each query in
succession - Step 1 Formulating a query
- Pick each query in succession from the gathered
queries - Step 2 Searching the Search engine
- Pose the query to a search engine and gather
results - Step 3 Looking at the results
- Step 4 Clicking one or more
111Sample Data Created
112Experiment 1
- Comparison with clickthroughs from query log
- For each query Relevance Document Pool (RDP)
- All clicked documents for the query from all the
users in the query log - Average Accuracy 60.04
113Experiment 2
- Random Navigation
- Power law Navigation
- Random click
114Creating user
115Creating Web Search Process
116Results
117Experiment 3
- Comparison with explicit Feedback
- 4 Judges
- Select small sub set of data created
- 25 users
- 1 query per user total 25 queries
- We consider the query, and the simulated feedback
created for this query
118- Each judge given an evaluation form
- Evaluation form
- Details about the judge
- A table containing query and corresponding
simulated click urls - For each simulated click judge feedback
- Boolean feedback 1 or 0
119Results
- Judge Accuracy 66.02
- Correlation between the judges 0.859
120Discussion
- 6 increase in accuracy over comparison with
query log - Match problems
- Search Engine index changes Relevance feedback
becomes stale! - Too low relevant documents in RDP
- qualcom.com - Only one document in RDP.
- Focussed query, only user posed it
- Focussed query Vs General query
- qualcomm.com - only one query , one user posed
- lottery - 58 users , 24 unique click urls
121Reranking
- In general LM for IR
- Noisy Channel based approach
lemur
Lemur encyclopedia brief
Lemur toolkit information retireval
The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
Lemur - Encyclopedia gives a brief description of
the physical traits of this animal.