Personalized Web Search using Clickthrough History - PowerPoint PPT Presentation

1 / 121
About This Presentation
Title:

Personalized Web Search using Clickthrough History

Description:

International Institute of Information Technology (IIIT) ... Jaguar (cat /car) Lemur (animal / lemur tool kit) ... I am into biology best guess for Jaguar? ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 122
Provided by: researchw6
Category:

less

Transcript and Presenter's Notes

Title: Personalized Web Search using Clickthrough History


1
Personalized Web Search using Clickthrough
History
  • U. Rohini
  • 200407019
  • rohini_at_research.iiit.ac.in
  • Language Technologies Research Center (LTRC)
  • International Institute of Information Technology
    (IIIT)
  • Hyderabad, India

2
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web Search
  • Personalized Search using user Relevance
    Feedback Statistical Language modeling based
    approaches
  • Simple N-gram based methods
  • Noisy Channel based method
  • Personalized Search using user Relevance
    Feedback Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
    Simple Statistical Language modeling based method
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

3
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web Search
  • Statistical Language modeling based approaches
  • Simple N-gram based methods
  • Noisy Channel based method
  • Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

4
Introduction
  • Current Web Search engines
  • Provide users with documents relevant to their
    information need
  • Issues
  • Information overload
  • To cater Hundreds of millions of users
  • Terabytes of data
  • Poor description of Information need
  • Short queries - Difficult to understand
  • Word ambiguities
  • Users only see top few results
  • Relevance
  • subjective depends on the user
  • One size Fits all ???

5
Motivation
  • Search is not a solved problem!
  • Poorly described information need
  • Java (Java island / Java programming language
    )
  • Jaguar (cat /car)
  • Lemur (animal / lemur tool kit)
  • SBH (State bank of Hyderbad/Syracuse Behavioral
    Health care)
  • Given prior information
  • I am into biology best guess for Jaguar?
  • past queries - information retrieval, language
    modeling best guess for lemur?

6
Background
  • Prior Information user feedback

7
Problem Description
  • Personalized Search
  • Customize search results according to each
    individual user
  • Personalized Search - Issues
  • What to use to Personalize?
  • How to Personalize?
  • When not to Personalize?
  • How to know Personalization helped?

8
Problem Statement
  • Problem
  • How to Personalize?
  • Our Direction
  • Use past Search history
  • Long term learning
  • Sub Problems
  • Broken down into 2 sub problems
  • How to model and represent past search contexts
  • How to use it to improve search results

9
Solution Outline
  • 1. How to model and represent past search
    contexts
  • Past search history from user over a period of
    time query logs
  • User contexts triples user,query,relevant
    documents
  • Apply appropriate method, learn from user
    contexts, build model user profile
  • User Profile Learning
  • 2. How to use it to improve search results
  • Get Initial Search results
  • Take top few documents, re-score using user
    profile and sort again
  • Reranking

10
Contributions
  • I Search A suite of approaches for Personalized
    Web Search
  • Proposed Personalized search approaches
  • Baseline
  • Basic Retrieval methods
  • Automatic Evaluation
  • Analysis of Query Log
  • Creating Simulated Feedback

11
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web Search
  • Statistical Language modeling based approaches
  • Simple N-gram based methods
  • Noisy Channel based method
  • Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

12
Review of Personalized Search
  • Personalized Search
  • Query logs Machine learning
    Language modeling Community based
    Others

13
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web Search
  • Statistical Language modeling based approaches
  • Simple N-gram based methods
  • Noisy Channel based method
  • Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

14
I Search A suite of approaches for Personalized
Search
  • Suite of Approaches
  • Statistical Language modeling based approaches
  • Simple N-gram based methods
  • Noisy Channel Model based method
  • Machine learning based approach
  • Ranking SVM based method
  • Personalization without relevance feedback
  • Simple N-gram based method

15
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web
  • Statistical Language modeling based approaches
  • Simple Language model based method
  • Noisy Channel based method
  • Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

16
Statistical Language Modeling based Approaches
Introduction
  • Statistical language modeling task of
    estimating probability distribution that captures
    statistical regularities of natural language
  • Applied to a number of problems Speech, Machine
    Translation, IR, Summarization

17
Statistical Language Modeling based Approaches
Background
Lemur
Query Formulation Model
Query
Given a query, which is most likely to be the
Ideal Document?
User Information need Ideal Document
In spite of the progress, not much work to
capture, model and integrate user context !
18
Motivation for our approach
Ideal document
Encyclopedia gives a brief description of the
physical traits of this animal.

The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
Information retrieval


User Past Search Contexts
Information retrieval (IR) is the science of
searching for information in documents,
searching for documents themselves, searching
for metadata which
19
Statistical Language Modeling based Approaches
Overview
  • From user contexts, capture statistical
    properties of texts
  • Use the same to improve search results
  • Different Contexts
  • Unigram and Bigrams
  • Simple N-gram based approaches
  • Relationship between query and document words
  • Noisy Channel based approach

20
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web
  • Statistical Language modeling based approaches
  • Simple N-gram based methods
  • Noisy Channel based method
  • Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

21
N-gram based Approaches Motivation
Ideal document
Lemur - Encyclopedia gives a brief description of
the physical traits of this animal.

The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
Information retrieval


Past Search Contexts
Unigrams Information Retrieval Documents
Bigrams Information retrieval Searching
documents Information documents
Information retrieval (IR) is the science of
searching for information in documents,
searching for documents themselves, searching
for metadata which
22
Sample user profile
23
Learning user profile
  • Given Past search history
  • Hu (q1, rf1), (q2, rf2), , (qn, rfn)
  • rfall contentation of all rf
  • For each unigram wi
  • User profile

24
Reranking
  • Recall, in general LM for IR
  • Our Approach

25
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web
  • Statistical Language modeling based approaches
  • Simple N-gram based methods
  • Noisy Channel based method
  • Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

26
Noisy Channel based Approach
  • Documents and Queries different information
    spaces
  • Queries short, concise
  • Documents more descriptive
  • Most methods to retrieval or personalized web
    search do not model this
  • We capture relationship between query and
    document words

27
Noisy Channel based approach Motivation

Query Generation Process (Noisy Channel)
Ideal Document
Retrieval
Query Generation Process (Noisy Channel)
28
Similar to Statistical Machine Translation
  • Given an english sentence translate into french
  • Given a query, retrieve documents closer to ideal
    document

Noisy channel 1
English Sentence
French Sentence
P(e/f)
Noisy Channel 2
Ideal Document
Query
P(q/w)
29
Learning user profile
  • User profile Translation Model
  • Triples (qw,dw,p(qw/dw))
  • Use Statistical Machine Translation methods
  • Learning user profile training a translation
    model
  • In SMT Training a translation model
  • From Parallel texts
  • Using EM algorithm

30
Learning User profile
  • Extracting Parallel Texts
  • From Queries and corresponding snippets from
    clicked documents
  • Training a Translation Model
  • GIZA - an open source tool kit widely used for
    training translation models in Statistical
    Machine Translation research.

U. Rohini, Vamshi Ambati, and Vasudeva Varma.
Statistical machine transla- tion models for
personalized search. Technical report,
International Institute of Information
Technology, 2007
31
Sample user profile
32
Reranking
  • Recall, in general LM for IR
  • Noisy Channel based approach

lemur
P(retrieval/lemur)
Lemur encyclopedia brief
Lemur toolkit information retireval
Lemur - Encyclopedia gives a brief description of
the physical traits of this animal.
The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
D1
D4
33
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web
  • Statistical Language modeling based approaches
  • Simple N-gram based methods
  • Noisy Channel based method
  • Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

34
Machine Learning based ApproachesIntroduction
  • Most machine learning for IR - Binary
    classification problem relevant and
    non-relevant
  • Click through data
  • Click is not an absolute relevance but relative
    relevance
  • i.e., assuming clicked relevant, un clicked -
    irrelevant is wrong.
  • Clicks biased
  • Partial relative relevance - Clicked documents
    are more relevant than the un clicked documents.

35
Background
  • Ranking SVM
  • A variation of SVM
  • Learns from Partial Relevance Data
  • Learning similar to classification SVM

36
Ranking SVMs based method
  • Use Ranking SVMs for learning user profile
  • Experimented
  • Different features
  • Unigram, bigram
  • Different Feature weights
  • Boolean, Term Frequency, Normalized Term Frequency

37
Learning user profile
  • User profile a weight vector
  • Learning Training an SVM Model
  • Steps
  • Extracting Features
  • Computing Feature Weights
  • Training SVM

1. Uppuluri R, Ambati V, Improving web search
results using collaborative filtering, In
proceedings of 3rd International Workshop on Web
Personalization (ITWP), held in conjunction with
AAAI 2006, 2006. 2. U. Rohini and Vasudeva
Varma. A novel approach for re-ranking of search
results using collaborative filtering. In
Proceeedings of International Conference on
Computing Theory and Applications (ICCTA07),
pages 491495, Kolkota, India, March 2007
38
Extracting Features
  • Features unigram, bigram
  • Given Past search history
  • Hu (q1, rf1), (q2, rf2), , (qn, rfn)
  • rfall contentation of all rf
  • Remove stop words from rfall
  • Extract all unigrams (or bigrams) from rfall

39
Computing Feature Weights
  • In each Relevant Document (di), compute weights
    of features
  • Boolean Weighting
  • 1 or 0
  • Term Frequency Weighting
  • tfw Number of times it occurs in di
  • Normalized Term Frequency Weighting
  • tfw/ di Q

40
Training SVM
  • Each relevant document represent as a string of
    features and corresponding weights
  • We used SVMlight for training

41

Sample Training
Sample User Profile
42
Reranking
  • Sim(Q,D) W. ?(Q,D)
  • W weight vector/user profile
  • ?(Q,D) vector of term and their weights
  • Measure of similarity between Q and D
  • Each term term in the query
  • Term weight product of weights in the query and
    the document (boolean, term frequency,normalized
    term frequency)

43
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web
  • Statistical Language modeling based approaches
  • Simple N-gram based methods
  • Noisy Channel based method
  • Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

44
Personalized Search without Relevance
FeedbackIntroduction
  • Can personalized be done without relevance
    feedback about which documents are relevant
  • How much informative are the queries posed by
    users
  • Is information contained in the queries enough to
    personalize?

45
Approach
  • Past queries of the user available
  • Make effective use of past queries
  • Simple N-gram based approach

46
Learning user profile
  • Given Past search history
  • Hu q1 q2, qn
  • qconcat Concatenation of all queries
  • For each unigram wi
  • User profile

47
Sample user profile
48
Reranking
  • In general LM for IR
  • Our Approach

U. Rohini, Vamshi Ambati, and Vasudeva Varma.
Personalized search without relevance feedback.
Technical report, International Institute of
Information Technology, 2007
49
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web
  • Statistical Language modeling based approaches
  • Simple N-gram based methods
  • Noisy Channel based method
  • Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

50
Experiments Introduction, Problems
  • Aim To see how they perform by comparing it with
    a baseline
  • Problems
  • No standard evaluation framework
  • Data
  • Lack of standardization
  • Comparison with previous work difficult
  • Difficult to repeat previously conducted
    experiments
  • Difficult to share results and observations
  • Repeating effort to collect data over and over
  • Identified as a problem and need for
    standardization (Allan et al. 2003)
  • Lack of standard personalized search baselines
  • In our work, used a variation of the Rocchio
    Algorithm
  • Metrics

51
Experiments Data
  • Click through data from a popular search engine
  • Data collected from 250k million users over 3
    months data in 2006.
  • Consists of (anonymous id, query,
    timestamp,position of the click,domain name of
    the click url)

52
Experiments Sample Data

53
Issues with the query log data
  • Web Search engines
  • Changing search engine indices
  • However, top 10 results mostly same
  • Implicit feedback Partial relevance feedback
  • 90 of the users click only top 10 results.
  • 95 only top 5 results
  • Only contained the domain name of the clicked URLs

54
Extracting Data Set
  • Conditions
  • A query should have at least 1 click
  • Exhibit long term behaviour (pose query over 3
    months and exhibit similar interests)
  • Assumptions
  • Each anonymous id corresponds to one user
  • Use the domain name of the click url while
    comparing
  • Final Data Set
  • How to split the data for training (learning user
    profile) and testing ?
  • Temporally
  • Training data learning user profile, Testing
    data Testing
  • First 2 months for training, third month for
    testing
  • 17 users
  • 51.88 average queries in train set and 12.64
    average queries in test set.

55
Baseline
  • Variation of Rocchio algorithm (Rocchio 1971)
  • Learning profile
  • User profile Vector of word and weights
  • For each query
  • For each clicked document
  • Collect corresonding snippet from search engine
  • Concatenate all such snippets for all queires
  • Compute frequency distribution of words
  • Reranking
  • Sim (Q,D) (tfq/Q tfrup/RUP). tfD/D

56
Metrics
  • MRR Mean Reciprocal Rank
  • Mrr(Q,D,u) ?q ? Q rr(q,RQ,D,u )
  • -----------------------
  • Q
  • rr(q,RQ,D,u ) position of the first relevant
    document and 0 if no relevant result in the top
    N(10).

57
Set up
Reranker 1. Rerank top M(10) resuts click
through data 2. First get the results from
google, Ignore ranks given by Google
(Similar to Tan, Shen Zhai 2006) 3. Rescore the
results using appropriately 4. Sort in
descending order and return
Top m urls
Results
Reranked Reslts
Query
Test Data Queryclicked urls
Clicked urls
Compare top n urls
MRR, P_at_n
58
Results Simple N-gram based Methods
59
Noisy Channel Based Method
  • Experiment 1
  • Comparison with baseline
  • Experiment 2
  • Different methods of extracting parallel texts
  • Experiment 3
  • Different training schemes
  • Different contexts for training
  • Different training models

60
Experiment 1
Comparison with baseline
61
Experiment 2
  • Extracting Parallel Texts Comparison of methods

62
Results
NS1 Query Snippets of relevant documents
NS3 Query Snippets of relevant documents
document Title Snippets
Synthetic query Snippets NS2 - Query
Snippets of relevant documents NS2 -
Query Snippets of relevant documents
Synthetic query Snippets
Synthetic query Snippets
query document title
63
Experiment 3
  • Different training schemes
  • Different contexts for training
  • Snippet Vs Document
  • Different training models
  • Different Training Models

64

Data and Set up
  • Data
  • Explicit Feedback data collected from 7 users
  • For each query, each user examined top 10
    documents and identified top 10 documents
  • Collected the top 10 results for all queries.
    Total documents 3469 documents
  • Set up
  • 3469 documents - created lucene index.
  • For reranking, first retrieve the results using
    lucene and then rerank them using the noisy
    channel approach.
  • We perform 10 fold cross validation

65
Results
66
Results
I - Document Training and Document Testing II
- Document Training and Snippet Testing III -
Snippet Training and Document Testing IV -
Snippet Training and Snippet Testing
67
Results SVM
SVM1 - unigram, Binary SVM2 - unigram, Term
Frequency SVM3 - unigram, normalized term
frequency SVM4 - bigram, normalized term
frequency SVM4 unigrams bigrams, normalized
term frequency
68
Results Personalization without Relevance
Feedback
PRWF personalization without relevance feedback
using only the profile learnt from queries
alone PRWFSmoothing smoothing the
probabilities from the user profile using huge
query language model obtained from all the
queries from all the users in collection 01 of
the click through data
69
Experiments Summary
  • Language Modeling Best Results!
  • Interesting framework Personalized Search
  • Simple N-gram based approaches also worked well
  • Noisy Channel model worked best
  • Extracting Synthetic Queries helped
  • Different Training schemes
  • IBM Model1 Vs GIZA
  • Snippet Vs Document
  • Machine Learning competitive results
  • Different Features and weights
  • Without Relevance Feedback Very encouraging
    results
  • Simple Approach worked well
  • Sparsity Query log was useful

70
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web Search
  • Statistical Language modeling based approaches
  • Simple N-gram based methods
  • Noisy Channel based method
  • Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

71
Query Log Study Introduction
  • Large interest in finding patterns and computing
    statistics from query logs
  • Previous work
  • Patterns statistics of queries Common
    queries, avg. no. of words, avg. no. of queries
    per session etc.
  • Little work on analyzing click behaviour of users
  • Granka et. al - Eye tracking study

72
Query Log Study Our Analysis
  • Analyzing clicking behaviour of users
  • Study if any general pattern in clicking
    behaviour
  • Aim to answer the following
  • Expt1 Do all users view results from top to
    bottom?
  • Expt2 Do all users view same number of results?

73
Query Log Study Observations
  • Expt1 All users view results from top to bottom?
  • YES!! - For 90 of Queries
  • Why is this important ?
  • Expt2 How many top results does the user view?
    gt Deepest click made by users
  • Statistical Analysis showed that deepest clicks
    made by a sample of users follow a Zipfs
    distribution or Power law
  • Many users view only top 5 (about 90/95), few
    users view top 10, much fewer view top 20 and so
    on
  • Why is this important?

74
Outline of the talk
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem Description
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • I Search A suite of approaches for Personalized
    Web Search
  • Statistical Language modeling based approaches
  • Simple N-gram based methods
  • Noisy Channel based method
  • Machine Learning based approach
  • Ranking SVM based method
  • Personalization without Relevance Feedback
  • Experiments
  • Query Log Study
  • Simulated Feedback
  • Conclusions and Future Directions

75
Simulated Feedback Introduction
  • Relevance Feedback Types, problems
  • Explicit
  • Difficult to collect
  • Implicit
  • Clickthrough data from search engines not
    available
  • Repeatability of experiments Problem!
  • Web Dynamic data collections Feedback
    collected becomes stale
  • Privacy

76
Simulated Feedback Motivation
  • Simulated Feedback Like from explicit and
    implicit feedback
  • Potential area outcome useful for web search
    and personalization
  • Easy to create
  • Customizable
  • Large amounts can be created
  • Repeatable
  • Testing specific domains

77
Simulated Feedback Creation
SIMULATOR
Web Search Behaviour Simulator Step1 Formulate
query Step2 Posing to a search engine Step3
Looking at results returned by search
engine Step4 Possibly clicking one or more
results
Simulator
User Creator
Parameters

Simulated Feedback
78
Outline
  • Introduction
  • Current Search Engines Problems
  • Motivation
  • Background
  • Problem
  • Solution Outline
  • Contributions
  • Review of Personalized Search
  • Thesis Outline
  • Statistical Language modeling based approaches
  • Simple Language model based approaches
  • Noisy Channel
  • Machine Learning based approach
  • Ranking SVM
  • Personalization without Relevance Feedback
  • Experiments
  • Conclusions and Future Directions

79
Conclusions
  • Statistical Language Modeling based approaches
  • Machine learning based approach
  • Personalized Search without relevance feedback
  • Performed evaluation using query log data
  • Query Log Analysis and Simulated Feedback

80
Future Directions
  • Recommending Documents
  • Extend to exploit Repetition in queries and
    clickthroughs
  • Language Modeling based Approaches
  • Capture Richer context
  • N-gram based method trigrams etc
  • Noisy Channel based method bigram
  • Machine learning based Approaches
  • Can learn non-text patterns or behaviour
  • Personalized Summarization
  • Simulating user behaviour

81
  • Thank you

82
Simple N-gram based approaches
  • N-gram general term for words
  • 1-gram unigram, 2-gram bigram
  • Capture statistical properties of text
  • Single words (Unigrams)
  • Two adjacent words (Bigrams)

83
Query Log Study Introduction
  • Query logs
  • Large interest in finding patterns and computing
    statistics from query logs
  • Previous work
  • Patterns and statistics on queries
  • Common queries, avg. no. of words, avg. no. of
    queries per session etc
  • Little work on analyzing click behaviour of users
  • Granka et. al - Eye tracking study

84
Query Log Study Our Analysis
  • Focus on Analyzing clicking behaviour of users
  • Study if any general pattern in clicking
    behaviour
  • Aim to answer the following
  • All users view results from top to bottom (Expt
    1)
  • All users view same number of results? (Expt 2)

85
Query log Data
  • Click through data from a popular search engine
  • Data collected from 250k million users over 3
    months data in 2006.
  • Consists of (anonymous id, query,
    timestamp,position of the click,domain name of
    the click url)

86
Sample Data

87
Experiment 1
  • All users view results from top to bottom?
  • Position position of the search result in the
    search engine
  • For each query
  • Arrange clicks based on time of click
  • If all the postions are in ascending order, user
    views from top to bottom
  • The query is said to be an anomaly if not so!

88
(No Transcript)
89
Observations
  • For 90 of the queries, users always go from top
    to bottom!!!
  • For the rest 10 queries
  • Uses clicks at least one bottom result before
    clicking a top result
  • User not happy with search engine ranking
  • Not the behaviour of the user - 50 users
    exhibit it
  • Certain Queries are hard ?

90
Experiment 2
  • How many top results does the user view?
  • Intuition
  • Typically users dont view all the results
  • Only top few How many?
  • Depends on the user?
  • Goal To see, how deep a user goes to see results

91
  • Patience how many results a user views
  • For each query, the deepest click. Maximum over
    all queries
  • For each query, average click. Maximum over all
    queries

92
For each query, the deepest click. Maximum over
all queries
93
For each query, average click. Maximum over all
queries
94
Observations
  • Statistical Analysis show they follow a Zipfs
    distribution or Power law
  • Many users view only top 5 (about 90/95), few
    users view top 10, much fewer view top 20 and so
    on
  • Can characterize patience of a group of users
    using Zipfs law or power law

95
Simulated Feedback
  • Relevance Feedback
  • Explicit
  • Difficult to collect
  • Implicit
  • Clickthrough data from search engines not
    available
  • Repeatability of experiments Problem!
  • Web Dynamic data collections Feedback
    collected becomes stale

96
Simulated Feedback
  • Simulated Feedback Drawing analog from explicit
    and implicit feedback
  • Potential area outcome useful for web search
    and personalization
  • Easy to create
  • Customizable
  • Large amounts can be created
  • Repeatable
  • Testing specific domains

97
  • Creating simulated Feedback
  • Creating Simulated user
  • Simulating user web search behaviour

U. Rohini, Vamshi Ambati, and Vasudeva Varma.
Creating simulated feedback. Technical report,
International Institute of Information
Technology, 2007.
98
Creating Simulated User
  • User Specific Parameters (Unique id etc)
  • Web search Specific parameters
  • Patience (From Query log analysis)
  • Threshold
  • Others can be Interests (User Profile/Model),
    Browsing History etc.

We considered Patience and threshold in this work
99
Patience
Pick From Power law Distribution. Many users view
top 5, less few top 10, much fewer view top 20
and so on
100
Relevance Threshold
  • Depends on the query and user
  • For some query, very high relevance is needed
  • We compute it according to the query for each user

101
Simulating user web search behaviour
  • Formulate a Web Search Process
  • Step1 Create the query
  • Step2 Posing to a search engine
  • Step3 Looking at the results returned by the
    search engine
  • Step4 Possibly clicking one or more results
  • Step 5 Reformulate if unsatisfied
  • Simulate the search process for the created user

We consider only Steps 1 to 4 in our approach
102
Simulating Step1Formulating the query
  • Can be very complex
  • We take a simple and practical approach
  • As of now, the queries are assumed to be given to
    the system

103
Simulating Step2Searching the Search Engine
  • Given a search engine
  • Pose the query from Step1 to the search engine
  • Get the search results.

104
Simulating Step3Looking at the Search Results
  • Simulation of this step can be done in a number
    of ways
  • Ex Random, top to bottom, bottom to up etc
  • We consider
  • Sequential from Top to bottom until patience is
    zero
  • For each document performs clicks as in Step4
  • (motivated by Radlinski et al, Granka et al )

105
Simulating Step 4Clicking the results
  • Crucial Step of our simulation
  • User Clicks a result if
  • The snippet shown by the search engine appears to
    be relevant to the user
  • The result below it is not more relevant than it
    (motivated by Radlinski et al, Granka et al )

106
Simulated Feedback Creation
Search results
Search Engine
Simulator
Web Search Behaviour Simulator
User Creator
Parameters
Simulated Feedback
107
Evaluation Problems
  • Is Simulated Feedback relevant?
  • How different is it from a randomly created
    feedback?
  • Evaluation -
  • No standard methods to evaluate
  • No Metrics to quantify success
  • How and what to compare ?

108
Experiments
  • Experiment 1
  • Comparison with Implicit Feedback from Query log
    Data
  • Experiment 2
  • Comparison with Baselines
  • Experiment 3
  • Comparison with Explicit Feedback

109
Experimental Set up
  • Creating simulated user
  • Randomly assign unique id
  • Patience
  • Draw randomly from Power law Distribution 1-
    25

110
Experimental set up
  • Simulating Web Search Process
  • Pick a user from query log, gather all queries
    posed by him.
  • Simulate Web search process of each query in
    succession
  • Step 1 Formulating a query
  • Pick each query in succession from the gathered
    queries
  • Step 2 Searching the Search engine
  • Pose the query to a search engine and gather
    results
  • Step 3 Looking at the results
  • Step 4 Clicking one or more

111
Sample Data Created
112
Experiment 1
  • Comparison with clickthroughs from query log
  • For each query Relevance Document Pool (RDP)
  • All clicked documents for the query from all the
    users in the query log
  • Average Accuracy 60.04

113
Experiment 2
  • Random Navigation
  • Power law Navigation
  • Random click

114
Creating user
115
Creating Web Search Process
116
Results
117
Experiment 3
  • Comparison with explicit Feedback
  • 4 Judges
  • Select small sub set of data created
  • 25 users
  • 1 query per user total 25 queries
  • We consider the query, and the simulated feedback
    created for this query

118
  • Each judge given an evaluation form
  • Evaluation form
  • Details about the judge
  • A table containing query and corresponding
    simulated click urls
  • For each simulated click judge feedback
  • Boolean feedback 1 or 0

119
Results
  • Judge Accuracy 66.02
  • Correlation between the judges 0.859

120
Discussion
  • 6 increase in accuracy over comparison with
    query log
  • Match problems
  • Search Engine index changes Relevance feedback
    becomes stale!
  • Too low relevant documents in RDP
  • qualcom.com - Only one document in RDP.
  • Focussed query, only user posed it
  • Focussed query Vs General query
  • qualcomm.com - only one query , one user posed
  • lottery - 58 users , 24 unique click urls

121
Reranking
  • In general LM for IR
  • Noisy Channel based approach

lemur
Lemur encyclopedia brief
Lemur toolkit information retireval
The Lemur toolkit for language modeling and
information retrieval is documented and made
available for download.
Lemur - Encyclopedia gives a brief description of
the physical traits of this animal.
Write a Comment
User Comments (0)
About PowerShow.com