Title: Optimizing Search Engines using Clickthrough Data
1Optimizing Search Engines using Clickthrough Data
Presentation by M. Sükrü Kuran
2Outline
- Search Engines
- Clickthrough Data
- Learning of Retrieval Functions
- Support Vector Machine (SVM) for
- Learning of Ranking Functions
- Experiment Setup
- Offline Experiment
- Online Experiment
- Analysis of The Online Experiment
- Conclusion and Future Work
- References
- Questions
3Search Engines
- Search engines utilize ranking systems to list
results based on their relevance to the query - Current ranking systems are not optimized for
relevance - As an alternative solution we can use
- Clickthrought Data to find
- more relevance
- optimized results
4Clickthrough Data
- What is Clickthrough Data ?
-
- Clickthrough data is the set of links that the
user selects from the list of the links retreived
by the search engine to a user-given query. - Why is Clickthrough Data Important?
- These are the most relevant links among the query
results - Easier to acquire than user feedback (since the
data is already in the logs of the search engines)
5Clickthrough Data (2)
- Users are less likely to click on a link that has
a low ranking - (Independent of the actual relevence)
- Users typically scan the first 10 links in the
result set 24 - Thus, clickthrough data is not the absolute
relevence - value for the query but a good relative relevence
value
6Clickthrough Data (3)
- Example
- Results for a search for SVM
- 1. Kernel Machines 6. Archives of Support
Vector - 2. Support Vector Machine Machines
- 3. SVM-Light Support Vector Machine 7. SVM demo
Applet - 4. Intr. To Support Vector Machines 8.
Royal Holloway Support Vector - 5. Support Vector Machine and Machine
- Kernel Methods Ref. 9. Support Vector
Machine - The Software
- 10. Lagrangian Support Vector
- Machine Home Page
Among the 10 results, only links 1,3 and 7 is
chosen (clickthrough data)
7Clickthrough Data (4)
- link3 lt link2
- link7 lt link2
- link7 lt link4
- link7 lt link5
- link7 lt link6
ranking preferred by the user (binary
relation)
We can generalize this preference information,
link i lt link j for all pairs 1 lt j lt i,
with and
8Learning of Retrieval Functions
- Goal
- We have to find a retrival function whose
results are close to - In order to calculate the similarity between any
given - and , we have to use a performance
metric - Average Precision (binary relevance)
- Kendalls
Very Simple
Good Performance Metric
9Learning of Retrieval Functions (2)
- Kendalls
- Between any two ranking functions the distance
is, - D Set of documents in a query result
- P of concordant pairs in D x D
- Q of discordant pairs in D x D
- m of documents/links in D
10Learning of Retrieval Functions (3)
- Problem Defination of Learning an Appropriate
Retrieval Function - For a fixed (but unknown) distribution of queries
and target (user preferred) rankings the goal is, - where is the distribution of queries
11Support Vector Machine (SVM) for Learning of
Ranking Functions
- Usually machine learning in information learning
is based on binary classification. - (A document is either related to the query or
not) - Since the information gathered from clickthrought
data is not an absoulte relevancy information we
cannot use binary classification
12Support Vector Machine (SVM) for Learning of
Ranking Functions (2)
- Using a set of queries and user ranking sets
(training data) we will select a ranking function
among a family (F) of ranking functions
Selection will be based on minimizing
n of queries in the training set
13Support Vector Machine (SVM) for Learning of
Ranking Functions (3)
- Then, we need to find a sound family of ranking
functions. - How to find an F which includes an efficent
ranking function (f) ?
14Support Vector Machine (SVM) for Learning of
Ranking Functions (4)
- A set of functions,
- Where s are description based retrieval
functions 10,11 - s are weight vectors (2D) adjusted by
learning
15Support Vector Machine (SVM) for Learning of
Ranking Functions (5)
- Instead of maximizing directly our goal function
we can minimize the Q in our performance measure - By using calssification SVMs 7
minimize
subject to
16Experiment Setup
- A baseline meta-search engine called Striver is
used for testing purposes - Striver forwards a query to MSNSearch, Google,
Excite, Altavista and Hotbot - Acquires top 100 results from each search engine
- Based on the learned retrival function it selects
top 50 of the 500(may be lesser if more than one
engine has found a specific document)
17Offline Experiment
- Using Striver 112 Queries are recorded
- A huge set of features are used to calculate the
description based retrieval functions - The testing is done with different values of
training set queries - Results from Google and MSNSearch are used for
benchmarking purposes
18Offline Experiment (2)
19Online Experiment
- Striver is used by a group of people (20 people)
- Based on these peoples queries training set of
Striver is composed of 260 queries - The results are compared with results from
Google, MSNSearch and Toprank (a simple
meta-search engine)
20Online Experiment (2)
More clicks mean that (for Google) users clicked
more links in the learned engine than they do in
Google for 29 queries out of 88. Less clicks mean
that (for Google) users clicked less links in the
learned engine than they do in Google for 13
queries out of 88
21Analysis of the Online Experiment
- Since all of the users have used the engine for
academic searches the learned data is good for
searches in academic research topics - But it may not give that good results for
different groups of people - We can say that learned engine is a customizable
engine unlike traditional engines
22Future Work and Conclusions
- What is the optimal group size for user
custimization? - Features can be tuned for better performance
- Clustering algorithms can cluster users in WWW
into subgroups based on their clickthrough datas
? - Can malicious users corrupt the learning process
by clicking irrelevant links, how it is avoided?
23References
- 1 R. Baeza-Yates and B. Ribeiro-Neto. Modern
Information Retrieval. Addison-Wesley-Longman,
Harlow, UK, May 1999. - 2 B. Bartell, G. Cottrell, and R. Belew.
Automatic combination of multiple ranked
retrieval systems. In Annual ACM SIGIR Conf. on
Research and Development in Information Retrieval
(SIGIR), 1994. - 3 D. Beeferman and A. Berger. Agglomerative
clustering of a search engine query log. In ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD), 2000. - 4 B. E. Boser, I. M. Guyon, and V. N. Vapnik. A
traininig algorithm for optimal margin
classifiers. In D. Haussler, editor, Proceedings
of the 5th Annual ACM Workshop on Computational
Learning Theory, pages 144152, 1992. - 5 J. Boyan, D. Freitag, and T. Joachims. A
machine learning architecture for optimizing web
search engines. In AAAI Workshop on Internet
Based Information Systems, August 1996. - 6 W. Cohen, R. Shapire, and Y. Singer. Learning
to order things. Journal of Artificial
Intelligence Research, 10, 1999. - 7 C. Cortes and V. N. Vapnik. Supportvector
networks. Machine Learning Journal, 20273297,
1995. - 8 K. Crammer and Y. Singer. Pranking with
ranking. In Advances in Neural Information
Processing Systems (NIPS), 2001. - 9 Y. Freund, R. Iyer, R. Shapire, and Y.
Singer. An efficient boosting algorithm for
combining preferences. In International
Conference on Machine Learning (ICML), 1998. - 10 N. Fuhr. Optimum polynomial retrieval
functions based on the probability ranking
principle. ACM Transactions on Information
Systems, 7(3)183204, 1989. - 11 N. Fuhr, S. Hartmann, G. Lustig, M.
Schwantner, K. Tzeras, and G. Knorz. Air/x - a
rule-based multistage indexing system for large
subject fields. In RIAO, pages 606623, 1991. - 12 R. Herbrich, T. Graepel, and K. Obermayer.
Large margin rank boundaries for ordinal
regression. In Advances in Large Margin
Classifiers, pages 115132. MIT Press, Cambridge,
MA, 2000. - 13 K. Hoffgen, H. Simon, and K. van Horn.
Robust trainability of single neurons. Journal of
Computer and System Sciences, 50114125, 1995. - 14 T. Joachims. Making large-scale SVM learning
practical. In B. Scholkopf, C. Burges, and A.
Smola, editors, Advances in Kernel Methods -
Support Vector Learning, chapter 11. MIT Press,
Cambridge, MA,1999. - 15 T. Joachims. Learning to Classify Text Using
Support Vector Machines Methods, Theory, and
Algorithms. Kluwer, 2002. - 16 T. Joachims. Unbiased evaluation of
retrieval quality using clickthrough data.
Technical report, Cornell University, Department
of Computer Science, 2002. http//www.joachims.org
. - 17 T. Joachims, D. Freitag, and T. Mitchell.
WebWatcher a tour guide for the world wide web.
In Proceedings of International Joint Conference
on Artificial Intelligence (IJCAI), volume 1,
pages 770 777. Morgan Kaufmann, 1997. - 18 J. Kemeny and L. Snell. Mathematical Models
in the Social Sciences. Ginn Co, 1962. - 19 M. Kendall. Rank Correlation Methods.
Hafner, 1955.
24Questions........
Experiment Results ?
Clickthrough Data ?
Machine Learning for Retrieval Functions ?
Retrieval Functions ?