Optimizing Search Engines using Clickthrough Data - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Optimizing Search Engines using Clickthrough Data

Description:

Optimizing Search Engines using Clickthrough Data. by. Thorsten Joachims ... [18] J. Kemeny and L. Snell. Mathematical Models in the Social Sciences. Ginn & Co, 1962. ... – PowerPoint PPT presentation

Number of Views:306
Avg rating:3.0/5.0
Slides: 25
Provided by: cmpeBo
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Search Engines using Clickthrough Data


1
Optimizing Search Engines using Clickthrough Data
  • by
  • Thorsten Joachims

Presentation by M. Sükrü Kuran
2
Outline
  • Search Engines
  • Clickthrough Data
  • Learning of Retrieval Functions
  • Support Vector Machine (SVM) for
  • Learning of Ranking Functions
  • Experiment Setup
  • Offline Experiment
  • Online Experiment
  • Analysis of The Online Experiment
  • Conclusion and Future Work
  • References
  • Questions

3
Search Engines
  • Search engines utilize ranking systems to list
    results based on their relevance to the query
  • Current ranking systems are not optimized for
    relevance
  • As an alternative solution we can use
  • Clickthrought Data to find
  • more relevance
  • optimized results

4
Clickthrough Data
  • What is Clickthrough Data ?
  • Clickthrough data is the set of links that the
    user selects from the list of the links retreived
    by the search engine to a user-given query.
  • Why is Clickthrough Data Important?
  • These are the most relevant links among the query
    results
  • Easier to acquire than user feedback (since the
    data is already in the logs of the search engines)

5
Clickthrough Data (2)
  • Users are less likely to click on a link that has
    a low ranking
  • (Independent of the actual relevence)
  • Users typically scan the first 10 links in the
    result set 24
  • Thus, clickthrough data is not the absolute
    relevence
  • value for the query but a good relative relevence
    value

6
Clickthrough Data (3)
  • Example
  • Results for a search for SVM
  • 1. Kernel Machines 6. Archives of Support
    Vector
  • 2. Support Vector Machine Machines
  • 3. SVM-Light Support Vector Machine 7. SVM demo
    Applet
  • 4. Intr. To Support Vector Machines 8.
    Royal Holloway Support Vector
  • 5. Support Vector Machine and Machine
  • Kernel Methods Ref. 9. Support Vector
    Machine
  • The Software
  • 10. Lagrangian Support Vector
  • Machine Home Page

Among the 10 results, only links 1,3 and 7 is
chosen (clickthrough data)
7
Clickthrough Data (4)
  • link3 lt link2
  • link7 lt link2
  • link7 lt link4
  • link7 lt link5
  • link7 lt link6

ranking preferred by the user (binary
relation)
We can generalize this preference information,
link i lt link j for all pairs 1 lt j lt i,
with and
8
Learning of Retrieval Functions
  • Goal
  • We have to find a retrival function whose
    results are close to
  • In order to calculate the similarity between any
    given
  • and , we have to use a performance
    metric
  • Average Precision (binary relevance)
  • Kendalls

Very Simple
Good Performance Metric
9
Learning of Retrieval Functions (2)
  • Kendalls
  • Between any two ranking functions the distance
    is,
  • D Set of documents in a query result
  • P of concordant pairs in D x D
  • Q of discordant pairs in D x D
  • m of documents/links in D

10
Learning of Retrieval Functions (3)
  • Problem Defination of Learning an Appropriate
    Retrieval Function
  • For a fixed (but unknown) distribution of queries
    and target (user preferred) rankings the goal is,
  • where is the distribution of queries

11
Support Vector Machine (SVM) for Learning of
Ranking Functions
  • Usually machine learning in information learning
    is based on binary classification.
  • (A document is either related to the query or
    not)
  • Since the information gathered from clickthrought
    data is not an absoulte relevancy information we
    cannot use binary classification

12
Support Vector Machine (SVM) for Learning of
Ranking Functions (2)
  • Using a set of queries and user ranking sets
    (training data) we will select a ranking function
    among a family (F) of ranking functions

Selection will be based on minimizing
n of queries in the training set
13
Support Vector Machine (SVM) for Learning of
Ranking Functions (3)
  • Then, we need to find a sound family of ranking
    functions.
  • How to find an F which includes an efficent
    ranking function (f) ?

14
Support Vector Machine (SVM) for Learning of
Ranking Functions (4)
  • A set of functions,
  • Where s are description based retrieval
    functions 10,11
  • s are weight vectors (2D) adjusted by
    learning

15
Support Vector Machine (SVM) for Learning of
Ranking Functions (5)
  • Instead of maximizing directly our goal function
    we can minimize the Q in our performance measure
  • By using calssification SVMs 7

minimize
subject to
16
Experiment Setup
  • A baseline meta-search engine called Striver is
    used for testing purposes
  • Striver forwards a query to MSNSearch, Google,
    Excite, Altavista and Hotbot
  • Acquires top 100 results from each search engine
  • Based on the learned retrival function it selects
    top 50 of the 500(may be lesser if more than one
    engine has found a specific document)

17
Offline Experiment
  • Using Striver 112 Queries are recorded
  • A huge set of features are used to calculate the
    description based retrieval functions
  • The testing is done with different values of
    training set queries
  • Results from Google and MSNSearch are used for
    benchmarking purposes

18
Offline Experiment (2)
19
Online Experiment
  • Striver is used by a group of people (20 people)
  • Based on these peoples queries training set of
    Striver is composed of 260 queries
  • The results are compared with results from
    Google, MSNSearch and Toprank (a simple
    meta-search engine)

20
Online Experiment (2)
More clicks mean that (for Google) users clicked
more links in the learned engine than they do in
Google for 29 queries out of 88. Less clicks mean
that (for Google) users clicked less links in the
learned engine than they do in Google for 13
queries out of 88
21
Analysis of the Online Experiment
  • Since all of the users have used the engine for
    academic searches the learned data is good for
    searches in academic research topics
  • But it may not give that good results for
    different groups of people
  • We can say that learned engine is a customizable
    engine unlike traditional engines

22
Future Work and Conclusions
  • What is the optimal group size for user
    custimization?
  • Features can be tuned for better performance
  • Clustering algorithms can cluster users in WWW
    into subgroups based on their clickthrough datas
    ?
  • Can malicious users corrupt the learning process
    by clicking irrelevant links, how it is avoided?

23
References
  • 1 R. Baeza-Yates and B. Ribeiro-Neto. Modern
    Information Retrieval. Addison-Wesley-Longman,
    Harlow, UK, May 1999.
  • 2 B. Bartell, G. Cottrell, and R. Belew.
    Automatic combination of multiple ranked
    retrieval systems. In Annual ACM SIGIR Conf. on
    Research and Development in Information Retrieval
    (SIGIR), 1994.
  • 3 D. Beeferman and A. Berger. Agglomerative
    clustering of a search engine query log. In ACM
    SIGKDD International Conference on Knowledge
    Discovery and Data Mining (KDD), 2000.
  • 4 B. E. Boser, I. M. Guyon, and V. N. Vapnik. A
    traininig algorithm for optimal margin
    classifiers. In D. Haussler, editor, Proceedings
    of the 5th Annual ACM Workshop on Computational
    Learning Theory, pages 144152, 1992.
  • 5 J. Boyan, D. Freitag, and T. Joachims. A
    machine learning architecture for optimizing web
    search engines. In AAAI Workshop on Internet
    Based Information Systems, August 1996.
  • 6 W. Cohen, R. Shapire, and Y. Singer. Learning
    to order things. Journal of Artificial
    Intelligence Research, 10, 1999.
  • 7 C. Cortes and V. N. Vapnik. Supportvector
    networks. Machine Learning Journal, 20273297,
    1995.
  • 8 K. Crammer and Y. Singer. Pranking with
    ranking. In Advances in Neural Information
    Processing Systems (NIPS), 2001.
  • 9 Y. Freund, R. Iyer, R. Shapire, and Y.
    Singer. An efficient boosting algorithm for
    combining preferences. In International
    Conference on Machine Learning (ICML), 1998.
  • 10 N. Fuhr. Optimum polynomial retrieval
    functions based on the probability ranking
    principle. ACM Transactions on Information
    Systems, 7(3)183204, 1989.
  • 11 N. Fuhr, S. Hartmann, G. Lustig, M.
    Schwantner, K. Tzeras, and G. Knorz. Air/x - a
    rule-based multistage indexing system for large
    subject fields. In RIAO, pages 606623, 1991.
  • 12 R. Herbrich, T. Graepel, and K. Obermayer.
    Large margin rank boundaries for ordinal
    regression. In Advances in Large Margin
    Classifiers, pages 115132. MIT Press, Cambridge,
    MA, 2000.
  • 13 K. Hoffgen, H. Simon, and K. van Horn.
    Robust trainability of single neurons. Journal of
    Computer and System Sciences, 50114125, 1995.
  • 14 T. Joachims. Making large-scale SVM learning
    practical. In B. Scholkopf, C. Burges, and A.
    Smola, editors, Advances in Kernel Methods -
    Support Vector Learning, chapter 11. MIT Press,
    Cambridge, MA,1999.
  • 15 T. Joachims. Learning to Classify Text Using
    Support Vector Machines Methods, Theory, and
    Algorithms. Kluwer, 2002.
  • 16 T. Joachims. Unbiased evaluation of
    retrieval quality using clickthrough data.
    Technical report, Cornell University, Department
    of Computer Science, 2002. http//www.joachims.org
    .
  • 17 T. Joachims, D. Freitag, and T. Mitchell.
    WebWatcher a tour guide for the world wide web.
    In Proceedings of International Joint Conference
    on Artificial Intelligence (IJCAI), volume 1,
    pages 770 777. Morgan Kaufmann, 1997.
  • 18 J. Kemeny and L. Snell. Mathematical Models
    in the Social Sciences. Ginn Co, 1962.
  • 19 M. Kendall. Rank Correlation Methods.
    Hafner, 1955.

24
Questions........
Experiment Results ?
  • ?

Clickthrough Data ?
Machine Learning for Retrieval Functions ?
Retrieval Functions ?
Write a Comment
User Comments (0)
About PowerShow.com