KDDCUP 2005 - PowerPoint PPT Presentation

1 / 11

About This Presentation

Title:

KDDCUP 2005

Description:

'There is no restriction on what data you can/can't use to build your models. ... for each query by combining the snippets from the returned pages given in Phase I. ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 12

Provided by: q2cC

Category:

more less

Transcript and Presenter's Notes

Title: KDDCUP 2005

1
KDD-CUP 2005

An Ensemble Method for Query Classification
Dou Shen
HKUST Team Dou Shen, Rong Pan, Jiantao Sun,
Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang
Hong Kong University of Science and Technology
Hong Kong, China
http//www.cs.ust.hk/

2
Task

Task
Categorize 800,000 queries into 67 predefined
categories
Limitation
There is no restriction on what data you
can/can't use to build your models. From
http//kdd05.lac.uic.edu/kddcup.html

Key Characteristics
No training data
Meaning of Queries ambiguous
A query usually contains too few words
Queries often have more than one meaning.
Semantics of Categories uncertain
Only the names of Categories, no more
specification

3
Nature of Problem
Phase I
Phase II
Page Categories
synonym-based classifier
Search engines
Page Content
statistical classifiers
4
Phase I From queries to pages and categories

Input
A query Qi,
Output
ltPage listi, Category listi gt
Approach
through Search Engines (SE)

We collected
40 million entries
50GB
Search engines
Lumur (CMU open source)
Google
ODP
Looksmart

5
Phase II.a Synonym-based Classifier using
directories

67 KDD-categories in KDDCUP
172,565 in ODP/Google, 272,405 in Looksmart
For each of the KDDCUP category
Apply Wordnet to find the corresponding synonyms
in the categories of ODP (Google) and Looksmart,
respectively
This produces one mapping function f for each
directory
Also returns a rank by matching frequency

Advantage
Fast,
Precise
Disadvantage
Many of the 172K and 272K categories from
ODP/Google and Looksmart do not map to KDDCUP
categories
This may result in low recall

6
Phase II.b Statistical Classifiers

Statistical Classifiers
Support Vector Machine (SVM) mapping pages to
KDDCUP categories
Training Data
15 million pages with categories from ODP
Directory
Apply the mapping f from Phase II.a, to build
training data.
Application of the classifier
Construct a virtual document for each query by
combining the snippets from the returned pages
given in Phase I.
Classifier returns category and rank

f
15 Million Pairs (page, odp-categories)
15 Million Pairs (page, kdd-categories)
7
Component Classifier Integration

We follow an ensemble learning approach
Each classifier returns the category and rank
The two kinds of classifiers have the similar
performance.
We integrate the different classifiers together
by a weighted sum of the ranks
Weights can be determined by validation data set
Based on the performance on the 111 sample data
Assign different weight values for a classifier
on different categories
The higher the precision, the higher the weight
value
We have also tried to use equally weighted
component classifiers

8
Final Result Generation

Two Solutions One for each evaluation criteria
S1 Using the validation data set is expected to
achieve better precision measure
Since each component classifier is highly
weighted on the classes where it achieves high
precision.
S2 Equally weighted combination is expected to
achieve higher F1 performance
Since the recall is relatively high
Evaluation Results (http//www.acm.org/sigs/sigkdd
/kdd2005/kddcup.html)
The Results are generated automatically.

9
Putting them together

Phase I
Phase II
Ensemble
10
Many methods did not work..

An example of the failed methods
Submit a query into search engines and get its
related pages
Similarly, we could get the related pages of a
category
Judge the similarity between the target query and
category by checking the Web-page context of the
search results
Assign the query to the most similar categories.

11
Questions?

Write a Comment

User Comments (0)