KDDCUP 2005 - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

KDDCUP 2005

Description:

'There is no restriction on what data you can/can't use to build your models. ... for each query by combining the snippets from the returned pages given in Phase I. ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 12
Provided by: q2cC
Category:
Tags: kddcup | snippets

less

Transcript and Presenter's Notes

Title: KDDCUP 2005


1
KDD-CUP 2005
  • An Ensemble Method for Query Classification
  • Dou Shen
  • HKUST Team Dou Shen, Rong Pan, Jiantao Sun,
    Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang
  • Hong Kong University of Science and Technology
  • Hong Kong, China
  • http//www.cs.ust.hk/

2
Task
  • Task
  • Categorize 800,000 queries into 67 predefined
    categories
  • Limitation
  • There is no restriction on what data you
    can/can't use to build your models. From
    http//kdd05.lac.uic.edu/kddcup.html
  • Key Characteristics
  • No training data
  • Meaning of Queries ambiguous
  • A query usually contains too few words
  • Queries often have more than one meaning.
  • Semantics of Categories uncertain
  • Only the names of Categories, no more
    specification

3
Nature of Problem
Phase I
Phase II
Page Categories
synonym-based classifier
Search engines
Page Content
statistical classifiers
4
Phase I From queries to pages and categories
  • Input
  • A query Qi,
  • Output
  • ltPage listi, Category listi gt
  • Approach
  • through Search Engines (SE)
  • We collected
  • 40 million entries
  • 50GB
  • Search engines
  • Lumur (CMU open source)
  • Google
  • ODP
  • Looksmart

5
Phase II.a Synonym-based Classifier using
directories
  • 67 KDD-categories in KDDCUP
  • 172,565 in ODP/Google, 272,405 in Looksmart
  • For each of the KDDCUP category
  • Apply Wordnet to find the corresponding synonyms
    in the categories of ODP (Google) and Looksmart,
    respectively
  • This produces one mapping function f for each
    directory
  • Also returns a rank by matching frequency
  • Advantage
  • Fast,
  • Precise
  • Disadvantage
  • Many of the 172K and 272K categories from
    ODP/Google and Looksmart do not map to KDDCUP
    categories
  • This may result in low recall

6
Phase II.b Statistical Classifiers
  • Statistical Classifiers
  • Support Vector Machine (SVM) mapping pages to
    KDDCUP categories
  • Training Data
  • 15 million pages with categories from ODP
    Directory
  • Apply the mapping f from Phase II.a, to build
    training data.
  • Application of the classifier
  • Construct a virtual document for each query by
    combining the snippets from the returned pages
    given in Phase I.
  • Classifier returns category and rank

f
15 Million Pairs (page, odp-categories)
15 Million Pairs (page, kdd-categories)
7
Component Classifier Integration
  • We follow an ensemble learning approach
  • Each classifier returns the category and rank
  • The two kinds of classifiers have the similar
    performance.
  • We integrate the different classifiers together
    by a weighted sum of the ranks
  • Weights can be determined by validation data set
  • Based on the performance on the 111 sample data
  • Assign different weight values for a classifier
    on different categories
  • The higher the precision, the higher the weight
    value
  • We have also tried to use equally weighted
    component classifiers

8
Final Result Generation
  • Two Solutions One for each evaluation criteria
  • S1 Using the validation data set is expected to
    achieve better precision measure
  • Since each component classifier is highly
    weighted on the classes where it achieves high
    precision.
  • S2 Equally weighted combination is expected to
    achieve higher F1 performance
  • Since the recall is relatively high
  • Evaluation Results (http//www.acm.org/sigs/sigkdd
    /kdd2005/kddcup.html)
  • The Results are generated automatically.

9
Putting them together



Phase I
Phase II
Ensemble
10
Many methods did not work..
  • An example of the failed methods
  • Submit a query into search engines and get its
    related pages
  • Similarly, we could get the related pages of a
    category
  • Judge the similarity between the target query and
    category by checking the Web-page context of the
    search results
  • Assign the query to the most similar categories.

11
Questions?
Write a Comment
User Comments (0)
About PowerShow.com