KDD Cup Survey - PowerPoint PPT Presentation

About This Presentation
Title:

KDD Cup Survey

Description:

Data collected from Gazelle.com, a legwear and legcare web retailer. Pre ... Insight questions judged with help of retail experts from Gazelle and Blue Martini ... – PowerPoint PPT presentation

Number of Views:250
Avg rating:3.0/5.0
Slides: 34
Provided by: csS5
Category:
Tags: kdd | cup | gazelle | survey

less

Transcript and Presenter's Notes

Title: KDD Cup Survey


1
KDD Cup Survey
  • Xinyue Liu

2
Outline
  • Nuts and Bolts of KDD Cup
  • KDD Cup 97-99
  • KDD Cup 2000
  • Summary

3
About KDD Cup
  • A knowledge discovery and data mining tools
    competition in conjunction with KDD conferences.
    It aims at
  • showcase the best methods for discovering
    higher-level knowledge from data.
  • Helping to close the gap between research and
    industry
  • Stimulating further KDD research and development

4
Statistics
  • Participation in KDD Cup grew steadily,
    especially requests to access the data
  • Average person-hours per submission 204Max
    person-hours per submission 910
  • Commercial software grew from 44 (cup 97) to 52
    (cup 98) to 77 (cup 2000)

5
Algorithms
Decision trees most widely tried and by far the
most commonly submitted
6
KDD Cup 97
  • A classification task to predict Financial
    services industry direct mail response
  • Winners
  • Charles Elkan, a PhD from UC-San Diego with his
    Boosted Naive Bayesian (BNB)
  • Silicon Graphics, Inc with their software MineSet
  • Urban Science Applications, Inc. with their
    software gain, Direct Marketing Selection System

7
BNB
  • Boosting to learn a series of classifiers,
    where each classifier in the series pays more
    attention to the examples misclassified by its
    predecessor. Repeated T rounds.
  • BNB representationally equivalent to a
    multilayer perceptron with a single hidden layer.
  • Complexity O(ef)
  • e examples f - attributes

8
MineSet
  • A KDD tool that combines data access,
    transformation, classification, and visualization.

9
KDD Cup 98
  • URL www.kdnuggets.com/meetings/kdd98/kdd-cup-98.h
    tml
  • A classification task to analyze fund raising
    mail responses to a non-profit organization
  • Winners
  • Urban Science Applications, Inc. with their
    software GainSmarts.
  • SAS Institute, Inc. with their software
    Enterprise Miner.
  • Quadstone Limited with their software
    Decisionhouse

10
GainSmarts
  • GainSmarts a feature selection expert system
  • First step - used Logistic Regression to assign
    each prospect a probability of donation (Pi).
  • Second step - used Linear Regression to estimate
    a conditional donation amount of responding
    donors (Ai)
  • Result (lt1 error) -
  • Prediction Pi Ai

11
Enterprise Miner
  • A data mining solution that addresses the entire
    data mining process
  • SEMMA Process
  • Sample
  • Explore
  • Modify
  • Model
  • Assess
  • Algorithms
  • Decision tree
  • Neural network
  • Regression

12
Decisionhouse
  • Decisionhouse an integrated modelling software
    suite by Quadstone
  • Data exploration using visualization modules.
  • Use Decision trees and Scorecards to model more
    complex tasks.
  • Choose the final model by comparing a variety of
    modeling approaches and looking at the difference
    in predicted net profitability (lift curve).

13
Results
Maximum Possible Profit Line (72,776 in profits
with 4,873 mailed)
Mail to Everyone Solution (10,560 in profits
with 96,367 mailed)
GainSmarts
SAS/Enterprise Miner
Quadstone/Decisionhouse
14
KDD Cup 99
  • URL www.cse.ucsd.edu/users/elkan/kdresults.html
  • Problem
  • same data set as KDD Cup 98
  • Winners
  • SAS Institute Inc. with their software Enterprise
    Miner.
  • Amdocs with their Information Analysis
    Environment

15
Software
  • SAS using two-stage model which includes two
    multi-layer perceptron (MLP) neural networks
    models.
  • Amdocs using its own Information Analysis
    Environment, which allows modeling of the value
    and class membership simultaneously. Algorithms
    used is a hybrid logistic regression model

16
KDD Cup 2000
  • www.ecn.purdue.edu/KDDCUP/
  • Sponsored by
  • Purdue University
  • Blue Martini Software

17
Data Set
  • Data collected from Gazelle.com, a legwear and
    legcare web retailer
  • Pre-processed
  • Training set 2 months
  • Test sets one month
  • Data collected includes
  • Click streams
  • Order information
  • Registration form

18
Problems
  • The goal to design models to support web-site
    personalization and to improve the profitability
    of the site by increasing customer response.
  • Questions - When given a set of page views,
  • will the visitor view another page on the site or
    leave?
  • which product brand will the visitor view in the
    remainder of the session?
  • characterize heavy spenders
  • characterize killer pages
  • characterize which product brand a visitor will
    view in the remainder of the session?

19
Evaluation
  • Accuracy/score was measured for the two questions
    with test sets
  • Insight questions judged with help of retail
    experts from Gazelle and Blue Martini
  • Created a list of insights from all
    participants
  • Each insight was given a weigh
  • Each participant was scored on all insights
  • Additional factors
  • Presentation quality
  • Correctness

20
The Winners
  • Question 1 5 Winner Amdocs
  • Question 2 3 Winner Salford Systems
  • Question 4 Winner e-steam
  • poster

21
Software (Amdocs)
  • Exploratory Data Analysis SAS
  • Classification Tree Amdocs Business Insight
    Tool
  • Decision tree
  • Rules Extraction
  • Modeling
  • Combining models

22
Scheme
23
Main Model
Decision Tree
Decision Tree
Decision Tree
5 trees
5 trees
5 trees
built on 34000 cases
built on 34000 cases
built on 34000 cases
24
Sub-models
Each model captures a different aspect of the
overall behavior in the data. Combining or
ensembling the models provides the best
prediction results.
Best rule
Chooses most accurate rule satisfied by each
record
Logistic regression on rule set raw field
values combine to define score for each record
Hybrid Model
Logistic regression on rule set defines score for
each record as a combination of rules the record
satisfies
Merged Rules
25
Software (Salford)
  • CART - a decision tree tool that automatically
    searching for and isolating significant
    patterns and relationships
  • MARS - a multivariate non-parametric regression
    procedure
  • HotSpotDetector
  • TreeNet

26
Cart
  • Binary recursive partitioning.
  • Key elements
  • Splitting rules
  • Brute force search all possible splits for all
    variables
  • Rank each splitting rule on the basis of a
    quality-of-split criterion (default GINI)
  • Recursion - split until further splitting is
    impossible or stopped.
  • Class assignment
  • Plurality rule
  • Assign every node whether it is terminal or not.
  • Pruning Trees does not stop in the middle
  • Testing - best sub-tree is the one with the
    lowest error

27
MARs
  • Automatic variable search 
  • Automatic variable transformation 
  • Automatic limited interaction searches 
  • Variable nesting 
  • Built-in testing regimens  model selection
    parameters.

28
Insights (Heavy Spenders)
  • Some of the Good insights
  • Referrers - establish ad policy based on
    conversion rates, not click-throughs
  • Not an AOL user - browser window too small for
    layout
  • Referring site traffic changed dramatically over
    time
  • Came to site from print-ad or news, not friends
    families
  • Very high and very low income
  • Geographic Northeast U.S. states
  • Repeat visitors

29
Insights (Who leaves?)
  • Some of the good insights
  • Crawlers, bots accounted for 16 of sessions
  • Long processing time (gt 12 seconds) implies high
    abandonment
  • Referring sites mycoupons have long sessions,
    shopnow.com are prone to exit quickly
  • Returning visitors' prob of continuing is double
  • View of specific products (Oroblue,Levante) cause
    abandonment
  • Probability of leaving decreases with page views
  • Free Gift and Welcome templates on first three
    pages encouraged visitors to stay at site

30
Insights(Brand view)
  • Some good insights
  • Referrer URL is great predictor
  • Fashionmall.com and winnie-cooper are referrers
    for Hanes and Donna Karan
  • mycoupons.com, tripod, deal-finder are referrers
    for American Essentials
  • Previous views of a product imply later views

31
Summary
  • Data mining requires background knowledge and
    access to business users
  • Successful data mining solutions combine
    automated and manual analysis, integrating the
    power of the machine with expert knowledge and
    human insight
  • Web Mining is challenging crawlers/bots,
    frequent site changes, etc.
  • KDD Cup is an excellent source to learn the
    state-of-art KDD techniques
  • KDD Cup data available for research and education

32
References
  • Elkan C. (1997). Boosting and Naive Bayesian
    Learning. Technical Report No. CS97-557,
    September 1997, UCSD.
  • Decisionhouse (1998). KDD Cup 98 Quadstone Take
    Bronze Miner Award. Retrieved March 15, 2001
    from http//www.kdnuggets.com/meetings/kdd98/quads
    tone/index.html
  • Urbane Science (1998). Urbane Science wins the
    KDD-98 Cup. Retrieved March 15, 2001 from
    http//www.kdnuggets.com/meetings/kdd98/gain-kddcu
    p98-release.html
  • Georges, J. Milley, A. (1999). KDD99
    Competition Knowledge Discovery Contest.
    Retrieved March 15, 2001 from http//www.cse.ucsd.
    edu/users/elkan/saskdd99.pdf
  • Rosset, S. Inger A. (1999). KDD-Cup 99
    Knowledge Discovery In a Charitable
    Organizations Donor Database. Retrieved March
    15, 2001 from http//www.cse.ucsd.edu/users/elkan/
    KDD2.doc

33
References (Cont.)
  • Sebastiani P., Ramoni M. Crea A. (1999).
    Profiling your Customers using Bayesian Networks.
    Retrieved March 15, 2001 from http//bayesware.com
    /resources/tutorials/kddcup99/kddcup99.pdf
  • Inger A., Vatnik N., Rosset S. Neumann E.
    (2000). KDD-Cup 2000 Question 1 Winners Report.
    Retrieved March 18, 2000 from
  • http//www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.
    ppt
  • Neumann E., Vatnik N., Rosset S., Duenias M.,
    Sasson I. Inger A. (2000). KDD-Cup 2000
    Question 5 Winners Report. Retrieved March 18,
    2000 from http//www.ecn.purdue.edu/KDDCUP/amdocs-
    slides-5.ppt
  • Salford System white papers
  • http//www.salford-systems.com/whitepaper.html
  • Summary talk presented at KDD (2000)
  • http//robotics.stanford.edu/ronnyk/kddCupTalk.p
    pt
Write a Comment
User Comments (0)
About PowerShow.com