Title: KDD Cup Survey
1KDD Cup Survey
2Outline
- Nuts and Bolts of KDD Cup
- KDD Cup 97-99
- KDD Cup 2000
- Summary
3About KDD Cup
- A knowledge discovery and data mining tools
competition in conjunction with KDD conferences.
It aims at - showcase the best methods for discovering
higher-level knowledge from data. - Helping to close the gap between research and
industry - Stimulating further KDD research and development
4Statistics
- Participation in KDD Cup grew steadily,
especially requests to access the data - Average person-hours per submission 204Max
person-hours per submission 910 - Commercial software grew from 44 (cup 97) to 52
(cup 98) to 77 (cup 2000)
5Algorithms
Decision trees most widely tried and by far the
most commonly submitted
6KDD Cup 97
- A classification task to predict Financial
services industry direct mail response - Winners
- Charles Elkan, a PhD from UC-San Diego with his
Boosted Naive Bayesian (BNB) - Silicon Graphics, Inc with their software MineSet
- Urban Science Applications, Inc. with their
software gain, Direct Marketing Selection System
7BNB
- Boosting to learn a series of classifiers,
where each classifier in the series pays more
attention to the examples misclassified by its
predecessor. Repeated T rounds. - BNB representationally equivalent to a
multilayer perceptron with a single hidden layer. - Complexity O(ef)
- e examples f - attributes
8MineSet
- A KDD tool that combines data access,
transformation, classification, and visualization.
9KDD Cup 98
- URL www.kdnuggets.com/meetings/kdd98/kdd-cup-98.h
tml - A classification task to analyze fund raising
mail responses to a non-profit organization - Winners
- Urban Science Applications, Inc. with their
software GainSmarts. - SAS Institute, Inc. with their software
Enterprise Miner. - Quadstone Limited with their software
Decisionhouse
10GainSmarts
- GainSmarts a feature selection expert system
- First step - used Logistic Regression to assign
each prospect a probability of donation (Pi). - Second step - used Linear Regression to estimate
a conditional donation amount of responding
donors (Ai) - Result (lt1 error) -
- Prediction Pi Ai
-
11Enterprise Miner
- A data mining solution that addresses the entire
data mining process - SEMMA Process
- Sample
- Explore
- Modify
- Model
- Assess
- Algorithms
- Decision tree
- Neural network
- Regression
12Decisionhouse
- Decisionhouse an integrated modelling software
suite by Quadstone - Data exploration using visualization modules.
- Use Decision trees and Scorecards to model more
complex tasks. - Choose the final model by comparing a variety of
modeling approaches and looking at the difference
in predicted net profitability (lift curve).
13Results
Maximum Possible Profit Line (72,776 in profits
with 4,873 mailed)
Mail to Everyone Solution (10,560 in profits
with 96,367 mailed)
GainSmarts
SAS/Enterprise Miner
Quadstone/Decisionhouse
14KDD Cup 99
- URL www.cse.ucsd.edu/users/elkan/kdresults.html
- Problem
- same data set as KDD Cup 98
- Winners
- SAS Institute Inc. with their software Enterprise
Miner. - Amdocs with their Information Analysis
Environment
15Software
- SAS using two-stage model which includes two
multi-layer perceptron (MLP) neural networks
models. - Amdocs using its own Information Analysis
Environment, which allows modeling of the value
and class membership simultaneously. Algorithms
used is a hybrid logistic regression model
16KDD Cup 2000
- www.ecn.purdue.edu/KDDCUP/
- Sponsored by
- Purdue University
- Blue Martini Software
17Data Set
- Data collected from Gazelle.com, a legwear and
legcare web retailer - Pre-processed
- Training set 2 months
- Test sets one month
- Data collected includes
- Click streams
- Order information
- Registration form
18Problems
- The goal to design models to support web-site
personalization and to improve the profitability
of the site by increasing customer response. - Questions - When given a set of page views,
- will the visitor view another page on the site or
leave? - which product brand will the visitor view in the
remainder of the session? - characterize heavy spenders
- characterize killer pages
- characterize which product brand a visitor will
view in the remainder of the session?
19Evaluation
- Accuracy/score was measured for the two questions
with test sets - Insight questions judged with help of retail
experts from Gazelle and Blue Martini - Created a list of insights from all
participants - Each insight was given a weigh
- Each participant was scored on all insights
- Additional factors
- Presentation quality
- Correctness
20The Winners
- Question 1 5 Winner Amdocs
- Question 2 3 Winner Salford Systems
- Question 4 Winner e-steam
- poster
21Software (Amdocs)
- Exploratory Data Analysis SAS
- Classification Tree Amdocs Business Insight
Tool - Decision tree
- Rules Extraction
- Modeling
- Combining models
22Scheme
23Main Model
Decision Tree
Decision Tree
Decision Tree
5 trees
5 trees
5 trees
built on 34000 cases
built on 34000 cases
built on 34000 cases
24Sub-models
Each model captures a different aspect of the
overall behavior in the data. Combining or
ensembling the models provides the best
prediction results.
Best rule
Chooses most accurate rule satisfied by each
record
Logistic regression on rule set raw field
values combine to define score for each record
Hybrid Model
Logistic regression on rule set defines score for
each record as a combination of rules the record
satisfies
Merged Rules
25Software (Salford)
- CART - a decision tree tool that automatically
searching for and isolating significant
patterns and relationships - MARS - a multivariate non-parametric regression
procedure - HotSpotDetector
- TreeNet
26Cart
- Binary recursive partitioning.
- Key elements
- Splitting rules
- Brute force search all possible splits for all
variables - Rank each splitting rule on the basis of a
quality-of-split criterion (default GINI) - Recursion - split until further splitting is
impossible or stopped. - Class assignment
- Plurality rule
- Assign every node whether it is terminal or not.
- Pruning Trees does not stop in the middle
- Testing - best sub-tree is the one with the
lowest error
27MARs
- Automatic variable searchÂ
- Automatic variable transformationÂ
- Automatic limited interaction searchesÂ
- Variable nestingÂ
- Built-in testing regimens model selection
parameters.
28Insights (Heavy Spenders)
- Some of the Good insights
- Referrers - establish ad policy based on
conversion rates, not click-throughs - Not an AOL user - browser window too small for
layout - Referring site traffic changed dramatically over
time - Came to site from print-ad or news, not friends
families - Very high and very low income
- Geographic Northeast U.S. states
- Repeat visitors
29Insights (Who leaves?)
- Some of the good insights
- Crawlers, bots accounted for 16 of sessions
- Long processing time (gt 12 seconds) implies high
abandonment - Referring sites mycoupons have long sessions,
shopnow.com are prone to exit quickly - Returning visitors' prob of continuing is double
- View of specific products (Oroblue,Levante) cause
abandonment - Probability of leaving decreases with page views
- Free Gift and Welcome templates on first three
pages encouraged visitors to stay at site
30Insights(Brand view)
- Some good insights
- Referrer URL is great predictor
- Fashionmall.com and winnie-cooper are referrers
for Hanes and Donna Karan - mycoupons.com, tripod, deal-finder are referrers
for American Essentials - Previous views of a product imply later views
31Summary
- Data mining requires background knowledge and
access to business users - Successful data mining solutions combine
automated and manual analysis, integrating the
power of the machine with expert knowledge and
human insight - Web Mining is challenging crawlers/bots,
frequent site changes, etc. - KDD Cup is an excellent source to learn the
state-of-art KDD techniques - KDD Cup data available for research and education
32References
- Elkan C. (1997). Boosting and Naive Bayesian
Learning. Technical Report No. CS97-557,
September 1997, UCSD. - Decisionhouse (1998). KDD Cup 98 Quadstone Take
Bronze Miner Award. Retrieved March 15, 2001
from http//www.kdnuggets.com/meetings/kdd98/quads
tone/index.html - Urbane Science (1998). Urbane Science wins the
KDD-98 Cup. Retrieved March 15, 2001 from
http//www.kdnuggets.com/meetings/kdd98/gain-kddcu
p98-release.html - Georges, J. Milley, A. (1999). KDD99
Competition Knowledge Discovery Contest.
Retrieved March 15, 2001 from http//www.cse.ucsd.
edu/users/elkan/saskdd99.pdf - Rosset, S. Inger A. (1999). KDD-Cup 99
Knowledge Discovery In a Charitable
Organizations Donor Database. Retrieved March
15, 2001 from http//www.cse.ucsd.edu/users/elkan/
KDD2.doc
33References (Cont.)
- Sebastiani P., Ramoni M. Crea A. (1999).
Profiling your Customers using Bayesian Networks.
Retrieved March 15, 2001 from http//bayesware.com
/resources/tutorials/kddcup99/kddcup99.pdf - Inger A., Vatnik N., Rosset S. Neumann E.
(2000). KDD-Cup 2000 Question 1 Winners Report.
Retrieved March 18, 2000 from - http//www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.
ppt - Neumann E., Vatnik N., Rosset S., Duenias M.,
Sasson I. Inger A. (2000). KDD-Cup 2000
Question 5 Winners Report. Retrieved March 18,
2000 from http//www.ecn.purdue.edu/KDDCUP/amdocs-
slides-5.ppt - Salford System white papers
- http//www.salford-systems.com/whitepaper.html
- Summary talk presented at KDD (2000)
- http//robotics.stanford.edu/ronnyk/kddCupTalk.p
pt