Title: Affinity Analysis for Selecting "Next Best Activity"
1Decision Treesdivide conquer
Tom Breur London, 23 April 2008 tombreur_at_xlntconsu
lting.com www.xlntconsulting.com 31-6-463 468 75
2Agenda
- Features of decision trees
- Overview algorithms
- Exploration and prediction
- Automatic ? manual
- Pitfalls best practices
- Decision trees ? regression
3Features of decision trees
- Symbolic analysis ? recursive partitioning ?
decision trees (inductive learning) - Split record set arriving at each node, using
best variable ? rinse repeat - Stop when
- All records in leaf belong to same class
- No variable can be found for splitting
4Example decision tree
buroscore
credit?
occupation
debt
No
Blue collar
Low
Low
Yes
White collar
Low
High
occupation?
White collar
Blue collar
High
No
Blue collar
Low
White collar
No
High
High
Blue collar
Low
No
High
debt?
Yes
White collar
High
Medium
Low, Medium
High
Yes
White collar
Low
High
Yes
White collar
Medium
High
No
Blue collar
High
Low
Blue collar
No
High
High
No
Blue collar
High
Low
Yes
White collar
Medium
High
5Overview algorithms (1)
- Usually categorical target, sometimes continuous
(e.g. CART?) - Usually binary target, sometimes multiple
categories (e.g. CHAID) - Usually statistical loss function, sometimes
information theory (e.g. ID3, C4.5, C5.0) - Binary or multiple splits nominal, ordinal or
continuous predictors
6Overview algorithms (2)
7Exploration and prediction
- Segmented prediction
- Insight in complex structures
- Discover noteworthy interactions
- Sanity check
- Foster adoption
- Spur data-driven business innovation
Every predictive model must be accompanied by
insight
8Automatic ? manual
- Manual apply domain expertise (aka model
engineering) - Variable selection
- Business problem
- Implementation specification
- Short ? long term lift characteristics
- Transient/behavioural ? stable variables
- Manual tree building drives variable development
9Pitfalls best practices (1)
- Pitfall 1
- Leakers?/Anachronistic variables? when the
model looks too good to be true, it probably is - Best practice 1
- Plot sorted univariate relation between input
output, look for a drop (suspect) ? e.g. 1st
two variables next slide
10Example best practice 1
11Pitfalls best practices (2)
- Pitfall 2
- Never assume it is the tree, it is always a
(possible) tree - Best practice 2
- Describe associations between most important
input variables and target, even if variables do
not appear in the (eventual) tree
12Pitfalls best practices (3)
- Pitfall 3
- Overtraining, overly optimistic prognosis
- Best practice 3
- Divide mining set into 3 parts? training-,
test-, and evaluation set(50-40-10 feels about
right)
13Pitfalls best practices (4)
- Pitfall 4
- Replace missing by constant (mean/mode)
- Best practice 4
- Identify rightfully missing yes/no?
- If replacing, append boolean was previously
missing - Avoid adding bias, intelligent imputation
14Decision trees ? regression
- Few sound comparative studies
- Most familiar technique works best
- On average regression predicts more accurately
- Alternative considerations
- Spur development of variables
- Innovate business
15Conclusion
- Decision trees are
- Flexible
- Versatile
- Gentle learning curve
- Superior insight drives
- Development of (better) predictive variables
- Innovation of business
- Manual tree building enhances data learning
16Resources - history decision trees
- AID Morgan Sonquist (1963)
- THAID Messenger Mandell (1972), Morgan
Messenger (1973), Morgan Sonquist (1973) - CHAID Hartigan (1975)
- CHAID Kass (1980)
- CART Breiman, Friedman, Olshen Stone (1984)
- ID3 Quinlan (1986)
- FACT Loh Vanichestakul (1988)
- Exhaustive CHAID Biggs, de Ville Suen (1991)
- MARS Friedman (1991)
- C4.5 Quinlan (1992)
- CHAID Magidson (1993, 1994)
- FIRM Hawkins (1995)
- QUEST Loh Shih (1997)
- C5.0 Quinlan (1998)
- CRUISE Kim Loh (2001)
- GUIDE Loh (2002)
17Resources - software
- www.angoss.com
- www.dtreg.com
- www.lumenaut.com
- www.micrsoft.com
- www.portraitsoftware.com
- www.rulequest.com
- www.salford-systems.com
- www.sas.com
- www.spss.com
- www.vanguardsw.com
- www.xlstat.com
- www.xpertrule.com
18Resources - references
- ?Breiman, Friedman, Olshen Stone (1984)
Classification and Regression Trees. ISBN
0412048418 - ?Berry Linoff (1999) Mastering Data Mining.
ISBN 0471331236 - ?Pyle (1999) Preparing Data for Data Mining.
ISBN 1558605290 - ?Pyle (2003) Business Modeling and Data Mining.
ISBN 155860653X