Title: Classification%20by%20search%20partition%20analysis:%20an%20alternative%20to%20tree%20methods%20in%20medical%20problems.
1Classification by search partition analysis an
alternative to tree methods in medical problems.
- Roger Marshall
- School of Population Health
- University of Auckland
- New Zealand
- rj.marshall_at_auckland.ac.nz
2Why classification? Uses to develop
diagnostic/prognostic decision and
classification rules discover homogeneous
subgroups e.g at risk in epidemiology, who
respond to treatment
3Methods Regression methods (including neural
networks) model based Trees no model (unless
hierarchical tree considered as such) Empirical
density methodsSmoothers of parameter
space Support Vector machines find margins of
maximum separation Boolean classifiers
(including SPAN, rough sets) based on logical
structures
4Attractions of trees (in medicine) regression
models perceived as unrealistic regression
models based on arithmetic scores trees
demarcate individuals with clusters
of characteristics closer affinity to clinical
reasoning
5Feinstein (circa 1971) .clinicians dont
think in terms of weighted averages of clinical
variables they think about demarcated
subgroups of people who possess combinations
of clinical attributes.. Suggested trees for
prognostic stratification.
6AO
AÇO
AOS
AOH
AOHG
7High risk if (H C) ( H P U) ( H P BV) .
8Trees History 1960-70s AID (Sonquist and
Morgan), CHAID, 1980s CART, C4.5, Machine
Learning 1990s-- New ideas Bagging, boosting,
Bayes, random forests Software CART, S,
KnowledgeSeeker, C4.5 (C5?), SPSS AnswerTree,
RECAMP, QUEST, CAL5, SAS macros, SAS Enterprise
Miner, R rpart
9Measure of best split Goodness of split e.g.
statistical test based measures chi-square
(categorical y), t, F-statistics (y continous),
log-rank for survival trees eg. decrease in
impurity by splitting (CART) eg. Likelihood
(deviance) statistics (Ripley, S).
10CART ideas on impurity Binary outcome classes
D1 and D2 . Define an impurity measure i(t) of
node t. pproportion of D1 s at node t e.g
Gini diversity impurity is i(t) p(1-p) e.g or
Entropy measure i(t) -p log p (1-p)
log(1-p) Change in impurity by splitting of into
left L and right R nodes node t is
i(t)-pLi(tL)-pRi(tR)
11e.g. Entropy deems a node with p0.25 more
impure than Gini
12Right size trees 1. Stopping rules (subgroup
sample size, P-values) (pre-pruning) 2. Grow
big tree and prune (post pruning) e.g MCCP
(minimum cost complexity pruning) Complexity c
terminal nodes Use cross-validation to
estimate prediction error. Many other methods
13New methods/extensions Multivariate trees
(multivariate splits) Multivariate y Survival
trees (Segal, LeBlanc) Bagging/Boosting
(Brieman) Bayes trees (Buntine,
Chipman) Forests of Trees (Breiman)
14Some Troubles with trees Net (main) effects
not evaluated Misleading decision
rules Rules hard to interpret Simple rules
probably missed Tree itself as a model
15AO
AOS
AOH
AOHG
16Misleading classification rules High risk of
diabetes (A O) Ç (A O S) Ç (AO H) Ç (AOH
G) which is same as (A O) Ç (O S) Ç (A H) Ç
(A G) i.e. A, O are redundant. e.g. Dont
need to be young (A) in conjunction with O, S
17Simple rules may require complex
trees (replication problem) Eg. (A B) Ç (C
D) Aage under 18, Bblack, C cigarette smoker
and Ddrinks alcohol. Needs a tree with 7
nodes. Eg. ( A B C) Ç(D E F) Ç(G H K) needs a
tree with 80 terminal nodes!
18Tree for (A B) Ç (C D)
AB
ACD
ABCD
19Positive attributes Usually (in medicine at
least) an attribute can be considered in advance
to be either positively or negatively
associated with y e.g obese, sedentary,
hypertensive are positive for diabetes. e.g.
Smoking, old, high cholesterol positive for
ischaemia eg. Presence of an adverse gene
20Regular classification rules Combinations of
positive attributes only to define high
risk Tree rules not usually regular (though
occasionally may reduce to a regular rule, as in
diabetes example).
21High risk if (H C) ( H P U) ( H P BV) e.g
( H P U) not regularly hospitalized.
22Tree model Is the hierarchical tree model
sensible? Probably not Even if it is does
process of subdivision estimate the best
tree? Maybe?
23The considerations suggest Why not consider
non-hierarchical procedures? and Why not focus
on regular combinations directly? SPAN attempts
to
24SPAN (Search Partition Analysis) Generates
regular decision rules of the form AK1
orand K2 orand Kq where Ki is the
conjunctiondisjunction of pi
attributes. Binary partitions of the predictor
space into A and A-. Non-hierarchical
25 Example SPAN rule for detecting malignant cells
(bcw data) from cell chracteristics
26SPAN Carries out a search to find best possible
combinations of attributes Unless search is
somehow limited, becomes impossibly large!
e.g 22 -1 -
1 ways to form a binary partition for m
attributes e.g. 2147 million for m5
m
27- How SPAN limits extent of search
- By restricting to a set of m attributes Tm
X1,Xm - Typically mlt15. These may be the m best
attributes - By not allowing mixed combinations of
attributes of those in Tm. - By restricting complexity of Boolean expressions
i.e pi and q parameters -
28Attribute set Tm If the set of m attributes
Tm X1,Xm consists of attributes
labelled positive, SPAN will generate only
regular partitions. Natural to consider the best
ranked attributes
29Ranked plot of attributes GI-Cancer and tumour
markers
30Extent of search for different parameters
Based on cominatoric formulae of lock and
keyalgorithm for generating partitions.
31j1, TTm
search over T
Iterated search procedure Continue
until AjAj1
Best partition Aj attribute aj
Make TTm, aj jj1
Produces a sequence of new attributes with
increasing better discrimination (no proof of
this assertion!)
32SPAN Rank plot a_1a_5 are partition attributes
on 5 iterations (hea data)
33Complexity penalising To avoid overfitting
penalise complex Boolean expressions How to
measure complexity? - by number of subgroups
(minus 1).
3 subgroups in A, 2 in A- Complexity c 3 2 -1
4 Penalise measure (eg entropy) G by G-bc
34Visualising subgroups
35Extension to gt2 ordinal states. e.g Categories 0,
1, 2 Can find binary partition A2 of 0,1
v 2 also A1 of
0 v 1,2
Need to ensure A2 is subset of A1. Constrain
search.
36E.g diabetes 0none, 1imparied glucose
tolerance, 2diabetes A2 (FÇU)È(FÇEÇT) A1
(FÇU)È(FÇT)È(FÇH) F, T and U denote positive
fructosamine, triglyceride and urinary albumin
tests. E is ethnic Polynesian Can be shown A2
subset of A1
37- Comparisons of SPAN and other methods
- Lim, Loh and Shih (Machine Learning) compared 33
- methods on 32 data sets
- Methods 22 tree, 9 statistical, 2 neural
networks. - 16 data sets (plus 16 with added noise)
- Seems to provide benchmarks for other methods.
- I tried SPAN on the 24 2-state and 3-state
classification - data sets.
38Data classes SPAN error LLS Range of 33 methods
bcw 2 0.035 0.03-0.09
bcw 0.035 0.03-0.08
bld 2 0.365 0.28-0.43
bld 0.373 0.29-0.44
bos 3 0.236 0.221-0.314
bos 0.236 0.225-0.422
cmc 3 0.449 0.43-0.60
cmc 0.444 0.43-0.58
dna 3 0.075 0.05-0.38
dna 0.075 0.04-0.38
hea 2 0.170 0.14-0.34
hea 0.170 0.15-0.31
39Data classes SPAN error LLS Range of 33 methods
pid 2 0.251 0.22-0.31
pid 0.252 0.22-0.32
smo 3 0.305 (0.44) 0.30-0.45
smo 0.305(0.44) 0.31-0.45
tae 3 0.510 0.325-0.693
tae 0.701 0.445-0.696
thy 3 0.0134 0.005-0.89
thy 0.0134 0.01-0.88
vot 2 0.044 0.04-0.06
vot 0.044 0.04-0.07
wav 3 0.266 0.151-0.477
wav 0.266 0.160-0.446
40Data POL SPAN QI0 LOG LDA IC0 RBF ST0
bcw 19 7 2.5 5.5 12 23 5.5 30
bcw 11 8 1.5 8 9 27 4 20
bld 3 26 7 9 18.5 20 24 10
bld 1 25 26 17 15 7 18 6
bos 2.5 6 28 11.5 14.5 18.5 2.5 22.5
bos 3 2 22.5 13 20 11 28 17.5
cmc 1 5 14 19 22 7.5 10 9
cmc 1 4 17 18.5 22.5 5 15.5 9
dna 2 21 18 16 12 10 31 10
dna 3 20 19 17 13.5 6 31 6
pid 16.5 31 5 11 1.5 16.5 11 16.5
pid 1 30 2.5 7 4 13.5 15 30
hea 9 7.5 4.5 6 1 18 11.5 30
hea 16 6 6 9 2 16 8 23
41Data POL SPAN QI0 LOG LDA IC0 RBF ST0
smo 9.5 9.5(32) 9.5 9.5 9.5 26 18.5 9.5
smo 7.5 9.5(32) 7.5 16.5 21 7.5 23.3 7.5
tae 20 19 10 11 6 3 13 30.5
tae 16 11 9 4 2 20 10 32
thy 14.5 14.5 17 26 28 8 23 5.5
thy 15 15 17.5 25 27 10 29 7.5
vot 25.5 9 1 21 15 17.5 25.5 21
vot 16 5 21 5 16 26 21 19
wav 5 21 8.5 2 10.5 29 1 26
wav 6 21 9 3.5 7.5 27.5 31 26
Mean Rank 9.3 11.1 (12.9) 11.8 12.1 12.9 15.6 17.1 17.7
Mean error 0.210 0.200 (0.230) 0.219 0.215 0.216 0.223 0.249 0.247
42Limitations/Criticisms Multi-class problems
difficult Data dredging Loss of information
by cutpoints of continuous vars. Complexity
penalising somewhat ad hoc Computationally
intensive, unless search sensibly restricted Not
black-box requires user judgements Needs
(temperamental!) SPAN software no R algorithms
43Conclusion Despite popularity trees have
weakness that stem from their hierachical
structure. SPAN offers an alternative that is
non-hierarchical SPAN generally performs as
well or better than trees. Offers decision
rules that are generally easy to understand.