Title: Learning Hierarchical Multilabel Classification Trees for Functional Genomics
1Learning Hierarchical Multilabel Classification
Trees for Functional Genomics
- Hendrik Blockeel (K.U.Leuven)
- In collaboration with
- Saso Dzeroski (Jozef Stefan Institute, Ljubljana)
- Amanda Clare (U. of Wales, Aberystwyth)
- Jan Struyf (U. of Wisconsin, Madison)
- Leander Schietgat (K.U.Leuven)
2Whats surprising about our resultsThe
household equivalent
- What you would NOT expect is
- that the combo is smaller than
- some individual machines
- that the coffee is even better
yet that is what we found when learning
combined models for different tasks
3Overview
- Hierarchical multilabel classification (HMC)
- Decision trees for HMC
- Motivated by problems in functional genomics
- Some experimental results
- Conclusions
4Classification settings
- Normally, in classification, we assign one class
label ci from a set C c1, , ck to each
example - In multilabel classification, we have to assign a
subset S ? C to each example - i.e., one example can belong to multiple classes
- Some applications
- Text classification assign subjects (newsgroups)
to texts - Functional genomics assign functions to genes
- In hierarchical multilabel classification (HMC),
the classes C form a hierarchy C,? - Partial order ? expresses is a superclass of
5Hierarchical multilabel classification
- Hierarchy constraint
- ci ? cj ? coverage(cj) ? coverage(ci)
- Elements of a class must be elements of its
superclasses - Should hold for given data as well as predictions
- Three possible ways of learning HMC models
- Learn k binary classifiers, one for each class
- If learned independently difficult to guarantee
hierarchical constraints - Can also be learned hierarchically
- Learn one classifier that predicts a vector of
classes - E.g., a neural net can have multiple outputs
- We will show it can also be done for decision
trees
61) Learn k classifiers independently
- C c1, c2, , ck
- Let S(x) set of class labels of x
- Learning
- For each i 1..k learn fi such that fi(x)1 if
ci ? S(x) - Prediction
- S(x) ?
- For each i 1..k include ci in S(x) if fi(x)1
- Predict S(x)
- Problem hierarchy constraint may not hold for
S - A classifier for a subclass might predict 1 when
the classifier for its superclass predicts 0 - Can be trivially fixed at prediction time but it
shows that the fi have not been learned optimally
72) Learn k classifiers hierarchically
- D data set Di x ? D ci ? S(x)
- Di Data set restricted to class i examples
- Parent(ci) immediate superclass of ci
- Learning
- For each i 1..k learn fi from Dparent(ci) such
that fi(x)1 if ci ? S(x) - Prediction (typically dop-down in hierarchy)
- S(x) ?
- For each i 1..k include ci in S(x) if
parent(ci) ? S(x) and fi(x)1 - Predict S(x)
- Advantages
- Hierarchy constraint solved
- More balanced distributions to learn from
.
c1
c2
c3
c4
c5
c6
c7
83) Learn one classifier
- Learning
- Learn f such that f(x) S(x)
- Prediction
- Predict f(x)
- Need to have a learner that can learn models with
gt1 output - E.g., neural nets can output k-D vector c1, ,
ck - In this work, we extend decision trees (normally
1-D output) to HMC trees - Trees interpretable theories
- 1 tree is more interpretable than k trees
- Risks
- Perhaps that one tree will be much larger
- Perhaps the tree will be much less accurate
9Data mining with decision trees
- Data mining learning a general model from
specific observations - Decision trees are a popular format for models
because - They are fast to build and fast to use
- They make accurate predictions
- They are easy to interpret
Name Age Salary Children Loan? Ann 25 29920
1 no Bob 32 40000 2
yes Carl 19 0 0 no Dirk
44 45200 3 yes .
. .
10Functional genomics
- Task Given a data set with descriptions of genes
and the functions they have, learn a model that
can predict for a new gene what functions it
performs - A gene can have multiple functions (out of 250
possible functions, in our case) - Could be done with decision trees, with all the
advantages that brings But - Decision trees predict only one class, not a set
of classes - Should we learn a separate tree for each
function? - 250 functions 250 trees not so fast and
interpretable anymore!
description
functions
Name A1 A2 A3 .. An 1 2 3 4 5 249
250 G1 x x
x x G2 x
x x G3
x x x
.
11Multiple prediction trees
- A multiple prediction tree (MPT) makes multiple
predictions at once - Basic idea (Blockeel, De Raedt, Ramon, 1998)
- A decision tree learner prefers tests that yield
much information on the class attribute
(measured using information gain (C4.5) or
variance reduction (CART)) - MPT learner prefers tests that reduce variance
for all target variables together - Variance mean squared distance of vectors to
mean vector, in k-D space
12HMC tree learning
- A special case of MPT learning
- Main characteristics
- Errors higher up in the hierarchy are more
important - Use weighted euclidean distance (higher weight
for higher classes) - Need to ensure hierarchy constraint
- Normally, leaf predicts ci iff proportion of ci
examples in leaf is above some threshold ti
(often 0.5) - To ensure compliance with hierarchy constraint
- ci ? cj ? ti ? tj
- Automatically fulfilled if all ti equal
13Example
.
.
c1
c2
c3
Weight 1
c1
c2
c3
x1
c4
c5
c6
c7
Weight 0.5
c4
c5
c6
c7
.
x1 c1, c3, c5 1,0,1,0,1,0,0 x2 c1, c3,
c7 1,0,1,0,0,0,1 x3 c1, c2, c5
1,1,0,0,0,0,0
c1
c2
c3
x2
c4
c5
c6
c7
d2(x1, x2) 0.25 0.25 0.5 d2(x1, x3) 11
2 x1 is more similar to x2 than to x3 DT tries
to create leaves with similar examples
.
x3
c1
c2
c3
c4
c5
c6
c7
14The Clus system
- Created by Jan Struyf
- Propositional DT learner, implemented in Java
- Implements ideas from
- C4.5 (Quinlan, 93) and CART (Breiman et al.,
84) - predictive clustering trees (Blockeel et al.,
98) - includes multiple prediction trees and
hierarchical multilabel classification trees - Reads data in ARFF format (Weka)
- We used two versions for our experiments
- Clus-HMC HMC version as explained
- Clus-SC single classification version, /- CART
15The datasets
- 12 datasets from functional genomics
- Each with a different description of the genes
- Sequence statistics (1)
- Phenotype (2)
- Predicted secondary structure (3)
- Homology (4)
- Micro-array data (5-12)
- Each with the same class hierarchy
- 250 classes distributed over 4 levels
- Number of examples 1592 to 3932
- Number of attributes 52 to 47034
16Our expectations
- How does HMC tree learning compare to the
straightforward approach of learning 250 trees? - We expect
- Faster learning Learning 1 HMCT is slower than
learning 1 SPT (single prediction tree), but
faster than learning 250 SPTs - Much faster prediction Using 1 HMCT for
prediction is as fast as using 1 SPT for
prediction, and hence 250 times faster than using
250 SPTs - Larger trees HMCT is larger than average tree
for 1 class, but smaller than set of 250 trees - Less accurate HMCT is less accurate than set of
250 SPTs (but hopefully not much less accurate) - So how much faster / simpler / less accurate are
our HMC trees?
17The (surprising) results
- The HMCT is on average less complex than one
single SPT - HMCT has 24 nodes, SPTs on average 33 nodes
- but youd need 250 of the latter to do the same
job - The HMCT is on average slightly more accurate
than a single SPT - (see graphs)
- Surprising, as each SPT is tuned for one specific
prediction task - Expectations w.r.t. efficiency are confirmed
- Learning min. speedup factor 4.5x, max 65x,
average 37x - Prediction gt250 times faster (since tree is not
larger) - Faster to learn, much faster to apply
18Precision recall curves
Precision proportion of predictions that is
correct P(X predicted X)
Recall proportion of class memberships correctly
identified P(predicted X X)
19An example rule
- High interpretability IF-THEN rules extracted
from the HMCT are quite simple
IF Nitrogen_Depletion_8_h lt -2.74 AND
Nitrogen_Depletion_2_h gt -1.94 AND
1point5_mM_diamide_5_min gt -0.03 AND
1M_sorbitol___45_min_ gt -0.36 AND
37C_to_25C_shock___60_min gt 1.28 THEN 40, 40/3,
5, 5/1
For class 40/3 Recall 0.15 precision
0.97. (rule covers 15 of all class 40/3 cases,
and 97 of the cases fulfilling these conditions
are indeed 40/3)
20The effect of merging
. . .
Optimized for c1
Optimized for c2
Optimized for c250
- Smaller than average individual tree
- - More accurate than average individual tree
Optimized for c1, c2, , c250
21Any explanation for these results?
- Almost too good to be true how is it possible?
- Answer the classes are not independent
- Different trees for different classes actually
share structure - Explains some complexity reduction achieved by
the HMCT - One class carries information on other classes
- This increases the signal-to-noise ratio
- Provides better guidance when learning the tree
(explaining good accuracy) - Avoids overfitting (explaining further reduction
of tree size) - This was confirmed empirically
22Overfitting
- To check our overfitting hypothesis
- Compared area under PR curve on training set
(Atr) and test set (Ate) - For SPC Atr Ate 0.219
- For HMCT Atr Ate 0.024
- (to verify, we tried Wekas M5 too 0.387)
- So HMCT clearly overfits much less
23Conclusions
- Surprising discovery a single tree can be found
that - predicts 250 different functions with, on
average, equal or better accuracy than
special-purpose trees for each function - is not more complex than a single special-purpose
tree (hence, 250 times simpler than the whole
set) - is (much) more efficient to learn and to apply
- The reason for this is to be found in the
dependencies between the gene functions - Provide better guidance when learning the tree
- Help to avoid overfitting