Learning Hierarchical Multilabel Classification Trees for Functional Genomics presentation

About This Presentation

Transcript and Presenter's Notes

Title: Learning Hierarchical Multilabel Classification Trees for Functional Genomics

1
Learning Hierarchical Multilabel Classification
Trees for Functional Genomics

Hendrik Blockeel (K.U.Leuven)
In collaboration with
Saso Dzeroski (Jozef Stefan Institute, Ljubljana)
Amanda Clare (U. of Wales, Aberystwyth)
Jan Struyf (U. of Wisconsin, Madison)
Leander Schietgat (K.U.Leuven)

2
Whats surprising about our resultsThe
household equivalent

What you would NOT expect is
that the combo is smaller than
some individual machines
that the coffee is even better

yet that is what we found when learning
combined models for different tasks
3
Overview

Hierarchical multilabel classification (HMC)
Decision trees for HMC
Motivated by problems in functional genomics
Some experimental results
Conclusions

4
Classification settings

Normally, in classification, we assign one class
label ci from a set C c1, , ck to each
example
In multilabel classification, we have to assign a
subset S ? C to each example
i.e., one example can belong to multiple classes
Some applications
Text classification assign subjects (newsgroups)
to texts
Functional genomics assign functions to genes
In hierarchical multilabel classification (HMC),
the classes C form a hierarchy C,?
Partial order ? expresses is a superclass of

5
Hierarchical multilabel classification

Hierarchy constraint
ci ? cj ? coverage(cj) ? coverage(ci)
Elements of a class must be elements of its
superclasses
Should hold for given data as well as predictions
Three possible ways of learning HMC models
Learn k binary classifiers, one for each class
If learned independently difficult to guarantee
hierarchical constraints
Can also be learned hierarchically
Learn one classifier that predicts a vector of
classes
E.g., a neural net can have multiple outputs
We will show it can also be done for decision
trees

6
1) Learn k classifiers independently

C c1, c2, , ck
Let S(x) set of class labels of x
Learning
For each i 1..k learn fi such that fi(x)1 if
ci ? S(x)
Prediction
S(x) ?
For each i 1..k include ci in S(x) if fi(x)1
Predict S(x)
Problem hierarchy constraint may not hold for
S
A classifier for a subclass might predict 1 when
the classifier for its superclass predicts 0
Can be trivially fixed at prediction time but it
shows that the fi have not been learned optimally

7
2) Learn k classifiers hierarchically

D data set Di x ? D ci ? S(x)
Di Data set restricted to class i examples
Parent(ci) immediate superclass of ci
Learning
For each i 1..k learn fi from Dparent(ci) such
that fi(x)1 if ci ? S(x)
Prediction (typically dop-down in hierarchy)
S(x) ?
For each i 1..k include ci in S(x) if
parent(ci) ? S(x) and fi(x)1
Predict S(x)
Advantages
Hierarchy constraint solved
More balanced distributions to learn from

.
c1
c2
c3
c4
c5
c6
c7
8
3) Learn one classifier

Learning
Learn f such that f(x) S(x)
Prediction
Predict f(x)
Need to have a learner that can learn models with
gt1 output
E.g., neural nets can output k-D vector c1, ,
ck
In this work, we extend decision trees (normally
1-D output) to HMC trees
Trees interpretable theories
1 tree is more interpretable than k trees
Risks
Perhaps that one tree will be much larger
Perhaps the tree will be much less accurate

9
Data mining with decision trees

Data mining learning a general model from
specific observations
Decision trees are a popular format for models
because
They are fast to build and fast to use
They make accurate predictions
They are easy to interpret

Name Age Salary Children Loan? Ann 25 29920
1 no Bob 32 40000 2
yes Carl 19 0 0 no Dirk
44 45200 3 yes .
. .
10
Functional genomics

Task Given a data set with descriptions of genes
and the functions they have, learn a model that
can predict for a new gene what functions it
performs
A gene can have multiple functions (out of 250
possible functions, in our case)
Could be done with decision trees, with all the
advantages that brings But
Decision trees predict only one class, not a set
of classes
Should we learn a separate tree for each
function?
250 functions 250 trees not so fast and
interpretable anymore!

description
functions
Name A1 A2 A3 .. An 1 2 3 4 5 249
250 G1 x x
x x G2 x
x x G3
x x x
.
11
Multiple prediction trees

A multiple prediction tree (MPT) makes multiple
predictions at once
Basic idea (Blockeel, De Raedt, Ramon, 1998)
A decision tree learner prefers tests that yield
much information on the class attribute
(measured using information gain (C4.5) or
variance reduction (CART))
MPT learner prefers tests that reduce variance
for all target variables together
Variance mean squared distance of vectors to
mean vector, in k-D space

12
HMC tree learning

A special case of MPT learning
Main characteristics
Errors higher up in the hierarchy are more
important
Use weighted euclidean distance (higher weight
for higher classes)
Need to ensure hierarchy constraint
Normally, leaf predicts ci iff proportion of ci
examples in leaf is above some threshold ti
(often 0.5)
To ensure compliance with hierarchy constraint
ci ? cj ? ti ? tj
Automatically fulfilled if all ti equal

13
Example
.
.
c1
c2
c3
Weight 1
c1
c2
c3
x1
c4
c5
c6
c7
Weight 0.5
c4
c5
c6
c7
.
x1 c1, c3, c5 1,0,1,0,1,0,0 x2 c1, c3,
c7 1,0,1,0,0,0,1 x3 c1, c2, c5
1,1,0,0,0,0,0
c1
c2
c3
x2
c4
c5
c6
c7
d2(x1, x2) 0.25 0.25 0.5 d2(x1, x3) 11
2 x1 is more similar to x2 than to x3 DT tries
to create leaves with similar examples
.
x3
c1
c2
c3
c4
c5
c6
c7
14
The Clus system

Created by Jan Struyf
Propositional DT learner, implemented in Java
Implements ideas from
C4.5 (Quinlan, 93) and CART (Breiman et al.,
84)
predictive clustering trees (Blockeel et al.,
98)
includes multiple prediction trees and
hierarchical multilabel classification trees
Reads data in ARFF format (Weka)
We used two versions for our experiments
Clus-HMC HMC version as explained
Clus-SC single classification version, /- CART

15
The datasets

12 datasets from functional genomics
Each with a different description of the genes
Sequence statistics (1)
Phenotype (2)
Predicted secondary structure (3)
Homology (4)
Micro-array data (5-12)
Each with the same class hierarchy
250 classes distributed over 4 levels
Number of examples 1592 to 3932
Number of attributes 52 to 47034

16
Our expectations

How does HMC tree learning compare to the
straightforward approach of learning 250 trees?
We expect
Faster learning Learning 1 HMCT is slower than
learning 1 SPT (single prediction tree), but
faster than learning 250 SPTs
Much faster prediction Using 1 HMCT for
prediction is as fast as using 1 SPT for
prediction, and hence 250 times faster than using
250 SPTs
Larger trees HMCT is larger than average tree
for 1 class, but smaller than set of 250 trees
Less accurate HMCT is less accurate than set of
250 SPTs (but hopefully not much less accurate)
So how much faster / simpler / less accurate are
our HMC trees?

17
The (surprising) results

The HMCT is on average less complex than one
single SPT
HMCT has 24 nodes, SPTs on average 33 nodes
but youd need 250 of the latter to do the same
job
The HMCT is on average slightly more accurate
than a single SPT
(see graphs)
Surprising, as each SPT is tuned for one specific
prediction task
Expectations w.r.t. efficiency are confirmed
Learning min. speedup factor 4.5x, max 65x,
average 37x
Prediction gt250 times faster (since tree is not
larger)
Faster to learn, much faster to apply

18
Precision recall curves
Precision proportion of predictions that is
correct P(X predicted X)
Recall proportion of class memberships correctly
identified P(predicted X X)
19
An example rule

High interpretability IF-THEN rules extracted
from the HMCT are quite simple

IF Nitrogen_Depletion_8_h lt -2.74 AND
Nitrogen_Depletion_2_h gt -1.94 AND
1point5_mM_diamide_5_min gt -0.03 AND
1M_sorbitol___45_min_ gt -0.36 AND
37C_to_25C_shock___60_min gt 1.28 THEN 40, 40/3,
5, 5/1
For class 40/3 Recall 0.15 precision
0.97. (rule covers 15 of all class 40/3 cases,
and 97 of the cases fulfilling these conditions
are indeed 40/3)
20
The effect of merging
. . .
Optimized for c1
Optimized for c2
Optimized for c250

Smaller than average individual tree
- More accurate than average individual tree

Optimized for c1, c2, , c250
21
Any explanation for these results?

Almost too good to be true how is it possible?
Answer the classes are not independent
Different trees for different classes actually
share structure
Explains some complexity reduction achieved by
the HMCT
One class carries information on other classes
This increases the signal-to-noise ratio
Provides better guidance when learning the tree
(explaining good accuracy)
Avoids overfitting (explaining further reduction
of tree size)
This was confirmed empirically

22
Overfitting

To check our overfitting hypothesis
Compared area under PR curve on training set
(Atr) and test set (Ate)
For SPC Atr Ate 0.219
For HMCT Atr Ate 0.024
(to verify, we tried Wekas M5 too 0.387)
So HMCT clearly overfits much less

23
Conclusions

Surprising discovery a single tree can be found
that
predicts 250 different functions with, on
average, equal or better accuracy than
special-purpose trees for each function
is not more complex than a single special-purpose
tree (hence, 250 times simpler than the whole
set)
is (much) more efficient to learn and to apply
The reason for this is to be found in the
dependencies between the gene functions
Provide better guidance when learning the tree
Help to avoid overfitting

Write a Comment

User Comments (0)

About PowerShow.com

Learning Hierarchical Multilabel Classification Trees for Functional Genomics PowerPoint PPT Presentation