Title: Three feature selection problems with solutions
1Three feature selection problems (with solutions)
- Jose M. Peña
- IISLAB, IDA
- Linköping University
- Sweden
- jospe_at_ifm.liu.se
- www.ida.liu.se/jospe
Joint work with Roland Nilsson Johan
Björkegren Jesper Tegnér
2Outline
- Problem I Posterior distribution.
- Solution Markov boundary.
- Peña, J. M., Nilsson, R., Björkegren, J. and
Tegnér, J. (2007). Towards Scalable and Data
Efficient Learning of Markov Boundaries.
International Journal of Approximate Reasoning,
45(2), 211-232. - Peña, J. M. (2008). Learning Gaussian Graphical
Models of Gene Networks with False Discovery Rate
Control. In Proceedings of the 6th European
Conference on Evolutionary Computation, Machine
Learning and Data Mining in Bioinformatics
(EvoBIO 2008) Lectures Notes in Computer
Science 4973, 165-176. - Problem II Class label.
- Solution Bayes relevant features.
- Nilsson, R., Peña, J. M., Björkegren, J. and
Tegnér, J. (2007). Consistent Feature Selection
for Pattern Recognition in Polynomial Time.
Journal of Machine Learning Research, 8, 589-612.
- Peña, J. M. (2009). On the Possible Ordering of
Discrete Features Subsets. Submitted. - Problem III All relevant features.
- Solution RIT algorithm.
- Nilsson, R., Peña, J. M., Björkegren, J. and
Tegnér, J. (2007). Detecting Multivariate
Differentially Expressed Genes. BMC
Bioinformatics, 8150.
3Problem I Posterior distribution
- The Markov boundary of Y, SM, is the minimal set
of features such that p(p(YX) p(Y SM)) 1. - If p(X) gt 0 then SM is unique.
- If p(X) gt 0 then Z ? SM iff p(p(YX) ? p(YX\Z))
gt 0.
Z is strongly relevant
no exhaustive search required - data inefficient
4Algorithms for SM
- Satisfied by
- Gaussian distributions.
- Distributions perfect to some graph.
- Closed under marginalizacion and conditioning.
(Tsamardinos et al., 2003)
- IAMB is consistent under the composition property
assumption (X - Y Z ? X - W Z ? X - YW Z). - KIAMB IAMB with randomness at step 4.
5Thrombin data
Data provided by DuPont Pharmaceuticals for KDD
Cup 2001. 1909 training instances 634 testing
instances 139351 binary features (3-D properties
of a drug compound tested for binding to
thrombin, a key receptor in blood clotting)
6Preliminaries for problem II
- Classifier, gX-gtY.
- Bayes classifier, g(X) arg maxy p(yX).
- Risk, R(g) p(g(X) ? Y) Sx,y p(x,y) 1g(x) ?
Y.
7Problem II Class label
- Let X0,1,Y-1,1, f(x)gt0 and p(Y1x)x/3
for all x. - Then, SMX but g(x)-1 for all x and, thus, X is
irrelevant for classification. - Z is Bayes relevant iff p(g(X)?g(X\Z)) gt 0. Let
S denote the set of Bayes relevant features. - If p(x)gt0 and p(Yx) has a single maximum for all
x, then p(g(X)?g(X\Z)) gt 0 iff
R(g(X))?R(g(X\Z)). - If p(x)gt0 and p(Yx) has a single maximum for all
x, then S is the only minimal feature subset
such that R(g(S)) R(g(X)).
no exhaustive search required - data inefficient
8UCI data sets
Consistent version of the one-shot approach
- The following backward search is correct
- SX
- Repeat while possible
- If there exists Z ? S such that
R(g(S))R(g(S\Z)), then SS\Z. - Data inneficient. Forward approaches ?
9S may differ from SM
- S ? SM.
- But the converse may not be true.
10Possible orderings of discrete feature subsets
- Any strictly increasing Bayes risk ordering is
possible as long as R(g(ST))ltR(g(S)). - E.g.,
- R(g(X1,X2,X3)) lt R(g(X1,X2)) lt R(g(X1,X3)) lt
R(g(X2,X3)) lt R(g(X3)) lt R(g(X2)) lt R(g(X1))
lt R(g(Ø)) - Finding the feature subset of size k that has
minimal Bayes risk requires exhaustive search. - As we have seen, finding S does not require
exhaustive search. - Open problem Is any sequence of Bayes risks
possible ? - Analogous results exist for continuous domains,
though not Gaussian. See Cover and Van Campenhout
(1978) and Van Campenhout (1980).
11Problem III All relevant features
- Z is weakly relevant iff p(p(YX) p(YX\Z)) 1
but - p(p(YS) ? p(YS,Z)) gt 0 with S ? X\Z.
- The set of all-relevant features, SA, is the set
of strongly and weakly relevant features.
12Algorithm for SA
- Satisfied by
- Gaussian distributions.
- Distributions perfect to some graph.
- Closed under marginalizacion and conditioning.
- There exists f(X,Y) gt 0 such that searching for
SA implies an exhaustive search. - RIT is consistent under the following
assumptions - composition (X - Y Z ? X - W Z ? X - YW Z),
and - weak transitivity (X - Y Z ? X - Y ZV ? X - V
Z ? V - Y Z). - RIT performs at most SAX tests (SAltX).
13Algorithm for SA
no exhaustive search required data efficient
14Algorithm for SA with FDR control
15Simulated data
16Diabetes data
Data from Gunton et al. (2005) Cell, 122. 7
Normal vs. 15 type 2 diabetic patients, and 5000
genes kept after filtering out those with low
variance. 3 genes are univariately
differentially expressed Arnt, Cdc14a and Ddx3Y
(370 if no control for multiplicity).
Dopey1 was recently shown to be active in the
vesicle traffic system, the mechanism
that delivers insulin receptors to the cell
surface.
4 genes encoded TFs, which is intriguing since a
large fraction of previously discovered
diabetes-related genes are TFs.
So does Ddx3Y (only 6 genes annotated with this
function).
17Summary
- Problem I Posterior distribution.
- Solution Markov boundary.
- Peña, J. M., Nilsson, R., Björkegren, J. and
Tegnér, J. (2007). Towards Scalable and Data
Efficient Learning of Markov Boundaries.
International Journal of Approximate Reasoning,
45(2), 211-232. - Peña, J. M. (2008). Learning Gaussian Graphical
Models of Gene Networks with False Discovery Rate
Control. In Proceedings of the 6th European
Conference on Evolutionary Computation, Machine
Learning and Data Mining in Bioinformatics
(EvoBIO 2008) Lectures Notes in Computer
Science 4973, 165-176. - Problem II Class label.
- Solution Bayes relevant features.
- Nilsson, R., Peña, J. M., Björkegren, J. and
Tegnér, J. (2007). Consistent Feature Selection
for Pattern Recognition in Polynomial Time.
Journal of Machine Learning Research, 8, 589-612.
- Peña, J. M. (2009). On the Possible Ordering of
Discrete Features Subsets. Submitted. - Problem III All relevant features.
- Solution RIT algorithm.
- Nilsson, R., Peña, J. M., Björkegren, J. and
Tegnér, J. (2007). Detecting Multivariate
Differentially Expressed Genes. BMC
Bioinformatics, 8150.