Three feature selection problems with solutions - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Three feature selection problems with solutions

Description:

IAMB is consistent under the composition property assumption (X Y | Z ? X W | Z X YW | Z). KIAMB: IAMB with randomness at step 4. Satisfied by. Gaussian distributions. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 18
Provided by: jose73
Category:

less

Transcript and Presenter's Notes

Title: Three feature selection problems with solutions


1
Three feature selection problems (with solutions)
  • Jose M. Peña
  • IISLAB, IDA
  • Linköping University
  • Sweden
  • jospe_at_ifm.liu.se
  • www.ida.liu.se/jospe

Joint work with Roland Nilsson Johan
Björkegren Jesper Tegnér
2
Outline
  • Problem I Posterior distribution.
  • Solution Markov boundary.
  • Peña, J. M., Nilsson, R., Björkegren, J. and
    Tegnér, J. (2007). Towards Scalable and Data
    Efficient Learning of Markov Boundaries.
    International Journal of Approximate Reasoning,
    45(2), 211-232.
  • Peña, J. M. (2008). Learning Gaussian Graphical
    Models of Gene Networks with False Discovery Rate
    Control. In Proceedings of the 6th European
    Conference on Evolutionary Computation, Machine
    Learning and Data Mining in Bioinformatics
    (EvoBIO 2008) Lectures Notes in Computer
    Science 4973, 165-176.
  • Problem II Class label.
  • Solution Bayes relevant features.
  • Nilsson, R., Peña, J. M., Björkegren, J. and
    Tegnér, J. (2007). Consistent Feature Selection
    for Pattern Recognition in Polynomial Time.
    Journal of Machine Learning Research, 8, 589-612.
  • Peña, J. M. (2009). On the Possible Ordering of
    Discrete Features Subsets. Submitted.
  • Problem III All relevant features.
  • Solution RIT algorithm.
  • Nilsson, R., Peña, J. M., Björkegren, J. and
    Tegnér, J. (2007). Detecting Multivariate
    Differentially Expressed Genes. BMC
    Bioinformatics, 8150.

3
Problem I Posterior distribution
  • The Markov boundary of Y, SM, is the minimal set
    of features such that p(p(YX) p(Y SM)) 1.
  • If p(X) gt 0 then SM is unique.
  • If p(X) gt 0 then Z ? SM iff p(p(YX) ? p(YX\Z))
    gt 0.

Z is strongly relevant
no exhaustive search required - data inefficient
4
Algorithms for SM
  • Satisfied by
  • Gaussian distributions.
  • Distributions perfect to some graph.
  • Closed under marginalizacion and conditioning.

(Tsamardinos et al., 2003)
  • IAMB is consistent under the composition property
    assumption (X - Y Z ? X - W Z ? X - YW Z).
  • KIAMB IAMB with randomness at step 4.

5
Thrombin data
Data provided by DuPont Pharmaceuticals for KDD
Cup 2001. 1909 training instances 634 testing
instances 139351 binary features (3-D properties
of a drug compound tested for binding to
thrombin, a key receptor in blood clotting)
6
Preliminaries for problem II
  • Classifier, gX-gtY.
  • Bayes classifier, g(X) arg maxy p(yX).
  • Risk, R(g) p(g(X) ? Y) Sx,y p(x,y) 1g(x) ?
    Y.

7
Problem II Class label
  • Let X0,1,Y-1,1, f(x)gt0 and p(Y1x)x/3
    for all x.
  • Then, SMX but g(x)-1 for all x and, thus, X is
    irrelevant for classification.
  • Z is Bayes relevant iff p(g(X)?g(X\Z)) gt 0. Let
    S denote the set of Bayes relevant features.
  • If p(x)gt0 and p(Yx) has a single maximum for all
    x, then p(g(X)?g(X\Z)) gt 0 iff
    R(g(X))?R(g(X\Z)).
  • If p(x)gt0 and p(Yx) has a single maximum for all
    x, then S is the only minimal feature subset
    such that R(g(S)) R(g(X)).

no exhaustive search required - data inefficient
8
UCI data sets
Consistent version of the one-shot approach
  • The following backward search is correct
  • SX
  • Repeat while possible
  • If there exists Z ? S such that
    R(g(S))R(g(S\Z)), then SS\Z.
  • Data inneficient. Forward approaches ?

9
S may differ from SM
  • S ? SM.
  • But the converse may not be true.

10
Possible orderings of discrete feature subsets
  • Any strictly increasing Bayes risk ordering is
    possible as long as R(g(ST))ltR(g(S)).
  • E.g.,
  • R(g(X1,X2,X3)) lt R(g(X1,X2)) lt R(g(X1,X3)) lt
    R(g(X2,X3)) lt R(g(X3)) lt R(g(X2)) lt R(g(X1))
    lt R(g(Ø))
  • Finding the feature subset of size k that has
    minimal Bayes risk requires exhaustive search.
  • As we have seen, finding S does not require
    exhaustive search.
  • Open problem Is any sequence of Bayes risks
    possible ?
  • Analogous results exist for continuous domains,
    though not Gaussian. See Cover and Van Campenhout
    (1978) and Van Campenhout (1980).

11
Problem III All relevant features
  • Z is weakly relevant iff p(p(YX) p(YX\Z)) 1
    but
  • p(p(YS) ? p(YS,Z)) gt 0 with S ? X\Z.
  • The set of all-relevant features, SA, is the set
    of strongly and weakly relevant features.

12
Algorithm for SA
  • Satisfied by
  • Gaussian distributions.
  • Distributions perfect to some graph.
  • Closed under marginalizacion and conditioning.
  • There exists f(X,Y) gt 0 such that searching for
    SA implies an exhaustive search.
  • RIT is consistent under the following
    assumptions
  • composition (X - Y Z ? X - W Z ? X - YW Z),
    and
  • weak transitivity (X - Y Z ? X - Y ZV ? X - V
    Z ? V - Y Z).
  • RIT performs at most SAX tests (SAltX).

13
Algorithm for SA
no exhaustive search required data efficient
14
Algorithm for SA with FDR control
15
Simulated data
16
Diabetes data
Data from Gunton et al. (2005) Cell, 122. 7
Normal vs. 15 type 2 diabetic patients, and 5000
genes kept after filtering out those with low
variance. 3 genes are univariately
differentially expressed Arnt, Cdc14a and Ddx3Y
(370 if no control for multiplicity).
Dopey1 was recently shown to be active in the
vesicle traffic system, the mechanism
that delivers insulin receptors to the cell
surface.
4 genes encoded TFs, which is intriguing since a
large fraction of previously discovered
diabetes-related genes are TFs.
So does Ddx3Y (only 6 genes annotated with this
function).
17
Summary
  • Problem I Posterior distribution.
  • Solution Markov boundary.
  • Peña, J. M., Nilsson, R., Björkegren, J. and
    Tegnér, J. (2007). Towards Scalable and Data
    Efficient Learning of Markov Boundaries.
    International Journal of Approximate Reasoning,
    45(2), 211-232.
  • Peña, J. M. (2008). Learning Gaussian Graphical
    Models of Gene Networks with False Discovery Rate
    Control. In Proceedings of the 6th European
    Conference on Evolutionary Computation, Machine
    Learning and Data Mining in Bioinformatics
    (EvoBIO 2008) Lectures Notes in Computer
    Science 4973, 165-176.
  • Problem II Class label.
  • Solution Bayes relevant features.
  • Nilsson, R., Peña, J. M., Björkegren, J. and
    Tegnér, J. (2007). Consistent Feature Selection
    for Pattern Recognition in Polynomial Time.
    Journal of Machine Learning Research, 8, 589-612.
  • Peña, J. M. (2009). On the Possible Ordering of
    Discrete Features Subsets. Submitted.
  • Problem III All relevant features.
  • Solution RIT algorithm.
  • Nilsson, R., Peña, J. M., Björkegren, J. and
    Tegnér, J. (2007). Detecting Multivariate
    Differentially Expressed Genes. BMC
    Bioinformatics, 8150.
Write a Comment
User Comments (0)
About PowerShow.com