Effective Feature Selection Framework for Cluster Analysis of Microarray Data - PowerPoint PPT Presentation

About This Presentation
Title:

Effective Feature Selection Framework for Cluster Analysis of Microarray Data

Description:

Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu – PowerPoint PPT presentation

Number of Views:241
Avg rating:3.0/5.0
Slides: 25
Provided by: gcpok
Category:

less

Transcript and Presenter's Notes

Title: Effective Feature Selection Framework for Cluster Analysis of Microarray Data


1
Effective Feature Selection Framework for
Cluster Analysis of Microarray Data
  • Gouchol Pok
  • Computer Science Dept.
  • Yanbian University
  • China

Keun Ho Ryu DB/Bioinformatics Lab Chungbuk Natl
University Korea
2
Outline
  • Background
  • Motivation
  • Proposed Method
  • Experiments
  • Conclusion

3
Feature Selection
  • Definition
  • Process of selecting a subset of relevant
    features for building robust learning models
  • Objectives
  • Alleviating the effect of the curse of
    dimensionality
  • Enhancing generalization capability
  • Speeding up learning process
  • Improving model interpretability

from Wikipedia http//en.wikipedia.org/wiki/Featu
re_selection
4
Issues in Feature Selection
  • How to compute the degree to which a feature is
    relevant with the class (discrimination)
  • How to decide if a selected feature is redundant
    with other features (strongly correlated)
  • How to select features so that classifying power
    is not diminished (increased)
  • Removal of irrelevancy
  • Removal of redundancy
  • Maintain class-discriminating power

5
Selection Modes
  • Univariate method
  • considers one feature at a time based on score
    rank
  • measures are Correlation, Information measure,
    K-S statistic, etc
  • Multivariate method
  • considers subsets of features altogether
  • Bayesian and PCA based selection
  • in principle, more powerful than univariate
    method, but not always in practice (Guyon2008)

6
Hard Case in Univariate method (Guyon2008)
Adopted from Guyons tutorial at IPAM summer
school
7
Proposed method Motivation
  • Method that fits 2-D microarray data
  • typical forms thousands of genes (rows) and
    hundreds of samples (columns)
  • Multivariate approach
  • Feature relevancy and redundancy are addressed
    simultaneously

8
System Flow
samples
genes
9
System Flow (cont.)
10
Methods Step1
  • Perform column-based difference op.
  • Di(N,M) C(N,M) ? Ci(N,1), i 1,2,, M
  • Difference operator may depend on applications,
    e.g. Euclidean or Manhattan distance
  • Di(N,M) contains class-specific info. w.r.t each
    gene

genes
11
Methods Step2
  • Apply thresholds
  • Find kind of emerging patterns which contrast 2
    classes
  • Suppose 1, 2,, j ? C1 and j1, j2, M ? C2
  • Sort the values in each column of Di(N,M)
  • 25-threshold to the same class differences and
    75-threshold to the different class differences

C1
C2
C1
C2
C1
C2
25
75
12
Methods Step3
  • Extract class-specific features
  • Within-class summation of binary values (count
    1s)

13
Methods Step4
  • Gene selection
  • Apply different threshold value for different
    class
  • Gene selection we are done for the row-wise
    reduction

threshold
14
Methods Step5
  • Column-wise reduction by clustering
  • Classification of samples
  • Applied NMF method

15
Nonnegative Matrix Factorization (NMF)
  • Matrix factorization A VH
  • A n ? m matrix of n genes and m samples.
  • V (n ? k) k columns of V are called basis
    vectors
  • H (k ? m) describes how strongly each building
    block is present in measurement
    vectors

16
NMF Parts-based Clustering (Brunet2004)
  • Brunet introduce meta-genes concept

17
Experiments Datasets
  • Leukemia Data
  • 5000 genes
  • 38 samples of two classes
  • 19 samples of ALL-B and 8 samples of ALL-T type,
  • 11 samples of AML type.
  • Medulloblastoma Data
  • 5893 genes
  • 34 samples of two classes
  • 25 classic type and 9 desmoplastic
    medulloblastoma type
  • Central Nervous System Tumors Data
  • 7129 samples
  • 34 samples of four classes
  • 10 classic medulloblastomas, 10 malig-nant
    gliomas, 10 rhabdoids, and 4 normals

18
Classification
  • Given a target sample, its class is predicted by
    the highest value in k-dim column vector of H

19
Results
  • Leukemia Data (ALL-T vs. ALL-B vs. AML)

20
Results
  • Medulloblastoma Data (Classic vs. Desmoplastic)

21
Results
  • Central Nervous System Tumors Data (4 classes)

22
Conclusions Future work
  • Our approach tries to capture a group of
    features, but in contrast to holistic methods
    such as PCA and ICA, intrinsic structure of data
    distribution is preserved in the reduced space.
  • Still, PCA and ICA can be used as an aide to look
    into the data distribution structure, and provide
    useful information for further processing to
    other methods.
  • Our on-going research is on how to combine the
    PCA and ICA to the proposed work

23
References
  • Wikipedia, http//en.wikipedia.org/wiki/Feature_se
    lection
  • J.-P. Brunet, P. Tamayo, T. Golub, and J. P.
    Mesirov. Metagenes and molecular pattern
    discovery using matrix factorization. PNAS,
    101(12)4164-4169, 2004.
  • L. Yu and H. Liu. Feature selection for
    high-dimensional data A fast correlation-based
    filter solution. In Proc 12th Int Conf on Machine
    Learning (ICML-03), pages 856863, 2003
  • Biesiada J, Duch W (2005), Feature Selection for
    High-Dimensional Data A Kolmogorov-Smirnov
    Correlation-Based Filter Solution. (CORES'05)
    Advances in Soft Computing, Springer Verlag, pp.
    95-104, 2005.
  • D.D. Lee and H.S. Seung, Learning the parts of
    objects by nonnegative matrix factorization

24
Questions?
Write a Comment
User Comments (0)
About PowerShow.com