Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 PowerPoint PPT Presentation

presentation player overlay
1 / 19
About This Presentation
Transcript and Presenter's Notes

Title: Feature Selection and Its Application in Genomic Data Analysis March 9, 2004


1
Feature Selection and Its Application in
Genomic Data Analysis March 9, 2004
  • Lei Yu
  • Arizona State University

2
Outlines
  • Introduction to feature selection
  • Motivation
  • Problem statement
  • Key research issues
  • Application in genomic data analysis
  • Overview of data mining for microarray data
  • Gene selection
  • A case study
  • Current research directions

3
Motivation
  • An active field in
  • Pattern recognition
  • Machine learning
  • Data mining
  • Statistics
  • Goodness
  • Reducing dimensionality
  • Improving learning efficiency
  • Increasing predicative accuracy
  • Reducing complexity of learned results

4
Problem Statement
  • A process of selecting a minimum subset of
    features that is sufficient to construct a
    hypothesis consistent with the training examples
    (Almuallim and Dietterich, 1991)
  • Selecting a minimum subset G such that P(CG) is
    equal or as close as possible to P(CF) (Koller
    and Sahami, 1996)

5
An Example for the Problem
  • Data set
  • Five Boolean features
  • C F1?F2
  • F3 F2 , F5 F4
  • Optimal subset
  • F1, F2 or F1, F3
  • Combinatorial nature of searching for an optimal
    subset

6
Subset Search
  • An example of search space (Kohavi and John,
    1997)

7
Evaluation Measures
  • Wrapper model
  • Relying on a predetermined classification
    algorithm
  • Using predictive accuracy as goodness measure
  • High accuracy, computationally expensive
  • Filter model
  • Separating feature selection from classifier
    learning
  • Relying on general characteristics of data
    (distance, correlation, consistency)
  • No bias toward any learning algorithm, fast

8
A Framework for Algorithms
9
Feature Ranking
  • Weighting and ranking individual features
  • Selecting top-ranked ones for feature selection
  • Advantages
  • Efficient O(N) in terms of dimensionality N
  • Easy to implement
  • Disadvantages
  • Hard to determine the threshold
  • Unable to consider correlation between features

10
Applications of Feature Selection
  • Text categorization
  • Yang and Pederson, 1997 (CMU)
  • Forman, 2003 (HP Labs)
  • Image retrieval
  • Swets and Weng, 1995 (MSU)
  • Dy et al, 2003 (Purdue University)
  • Gene expression microarrray data analysis
  • Xing et al, 2001 (UC Berkeley)
  • Lee et al, 2003 (Texas AM)
  • Customer relationship management
  • Ng and Liu, 2000 (NUS)
  • Intrusion detection
  • Lee et al, 2000 (Columbia University)

11
Microarray Technology
  • Enabling simultaneously measuring the expression
    levels for thousands or tens of thousands of
    genes in a single experiment
  • Providing new opportunities and challenges for
    data mining

Gene
Value
M23197_at
261
U66497_at
88
4778
M92287_at
. . .
. . .
12
Two Ways to View Microarray Data
13
Data Mining Tasks
14
Gene Selection
  • Data characteristics in sample classification
  • High dimensionality (thousands of genes)
  • Small sample size (often less than 100 samples)
  • Problems
  • Curse of dimensionality
  • Overfitting the training data
  • Traditional gene selection methods
  • Within the filter model
  • Gene ranking

15
A Case Study (Golub et al., 1999)
  • Leukemia data
  • 7129 genes, 72 samples
  • Training 38 (27 ALL, 11 AML)
  • Test 34 (20 ALL, 14 AML)
  • Normalization
  • Mean 0
  • Standard deviation 1
  • Correlation measure

16
Case Study (continued)
  • Performance of selected genes
  • Accuracy on training set 36 out 38 (94.74)
    correctly classified
  • Accuracy on test set 29 out 34 (85.29)
    correctly classified
  • Limitations
  • Domain knowledge required to determine the number
    of genes selected
  • Unable to remove redundant genes

17
Feature/Gene Redundancy
  • Examining redundant genes
  • Two heads are not necessarily better than one
  • Effects of redundant genes
  • How to handle redundancy
  • A challenge
  • Some latest work
  • MRMR (Maximum Relevance Minimum Redundancy) (Ding
    and Peng, CSB-2003)
  • FCBF (Fast Correlation Based Filter) (Yu and Liu,
    ICML-2003)

18
Research Directions
  • Feature selection for unlabeled data
  • Common things as for labeled data
  • Difference
  • Dealing with different data types
  • Nominal, discrete, continuous
  • Discretization
  • Dealing with large size data
  • Comparative study and intelligent selection of
    feature selection methods

19
References
  • G. John, R. Kohavi, and K. Pfleger. Irrelevant
    features and the subset selection problem.
    ICML-1994.
  • L. Yu and H. Liu. Feature selection for
    high-dimensional data a fast correlation-based
    filter solution. ICML-2003.
  • T. R. Golub et al. Molecular classification of
    cancer class discovery and class prediction by
    gene expression monitoring. Science-1999.
  • C. Ding and H. Peng. Minimum redundancy feature
    selection from microarray gene expression data.
    CSB-2003.
  • J. Shavlik and D. Page. Machine learning and
    genetic microarrays. ICML-2003 tutorial.
    http//www.cs.wisc.edu/dpage/ICML-2003-Tutorial-S
    havlik-Page.ppt
Write a Comment
User Comments (0)
About PowerShow.com