An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification


1
An Evaluation of Gene Selection Methods for
Multi-class Microarray Data Classification
  • by Carlotta Domeniconi and
  • Hong Chai

2
Outline
  • Introduction to microarray data
  • Problem description
  • Related work
  • Our methods
  • Experimental Analysis
  • Result
  • Conclusion and future work

3
Microarray
  • Measures gene expression levels across different
    conditions, times or tissue samples
  • Gene expression levels inform cell activity and
    disease status
  • Microarray data distinguish between tumor types,
    define new subtypes, predict prognostic outcome,
    identify possible drugs, assess drug toxicity,
    etc.

4
Microarray Data
  • A matrix of measurements rows are gene
    expression levels columns are samples/conditions.

5
Example Lymphoma Dataset
6
Microarray data analysis
  • Clustering applied to genes to identify genes
    with similar functions or participate in similar
    biological processes, or to samples to find
    potential tumor subclasses.
  • Classification builds model to predict diseased
    samples. Diagnostic value.

7
Classification Problem
  • Large number of genes (features) - may contain up
    to 20,000 features.
  • Small number of experiments (samples) hundreds
    but usually less than 100 samples.
  • The need to identify marker genes to classify
    tissue types, e.g. diagnose cancer - feature
    selection

8
Our Focus
  • Binary classification and feature selection
    methods extensively studied Multi-class case
    received little attention.
  • Practically many microarray datasets have more
    than two categories of samples
  • We focus on multi-class gene ranking and
    selection.

9
Related Work
  • Some criteria used in feature ranking
  • Correlation coefficient
  • Information gain
  • Chi-squared
  • SVM-RFE

10
Notation
  • Given C classes
  • m observations (samples or patients)
  • n feature measurements (gene expressions)
  • class labels y 1,...,C

11
Correlation Coefficient
  • Two class problem y -1,1
  • Ranking criterion defined in Golub
  • where µj is the mean and s standard deviation
    along dimension j in the and classes Large
    w indicates discriminant feature

12
Fischers score
  • Fishers criterion score in Pavlidis

13
Assumption of above methods
  • Features analyzed in isolation. Not considering
    correlations.
  • Assumption independent of each other
  • Implication redundant genes selected into a top
    subset.

14
Information Gain
  • A measure of the effectiveness of a feature in
    classifying the training data.
  • Expected reduction in entropy caused by
    partitioning the data according to this feature.
  • V (A) is the set of all possible values of
    feature A, and Sv is the subset of S for which
    feature A has value v

15
Information Gain
  • E(S) is the entropy of the entire set S.
  • wherewhere Ci is the number of training data in
    class Ci, and S is thecardinality of the entire
    set S.

16
Chi-squared
  • Measures features individually
  • Continuous valued features discretized into
    intervals
  • Form a matrix A, where Aij is the number of
    samples of the Ci class within the j-th interval.
  • Let CIj be the number of samples in the j-th
    interval

17
Chi-squared
  • The expected frequency of Aij is
  • The Chi-squared statistic of a feature is defined
    as
  • Where I is the number of intervals. The larger
    the statistic, the more informative the feature
    is.

18
SVM-RFE
  • Recursive Feature Elimination using SVM
  • In the linear SVM model on the full feature set
  • Sign (wx b)
  • w is a vector of weights for each feature,
    x is an input instance, and b a threshold.
  • If wi 0, feature Xi does not influence
    classification and can be eliminated from the set
    of features.

19
SVM-RFE
  • After getting w for the full feature set, sort
    features in descending order of weights. A
    percentage of lower feature is eliminated.
  • 3. A new linear SVM is built using the new set of
    features. Repeat the process.
  • 4. The best feature subset is chosen.

20
Other criteria
  • The Brown-Forsythe, the Cochran, and the Welch
    test statistics used in Chen, et al.
  • (Extensions of the t-statistic used in the
    two-class classification problem.)
  • PCA
  • (Disadvantage new dimension formed. None of
  • the original features can be discarded.
    Therefore
  • cant identify marker genes.)

21
Our Ranking Methods
  • BScatter
  • MinMax
  • bSum
  • bMax
  • bMin
  • Combined

22
Notation
  • For each class i and each feature j, we define
    the mean value of feature j for class Ci
  • Define the total mean along feature j

23
Notation
  • Define between-class scatter along feature j

24
Function 1 BScatter
  • Fisher discriminant analysis for multiple classes
    under feature independence assumption. It credits
    the largest score to the feature that maximizes
    the ratio of the between-class scatter to the
    within-class scatter
  • where sji is the standard deviation of class i
    along feature j

25
Function 2 MinMax
  • Favors features along which the farthest
    mean-class difference is large, and the within
    class variance is small.

26
Function 3 bSum
  • For each feature j, we sort the C values µj,i in
    non-decreasing order µ j1 lt µj2lt µ jC
  • Define bj,l µ j11 - µ j1
  • bSum rewards the features with large distances
    between adjacent mean class values

27
Function 4 bMax
  • Rewards features j with a large
    between-neighbor-class mean difference

28
Function 5 bMin
  • Favorsthe features with large smallest
    between-neighbor-class mean difference

29
Function 6 Comb
  • Considers a score function which combines MinMax
    and bMin

30
Datasets
Dataset sample genes classes Comment
MLL 72 12582 3 Available at http//research.nhgri.nih.gov/microarray/Supplement
Lymphoma 88 4026 6 Number of samples in each class are, 46 in DLBCL, 11 in CLL, 9 in FL (malignant classes), 11 in ABB, 6 in
Yeast 80 5775 3 RAT, and 6 in TCL (normal samples). available at http//llmpp.nih.gov/lymphoma
NCI60 61 1155 8 Available at http//rana.lbl.gov/
31
Experiment Design
  • Gene expression scaled between -1,1
  • Performed 9 comparative feature selection methods
  • (6 proposed scores, Chi-squared, Information
    Gain, and SVM-RFE)
  • Obtain subsets of top-ranked genes to train SVM
    classifier
  • (3 kernel functions linear, 2-degree
    polynomial, Gaussian Soft-margin 1,100
    Gaussian kernel 0.001,2)
  • Leave-one-out cross validation due to small
    sample size
  • One-vs-one multi-class classification implemented
    on LIBSVM

32
Result MLL Dataset
33
Result Lymphoma Dataset
34
Conclusions
  • SVMs classification benefits from gene selection
  • Gene ranking with correlation coefficients gives
    higher accuracy than SVM-RFE in low dimensions in
    most data sets. The best performing correlation
    score varies from problem to problem
  • Although SVM-RFE shows an excellent performance
    in general, there is no clear winner. The
    performance of feature selection methods seems to
    be problem-dependent

35
Conclusions
  • For a given classification model, different gene
    selection methods reach the best performance for
    different feature set sizes
  • Very high accuracy was achieved on all the data
    sets studied here. In many cases perfect accuracy
    (based on leave-one-out error) was achieved
  • The NCI60 dataset 17 shows lower accuracy
    values. This dataset has the largest number of
    classes (eight), and smaller sample sizes per
    class. SVM-RFE handles this case well, achieving
    96.72 accuracy with 100 selected genes and a
    linear kernel. The gap in accuracy between
    SVM-RFE and the other gene rankingmethods is
    highest for this dataset (ca. 11.5).

36
Limitations Future Work
  • The selection of features over the whole training
    set induces a bias in the results. Will study
    valuable suggestions on how to assess and correct
    the bias in future experiments.
  • Will take into consideration the correlation
    between any pair of selected features. Ranking
    method will be modified so that correlations are
    lower than a certain threshold.
  • Evaluate top-ranked genes in our research against
    marker genes identified in other studies.
Write a Comment
User Comments (0)
About PowerShow.com