An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification presentation

About This Presentation

Transcript and Presenter's Notes

Title: An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification

1
An Evaluation of Gene Selection Methods for
Multi-class Microarray Data Classification

by Carlotta Domeniconi and
Hong Chai

2
Outline

Introduction to microarray data
Problem description
Related work
Our methods
Experimental Analysis
Result
Conclusion and future work

3
Microarray

Measures gene expression levels across different
conditions, times or tissue samples
Gene expression levels inform cell activity and
disease status
Microarray data distinguish between tumor types,
define new subtypes, predict prognostic outcome,
identify possible drugs, assess drug toxicity,
etc.

4
Microarray Data

A matrix of measurements rows are gene
expression levels columns are samples/conditions.

5
Example Lymphoma Dataset
6
Microarray data analysis

Clustering applied to genes to identify genes
with similar functions or participate in similar
biological processes, or to samples to find
potential tumor subclasses.
Classification builds model to predict diseased
samples. Diagnostic value.

7
Classification Problem

Large number of genes (features) - may contain up
to 20,000 features.
Small number of experiments (samples) hundreds
but usually less than 100 samples.
The need to identify marker genes to classify
tissue types, e.g. diagnose cancer - feature
selection

8
Our Focus

Binary classification and feature selection
methods extensively studied Multi-class case
received little attention.
Practically many microarray datasets have more
than two categories of samples
We focus on multi-class gene ranking and
selection.

9
Related Work

Some criteria used in feature ranking
Correlation coefficient
Information gain
Chi-squared
SVM-RFE

10
Notation

Given C classes
m observations (samples or patients)
n feature measurements (gene expressions)
class labels y 1,...,C

11
Correlation Coefficient

Two class problem y -1,1
Ranking criterion defined in Golub
where µj is the mean and s standard deviation
along dimension j in the and classes Large
w indicates discriminant feature

12
Fischers score

Fishers criterion score in Pavlidis

13
Assumption of above methods

Features analyzed in isolation. Not considering
correlations.
Assumption independent of each other
Implication redundant genes selected into a top
subset.

14
Information Gain

A measure of the effectiveness of a feature in
classifying the training data.
Expected reduction in entropy caused by
partitioning the data according to this feature.
V (A) is the set of all possible values of
feature A, and Sv is the subset of S for which
feature A has value v

15
Information Gain

E(S) is the entropy of the entire set S.
wherewhere Ci is the number of training data in
class Ci, and S is thecardinality of the entire
set S.

16
Chi-squared

Measures features individually
Continuous valued features discretized into
intervals
Form a matrix A, where Aij is the number of
samples of the Ci class within the j-th interval.
Let CIj be the number of samples in the j-th
interval

17
Chi-squared

The expected frequency of Aij is
The Chi-squared statistic of a feature is defined
as
Where I is the number of intervals. The larger
the statistic, the more informative the feature
is.

18
SVM-RFE

Recursive Feature Elimination using SVM
In the linear SVM model on the full feature set
Sign (wx b)
w is a vector of weights for each feature,
x is an input instance, and b a threshold.
If wi 0, feature Xi does not influence
classification and can be eliminated from the set
of features.

19
SVM-RFE

After getting w for the full feature set, sort
features in descending order of weights. A
percentage of lower feature is eliminated.
3. A new linear SVM is built using the new set of
features. Repeat the process.
4. The best feature subset is chosen.

20
Other criteria

The Brown-Forsythe, the Cochran, and the Welch
test statistics used in Chen, et al.
(Extensions of the t-statistic used in the
two-class classification problem.)
PCA
(Disadvantage new dimension formed. None of
the original features can be discarded.
Therefore
cant identify marker genes.)

21
Our Ranking Methods

BScatter
MinMax
bSum
bMax
bMin
Combined

22
Notation

For each class i and each feature j, we define
the mean value of feature j for class Ci
Define the total mean along feature j

23
Notation

Define between-class scatter along feature j

24
Function 1 BScatter

Fisher discriminant analysis for multiple classes
under feature independence assumption. It credits
the largest score to the feature that maximizes
the ratio of the between-class scatter to the
within-class scatter
where sji is the standard deviation of class i
along feature j

25
Function 2 MinMax

Favors features along which the farthest
mean-class difference is large, and the within
class variance is small.

26
Function 3 bSum

For each feature j, we sort the C values µj,i in
non-decreasing order µ j1 lt µj2lt µ jC
Define bj,l µ j11 - µ j1
bSum rewards the features with large distances
between adjacent mean class values

27
Function 4 bMax

Rewards features j with a large
between-neighbor-class mean difference

28
Function 5 bMin

Favorsthe features with large smallest
between-neighbor-class mean difference

29
Function 6 Comb

Considers a score function which combines MinMax
and bMin

30
Datasets
Dataset sample genes classes Comment
MLL 72 12582 3 Available at http//research.nhgri.nih.gov/microarray/Supplement
Lymphoma 88 4026 6 Number of samples in each class are, 46 in DLBCL, 11 in CLL, 9 in FL (malignant classes), 11 in ABB, 6 in
Yeast 80 5775 3 RAT, and 6 in TCL (normal samples). available at http//llmpp.nih.gov/lymphoma
NCI60 61 1155 8 Available at http//rana.lbl.gov/
31
Experiment Design

Gene expression scaled between -1,1
Performed 9 comparative feature selection methods
(6 proposed scores, Chi-squared, Information
Gain, and SVM-RFE)
Obtain subsets of top-ranked genes to train SVM
classifier
(3 kernel functions linear, 2-degree
polynomial, Gaussian Soft-margin 1,100
Gaussian kernel 0.001,2)
Leave-one-out cross validation due to small
sample size
One-vs-one multi-class classification implemented
on LIBSVM

32
Result MLL Dataset
33
Result Lymphoma Dataset
34
Conclusions

SVMs classification benefits from gene selection
Gene ranking with correlation coefficients gives
higher accuracy than SVM-RFE in low dimensions in
most data sets. The best performing correlation
score varies from problem to problem
Although SVM-RFE shows an excellent performance
in general, there is no clear winner. The
performance of feature selection methods seems to
be problem-dependent

35
Conclusions

For a given classification model, different gene
selection methods reach the best performance for
different feature set sizes
Very high accuracy was achieved on all the data
sets studied here. In many cases perfect accuracy
(based on leave-one-out error) was achieved
The NCI60 dataset 17 shows lower accuracy
values. This dataset has the largest number of
classes (eight), and smaller sample sizes per
class. SVM-RFE handles this case well, achieving
96.72 accuracy with 100 selected genes and a
linear kernel. The gap in accuracy between
SVM-RFE and the other gene rankingmethods is
highest for this dataset (ca. 11.5).

36
Limitations Future Work

The selection of features over the whole training
set induces a bias in the results. Will study
valuable suggestions on how to assess and correct
the bias in future experiments.
Will take into consideration the correlation
between any pair of selected features. Ranking
method will be modified so that correlations are
lower than a certain threshold.
Evaluate top-ranked genes in our research against
marker genes identified in other studies.

Write a Comment

User Comments (0)

About PowerShow.com

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification PowerPoint PPT Presentation