A Kolmogorov-Smirnov Correlation-Based Filter for Microarray Data

1 / 22
About This Presentation
Title:

A Kolmogorov-Smirnov Correlation-Based Filter for Microarray Data

Description:

Microarray matrices Genes in rows, ... (Weka), Naive Bayes with single Gaussian kernel, or discretized prob., k-NN, or 1 nearest neighbor algorithm ... –

Number of Views:99
Avg rating:3.0/5.0
Slides: 23
Provided by: Wlodzisl
Category:

less

Transcript and Presenter's Notes

Title: A Kolmogorov-Smirnov Correlation-Based Filter for Microarray Data


1
A Kolmogorov-Smirnov Correlation-Based Filter for
Microarray Data
  • Jacek Biesiada
  • Division of Computer Methods, Dept. of
    Electrotechnology, The Silesian University of
    Technology, Katowice, Poland.
  • Wlodzislaw Duch
  • Dept. of Informatics, Nicolaus Copernicus
    University, Google Duch
  • ICONIP 2007

2
Motivation
  • Attention is a basic cognitive skill, without
    focus on relevant information cognition would not
    be possible.
  • In natural perception (vision, auditory scenes,
    tactile signals) large number of features may be
    dynamically selected depending on the task.
  • Large feature spaces (genes, proteins,
    chemistry, etc) different features are relevant.
  • Filters will leave large number of potentially
    relevant features.
  • Redundancy should be removed!
  • Fast filters with removal of redundancy are
    needed!
  • Microarrays popular testing ground, although not
    reliable due to small number of samples.
  • Goal fast filter redundancy removal tests on
    microarray data to identify problems.

3
Microarray matrices
  • Genes in rows, samples in columns, DNA/RNA type

4
Selection of information
  • Find relevant information
  • discard attributes that do not contain
    information.
  • use weights to express the relative importance.
  • create new, more informative attributes.
  • reduce dimensionality aggregating information.
  • Ranking treat each feature as independent.
  • Selection search for subsets, remove redundant.
  • Filters universal criteria, model-independent.
  • Wrappers criteria specific for data models are
    used.
  • Frapper filter wrapper in the final stage.
  • Redfilapper redundancy removal filter
    wrapper.
  • Create fast redfilapper.

5
Filters Wrappers
  • Filter approach for data D
  • define your problem C, for example assignment of
    class labels
  • define an index of relevance for each feature
    JiJ(Xi)J(XiD,C)
  • calculate relevance indices for all features and
    order Ji1? Ji2 ? .. Jid
  • remove all features with relevance below
    threshold J(Xi) lt tR
  • Wrapper approach
  • select predictor P and performance measure
    J(DX)P(DataX).
  • define search scheme forward, backward or mixed
    selection.
  • evaluate starting subset of features Xs, ex.
    single best or all features
  • add/remove feature Xi, accept new set Xs?XsXi
    if P(DataXsXi)gtP(DataXs)

6
Information gain
  • Information gained by considering the joint
    probability distribution p(C, f) is a difference
    between
  • A feature is more important if its information
    gain is larger.
  • Modifications of the information gain, used as
    criteria in some decision trees, include
    IGR(C,Xj) IG(C,Xj)/I(Xj) the gain ratio
    IGn(C,Xj) IG(C,Xj)/I(C) an asymmetric
    dependency coefficient DM(C,Xj)
    1-IG(C,Xj)/I(C,Xj) normalized Mantaras distance

7
Information indices
  • Information gained considering attribute Xj and
    classes C together is also known as mutual
    information, equal to the Kullback-Leibler
    divergence between joint and product probability
    distributions

Entropy distance measure is a sum of conditional
information
Symmetrical uncertainty coefficient is obtained
from entropy distance
8
Purity indices
  • Many information-based quantities may be used to
    evaluate attributes.Consistency or purity-based
    indices are one alternative.

For selection of subset of attributes FXi the
sum runs over all Cartesian products, or
multidimensional partitions rk(F). Advantages
simplest approach both ranking and
selection Hashing techniques are used to
calculate p(rk(F)) probabilities.
9
Correlation coefficient
  • Perhaps the simplest index is based on the
    Pearsons correlation coefficient (CC) that
    calculates expectation values for product of
    feature values and class values

For feature values that are linearly dependent
correlation coefficient is 1 or -1 for complete
independence of class and Xj distribution CC 0.
How significant are small correlations? It
depends on the number of samples n. The answer
(see Numerical Recipes www.nr.com) is given by
For n1000 even small CC0.02 gives P 0.5, but
for n10 such CC gives only P 0.05.
10
F-score
  • Mutual information is based on Kullback-Leibler
    distance, any distance measure between
    distributions may also be used, ex.
    Jeffreys-Matusita

with pooled variance calculated from
For two classes F t2 or t-score. Many other
such (dis)similarity measures exist. Which is the
best? In practice they all are similar, although
accuracy of calculation of indices is important
relevance indices should be insensitive to noise
and unbiased in their treatment of features with
many values.
11
State-of-the-art methods
  • 1. FCBF, Fast Correlation-Based Filter (Yu
    Liu 2003).
  • Compare feature-class JiSU(Xi,C) and
    feature-feature SU(Xi,Xj)
  • rank features Ji1 Ji2 Ji3 ... Jim min
    threshold.
  • Compare feature Xi to all Xj lower in ranking,
  • if SU(Xi, Xj) SU(C,Xi) then Xi is redundant
    and is removed.
  • ConnSF, Consistency features selection (Dash,
    Liu, Motoda 2000).
  • Inconsistency JI(S) for discrete valued feature S
    is JI(S) n - n(C). where a subset of features
    S with values VS appears n times in the data,
    most often n(C) times with the label of class C.
  • Total inconsistency count sum of all the
    inconsistency counts for all distinct patterns of
    a feature subsets S.
  • Consistency the least inconsistency count.
  • 3. CorrSF (Hall 1999), based on correlation
    coefficient with 5 step backtracking.

12
Kolmogorov-Smirnov test
  • Are distributions of values of two different
    features roughly equal? If yes, one is redundant.
  • Discretization process creates k clusters
    (vectors from roughly the same class), each
    typically covering similar range of values.
  • A much larger number of independent observation
    n1, n2 gt 40 are taken from the two distributions,
    measuring frequencies of different classes.
  • Based on the frequency table the empirical
    cumulative distribution functions F1i and F2i are
    constructed.
  • ?(K-S statistics) is proportional to the largest
    absolute difference of F1i - F2i, and if ? lt ?a
    distributions are equal

13
KS-CBS
  • Kolmogorov-Smirnov Correlation-Based Selection
    algorithm.
  • Relevance analysis
  • Order features according to the decreasing values
    of relevance indices creating S list.
  • Redundancy analysis
  • Initialize Fi to the first feature in the S
    list.
  • Use K-S test to find and remove from S all
    features for which Fi forms an approximate
    redundant cover C(Fi).
  • Move Fi to the set of selected features, take as
    Fi the next remaining feature in the list.
  • Repeat step 3 and 4 until the end of the S list.

14
3 Datasets
  • Leukemia training 38 bone marrow samples (27 of
    the ALL and 11 of the AML type), using 7129
    probes from 6817 human genes 34 test samples
    are provided, with 20 ALL and 14 AML cases. Too
    small for such split,
  • Colon Tumor 62 samples collected from colon
    cancer patients, with 40 biopsies from tumor
    areas (labelled as negative") and 22 from
    healthy parts of the colons of the same patients.
    2000 out of around 6500 genes were pre-selected,
    based on the confidence in the measured
    expression levels.
  • Diffuse Large B-cell Lymphoma DLBCL two
    distinct types of diffuse large lymphoma B-cells
    (most common subtype of non-Hodgkins lymphoma)
    47 samples, 24 from germinal centre B-like"
    group, 23 are from activated B-like" group, 4026
    genes.

15
Discretization classifiers
  • For comparison of information selection
    techniques simple discretization of gene
    expression levels into 3 intervals is used.
    Variance s, mean µ, discrete values -1, 0, 1
    for
  • (-?,µ - s/2), µ - s/2 , µ s/2 , (µ s/2,
    ?)
  • Represents under-expression, baseline and
    over-expression of genes.
  • Results after such discretization are in some
    cases significantly improved and are given in
    parenthesis in the tables below.
  • Classifiers used
  • C4.5 decision tree (Weka),
  • Naive Bayes with single Gaussian kernel, or
    discretized prob.,
  • k-NN, or 1 nearest neighbor algorithm (Ghostminer
    implementation)
  • Linear SVM with C 1 (also GM)

16
No. of features selected
  • For standard a0.05 confidence level for
    redundancy rejection relatively large number of
    features is left for Leukemia.
  • Even for a0.001 confidence level 47 features are
    left best to optimize it by wrapper.
  • A larger number of feature may lead to more
    reliable profile (ex. by chance one gene in
    Leukemia gets 100 on training).
  • Large improvements up to 30 in accuracy, with
    small number of samples statistical significance
    is 5.
  • Discretization improves results in most cases.

17
Results
18
More results
19
Leukemia Bayes rules
  • Top test, bottom train green p(CX) for
    Gaussian-smoothed density with s0.01, 0.02,
    0.05, 0.20 (Zyxin).

20
Leukemia SVM LVO
  • Problems with stability

21
Leukemia boosting
  • 3 best genes, evaluation using bootstrap.

22
Conclusions
  • K-S CBS algorithm combines relevance indices
    (F-measure, SUC or other index) to rank and
    reduce the number of features, and uses
    Kolmogorov-Smirnov test to reduce the number of
    features further.
  • It is computationally efficient and gives quite
    good results.
  • Variants of this algorithm may identify
    approximate redundant covers for consecutive
    features Xi and leave in the S set only the one
    that gives best results.
  • Problems with stability of solutions for small
    and large data! no significant difference between
    many feature selection methods.
  • Frapper selects on training those that are
    helpful in O(m) steps, stabilizes LOO results a
    bit, but it is not a complete solution.
  • Will anything work reliably for microarray
    feature selection? Are results published so far
    worth anything?
Write a Comment
User Comments (0)
About PowerShow.com