Predictive Analysis of Gene Expression Data from Human SAGE Libraries PowerPoint PPT Presentation

presentation player overlay
1 / 24
About This Presentation
Transcript and Presenter's Notes

Title: Predictive Analysis of Gene Expression Data from Human SAGE Libraries


1
Predictive Analysis of Gene Expression Data
from Human SAGE Libraries
  • Alexessander Alves Nikolay Zagoruiko Oleg
    Okun
  • Olga Kutnenko Irina Borisova
  • University of Porto,
    PORTUGAL
  • Russian Academy of Sciences RUSSIA
  • University of Oulu
    FINLAND

2
Outline
  1. Goals
  2. Background
  3. SAGE Data
  4. Gene Expression Data
  5. Feature Selection
  6. GRAD
  7. Experiments
  8. Conclusions

3
Goal
  • Predictive Analysis
  • Feature Selection Methods in Bioinformatics and
    Machine Learning
  • Cancer Classification

4
Background
Central Dogma of Biology
  • Genes code proteins and other larger biomolecules
  • Genes are expressed in a two steps process
    (Central Dogma of Biology)
  • Several technologies measure transcription SAGE,
    Micro array

Molla et al, 2003
Gene Expression Process 1- Transcribed into an
RNA Sequence 2- Translated into a protein
5
SAGE DATA
  • Advantages
  • Compare samples between different organs and
    patients. (No normalisation required)
  • Collects complete gene expression profile of a
    cell/tissue without prior knowledge of the mRNA
    to be profiled

6
SAGE DATA
  • Drawbacks
  • Very Expensive to Collect Data using the SAGE
    method
  • Very Few Examples (consequence)

7
GENE EXPRESSION DATA
  • Challenges posed to Machine Learning
  • Number of Genes Dramatically Exceeds Examples!!!
  • Curse of Dimensionality (not enough density to
    estimate accuratelly the model)
  • Over-fitting (higher probability of finding
    casual relationships among data attributes)

8
Feature Selection
  • Remove Irrelevant and Redundant Genes
  • Methods
  • Wrapper
  • Fit classifier to a subset of data and use
    classification accuracy to drive the search for
    relevant genes (e.g. C4.5 accuracy )
  • Filtering
  • Use a function to assess the goodness of a subset
    of genes (e.g. euclidean distance, entropy,
    correlation, etc...)
  • Problem Complexity
  • O(2n) ...
  • n, number of genes
  • Smaller dataset n822.
  • O(2n) ? 2.8x10246 ? Intractable using a simple
    exaustive search

9
Gene Selection In Bioinformatics
  • Filtering is usually prefered because is
    computationally less expensive
  • Several works on classification select genes
    with
  • Wilcoxon test,
  • t-test
  • Additionally, also remove genes with low entropy,
    variability, or absolute expression level.
  • Cons
  • Redundancy
  • Interdependency unaware

10
Our Proposals
  • Study Bioinformatics Filtering Techniques
  • Compare with Machine Learning Algorithms
  • Avoid Redundancy
  • Consider Interdependency and low expressed genes
  • Introduce a new Filtering Algorithm GRAD

11
GRAD
  • Search Strategy
  • Use Exaustive Search on the formation of
    informative groups of attributes (granules)
  • Use AdDel for choosing subsets of granules
  • AdDel A combination of forward sequential search
    (FSS) and backward sequential search (BSS)
  • Number of attributes to include on a subset is
    estimated by algorithm

12
GRAD
  • Algorithm
  • P0 x1,x2,,xn initial set of features
  • Formation of granules
  • Ordering by individual relevance
  • G1 x7, x33, x12,,xn
  • All pairs by exhaustive search
  • G2 x3x8, x15x88,,xi xj
  • All triplets by exhaustive search
  • G3 x75x1x35, x11x49x55,, xi xj xk
  • Top level most relevant granules using
    AdDel
  • GltG1,G2,G3gt AdDel

and are the distances to closest
neighbors, one from each class
13
Experiments
  • Comparison
  • GRAD
  • Wrapper C4.5
  • Original Dataset
  • Filtering
  • Wilcoxon Test, low entropy, variability, and very
    low absolute expression level
  • Classifiers
  • C4.5
  • SVM
  • RBF
  • NN-MLP
  • Data
  • Small Dataset 74x822

14
Data Characterization
  • Not all organs have samples of both classes
  • Unbalanced number of cases
  • 50 Cancer Samples
  • 24 Normal Samples

Most data is relativelly low expressed Mean quite
far from median Potentially due to outliers
15
Data Characterization
average vs standard deviation
average vs range
Both range and standard deviation have roughly
linear relationship with gene expression level
average
16
Experimental Results
Predictive Accuracy GRAD WRAPPER
Original Filtering 86 82 79
78
GRAD is significantly better than using the
original or the filtered dataset Wrapper
approach is not
17
GRAD Results
  • Importance of considering dependence
  • Distance Function
  • best by GRAD
  • P100

10 most individually informative P75,7
18
GRAD Results
  • Scatter Plot of GRAD Attributes

Interdependency relationship between two non
differentially expressed genes selected with GRAD
Two differentially expressed genes selected with
GRAD.
19
GRAD Results
  • Examples ordered by the value of the Distance
    Function

In the future it can allow to estimate the
degree of risk, to make early diagnostics and
to supervise a course of treatment
20
Induced Classifiers
C4.5 Induced on GRAD attributes
C4.5 Induced using a Wrapper Approach
21
Conclusions
  1. Coping with redundancy and dependency between
    attributes is very important.
  2. Algorithm GRAD represents effective means to
    select a subset of attributes from very big
    initial set.
  3. The submitted results have only illustrative
    character.
  4. We are open for cooperation with those who have
    interest on the biological interpretation of
    results

22
Questions

23
GRAD
  • In increasing n the relevance grows, then growth
    stops and begins its decrease due to addition
    less informative, rustling attributes.
  • The maximum of the curve of quality allows
  • to specify optimum quantity of attributes.
  • Only algorithms of AdDel family has such
    property.

24
Gene Selection In Bioinformatics
  • Redundancy
  • some genes are highly correlated. (probably
    belonging to the same biological pathways)
  • Curse of Dimensionality
  • Interdependency
  • A few interdependent genes may carry together
    more significant information than a subset of
    independent genes.
  • Loss of relevant information to discriminate
    among classes

All this have a negative impact on predictive
accuracy!!
25
Feature Selection
  • Wrapper
  • Considers the classifier while searching best
    subset
  • Accuracy Improves
  • May overfit due to small sample sizes and huge
    dimensionality
  • Computationally more expensive
  • Filtering
  • Potentially less accurate
  • Faster Does not requires the induction of a
    predictor
  • Commonly prefered approach in bioinformatics
Write a Comment
User Comments (0)
About PowerShow.com