Title: Predictive Analysis of Gene Expression Data from Human SAGE Libraries
1Predictive Analysis of Gene Expression Data
from Human SAGE Libraries
- Alexessander Alves Nikolay Zagoruiko Oleg
Okun - Olga Kutnenko Irina Borisova
- University of Porto,
PORTUGAL - Russian Academy of Sciences RUSSIA
- University of Oulu
FINLAND
2Outline
- Goals
- Background
- SAGE Data
- Gene Expression Data
- Feature Selection
- GRAD
- Experiments
- Conclusions
3Goal
- Predictive Analysis
- Feature Selection Methods in Bioinformatics and
Machine Learning - Cancer Classification
4Background
Central Dogma of Biology
- Genes code proteins and other larger biomolecules
- Genes are expressed in a two steps process
(Central Dogma of Biology) - Several technologies measure transcription SAGE,
Micro array
Molla et al, 2003
Gene Expression Process 1- Transcribed into an
RNA Sequence 2- Translated into a protein
5SAGE DATA
- Advantages
- Compare samples between different organs and
patients. (No normalisation required) - Collects complete gene expression profile of a
cell/tissue without prior knowledge of the mRNA
to be profiled
6SAGE DATA
- Drawbacks
- Very Expensive to Collect Data using the SAGE
method - Very Few Examples (consequence)
7GENE EXPRESSION DATA
- Challenges posed to Machine Learning
- Number of Genes Dramatically Exceeds Examples!!!
- Curse of Dimensionality (not enough density to
estimate accuratelly the model) - Over-fitting (higher probability of finding
casual relationships among data attributes)
8Feature Selection
- Remove Irrelevant and Redundant Genes
- Methods
- Wrapper
- Fit classifier to a subset of data and use
classification accuracy to drive the search for
relevant genes (e.g. C4.5 accuracy ) - Filtering
- Use a function to assess the goodness of a subset
of genes (e.g. euclidean distance, entropy,
correlation, etc...) - Problem Complexity
- O(2n) ...
- n, number of genes
- Smaller dataset n822.
- O(2n) ? 2.8x10246 ? Intractable using a simple
exaustive search
9Gene Selection In Bioinformatics
- Filtering is usually prefered because is
computationally less expensive - Several works on classification select genes
with - Wilcoxon test,
- t-test
- Additionally, also remove genes with low entropy,
variability, or absolute expression level. - Cons
- Redundancy
- Interdependency unaware
10Our Proposals
- Study Bioinformatics Filtering Techniques
- Compare with Machine Learning Algorithms
- Avoid Redundancy
- Consider Interdependency and low expressed genes
- Introduce a new Filtering Algorithm GRAD
11GRAD
- Search Strategy
- Use Exaustive Search on the formation of
informative groups of attributes (granules) - Use AdDel for choosing subsets of granules
- AdDel A combination of forward sequential search
(FSS) and backward sequential search (BSS) - Number of attributes to include on a subset is
estimated by algorithm
12GRAD
- Algorithm
- P0 x1,x2,,xn initial set of features
- Formation of granules
- Ordering by individual relevance
- G1 x7, x33, x12,,xn
- All pairs by exhaustive search
- G2 x3x8, x15x88,,xi xj
- All triplets by exhaustive search
- G3 x75x1x35, x11x49x55,, xi xj xk
-
- Top level most relevant granules using
AdDel - GltG1,G2,G3gt AdDel
and are the distances to closest
neighbors, one from each class
13Experiments
- Comparison
- GRAD
- Wrapper C4.5
- Original Dataset
- Filtering
- Wilcoxon Test, low entropy, variability, and very
low absolute expression level - Classifiers
- C4.5
- SVM
- RBF
- NN-MLP
- Data
- Small Dataset 74x822
14Data Characterization
- Not all organs have samples of both classes
- Unbalanced number of cases
- 50 Cancer Samples
- 24 Normal Samples
Most data is relativelly low expressed Mean quite
far from median Potentially due to outliers
15Data Characterization
average vs standard deviation
average vs range
Both range and standard deviation have roughly
linear relationship with gene expression level
average
16Experimental Results
Predictive Accuracy GRAD WRAPPER
Original Filtering 86 82 79
78
GRAD is significantly better than using the
original or the filtered dataset Wrapper
approach is not
17GRAD Results
- Importance of considering dependence
- Distance Function
10 most individually informative P75,7
18GRAD Results
- Scatter Plot of GRAD Attributes
Interdependency relationship between two non
differentially expressed genes selected with GRAD
Two differentially expressed genes selected with
GRAD.
19GRAD Results
- Examples ordered by the value of the Distance
Function
In the future it can allow to estimate the
degree of risk, to make early diagnostics and
to supervise a course of treatment
20Induced Classifiers
C4.5 Induced on GRAD attributes
C4.5 Induced using a Wrapper Approach
21Conclusions
- Coping with redundancy and dependency between
attributes is very important. - Algorithm GRAD represents effective means to
select a subset of attributes from very big
initial set. - The submitted results have only illustrative
character. - We are open for cooperation with those who have
interest on the biological interpretation of
results
22Questions
23GRAD
-
- In increasing n the relevance grows, then growth
stops and begins its decrease due to addition
less informative, rustling attributes. - The maximum of the curve of quality allows
- to specify optimum quantity of attributes.
- Only algorithms of AdDel family has such
property.
24Gene Selection In Bioinformatics
- Redundancy
- some genes are highly correlated. (probably
belonging to the same biological pathways) - Curse of Dimensionality
- Interdependency
- A few interdependent genes may carry together
more significant information than a subset of
independent genes. - Loss of relevant information to discriminate
among classes
All this have a negative impact on predictive
accuracy!!
25Feature Selection
- Wrapper
- Considers the classifier while searching best
subset - Accuracy Improves
- May overfit due to small sample sizes and huge
dimensionality - Computationally more expensive
- Filtering
- Potentially less accurate
- Faster Does not requires the induction of a
predictor - Commonly prefered approach in bioinformatics