Methods for MicroArray Analysis Data Mining - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Methods for MicroArray Analysis Data Mining

Description:

Methods for Micro-Array Analysis. Data Mining & Machine Learning Approaches ... Principal component analysis (PCA) involves a mathematical procedure that ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 30

Provided by: pre80

Category:

more less

Transcript and Presenter's Notes

Title: Methods for MicroArray Analysis Data Mining

1
Methods for Micro-Array Analysis Data Mining
Machine Learning Approaches
2
What is Micro-Array Analysis?
3
Analysis of Micro-Array Data

Challenges posed
Typical characteristic of micro array data is the
large number of variables relative to the number
of observations.
Hidden knowledge in these data has to be
discovered
Eg.Gene expression data from 72 leukemia patients
(samples) with 7,070 genes (variables)
The study of the variability of gene expression
patterns
Problems
How to analyze micro-array data with the
following requirements met simultaneously ?
Efficiency
Accuracy
Automation

4
Typical Micro-Array data set

Suppose that the identical micro-array experiment
is repeted p times (e.g. colon cancer cells from
p patients compired with p wild tipes). Then we
obtain a data set (mij i1,,G, j1,,p), in
which mij is the expression ratio in gene I in
jth experiment.
Usually generate large data sets with expression
values for thousands of genes (200020000) but
not more than a few dozens of samples
For example

5
Main objectives in micro-array data analyses

1. To find the genes that are differently
expressed (DF) in the two samples (e.g. the given
colon cancer sample / the wild type cells that
submited a given treatment / no treatment ).
Although biologists can discover DF genes even
with p1, it has been realized lately that making
independent replications is a good practise.
Questions that could be asked
- Which genes expression is modified by the
condition? (it has been reported that many
diseases, especially tumors, have never been
caused by a single gene mutation but are the
result of a series of gene changes)
- Has the treatment changed the expression
level of specific (target) genes / gene sequences
to noticeably different levels? If so is it
important (i.e. is the patients condition
improved due to this change in expression levels)
2. To find genes that behave similarly in
different conditions (i.e. clustering the row
vectors) and to find subgroups of samples (or
patients tissues), that are similar to each
other (i.e. clustering the column vectors).
- novel discovery of genes in related
biological pathway or having related functions
- clinically important subgroups of
patients
3. Classification
- For example Golub et al. (1999) 2
types of leukemia / based on gene expression
profile of each sample
4. Validation of the models, assessment of
robustness/ predicting power of the classifiers
(models)

6
Main objectives in micro-array data analyses
7
1.Finding differently expressed
genes.Parametrical methods t-test

Standard t test
H0 - no difference between the treatment
and the controlled samples
H1 - treatment has an influence.
Knowing the probability distribution of the T
variable under H0 (Student law of p-1 ddl), the
actual T is computed and compared to this
distribution.
At a smaller p-value it is less likely to see
extreme differences by chance.

8
1.Finding differently expressed genes. t-test

Advantages simple and implemented in all
comercial microarray analysis packages
Disadvantages distributional assumptions and
the problem of multiple testing (due to the small
number of samples, we can not assume normality of
the mean of the samples) . -gt what is the false
descovery rate ?
Alternatives empirical Bayes and parametric
Bayes

9
1.Finding differently expressed genes.Fold
approach

If the average expression level of the genes is
examined
If it changed by a certain number of folds, the
gene is declared changed (on or off)
Disadvantage does not reveal the desired
correlation between the gene and its function.
Does not find related genes.

10
Data Mining

11
Gene Expression grouping and classification.
Overview of existing approaches

Micro-Array analysis employs machine learning
algorithms and techniques to mine useful data.
Unsupervised data analysis
Principal Component Analysis (PCA)
Hierarchical Clustering
Non-Hierarchical Clustering
K means
Self organizing maps (a type of neural networks)
Supervised data analysis
Decision Trees - C5.0 implementation
Artificial Neural Networks Back-propagation
algorithm
Two complementary techniques
Cross-validation
Multi-model approaches (boosting, bagging,
stacking)

12
Principal Component Analysis (PCA)

This is a technique for finding major
combinations of data (I.e. genes that are
regularly up- and down- regulated together)
Objectives
Graphically resume a large rectangular table of
numbers, R, simplify its comprehension, find
pertinent features.
Reduce the dimensionality of the data set, (e.g.
co-regulated genes)
Graphically resume
- The correlations between the
variables.
- Find new meaningful underlying
variables (dimensions), resuming the initial
variables in this way.
MAXIMIZE THE INTRA-CLASS VARIANCE
MINIMIZE THE INTER-CLASS VARI12ANCE
- The proximities and the
principal oppositions between the individuals
Simple example
Imagine a micro-array data set consisting of
only 2 experiments (2 samples)
Graphically represent the data.

13
Principal Component Analysis(PCA)

Principal component analysis of a
two-dimensional data cloud. The line shown is the
direction of the first principal component, which
gives an optimal (in the mean-square sense)
linear reduction of dimension from 2 to 1
dimensions.

14
Principal Component Analysis(PCA)

Principal component analysis (PCA) involves a
mathematical procedure that transforms a number
of (possibly) correlated variables into a
(smaller) number of uncorrelated variables called
principal components.
Illustration for the case of 2 samples
The variance of the sample x is given by
The variance of the sample y is given by
The covariance of between x and y
Then we can write
This matrix is square and symmetric, admits a
characteristic polynomial and is diagonalizable.
Also admits a basis of orthogonal eigen vectors

15
Principal Component Analysis(PCA)

Then, it exists a matrix U so that
The 2 eigen vectors orthogonal. Represent a
new system of INDEPENDENT coordinates. The
quantities u11 and u12 are actually the
coordinates of the new axis expressed in a
vectorial format. Same for u21 and u22 .
Each coefficient indicates the weight of a
particular experiment within this component !
(how much participates this experiment at the
generation of this pattern)
A translation and a rotation of the coordinate
system.

16
Principal Component Analysis(PCA)

The first principal component - as much of the
variability in the data as possible,
Each succeeding component - as much of the
remaining variability as possible.
Imagine cloud of data in many dimentions ?
benefits !
The projection of a point A (x, y) on a axis u
(u1, u2) is obtained by performing the scalar
product of the coordinates of this point and the
vectorial coordinates of the axis projection
xu1yu2.
Now, our the points are the genes.
It is intersting to plot the eigen values, which
expresses the way that the variability of data is
repartised in the new coordinate system. The
relative sizes of the major and minor axes in the
ellipse.

17
Principal Component Analysis(PCA)

Application to sporulation time-series
observations of differential expression for
thousands of genes across multiple conditions
Usually, the first component has all positive
coefficients, indicating a weighted average of
all experiments
The second principal coefficient has negative
values at early time points and positive values
for the latter time points, indicating a measure
of change in expression

18
Machine Learning for Micro-Array Analysis
clustering

Cluster analysis
Identification of new subgroups or classes of
some biological entity (e.g. ,tumors)

19
Hierarchical Clustering

Hierarchical cluster methos differ in
the distance measure selected
the manner in which the distances are computed
between the growing clusters and the remaining
members of the data set
Single Linkage. Disadvantage - loose clusters
Complete Linkage. Disadvantage to compact
clusters of very similar size.
Average Linkage
Unweighted pair-group method average (UPGMA) To
groups of the lowest average distance are joined
to form a new cluster.

20
Hierarchical Clustering

Euclidian and Manhattan sensitive to absolute
expression levels. Reveal genes that have similar
expression levels.
A and B have aproximately the
same expression levels
Correlation coefficient with centering sensitive
to expression profiles. Reveal genes that have
similar expression profiles.
D and E enhanced
A and C repressed
Absolute correlation coefficient
A, C, D, E may be involved in
the same biological pathway

21
K-means Clustering

1. Randomly assign data to the clusters.
Suppose there are m genes per cluster.
2. Calculate an average expression vector for
each cluster i.
This Corresponds to the centroid of the
cluster.
3. Calculate a mean interclass distance
between each point and the centroid, for each
cluster.
4. Move the data from one class to another.
Aim of minimizing the averall interclass
distance measure.
ADVANTAGES easy to implement.
DISADVANTAGES computationally intensive.
outcome
determined by such factors as distance metrics
chose.

22
Non-parametric models

Models that rely heavily on the empirical
analysis of large data sets rather than on prior
domain knowledge
Non-parametric Approaches
Decision trees, Neural networks, Genetic
algorithms, and Nearest neighbor methods.
Fundamental assumption
Consistently observed relationships or
patterns in large data sets will recur in future
observations.
Advantages
Does not require a thorough understanding of the
underlying system or problem
Can be used to build arbitrarily complex models,
that are highly non-linear and not restricted by
human comprehension.

23
Decision Tree

Strengths
Clearly indicates which attributes are most
important for prediction or classification.
Weaknesses
Limited ability to handle estimation or
regression tasks where the goal is to predict the
value of a continuous variable
Error-prone when the number of training examples
per class is small

24
Neural Networks

Strengths
Ability to handle a range of problem tasks
including classification (discrete outputs) and
estimation or regression tasks (continuous
outputs)
Provision of an indication (through sensitivity
analysis) of which attributes are most important
for prediction or classification

Weaknesses
The risk of premature convergence to an inferior
solution (this is normally addressed by
performing a sensible cross-validation procedure)

25
Multi-Model Approaches

Problem with the regular models
Instability of Prediction Method
Sensitivity of the final model to small changes
in the training set.
Unstable machine learning methods
Decision trees
Stable methods
k-nearest neighbor
Neural models

Now, let us see an approach to address the
instability problem.
26
Machine Learning for Micro-Array Analysis

Cross validation
To test the robustness of the classifier
Algorithm choice depends on
Attributes
Ratio of the training data
TP,TNif TP is small- over-fitting occurs
Combined approaches
Limited amount of training data, the individual
classifier may not represent the true hypotheses.
Combined classifier may produce a good
approximation to the true hypotheses.

27
Multi-Model Approaches

Common methods for constructing multi-model
systems
Boosting, Bagging, and Stacking

What they do?
Creates and Combines multiple classifiers
How are they different from each other?
Differ in how the classifiers are trained and in
how their outputs are combined.
How they improve accuracy?
They improve accuracy by focusing the learning
process on examples in the data that are harder
to model than others.

28
Boosting
29
Boosting Algorithm

Step 1 Form the Learning set and validation set
(with uniform and without replacement sampling).
Step 2 N different training set replicas are
sampled adaptively (with non-uniform sampling
probabilities and with replacement)
Step 3 Build each classifier, f'i (x), based on
the training set.
Step 4 Establish each classifiers performance
by testing it against the learning set.
Step 5 Calculate a weight for each classifier
based on its performance
Step 6 Combine model by means of a weighted
voting scheme, where each individual prediction
model carries a different weight.