Clustering and Classification In Gene Expression Data - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Clustering and Classification In Gene Expression Data

Description:

Clustering is an exploratory tool to see who's running with who: ... Performance on the training set is a poor approach, and will deflate the error estimate. ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 56

Provided by: fox22

Category:

more less

Transcript and Presenter's Notes

Title: Clustering and Classification In Gene Expression Data

1
Clustering and Classification In Gene Expression
Data

Carlo Colantuoni
ccolantu_at_jhsph.edu

Slide Acknowledgements Elizabeth Garrett-Mayer,
Rafael Irizarry, Giovanni Parmigiani, David
Madigan, Kevin Coombs, Richard Simon, Ingo
Ruczinski.Classification based in part on Chapter
10 of Hand, Manilla, Smyth and Chapter 7 of Han
and Kamber
2
Data from Garber et al. PNAS (98), 2001.
3
(No Transcript)
4
Clustering

Clustering is an exploratory tool to see who's
running with who Genes and Samples.
Unsupervized
NOT for classification of samples.
NOT for identification of differentially
expressed genes.

5
Clustering

Clustering organizes things that are close into
groups.
What does it mean for two genes to be close?
What does it mean for two samples to be close?
Once we know this, how do we define groups?
Hierarchical and K-Means Clustering

6
Distance

We need a mathematical definition of distance
between two points
What are points?
If each gene is a point, what is the mathematical
definition of a point?

7
Points
1 2 . . . . . . . N
1 2 . . . . . . . . G

Gene1 (E11, E12, , E1N)
Gene2 (E21, E22, , E2N)
Sample1 (E11, E21, , EG1)
Sample2 (E12, E22, , EG2)
Egiexpression gene g, sample i

DATA MATRIX
8
Most Famous Distance

Euclidean distance
Example distance between gene 1 and 2
Sqrt of Sum of (E1i -E2i)2, i1,,N
When N is 2, this is distance as we know it

Baltimore
Distance
DC
When N is 20,000 you have to think abstractly
9
Correlation can also be used to compute distance

Pearson Correlation
Spearman Correlation
Uncentered Correlation
Absolute Value of Correlation

The difference is that, if you have two vectors X
and Y with identical
shape, but which are offset relative to each
other by a fixed value,
they will have a standard Pearson correlation
(centered correlation)
of 1 but will not have an uncentered correlation
of 1.

11
The similarity/distance matrices
1 2 ...G
1 2 .N
1 2 . . . . . . . . G
1 2 . . . . . . . . G
GENE SIMILARITY MATRIX
DATA MATRIX
12
The similarity/distance matrices
1 2 ..N
1 2 .N
1 2 . . . . . . . . G
1 2 . . . N
SAMPLE SIMILARITY MATRIX
DATA MATRIX
13
Gene and Sample Selection

Do you want all genes included?
What to do about replicates from the same
individual/tumor?
Genes that contribute noise will affect your
results.
Including all genes dendrogram cant all be
seen at the same time.
Perhaps screen the genes?

14
Two commonly seen clustering approaches in gene
expression data analysis

Hierarchical clustering
Dendrogram (red-green picture)
Allows us to cluster both genes and samples in
one picture and see whole dataset organized
K-means/K-medoids
Partitioning method
Requires user to define K of clusters a
priori
No picture to (over)interpret

15
Hierarchical Clustering

The most overused statistical method in gene
expression analysis
Gives us pretty red-green picture with patterns
But, pretty picture tends to be pretty unstable.
Many different ways to perform hierarchical
clustering
Tend to be sensitive to small changes in the data
Provided with clusters of every size where to
cut the dendrogram is user-determined

16
Choose clustering direction

Agglomerative clustering (bottom-up)
Starts with as each gene in its own cluster
Joins the two most similar clusters
Then, joins next two most similar clusters
Continues until all genes are in one cluster
Divisive clustering (top-down)
Starts with all genes in one cluster
Choose split so that genes in the two clusters
are most similar (maximize distance between
clusters)
Find next split in same manner
Continue until all genes are in single gene
clusters

17
Choose linkage method (if bottom-up)

Single Linkage join clusters whose distance
between closest genes is smallest (elliptical)
Complete Linkage join clusters whose distance
between furthest genes is smallest (spherical)
Average Linkage join clusters whose average
distance is the smallest.

18
Dendrogram Creation Interpretation
19
Dendrogram Creation Interpretation
20
Dendrogram Creation Interpretation
21
Cluster Assignment
22
Simulated Data with 4 clusters 1-10, 11-20,
21-30, 31-40
450 relevant genes 450 noise genes.
450 relevant genes.
23
K-means and K-medoids

Partitioning Method
Dont get pretty picture
MUST choose number of clusters K a priori
More of a black box because output is most
commonly looked at purely as assignments
Each object (gene or sample) gets assigned to a
cluster
Begin with initial partition
Iterate so that objects within clusters are most
similar

24
K-means (continued)

Euclidean distance most often used
Spherical clusters.
Can be hard to choose or figure out K.
Not unique solution clustering can depend on
initial partition
No pretty figure to (over)interpret

25
K-means Algorithm

1. Choose K centroids at random
2. Make initial partition of objects into k
clusters by assigning objects to closest centroid
Calculate the centroid (mean) of each of the k
clusters.
a. For object i, calculate its distance to each
of
the centroids.
b. Allocate object i to cluster with closest
centroid.
c. If object was reallocated, recalculate
centroids based
on new clusters.
4. Repeat 3 for object i 1,.N.
Repeat 3 and 4 until no reallocations occur.
Assess cluster structure for fit and stability

26
K-means

We start with some data
Interpretation
We are showing expression for two samples for 14
genes
We are showing expression for two genes for 14
samples
This is with 2 genes.

Iteration 0
27
K-means

Choose K centroids
These are starting values that the user picks.
There are some data driven ways to do it

Iteration 0
28
K-means

Make first partition by finding the closest
centroid for each point
This is where distance is used

Iteration 1
29
K-means

Now re-compute the centroids by taking the middle
of each cluster

Iteration 2
30
K-means

Repeat until the centroids stop moving or until
you get tired of waiting

Iteration 3
31
K-means Limitations

Final results depend on starting values
How do we chose K? There are methods but not much
theory saying what is best.
Where are the pretty pictures?

32
Assessing cluster fit and stability

Most often ignored.
Cluster structure is treated as reliable and
precise
Can be VERY sensitive to noise and to outliers
Homogeneity and Separation
Cluster Silhouettes how similar genes within a
cluster are to genes in other clusters (Rousseeuw
Journal of Computation and Applied Mathematics,
1987)

33
Silhouettes

Silhouette of gene i is defined as
ai average distance of gene i to other gene in
same cluster
bi average distance of gene i to genes in its
nearest neighbor cluster

34
WADP Weighted Average Discrepancy Pairs

Add perturbations to original data
Calculate the number of paired samples that
cluster together in the original cluster that
didnt in the perturbed
Repeat for every cutoff (i.e. for each k)
Do iteratively
Estimate for each k the proportion of discrepant
pairs.

35
(No Transcript)
36
Classification

Diagnostic tests are good examples of classifiers
A patient has a given disease or not
The classifier is a machine that accepts some
clinical parameters as input, and spits out an
prediction for the patient
D
Not-D
Classes must be mutually exclusive and exhaustive

37
Components of Class Prediction

Select features (genes)
Which genes will be included in the model
Select type of classifier
E.g. (D)LDA, SVM, k-Nearest-Neighbor,
Fit parameters for model (train the classifier)
Quantify predictive accuracy Cross-Validation

38
Feature Selection

Goal is to identify a small subset of genes which
together give accurate predictions.
Methods will vary depending on nature of
classification problem
Choose genes with significant t-statistics to
distinguish between two simple classes e.g.

39
Classifier Selection

In microarray classification, the number of
features is (almost) always much greater than the
number of samples.
Overfitting is a distinct risk, and increases
with more complicated methods.

40
How microarrays differ from the rest of the world

Complex classification algorithms such as neural
networks that perform better elsewhere dont do
as well as simpler methods for expression data.
Comparative studies have shown that simpler
methods work as well or better for microarray
problems because the number of candidate
predictors exceeds the number of samples by
orders of magnitude.
(Dudoit, Fridlyand and Speed JASA 2001)

41
Statistical Methods Appropriate for Class
Comparison may not be Appropriate for Class
Prediction

Demonstrating statistical significance of
prognostic factors is not the same as
demonstrating predictive accuracy.
Demonstrating goodness of fit of a model to the
data used to develop it is not a demonstration
of predictive accuracy.
Most statistical methods were not developed for
pgtgtn prediction problems

42
Linear discriminant analysis

If there are K classes, simply draw lines
(planes) to divide the space of expression
profiles into K regions, one for each class.
If profile X falls in region K, predict class K.

43
Nearest Neighbor Classification

To classify a new observation X, measure the
distance d(X,Xi) between X and every sample Xi
in training set
Assign to X the class label of its nearest
neighbor in the training set.

44
Random Forests

Build several random decision trees and have
them vote to determine final classification

45
Evaluating a classifier

Want to estimate the error rate when classifier
is used to predict class of a new observation
The ideal approach is to get a set of new
observations, with known class label and see how
frequently the classifier makes the correct
prediction.
Performance on the training set is a poor
approach, and will deflate the error estimate.
Cross validation methods are used to get less
biased estimates of error using only the training
data.

46
Split-Sample Evaluation

Training-set
Used to select features, select model type,
determine parameters and cut-off thresholds
Test-set
Withheld until a single model is fully specified
using the training-set.
Fully specified model is applied to the
expression profiles in the test-set to predict
class labels.
Number of errors is counted

47
V-fold cross validation

Divide data into V groups.
Hold one group back, train the classifier on
other V-1 groups, and use it to predict the last
one.
Rotate through all V points, holding each back.
Error estimate is total error rate on all V test
groups.

48
Leave-one-out Cross Validation

Hold one data point back, train the classifier
on other n-1 data points, and use it to predict
the last one.
Rotate through all n points, holding each back.
Error estimate is total error rate on all n test
values.

49
Non-cross-validated Prediction
1. Prediction rule is built using full data
set. 2. Rule is applied to each specimen for
class prediction.
Cross-validated Prediction (Leave-one-out method)
1. Full data set is divided into training and
test sets (test set contains 1 specimen). 2.
Prediction rule is built from scratch using the
training set. 3. Rule is applied to the specimen
in the test set for class prediction. 4.
Process is repeated until each specimen has
appeared once in the test set.
50
Which to use depends mostly on sample size

If the sample is large enough, split into test
and train groups.
If sample is barely adequate for either testing
or training, use leave one out
In between consider V-fold. This method can give
more accurate estimates than leave one out, but
reduces the size of training set.

51
Beware

Cross-validation of a model cannot occur after
selecting the genes to be used in the model

52
Incomplete (incorrect) Cross-Validation

Publications are using all the data to select
genes and then cross-validating only the
parameter estimation component of model
development
Highly biased
Many published complex methods which make strong
claims based on incorrect cross-validation.
Frequently seen in complex feature set selection
algorithms
Some software encourages inappropriate
cross-validation

53
Gene-Expression Profiles in Hereditary Breast
Cancer

Breast tumors studied
7 BRCA1 tumors
8 BRCA2 tumors
7 sporadic tumors
Log-ratios measurements of 3226 genes for each
tumor after initial data filtering

RESEARCH QUESTION Can we distinguish BRCA1 from
BRCA1 cancers and BRCA2 from BRCA2 cancers
based solely on their gene expression profiles?
54
BRCA1
55
BRCA2

Write a Comment

User Comments (0)