CLUSTERING - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

CLUSTERING

Description:

Joint work with Prof. Regina Liu and Jun Li. rebecka_at_stat.rutgers.edu. Gene Expression Data ... X : the gene expression level of the k-th replicate under the ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 22
Provided by: Jul117
Category:

less

Transcript and Presenter's Notes

Title: CLUSTERING


1
CLUSTERING GENE EXPRESSION DATA BASED ON
P-VALUES
Rebecka Jornsten
Department of Statistics
Rutgers University
rebecka_at_stat.rutgers.edu
Joint work with Prof. Regina Liu and Jun Li
2
Gene Expression Data
N the number of genes considered C the number
of experimental conditions n the number of
replicates under the j-th condition X the
gene expression level of the k-th replicate under
the j-th condition for the i-th gene.
j
ijk
Cond1 Cond2
Cond3 rep1
rep2 rep3 rep1 rep2 rep3
rep1 rep2 rep3 . 1 0.46 0.30
0.80 1.51 0.90 1.01 ... -0.43 0.12
-0.17 2 -0.10 0.49 0.24 0.06
0.46 0.20 ... 0.27 1.32 0.89
3 0.15 0.74 0.04 0.10 0.20 0.55
... -0.54 -0.42 -0.05



Gene
X
322
Assumption X iid Normal with meanm and
SDs
ijk
ij
ij
3
Gene clustering
We perform these two tasks jointly
  • Which genes are similar? different?

Components in clustering analysis
1. Similarity/dissimilarity measure
2. Clustering algorithm
3. Cluster validation
Gene selection
  • Which genes are significantly differentially
    expressed?

Components in gene selection
1. Type of test
2. Computing p-values (based on distribution
assumptions, or permutation)
3. Adjusting p-values for multiple comparisons
4
  • Motivation
  • P-value based clustering
  • Joint gene clustering and selection
  • Explicitly takes into account the variability of
    the data (the
  • experimental setup)
  • Provides a standardized way to assess the degree
    of similarity
  • between genes
  • Is less arbitrary than many of the existing
    choices
  • of dissimilarity measures, such as Euclidean
    distance
  • Can be easily calibrated for different
    separating criteria
  • Increased power by selecting clusters of genes,
    rather
  • than genes one-by-one.
  • Name-that-cluster we can anchor the clusters
    with
  • pseudo-genes with experimental profiles of
    interest.

5
P-value as the dissimilarity measure
Similarity between genes (separating
criterion) hypothesis testing problem
Examples A lot of flexibility!
  • Testing whether the gene mean vectors across
    experimental
  • conditions are equal
  • Testing whether gene mean and variance vectors
    across
  • experimental conditions are equal
  • Alt1 assuming independent errors for all genes
  • Alt2 assuming independent errors for replicates,
    but
  • allowing genes to be correlated within each
    replicate
  • (pairwise tests)

6
P-value as the dissimilarity measure (ctd)
  • Determine whether or not two genes are
    dissimilar according to some specified criterion
  • Determine whether or not to reject the
    corresponding null hypothesis associated with the
    specified criterion
  • small p-values a strong evidence against
    similarity
  • Dissimilarity measure 1- P-value
  • What we need for clustering apply our chosen
    test and
  • P-value computation method to all pairs of genes

7
Clustering algorithms
The gene-gene P-values provide the dissimilarity
measures .. Now, how do we cluster the genes?
Different clustering algorithms will generate
different results. Some approaches
  • PAM a global cost-function that emphasizes
    within-cluster similarity a tends to generate
    equal size clusters
  • PAMsil using the silhouette validation criterion
    to cluster, takes both between-cluster and
    within-cluster (dis)similarities into account
    (van der Laan et al) a greedy/aggressive
  • Hierarchical clustering more or less greedy
    depending on the selected linkage, generate
    clusters by cutting the tree at a chosen level

8
Cluster validation how many clusters?
This is a difficult problem for noisy gene
expression data! Some approaches
  • Silhouette width P-values are already
    standardized so our silhouette (for gene i)
    corresponds to the difference between average
    P-values from gene i to members of the its
    cluster, and average P-values from gene i to the
    nearest competing cluster.
  • We then select the number of clusters to maximize
    the non-standardized silhouette width.
  • Combined P-values We select the number of
    clusters such that the combined P-values between
    all clusters satisfy a chosen criterion (e.g. not
    exceeding 5).

9
Simulation study 1a Increased power
3 clusters with sizes 20/20/20 False
positive/negative discovery rates
1. BH corrections. 2. PAM on p-values 3. PAMsil
on p-values, non-standardized sil, 4. PAMsil 5.
PAM on average data 6. PAM on full
data
10
Simulation study 1b Increased power
3 clusters with sizes 40/10/10 False
positive/negative discovery rates
1. BH corrections. 2. PAM on p-values 3. PAMsil
on p-values, non-standardized sil, 4. PAMsil 5.
PAM on average data 6. PAM on full
data
11
Simulation study 2a - Flexibility 3 clusters,
20 in each -- the first and last 10 members of
each cluster have two different variances. Use a
LRT test to test gene-pairs for equal mean and
variance vectors of clusters should be 6
P-value matrix Euclidean distance
Euclidean distance
based on average
based on full data
12
Simulation study 2b - Flexibility
3 clusters with sizes 20/50/10 5 members of
first cluster have a larger variance than the
rest. Error rates
1. PAMsil on p-values 2. PAMsil average data 3.
PAM on average
13
Simulation study 3 Effect of Replication
3 clusters, 20 members in each -- 5, 15 or 25
replicates for each gene
P-value matrix Euclidean
distance Euclidean distance
based on average based
on full data
14
Simulation study 3 Effect of Replications
3 clusters, 20 members in each -- 5, 15 or 25
replicates for each gene Table of the fraction of
time we choose 3 clusters over 2 with each
method, for 5,15 or 25 replicates. Note that the
p-value based method adapts to the data
5
15 25
No. of replicates
Method
PAMsil on p-values .19 1 1 PAMsil on
average .42 .40 .39 PAMsil on
full data .38 .39 .45
15
Using microarrays to screen anti-inflammatory
drugs in injured spinal cord
Data provided by R.Hart, Dept. Neuroscience,
Rutgers
  • 7 conditions
  • 3 replicates of each condition
  • 1664 genes
  • Cluster genes to look for useful patterns

uninjured, injured but untreated, and injured
treated by five different drugs
16
P-value based clustering
3 genes, chemokines thought to be beneficial for
recovery
The null cluster (1518 genes)
7 genes, related to cleaning mechanisms in the
cell (macrophages), as well as stress-response
(hsp)
Drug 4, NS398, a COX-2 inhibitor
17
Selecting genes
Side note 1 If we increase to FDR 5 for BH the
overlap is 104 genes, out of 379.
Alt 1 Name-that-cluster use Welsh F to decide
which cluster is the null cluster Alt 2
Benjamini-Hochberg (BH) correction Comparison
Side note 2 If we cluster with PAM we get
clusters of roughly equal size, and all clusters
contain many null genes.
Benjamini-Hochberg 152 genes selected at FDR 1
97
55
91
P-value based clustering (PAMsil) 146 genes not
in the null-cluster
18
Clustering genes filtered by BH
PAMsil with P-values for the BH subset (152 genes)
Missing some chemokines, and the stress
response genes
PAMsil on average data for the BH subset (152
genes)
19
Alternative clustering
When we increase the number of clusters to 5 for
the BH filtered average data (PAMsil) the
clustering is dominated by a few outliers
Macrophage inflammatory protein
Two outlying genes PAMsil may be too
aggressive?
20
Concluding Remarks
  • P-value-based clustering approach
  • reflects the exact experimental setup
  • has valid statistical justification
  • allows for flexible separating criteria
  • Joint gene selection/clustering approach
  • increased power/reduced false negative rate
  • Comparisons
  • PAMsil preferable to PAM since PAM tends to
    generate
  • equal sized clusters
  • PAMsil on P-values preferable to PAMsil on
    average
  • since less sensitive with respect to
    noisy/outlying genes

21
Future work
  • Explore other clustering algorithms
  • asymmetric costs?
  • focus only on between-cluster p-values?
  • robustify PAMsil
  • Explore the use of other tests
  • time course experiments
  • profile tests
  • More extensive simulations, and application to
  • other gene expression data sets
  • How to deal with rag-bag genes?
  • allow for rag-bag clusters
  • post-processing/filtering and validation after
    clustering
Write a Comment
User Comments (0)
About PowerShow.com