More%20Microarray%20Analysis:%20Unsupervised%20Approaches - PowerPoint PPT Presentation

About This Presentation

Title:

More%20Microarray%20Analysis:%20Unsupervised%20Approaches

Description:

Some similar concepts to analysis, but often very different goals ... Imputation affects downstream analysis. Unsupervised Analysis ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 46

Provided by: matthe51

Learn more at: https://www.cs.princeton.edu

Category:

more less

Transcript and Presenter's Notes

Title: More%20Microarray%20Analysis:%20Unsupervised%20Approaches

1
More Microarray AnalysisUnsupervised Approaches

Matt Hibbs
Troyanskaya Lab

2
Outline

Gene Expression vs. DNA applications
A little more normalization (missing values)
Unsupervised Analysis
Basic Clustering
Statistical Enrichment
PCA/SVD
Advanced Clustering
Search-based Approaches

3
Expression / DNA

Some similar concepts to analysis, but often very
different goals
Expression clustering, guilt by association,
functional enrichment
DNA signal processing, spatial relationships,
motif finding
Visualized differently (Heat maps vs. karyoscope)

4
The missing value problem

Microarrays can have systematic or random missing
values
Some algorithms cant deal with missing values
(PCA/SVD in particular)
Instead of hoping missing values wont bias the
analysis, better to estimate them accurately

5
Spatial Defects
6
KNN Impute

Idea use genes with similar expression profiles
to estimate missing values

2 5 7 3 1
Gene X
2 4 5 7 3 2
Gene A
8 9 2 1 4 9
Gene B
3 5 6 7 3 2
Gene C
7
Imputation affects downstream analysis

Complete data set
Data set with missing values estimated by
KNNimpute algorithm
Data set with 30 entries missing and filled with
zeros (zero values appear black)
8
Unsupervised Analysis

Supervised techniques great if you have starting
information (e.g. labels)
But, we often we dont know enough beforehand to
apply these methods
Unsupervised techniques are exploratory
Let the data organize itself, then try to find
biological meaning
Approaches to understand whole data
Visualization often helpful

9
Clustering

Let the data organize itself
Reordering of genes (or conditions) in the
dataset so that similar patterns are next to each
other (or in separate groups)
Identify subsets of genes (or experiments) that
are related by some measure

10
Quick Example
Conditions
Genes
11
Why cluster?

Guilt by association if unknown gene X is
similar in expression to known genes A and B,
maybe they are involved in the same/related
pathway
Visualization datasets are too large to be able
to get information out without reorganizing the
data

12
Clustering Techniques

Algorithm (Method)
Hierarchical
K-means
Self Organizing Maps
QT-Clustering
NNN
.
.
.

Distance Metric
Euclidean (L2)
Pearson Correlation
Spearman Correlation
Manhattan (L1)
Kendalls t
.
.
.

13
Distance Metrics

Choice of distance measure is important for most
clustering techniques
Pair-wise metrics compare vectors of numbers
e.g. genes x y, ea. with n measurements

14
Distance Metrics
15
Hierarchical clustering

Imposes (pair-wise) hierarchical structure on all
of the data
Often good for visualization
Basic Method (agglomerative)
Calculate all pair-wise distances
Join the closest pair
Calculate pairs distance to all others
Repeat from 2 until all joined

16
Hierarchical clustering
17
Hierarchical clustering
18
Hierarchical clustering
19
Hierarchical clustering
20
Hierarchical clustering
21
Hierarchical clustering
22
HC Interior Distances

Three typical variants to calculate interior
distances within the tree
Average linkage mean/median over all possible
pair-wise values
Single linkage minimum pair-wise distance
Complete linkage maximum pair-wise distance

23
Hierarchical clustering problems

Hard to define distinct clusters
Genes assigned to clusters on the basis of all
experiments
Optimizing node ordering hard (finding the
optimal solution is NP-hard)
Can be driven by one strong cluster a problem
for gene expression b/c data in row space is
often highly correlated

24
HC Real Example

Demo in JavaTreeView HIDRA
Spellman et al., 1998 yeast alpha-factor sync
cell cycle timecourse

25
HC Another Example

Expression of tumors hierarchically clustered
Expression groups by clinical class

Garber et al.
26
K-means Clustering

Groups genes into a pre-defined number of
independent clusters
Basic algorithm
Define k number of clusters
Randomly initialize each cluster with a seed
(often with a random gene)
Assign each gene to the cluster with the most
similar seed
Recalculate all cluster seeds as means (or
medians) of genes assigned to the cluster
Repeat 3 4 until convergence
(e.g. No genes move, means dont change much,
etc.)

27
K-means example
28
K-means example
29
K-means example
30
K-means problems

Have to set k ahead of time
Ways to choose optimal k minimize
within-cluster variation compared to random data
or held out data
Each gene only belongs to exactly 1 cluster
One cluster has no influence on the others (one
dimensional clustering)
Genes assigned to clusters on the basis of all
experiments

31
K-means Real Example

Demo in TIGR MeV
Spellman et al. alpha-factor cell cycle

32
Clustering Tweaks

Fuzzy clustering allows genes to be partially
in different clusters
Dependent clusters consider between-cluster
distances as well as within-cluster
Bi-clustering look for patterns across subsets
of conditions
Very hard problem (NP-complete)
Practical solutions use heuristics/simplifications
that may affect biological interpretation

33
Cluster Evaluation

Mathematical consistency
Compare coherency of clusters to background
Look for functional consistency in clusters
Requires a gold standard, often based on GO,
MIPS, etc.
Evaluate likelihood of enrichment in clusters
Hypergeometric distribution, etc.
Several tools available

34
Gene Ontology

Organization of curated biological knowledge
3 branches biological process, molecular
function, cellular component

35
Hypergeometric Distribution

Probability of observing x or more genes in a
cluster of n genes with a common annotation
N total number of genes in genome
M number of genes with annotation
n number of genes in cluster
x number of genes in cluster with annotation
Multiple hypothesis correction required if
testing multiple functions (Bonferroni, FDR,
etc.)
Additional genes in clusters with strong
enrichment may be related

36
GO term Enrichment Tools