Title: More%20Microarray%20Analysis:%20Unsupervised%20Approaches
1More Microarray AnalysisUnsupervised Approaches
- Matt Hibbs
- Troyanskaya Lab
2Outline
- Gene Expression vs. DNA applications
- A little more normalization (missing values)
- Unsupervised Analysis
- Basic Clustering
- Statistical Enrichment
- PCA/SVD
- Advanced Clustering
- Search-based Approaches
-
3Expression / DNA
- Some similar concepts to analysis, but often very
different goals - Expression clustering, guilt by association,
functional enrichment - DNA signal processing, spatial relationships,
motif finding - Visualized differently (Heat maps vs. karyoscope)
4The missing value problem
- Microarrays can have systematic or random missing
values - Some algorithms cant deal with missing values
(PCA/SVD in particular) - Instead of hoping missing values wont bias the
analysis, better to estimate them accurately
5Spatial Defects
6KNN Impute
- Idea use genes with similar expression profiles
to estimate missing values
2 5 7 3 1
Gene X
2 4 5 7 3 2
Gene A
8 9 2 1 4 9
Gene B
3 5 6 7 3 2
Gene C
7Imputation affects downstream analysis
Complete data set
Data set with missing values estimated by
KNNimpute algorithm
Data set with 30 entries missing and filled with
zeros (zero values appear black)
8Unsupervised Analysis
- Supervised techniques great if you have starting
information (e.g. labels) - But, we often we dont know enough beforehand to
apply these methods - Unsupervised techniques are exploratory
- Let the data organize itself, then try to find
biological meaning - Approaches to understand whole data
- Visualization often helpful
9Clustering
- Let the data organize itself
- Reordering of genes (or conditions) in the
dataset so that similar patterns are next to each
other (or in separate groups) - Identify subsets of genes (or experiments) that
are related by some measure
10Quick Example
Conditions
Genes
11Why cluster?
- Guilt by association if unknown gene X is
similar in expression to known genes A and B,
maybe they are involved in the same/related
pathway - Visualization datasets are too large to be able
to get information out without reorganizing the
data
12Clustering Techniques
- Algorithm (Method)
- Hierarchical
- K-means
- Self Organizing Maps
- QT-Clustering
- NNN
- .
- .
- .
- Distance Metric
- Euclidean (L2)
- Pearson Correlation
- Spearman Correlation
- Manhattan (L1)
- Kendalls t
- .
- .
- .
13Distance Metrics
- Choice of distance measure is important for most
clustering techniques - Pair-wise metrics compare vectors of numbers
- e.g. genes x y, ea. with n measurements
14Distance Metrics
15Hierarchical clustering
- Imposes (pair-wise) hierarchical structure on all
of the data - Often good for visualization
- Basic Method (agglomerative)
- Calculate all pair-wise distances
- Join the closest pair
- Calculate pairs distance to all others
- Repeat from 2 until all joined
16Hierarchical clustering
17Hierarchical clustering
18Hierarchical clustering
19Hierarchical clustering
20Hierarchical clustering
21Hierarchical clustering
22HC Interior Distances
- Three typical variants to calculate interior
distances within the tree - Average linkage mean/median over all possible
pair-wise values - Single linkage minimum pair-wise distance
- Complete linkage maximum pair-wise distance
23 Hierarchical clustering problems
- Hard to define distinct clusters
- Genes assigned to clusters on the basis of all
experiments - Optimizing node ordering hard (finding the
optimal solution is NP-hard) - Can be driven by one strong cluster a problem
for gene expression b/c data in row space is
often highly correlated
24HC Real Example
- Demo in JavaTreeView HIDRA
- Spellman et al., 1998 yeast alpha-factor sync
cell cycle timecourse
25HC Another Example
- Expression of tumors hierarchically clustered
- Expression groups by clinical class
Garber et al.
26K-means Clustering
- Groups genes into a pre-defined number of
independent clusters - Basic algorithm
- Define k number of clusters
- Randomly initialize each cluster with a seed
(often with a random gene) - Assign each gene to the cluster with the most
similar seed - Recalculate all cluster seeds as means (or
medians) of genes assigned to the cluster - Repeat 3 4 until convergence
- (e.g. No genes move, means dont change much,
etc.)
27K-means example
28K-means example
29K-means example
30K-means problems
- Have to set k ahead of time
- Ways to choose optimal k minimize
within-cluster variation compared to random data
or held out data - Each gene only belongs to exactly 1 cluster
- One cluster has no influence on the others (one
dimensional clustering) - Genes assigned to clusters on the basis of all
experiments
31K-means Real Example
- Demo in TIGR MeV
- Spellman et al. alpha-factor cell cycle
32 Clustering Tweaks
- Fuzzy clustering allows genes to be partially
in different clusters - Dependent clusters consider between-cluster
distances as well as within-cluster - Bi-clustering look for patterns across subsets
of conditions - Very hard problem (NP-complete)
- Practical solutions use heuristics/simplifications
that may affect biological interpretation
33Cluster Evaluation
- Mathematical consistency
- Compare coherency of clusters to background
- Look for functional consistency in clusters
- Requires a gold standard, often based on GO,
MIPS, etc. - Evaluate likelihood of enrichment in clusters
- Hypergeometric distribution, etc.
- Several tools available
34Gene Ontology
- Organization of curated biological knowledge
- 3 branches biological process, molecular
function, cellular component
35 Hypergeometric Distribution
- Probability of observing x or more genes in a
cluster of n genes with a common annotation - N total number of genes in genome
- M number of genes with annotation
- n number of genes in cluster
- x number of genes in cluster with annotation
- Multiple hypothesis correction required if
testing multiple functions (Bonferroni, FDR,
etc.) - Additional genes in clusters with strong
enrichment may be related
36GO term Enrichment Tools
- SGDs Princetons GoTermFinder
- http//go.princeton.edu
- GOLEM (http//function.princeton.edu/GOLEM)
- HIDRA
Sealfon et al., 2006
37More Unsupervised Methods
- Search-based approaches
- Starting with a query gene/condition, find most
related group - Singular Value Decomposition (SVD) Principal
Component Analysis (PCA) - Decomposition of data matrix into patterns
weights and contributions - Real names are principal componentssingular
values and left/right eigenvectors - Used to remove noise, reduce dimensionality,
identify common/dominant signals
38SVD ( PCA)
- SVD is the method, PCA is performing SVD on
centered data - Projects data into another orthonormal basis
- New basis ordered by variance explained
Singular values
Eigen-genes
Original Data matrix
Eigen-conditions
39SVD
40SVD Real Example
- Demo in TIGR MeV
- Spellman et al., 1998 cell cycle time courses
- alpha-factor sync
- cdc15 sync
41DNA arrays / Sequence-based Analysis
- Methods so far focused on expression data
- Other uses of microarrays often sequence based
CGH, ChIP-chip, SNP scanner - Data has important, inherent order
- Most analysis methods developed from signal
processing techniques (e.g. sound) - View data in chromosomal order (karyoscope)
- Tools JavaTreeView, IGB, Chippy
42CGH Example
43Aneuploidy affects expression too
rpl20aD/ rpl20aD, Chromosome XV
(data from Hughes et al. (2000))
44Software Tools
- JavaTreeView viz, karyoscope
- HIDRA viz, mult. datasets, search
- Cluster (Eisen lab) clustering
- TIGR MeV clustering, viz
- IGB Affys CGH browser
- ChIPpy ChIP-chip analysis
45Summary
- Unsupervised Analysis
- Let the data organize itself, find patterns
- Clustering Distance Metric Algorithm
- SVD/PCA auto find dominant patterns
- Impute missing values (KNN)
- CGH Karyoscope view
- Questions?