More%20Microarray%20Analysis:%20Unsupervised%20Approaches - PowerPoint PPT Presentation

About This Presentation
Title:

More%20Microarray%20Analysis:%20Unsupervised%20Approaches

Description:

Some similar concepts to analysis, but often very different goals ... Imputation affects downstream analysis. Unsupervised Analysis ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 46
Provided by: matthe51
Category:

less

Transcript and Presenter's Notes

Title: More%20Microarray%20Analysis:%20Unsupervised%20Approaches


1
More Microarray AnalysisUnsupervised Approaches
  • Matt Hibbs
  • Troyanskaya Lab

2
Outline
  • Gene Expression vs. DNA applications
  • A little more normalization (missing values)
  • Unsupervised Analysis
  • Basic Clustering
  • Statistical Enrichment
  • PCA/SVD
  • Advanced Clustering
  • Search-based Approaches

3
Expression / DNA
  • Some similar concepts to analysis, but often very
    different goals
  • Expression clustering, guilt by association,
    functional enrichment
  • DNA signal processing, spatial relationships,
    motif finding
  • Visualized differently (Heat maps vs. karyoscope)

4
The missing value problem
  • Microarrays can have systematic or random missing
    values
  • Some algorithms cant deal with missing values
    (PCA/SVD in particular)
  • Instead of hoping missing values wont bias the
    analysis, better to estimate them accurately

5
Spatial Defects
6
KNN Impute
  • Idea use genes with similar expression profiles
    to estimate missing values

2 5 7 3 1
Gene X
2 4 5 7 3 2
Gene A
8 9 2 1 4 9
Gene B
3 5 6 7 3 2
Gene C
7
Imputation affects downstream analysis

 
Complete data set
Data set with missing values estimated by
KNNimpute algorithm
Data set with 30 entries missing and filled with
zeros (zero values appear black)
8
Unsupervised Analysis
  • Supervised techniques great if you have starting
    information (e.g. labels)
  • But, we often we dont know enough beforehand to
    apply these methods
  • Unsupervised techniques are exploratory
  • Let the data organize itself, then try to find
    biological meaning
  • Approaches to understand whole data
  • Visualization often helpful

9
Clustering
  • Let the data organize itself
  • Reordering of genes (or conditions) in the
    dataset so that similar patterns are next to each
    other (or in separate groups)
  • Identify subsets of genes (or experiments) that
    are related by some measure

10
Quick Example
Conditions
Genes
11
Why cluster?
  • Guilt by association if unknown gene X is
    similar in expression to known genes A and B,
    maybe they are involved in the same/related
    pathway
  • Visualization datasets are too large to be able
    to get information out without reorganizing the
    data

12
Clustering Techniques
  • Algorithm (Method)
  • Hierarchical
  • K-means
  • Self Organizing Maps
  • QT-Clustering
  • NNN
  • .
  • .
  • .
  • Distance Metric
  • Euclidean (L2)
  • Pearson Correlation
  • Spearman Correlation
  • Manhattan (L1)
  • Kendalls t
  • .
  • .
  • .

13
Distance Metrics
  • Choice of distance measure is important for most
    clustering techniques
  • Pair-wise metrics compare vectors of numbers
  • e.g. genes x y, ea. with n measurements

14
Distance Metrics
15
Hierarchical clustering
  • Imposes (pair-wise) hierarchical structure on all
    of the data
  • Often good for visualization
  • Basic Method (agglomerative)
  • Calculate all pair-wise distances
  • Join the closest pair
  • Calculate pairs distance to all others
  • Repeat from 2 until all joined

16
Hierarchical clustering
17
Hierarchical clustering
18
Hierarchical clustering
19
Hierarchical clustering
20
Hierarchical clustering
21
Hierarchical clustering
22
HC Interior Distances
  • Three typical variants to calculate interior
    distances within the tree
  • Average linkage mean/median over all possible
    pair-wise values
  • Single linkage minimum pair-wise distance
  • Complete linkage maximum pair-wise distance

23
Hierarchical clustering problems
  • Hard to define distinct clusters
  • Genes assigned to clusters on the basis of all
    experiments
  • Optimizing node ordering hard (finding the
    optimal solution is NP-hard)
  • Can be driven by one strong cluster a problem
    for gene expression b/c data in row space is
    often highly correlated

24
HC Real Example
  • Demo in JavaTreeView HIDRA
  • Spellman et al., 1998 yeast alpha-factor sync
    cell cycle timecourse

25
HC Another Example
  • Expression of tumors hierarchically clustered
  • Expression groups by clinical class

Garber et al.
26
K-means Clustering
  • Groups genes into a pre-defined number of
    independent clusters
  • Basic algorithm
  • Define k number of clusters
  • Randomly initialize each cluster with a seed
    (often with a random gene)
  • Assign each gene to the cluster with the most
    similar seed
  • Recalculate all cluster seeds as means (or
    medians) of genes assigned to the cluster
  • Repeat 3 4 until convergence
  • (e.g. No genes move, means dont change much,
    etc.)

27
K-means example
28
K-means example
29
K-means example
30
K-means problems
  • Have to set k ahead of time
  • Ways to choose optimal k minimize
    within-cluster variation compared to random data
    or held out data
  • Each gene only belongs to exactly 1 cluster
  • One cluster has no influence on the others (one
    dimensional clustering)
  • Genes assigned to clusters on the basis of all
    experiments

31
K-means Real Example
  • Demo in TIGR MeV
  • Spellman et al. alpha-factor cell cycle

32
Clustering Tweaks
  • Fuzzy clustering allows genes to be partially
    in different clusters
  • Dependent clusters consider between-cluster
    distances as well as within-cluster
  • Bi-clustering look for patterns across subsets
    of conditions
  • Very hard problem (NP-complete)
  • Practical solutions use heuristics/simplifications
    that may affect biological interpretation

33
Cluster Evaluation
  • Mathematical consistency
  • Compare coherency of clusters to background
  • Look for functional consistency in clusters
  • Requires a gold standard, often based on GO,
    MIPS, etc.
  • Evaluate likelihood of enrichment in clusters
  • Hypergeometric distribution, etc.
  • Several tools available

34
Gene Ontology
  • Organization of curated biological knowledge
  • 3 branches biological process, molecular
    function, cellular component

35
Hypergeometric Distribution
  • Probability of observing x or more genes in a
    cluster of n genes with a common annotation
  • N total number of genes in genome
  • M number of genes with annotation
  • n number of genes in cluster
  • x number of genes in cluster with annotation
  • Multiple hypothesis correction required if
    testing multiple functions (Bonferroni, FDR,
    etc.)
  • Additional genes in clusters with strong
    enrichment may be related

36
GO term Enrichment Tools
  • SGDs Princetons GoTermFinder
  • http//go.princeton.edu
  • GOLEM (http//function.princeton.edu/GOLEM)
  • HIDRA

Sealfon et al., 2006
37
More Unsupervised Methods
  • Search-based approaches
  • Starting with a query gene/condition, find most
    related group
  • Singular Value Decomposition (SVD) Principal
    Component Analysis (PCA)
  • Decomposition of data matrix into patterns
    weights and contributions
  • Real names are principal componentssingular
    values and left/right eigenvectors
  • Used to remove noise, reduce dimensionality,
    identify common/dominant signals

38
SVD ( PCA)
  • SVD is the method, PCA is performing SVD on
    centered data
  • Projects data into another orthonormal basis
  • New basis ordered by variance explained

Singular values
Eigen-genes
Original Data matrix
Eigen-conditions
39
SVD
40
SVD Real Example
  • Demo in TIGR MeV
  • Spellman et al., 1998 cell cycle time courses
  • alpha-factor sync
  • cdc15 sync

41
DNA arrays / Sequence-based Analysis
  • Methods so far focused on expression data
  • Other uses of microarrays often sequence based
    CGH, ChIP-chip, SNP scanner
  • Data has important, inherent order
  • Most analysis methods developed from signal
    processing techniques (e.g. sound)
  • View data in chromosomal order (karyoscope)
  • Tools JavaTreeView, IGB, Chippy

42
CGH Example
  • Demo in JavaTreeView

43
Aneuploidy affects expression too
rpl20aD/ rpl20aD, Chromosome XV
(data from Hughes et al. (2000))
44
Software Tools
  • JavaTreeView viz, karyoscope
  • HIDRA viz, mult. datasets, search
  • Cluster (Eisen lab) clustering
  • TIGR MeV clustering, viz
  • IGB Affys CGH browser
  • ChIPpy ChIP-chip analysis

45
Summary
  • Unsupervised Analysis
  • Let the data organize itself, find patterns
  • Clustering Distance Metric Algorithm
  • SVD/PCA auto find dominant patterns
  • Impute missing values (KNN)
  • CGH Karyoscope view
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com