Microarray Data Analysis - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Microarray Data Analysis

Description:

Identify genes with differential gene expression between two ... Triangle rule dik = dij djk. Euclidean distance d = S ( xi yi) 2 over i = 1 to n. ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 52

Provided by: divyane

Category:

more less

Transcript and Presenter's Notes

Title: Microarray Data Analysis

1
Microarray Data Analysis

Angela Burr
Divya Rao

2
Microarray Data Analysis

Objectives
Identify genes with differential gene expression
between two experimental conditions
Identifying patterns of gene expression of a
sample at a particular state
Understanding biological systems

3
Normalization

Adjust raw intensity values to correct for
intensity differences due to the procedure
Methods
Global adjustment
Intensity dependent
Housekeeping normalization
Print run normalizations
Between slides

4
Preliminary Analysis

General distribution information
Scatterplots (log R vs log G)
MA plots (Mlog (R/G))
Histograms
Boxplots

5
(No Transcript)
6
(No Transcript)
7
Histogram of distribution
8
Differential Expression

Goal Identify significant differences in gene
expression from the background
Problems
Microarray experiments with many variables and
few replicates are not conducive for traditional
statistical tests
Possible Solution
Use statistical models
Concerns
High variability among levels gene expressions
making it difficult to identify up-regulation or
down-regulation of a single gene
False negatives and false positives

9
Parametric Methods

T-test approaches
Bayesian approach pool genes with similar
expression levels to better estimate the variance
Compute probability of gene expression from a
uniform distribution vs non-uniform distribution
ANOVA

10
t-tests Comparison
11
Non-Parametric Methods

Nonparametric t-test
Permutations
Not very sensitive
Wilcoxon rank sum test

12
Xu and Li, Bioinformatics Vol 19(10) 2003
p1284-1289
13
CLUSTER ANALYSIS

Genes that belong to a particular pathway are
generally co-regulated and have similar
expression patterns
Correlating gene expression changes to identify
sets of genes with similar profiles
Method of organizing gene expression data, and
developing phylogenetic trees

14
Terminology

Vector Position of the gene in expression
space, a geometric coordinate
Expression space - The log2 ratios from the
experiment with n arrays, plotted in n dimensions
Clustering algorithms group genes based on their
separation in expression space

Mean Centring Subtract the basal expression
level of a gene from each experimental
measurement
Distance Metric Measure of distance between
gene expression vectors
Metric distances
Semi-Metric distances

16
Distance Metrics

Metric distances (dij between vectors i and j)
dij positive definite
dij symmetric
dij zero distance from itself
Triangle rule dik lt dij djk
Euclidean distance d v S ( xi yi) 2 over i
1 to n.
Semi Metric distances
Do not obey triangle rule

17
Classification of Clustering Algorithms

Hierarchical
Produces a hierarchy of clusters
Non-Hierarchical
Partitions objects into different clusters
Divisive
Breaks down parent cluster into smaller clusters
Agglomerative
Fuses single-member clusters

18
Classification continued

Supervised
Use known biological information to guide the
algorithm
Example SVM
Unsupervised
Make use of gene expression information to reveal
patterns in the data
Example Hierarchical, k-Means, SOM, PCA

19
Hierarchical Clustering

Most widely used
Agglomerative approach, produces a single
hierarchical tree
Typically average linkage clustering is used.
Disadvantages
If a bad assignment is made, it cannot be
corrected
Expression patterns of genes become less relevant
with progress in clustering

20
Hierarchical Clustering Algorithms

Single Linkage Clustering
Minimum distance between members of two clusters
Complete Linkage Clustering
Greatest distance between members of relevant
clusters
Average Linkage Clustering
UPGMA
Centroid or Median

21
Visualization of Hierarchical clustering

average linking
complete linkage
single linkage

Quackebush J

22
k-means Clustering

Used when prior biological knowledge regarding
the number of clusters is known.
Partitions objects into fixed number (k) of
clusters.
No dendrograms produced

23
Self-organizing Maps

Neural network based divisive clustering
Assigns genes to partitions based on the
similarity of their expression vectors to
pre-defined reference vectors
Requires prior knowledge

24
Principal Component Analysis

Provides a projection of complex data sets onto a
reduced, easily visualized space
Difficult to accurately define the precise
boundaries of distinct clusters in the data
Powerful when used with another classification
technique such as k-means clustering or SOM

25
PCA and Hierarchical analysis

Hierarchical (average linkage) clustering
PCA
Quackenbush J

26
Supervised Methods - SVM

Require previous information about which genes
are expected to cluster together
Support Vector Machines (SVM)
Binary classifier
Kernel function
Classification of samples
Molecular expression fingerprinting

27
Comparison of clustering methods

Hierarchical clustering with correlation
Clustering by k-means
Diana
Fanny
Model-based clustering
Hierarchical clustering with partial least squares

28
Comparison of clustering methods

Data used
Transcriptional program of sporulation in budding
yeast (Chu et al)
7 time points
4975 genes
Simulated data (Quackenbush)
10 time points
450 genes

29
Hierarchical and k-means clustering

Hierarchical clustering
Produces a hierarchy of clusters
Average linkage clustering UPGMA
d(x, y) 1 - corr(x, y)
corr(x, y) -gt statistical correlation between
expression profiles of x and y
K-means clustering
Number of clusters determined using Hierarchical
clustering

30
Diana

Divisive clustering method
Cluster C with cardinality n(C)
x1 ? C
Splits into
x1, .., xk and
C \ x1, .., xk

31
Fanny

Fuzzy logic
Used L1 (Manhattan distance)
Computes probability vectors for all genes that
minimize the objective function
Assigns genes to the group with highest
probability

32
Model-based clustering

Modeling expression profiles by mixtures of
multivariate normal distributions
Density function
Likelihood of genes with expression profiles
x1,,xn
Group levels ? obtained by maximum likelihood
that maximizes L jointly in ? and ?

33
Hierarchical clustering with partial least squares

xi is the vector of expression ratios for the ith
gene
Partial least squared model of x1 on x2,..,xM
Symmetrized coefficient for any gene pair (i, j),
i ? j

34
Validation

Average proportion of non-overlap measures
Average distance between means measure
Average distance measure
K number of clusters
Expect small values

35
Results for sporulation data

Non-overlap measures

36
Results of sporulation data

Average distance between means measure
Average distance measure

37
Results for simulate data

Hierarchical clustering performed poorly
Diana and Model based performed best
K-means and Fanny performed well

38
Model profile

Training set of seven temporal classes
Model temporal profile
For each class the average of the log expression
ratio of all the genes in that class are plotted
over the seven time points
Same for 5 algorithms
Fanny not used

39
Comparison with model profiles
40
Comparison with model profiles
41
Comparison with model profiles
42
Obejctive comparison

Distance measure
Total distance from model profile

43
Discussion

End result is dependent on clustering method
Choose based on
Visual plots
Validation techniques
Compare with model profiles
Application of more than one technique

44
Microarray Software

To assist in management / organization of the
data and help in the data analysis process
Commercial and open-source available
Popular brands
Gene Spring
Genesis
MIDAS

45
A Gene-Coexpression Network for Global Discovery
of Conserved Genetic Modules (Stuart et al,
Science Vol. 302 2003)

Evolutionary divergent organisms
Home sapien, Drosophia, C. elegans, S. cerevisiae
Importance
Only used genes with orthologs to develop
networks between species
Dealing with multiple species which found to have
tighter clusters and more exclusive

46
Gene-Coexpression Networks

Methods
Group genes by BLAST
Identify gene groups that are functionally
related by prior microarray data
Identify genes with significant Pearson
correlations indicating conserved coexpression
Created a coexpression network with based on
correlations
Validated by methods of permutations adding
noise

47
Stuart et al, Science Vol 302 Oct 10 , 2003 p
249-255
48
Stuart et al, Science Vol 302 Oct 10 , 2003 p
249-255
49
Gene-Coexpression Networks

Results
Identification of genes involved with particular
biological processes (that were previously
unknown) because genes linked in network likely
to be involved in similar processes
Selective forces impacting interconnected classes
of genes over evolution

50
References

Baggerly et al, J. of Comp Bio Vol 8(6) 2001,
p639-659
Chen, G. et al., (2002) Evaluation and comparison
of clustering algorithms in analyzing ES cell
gene expression data. Statistica Sinica, 12, 241
262 Dozmorov and Centola, Bioinformatics Vol
19(2) 2003, p204-211
Hatfield et al, Molecular Microbiology Vol 47(4)
2003, p 871-877
Kerr, M.K. and Churchill, G.A. (2001)
Bootstrapping cluster analysis assessing the
resliability of conclusions from microarray
experiments. Proc. Natl Acad. Sci. USA, 98, 8961
8965
Pam, Bioinformatics Vol 19(11) 2003, p 1333-1340
Sambrook J (ed), DNA Microarrays A Molecular
Cloning Manual
Slonim D, Nature Vol 32 2002 p 502-508
Stuart et al, Science Vol 302 2003 p249-255
Troyanskaya et al, Bioinformatics Vol 18(11)
2002, p1454-1461
Tsodikov et al, Bioinformatics Vol 18(2) 2002, p
251-260.
Wilson et al, Bioinformatics Vol 19(11) 2003,
p1325-1332

51
References

Datta, S., Datta, S. (2003) Comparisons and
validation of statistical clustering techniques
for microarray gene expression data.
Bioinformatics 19, 459 466.
Eisen, M.B., Spellman, P.T., Brown, P.O. and
Botstein, D. (1998) Cluster analysis and display
of genome-wide expression patterns. Proc. Natl
Acad. Sci. USA, 95, 14863 14868.
McLachlan, G.J., Bean, R.W. and Peel, D. (2002) A
mixture model-based approach to the clustering to
microarray expression data. Bioinformatics, 18, 1
10.
Quackenbush, J. (2001) Computational analysis of
microarray data. Nat. Rev. Genet., 2, 418 427.
Waddell, P. and Kishino, H. (2000) Cluster
inference methods and graphical models evaluated
on NC160 microarray gene expression data. Genome
Informatics, 11, 129 140.
Xu and Li, Bioinformatics Vol 19(10) 2003, p
1284-1289
Yeung, K., Haynor, D.R. and Ruzzo, W.L. (2001)
Validating clustering for gene expression data.
Bioinformatics, 17, 309 318