Microarray Data Analysis - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Microarray Data Analysis

Description:

Identify genes with differential gene expression between two ... Triangle rule dik = dij djk. Euclidean distance d = S ( xi yi) 2 over i = 1 to n. ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 52
Provided by: divyane
Category:

less

Transcript and Presenter's Notes

Title: Microarray Data Analysis


1
Microarray Data Analysis
  • Angela Burr
  • Divya Rao

2
Microarray Data Analysis
  • Objectives
  • Identify genes with differential gene expression
    between two experimental conditions
  • Identifying patterns of gene expression of a
    sample at a particular state
  • Understanding biological systems

3
Normalization
  • Adjust raw intensity values to correct for
    intensity differences due to the procedure
  • Methods
  • Global adjustment
  • Intensity dependent
  • Housekeeping normalization
  • Print run normalizations
  • Between slides

4
Preliminary Analysis
  • General distribution information
  • Scatterplots (log R vs log G)
  • MA plots (Mlog (R/G))
  • Histograms
  • Boxplots

5
(No Transcript)
6
(No Transcript)
7
Histogram of distribution
8
Differential Expression
  • Goal Identify significant differences in gene
    expression from the background
  • Problems
  • Microarray experiments with many variables and
    few replicates are not conducive for traditional
    statistical tests
  • Possible Solution
  • Use statistical models
  • Concerns
  • High variability among levels gene expressions
    making it difficult to identify up-regulation or
    down-regulation of a single gene
  • False negatives and false positives

9
Parametric Methods
  • T-test approaches
  • Bayesian approach pool genes with similar
    expression levels to better estimate the variance
  • Compute probability of gene expression from a
    uniform distribution vs non-uniform distribution
  • ANOVA

10
t-tests Comparison
11
Non-Parametric Methods
  • Nonparametric t-test
  • Permutations
  • Not very sensitive
  • Wilcoxon rank sum test

12
Xu and Li, Bioinformatics Vol 19(10) 2003
p1284-1289
13
CLUSTER ANALYSIS
  • Genes that belong to a particular pathway are
    generally co-regulated and have similar
    expression patterns
  • Correlating gene expression changes to identify
    sets of genes with similar profiles
  • Method of organizing gene expression data, and
    developing phylogenetic trees

14
Terminology
  • Vector Position of the gene in expression
    space, a geometric coordinate
  • Expression space - The log2 ratios from the
    experiment with n arrays, plotted in n dimensions
  • Clustering algorithms group genes based on their
    separation in expression space

15
  • Mean Centring Subtract the basal expression
    level of a gene from each experimental
    measurement
  • Distance Metric Measure of distance between
    gene expression vectors
  • Metric distances
  • Semi-Metric distances

16
Distance Metrics
  • Metric distances (dij between vectors i and j)
  • dij positive definite
  • dij symmetric
  • dij zero distance from itself
  • Triangle rule dik lt dij djk
  • Euclidean distance d v S ( xi yi) 2 over i
    1 to n.
  • Semi Metric distances
  • Do not obey triangle rule

17
Classification of Clustering Algorithms
  • Hierarchical
  • Produces a hierarchy of clusters
  • Non-Hierarchical
  • Partitions objects into different clusters
  • Divisive
  • Breaks down parent cluster into smaller clusters
  • Agglomerative
  • Fuses single-member clusters

18
Classification continued
  • Supervised
  • Use known biological information to guide the
    algorithm
  • Example SVM
  • Unsupervised
  • Make use of gene expression information to reveal
    patterns in the data
  • Example Hierarchical, k-Means, SOM, PCA

19
Hierarchical Clustering
  • Most widely used
  • Agglomerative approach, produces a single
    hierarchical tree
  • Typically average linkage clustering is used.
  • Disadvantages
  • If a bad assignment is made, it cannot be
    corrected
  • Expression patterns of genes become less relevant
    with progress in clustering

20
Hierarchical Clustering Algorithms
  • Single Linkage Clustering
  • Minimum distance between members of two clusters
  • Complete Linkage Clustering
  • Greatest distance between members of relevant
    clusters
  • Average Linkage Clustering
  • UPGMA
  • Centroid or Median

21
Visualization of Hierarchical clustering
  • average linking
  • complete linkage
  • single linkage
  • Quackebush J

22
k-means Clustering
  • Used when prior biological knowledge regarding
    the number of clusters is known.
  • Partitions objects into fixed number (k) of
    clusters.
  • No dendrograms produced

23
Self-organizing Maps
  • Neural network based divisive clustering
  • Assigns genes to partitions based on the
    similarity of their expression vectors to
    pre-defined reference vectors
  • Requires prior knowledge

24
Principal Component Analysis
  • Provides a projection of complex data sets onto a
    reduced, easily visualized space
  • Difficult to accurately define the precise
    boundaries of distinct clusters in the data
  • Powerful when used with another classification
    technique such as k-means clustering or SOM

25
PCA and Hierarchical analysis
  • Hierarchical (average linkage) clustering
  • PCA
  • Quackenbush J

26
Supervised Methods - SVM
  • Require previous information about which genes
    are expected to cluster together
  • Support Vector Machines (SVM)
  • Binary classifier
  • Kernel function
  • Classification of samples
  • Molecular expression fingerprinting

27
Comparison of clustering methods
  • Hierarchical clustering with correlation
  • Clustering by k-means
  • Diana
  • Fanny
  • Model-based clustering
  • Hierarchical clustering with partial least squares

28
Comparison of clustering methods
  • Data used
  • Transcriptional program of sporulation in budding
    yeast (Chu et al)
  • 7 time points
  • 4975 genes
  • Simulated data (Quackenbush)
  • 10 time points
  • 450 genes

29
Hierarchical and k-means clustering
  • Hierarchical clustering
  • Produces a hierarchy of clusters
  • Average linkage clustering UPGMA
  • d(x, y) 1 - corr(x, y)
  • corr(x, y) -gt statistical correlation between
    expression profiles of x and y
  • K-means clustering
  • Number of clusters determined using Hierarchical
    clustering

30
Diana
  • Divisive clustering method
  • Cluster C with cardinality n(C)
  • x1 ? C
  • Splits into
  • x1, .., xk and
  • C \ x1, .., xk

31
Fanny
  • Fuzzy logic
  • Used L1 (Manhattan distance)
  • Computes probability vectors for all genes that
    minimize the objective function
  • Assigns genes to the group with highest
    probability

32
Model-based clustering
  • Modeling expression profiles by mixtures of
    multivariate normal distributions
  • Density function
  • Likelihood of genes with expression profiles
    x1,,xn
  • Group levels ? obtained by maximum likelihood
    that maximizes L jointly in ? and ?

33
Hierarchical clustering with partial least squares
  • xi is the vector of expression ratios for the ith
    gene
  • Partial least squared model of x1 on x2,..,xM
  • Symmetrized coefficient for any gene pair (i, j),
    i ? j

34
Validation
  • Average proportion of non-overlap measures
  • Average distance between means measure
  • Average distance measure
  • K number of clusters
  • Expect small values

35
Results for sporulation data
  • Non-overlap measures

36
Results of sporulation data
  • Average distance between means measure
  • Average distance measure

37
Results for simulate data
  • Hierarchical clustering performed poorly
  • Diana and Model based performed best
  • K-means and Fanny performed well

38
Model profile
  • Training set of seven temporal classes
  • Model temporal profile
  • For each class the average of the log expression
    ratio of all the genes in that class are plotted
    over the seven time points
  • Same for 5 algorithms
  • Fanny not used

39
Comparison with model profiles
40
Comparison with model profiles
41
Comparison with model profiles
42
Obejctive comparison
  • Distance measure
  • Total distance from model profile

43
Discussion
  • End result is dependent on clustering method
  • Choose based on
  • Visual plots
  • Validation techniques
  • Compare with model profiles
  • Application of more than one technique

44
Microarray Software
  • To assist in management / organization of the
    data and help in the data analysis process
  • Commercial and open-source available
  • Popular brands
  • Gene Spring
  • Genesis
  • MIDAS

45
A Gene-Coexpression Network for Global Discovery
of Conserved Genetic Modules (Stuart et al,
Science Vol. 302 2003)
  • Evolutionary divergent organisms
  • Home sapien, Drosophia, C. elegans, S. cerevisiae
  • Importance
  • Only used genes with orthologs to develop
    networks between species
  • Dealing with multiple species which found to have
    tighter clusters and more exclusive

46
Gene-Coexpression Networks
  • Methods
  • Group genes by BLAST
  • Identify gene groups that are functionally
    related by prior microarray data
  • Identify genes with significant Pearson
    correlations indicating conserved coexpression
  • Created a coexpression network with based on
    correlations
  • Validated by methods of permutations adding
    noise

47
Stuart et al, Science Vol 302 Oct 10 , 2003 p
249-255
48
Stuart et al, Science Vol 302 Oct 10 , 2003 p
249-255
49
Gene-Coexpression Networks
  • Results
  • Identification of genes involved with particular
    biological processes (that were previously
    unknown) because genes linked in network likely
    to be involved in similar processes
  • Selective forces impacting interconnected classes
    of genes over evolution

50
References
  • Baggerly et al, J. of Comp Bio Vol 8(6) 2001,
    p639-659
  • Chen, G. et al., (2002) Evaluation and comparison
    of clustering algorithms in analyzing ES cell
    gene expression data. Statistica Sinica, 12, 241
    262 Dozmorov and Centola, Bioinformatics Vol
    19(2) 2003, p204-211
  • Hatfield et al, Molecular Microbiology Vol 47(4)
    2003, p 871-877
  • Kerr, M.K. and Churchill, G.A. (2001)
    Bootstrapping cluster analysis assessing the
    resliability of conclusions from microarray
    experiments. Proc. Natl Acad. Sci. USA, 98, 8961
    8965
  • Pam, Bioinformatics Vol 19(11) 2003, p 1333-1340
  • Sambrook J (ed), DNA Microarrays A Molecular
    Cloning Manual
  • Slonim D, Nature Vol 32 2002 p 502-508
  • Stuart et al, Science Vol 302 2003 p249-255
  • Troyanskaya et al, Bioinformatics Vol 18(11)
    2002, p1454-1461
  • Tsodikov et al, Bioinformatics Vol 18(2) 2002, p
    251-260.
  • Wilson et al, Bioinformatics Vol 19(11) 2003,
    p1325-1332

51
References
  • Datta, S., Datta, S. (2003) Comparisons and
    validation of statistical clustering techniques
    for microarray gene expression data.
    Bioinformatics 19, 459 466.
  • Eisen, M.B., Spellman, P.T., Brown, P.O. and
    Botstein, D. (1998) Cluster analysis and display
    of genome-wide expression patterns. Proc. Natl
    Acad. Sci. USA, 95, 14863 14868.
  • McLachlan, G.J., Bean, R.W. and Peel, D. (2002) A
    mixture model-based approach to the clustering to
    microarray expression data. Bioinformatics, 18, 1
    10.
  • Quackenbush, J. (2001) Computational analysis of
    microarray data. Nat. Rev. Genet., 2, 418 427.
  • Waddell, P. and Kishino, H. (2000) Cluster
    inference methods and graphical models evaluated
    on NC160 microarray gene expression data. Genome
    Informatics, 11, 129 140.
  • Xu and Li, Bioinformatics Vol 19(10) 2003, p
    1284-1289
  • Yeung, K., Haynor, D.R. and Ruzzo, W.L. (2001)
    Validating clustering for gene expression data.
    Bioinformatics, 17, 309 318
Write a Comment
User Comments (0)
About PowerShow.com