Pabio590B - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Pabio590B

Description:

Pabio590B week 1 Microarrays Overview Design & hybridization Data analysis Overview Experiments you might do Microarray Design RNA Expression Chip Designs ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 48
Provided by: DS6
Category:

less

Transcript and Presenter's Notes

Title: Pabio590B


1
Pabio590B week 1Microarrays
  • Overview
  • Design hybridization
  • Data analysis

2
Overview
  • Affix/synthesize probes of known sequence to chip
  • Hybridize with labeled sample
  • Quantify level of hybridization to each probe
  • Normalization
  • Statistics
  • Clustering more

3
Experiments you might do
Measure RNA expression Changes in gene
expression over time / lifecycle Compare
differences between tissues/cell types
Comparisons between species/strains/conditions Wh
ole genome transcript mapping (tiling arrays)
Measure DNA content Presence or absence of
region Copy number via Comparative Genomic
Hybridization SNP Genotyping/Re-sequencing
Other ChIP on chip arrays RIP on chip
4
Microarray Design
  • Affix/synthesize probes of known sequence to chip
  • Hybridize with labeled sample
  • Quantify level of hybridization to each probe
  • Normalization
  • Statistics
  • Clustering more

5
RNA Expression Chip Designs
  • Expression Array
  • - N number of probes per gene of interest
  • - Trade-off between accuracy and number of
    features
  • Tiling array
  • - Place probe of X nt every Y bases
  • - Biased vs unbiased

6
Probe considerations
  • Number of probes per region of interest
  • Specificity of probes
  • Distance between probes (tiling)
  • Mismatch probes (Affymetrix)

7
Hybridization
  • Affix/synthesize probes of known sequence to chip
  • Hybridize with labeled sample
  • Quantify level of hybridization to each probe
  • Normalization
  • Statistics
  • Clustering more

8
Two-color vs One-color
  • Two-color
  • Two samples one each slide
  • cy3 - green - 532nm
  • cy5 - red - 635nm
  • One-color
  • One sample per slide
  • cy3
  • No significant difference in accuracy or
    reproducibility

9
Designs for Two-color Array
10
Data Normalization
  • Affix/synthesize probes of known sequence to chip
  • Hybridize with labeled sample
  • Quantify level of hybridization to each probe
  • Normalization
  • Statistics
  • Clustering more

11
Within-Array Normalization
Lowess Normalization
Cy3/Cy5
Signal intensity
Before
After
12
Between-Array Normalization
  • RNA Spike-in
  • Random Probes
  • Median Scaling
  • Quantile Scaling

Median and quantile normalization are predicated
upon the arrays in question having the same
distribution. That is to say, if you can safely
assume that the bulk of genes have the same
expression across the arrays, only then you can
use those methods.
13
Quantile Normalization
Before
After
14
Statistical Analysis
  • Affix/synthesize probes of known sequence to chip
  • Hybridize with labeled sample
  • Quantify level of hybridization to each probe
  • Normalization
  • Statistics
  • Clustering more

15
Some Advice About Statistics
  • Dont get too hung up on p-values or any other
    stat.
  • Ultimately what matters is biological relevance
    and external knowledge and other heterogeneous
    measures (related functions, pathways, other data
    types) that are not easily measured by statistics
    alone.
  • P-values should help you evaluate the strength of
    the evidence, rather than being used as an
    absolute yardstick of significance.
  • Statistical significance is not necessarily the
    same as biological relevance and vice-versa.

John Quackenbush
16
Is this gene differentially expressed between the
two conditions?
17
To rephrase the question
  • Is the mean probe value different between Samples
    A B
  • Null Hypothesis H0 means are the same
  • Alternate Hypothesis Ha means are different

18
What affects our ability to test the hypothesis?
  • Difference in means
  • Number of sample points
  • Standard deviations of sample

19
The T-statistic
  • Directly proportional to difference in means
  • Inversely proportional to standard deviation
  • Directly proportional to sample size

The T-test calculates how likely the T-statistic
is, given the null hypothesis that the means are
actually the same.
20
T-statistic and P-values
  • P-values can be determined from theoretical
    distributions or permutation testing
  • Theoretical distributions rely on a set of
    assumptions that array experiments do not
    necessarily follow
  • Permutation tests do not rely on any assumptions

21
Permutation Testing
1) Permute n times by random shuffling 2)
Calculate T-statistic for each permutation 3)
Calculate probability of original T-statistic
22
Interpreting P-values
  • T-test tests the null hypothesis that sample
    means are equal
  • Gene X has p-value of 5 from T-test
  • 95 chance it is differentially expressed
  • 5 chance that is NOT differentially expressed
  • ? False Positive Rate 5

23
T-Test Refinements
  • Equal vs unequal variance of samples
  • Equal vs unequal sample size
  • Dependant vs independent samples
  • CAVEAT
  • As sample sizes get smaller, the validity of
    p-values calculated via permutation diminishes.
  • Microarrays typically have few probes per gene,
    so sample size is smallish.

24
Multiple Testing Problem
  • If there is a 5 chance of false positives in one
    experiment, what happens when we are testing
    10,000 genes.
  • The majority of those genes are not
    differentially expressed, but
  • a 5 p-value means we will have 500
    false-positives.

25
Family-Wise Error Rate (FWER)
FWER is the probability of making one or more
false discoveries (type I errors) among all the
hypotheses when performing multiple pair-wise
tests.
  • One comparison FWER p-value
  • 10,000 comparisons FWER 1.0

That means that when making 10,000 comparisons
you are sure to make at least one error.
26
Bonferroni Correction
  • What if you want to keep the FWER at 5
  • 0.05 / 10,000 0.000005 5e-6
  • Only those genes with T-test p-value of lt 5xe-6
    are called differentially expressed
  • Leads to experiment-wide ? of 0.05

The Standard Bonferroni correction is considered
very conservative
27
Adjusted Bonferroni
  • Rank all genes by ascending order of p-value
  • Assign gene with smallest p-value a corrected
    p-value of ? / N (0.5/10,000)
  • Assign gene with second smallest p-value a
    corrected p-value of ? / N-1
  • Etc

The Adjusted Bonferroni correction is less
conservative
28
False Discovery Rate
  • Measures the likely number of false positives
    amongst discovered genes
  • Factors affecting FDR
  • Proportion of actual differentially expressed
    genes
  • Distribution of the true differences
  • Measurement variability
  • Sample size

29
Analysis of Variance (ANOVA)
  • Microarray testing across 3 conditions
  • Is a gene expressed equally across all
    conditions?
  • F-ratio for given gene X
  • (variability within conditions) / (variability
    across conditions)
  • Calculate p-value
  • Look up probability of F-ratio
  • Determine probability by permutation testing

30
Significance Analysis of Microarrays (SAM)
  • Gene-specific T-tests
  • Computes statistic (dj) for each gene j
  • measures the relationship between gene expression
    and a response variable
  • describes and groups the data based on
    experimental conditions
  • uses non-parametric statistics
  • repeated permutations are used to determine FDR
  • Accounts for correlations in genes and avoids
    parametric assumptions about the (normal vs
    non-normal) distribution of individual genes

31
Clustering
  • Affix/synthesize probes of known sequence to chip
  • Hybridize with labeled sample
  • Quantify level of hybridization to each probe
  • Normalization
  • Statistics
  • Clustering more

32
Why do clustering?
  • Identify groups of possibly co-regulated genes
    (e.g. so you can look for common sequence motifs)
  • Identify typical temporal or spatial gene
    expression patterns (e.g. cell-cycle data)
  • Arrange a set of genes in a linear order that is
    at least not totally meaningless

33
Can also cluster experiments
  • Quality control
  • detect bad/outlying experiments
  • Identify or categorize classes of biological
    samples
  • sorting by tumor sub-type

34
How you cluster?
  • Define a distance measure
  • Group genes (or experiments) based on that measure

Objects are placed into groups. Objects within a
group are more similar to each other than objects
across groups.
In some cases groups are hierarchically organized
based on the intra-group similarity
35
Distance Metrics
Correlation
Euclidean
Correlation (X,Y) 1 Distance (X,Y)
4 Correlation (X,Z) -1 Distance (X,Z)
2.83 Correlation (X,W) 1 Distance (X,W)
1.41
36
Clustering considerations
  • Euclidean clustering
  • Magnitude direction
  • 2 conditions
  • Correlation clustering
  • Direction only
  • 3 conditions

Array data is noisy, so you probably need
multiple data points per condition
  • Clustering methods
  • Hierarchical
  • Partitional
  • Other


37
Hierarchical clustering
Agglomerative, bottom-up method
  • Initial state
  • - each item is a cluster
  • Iterate
  • - join two most similar cluster
  • Stop
  • - when number of clusters reaches user-defined
    value

38
Linkage methods
Ways to determine cluster similarity
Single Link Similarity of two most similar
members
Complete Link Similarity of two most similar
members
Average Link Average similarity of all members
39
Comparing linkage methods
40
Partitional (K-means) clustering
Divisive, top-down method
  • Partition data into K random clusters
  • Assign each point to nearest cluster
  • Calculate centroid of each cluster
  • GOTO step 2

41
Other methods
  • Support Vector Machines (SVM)
  • K-nearest Neighbor (KNN)
  • Self Organizing Maps (SOM)
  • Self Organizing Tree Algorithm (SOTA)
  • Cluster Affinity Search Technique (CAST)
  • QT Cluster (QTC)
  • Discriminant Analysis Classifier (DAM)
  • Principal Component Analysis (PCA)
  • Etc.

42
Warnings and Limitations
  • Clusters are like statistics
  • Ideally they mirror reality, but they should
    only be taken seriously in conjunction with
    confirmatory data from other sources.
  • Clustering software clusters things
  • If you tell it to find 4 clusters, it will find
    4 clusters in anything!
  • Garbage In, Garbage Out
  • Clustering typically relies on a set of input
    parameters that can be hard to evaluate except
    for empirically evaluating the outputs for a
    given set of input parameters.

43
Clusters Interpretation - EASE (Expression
Analysis Systematic Explorer)
Population Size 40 genes Cluster size 12
genes 10 genes, shown in green, have a common
biological theme and 8 occur within the cluster
44
Microarray Analysis Software
  • TIGR MEV
  • Limma
  • SAM
  • EDGE
  • These software packages are free and open-source
  • Each has different strengths/weaknesses and makes
    different assumptions about your data

45
Analysis Platforms
  • Gene Sifter
  • Rosetta Resolver
  • Bio Discovery

46
Microarray Data Sources
  • Gene Expression Omnibus (NCBI)
  • ArrayExpress (EBI)
  • Stanford Microarray Database
  • Yale Microarray Database

47
Microarray Data Standards
  • Microarray Gene Expression Data Society (MGED)
  • MIAME
  • MAGE - OM
  • MAGE ML
  • RNA Abundance Database (RAD)
  • Integrating data from various types of expression
    experiments
Write a Comment
User Comments (0)
About PowerShow.com