Pabio590B

About This Presentation

Title:

Pabio590B

Description:

Pabio590B week 1 Microarrays Overview Design & hybridization Data analysis Overview Experiments you might do Microarray Design RNA Expression Chip Designs ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 48

Provided by: DS6

Learn more at: http://faculty.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Pabio590B

1
Pabio590B week 1Microarrays

Overview
Design hybridization
Data analysis

2
Overview

Affix/synthesize probes of known sequence to chip
Hybridize with labeled sample
Quantify level of hybridization to each probe
Normalization
Statistics
Clustering more

3
Experiments you might do
Measure RNA expression Changes in gene
expression over time / lifecycle Compare
differences between tissues/cell types
Comparisons between species/strains/conditions Wh
ole genome transcript mapping (tiling arrays)
Measure DNA content Presence or absence of
region Copy number via Comparative Genomic
Hybridization SNP Genotyping/Re-sequencing
Other ChIP on chip arrays RIP on chip
4
Microarray Design

Affix/synthesize probes of known sequence to chip
Hybridize with labeled sample
Quantify level of hybridization to each probe
Normalization
Statistics
Clustering more

5
RNA Expression Chip Designs

Expression Array
- N number of probes per gene of interest
- Trade-off between accuracy and number of
features
Tiling array
- Place probe of X nt every Y bases
- Biased vs unbiased

6
Probe considerations

Number of probes per region of interest
Specificity of probes
Distance between probes (tiling)
Mismatch probes (Affymetrix)

7
Hybridization

Affix/synthesize probes of known sequence to chip
Hybridize with labeled sample
Quantify level of hybridization to each probe
Normalization
Statistics
Clustering more

8
Two-color vs One-color

Two-color
Two samples one each slide
cy3 - green - 532nm
cy5 - red - 635nm
One-color
One sample per slide
cy3
No significant difference in accuracy or
reproducibility

9
Designs for Two-color Array
10
Data Normalization

Affix/synthesize probes of known sequence to chip
Hybridize with labeled sample
Quantify level of hybridization to each probe
Normalization
Statistics
Clustering more

11
Within-Array Normalization
Lowess Normalization
Cy3/Cy5
Signal intensity
Before
After
12
Between-Array Normalization

RNA Spike-in
Random Probes
Median Scaling
Quantile Scaling

Median and quantile normalization are predicated
upon the arrays in question having the same
distribution. That is to say, if you can safely
assume that the bulk of genes have the same
expression across the arrays, only then you can
use those methods.
13
Quantile Normalization
Before
After
14
Statistical Analysis

Affix/synthesize probes of known sequence to chip
Hybridize with labeled sample
Quantify level of hybridization to each probe
Normalization
Statistics
Clustering more

15
Some Advice About Statistics

Dont get too hung up on p-values or any other
stat.
Ultimately what matters is biological relevance
and external knowledge and other heterogeneous
measures (related functions, pathways, other data
types) that are not easily measured by statistics
alone.
P-values should help you evaluate the strength of
the evidence, rather than being used as an
absolute yardstick of significance.
Statistical significance is not necessarily the
same as biological relevance and vice-versa.

John Quackenbush
16
Is this gene differentially expressed between the
two conditions?
17
To rephrase the question

Is the mean probe value different between Samples
A B
Null Hypothesis H0 means are the same
Alternate Hypothesis Ha means are different

18
What affects our ability to test the hypothesis?

Difference in means
Number of sample points
Standard deviations of sample

19
The T-statistic

Directly proportional to difference in means
Inversely proportional to standard deviation
Directly proportional to sample size

The T-test calculates how likely the T-statistic
is, given the null hypothesis that the means are
actually the same.
20
T-statistic and P-values

P-values can be determined from theoretical
distributions or permutation testing
Theoretical distributions rely on a set of
assumptions that array experiments do not
necessarily follow
Permutation tests do not rely on any assumptions

21
Permutation Testing
1) Permute n times by random shuffling 2)
Calculate T-statistic for each permutation 3)
Calculate probability of original T-statistic
22
Interpreting P-values

T-test tests the null hypothesis that sample
means are equal
Gene X has p-value of 5 from T-test
95 chance it is differentially expressed
5 chance that is NOT differentially expressed
? False Positive Rate 5

23
T-Test Refinements

Equal vs unequal variance of samples
Equal vs unequal sample size
Dependant vs independent samples
CAVEAT
As sample sizes get smaller, the validity of
p-values calculated via permutation diminishes.
Microarrays typically have few probes per gene,
so sample size is smallish.

24
Multiple Testing Problem

If there is a 5 chance of false positives in one
experiment, what happens when we are testing
10,000 genes.
The majority of those genes are not
differentially expressed, but
a 5 p-value means we will have 500
false-positives.

25
Family-Wise Error Rate (FWER)
FWER is the probability of making one or more
false discoveries (type I errors) among all the
hypotheses when performing multiple pair-wise
tests.

One comparison FWER p-value
10,000 comparisons FWER 1.0

That means that when making 10,000 comparisons
you are sure to make at least one error.
26
Bonferroni Correction

What if you want to keep the FWER at 5
0.05 / 10,000 0.000005 5e-6
Only those genes with T-test p-value of lt 5xe-6
are called differentially expressed
Leads to experiment-wide ? of 0.05

The Standard Bonferroni correction is considered
very conservative
27
Adjusted Bonferroni

Rank all genes by ascending order of p-value
Assign gene with smallest p-value a corrected
p-value of ? / N (0.5/10,000)
Assign gene with second smallest p-value a
corrected p-value of ? / N-1
Etc

The Adjusted Bonferroni correction is less
conservative
28
False Discovery Rate

Measures the likely number of false positives
amongst discovered genes
Factors affecting FDR
Proportion of actual differentially expressed
genes
Distribution of the true differences
Measurement variability
Sample size

29
Analysis of Variance (ANOVA)

Microarray testing across 3 conditions
Is a gene expressed equally across all
conditions?
F-ratio for given gene X
(variability within conditions) / (variability
across conditions)
Calculate p-value
Look up probability of F-ratio
Determine probability by permutation testing

30
Significance Analysis of Microarrays (SAM)

Gene-specific T-tests
Computes statistic (dj) for each gene j
measures the relationship between gene expression
and a response variable
describes and groups the data based on
experimental conditions
uses non-parametric statistics
repeated permutations are used to determine FDR
Accounts for correlations in genes and avoids
parametric assumptions about the (normal vs
non-normal) distribution of individual genes

31
Clustering

Affix/synthesize probes of known sequence to chip
Hybridize with labeled sample
Quantify level of hybridization to each probe
Normalization
Statistics
Clustering more

32
Why do clustering?

Identify groups of possibly co-regulated genes
(e.g. so you can look for common sequence motifs)
Identify typical temporal or spatial gene
expression patterns (e.g. cell-cycle data)
Arrange a set of genes in a linear order that is
at least not totally meaningless

33
Can also cluster experiments

Quality control
detect bad/outlying experiments
Identify or categorize classes of biological
samples
sorting by tumor sub-type

34
How you cluster?

Define a distance measure
Group genes (or experiments) based on that measure

Objects are placed into groups. Objects within a
group are more similar to each other than objects
across groups.
In some cases groups are hierarchically organized
based on the intra-group similarity
35
Distance Metrics
Correlation
Euclidean
Correlation (X,Y) 1 Distance (X,Y)
4 Correlation (X,Z) -1 Distance (X,Z)
2.83 Correlation (X,W) 1 Distance (X,W)
1.41
36
Clustering considerations

Euclidean clustering
Magnitude direction
2 conditions

Correlation clustering
Direction only
3 conditions

Array data is noisy, so you probably need
multiple data points per condition

Clustering methods
Hierarchical
Partitional
Other

37
Hierarchical clustering
Agglomerative, bottom-up method

Initial state
- each item is a cluster
Iterate
- join two most similar cluster
Stop
- when number of clusters reaches user-defined
value

38
Linkage methods
Ways to determine cluster similarity
Single Link Similarity of two most similar
members
Complete Link Similarity of two most similar
members
Average Link Average similarity of all members
39
Comparing linkage methods
40
Partitional (K-means) clustering
Divisive, top-down method

Partition data into K random clusters
Assign each point to nearest cluster
Calculate centroid of each cluster
GOTO step 2

41
Other methods

Support Vector Machines (SVM)
K-nearest Neighbor (KNN)
Self Organizing Maps (SOM)
Self Organizing Tree Algorithm (SOTA)
Cluster Affinity Search Technique (CAST)
QT Cluster (QTC)
Discriminant Analysis Classifier (DAM)
Principal Component Analysis (PCA)
Etc.

42
Warnings and Limitations

Clusters are like statistics
Ideally they mirror reality, but they should
only be taken seriously in conjunction with
confirmatory data from other sources.
Clustering software clusters things
If you tell it to find 4 clusters, it will find
4 clusters in anything!
Garbage In, Garbage Out
Clustering typically relies on a set of input
parameters that can be hard to evaluate except
for empirically evaluating the outputs for a
given set of input parameters.

43
Clusters Interpretation - EASE (Expression
Analysis Systematic Explorer)
Population Size 40 genes Cluster size 12
genes 10 genes, shown in green, have a common
biological theme and 8 occur within the cluster
44
Microarray Analysis Software

TIGR MEV
Limma
SAM
EDGE
These software packages are free and open-source
Each has different strengths/weaknesses and makes
different assumptions about your data

45
Analysis Platforms

Gene Sifter
Rosetta Resolver
Bio Discovery

46
Microarray Data Sources

Gene Expression Omnibus (NCBI)
ArrayExpress (EBI)
Stanford Microarray Database
Yale Microarray Database

47
Microarray Data Standards

Microarray Gene Expression Data Society (MGED)
MIAME
MAGE - OM
MAGE ML
RNA Abundance Database (RAD)
Integrating data from various types of expression
experiments

Write a Comment

User Comments (0)

About PowerShow.com

Pabio590B - PowerPoint PPT Presentation

Pabio590B

Pabio590B week 1 Microarrays Overview Design & hybridization Data analysis Overview Experiments you might do Microarray Design RNA Expression Chip Designs ... – PowerPoint PPT presentation