Title: Analysis of SAGE Data: An Introduction
1Analysis of SAGE DataAn Introduction
- Kevin R. Coombes
- Section of Bioinformatics
2Outline
- Description of SAGE method
- Preliminary bioinformatics issues
- Description of analysis methods introduced in
early paper - Review of literature statistics and SAGE
3What is SAGE?
- Serial Analysis of Gene Expression
- Method to quantify gene expression levels in
samples of cells - Open system
- Can potentially reveal expression levels of all
genes unbiased and comprehensive - Microarrays are closed, since they only tell you
about the genes spotted on the array
Ref Velculescu et al., Science 1995 270484-487
4How does SAGE work?
3.(c) Discard loose fragments.
9. Sequence and record the tags and frequencies.
5From ditags to counts
- Locate the punctuation CATG
- Extract ditags of length 20-26 between the
punctuation - Discard duplicate ditags (including in reverse
direction) -- probably PCR artifacts - Take extreme 10 bases as the two tags, reversing
right-hand tag - Discard linker sequences
- Count occurrences of each tag
SAGE software available at http//www.sagenet.org
6What does the data look like?
7From tags to genes
- Collect sequence records from GenBank that are
represented in UniGene - Assign sequence orientation (by finding poly-A
tail or poly-A signal or from annotations) - Extract 10-bases 3-adjacent to 3-most CATG
- Assign UniGene identifier to each sequence with a
SAGE tag - Record (for each tag-gene pair)
- sequences with this tag
- sequences in gene cluster with this tag
Maps available at http//www.ncbi.nlm.nih.gov/SAGE
8From tags to genes
- Ideal situation
- one gene one tag
- True situation
- one gene many tags (alternative splicing
alternative polyadenylation) - one tag many genes (conserved 3 regions)
9Sequencing Errors
- Estimated sequencing error rate
- 0.7 per base (range 0.2 - 1)
- Affect
- ditags in a SAGE experiment
- can improve by using phred scores and discarding
ambiguous sequences - tag-gene mappings from GenBank
- RNA better than EST
10Reliable tag-gene assignments
11SAGE and cancer
- Ten SAGE libraries, two each from
- normal colon
- colon tumors
- colon cancer cell lines
- pancreatic tumors
- pancreatic cell lines
- Pooled each pair
Ref Zhang et al., Science 1997 2761268-1272
12Variability in SAGE libraries
13Distribution of tags
- 303,706 total tags
- 48,471 distinct tags
- Distribution
- 85.9 seen up to 5 times (25 of mass)
- 12.7 between 5 and 50 times (30)
- 0.1 between 50 and 500 times (26)
- 0.1 more than 500 times (19)
Ref Zhang et al., Science 1997 2761268-1272
14How many tags were missed?
- They simulated to find 92 chance of detecting
tags at 3 copies/cell - Using binomial approximation
- Get 95 chance for 3 copies/cell
- Only get 63 chance for 1 copy/cell
- Most of what they saw occurred at 1-5 copies per
cell
15Differential Expression
- Found 289 tags differentially expressed between
normal colon and colon cancer (181 decreased 108
increased) - Method Monte Carlo simulation.
- 100000 sims per transcript for relative
likelihood of seeing observed difference - Used observed distribution of transcripts to
simulate 40 experiments.
Ref Zhang et al., Science 1997 2761268-1272
16Sensitivity
- Claim 95 chance of detecting 6-fold difference
- Method Monte Carlo
- 200 simulations, assuming abundance of 0.0001 in
first sample and 0.0006 in second sample
Ref Zhang et al., Science 1997 2761268-1272
17Weaknesses in Analysis
- Failed to account for intrinsic variability in
samples (which changes depending on abundance) in
assessing significance - Monte Carlo used observed distribution, which is
definitely not true distribution. - Sensitivity only measured at one abundance level.
18Alternative Analytic Methods
- Audic and Claverie, Genome Res 1997 7986-995
- Chen et al., J Exp Med 1998 91657-1668
- Kal et al., Mol Biol of Cell 1999 101859-1872
- Michiels et al., Physiol Genomics 1999 183-91
- Stollberg et al., Genome Res 2000 101241-1248
- Man et al., Bioinformatics 2000 16953-959
19Audic and Claverie
- Main goal confidence limits for differential
expression - Use Poisson approximation for number of times x
you see the same tag. - Put a uniform prior on the Poisson parameter get
posterior probability of see tag y times in new
experiment - p(y x) (x y)! / x! y! 2(x y 1)
- Generalizes to unequal sample sizes
20Chen et al.
- Assume
- equal sample sizes
- tag has concentration X, Y in two samples
- Look at W X/(XY)
- Use a symmetric Beta prior distribution with a
peak near 0.5 (since most genes dont change) - Use Bayes theorem to compute posterior
probability of threefold difference in expression
21Unequal sample sizes
- This analysis generalizes easily to the case of
unequal size SAGE libraries - Lal et al., Cancer Res 1999 595403-5407
- This method is used at the NCBI SAGEmap web site
for online differential expression queries - http//www.ncbi.nlm.nih.gov/SAGE
22Kal et al.
- Assume the proportion of times you see a tag has
binomial distribution - Replace with a normal approximation to compute
confidence limits - Used at http//www.cmbi.kun.nl/usage
- Equivalent to chi-square test on 2x2 table
23Michiels et al.
- First perform overall chi-square test to decide
if the two SAGE libraries being compared are
different. - Get significance by Monte Carlo simulation
- Perform gene-by-gene chi-square tests and use
them to rank genes in order of most likely to be
different
24Stollberg et al.
- Assume binomial distributions
- Model the binomial parameters as a sum of two
exponentials - fit to the Zhang step function data
- Simulate from this model, adding
- sequencing errors
- nonuniqueness of tags
- nonrandomness of DNA sequences
25Stollberg et al.
- Key finding
- Naively using observed data to fit model
parameters cannot recover the observed data by
simulation - Maximum likelihood estimate of parameters that
recover the observed data give very different
looking parameters
26Stollberg et al.
27Man et al.
- Compares specificity and sensitivity of different
tests for differential expression - Audic and Claverie
- Kal
- Fishers exact test
- Monte Carlo simulation of experiments
- Findings
- Similar power at high abundance
- Kal has highest power at low abundance
28Questions
- Sample size computations
- How many tags should we sequence if we want to
see tags of a given frequency? - How many tags should we sequence if we want to
see a given percentage of tags? - How many tags are expressed in a sample?
- Best method for identifying differential
expression?
29Additional SAGE references
- Review
- Madden et al., Drug Disc Today 2000 5415-425
- Online Tools
- Lash et al., Genome Res 2000 101051-1060
- van Kampen et al., Bioinformatics 2000
16899-905 - Comparison of SAGE and Affymetrix
- Ishii et al., Genomics 2000 68136-143
- Combine SAGE and custom microarrays
- Nacht et al., Cancer Res 1999 595464-5470
- Mapping SAGE data onto genome
- Caron et al., Science 2001 2911289-1292
- Data mining the public SAGE libraries
- Argani et al., Cancer Res 2001 614320-4324