Analysis of SAGE Data: An Introduction - PowerPoint PPT Presentation

About This Presentation

Title:

Analysis of SAGE Data: An Introduction

Description:

Analysis of SAGE Data: An Introduction Kevin R. Coombes Section of Bioinformatics – PowerPoint PPT presentation

Number of Views:148

Avg rating:3.0/5.0

Slides: 30

Provided by: Kevin694

Category:

more less

Transcript and Presenter's Notes

Title: Analysis of SAGE Data: An Introduction

1
Analysis of SAGE DataAn Introduction

Kevin R. Coombes
Section of Bioinformatics

2
Outline

Description of SAGE method
Preliminary bioinformatics issues
Description of analysis methods introduced in
early paper
Review of literature statistics and SAGE

3
What is SAGE?

Serial Analysis of Gene Expression
Method to quantify gene expression levels in
samples of cells
Open system
Can potentially reveal expression levels of all
genes unbiased and comprehensive
Microarrays are closed, since they only tell you
about the genes spotted on the array

Ref Velculescu et al., Science 1995 270484-487
4
How does SAGE work?
3.(c) Discard loose fragments.
9. Sequence and record the tags and frequencies.
5
From ditags to counts

Locate the punctuation CATG
Extract ditags of length 20-26 between the
punctuation
Discard duplicate ditags (including in reverse
direction) -- probably PCR artifacts
Take extreme 10 bases as the two tags, reversing
right-hand tag
Discard linker sequences
Count occurrences of each tag

SAGE software available at http//www.sagenet.org
6
What does the data look like?
7
From tags to genes

Collect sequence records from GenBank that are
represented in UniGene
Assign sequence orientation (by finding poly-A
tail or poly-A signal or from annotations)
Extract 10-bases 3-adjacent to 3-most CATG
Assign UniGene identifier to each sequence with a
SAGE tag
Record (for each tag-gene pair)
sequences with this tag
sequences in gene cluster with this tag

Maps available at http//www.ncbi.nlm.nih.gov/SAGE
8
From tags to genes

Ideal situation
one gene one tag
True situation
one gene many tags (alternative splicing
alternative polyadenylation)
one tag many genes (conserved 3 regions)

9
Sequencing Errors

Estimated sequencing error rate
0.7 per base (range 0.2 - 1)
Affect
ditags in a SAGE experiment
can improve by using phred scores and discarding
ambiguous sequences
tag-gene mappings from GenBank
RNA better than EST

10
Reliable tag-gene assignments
11
SAGE and cancer

Ten SAGE libraries, two each from
normal colon
colon tumors
colon cancer cell lines
pancreatic tumors
pancreatic cell lines
Pooled each pair

Ref Zhang et al., Science 1997 2761268-1272
12
Variability in SAGE libraries
13
Distribution of tags

303,706 total tags
48,471 distinct tags
Distribution
85.9 seen up to 5 times (25 of mass)
12.7 between 5 and 50 times (30)
0.1 between 50 and 500 times (26)
0.1 more than 500 times (19)

Ref Zhang et al., Science 1997 2761268-1272
14
How many tags were missed?

They simulated to find 92 chance of detecting
tags at 3 copies/cell
Using binomial approximation
Get 95 chance for 3 copies/cell
Only get 63 chance for 1 copy/cell
Most of what they saw occurred at 1-5 copies per
cell

15
Differential Expression

Found 289 tags differentially expressed between
normal colon and colon cancer (181 decreased 108
increased)
Method Monte Carlo simulation.
100000 sims per transcript for relative
likelihood of seeing observed difference
Used observed distribution of transcripts to
simulate 40 experiments.

Ref Zhang et al., Science 1997 2761268-1272
16
Sensitivity

Claim 95 chance of detecting 6-fold difference
Method Monte Carlo
200 simulations, assuming abundance of 0.0001 in
first sample and 0.0006 in second sample

Ref Zhang et al., Science 1997 2761268-1272
17
Weaknesses in Analysis

Failed to account for intrinsic variability in
samples (which changes depending on abundance) in
assessing significance
Monte Carlo used observed distribution, which is
definitely not true distribution.
Sensitivity only measured at one abundance level.

18
Alternative Analytic Methods

Audic and Claverie, Genome Res 1997 7986-995
Chen et al., J Exp Med 1998 91657-1668
Kal et al., Mol Biol of Cell 1999 101859-1872
Michiels et al., Physiol Genomics 1999 183-91
Stollberg et al., Genome Res 2000 101241-1248
Man et al., Bioinformatics 2000 16953-959

19
Audic and Claverie

Main goal confidence limits for differential
expression
Use Poisson approximation for number of times x
you see the same tag.
Put a uniform prior on the Poisson parameter get
posterior probability of see tag y times in new
experiment
p(y x) (x y)! / x! y! 2(x y 1)
Generalizes to unequal sample sizes

20
Chen et al.

Assume
equal sample sizes
tag has concentration X, Y in two samples
Look at W X/(XY)
Use a symmetric Beta prior distribution with a
peak near 0.5 (since most genes dont change)
Use Bayes theorem to compute posterior
probability of threefold difference in expression

21
Unequal sample sizes

This analysis generalizes easily to the case of
unequal size SAGE libraries
Lal et al., Cancer Res 1999 595403-5407
This method is used at the NCBI SAGEmap web site
for online differential expression queries
http//www.ncbi.nlm.nih.gov/SAGE

22
Kal et al.

Assume the proportion of times you see a tag has
binomial distribution
Replace with a normal approximation to compute
confidence limits
Used at http//www.cmbi.kun.nl/usage
Equivalent to chi-square test on 2x2 table

23
Michiels et al.

First perform overall chi-square test to decide
if the two SAGE libraries being compared are
different.
Get significance by Monte Carlo simulation
Perform gene-by-gene chi-square tests and use
them to rank genes in order of most likely to be
different

24
Stollberg et al.

Assume binomial distributions
Model the binomial parameters as a sum of two
exponentials
fit to the Zhang step function data
Simulate from this model, adding
sequencing errors
nonuniqueness of tags
nonrandomness of DNA sequences

25
Stollberg et al.

Key finding
Naively using observed data to fit model
parameters cannot recover the observed data by
simulation
Maximum likelihood estimate of parameters that
recover the observed data give very different
looking parameters

26
Stollberg et al.
27
Man et al.

Compares specificity and sensitivity of different
tests for differential expression
Audic and Claverie
Kal
Fishers exact test
Monte Carlo simulation of experiments
Findings
Similar power at high abundance
Kal has highest power at low abundance

28
Questions