Analysis of SAGE Data: An Introduction - PowerPoint PPT Presentation

About This Presentation
Title:

Analysis of SAGE Data: An Introduction

Description:

Analysis of SAGE Data: An Introduction Kevin R. Coombes Section of Bioinformatics – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 30
Provided by: Kevin694
Category:

less

Transcript and Presenter's Notes

Title: Analysis of SAGE Data: An Introduction


1
Analysis of SAGE DataAn Introduction
  • Kevin R. Coombes
  • Section of Bioinformatics

2
Outline
  • Description of SAGE method
  • Preliminary bioinformatics issues
  • Description of analysis methods introduced in
    early paper
  • Review of literature statistics and SAGE

3
What is SAGE?
  • Serial Analysis of Gene Expression
  • Method to quantify gene expression levels in
    samples of cells
  • Open system
  • Can potentially reveal expression levels of all
    genes unbiased and comprehensive
  • Microarrays are closed, since they only tell you
    about the genes spotted on the array

Ref Velculescu et al., Science 1995 270484-487
4
How does SAGE work?
3.(c) Discard loose fragments.
9. Sequence and record the tags and frequencies.
5
From ditags to counts
  • Locate the punctuation CATG
  • Extract ditags of length 20-26 between the
    punctuation
  • Discard duplicate ditags (including in reverse
    direction) -- probably PCR artifacts
  • Take extreme 10 bases as the two tags, reversing
    right-hand tag
  • Discard linker sequences
  • Count occurrences of each tag

SAGE software available at http//www.sagenet.org
6
What does the data look like?
7
From tags to genes
  • Collect sequence records from GenBank that are
    represented in UniGene
  • Assign sequence orientation (by finding poly-A
    tail or poly-A signal or from annotations)
  • Extract 10-bases 3-adjacent to 3-most CATG
  • Assign UniGene identifier to each sequence with a
    SAGE tag
  • Record (for each tag-gene pair)
  • sequences with this tag
  • sequences in gene cluster with this tag

Maps available at http//www.ncbi.nlm.nih.gov/SAGE
8
From tags to genes
  • Ideal situation
  • one gene one tag
  • True situation
  • one gene many tags (alternative splicing
    alternative polyadenylation)
  • one tag many genes (conserved 3 regions)

9
Sequencing Errors
  • Estimated sequencing error rate
  • 0.7 per base (range 0.2 - 1)
  • Affect
  • ditags in a SAGE experiment
  • can improve by using phred scores and discarding
    ambiguous sequences
  • tag-gene mappings from GenBank
  • RNA better than EST

10
Reliable tag-gene assignments
11
SAGE and cancer
  • Ten SAGE libraries, two each from
  • normal colon
  • colon tumors
  • colon cancer cell lines
  • pancreatic tumors
  • pancreatic cell lines
  • Pooled each pair

Ref Zhang et al., Science 1997 2761268-1272
12
Variability in SAGE libraries
13
Distribution of tags
  • 303,706 total tags
  • 48,471 distinct tags
  • Distribution
  • 85.9 seen up to 5 times (25 of mass)
  • 12.7 between 5 and 50 times (30)
  • 0.1 between 50 and 500 times (26)
  • 0.1 more than 500 times (19)

Ref Zhang et al., Science 1997 2761268-1272
14
How many tags were missed?
  • They simulated to find 92 chance of detecting
    tags at 3 copies/cell
  • Using binomial approximation
  • Get 95 chance for 3 copies/cell
  • Only get 63 chance for 1 copy/cell
  • Most of what they saw occurred at 1-5 copies per
    cell

15
Differential Expression
  • Found 289 tags differentially expressed between
    normal colon and colon cancer (181 decreased 108
    increased)
  • Method Monte Carlo simulation.
  • 100000 sims per transcript for relative
    likelihood of seeing observed difference
  • Used observed distribution of transcripts to
    simulate 40 experiments.

Ref Zhang et al., Science 1997 2761268-1272
16
Sensitivity
  • Claim 95 chance of detecting 6-fold difference
  • Method Monte Carlo
  • 200 simulations, assuming abundance of 0.0001 in
    first sample and 0.0006 in second sample

Ref Zhang et al., Science 1997 2761268-1272
17
Weaknesses in Analysis
  • Failed to account for intrinsic variability in
    samples (which changes depending on abundance) in
    assessing significance
  • Monte Carlo used observed distribution, which is
    definitely not true distribution.
  • Sensitivity only measured at one abundance level.

18
Alternative Analytic Methods
  • Audic and Claverie, Genome Res 1997 7986-995
  • Chen et al., J Exp Med 1998 91657-1668
  • Kal et al., Mol Biol of Cell 1999 101859-1872
  • Michiels et al., Physiol Genomics 1999 183-91
  • Stollberg et al., Genome Res 2000 101241-1248
  • Man et al., Bioinformatics 2000 16953-959

19
Audic and Claverie
  • Main goal confidence limits for differential
    expression
  • Use Poisson approximation for number of times x
    you see the same tag.
  • Put a uniform prior on the Poisson parameter get
    posterior probability of see tag y times in new
    experiment
  • p(y x) (x y)! / x! y! 2(x y 1)
  • Generalizes to unequal sample sizes

20
Chen et al.
  • Assume
  • equal sample sizes
  • tag has concentration X, Y in two samples
  • Look at W X/(XY)
  • Use a symmetric Beta prior distribution with a
    peak near 0.5 (since most genes dont change)
  • Use Bayes theorem to compute posterior
    probability of threefold difference in expression

21
Unequal sample sizes
  • This analysis generalizes easily to the case of
    unequal size SAGE libraries
  • Lal et al., Cancer Res 1999 595403-5407
  • This method is used at the NCBI SAGEmap web site
    for online differential expression queries
  • http//www.ncbi.nlm.nih.gov/SAGE

22
Kal et al.
  • Assume the proportion of times you see a tag has
    binomial distribution
  • Replace with a normal approximation to compute
    confidence limits
  • Used at http//www.cmbi.kun.nl/usage
  • Equivalent to chi-square test on 2x2 table

23
Michiels et al.
  • First perform overall chi-square test to decide
    if the two SAGE libraries being compared are
    different.
  • Get significance by Monte Carlo simulation
  • Perform gene-by-gene chi-square tests and use
    them to rank genes in order of most likely to be
    different

24
Stollberg et al.
  • Assume binomial distributions
  • Model the binomial parameters as a sum of two
    exponentials
  • fit to the Zhang step function data
  • Simulate from this model, adding
  • sequencing errors
  • nonuniqueness of tags
  • nonrandomness of DNA sequences

25
Stollberg et al.
  • Key finding
  • Naively using observed data to fit model
    parameters cannot recover the observed data by
    simulation
  • Maximum likelihood estimate of parameters that
    recover the observed data give very different
    looking parameters

26
Stollberg et al.
27
Man et al.
  • Compares specificity and sensitivity of different
    tests for differential expression
  • Audic and Claverie
  • Kal
  • Fishers exact test
  • Monte Carlo simulation of experiments
  • Findings
  • Similar power at high abundance
  • Kal has highest power at low abundance

28
Questions
  • Sample size computations
  • How many tags should we sequence if we want to
    see tags of a given frequency?
  • How many tags should we sequence if we want to
    see a given percentage of tags?
  • How many tags are expressed in a sample?
  • Best method for identifying differential
    expression?

29
Additional SAGE references
  • Review
  • Madden et al., Drug Disc Today 2000 5415-425
  • Online Tools
  • Lash et al., Genome Res 2000 101051-1060
  • van Kampen et al., Bioinformatics 2000
    16899-905
  • Comparison of SAGE and Affymetrix
  • Ishii et al., Genomics 2000 68136-143
  • Combine SAGE and custom microarrays
  • Nacht et al., Cancer Res 1999 595464-5470
  • Mapping SAGE data onto genome
  • Caron et al., Science 2001 2911289-1292
  • Data mining the public SAGE libraries
  • Argani et al., Cancer Res 2001 614320-4324
Write a Comment
User Comments (0)
About PowerShow.com