Title: Alternative Splicing from ESTs
1Alternative Splicing from ESTs
- Eduardo Eyras
- Bioinformatics UPF February 2004
2- Intro
- ESTs
- Prediction of
- Alternative Splicing from ESTs
35
3
3
5
AAAAAAA
5 CAP
45
3
3
5
Transcription
exons
introns
pre-mRNA
AAAAAAA
5 CAP
5Alt splicing as a mechanism of gene regulation
Functional domains can be added/subtracted ?
protein diversity Can introduce early stop
codons, resulting in truncated proteins or
unstable mRNAs It can modify the activity of the
transcription factors, affecting the expression
of genes It is observed nearly in all
metazoans Estimated to occur in 30-40 of human
6Forms of alternative splicing
Exon skipping / inclusion
Alternative 3 splice site
Alternative 5 splice site
Mutually exclusive exons
Intron retention
Constitutive exon
Alternatively spliced exons
7- How to study alternative splicing?
8ESTs (Expressed Sequence Tags)
Single-pass sequencing of a small (end) piece of
cDNA Typically 200-500 nucleotides long It may
contain coding and/or non-coding region
9ESTs
Cells from a specific organ, tissue or
developmental stage
mRNA extraction
Add oligo-dT primer
TTTTTT
5
3
Reverse transcriptase
RNA
TTTTTT
5
3
DNA
Ribonuclease H
TTTTTT
5
3
DNA polimerase Ribonuclease H
3
5
AAAAAA
TTTTTT
Double stranded cDNA
5
3
10ESTs
3
5
AAAAAA
Clone cDNA into a vector
TTTTTT
5
3
5 EST
Single-pass sequence reads
Multiple cDNA clones
3 EST
11Alternative Splicing from ESTs
Genomic
Primary transcript
Splicing
Splice variants
cDNA clones
EST sequences
5 3
5 3
12Alternative Splicing from ESTs
ESTs can also provide information about potential
alternative splicing when aligned to the genome
(and when aligned to mRNA data)
13EST sequencing
- Is fast and cheap
- Gives direct information about the gene sequence
- Partial information
Resulting ESTs Known gene (DB searches) Similar
to known gene Contaminant Novel gene
14ESTs provide expression data
eVOC Ontologies http//www.sanbi.ac.za/ev
oc/
15Linking the expression vocabulary to gene
annotations
ESTs
Genes
16Normalized vs. non-normalized libraries
17The down side of the ESTs
- Cannot detect lowly/rarely expressed genes or
non-expressed sequences (regulatory)
Random sampling the more ESTs we sequence the
less new useful sequences we will get
18Gene Hunting
- Sequencing of the Human Genome (HGP)
EST Sequencing
19Origin of the ESTs
- Science. 1991 Jun 21252(5013)1651-6
- Complementary DNA sequencing expressed sequence
tags and human genome project. - Adams MD, Kelley JM, Gocayne JD, Dubnick M,
Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde
B, Moreno RF, et al.Section of Receptor
Biochemistry and Molecular Biology, National
Institute of Neurological Disorders and Stroke,
National Institutes of Health, Bethesda, MD.
Automated partial DNA sequencing was conducted
on more than 600 randomly selected human brain
complementary DNA (cDNA) clones to generate
expressed sequence tags (ESTs). ESTs have
applications in the discovery of new human genes,
mapping of the human genome, and identification
of coding regions in genomic sequences. Of the
sequences generated, 337 represent new genes,
including 48 with significant similarity to genes
from other organisms, such as a yeast RNA
polymerase II subunit Drosophila kinesin, Notch,
and Enhancer of split and a murine tyrosine
kinase receptor. Forty-six ESTs were mapped to
chromosomes after amplification by the polymerase
chain reaction. This fast approach to cDNA
characterization will facilitate the tagging of
most human genes in a few years at a fraction of
the cost of complete genomic sequencing, provide
new genetic markers, and serve as a resource in
diverse biological research fields.
20EST-sequencing explosion
? non-exclusivity (1992)
- Merck and WashU (1994)
- ? public ESTs
- ? GenBank
- ? dbEST
21dbEST release 20 February 2004
- Number of public entries 20,039,613
- Summary by organism
- Homo sapiens (human)
5,472,005 - Mus musculus domesticus (mouse) 4,056,481
- Rattus sp. (rat)
583,841 - Triticum aestivum (wheat)
549,926 - Ciona intestinalis
492,511 - Gallus gallus (chicken)
460,385 - Danio rerio (zebrafish)
450,652 - Zea mays (maize)
391,417 - Xenopus laevis (African clawed frog)
359,901
22EST lengths
450 bp
Human EST length distribution (dbEST Sep. 2003 )
23Recover the mRNA from the ESTs
24What is an EST cluster?
A cluster is a set of fragmented EST data (plus
mRNA data if known), consolidated according to
sequence similarity Clusters are indexed by
gene such that all expressed data concerning a
single gene is in a single index class, and each
index class contains the information for only one
gene. Â (Burke, Davison, Hide, Genome Research
1999).
25EST pre-processing
Vector Repeats Mitochondrial Xenocontaminants
26EST Clustering
- UniGene (NCBI) www.ncbi.nlm.nih.gov/UniGene
- TIGR Human Gene Index www.tigr.org
- (The Institute for Genomic Research)
- StackDB www.sanbi.ac.za
- (South African Bioinformatics Institute)
-
27UniGene
- Species UniGene Entries
- Homo sapiens 118,517
- Mus musculus 82,482
- Rattus norvegicus 43,942
- Sus scrofa 20,426
- Gallus gallus 11,970
- Xenopus laevis 21,734
- Xenopus tropicalis 17,102
28 29ESTs aligned to the genome
- Some advantages
- It defines the location of exons and introns
- We can verify the splice sites of introns (e.g.
GT-AG) - ? hence also check the correct strand of spliced
ESTs - It helps preventing chimeras
- It can avoid putting together ESTs from
paralogous genes - We can prevent including pseudogenes in our
analysis
30Aligning ESTs to the Genome
- Many ESTs ? Fast programs, Fast computers
- Nearly exact matches Coverage gt 97
- Percent_id gt 97
- Splice sites GTAG, ATAC, GCAG
31Aligning ESTs to the Genome
Extra pre-processing of ESTs
- Clip poly A tails/Clip 20bp from either end
- Best in genome
- Remove potential processed pseudogenes
- Give preference to ESTs that are spliced
32Human ESTGenes
Genomic length distribution of aligned human ESTs
400bp
Tail up to 800kb
33The Problem
ESTs
Genome
What are the transcripts represented in this set
of mapped ESTs?
34Predict Transcripts from ESTs
ESTs
Transcript predictions
Merge ESTs according to splicing structure
compatibility
35Representation
Every 2 ESTs in a Genomic Cluster may represent
the same splicing (redundant) or not The
redundancy relation is a graph
x
x
Extension
y
y
x
Inclusion
z
x
z
Sort by the smallest coordinate ascending and by
the largest coordinate descending
36Criteria of merging
Allow edge-exon mismatches
Allow internal mismatches
Allow intron mismatches
37Transitivity
x
x
y
y
Extension
z
w
x
Inclusion
w
z
x
z
w
This reduces the number of comparisons needed
38ClusterMerge graph
Each node defines an inclusion sub-tree
y
z
y
x
z
x
Extensions form acyclic graphs
x
x
y
z
y
z
w
w
39Recovering the Solution
Mergeable sets of ESTs can be recovered
as special paths in the graph
1
4
2
3
5
6
7
9
8
40Recovering the Solution
Root does not extend any node
1
Root
4
2
3
5
6
7
9
8
Leaves
Leaf not-extended and root of an inclusion tree
41Recovering the Solution
Any set of ESTs in a path from a root to a leaf
is mergeable
1
Root
4
2
3
5
6
7
9
8
Leaves
42Recovering the Solution
Add the inclusion tree attached to each node in
the path
1
Root
4
2
3
5
6
7
9
8
Leaves
43Recovering the Solution
Lists produced (1,2,3,4,5,6,7,8) (
1,2,3,4,5,6,7,9)
1
4
2
3
5
6
7
9
8
This representation minimizes the necessary
comparisons between ESTs
44How to build the graph
Mutual Recursion
Inclusion gt go up in the tree
Recursion search along extension branch
Search graph (leaves)
Search sub-graph
45How to build the graph
Example
1
2
3
4
5
6
46How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
47How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
7
Leaves
48How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
7
Inclusion
49How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
7
Inclusion
50How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
7
Extension
51How to build the graph
Example
1
3
1
2
3
2
5
4
5
4
6
6
7
Inclusion
52How to build the graph
Example
1
3
1
2
3
2
5
7
4
5
4
6
6
7
Place
53How to build the graph
Example
1
3
1
2
3
2
5
7
4
5
4
6
6
7
Inclusion
54How to build the graph
Example
1
3
1
2
3
2
5
7
4
5
4
6
6
7
tagged as visited - skip
55How to build the graph
Example
1
3
1
2
3
2
5
7
4
5
4
6
6
7
Possible sub-trees beyond 1 or 3 remain
unseen! The representation minimizes the
necessary comparisons
56Deriving the transcripts from the lists
Internal Splice Sites external coordinates of
the 5 and 3 exons are not allowed to
contribute
57Deriving the transcripts from the lists
Splice Sites are set to the most common
coordinate 5 and 3 coordinates are set to
the exon coordinate that extends the
potential UTR the most
58Single exon transcripts
Reject resulting single exon transcripts when
using ESTs
59Annotation with ESTs
ESTs aligned to the genome can provide
information about UTRs and alternative splicing
60Annotation with ESTs
EST-Transcripts at www.ensembl.org
61Annotation with ESTs
62Results for Human and Mouse
- Human EST-genes (assembly ncbi33)
- 38,581 Genes
- 122,247Transcripts ( 42 with full CDS )
-
- Mouse EST-genes (assembly ncbi30)
- 32,848 Genes
- 103,664 Transcripts ( 36 with full CDS )
63- How many transcripts are conserved?
- Is Alternative Splicing conserved?
64EST-transcript pairs
- 42,625 transcript pairs (in 18,242 gene pairs)
gene pairs 78 with one transcript pair
conserved 22 with more than one transcript pair
conserved
For 22 of the gene pairs some form of alt.
splicing is conserved
65Conservation of Alt. Splicing
- Take gene-pairs with more than one
transcript-pair
? ( number of paired
transcripts - 1) conservation
--------------------------------------------------
----- ? ( number of
transcripts - 1 ) ? sum over genes in a gene
pair with more than one variant ( subtract the
main transcript form)
19 of alt. variants in human are conserved in
mouse 32 of alt. variants in mouse are conserved
in human
66- How many predicted novel genes
- are validated by Human-Mouse comparison?
67Novel genes
ESTGenes Not in Ensembl
Human ESTGenes validated by comparison to mouse
13,174
18,242
24,201
ESTGenes with at least one complete ORF
68Novel genes
ESTGenes not in Ensembl validated by comparison
to mouse
984
With a complete ORF
69