Overview - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Overview

Description:

TATA box. Initiator. Gene. DNA coding strand. Biological ... upstream regulatory signals (TATA boxes) and local characteristics of the sequence (CpG islands) ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 44
Provided by: CeleraE7
Category:
Tags: overview

less

Transcript and Presenter's Notes

Title: Overview


1
Overview
  • Biological motivation
  • Methods in gene prediction
  • Mapping of large EST data sets
  • Applications of EST data mining

2
ESTomics
  • Sorin Istrail

3
Biological motivation
  • Model of eukaryotic gene transcription and
    translation

RNA polymerase II promoter
Upstream binding sites
TATA box
Gene
DNA coding strand
Sp1
Oct1
C/EBP
Initiator
4
Biological motivation
  • Model of eukaryotic gene transcription and
    translation

RNA polymerase II promoter
Upstream binding sites
TATA box
Gene
DNA coding strand
Sp1
Oct1
C/EBP
Initiator
Transcription
AAUAAA
cap
Exon 1
Exon 2
Intron
primary transcript
(A)n
3 UTR
GT
AG
5 UTR
5
Biological motivation
  • Model of eukaryotic gene transcription and
    translation

RNA polymerase II promoter
Upstream binding sites
TATA box
Gene
DNA coding strand
Sp1
Oct1
C/EBP
Initiator
Transcription
AAUAAA
cap
Exon 1
Exon 2
Intron
primary transcript
(A)n
3 UTR
GT
AG
5 UTR
Splicing
mRNA
5 UTR
3 UTR
6
Biological motivation
  • Model of eukaryotic gene transcription and
    translation

RNA polymerase II promoter
Upstream binding sites
TATA box
Gene
DNA coding strand
Sp1
Oct1
C/EBP
Initiator
Transcription
AAUAAA
cap
Exon 1
Exon 2
Intron
primary transcript
(A)n
3 UTR
GT
AG
5 UTR
Splicing
mRNA
5 UTR
3 UTR
Translation
protein (peptide)
7
Biological motivation
  • Expressed Sequence Tags (ESTs) are cDNA fragments
  • 500 bp long on average
  • may span one or more exons
  • cDNA single-stranded DNA complementary to an
    RNA, synthesized from it by reverse transcription

3 UTR
Gene
5 UTR
DNA coding strand
Exon 4 (non-coding)
Exon 3
Exon 1
Exon 2
Intron
primary transcript
Intron
Intron
mRNA
ESTs
8
Overview
  • Biological motivation
  • Methods in gene prediction
  • Mapping of large EST data sets
  • Applications of EST data mining

9
Methods in gene finding
  • Ab initio analysis of genomic sequences
    (GenScan, Burge and Karlin 1997 HMMer, Haussler
    et al. 1993, Krogh et al. 1994 FGenesH, Solovyev
    and Salamov 1994)
  • Comparison of protein and genomic sequences
    (Procrustes, Gelfand et al. 1996 Genewise,
    Birney and Durbin)
  • Comparison of expressed DNA (ESTs, cDNA, mRNA)
    and genomic sequences (EST_GENOME, Mott 1997
    SIM4, Florea et al. 1998)
  • Cross-species genomic sequence comparisons
    (ROSETTA, Batzoglou et al. 2000 CEM, Bafna and
    Huson 2000)

10
Ab initio gene finders
  • Use information embedded in the genomic sequence
    to predict the exon model
  • polyadenylation signal (AATAAA)
  • differential codon usage in coding versus
    non-coding sections of the gene
  • upstream regulatory signals (TATA boxes) and
    local characteristics of the sequence (CpG
    islands)
  • splice recognition signals (e.g., GT-AG)
  • Markov models are the predominant predictive
    method
  • Caveats
  • not effective in detecting alternatively spliced
    forms, interleaved or overlapping genes

11
The GenScan method
  • High-level organization
  • each of the basic functional units of a gene is
    associated with a state in the HMM
  • Lower-level organization
  • separate sequence prediction module for each of
    the higher-level elements
  • exons (marginal, internal, phase-specific) -
    inhomogeneous 3-periodic fifth order Markov model
  • introns and intergenic regions - homogeneous 5th
    order Markov model
  • 5 and 3UTRs - homogeneous 5th order Markov
    model
  • polyadenylation signal
  • donor and acceptor splice sites - WAM and the
    Maximal Dependence Decomposition (MDD), i.e., a
    decision tree-based weighted position matrix

12
GenScans HMM for sequence generation
Reverse (-) strand
F - (5UTR)
F (5UTR)
P - (prom)
P (prom)
E0
I0
Einit-
I0 -
E0 -
Einit
Esngl (single-exon gene)
Esngl - (single-exon gene)
N (intergenic region)
I1 -
E1
I1
E1 -
A (polyA signal)
A - (polyA signal)
I2 -
E2
I2
Eterm
Eterm-
E2 -
T (3UTR)
T - (3UTR)
Forward () strand
(Prediction of complete gene structures in human
genomic DNA(1997) Burge and Karlin, JMB 268, p.
86)
13
Protein-genomic sequence comparisons
  • Use sequence similarity between the protein and
    the protein-coding regions of the genomic
    sequence for gene model prediction
  • Algorithmic techniques
  • dynamic programming-based sequence alignment
    algorithms
  • specialized recognition modules for splice
    junction prediction
  • profile HMMs
  • Examples
  • Procrustes (Gelfand et al. 1996)
  • combinatorial pairing of putative splice
    junctions to form introns
  • uses protein-genomic sequence similarity to
    validate the correct pairings
  • Genewise (Durbin and Birney)
  • HMM-based sequence profiles
  • uses similarity between the query protein and a
    database of protein families organized in
    profiles (Pfam)
  • Caveats
  • prediction limited to coding regions (excluding
    5 and 3 UTRs)

14
cDNA-genomic sequence comparisons
  • Use similarities between the cDNA (ESTs, mRNAs)
    and the genomic sequences to predict the gene
    model.
  • Algorithmic techniques
  • dynamic-programming based sequence alignment
    algorithms
  • specialized module for splice junction detection
    (pattern matching techniques, or statistical
    modeling)
  • Examples
  • EST_GENOME (Mott 1997)
  • dynamic programming alignment with an affine
    scoring scheme
  • uniform scoring for large indels (introns)
  • SIM4 (Florea et al. 1998)
  • incremental exon detection and refinement with
    blast-like and greedy sequence comparison
    techniques
  • pattern matching prediction of splice junctions
  • Caveats
  • accuracy depends on the quality of the data
    source (e.g., cannot detect genomic contamination
    by unspliced introns, or spurious priming)

15
Cross-species genomic sequence comparison
  • Use the sequence similarity and the ordering of
    homologous regions between genomic sequences from
    related organisms to infer their common gene
    model.
  • Algorithmic techniques
  • dynamic programming-based sequence comparison
    algorithms
  • statistical modeling of the splice junctions and
    other common transcriptional elements
  • Examples
  • ROSETTA (Batzoglou et al. 2000), CEM (Conserved
    Exon Model Bafna and Huson 2000)
  • progressive sequence alignment between the
    various categories of orthologus regions (based
    on the expected sequence similarity)
  • statistical methods for splice signal recognition
    (?)
  • Caveats
  • accuracy depends on the specificity of sequence
    similarity and the presence of delimiting
    transcriptional signals at that locus (similarity
    may extend past the gene boundaries)

16
Automatic gene annotation with Otto
17
Components of the automatic gene annotation
  • Bn - blastn (dbEST, CHGI, CMGI, RefSeq)
  • S4 - SIM4 (dbEST, CHGI, CMGI, RefSeq)
  • Genewise (nr)
  • GenScan
  • FGenesH
  • repeat - RepeatMasker
  • etc.
  • Otto automatic gene predictions by
    Otto
  • Promoted curated transcripts

18
Overview
  • Biological motivation
  • Methods in gene prediction
  • Mapping of large EST data sets
  • Applications of EST data mining

19
Using large EST data sets for gene prediction
20
Using large EST data sets for gene prediction
  • Each EST may span one or more of a genes exons
  • Overlapping ESTs and mRNAs on the genome can be
    used to infer gene models
  • Large data sets must be used for completeness
  • dbEST ( 3.7 million ESTs)
  • UniGene (90,000 ESTs and mRNA transcripts,
    grouped by similarity)
  • proprietary data sets (LifeSeq, CHGI)
  • Analyzing such large data sets is time and
    resource-consuming
  • Strategy for EST data mining
  • determine the occurrences of a large set of cDNA
    sequences in a target genome (mapping)
  • group the overlapping EST matches on the genome
    to infer the underlying gene model (clustering)

21
Mapping ESTs to a target genome
  • Mapping Determine, for a given EST, the exact
    genomic location(s) and exon model(s), i.e.
  • exon coordinates in the genomic sequence
  • genomic match strand (forward, or reverse
    complement)
  • percent sequence identity values (at the exon and
    EST levels)
  • spliced EST-genomic sequence alignment
  • ValidationCriteria for validating putative EST
    occurrences on the genome
  • EST coverage
  • similarity between the EST and genomic sequences
  • e.g., gt80 of the EST must match the genome, at
    gt90 sequence identity

22
Technical challenges
  • cDNA
  • Sequencing errors and polymorphisms
  • Interspecies contamination
  • Low quality EST data
  • Gene model
  • Multiple gene homologues
  • Alternative splicing
  • Interleaving and overlapping of genes
  • Genomic sequence
  • Repetitive elements
  • Genomic contamination
  • Genomic sequence representation
  • Large data size
  • 3 billion bp in the human genome
  • 2.8 billion bp in dbEST

23
Source primary cDNA data
24
Source underlying gene model
  • Multiple gene homologues
  • generate multiple EST matches
  • need to distinguish the true match based on
    sequence similarity
  • complicated by sequencing errors in cDNA data

EST
Ortholog (true match)
Paralog 3
Paralog 2
Paralog 1
25
Source underlying gene model
  • Alternative splicing
  • a single gene gives rise to more than one mRNA
    sequences and protein products
  • may occur as a result of tissue specificity, or
    to activate different regulatory pathways
  • cannot be identified by ab initio methods

mRNA transcript 1
genomic sequence
mRNA transcript 2
26
Source underlying gene model
  • Interleaving and overlapping of genes
  • genes located in the introns of another gene
  • overlapping exons from different genes
  • difficult to detect with ab initio methods

Gene 1
Gene2
27
Source genomic sequence
  • Repetitive elements
  • classes
  • LINEs (Long Interspersed Nuclear Elements) --
    7,000bp
  • SINEs (Short Interspersed Nuclear Elements) --
    300bp -- e.g., Alu
  • low complexity regions -- e.g., ACACACACACACACAC
  • tandem repeats -- e.g., CAGCAGCAGCAG
  • occur in large numbers in the genome
  • considerably increase the size of the computation

28
Source genomic sequence
  • Genomic contamination
  • unspliced introns (A)
  • internal priming (B)
  • these artifacts can only be resolved by
    clustering the ESTs on the genomic axis, or in
    conjunction with other prediction methods

unspliced intron
EST
EST
genome
genome
AATATAAA
false (non-genic) primer
(A)
(B)
29
Source genomic sequence
  • Genomic sequence representation
  • ideal view one sequence per chromosome
  • public sequences BACs, contigs, ordered and
    oriented to approximate full-chromosomes
  • possible mis-ordering and mis-orienting
  • incomplete genomic sequence

Gap
30
Source genomic sequence
  • Celera genome assembly
  • generated using the Whole Genome Shotgun (WGS)
    method and a compartmentalized sequence assembler
  • sequence partially ordered and oriented
    collection of scaffolds
  • scaffolds ordered and oriented collection of
    contigs
  • known mean and distribution of gap lengths

Scaffolds
Contig ordering and orienting with mate-pairs
Shared fragments
Gap(?,?2)
Fragments
BACs (finished or unordered collections of
contigs)
...ACCGATCACGTATCTAGCGATCTTAAGGCTATCCCATGCGAGACTTA
GCTTACGGNNNCATTCGAGCGGATCTATCTGAGCT....
31
Source genomic sequence
Scaffold
Contigs
BACtigs
Genomic sequence
Fragments
32
Strategies for large scale EST mapping
  • Direct mapping with an exact cDNA-genomic
    sequence alignment method (SIM4, EST_GENOME)
  • divide the genome in n overlapping fragments
  • align the EST against each of the genomic
    fragments
  • Time required
  • SIM4 - 0.3s per EST/Mb (1 EST vs. genome in 15
    minutes)
  • EST_GENOME - even slower
  • Too expensive!

33
Strategies for large scale EST mapping
  • Mapping of ESTs to the genome via the (predicted)
    mRNA transcripts
  • map each of the ESTs on the set of (predicted)
    mRNA transcripts, or genes with known genomic
    locations
  • align the EST against the genomic fragment
    containing the gene for the EST with an exact
    alignment method
  • Faster than exact mapping
  • Can be used to improve existing gene models, but
    not to discover new ones

34
Strategies for large scale EST mapping
  • Two-stage mapping of ESTs to the genome
  • detect potential EST matches on the genome with a
    fast similarity search program (signal finding)
  • blastn, MUMer, tfastx
  • align the EST against the bounded genomic region
    containing the signal with an exact alignment
    method (polishing)
  • SIM4, EST_GENOME

1
2
EST
EST signal
genome
bounded genomic regions containing the EST signal
35
Repeat detection and resolution
  • Repeats represent 40 of the sequence of the
    human genome
  • Some repeats can be found in the 3 UTRs of the
    genes
  • Spurious priming can produce repetitive ESTs
  • In tests using dbEST 1 of the ESTs found
    accounted for 99 of the EST signals
  • Resolution Strategies
  • repeat mask the genome prior to mapping using,
    e.g., RepeatMasker
  • repeat mask the EST data prior to mapping
  • selectively mask only those ESTs with large
    numbers of occurrences, during mapping

36
Overview
  • Biological motivation
  • Methods in gene prediction
  • Mapping of large EST data sets
  • Applications of EST data mining

37
EST data mining
  • Gene prediction by genomic EST clustering
    (previously discussed)
  • Generation of gene indices by EST clustering and
    assembly
  • 5 and 3 UTR reconstruction
  • Detection of alternatively spliced gene variants

38
Gene indices
  • Quality and vector trim the EST sequences
  • Cluster the ESTs in groups based on sequence
    similarity
  • Assemble the ESTs in each cluster using a
    multiple alignment program
  • For each cluster, select a consensus sequence
    EST assembly
  • Each EST assembly is a potential mRNA transcript
  • Detect potential splice variants by pairwise
    comparisons between highly similar EST assemblies

39
5 and 3 UTR reconstruction
  • Map the ESTs on the genomic axis
  • Cluster the EST matches along the genomic axis in
    the area surrounding the predicted transcripts,
    in a manner consistent with the GenBank
    annotation
  • Determine putative 3 mRNA transcript ends in the
    vicinity of the 3-most EST-genomic alignments
  • Use genomic information (e.g., poly-adenylation
    signals AATAAA) to validate the 3 UTR ends

40
Detection of alternative splices
  • Using EST consensus information
  • cluster the ESTs to create gene indices
  • determine the consensus sequence for each cluster
  • compare highly similar consensus sequences to
    detect putative alternatively spliced exons
    (indel blocks)
  • Using the EST-genomic sequence alignments
  • cluster the EST matches along the genomic axis to
    infer possible exon models
  • determine (internal) exons that are present in
    some, but not all, ESTs in the cluster
    (alternatively spliced)
  • collect EST evidence for alternatively spliced
    variants

41
References
  • Lewin B (2000) Genes VII, Oxford University Press
    Inc., New York, ISBN 0-19-879276-X.
  • Burge C, and Karlin S. (1997) Prediction of
    complete gene structures in human genomic DNA, J
    Mol Biol. 268(1)78-94.
  • Kulp D, Haussler D, Reese MG, and Eeckman FH.
    (1996) A generalized hidden Markov model for the
    recognition of human genes in DNA, Proc Int Conf
    Intell Syst Mol Biol. 4134-42.
  • Krogh A, Mian IS, and Haussler D. (1994) A hidden
    Markov model that finds genes in E. coli DNA,
    Nucleic Acids Res. 22(22)4768-78.
  • Solovyev VV, Salamov AA, and Lawrence CB. (1994)
    Predicting internal exons by oligonucleotide
    composition and discriminant analysis of
    spliceable open reading frames, Nucleic Acids
    Res. 22(24)5156-63.
  • Salamov AA, and Solovyev VV. (2000) Ab initio
    gene finding in Drosophila genomic DNA, Genome
    Res. 10(4)516-22.

42
References
  • Gelfand MS, Mironov AA, and Pevzner PA (1996)
    Gene recognition via spliced sequence alignment,
    Proc Natl Acad Sci USA 93(17)9061-6.
  • Mott R. (1997) EST_GENOME a program to align
    spliced DNA sequences to unspliced genomic DNA,
    Comput Appl Biosci. 13(4)477-8.
  • Florea L, Hartzell G, Zhang Z, Rubin GM, and
    Miller W. (1998) A computer program for
    aligning a cDNA sequence with a genomic DNA
    sequence, Genome Res. 8(9)967-74.
  • Florea, L. and Walenz, B. (in preparation)
    ESTMapper Massive EST Mapping.
  • Batzoglou S, Pachter L, Mesirov JP, Berger B, and
    Lander ES. (2000) Human and mouse gene
    structure comparative analysis and application
    to exon prediction, Genome Res. 10(7)950-8.
  • Bafna V, and Huson DH. (2000) The conserved exon
    method for gene finding, Proc Int Conf Intell
    Syst Mol Biol. 83-12.
  • Quackenbush J, Liang F, Holt I, Pertea G, and
    Upton J. (2000) The TIGR gene indices
    reconstruction and representation of expressed
    gene sequences, Nucleic Acids Res. 28(1)141-5.

43
References
  • Venter JC, Adams MD, Myers EW, Li PW, Mural RJ,
    Sutton GG, Smith HO, Yandell M, Evans CA, Holt
    RA, et al. (2001) The sequence of the human
    genome, Science 291(5507)1304-51.
  • Gautheret D, Poirot O, Lopez F, Audic S, and
    Claverie JM. (1998) Alternate polyadenylation in
    human mRNAs a large-scale analysis by EST
    clustering, Genome Res. 8(5)524-30.
  • Kan Z, Rouchka EC, Gish WR, and States DJ. (2001)
    Gene structure prediction and alternative
    splicing analysis using genomically aligned ESTs,
    Genome Res. 11(5)889-900.
  • Kan Z, Gish W, Rouchka E, Glasscock J, and States
    D. (2000) UTR reconstruction and analysis using
    genomically aligned EST sequences, Proc Int Conf
    Intell Syst Mol Biol. 8218-27.
  • Ji H, Zhou Q, Wen F, Xia H, Lu X, and Li Y.
    (2001) AsMamDB an alternative splice database of
    mammals, Nucleic Acids Res. 29(1)260-3.
Write a Comment
User Comments (0)
About PowerShow.com