Title: Gene discovery using combined signals from genome sequence and natural selection
1Gene discovery using combined signals from genome
sequence and natural selection
Michael Brent Washington University
The mouse genome analysis group
2Genes are read out via mRNA
3RNA Processing
4A typical human gene structure
5In a mammalian genome
- Finding all the genes is hard
- Mammalian genomes are large
- 5,051 miles of 10pt type
- Raleigh to Tripoli, Libya
- Only about 1.5 protein coding
- Raleigh to Winston-Salem
6Genes are fairly unconstrained
- Intron length is highly variable
- 5 are 40-100 nt long
- 3 are longer than 30,000 nt
- Distance between genes is highly variable
- From 103 to 106 nt or more (probably)
7Exons per gene (RefSeq)
8Background is not random
- Segmental duplications
- Entire regions duplicate, then diverge slowly
- Processed pseudogenes
- Spliced transcripts integrate back into the
genome - Sequence is similar to source genes
- Generally not functional
-
9Gene prediction two approaches
- 1. Transcript-based (E.g., GeneWise)
- Map experimentally determined sequences of
spliced transcripts to their genomic source - Map transcript sequences to genomic regions that
could produce similar transcripts - 2. De novo (genome only)
- Model DNA patterns characteristic of gene
components - Splice donor and accepter
- Protein coding sequence
- Translation start and stop
10Advantages and disadvantages
- Transcript-based
- Advantage conservative
- Evidence of transcription for every exon
- Disadvantage conservative
- Cant find truly novel genes
- Still subject to error
11Advantages and disadvantages
- De novo
- Advantage 1 Less biased toward
- Known transcripts
- Transcripts that can be sequenced easily
- Advantage 2 Genome sequencing is easy
- Disadvantages
- No direct evidence of transcription
- Presumably, more false positives
12Single-genome de novo Genscan
- Strengths
- For mammalian sequence, one of the best
single-genome, de novo gene predictors - Widely used to great practical advantage
- De facto standard for mammalian sequence
- Limitations
- Predicts gt45K genes (best est. 25-30K)
- Predicts gt315K exons (best est. 200K-250K)
- Gets only 9 of known genes exactly right
13Dual genome de novo
- We developed algorithms that use two genomes to
- Reduce the number of false positives
- Refined the details of the structures
14Single-genome de novo method
- Probability model
- Assigns probability to annotated DNA sequences
- 5TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3
- Optimization algorithm
- Given a DNA sequence, find the most probable
annotation, according to the model
Intron
Exon
5 UTR
15Genscans generative model
Intron
Intron
Exon
CCATGGCGTCTTCAGGCAGTGACTC
16Genscans generative model
- States correspond to gene features
- Model generates DNA sequence by passing through
states - The probability of annotated DNA sequence is the
probability of - generating the DNA sequence
- by passing through states corre-sponding to the
annotation.
Generalized HMM
17Dual genome prediction
- Input
- Target and informant genomes
- Idea
- Patterns of evolution since the last common
ancestor may reveal gene structure
18Two conservation signals
- 1. Local alignment signal
- Selective pressures differ by feature
- This leaves a characteristic signature
- 2. Structural signal
- Locations of introns tend to be conserved
19Characteristic local alignments
Coding exon
TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC
TTATCCACCAGACCAGATAG
GTATTTGTCAGCTACTCTC
human
mouse
Intron (non-coding)
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
CTAGAGC----AAGAAGACAG
GTACCATAGGGCTCTCCT
human
mouse
20Conservation of intron location
21Align?predict?filter?test
Aligned Intron Filter
Validation (RT-PCR)
TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC
WU-BLAST
TWINSCAN
TCTGCCACC TCAGCTACT
TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC
22Representation change
TWINSCAN
gHMM decoding
Conservation sequence
TCTGCCACC TCAGCTACT
TCTGCCACC
23BLAST Alignments
Target
Informant
24Projecting BLAST Alignments
Target
Informant
25Projecting BLAST Alignments
Target
Informant
26Projecting BLAST Alignments
Target
Informant
27Projecting BLAST Alignments
Target
Informant
28Conservation sequence
Synthetic (projected) local alignment
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human
mouse
CTAGAG AGACAGGTACCATAGGGCTCTCCT
- Pair each nucleotide of the target with
- if it is aligned and identical
29Conservation sequence
Synthetic (projected) local alignment
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human
mouse
CTAGAG AGACAGGTACCATAGGGCTCTCCT
- Pair each nucleotide of the target with
- if it is aligned and identical
- if it is aligned to mismatch or gap
30Conservation sequence
Synthetic (projected) local alignment
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human
. . . . . . . . .
mouse
CTAGAG AGACAGGTACCATAGGGCTCTCCT
- Pair each nucleotide of the target with
- if it is aligned and identical
- if it is aligned to mismatch or gap
- . if it is unaligned
31Conservation sequence
Conservation sequence
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human
. . . . . . . . .
- Pair each nucleotide of the target with
- if it is aligned and identical
- if it is aligned to mismatch or gap
- . if it is unaligned
32Conservation sequence
Conservation sequence
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC
human
. . . . . . . . .
- Pair each nucleotide of the target with
- if it is aligned and identical
- if it is aligned to mismatch or gap
- . if it is unaligned
33Twinscan Extending the model
- Probability model
- Assigns probability to annotated DNA
- 5TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3
- ........
- Optimization
- Given DNA and conservation sequence, find the
most probable annotation, according to the model
Intron
Exon
5 UTR
34Twinscan
- Each state generates DNA and conservation
sequence independently - Probability of annotated DNA and conservation
sequence is probability of generating the DNA and
conservation sequence by passing through
corresponding states
35Performance Evaluation
- RefSeq
- A set 13,000 Known mRNAs
- Represents 40-50 of human genes
- Usually, only one of several splices
- Mapping to genome is imperfect
- Best available gold standard
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40Short term goal
- All multi-exon human genes
- Predict accurately
- Integrate information from more genomes
- Verify at least one intron experimentally
- Follow up with full-length verification
41Acknowledgments
- Funding agencies
- National Institutes of Health (NHGRI)
- National Science Foundation (DBI)
- Sequencing centers
- Sanger, Whitehead, Wash. U.
- My group
- Ian Korf, Paul Flicek, Evan Keibler, Ping Hu
- Collaborators
- Roderic Guigo, Josep Abril, Genis Parra
- Pankaj Agarwal
- Stylianos Antonarakis, Alexandre Reymond, Manolis
Dermitzakis
42Other clades
- Plants
- Arabidopsis thaliana, cabbage, rice
- Nematodes
- C. elegans, C. briggsae
- Fungi
- Cryptococcus neoformans (JEC21, H99)
43Pair HMM algorithms (SLAM,)
- Input is orthologous sequences.
- Aligns and predicts simultaneously, using a joint
probability model - Predicts orthologous genes in 2 sequences
- All predicted CDS is aligned
- Some aligned regions are not predicted CDS
- Labeled conserved non-coding sequence
44The algorithms (SLAM,)
- sgp2
- Alignment before prediction (tblastx)
- Predicts genes in target sequence only
- Dont need orthologous input sequences
- Paralogs low-coverage shotgun can help
- Modifies scores of all potential exons, by
- At each base, add tblastx score of best
overlapping local alignment (roughly) - To gene-id scores of that potential exon
45The algorithms
- TWINSCAN
- Alignment before prediction (blastn)
- Predicts in target sequence only
- Modifies scores of all potential exons, UTRs,
splice sites, start and stop models, by - At each base, apply a feature-specific scoring
model (estimated for this purpose) - to the best overlapping local alignment, and
adding the result - To Genscan scores for that feature
46 Aligned, CDS vs. other
47Syntenic Gene Prediction (sgp2)
48Why work on gene finding?
- Genes are
- Components responsible for biological function
- Variations cause human disease / susceptibility
- Controls for modifying biological function
- Human gene therapy
- Agriculture
- Nanotechnology, etc.