Gene discovery using combined signals from genome sequence and natural selection - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Gene discovery using combined signals from genome sequence and natural selection

Description:

Map experimentally determined sequences of spliced transcripts to their genomic source ... WU-BLAST. Aligned Intron Filter. Validation (RT-PCR) ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 49

Provided by: michaelrbr

Category:

more less

Transcript and Presenter's Notes

Title: Gene discovery using combined signals from genome sequence and natural selection

1
Gene discovery using combined signals from genome
sequence and natural selection
Michael Brent Washington University
The mouse genome analysis group
2
Genes are read out via mRNA
3
RNA Processing
4
A typical human gene structure
5
In a mammalian genome

Finding all the genes is hard
Mammalian genomes are large
5,051 miles of 10pt type
Raleigh to Tripoli, Libya
Only about 1.5 protein coding
Raleigh to Winston-Salem

6
Genes are fairly unconstrained

Intron length is highly variable
5 are 40-100 nt long
3 are longer than 30,000 nt
Distance between genes is highly variable
From 103 to 106 nt or more (probably)

7
Exons per gene (RefSeq)
8
Background is not random

Segmental duplications
Entire regions duplicate, then diverge slowly
Processed pseudogenes
Spliced transcripts integrate back into the
genome
Sequence is similar to source genes
Generally not functional

9
Gene prediction two approaches

1. Transcript-based (E.g., GeneWise)
Map experimentally determined sequences of
spliced transcripts to their genomic source
Map transcript sequences to genomic regions that
could produce similar transcripts
2. De novo (genome only)
Model DNA patterns characteristic of gene
components
Splice donor and accepter
Protein coding sequence
Translation start and stop

10
Advantages and disadvantages

Transcript-based
Advantage conservative
Evidence of transcription for every exon
Disadvantage conservative
Cant find truly novel genes
Still subject to error

11
Advantages and disadvantages

De novo
Advantage 1 Less biased toward
Known transcripts
Transcripts that can be sequenced easily
Advantage 2 Genome sequencing is easy
Disadvantages
No direct evidence of transcription
Presumably, more false positives

12
Single-genome de novo Genscan

Strengths
For mammalian sequence, one of the best
single-genome, de novo gene predictors
Widely used to great practical advantage
De facto standard for mammalian sequence
Limitations
Predicts gt45K genes (best est. 25-30K)
Predicts gt315K exons (best est. 200K-250K)
Gets only 9 of known genes exactly right

13
Dual genome de novo

We developed algorithms that use two genomes to
Reduce the number of false positives
Refined the details of the structures

14
Single-genome de novo method

Probability model
Assigns probability to annotated DNA sequences
5TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3
Optimization algorithm
Given a DNA sequence, find the most probable
annotation, according to the model

Intron
Exon
5 UTR
15
Genscans generative model
Intron
Intron
Exon
CCATGGCGTCTTCAGGCAGTGACTC
16
Genscans generative model

States correspond to gene features
Model generates DNA sequence by passing through
states
The probability of annotated DNA sequence is the
probability of
generating the DNA sequence
by passing through states corre-sponding to the
annotation.

Generalized HMM
17
Dual genome prediction

Input
Target and informant genomes
Idea
Patterns of evolution since the last common
ancestor may reveal gene structure

18
Two conservation signals

1. Local alignment signal
Selective pressures differ by feature
This leaves a characteristic signature
2. Structural signal
Locations of introns tend to be conserved

19
Characteristic local alignments
Coding exon
TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC
TTATCCACCAGACCAGATAG
GTATTTGTCAGCTACTCTC
human
mouse
Intron (non-coding)
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
CTAGAGC----AAGAAGACAG
GTACCATAGGGCTCTCCT
human
mouse
20
Conservation of intron location
21
Align?predict?filter?test
Aligned Intron Filter
Validation (RT-PCR)
TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC
WU-BLAST
TWINSCAN
TCTGCCACC TCAGCTACT
TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC
22
Representation change
TWINSCAN
gHMM decoding
Conservation sequence
TCTGCCACC TCAGCTACT
TCTGCCACC
23
BLAST Alignments
Target
Informant
24
Projecting BLAST Alignments
Target
Informant
25
Projecting BLAST Alignments
Target
Informant
26
Projecting BLAST Alignments
Target
Informant
27
Projecting BLAST Alignments
Target
Informant
28
Conservation sequence
Synthetic (projected) local alignment
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human

mouse
CTAGAG AGACAGGTACCATAGGGCTCTCCT

Pair each nucleotide of the target with
if it is aligned and identical

29
Conservation sequence
Synthetic (projected) local alignment
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human

mouse
CTAGAG AGACAGGTACCATAGGGCTCTCCT

Pair each nucleotide of the target with
if it is aligned and identical
if it is aligned to mismatch or gap

30
Conservation sequence
Synthetic (projected) local alignment
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human
. . . . . . . . .
mouse
CTAGAG AGACAGGTACCATAGGGCTCTCCT

Pair each nucleotide of the target with
if it is aligned and identical
if it is aligned to mismatch or gap
. if it is unaligned

31
Conservation sequence
Conservation sequence
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human
. . . . . . . . .

Pair each nucleotide of the target with
if it is aligned and identical
if it is aligned to mismatch or gap
. if it is unaligned

32
Conservation sequence
Conservation sequence
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC
human
. . . . . . . . .

Pair each nucleotide of the target with
if it is aligned and identical
if it is aligned to mismatch or gap
. if it is unaligned

33
Twinscan Extending the model

Probability model
Assigns probability to annotated DNA
5TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3
........
Optimization
Given DNA and conservation sequence, find the
most probable annotation, according to the model

Intron
Exon
5 UTR
34
Twinscan

Each state generates DNA and conservation
sequence independently
Probability of annotated DNA and conservation
sequence is probability of generating the DNA and
conservation sequence by passing through
corresponding states

35
Performance Evaluation

RefSeq
A set 13,000 Known mRNAs
Represents 40-50 of human genes
Usually, only one of several splices
Mapping to genome is imperfect
Best available gold standard

36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Short term goal

All multi-exon human genes
Predict accurately
Integrate information from more genomes
Verify at least one intron experimentally
Follow up with full-length verification

41
Acknowledgments

Funding agencies
National Institutes of Health (NHGRI)
National Science Foundation (DBI)
Sequencing centers
Sanger, Whitehead, Wash. U.
My group
Ian Korf, Paul Flicek, Evan Keibler, Ping Hu
Collaborators
Roderic Guigo, Josep Abril, Genis Parra
Pankaj Agarwal
Stylianos Antonarakis, Alexandre Reymond, Manolis
Dermitzakis

42
Other clades

Plants
Arabidopsis thaliana, cabbage, rice
Nematodes
C. elegans, C. briggsae
Fungi
Cryptococcus neoformans (JEC21, H99)

43
Pair HMM algorithms (SLAM,)

Input is orthologous sequences.
Aligns and predicts simultaneously, using a joint
probability model
Predicts orthologous genes in 2 sequences
All predicted CDS is aligned
Some aligned regions are not predicted CDS
Labeled conserved non-coding sequence

44
The algorithms (SLAM,)

sgp2
Alignment before prediction (tblastx)
Predicts genes in target sequence only
Dont need orthologous input sequences
Paralogs low-coverage shotgun can help
Modifies scores of all potential exons, by
At each base, add tblastx score of best
overlapping local alignment (roughly)
To gene-id scores of that potential exon

45
The algorithms

TWINSCAN
Alignment before prediction (blastn)
Predicts in target sequence only
Modifies scores of all potential exons, UTRs,
splice sites, start and stop models, by
At each base, apply a feature-specific scoring
model (estimated for this purpose)
to the best overlapping local alignment, and
adding the result
To Genscan scores for that feature