Gene discovery using combined signals from genome sequence and natural selection - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Gene discovery using combined signals from genome sequence and natural selection

Description:

Map experimentally determined sequences of spliced transcripts to their genomic source ... WU-BLAST. Aligned Intron Filter. Validation (RT-PCR) ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 49
Provided by: michaelrbr
Category:

less

Transcript and Presenter's Notes

Title: Gene discovery using combined signals from genome sequence and natural selection


1
Gene discovery using combined signals from genome
sequence and natural selection
Michael Brent Washington University
The mouse genome analysis group
2
Genes are read out via mRNA
3
RNA Processing
4
A typical human gene structure
5
In a mammalian genome
  • Finding all the genes is hard
  • Mammalian genomes are large
  • 5,051 miles of 10pt type
  • Raleigh to Tripoli, Libya
  • Only about 1.5 protein coding
  • Raleigh to Winston-Salem

6
Genes are fairly unconstrained
  • Intron length is highly variable
  • 5 are 40-100 nt long
  • 3 are longer than 30,000 nt
  • Distance between genes is highly variable
  • From 103 to 106 nt or more (probably)

7
Exons per gene (RefSeq)
8
Background is not random
  • Segmental duplications
  • Entire regions duplicate, then diverge slowly
  • Processed pseudogenes
  • Spliced transcripts integrate back into the
    genome
  • Sequence is similar to source genes
  • Generally not functional

9
Gene prediction two approaches
  • 1. Transcript-based (E.g., GeneWise)
  • Map experimentally determined sequences of
    spliced transcripts to their genomic source
  • Map transcript sequences to genomic regions that
    could produce similar transcripts
  • 2. De novo (genome only)
  • Model DNA patterns characteristic of gene
    components
  • Splice donor and accepter
  • Protein coding sequence
  • Translation start and stop

10
Advantages and disadvantages
  • Transcript-based
  • Advantage conservative
  • Evidence of transcription for every exon
  • Disadvantage conservative
  • Cant find truly novel genes
  • Still subject to error

11
Advantages and disadvantages
  • De novo
  • Advantage 1 Less biased toward
  • Known transcripts
  • Transcripts that can be sequenced easily
  • Advantage 2 Genome sequencing is easy
  • Disadvantages
  • No direct evidence of transcription
  • Presumably, more false positives

12
Single-genome de novo Genscan
  • Strengths
  • For mammalian sequence, one of the best
    single-genome, de novo gene predictors
  • Widely used to great practical advantage
  • De facto standard for mammalian sequence
  • Limitations
  • Predicts gt45K genes (best est. 25-30K)
  • Predicts gt315K exons (best est. 200K-250K)
  • Gets only 9 of known genes exactly right

13
Dual genome de novo
  • We developed algorithms that use two genomes to
  • Reduce the number of false positives
  • Refined the details of the structures

14
Single-genome de novo method
  • Probability model
  • Assigns probability to annotated DNA sequences
  • 5TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3
  • Optimization algorithm
  • Given a DNA sequence, find the most probable
    annotation, according to the model

Intron
Exon
5 UTR
15
Genscans generative model
Intron
Intron
Exon
CCATGGCGTCTTCAGGCAGTGACTC
16
Genscans generative model
  • States correspond to gene features
  • Model generates DNA sequence by passing through
    states
  • The probability of annotated DNA sequence is the
    probability of
  • generating the DNA sequence
  • by passing through states corre-sponding to the
    annotation.

Generalized HMM
17
Dual genome prediction
  • Input
  • Target and informant genomes
  • Idea
  • Patterns of evolution since the last common
    ancestor may reveal gene structure

18
Two conservation signals
  • 1. Local alignment signal
  • Selective pressures differ by feature
  • This leaves a characteristic signature
  • 2. Structural signal
  • Locations of introns tend to be conserved

19
Characteristic local alignments
Coding exon
TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC
TTATCCACCAGACCAGATAG
GTATTTGTCAGCTACTCTC
human
mouse
Intron (non-coding)
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
CTAGAGC----AAGAAGACAG
GTACCATAGGGCTCTCCT
human
mouse
20
Conservation of intron location
21
Align?predict?filter?test
Aligned Intron Filter
Validation (RT-PCR)
TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC
WU-BLAST
TWINSCAN
TCTGCCACC TCAGCTACT
TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC
22
Representation change
TWINSCAN
gHMM decoding
Conservation sequence
TCTGCCACC TCAGCTACT
TCTGCCACC
23
BLAST Alignments
Target
Informant
24
Projecting BLAST Alignments
Target
Informant
25
Projecting BLAST Alignments
Target
Informant
26
Projecting BLAST Alignments
Target
Informant
27
Projecting BLAST Alignments
Target
Informant
28
Conservation sequence
Synthetic (projected) local alignment
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human

mouse
CTAGAG AGACAGGTACCATAGGGCTCTCCT
  • Pair each nucleotide of the target with
  • if it is aligned and identical

29
Conservation sequence
Synthetic (projected) local alignment
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human

mouse
CTAGAG AGACAGGTACCATAGGGCTCTCCT
  • Pair each nucleotide of the target with
  • if it is aligned and identical
  • if it is aligned to mismatch or gap

30
Conservation sequence
Synthetic (projected) local alignment
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human
. . . . . . . . .
mouse
CTAGAG AGACAGGTACCATAGGGCTCTCCT
  • Pair each nucleotide of the target with
  • if it is aligned and identical
  • if it is aligned to mismatch or gap
  • . if it is unaligned

31
Conservation sequence
Conservation sequence
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
human
. . . . . . . . .
  • Pair each nucleotide of the target with
  • if it is aligned and identical
  • if it is aligned to mismatch or gap
  • . if it is unaligned

32
Conservation sequence
Conservation sequence
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC
human
. . . . . . . . .
  • Pair each nucleotide of the target with
  • if it is aligned and identical
  • if it is aligned to mismatch or gap
  • . if it is unaligned

33
Twinscan Extending the model
  • Probability model
  • Assigns probability to annotated DNA
  • 5TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3
  • ........
  • Optimization
  • Given DNA and conservation sequence, find the
    most probable annotation, according to the model

Intron
Exon
5 UTR
34
Twinscan
  • Each state generates DNA and conservation
    sequence independently
  • Probability of annotated DNA and conservation
    sequence is probability of generating the DNA and
    conservation sequence by passing through
    corresponding states

35
Performance Evaluation
  • RefSeq
  • A set 13,000 Known mRNAs
  • Represents 40-50 of human genes
  • Usually, only one of several splices
  • Mapping to genome is imperfect
  • Best available gold standard

36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Short term goal
  • All multi-exon human genes
  • Predict accurately
  • Integrate information from more genomes
  • Verify at least one intron experimentally
  • Follow up with full-length verification

41
Acknowledgments
  • Funding agencies
  • National Institutes of Health (NHGRI)
  • National Science Foundation (DBI)
  • Sequencing centers
  • Sanger, Whitehead, Wash. U.
  • My group
  • Ian Korf, Paul Flicek, Evan Keibler, Ping Hu
  • Collaborators
  • Roderic Guigo, Josep Abril, Genis Parra
  • Pankaj Agarwal
  • Stylianos Antonarakis, Alexandre Reymond, Manolis
    Dermitzakis

42
Other clades
  • Plants
  • Arabidopsis thaliana, cabbage, rice
  • Nematodes
  • C. elegans, C. briggsae
  • Fungi
  • Cryptococcus neoformans (JEC21, H99)

43
Pair HMM algorithms (SLAM,)
  • Input is orthologous sequences.
  • Aligns and predicts simultaneously, using a joint
    probability model
  • Predicts orthologous genes in 2 sequences
  • All predicted CDS is aligned
  • Some aligned regions are not predicted CDS
  • Labeled conserved non-coding sequence

44
The algorithms (SLAM,)
  • sgp2
  • Alignment before prediction (tblastx)
  • Predicts genes in target sequence only
  • Dont need orthologous input sequences
  • Paralogs low-coverage shotgun can help
  • Modifies scores of all potential exons, by
  • At each base, add tblastx score of best
    overlapping local alignment (roughly)
  • To gene-id scores of that potential exon

45
The algorithms
  • TWINSCAN
  • Alignment before prediction (blastn)
  • Predicts in target sequence only
  • Modifies scores of all potential exons, UTRs,
    splice sites, start and stop models, by
  • At each base, apply a feature-specific scoring
    model (estimated for this purpose)
  • to the best overlapping local alignment, and
    adding the result
  • To Genscan scores for that feature

46
Aligned, CDS vs. other
47
Syntenic Gene Prediction (sgp2)
48
Why work on gene finding?
  • Genes are
  • Components responsible for biological function
  • Variations cause human disease / susceptibility
  • Controls for modifying biological function
  • Human gene therapy
  • Agriculture
  • Nanotechnology, etc.
Write a Comment
User Comments (0)
About PowerShow.com