Title: Comparative%20gene%20hunting
1Comparative gene hunting
- Irmtraud Meyer
- The Sanger Institute
- now
- University of Oxford
- meyer_at_stats.ox.ac.uk
2Making sense of the genome
- What are the proteins and where are they encoded ?
3Aim in ab initio gene prediction
4We will very soon have
5Rough comparative map
reference www.ensembl.org
6Typical Situation
?
. . . gagccgcctcctccccttccccacgctctaggagggggccgcgg
gggcctggct gcgtcggccaatcggagtgcacttccgcagctgacaaat
tcagtataaaagcttggggct ggggccgagcactggggactttgagggt
ggccaggccagcgtaggaggccagcgtaggat cctgctgggagcgggga
actgagggaagcgacgccgagaaagcaggcgtaccacggaggg agagaa
aagctccggaagcccagcagcgcctttacgcacagctgccaactggccgc
tgcc gaccgtctccagctcccgaggacgcgcgaccggacaccgggtcct
gccacagccgaggac agctcgccgctcgccgcagcgagcccggggcggc
ccttcagggggacctttcccagatcg Cccaggccgcccggatgtgcacg
aaaatggaacag. . . . . . ggcgacgggggctcgggaagcctg
acagggcttttgcgcacagctgccggctgg tgctacccgcccgcgccag
cccccgagaacgcgcgaccaggcacccagtccggtcaccgc agcggaga
gctcgccgctcgctgcagcgaggcccggagcggccccgcagggaccctcc
cc agaccgcctgggccgcccggatgtgcactaaaatggaacagcccttc
taccacgacgact catacacagctacgggatacggccgggcccctggtg
gcctctctctacacgactacaaac tcctgaaaccgagcctggcggtcaa
cctggccgacccctaccggagtctcaaagcgcctg Gggctcgcggaccc
ggcccagagggcggcggtggcggcagctacttttc . . .
Human DNA
Mouse DNA
7Similar problem
8Aim in comparative ab initio gene prediction
cctgctgggtgcgagagccggcgtaccggtgaggcc
cctgctgggagcgaaagcaggcgtaccacggaggg
9Why is this a good idea ?
IISPTHISJLKDAFKLJDFISDFLKJUEHIDDENWRWIERUOIYWERIUY
KISFTHISPLKDAPKOJGFISJYTKJUWHIDDENRUIEUNNKLZSBUEYQ
- advantages
- can detect new genes as there is no need to
search in databases for proteins - fewer assumptions needed than in one-strand ab
initio gene-prediction methods, i.e. can detect
unusual genes
10Mouse human comparison
- 3286 million bases
- about 30 000 (?) genes
11Analysing mouse and human DNA
- Training
- adjust parameters of Doublescan with set of known
pairs of orthologous mouse and human genes - Testing
- Test set 80 pairs of known mouse and human genes
- 55 same number of exons, different coding
length - 42 same number of exons, same coding length
- 3 different number of exons, different
coding length
12Results - Performance
annotation
prediction
13C. elegans C. briggsae
- C. elegans
- sequenced in 1998
- 97 million bases
- 5 autosomes, one X
- about 20 000 genes
- C. briggsae
- around 100 million bases
- 5 autosomes, one X
14Results - Performance
annotation
prediction
15Summary
- Doublescan
- predicts the gene structures of both sequences at
the same time as aligning the sequences - capable of predicting partial, complete and
multiple genes or no genes at all as well as more
diverged pairs of genes which are related by
events of exon-fusion or exon-splitting - can be used to analyse long sequences using the
Stepping Stone algorithm (same performance as
Hirschberg algorithm) - general concept can be trained to analyse other
pairs of related genomes - performance on mouse - human DNA and c. elegans
c. briggsae DNA very promising
16To do list
- large scale mouse - human comparison
- large scale c. elegans c. briggsae comparison
- search for regulatory regions
17References
- www.sanger.ac.uk/Software/analysis/doublescan
- I.M.Meyer And R. Durbin, Bioinformatics,
2002,18(10), pp. 1309-
18Acknowledgements
- Richard Durbin
- Sequencing centres
- Trinity College, Cambridge
- Wellcome Trust
- The Sanger Centre
19The method
- What are pair hidden Markov models ?
- How can they be used to find genes ?
20Pair HMMs
- idea annotate the two sequences by parsing them
through connected states
21Pair HMMs
- idea annotate the two sequences by parsing them
through connected states
22- idea annotate the two sequences by parsing them
through connected states
23Doublescan
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31Refinements
- Score all potential splice sites
- gt distinguish between true and false splice
sites by rescaling the nominal transition probs
to the splice site states
32Refinements to Doublescan
- Score all potential translation start sites
- gt distinguish between true and false translation
start sites by rescaling the nominal transition
probs to the START START state