Comparative%20gene%20hunting - PowerPoint PPT Presentation

About This Presentation
Title:

Comparative%20gene%20hunting

Description:

demotic. greek. hieroglyphs. Aim in comparative ab initio gene prediction : annotate ... Why is this a good idea ? ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 33
Provided by: stats5
Category:

less

Transcript and Presenter's Notes

Title: Comparative%20gene%20hunting


1
Comparative gene hunting
  • Irmtraud Meyer
  • The Sanger Institute
  • now
  • University of Oxford
  • meyer_at_stats.ox.ac.uk

2
Making sense of the genome
  • What are the proteins and where are they encoded ?

3
Aim in ab initio gene prediction
4
We will very soon have
5
Rough comparative map
reference www.ensembl.org
6
Typical Situation
?
. . . gagccgcctcctccccttccccacgctctaggagggggccgcgg
gggcctggct gcgtcggccaatcggagtgcacttccgcagctgacaaat
tcagtataaaagcttggggct ggggccgagcactggggactttgagggt
ggccaggccagcgtaggaggccagcgtaggat cctgctgggagcgggga
actgagggaagcgacgccgagaaagcaggcgtaccacggaggg agagaa
aagctccggaagcccagcagcgcctttacgcacagctgccaactggccgc
tgcc gaccgtctccagctcccgaggacgcgcgaccggacaccgggtcct
gccacagccgaggac agctcgccgctcgccgcagcgagcccggggcggc
ccttcagggggacctttcccagatcg Cccaggccgcccggatgtgcacg
aaaatggaacag. . . . . . ggcgacgggggctcgggaagcctg
acagggcttttgcgcacagctgccggctgg tgctacccgcccgcgccag
cccccgagaacgcgcgaccaggcacccagtccggtcaccgc agcggaga
gctcgccgctcgctgcagcgaggcccggagcggccccgcagggaccctcc
cc agaccgcctgggccgcccggatgtgcactaaaatggaacagcccttc
taccacgacgact catacacagctacgggatacggccgggcccctggtg
gcctctctctacacgactacaaac tcctgaaaccgagcctggcggtcaa
cctggccgacccctaccggagtctcaaagcgcctg Gggctcgcggaccc
ggcccagagggcggcggtggcggcagctacttttc . . .
Human DNA
Mouse DNA
7
Similar problem
8
Aim in comparative ab initio gene prediction
cctgctgggtgcgagagccggcgtaccggtgaggcc
cctgctgggagcgaaagcaggcgtaccacggaggg
9
Why is this a good idea ?
IISPTHISJLKDAFKLJDFISDFLKJUEHIDDENWRWIERUOIYWERIUY
KISFTHISPLKDAPKOJGFISJYTKJUWHIDDENRUIEUNNKLZSBUEYQ
  • advantages
  • can detect new genes as there is no need to
    search in databases for proteins
  • fewer assumptions needed than in one-strand ab
    initio gene-prediction methods, i.e. can detect
    unusual genes

10
Mouse human comparison
  • 3286 million bases
  • about 30 000 (?) genes

11
Analysing mouse and human DNA
  • Training
  • adjust parameters of Doublescan with set of known
    pairs of orthologous mouse and human genes
  • Testing
  • Test set 80 pairs of known mouse and human genes
  • 55 same number of exons, different coding
    length
  • 42 same number of exons, same coding length
  • 3 different number of exons, different
    coding length

12
Results - Performance
annotation
prediction
13
C. elegans C. briggsae
  • C. elegans
  • sequenced in 1998
  • 97 million bases
  • 5 autosomes, one X
  • about 20 000 genes
  • C. briggsae
  • around 100 million bases
  • 5 autosomes, one X

14
Results - Performance
annotation
prediction
15
Summary
  • Doublescan
  • predicts the gene structures of both sequences at
    the same time as aligning the sequences
  • capable of predicting partial, complete and
    multiple genes or no genes at all as well as more
    diverged pairs of genes which are related by
    events of exon-fusion or exon-splitting
  • can be used to analyse long sequences using the
    Stepping Stone algorithm (same performance as
    Hirschberg algorithm)
  • general concept can be trained to analyse other
    pairs of related genomes
  • performance on mouse - human DNA and c. elegans
    c. briggsae DNA very promising

16
To do list
  • large scale mouse - human comparison
  • large scale c. elegans c. briggsae comparison
  • search for regulatory regions

17
References
  • www.sanger.ac.uk/Software/analysis/doublescan
  • I.M.Meyer And R. Durbin, Bioinformatics,
    2002,18(10), pp. 1309-

18
Acknowledgements
  • Richard Durbin
  • Sequencing centres
  • Trinity College, Cambridge
  • Wellcome Trust
  • The Sanger Centre

19
The method
  • What are pair hidden Markov models ?
  • How can they be used to find genes ?

20
Pair HMMs
  • idea annotate the two sequences by parsing them
    through connected states

21
Pair HMMs
  • idea annotate the two sequences by parsing them
    through connected states

22
  • idea annotate the two sequences by parsing them
    through connected states

23
Doublescan
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
Refinements
  • Score all potential splice sites
  • gt distinguish between true and false splice
    sites by rescaling the nominal transition probs
    to the splice site states

32
Refinements to Doublescan
  • Score all potential translation start sites
  • gt distinguish between true and false translation
    start sites by rescaling the nominal transition
    probs to the START START state
Write a Comment
User Comments (0)
About PowerShow.com