Title: Ab Initio Gene Discovery
1Ab Initio Gene Discovery
- Alberto Riva
- MGM / UFGI
- GMS6181
2Genome annotation
- Sequencing a genome is only the first step.
Annotation turns a genome sequence into a source
of useful information. - The most basic and (relatively speaking) easiest
annotation step is the identification of
protein-coding genes.
3- GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
CACCAAATAAATATATTTTTGTAC
Is there a gene in this sequence? What is its
structure? What does it do?
4(No Transcript)
5Some numbers
- Human genes in RefSeq 25,041
- Average number of exons 10.35
- Average exon length 270bp
- Longest exon 21,693bp(exon 81 of 83 in gene
MUC16, transcript length is 132,498bp).
6- GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
CACCAAATAAATATATTTTTGTAC
7A few hints
- The sequence is a human gene transcript, on the
forward strand - All splice junctions are canonical (GT-AG)
- Will an approach based on simple string search
work?
8- GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
CACCAAATAAATATATTTTTGTAC
9- GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
CACCAAATAAATATATTTTTGTAC
10- GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
CACCAAATAAATATATTTTTGTAC
11- GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
CACCAAATAAATATATTTTTGTAC
12So the problem is simply
- Find one or more GT-AG pairs such that removing
all the nucleotides between them, what remains
(between one of the ATGs and one of the stop
codons) can be correctly translated into an amino
acid sequence. - (Gene prediction is much, much easier if we know
the cDNA or protein sequence becomes a sequence
alignment problem).
13ORF finding
- Translate DNA sequence in all three (or six)
frames - Look for ORFs, stretches with no stop codons
longer than what would be expected at random - Look for coding exons within the ORFs.
14ORF finding
- Looking for ORFs
- 3 stop codons out of 64 1 every 21.3 amino
acids. Average ORF length is 64bp. - Unfortunately, many exons are shorter. A 51bp ORF
could be an exon, or a false positive.
15ORF finding
- Looking for exons within ORFs
- ORFs extend before and after exons. Identifying
the exact start and end site is hard. Only works
for coding exons. - Exons can be in any one of the three frames.
Alternative splicing, mutually exclusive exons
add complexity.
16ORF finding - example
- Our sequence contains 85 stop codons
- 27 in frame 1
- 35 in frame 2
- 23 in frame 3
- Average ORF length is 65.5, as expected
- 29 ORFs out of 82 (35) are longer.
17 ORFs in frame 2 GAGGAACCGAGAGGCTGAGACTAACC
CAGAAACATCCAATTCTCAAACTGAAGCTCGCACTCTCGCCTCCAGCATG
AAAGTCTCTGCCGCCCTTCTGTGCCTGCTGCTCATAGCAGCCACCTTCAT
TCCCCAAGGGCTCGCTCAGCCAGGTAAGGCCCCCTCTTCTTCTCCTTGAA
CCACATTGTCTTCTCTCTGAGTTATCATGGACCATCCAAGCAGACGTGGT
ACCCACAGTCTTGCTTTAACGCTACTTTTCCAAGATAAGGTGACTCAGAA
AAGGACAAGGGGTGAGCCCAACCACACAGCTGCTGCTCGGCAGAGCCTGA
ACTAGAATTCCAGCTGTGAACCCCAAATCCAGCTCCTTCCAGGATTCCAG
CTCTGGGAACACACTCAGCGCAGTTACTCCCCCAGCTGCTTCCAGCAGAG
TTTGGGGATCAGGGTAATCAAAGAGAGGGTGGGTGTGTAGGCTGTTTCCA
GACACGCTGGAGACCCAGAATCTGGTCTGTGCTTCATTCACCTTAGCTTC
CAGAGACGGTGACTCTGCAGAGGTAATGAGTATCAGGGAAACTCATGACC
AGGCATAGCCTATTCAGAGTCTAAAAGGAGGCTCATAGTGGGGCTCCCCA
GCTGATCTTCCCTGGTGCTGATCATCTGGATTATTGGTCCGTCTTAATGA
CACTTGTAGGCATTATCTAGCTTTAACAGCTCCTCCTTCTCTCTGTCCAT
TATCAATGTTATATACCCATTTTACAGCATAGGAAACTGAGTCATTGGGT
CAAAGATCACATTCTAGCTCTGAGGTATAGGCAGAAGCACTGGGATTTAA
TGAGCTCTTTGTCTTCTCCTGCCTGCCTTTTGCTTTTTCCTCATGACTCT
TTTCTGCTCTTAAGATCAGAATAATCCAGTTCATCCTAAAATGCTTTTTC
TTTGTGGTTTATTTTCCAGATGCAATCAATGCCCCAGTCACCTGCTGTTA
TAACTTCACCAATAGGAAGATCTCAGTGCAGAGGCTCGCGAGCTATAGAA
GAATCACCAGCAGCAAGTGTCCCAAAGAAGCTGTGATGTGAGTTCAGCAC
ACCAACCTTCCCTGGCCTGAAGTTCTTCCTTGTGGAGCAAGGGACAAGCC
TCATAAACCTAGAGTCAGAGAGTGCACTATTTAACTTAATGTACAAAGGT
TCCCAATGGGAAAACTGAGGCACCAAGGGAAAAAGTGAACCCCAACATCA
CTCTCCACCTGGGTGCCTATTCAGAACACCCCAATTTCTTTAGCTTGAAG
TCAGGATGGCTCCACCTGGACACCTATAGGAGCAGTTTGCCCTGGGTTCC
CTCCTTCCACCTGCGTTCCTCCTCTAGCTCCCATGGCAGCCCTTTGGTGC
AGAATGGGCTGCACTTCTAGACCAAAACTGCAAAGGAACTTCATCTAACT
CTGTCCTCCCTCCCCACAGCTTCAAGACCATTGTGGCCAAGGAGATCTGT
GCTGACCCCAAGCAGAAGTGGGTTCAGGATTCCATGGACCACCTGGACAA
GCAAACCCAAACTCCGAAGACTTGAACACTCACTCCACAACCCAAGAATC
TGCAGCTAACTTATTTTCCCCTAGCTTTCCCCAGACACCCTGTTTTATTT
TATTATAATGAATTTTGTTTGTTGATGTGAAACATTATGCCTTAAGTAAT
GTTAATTCTTATTTAAGTTATTGATGTTTTAAGTTTATCTTTCATGGTAC
TAGTGTTTTTTAGATACAGAGACTTGGGGAAATTGCTTTTCCTCTTGAAC
CACAGTTCTACCCCTGGGATGTTTTGAGGGTCTTTGCAAGAATCATTAAT
ACAAAGAATTTTTTTTAACATTCCAATGCATTGCTAAAATATTATTGTGG
AAATGAATATTTTGTAACTATTACACCAAATAAATATATTTTTGTAC
18Help!
- More informative consensus sequences for splice
sites - AGGTRAGT
- YYYNCAGG (RA/G, YC/T).
- Information on expected length of exons and
introns, codon usage, hexamer frequencies , CG
content. - ? Probabilistic methods
19Probabilistic methods
- Probabilistic methods compute the probability
that a feature (e.g. an exon) is present in a
certain region of the input sequence. - They are able to combine different forms of
evidence into a score for each detected
feature. - A good example is GENSCAN, based on Hidden Markov
Models (HMMs). http//genes.mit.edu/GENSCAN.html
20GENSCAN model
- An HMM is a state machine, with probabilities
describing transitions between states. - Can take into account correlations between
exons, to build complete and accurate models.
21GENSCAN output
- GENSCANW output for sequence 211632
- GENSCAN 1.0 Date run 12-Jan-107 Time 211635
- Sequence Transcript 1923 bp 45.66 CG
Isochore 2 (43 - 51 CG) - Parameter matrix HumanIso.smat
- Predicted genes/exons
- Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T
CodRg P.... Tscr.. - ----- ---- - ------ ------ ---- -- -- ---- ----
----- ----- ------ - 1.01 Init 74 149 76 1 1 69 110
69 0.934 7.40 - 1.02 Intr 946 1063 118 2 1 105 76
94 0.992 9.42 - 1.03 Term 1446 1551 106 0 1 105 47
127 0.996 8.18 - 1.04 PlyA 1905 1910 6
1.05 - Predicted peptide sequence(s)
gtTranscriptGENSCAN_predicted_peptide_199_aa
MKVSAALLCLLLIAATFIPQGLAQPDAINAPVTCCYNFTNRKISVQRLAS
YRRITSSKCP KEAVIFKTIVAKEICADPKQKWVQDSMDHLDKQTQTPKT
22 gaggaaccgagaggctgagactaacccagaaacatccaattctcaaact
gaagctcgcactctcgcctccagcATGAAAGTCTCTGCCGCCCTTCTGTG
CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
GAacactcactccacaacccaagaatctgcagctaacttattttccccta
gctttccccagacaccctgttttattttattataatgaattttgtttgtt
gatgtgaaacattatgccttaagtaatgttaattcttatttaagttattg
atgttttaagtttatctttcatggtactagtgttttttagatacagagac
ttggggaaattgcttttcctcttgaaccacagttctacccctgggatgtt
ttgagggtctttgcaagaatcattaatacaaagaattttttttaacattc
caatgcattgctaaaatattattgtggaaatgaatattttgtaactatta
caccaaataaatatatttttgtac
23What did we find?
- We discovered gene CCL2 (Small inducible cytokine
A2 precursor), on chromosome 17. - Next steps look at similarity with other genes,
homologs in different organisms, protein sequence
structure in order to infer gene function.
24State of the art
- In real life long sequences containing many
genes, on both strands. Overlapping genes. Genes
in introns of other genes. - 5 of RefSeq genes have incorrect structure
(wrong translation). - How could we emulate what the transcriptional/spli
cing machinery does?
25Future challenges
- Detecting promoters, control regions,
transcription factor binding sites, splicing
enhancers and silencers - Non-coding RNAs, micro-RNAs, etc
- Insulators, boundary elements, matrix-attachment
and scaffold-attachment regions, replication
origins, recombination hotspots ???