Ab Initio Gene Discovery - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Ab Initio Gene Discovery

Description:

All splice junctions are 'canonical' (GT-AG) ... Find one or more GT-AG pairs such that removing all the nucleotides between them, ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 26
Provided by: albert103
Category:

less

Transcript and Presenter's Notes

Title: Ab Initio Gene Discovery


1
Ab Initio Gene Discovery
  • Alberto Riva
  • MGM / UFGI
  • GMS6181

2
Genome annotation
  • Sequencing a genome is only the first step.
    Annotation turns a genome sequence into a source
    of useful information.
  • The most basic and (relatively speaking) easiest
    annotation step is the identification of
    protein-coding genes.

3
  • GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
    GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
    CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
    GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
    ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
    ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
    ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
    CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
    TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
    AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
    GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
    TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
    AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
    ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
    TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
    ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
    GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
    TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
    ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
    AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
    CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
    AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
    TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
    GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
    CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
    GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
    CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
    CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
    AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
    AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
    TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
    GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
    GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
    GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
    ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
    TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
    TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
    CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
    CACCAAATAAATATATTTTTGTAC

Is there a gene in this sequence? What is its
structure? What does it do?
4
(No Transcript)
5
Some numbers
  • Human genes in RefSeq 25,041
  • Average number of exons 10.35
  • Average exon length 270bp
  • Longest exon 21,693bp(exon 81 of 83 in gene
    MUC16, transcript length is 132,498bp).

6
  • GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
    GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
    CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
    GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
    ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
    ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
    ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
    CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
    TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
    AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
    GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
    TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
    AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
    ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
    TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
    ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
    GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
    TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
    ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
    AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
    CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
    AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
    TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
    GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
    CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
    GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
    CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
    CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
    AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
    AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
    TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
    GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
    GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
    GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
    ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
    TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
    TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
    CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
    CACCAAATAAATATATTTTTGTAC

7
A few hints
  • The sequence is a human gene transcript, on the
    forward strand
  • All splice junctions are canonical (GT-AG)
  • Will an approach based on simple string search
    work?

8
  • GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
    GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
    CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
    GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
    ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
    ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
    ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
    CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
    TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
    AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
    GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
    TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
    AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
    ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
    TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
    ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
    GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
    TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
    ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
    AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
    CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
    AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
    TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
    GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
    CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
    GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
    CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
    CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
    AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
    AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
    TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
    GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
    GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
    GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
    ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
    TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
    TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
    CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
    CACCAAATAAATATATTTTTGTAC

9
  • GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
    GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
    CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
    GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
    ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
    ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
    ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
    CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
    TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
    AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
    GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
    TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
    AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
    ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
    TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
    ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
    GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
    TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
    ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
    AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
    CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
    AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
    TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
    GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
    CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
    GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
    CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
    CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
    AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
    AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
    TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
    GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
    GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
    GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
    ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
    TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
    TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
    CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
    CACCAAATAAATATATTTTTGTAC

10
  • GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
    GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
    CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
    GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
    ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
    ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
    ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
    CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
    TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
    AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
    GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
    TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
    AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
    ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
    TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
    ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
    GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
    TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
    ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
    AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
    CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
    AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
    TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
    GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
    CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
    GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
    CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
    CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
    AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
    AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
    TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
    GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
    GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
    GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
    ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
    TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
    TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
    CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
    CACCAAATAAATATATTTTTGTAC

11
  • GAGGAACCGAGAGGCTGAGACTAACCCAGAAACATCCAATTCTCAAACT
    GAAGCTCGCACTCTCGCCTCCAGCATGAAAGTCTCTGCCGCCCTTCTGTG
    CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
    GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
    ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
    ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
    ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
    CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
    TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
    AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
    GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
    TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
    AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
    ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
    TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
    ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
    GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
    TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
    ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
    AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
    CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
    AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
    TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
    GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
    CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
    GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
    CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
    CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
    AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
    AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
    TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
    GAACACTCACTCCACAACCCAAGAATCTGCAGCTAACTTATTTTCCCCTA
    GCTTTCCCCAGACACCCTGTTTTATTTTATTATAATGAATTTTGTTTGTT
    GATGTGAAACATTATGCCTTAAGTAATGTTAATTCTTATTTAAGTTATTG
    ATGTTTTAAGTTTATCTTTCATGGTACTAGTGTTTTTTAGATACAGAGAC
    TTGGGGAAATTGCTTTTCCTCTTGAACCACAGTTCTACCCCTGGGATGTT
    TTGAGGGTCTTTGCAAGAATCATTAATACAAAGAATTTTTTTTAACATTC
    CAATGCATTGCTAAAATATTATTGTGGAAATGAATATTTTGTAACTATTA
    CACCAAATAAATATATTTTTGTAC

12
So the problem is simply
  • Find one or more GT-AG pairs such that removing
    all the nucleotides between them, what remains
    (between one of the ATGs and one of the stop
    codons) can be correctly translated into an amino
    acid sequence.
  • (Gene prediction is much, much easier if we know
    the cDNA or protein sequence becomes a sequence
    alignment problem).

13
ORF finding
  1. Translate DNA sequence in all three (or six)
    frames
  2. Look for ORFs, stretches with no stop codons
    longer than what would be expected at random
  3. Look for coding exons within the ORFs.

14
ORF finding
  • Looking for ORFs
  • 3 stop codons out of 64 1 every 21.3 amino
    acids. Average ORF length is 64bp.
  • Unfortunately, many exons are shorter. A 51bp ORF
    could be an exon, or a false positive.

15
ORF finding
  • Looking for exons within ORFs
  • ORFs extend before and after exons. Identifying
    the exact start and end site is hard. Only works
    for coding exons.
  • Exons can be in any one of the three frames.
    Alternative splicing, mutually exclusive exons
    add complexity.

16
ORF finding - example
  • Our sequence contains 85 stop codons
  • 27 in frame 1
  • 35 in frame 2
  • 23 in frame 3
  • Average ORF length is 65.5, as expected
  • 29 ORFs out of 82 (35) are longer.

17
ORFs in frame 2 GAGGAACCGAGAGGCTGAGACTAACC
CAGAAACATCCAATTCTCAAACTGAAGCTCGCACTCTCGCCTCCAGCATG
AAAGTCTCTGCCGCCCTTCTGTGCCTGCTGCTCATAGCAGCCACCTTCAT
TCCCCAAGGGCTCGCTCAGCCAGGTAAGGCCCCCTCTTCTTCTCCTTGAA
CCACATTGTCTTCTCTCTGAGTTATCATGGACCATCCAAGCAGACGTGGT
ACCCACAGTCTTGCTTTAACGCTACTTTTCCAAGATAAGGTGACTCAGAA
AAGGACAAGGGGTGAGCCCAACCACACAGCTGCTGCTCGGCAGAGCCTGA
ACTAGAATTCCAGCTGTGAACCCCAAATCCAGCTCCTTCCAGGATTCCAG
CTCTGGGAACACACTCAGCGCAGTTACTCCCCCAGCTGCTTCCAGCAGAG
TTTGGGGATCAGGGTAATCAAAGAGAGGGTGGGTGTGTAGGCTGTTTCCA
GACACGCTGGAGACCCAGAATCTGGTCTGTGCTTCATTCACCTTAGCTTC
CAGAGACGGTGACTCTGCAGAGGTAATGAGTATCAGGGAAACTCATGACC
AGGCATAGCCTATTCAGAGTCTAAAAGGAGGCTCATAGTGGGGCTCCCCA
GCTGATCTTCCCTGGTGCTGATCATCTGGATTATTGGTCCGTCTTAATGA
CACTTGTAGGCATTATCTAGCTTTAACAGCTCCTCCTTCTCTCTGTCCAT
TATCAATGTTATATACCCATTTTACAGCATAGGAAACTGAGTCATTGGGT
CAAAGATCACATTCTAGCTCTGAGGTATAGGCAGAAGCACTGGGATTTAA
TGAGCTCTTTGTCTTCTCCTGCCTGCCTTTTGCTTTTTCCTCATGACTCT
TTTCTGCTCTTAAGATCAGAATAATCCAGTTCATCCTAAAATGCTTTTTC
TTTGTGGTTTATTTTCCAGATGCAATCAATGCCCCAGTCACCTGCTGTTA
TAACTTCACCAATAGGAAGATCTCAGTGCAGAGGCTCGCGAGCTATAGAA
GAATCACCAGCAGCAAGTGTCCCAAAGAAGCTGTGATGTGAGTTCAGCAC
ACCAACCTTCCCTGGCCTGAAGTTCTTCCTTGTGGAGCAAGGGACAAGCC
TCATAAACCTAGAGTCAGAGAGTGCACTATTTAACTTAATGTACAAAGGT
TCCCAATGGGAAAACTGAGGCACCAAGGGAAAAAGTGAACCCCAACATCA
CTCTCCACCTGGGTGCCTATTCAGAACACCCCAATTTCTTTAGCTTGAAG
TCAGGATGGCTCCACCTGGACACCTATAGGAGCAGTTTGCCCTGGGTTCC
CTCCTTCCACCTGCGTTCCTCCTCTAGCTCCCATGGCAGCCCTTTGGTGC
AGAATGGGCTGCACTTCTAGACCAAAACTGCAAAGGAACTTCATCTAACT
CTGTCCTCCCTCCCCACAGCTTCAAGACCATTGTGGCCAAGGAGATCTGT
GCTGACCCCAAGCAGAAGTGGGTTCAGGATTCCATGGACCACCTGGACAA
GCAAACCCAAACTCCGAAGACTTGAACACTCACTCCACAACCCAAGAATC
TGCAGCTAACTTATTTTCCCCTAGCTTTCCCCAGACACCCTGTTTTATTT
TATTATAATGAATTTTGTTTGTTGATGTGAAACATTATGCCTTAAGTAAT
GTTAATTCTTATTTAAGTTATTGATGTTTTAAGTTTATCTTTCATGGTAC
TAGTGTTTTTTAGATACAGAGACTTGGGGAAATTGCTTTTCCTCTTGAAC
CACAGTTCTACCCCTGGGATGTTTTGAGGGTCTTTGCAAGAATCATTAAT
ACAAAGAATTTTTTTTAACATTCCAATGCATTGCTAAAATATTATTGTGG
AAATGAATATTTTGTAACTATTACACCAAATAAATATATTTTTGTAC
18
Help!
  • More informative consensus sequences for splice
    sites
  • AGGTRAGT
  • YYYNCAGG (RA/G, YC/T).
  • Information on expected length of exons and
    introns, codon usage, hexamer frequencies , CG
    content.
  • ? Probabilistic methods

19
Probabilistic methods
  • Probabilistic methods compute the probability
    that a feature (e.g. an exon) is present in a
    certain region of the input sequence.
  • They are able to combine different forms of
    evidence into a score for each detected
    feature.
  • A good example is GENSCAN, based on Hidden Markov
    Models (HMMs). http//genes.mit.edu/GENSCAN.html

20
GENSCAN model
  • An HMM is a state machine, with probabilities
    describing transitions between states.
  • Can take into account correlations between
    exons, to build complete and accurate models.

21
GENSCAN output
  • GENSCANW output for sequence 211632
  • GENSCAN 1.0 Date run 12-Jan-107 Time 211635
  • Sequence Transcript 1923 bp 45.66 CG
    Isochore 2 (43 - 51 CG)
  • Parameter matrix HumanIso.smat
  • Predicted genes/exons
  • Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T
    CodRg P.... Tscr..
  • ----- ---- - ------ ------ ---- -- -- ---- ----
    ----- ----- ------
  • 1.01 Init 74 149 76 1 1 69 110
    69 0.934 7.40
  • 1.02 Intr 946 1063 118 2 1 105 76
    94 0.992 9.42
  • 1.03 Term 1446 1551 106 0 1 105 47
    127 0.996 8.18
  • 1.04 PlyA 1905 1910 6
    1.05
  • Predicted peptide sequence(s)
    gtTranscriptGENSCAN_predicted_peptide_199_aa
    MKVSAALLCLLLIAATFIPQGLAQPDAINAPVTCCYNFTNRKISVQRLAS
    YRRITSSKCP KEAVIFKTIVAKEICADPKQKWVQDSMDHLDKQTQTPKT

22
gaggaaccgagaggctgagactaacccagaaacatccaattctcaaact
gaagctcgcactctcgcctccagcATGAAAGTCTCTGCCGCCCTTCTGTG
CCTGCTGCTCATAGCAGCCACCTTCATTCCCCAAGGGCTCGCTCAGCCAG
GTAAGGCCCCCTCTTCTTCTCCTTGAACCACATTGTCTTCTCTCTGAGTT
ATCATGGACCATCCAAGCAGACGTGGTACCCACAGTCTTGCTTTAACGCT
ACTTTTCCAAGATAAGGTGACTCAGAAAAGGACAAGGGGTGAGCCCAACC
ACACAGCTGCTGCTCGGCAGAGCCTGAACTAGAATTCCAGCTGTGAACCC
CAAATCCAGCTCCTTCCAGGATTCCAGCTCTGGGAACACACTCAGCGCAG
TTACTCCCCCAGCTGCTTCCAGCAGAGTTTGGGGATCAGGGTAATCAAAG
AGAGGGTGGGTGTGTAGGCTGTTTCCAGACACGCTGGAGACCCAGAATCT
GGTCTGTGCTTCATTCACCTTAGCTTCCAGAGACGGTGACTCTGCAGAGG
TAATGAGTATCAGGGAAACTCATGACCAGGCATAGCCTATTCAGAGTCTA
AAAGGAGGCTCATAGTGGGGCTCCCCAGCTGATCTTCCCTGGTGCTGATC
ATCTGGATTATTGGTCCGTCTTAATGACACTTGTAGGCATTATCTAGCTT
TAACAGCTCCTCCTTCTCTCTGTCCATTATCAATGTTATATACCCATTTT
ACAGCATAGGAAACTGAGTCATTGGGTCAAAGATCACATTCTAGCTCTGA
GGTATAGGCAGAAGCACTGGGATTTAATGAGCTCTTTGTCTTCTCCTGCC
TGCCTTTTGCTTTTTCCTCATGACTCTTTTCTGCTCTTAAGATCAGAATA
ATCCAGTTCATCCTAAAATGCTTTTTCTTTGTGGTTTATTTTCCAGATGC
AATCAATGCCCCAGTCACCTGCTGTTATAACTTCACCAATAGGAAGATCT
CAGTGCAGAGGCTCGCGAGCTATAGAAGAATCACCAGCAGCAAGTGTCCC
AAAGAAGCTGTGATGTGAGTTCAGCACACCAACCTTCCCTGGCCTGAAGT
TCTTCCTTGTGGAGCAAGGGACAAGCCTCATAAACCTAGAGTCAGAGAGT
GCACTATTTAACTTAATGTACAAAGGTTCCCAATGGGAAAACTGAGGCAC
CAAGGGAAAAAGTGAACCCCAACATCACTCTCCACCTGGGTGCCTATTCA
GAACACCCCAATTTCTTTAGCTTGAAGTCAGGATGGCTCCACCTGGACAC
CTATAGGAGCAGTTTGCCCTGGGTTCCCTCCTTCCACCTGCGTTCCTCCT
CTAGCTCCCATGGCAGCCCTTTGGTGCAGAATGGGCTGCACTTCTAGACC
AAAACTGCAAAGGAACTTCATCTAACTCTGTCCTCCCTCCCCACAGCTTC
AAGACCATTGTGGCCAAGGAGATCTGTGCTGACCCCAAGCAGAAGTGGGT
TCAGGATTCCATGGACCACCTGGACAAGCAAACCCAAACTCCGAAGACTT
GAacactcactccacaacccaagaatctgcagctaacttattttccccta
gctttccccagacaccctgttttattttattataatgaattttgtttgtt
gatgtgaaacattatgccttaagtaatgttaattcttatttaagttattg
atgttttaagtttatctttcatggtactagtgttttttagatacagagac
ttggggaaattgcttttcctcttgaaccacagttctacccctgggatgtt
ttgagggtctttgcaagaatcattaatacaaagaattttttttaacattc
caatgcattgctaaaatattattgtggaaatgaatattttgtaactatta
caccaaataaatatatttttgtac
23
What did we find?
  • We discovered gene CCL2 (Small inducible cytokine
    A2 precursor), on chromosome 17.
  • Next steps look at similarity with other genes,
    homologs in different organisms, protein sequence
    structure in order to infer gene function.

24
State of the art
  • In real life long sequences containing many
    genes, on both strands. Overlapping genes. Genes
    in introns of other genes.
  • 5 of RefSeq genes have incorrect structure
    (wrong translation).
  • How could we emulate what the transcriptional/spli
    cing machinery does?

25
Future challenges
  • Detecting promoters, control regions,
    transcription factor binding sites, splicing
    enhancers and silencers
  • Non-coding RNAs, micro-RNAs, etc
  • Insulators, boundary elements, matrix-attachment
    and scaffold-attachment regions, replication
    origins, recombination hotspots ???
Write a Comment
User Comments (0)
About PowerShow.com