GENE FINDING - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

GENE FINDING

Description:

Artemis genome viewer. Coding sequence vs non coding sequence. Gene finding software ... http://www.sanger.ac.uk/Software/Artemis ... – PowerPoint PPT presentation

Number of Views:190
Avg rating:3.0/5.0
Slides: 31
Provided by: neil5
Category:
Tags: finding | gene | artemis

less

Transcript and Presenter's Notes

Title: GENE FINDING


1
GENE FINDING
2
Module 3
  • Gene Finding
  • Sequence properties
  • Gene finding software
  • Homology
  • Protein domains
  • Pfam
  • Prosite
  • hmmer
  • Annotation
  • Database flat files
  • Artemis
  • Gene ontologys
  • Genome comparisons
  • Syteny
  • ACT

3
Gene finding
  • Artemis genome viewer
  • Coding sequence vs non coding sequence
  • Gene finding software
  • Homology between species
  • ESTs

4
DNA sequence
Gene finders
Blastn
Halfwise
Blastx
tRNA scan
RepeatMasker
Repeats
Promoters
Pseudo-Genes
rRNA
Genes
tRNA
Fasta
BlastP
Pfam
Prosite
Psort
SignalP
TMHMM
5
The Annotation Process
DNA SEQUENCE
Useful Information
ANNALYSIS SOFTWARE
Annotator
6
Artemis
  • Artemis is a free DNA sequence viewer and
    annotation tool that allows visualization of
    sequence features and the results of analyses
    within the context of the sequence, and its
    six-frame translation.
  • http//www.sanger.ac.uk/Software/Artemis/

7
atcttttacttttttcatcatctatacaaaaaatcatagaatattcatca
tgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt
tttaattaattcacattttatatctttaagtataatatcatttaacatt
atgttatcttcctcagtgtttttcattattatttgcatgtacagtttatc
a tttttatgtaccaaactatatcttatattaaatggatctctacttata
aagttaaaatctttttttaattttttcttttcacttccaattttatattc
cg cagtacatcgaattctaaaaaaaaaaataaataatatataatatata
ataaataatatataataaataatatataatatataataaataatatataa
tat ataatatataataaataatatataatatataatatataataaataa
tatataataaataatatataatatataatatataatactttggaaagatt
attt atatgaatatatacacctttaataggatacacacatcatatttat
atatatacatataaatattccataaatatttatacaacctcaaataaaat
aaaca tacatatatatatataaatatatacatatatgtatcattacgta
aaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggt
attagg agatatatttactgattcctcatttttataaatgttaaaatta
ttatccctagtccaaatatccacatttattaaattcacttgaatattgtt
ttttaaa ttgctagatatattaatttgagatttaaaattctgacctata
taaacctttcgagaatttataggtagacttaaacttatttcatttgataa
actaatat tatcatttatgtccttatcaaaatttattttctccatttca
gttattttaaacatattccaaatattgttattaaacaagggcggacttaa
acgaagtaa ttcaatcttaactccctccttcacttcactcattttatat
attccttaatttttactatgtttattaaattaacatatatataaacaaat
atgtcactaa taatatatatatatatatatatatatatatatattataa
atgttttactctattttcacatcttgtccttttttttttaaaaatcccaa
ttcttattcat taaataataatgtatttttttttttttttttttttttt
attaattattatgttactgttttattatatacactcttaatcatatatat
atatttatatat atatatatatatatatatatattattcccttttcatg
ttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatattt
ttataacatatgt attattaaaatgtatatataaaaatatatattccat
ttattattatttttttatatacattgttataagagtatcttctcccttct
ggtttatattacta ccatttcactttgaacttttcataaaaattaatag
aatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaata tatatatatatatatatacatataatatatattt
catctaatcatttaaaattattattatatattttttaaaaaatatattta
tgataacataaaaaga atttaattttaattaaatatatataattacata
catctaatattattatatatatataataagttttccaaatagaatactta
tatattatatatatata tatatatatatatattcttccataaaaagaat
aaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacatt
gaatatatagttgtattt ataaaattaaagaaaaagcataaagttacca
tttaatagtggagattagtaacattttcttcattatcaaaaatatttatt
tcctaattttttttttttg taaaatatatttaaaaatgtaatagattat
gtattaaataatataaatatagcaaaatgttcaattttagaaatttgcct
ctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataa
agtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatg
acatgttataatataatataa taaataaaaattatgtaatatatcataa
tcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaat
acatatataactaacattcata tctttatttttgtagatgatataaaaa
attttataaactcttatgaagggatatatttttcatcatccaataaattt
ataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaa
atgttattacaataaatacagatctgtatgtagttgatttcctttttaat
gagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagat
ttatatttttatttatttattatatattattttttaatttttcttttata
tatttattttatttagtgtataaaa tgatatcctttatatttatattta
catgggatattcaaataataacaaaaatgagtatacacatatatatatat
atatatatatatatgtatattttttt tttttttttatgttcctatagga
aagggaagaattcactgatttgtagtgtttacaatattagggaatgcaac
tttacacttttgaaaaaaattcagtta agcaaaaatattaataacatta
aaaagacactgatagcaaaatgtaatgaatatataataacattagaaaat
aagaaaattactttttatttcttaaata aagattatagtataaatcaaa
gtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtca
aaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatatacc
aattagatattaaaaattcccatattagttatacacttattgatagtttc
aatttaaatttatcctacctcagagaatct ataaataataaaaaaaagc
atataaataaaataaatgatgtatcaaataatgacccaaaaaaggataat
aatgaaaaaaatacttcatctaataatataa cacataacaattataatg
acatatcaaataataataataataataataatattaatggggtgaaagac
catataaataataacactctggaaaataatga tgaaccaatcttatcta
tatataatgaagatcttaatgttttatatatatgccaaaatatgtataac
gtcctttttgttttgaatttaaataacctaagt
8
DNA in Artemis
AT content
Forward translations
Reverse Translations
DNA and amino acids
9
Gene structure
  • IN TRYPANOSOMATIDS
  • Polycistronic structure
  • Genes occur on a single strand at a time.
  • Inflection points
  • No splicing

10
(No Transcript)
11
Trypanosome gene structure
12
GENE STRUCTURE IN MALARIA
  • Splicing
  • No polycistronic units
  • Can have small exons
  • Low complexity regions

13
Gene Structure
14
DONERS AND ACCEPTERS
TTTT AaaCAG Ccca gggg
ta GTAT g
nnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnn
Exon Any size
Intron 50-500bp
Exon Any size
15
AT content
  • Coding regions have higher GC content in AT rich
    genomes

16
AT content
17
CODON USAGE
  • Codon bias is different for each organisms.
  • DNA content in coding regions is restricted but
    not in non coding regions.
  • The codon usage for any particular gene can
    influence expression.

18
Codon usage
  • All organisms have a preferred set of codons.
  • Malaria Trypanosoma
  • GUU 0.41 GUU 0.28
  • GUC 0.06 GUC 0.19
  • GUA 0.42 GUA 0.14
  • GUG 0.11 GUG 0.39

19
Codon Usage
  • http//www.kazusa.or.jp/codon/

20
Codon Usage Table
UUU 34.3( 26847) UCU 15.3( 11956) UAU 45.6(
35709) UGU 15.3( 11942) UUC 7.3( 5719) UCC
5.3( 4141) UAC 5.5( 4340) UGC 2.4(
1872) UUA 49.2( 38527) UCA 18.2( 14239) UAA
1.0( 813) UGA 0.2( 188) UUG 10.1( 7911)
UCG 2.8( 2154) UAG 0.2( 123) UGG 5.2(
4066) CUU 8.7( 6776) CCU 9.1( 7148) CAU
19.5( 15287) CGU 3.3( 2561) CUC 1.7( 1354)
CCC 2.5( 1982) CAC 3.9( 3020) CGC 0.5(
354) CUA 5.4( 4217) CCA 13.1( 10221) CAA
25.1( 19650) CGA 2.4( 1878) CUG 1.3( 1044)
CCG 0.9( 742) CAG 3.3( 2598) CGG 0.2(
184) AUU 34.0( 26611) ACU 12.8( 10050)
AAU105.5( 82591) AGU 21.6( 16899) AUC 5.9(
4636) ACC 5.5( 4312) AAC 18.5( 14518) AGC
3.8( 2994) AUA 44.7( 34976) ACA 22.8( 17822)
AAA 90.5( 70863) AGA 16.9( 13213) AUG 20.9(
16326) ACG 3.8( 2951) AAG 19.2( 15056) AGG
3.9( 3091) GUU 18.1( 14200) GCU 12.5( 9811)
GAU 55.5( 43424) GGU 16.6( 12960) GUC 2.6(
2063) GCC 3.2( 2541) GAC 8.6( 6696) GGC
1.6( 1269) GUA 18.2( 14258) GCA 12.6( 9871)
GAA 65.8( 51505) GGA 16.7( 13043) GUG 4.9(
3806) GCG 1.1( 890) GAG 10.1( 7878) GGG
2.9( 2243)
21
Codon Usage in Artemis
Forward frames
Reverse frames
22
GC frame plot
  • Plots the third position GC content of each frame
    of a DNA sequence.
  • In coding DNA the GC content of the 3rd base is
    often higher.
  • Good prediction of coding in malaria and
    trypanosomes.

23
(No Transcript)
24
Genefinding programs
  • Genefinding software packages use hidden markov
    models.
  • Predict coding, intergenic and intron sequences
  • Need to be trained on a specific organism.
  • Never perfect!

25
PhatCawley et al. (2001) Mol. Bio. Para. 118
p167http//www.stat.berkeley.edu/users/scawley/Ph
at/
  • Based on a generalised hidden Markov model (GHMM)
  • Free easily installed and run.
  • Is good at predicting multiexon genes but will in
    some cases miss out genes altogether and will
    over predict.

26
Whant is an HMM
  • A statistical model that represents a gene.
  • Similar to a weight Matrix that can recognise
    gaps and treat them in a systematic way.
  • Has a different states that represent
    introns,exons and intergenic regions.

27
GlimmerMSalzberg et al. (1999) genomics 59 24-31
  • Adaption of the prokaryotic genefinder Glimmer.
  • Delcher et al. (1999) NAR 2 4363-4641
  • Based on a interpolated HMM (IHMM).
  • Only used short chains of bases (markov chains)
    to generate probabilities.
  • Trained identically to Phat

28
GlimmerM
  • Under predicts splicing
  • Hardly hardly ever misses a gene completely.
  • Does over predict.
  • Free with licence.

29
Homology Data
  • Coding regions are more conserved than non coding
    regions due to selective pressure.
  • Comparing all possible translations against all
    known proteins will give clues to known genes.
  • Blastx

30
The Gene Prediction Process
ESTs
FASTA
BlastX
DNA SEQUENCE
Good Gene Models
ANNALYSIS SOFTWARE
Phat
GlimmerM
DNA Plots
Annotator
Write a Comment
User Comments (0)
About PowerShow.com