Title: GENE FINDING
1GENE FINDING
2Module 3
- Gene Finding
- Sequence properties
- Gene finding software
- Homology
- Protein domains
- Pfam
- Prosite
- hmmer
- Annotation
- Database flat files
- Artemis
- Gene ontologys
- Genome comparisons
- Syteny
- ACT
3Gene finding
- Artemis genome viewer
- Coding sequence vs non coding sequence
- Gene finding software
- Homology between species
- ESTs
4DNA sequence
Gene finders
Blastn
Halfwise
Blastx
tRNA scan
RepeatMasker
Repeats
Promoters
Pseudo-Genes
rRNA
Genes
tRNA
Fasta
BlastP
Pfam
Prosite
Psort
SignalP
TMHMM
5The Annotation Process
DNA SEQUENCE
Useful Information
ANNALYSIS SOFTWARE
Annotator
6Artemis
- Artemis is a free DNA sequence viewer and
annotation tool that allows visualization of
sequence features and the results of analyses
within the context of the sequence, and its
six-frame translation. - http//www.sanger.ac.uk/Software/Artemis/
7atcttttacttttttcatcatctatacaaaaaatcatagaatattcatca
tgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt
tttaattaattcacattttatatctttaagtataatatcatttaacatt
atgttatcttcctcagtgtttttcattattatttgcatgtacagtttatc
a tttttatgtaccaaactatatcttatattaaatggatctctacttata
aagttaaaatctttttttaattttttcttttcacttccaattttatattc
cg cagtacatcgaattctaaaaaaaaaaataaataatatataatatata
ataaataatatataataaataatatataatatataataaataatatataa
tat ataatatataataaataatatataatatataatatataataaataa
tatataataaataatatataatatataatatataatactttggaaagatt
attt atatgaatatatacacctttaataggatacacacatcatatttat
atatatacatataaatattccataaatatttatacaacctcaaataaaat
aaaca tacatatatatatataaatatatacatatatgtatcattacgta
aaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggt
attagg agatatatttactgattcctcatttttataaatgttaaaatta
ttatccctagtccaaatatccacatttattaaattcacttgaatattgtt
ttttaaa ttgctagatatattaatttgagatttaaaattctgacctata
taaacctttcgagaatttataggtagacttaaacttatttcatttgataa
actaatat tatcatttatgtccttatcaaaatttattttctccatttca
gttattttaaacatattccaaatattgttattaaacaagggcggacttaa
acgaagtaa ttcaatcttaactccctccttcacttcactcattttatat
attccttaatttttactatgtttattaaattaacatatatataaacaaat
atgtcactaa taatatatatatatatatatatatatatatatattataa
atgttttactctattttcacatcttgtccttttttttttaaaaatcccaa
ttcttattcat taaataataatgtatttttttttttttttttttttttt
attaattattatgttactgttttattatatacactcttaatcatatatat
atatttatatat atatatatatatatatatatattattcccttttcatg
ttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatattt
ttataacatatgt attattaaaatgtatatataaaaatatatattccat
ttattattatttttttatatacattgttataagagtatcttctcccttct
ggtttatattacta ccatttcactttgaacttttcataaaaattaatag
aatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaata tatatatatatatatatacatataatatatattt
catctaatcatttaaaattattattatatattttttaaaaaatatattta
tgataacataaaaaga atttaattttaattaaatatatataattacata
catctaatattattatatatatataataagttttccaaatagaatactta
tatattatatatatata tatatatatatatattcttccataaaaagaat
aaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacatt
gaatatatagttgtattt ataaaattaaagaaaaagcataaagttacca
tttaatagtggagattagtaacattttcttcattatcaaaaatatttatt
tcctaattttttttttttg taaaatatatttaaaaatgtaatagattat
gtattaaataatataaatatagcaaaatgttcaattttagaaatttgcct
ctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataa
agtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatg
acatgttataatataatataa taaataaaaattatgtaatatatcataa
tcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaat
acatatataactaacattcata tctttatttttgtagatgatataaaaa
attttataaactcttatgaagggatatatttttcatcatccaataaattt
ataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaa
atgttattacaataaatacagatctgtatgtagttgatttcctttttaat
gagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagat
ttatatttttatttatttattatatattattttttaatttttcttttata
tatttattttatttagtgtataaaa tgatatcctttatatttatattta
catgggatattcaaataataacaaaaatgagtatacacatatatatatat
atatatatatatatgtatattttttt tttttttttatgttcctatagga
aagggaagaattcactgatttgtagtgtttacaatattagggaatgcaac
tttacacttttgaaaaaaattcagtta agcaaaaatattaataacatta
aaaagacactgatagcaaaatgtaatgaatatataataacattagaaaat
aagaaaattactttttatttcttaaata aagattatagtataaatcaaa
gtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtca
aaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatatacc
aattagatattaaaaattcccatattagttatacacttattgatagtttc
aatttaaatttatcctacctcagagaatct ataaataataaaaaaaagc
atataaataaaataaatgatgtatcaaataatgacccaaaaaaggataat
aatgaaaaaaatacttcatctaataatataa cacataacaattataatg
acatatcaaataataataataataataataatattaatggggtgaaagac
catataaataataacactctggaaaataatga tgaaccaatcttatcta
tatataatgaagatcttaatgttttatatatatgccaaaatatgtataac
gtcctttttgttttgaatttaaataacctaagt
8DNA in Artemis
AT content
Forward translations
Reverse Translations
DNA and amino acids
9Gene structure
- IN TRYPANOSOMATIDS
- Polycistronic structure
- Genes occur on a single strand at a time.
- Inflection points
- No splicing
10(No Transcript)
11Trypanosome gene structure
12GENE STRUCTURE IN MALARIA
- Splicing
- No polycistronic units
- Can have small exons
- Low complexity regions
13Gene Structure
14DONERS AND ACCEPTERS
TTTT AaaCAG Ccca gggg
ta GTAT g
nnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnn
Exon Any size
Intron 50-500bp
Exon Any size
15AT content
- Coding regions have higher GC content in AT rich
genomes
16AT content
17CODON USAGE
- Codon bias is different for each organisms.
- DNA content in coding regions is restricted but
not in non coding regions. - The codon usage for any particular gene can
influence expression.
18Codon usage
- All organisms have a preferred set of codons.
- Malaria Trypanosoma
- GUU 0.41 GUU 0.28
- GUC 0.06 GUC 0.19
- GUA 0.42 GUA 0.14
- GUG 0.11 GUG 0.39
19Codon Usage
- http//www.kazusa.or.jp/codon/
20Codon Usage Table
UUU 34.3( 26847) UCU 15.3( 11956) UAU 45.6(
35709) UGU 15.3( 11942) UUC 7.3( 5719) UCC
5.3( 4141) UAC 5.5( 4340) UGC 2.4(
1872) UUA 49.2( 38527) UCA 18.2( 14239) UAA
1.0( 813) UGA 0.2( 188) UUG 10.1( 7911)
UCG 2.8( 2154) UAG 0.2( 123) UGG 5.2(
4066) CUU 8.7( 6776) CCU 9.1( 7148) CAU
19.5( 15287) CGU 3.3( 2561) CUC 1.7( 1354)
CCC 2.5( 1982) CAC 3.9( 3020) CGC 0.5(
354) CUA 5.4( 4217) CCA 13.1( 10221) CAA
25.1( 19650) CGA 2.4( 1878) CUG 1.3( 1044)
CCG 0.9( 742) CAG 3.3( 2598) CGG 0.2(
184) AUU 34.0( 26611) ACU 12.8( 10050)
AAU105.5( 82591) AGU 21.6( 16899) AUC 5.9(
4636) ACC 5.5( 4312) AAC 18.5( 14518) AGC
3.8( 2994) AUA 44.7( 34976) ACA 22.8( 17822)
AAA 90.5( 70863) AGA 16.9( 13213) AUG 20.9(
16326) ACG 3.8( 2951) AAG 19.2( 15056) AGG
3.9( 3091) GUU 18.1( 14200) GCU 12.5( 9811)
GAU 55.5( 43424) GGU 16.6( 12960) GUC 2.6(
2063) GCC 3.2( 2541) GAC 8.6( 6696) GGC
1.6( 1269) GUA 18.2( 14258) GCA 12.6( 9871)
GAA 65.8( 51505) GGA 16.7( 13043) GUG 4.9(
3806) GCG 1.1( 890) GAG 10.1( 7878) GGG
2.9( 2243)
21Codon Usage in Artemis
Forward frames
Reverse frames
22GC frame plot
- Plots the third position GC content of each frame
of a DNA sequence. - In coding DNA the GC content of the 3rd base is
often higher. - Good prediction of coding in malaria and
trypanosomes.
23(No Transcript)
24Genefinding programs
- Genefinding software packages use hidden markov
models. - Predict coding, intergenic and intron sequences
- Need to be trained on a specific organism.
- Never perfect!
25PhatCawley et al. (2001) Mol. Bio. Para. 118
p167http//www.stat.berkeley.edu/users/scawley/Ph
at/
- Based on a generalised hidden Markov model (GHMM)
- Free easily installed and run.
- Is good at predicting multiexon genes but will in
some cases miss out genes altogether and will
over predict.
26Whant is an HMM
- A statistical model that represents a gene.
- Similar to a weight Matrix that can recognise
gaps and treat them in a systematic way. - Has a different states that represent
introns,exons and intergenic regions.
27GlimmerMSalzberg et al. (1999) genomics 59 24-31
- Adaption of the prokaryotic genefinder Glimmer.
- Delcher et al. (1999) NAR 2 4363-4641
- Based on a interpolated HMM (IHMM).
- Only used short chains of bases (markov chains)
to generate probabilities. - Trained identically to Phat
28GlimmerM
- Under predicts splicing
- Hardly hardly ever misses a gene completely.
- Does over predict.
- Free with licence.
29Homology Data
- Coding regions are more conserved than non coding
regions due to selective pressure. - Comparing all possible translations against all
known proteins will give clues to known genes. - Blastx
30The Gene Prediction Process
ESTs
FASTA
BlastX
DNA SEQUENCE
Good Gene Models
ANNALYSIS SOFTWARE
Phat
GlimmerM
DNA Plots
Annotator