Title: Exercises 6 8: Results and Answers
1Exercises 6 8 Results and Answers
6. Sequence Profiles and Motifs 7. Hidden Markov
Models 8. Gene prediction
EPFL Bioinformatics I 9 Jan 2006
2Exercise 6.1. Discovery of the Shine-Dalgarno
interaction motif with MEME
- Questions/suggestion
- How far away from the 3' end is the region of the
16S RNA which is complementary to the motif found
by MEME? About 5 to 15 bases upstream. - Do all input sequences contain a Shine-Dalgarno
interaction region ? Almost all (according to
the MEME results) . Note that MEME by default
searches motifs on both strands. A few sequences
are reported to contain motif 1 on the non-coding
strand. - Do you find mismatched motifs in the sequences ?
Yes, many. Below is the position-specific
probability matrix returned by MEME.
A 0.26 0.49 0.00 0.02 0.72 0.16 C 0.30 0.16
0.00 0.00 0.03 0.05 G 0.44 0.25 1.00 0.98 0.00
0.51 T 0.00 0.10 0.00 0.00 0.25 0.28
3 end of E. coli 16s ribosomal
RNA GCGGTGGATCACCTCCTTA
MEME results see http//www.isrec.isb-sib.ch/pbu
cher/bioinfo_1/meme_results.html
EPFL Bioinformatics I 19 Dec 2005
3Exercise 6.2. Compare iterative PSI-BLAST search
with standard protein BLAST search
Number of yeast EST1 related found with E-values
0.005 Protein BLAST 11 PSI-Blast 1st
iteration 11 PSI-BLAST 2nd iteration
90 PSI-BLAST 3rd iteration even more What
next Perform PSI-BLAST searches with one of the
newly found EST1-related proteins. Do you find
the same collection of proteins starting with
another member of the protein family ? Apparently
yes (though perhaps not exactly). Example The
sequence AAN46114 (human EST1A) was newly found
in the third PSI-Blast iteration. Starting with
this protein, the yeast EST1 protein is found
already in the second iteration.
EPFL Bioinformatics I 19 Dec 2005
4Exercise 7 Hidden Markov Models
7.1 Given the simple HMM below and the sequence
CCC
By how many different paths can this sequence be
generated by the HMM ? 8 (23) WWW, WWS, WSW,
WSS, SWW, SWS, SSW, SSW
EPFL Bioinformatics I 19 Dec 2005
5Basic calculations for a simple HMM
Sequence Path 1 B -gt W -gt W -gt W
-gt E 0.50.150.80.150.80.150.10.000
108 Path 2 B -gt W -gt W -gt S -gt E
0.50.150.80.150.10.350.10.000032 Path
3 B -gt W -gt S -gt W -gt E
0.50.150.10.350.10.150.10.000004 Path 4 B
-gt W -gt S -gt S -gt E
0.50.150.10.350.80.350.10.000074 Path 5 B
-gt S -gt W -gt W -gt E
0.50.350.10.150.80.150.10.000032 Path 6 B
-gt S -gt W -gt S -gt E
0.50.350.10.150.10.350.10.000009 Path 7 B
-gt S -gt S -gt W -gt E
0.50.350.80.350.10.150.10.000074 Path 8 B
-gt S -gt S -gt S -gt E
0.50.350.80.350.80.350.10.001372
-------- Total
probability 0.001705
HMM
Sequence CCC
Which is the most probable path for its
generation ? SSS Which is the least probable
path for this sequences ? WSW What is the
probability that this sequence is generated by
the most probable path ? 0.001372 What is the
total probability that this sequence is generated
by the HMM ? 0.001705
EPFL Bioinformatics I 19 Dec 2005
6Basic calculations for a simple HMM
Sequence Path 1 B -gt W -gt W -gt W
-gt E 0.50.150.80.150.80.150.10.000
108 Path 2 B -gt W -gt W -gt S -gt E
0.50.150.80.150.10.350.10.000032 Path
3 B -gt W -gt S -gt W -gt E
0.50.150.10.350.10.150.10.000004 Path 4 B
-gt W -gt S -gt S -gt E
0.50.150.10.350.80.350.10.000074 Path 5 B
-gt S -gt W -gt W -gt E
0.50.350.10.150.80.150.10.000032 Path 6 B
-gt S -gt W -gt S -gt E
0.50.350.10.150.10.350.10.000009 Path 7 B
-gt S -gt S -gt W -gt E
0.50.350.80.350.10.150.10.000074 Path 8 B
-gt S -gt S -gt S -gt E
0.50.350.80.350.80.350.10.001372
-------- Total
probability 0.001705
HMM
Sequence CCC
Assuming that this sequence is generated by the
model, what is the probability that the first C
is emitted be state S ? (0.0000320.0000090.
0000740.001372)/0.001705 0.872141 Note This
type of calculation also called posterior
decoding.
EPFL Bioinformatics I 19 Dec 2005
7 SWISS-PROT Annotation
TMHMM2.0 outside 1 35 TMHMM2.0 TMhelix
36 58 TMHMM2.0 inside 59
70 TMHMM2.0 TMhelix 71 93 TMHMM2.0 outside
94 107 TMHMM2.0 TMhelix 108
130 TMHMM2.0 inside 131 149 TMHMM2.0 TMhelix
150 172 TMHMM2.0 outside 173
201 TMHMM2.0 TMhelix 202 224 TMHMM2.0 inside
225 250 TMHMM2.0 TMhelix 251
273 TMHMM2.0 outside 274 282 TMHMM2.0 TMhelix
283 305 TMHMM2.0 inside 306 348
FT TOPO_DOM 1 33 Extracellular
(Potential). FT TRANSMEM 34 58 1
(Potential). FT TOPO_DOM 59 70 Cytoplasmic
(Potential). FT TRANSMEM 71 96 2
(Potential). FT TOPO_DOM 97 110 Extracellular
(Potential). FT TRANSMEM 111 13 3
(Potential). FT TOPO_DOM 131 149 Cytoplasmic
(Potential). FT TRANSMEM 150 173 4
(Potential). FT TOPO_DOM 174 199 Extracellular
(Potential). FT TRANSMEM 200 227 5
(Potential). FT TOPO_DOM 228 249 Cytoplasmic
(Potential). FT TRANSMEM 250 273 6
(Potential). FT TOPO_DOM 274 281 Extracellular
(Potential). FT TRANSMEM 282 306 7
(Potential). FT TOPO_DOM 307 348 Cytoplasmic
(Potential).
EPFL Bioinformatics I 19 Dec 2005
8 SWISS-PROT Annotation
TMHMM2.0 outside 1 9 TMHMM2.0 TMhelix
10 32 TMHMM2.0 inside 33
242 TMHMM2.0 TMhelix 243 265 TMHMM2.0 outside
266 551
FT SIGNAL 1 26 FT TOPO_DOM 27 240
Extracellular (Potential). FT TRANSMEM 241 265
Potential. FT TOPO_DOM 266 551 Cytoplasmic
(Potential).
EPFL Bioinformatics I 19 Dec 2005
9 SWISS-PROT Annotation
TMHMM2.0 inside 1 93 TMHMM2.0 TMhelix
94 116 TMHMM2.0 outside 117 774
FT TOPO_DOM 1 97 Cytoplasmic (Potential). FT
TRANSMEM 98 118 Segment S1 (Potential). FT
TRANSMEM 125 145 Segment S2 (Potential). FT
TOPO_DOM 146 171 Cytoplasmic (Potential). FT
TRANSMEM 172 192 Segment S3 (Potential). FT
TRANSMEM 202 222 Segment S4 (Potential). FT
TOPO_DOM 223 253 Cytoplasmic (Potential). FT
TRANSMEM 254 274 Segment S5 (Potential). FT
TRANSMEM 298 319 Segment H5 (Potential). FT
TRANSMEM 330 350 Segment S6 (Potential). FT
TOPO_DOM 351 774 Cytoplasmic (Potential).
EPFL Bioinformatics I 19 Dec 2005
107.2. Use TMHMM server for transmembrane topology
predictions
- How accurate are the prediction with regard to
the annotations in SWISS-PROT? - OPSB_HUMAN Very good
- IL2RB_HUMAN Not so good. The signal peptide is
confounded with a TM (transmembrane helix) - HCN3_HUMAN Not so good, some TMs are visible in
the graphical output but have low probability
values -
- How reliable are the inside/outside predictions ?
- OPSB_HUMAN Very good
- IL2RB_HUMAN Unreliable according to posterior
decoding, false with regard to SWISS-PROT
annotation - HCN3_HUMAN Predictions with probabilities gt 0.8
are all correct. - Note further
- The diagram on top of the Figure reflects the
Viterbi path (most probable model) - The TM predictions and inside/outside predictions
are highly interdependent. A missed TM induces a
wrong inside/outside prediction adjacent to it.
EPFL Bioinformatics I 19 Dec 2005
11Exercise 8 Gene prediction
7.1. Paper and pencil exercise Predicting splice
donor sites with a matrix.
Test the discriminatory power of the above
scoring matrix using the human beta-globin gene
as an example.
Splice donor weight matrix
Score of true splice junctions
A 6 19 -24 -99 -99 15 23 -30 -11 C 8 -14
-37 -99 -99 -43 -27 -33 -6 G -7 -12 26 31 -99
13 -15 27 -4 T -15 -14 -27 -99 31 -43 -22 -33
13
G A A g t t g g t -719-243131-4315
2713 32 A G G g t g a g t
6-1226313113232713 158
EPFL Bioinformatics I 19 Dec 2005
128.1. Paper and pencil exercise Predicting splice
donor sites with a matrix.
aggacaggtacggctgtcatcacttagacctcaccctgtggagccacacc
ctagggttgg ccaatctactcccaggagcagggagggcaggagccaggg
ctgggcataaaagtcagggca gagccatctattgcttacatttgcttct
gacacaactgtgttcactagcaacctcaaaca
gacaccATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCT
GTGGGGCAAG GTGAACGTGGATGAAgttggtggtgaggccctgggcagg
ttggtatcaaggttacaagac aggtttaaggagaccaatagaaactggg
catgtggagacagagaagactcttgggtttct
gataggcactgactctctctgcctattggtctattttcccacccttagGC
TGCTGGTGGT CTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGA
TCTGTCCACTCCTGATGCTGT TATGGGCAACCCTAAGGTGAAGGCTCAT
GGCAAGAAAGTGCTCGGTGCCTTTAGTGATGG
CCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGC
TGCACTGTGA CAAGCTGCACGTGGATCCTGAGAACTTCAGGgtgagtct
atgggacccttgatgttttct ttccccttcttttctatggttaagttca
tgtcataggaaggggagaagtaacagggtaca
gtttagaatgggaaacagacgaatgattgcatcagtgtggaagtctcagg
atcgttttag tttcttttatttgctgttcataacaattgttttcttttg
tttaattcttgctttcttttt ttttcttctccgcaatttttactattat
acttaatgccttaacattgtgtataacaaaag
gaaatatctctgagatacattaagtaacttaaaaaaaaactttacacagt
ctgcctagta cattactatttggaatatatgtgtgcttatttgcatatt
cataatctccctactttattt tcttttatttttaattgatacataatca
ttatacatatttatgggttaaagtgtaatgtt
ttaatatgtgtacacatattgaccaaatcagggtaattttgcatttgtaa
ttttaaaaaa tgctttcttcttttaatatacttttttgtttatcttatt
tctaatactttccctaatctc tttctttcagggcaataatgatacaatg
tatcatgcctctttgcaccattctaaagaata
acagtgataatttctgggttaaggcaatagcaatatttctgcatataaat
atttctgcat ataaattgtaactgatgtaagaggtttcatattgctaat
agcagctacaatccagctacc attctgcttttattttatggttgggata
aggctggattattctgagtccaagctaggccc
ttttgctaatcatgttcatacctcttatcttcctcccacagCTCCTGGGC
AACGTGCTGG TCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCC
CACCAGTGCAGGCTGCCTATC AGAAAGTGGTGGCTGGTGTGGCTAATGC
CCTGGCCCACAAGTATCACtaagctcgctttc
ttgctgtccaatttctattaaaggttcctttgttccctaagtccaactac
taaactgggg gatattatgaagggccttgagcatctggattctgcctaa
taaaaaacatttattttcatt gcaatgatgtatttaaattatttctgaa
tattttactaaaaagggaatgtgggaggtcag
Test the discriminatory power of the above
scoring matrix using the human beta-globin gene
as an example.
EPFL Bioinformatics I 19 Dec 2005
13Exercise8.1. Paper and pencil exercise
Predicting splice donor sites with a matrix
Donor weight matrix for all GT dinucleotide
flanking sequences in the human ß-globin gene
region
5 CAGGTACGG 126 416 GTGGTCTAC -34 771
ATCGTTTTA -92 1325 ATTGTAACT 45 13 GCTGTCATC
-45 436 GAGGTTCTT 10 777 TTAGTTTCT -76 1334
GATGTAAGA 101 35 CCTGTGGAG -7 445 TGAGTCCTT
-79 793 gcTGTTCAT -73 1341 GAGGTTTCA -9 53
AGGGTTGGC 45 460 TCTGTCCAC -100 806 ATTGTTTTC
-77 1397 ATGGTTGGG 45 109 AAAGTCAGG 66 476
GCTGTTATG -43 816 TTTGTTTAA -100 1423 TGAGTCCAA
-100 155 ACTGTGTTC -21 494 AAGGTGAAG 115 884
ATTGTGTAT 1 1451 CATGTTCAT -25 157 TGTGTTCAC
-98 515 AAAGTGCTC 10 886 TGTGTATAA -40 1491
AACGTGCTG -1 187 ATGGTGCAC 30 522 TCGGTGCCT
25 921 TAAGTAACT 60 1497 CTGGTCTGT 57 210
GAAGTCTGC 6 531 TTAGTGATG 8 946 ACAGTCTGC
-14 1501 TCTGTGTGC 18 217 GCCGTTACT -36 582
TGAGTGAGC 68 955 CTAGTACAT 3 1503 TGTGTGCTG
-43 228 CCTGTGGGG 50 594 ACTGTGACA 19 978
TATGTGTGC 51 1542 CCAGTGCAG -16 238 AAGGTGAAC
113 608 CACGTGGAT 33 980 TGTGTGCTT -26 1563
AAAGTGGTG 24 244 AACGTGGAT 31 629 AGGGTGAGT
158 1062 TGGGTTAAA 0 1566 GTGGTGGCT 45 253
GAAGTTGGT 32 633 TGAGTCTAT -71 1070 AGTGTAATG
30 1573 CTGGTGTGG 96 257 TTGGTGGTG 20 651
GATGTTTTC -57 1068 AAAGTGTAA 13 1575 GGTGTGGCT
-6 260 GTGGTGAGG 126 676 ATGGTTAAG 26 1075
AATGTTTTA -49 1598 CAAGTATCA 14 276 CAGGTTGGT
97 681 TAAGTTCAT -45 1085 TATGTGTAC -6 1623
GCTGTCCAA -97 280 TTGGTATCA 8 687 CATGTCATA
-2 1087 TGTGTACAC -40 1641 AAGGTTCCT 23 288
AAGGTTACA 49 705 GAAGTAACA 44 1110 AGGGTAATT
100 1649 TTTGTTCCC -103 300 CAGGTTTAA 9 713
AGGGTACAG 36 1124 TTTGTAATT 24 1658 TAAGTCCAA
-69 329 CATGTGGAG 26 718 ACAGTTTAG -69 1165
TTTGTTTAT -76 1746 GATGTATTT 20 352 TGGGTTTCT
-24 752 TCAGTGTGG 23 1225 AATGTATCA 9 1786
AATGTGGGA 74 386 TTGGTCTAT -23 754 AGTGTGGAA
-14 1261 ACAGTGATA 22 413 CTGGTGGTC 41
760 GAAGTCTCA -59 1275 TGGGTTAAG 7
Red exons, bluescores of true splice donors
EPFL Bioinformatics I 19 Dec 2005
14Exercise 8.1. Score histogram for
non-splice-donor GT flanking sequences
.
EPFL Bioinformatics I 19 Dec 2005
15Exercise 8.1 Predicting splice donor sites with
a matrix
- Questions/Suggestions
- Based on these few results, propose a reasonable
cut-off value for discriminating true splice
junction from other GT dinucleotide sites. - The first splice junction has a relatively low
score of 32, the second one a high score of 158.
In the ß-globin gene region, 24 non-splice-donor
GT flanking sequences have scores 32. None of
them has a score 158. With these values, the
estimated sensitivity and selectivity would be - Cut-off value 32 sensitivity 100 selectivity
8.3 - Cut-off value 158 sensitivity
50 selectivity 100 - How would you rate the discriminatory power of
this splice donor site weight matrix ? - Based on these results Not very good
- Conclusions This splice donor weigh matrix
combined with any kind of splice acceptor site
prediction method would not be sufficient to
predict gene structures. A good coding
regions/exon prediction method is required in
addition.
EPFL Bioinformatics I 19 Dec 2005
168.2. Test the performance of several gene
prediction web server programs
GENSCAN predictions for the first 50'000 bp from
the embl entry AC002397. (graphical output)
EPFL Bioinformatics I 19 Dec 2005
17Exercise 8.2 Analysis of the results
GENSCAN performance for gene CD4.
Correctly predicted genes 0. completely missed
genes 0 Correctly predicted exons 6/7)
(85.7) incorrectmissed exons 2/8
(25.3) Correctly predicted bases 1343/1347
(99.7) missed coding bases 31/1374 (2.3)
EPFL Bioinformatics I 19 Dec 2005
188.2 .Gene Prediction Questions
- Compare the performance of the four programs.
- Summarize the types of errors made by the
programs - Do different programs tend to make the same
errors ? - A results summary will follow soon !
EPFL Bioinformatics I 19 Dec 2005