Title: Exercises 1 8: Results and Answers
1Exercises 1 8 Results and Answers
- Molecular sequence comparison
- Pairwise alignment
- Molecular databases
- Using BLAST with different parameters and options
- Multiple sequence alignment
EPFL Bioinformatics I 19 Dec 2005
2Exercise 1 Molecular Sequence Comparison
Text 1.1. Dot matrix comparisons over the
web. How to deduce domain architectures from
dotplots ? Example NCK1_HUMAN versus NCK1_HUMAN
Interpretation NCK1 contains a domain which is
Repeated 3 times.
EPFL Bioinformatics I 19 Dec 2005
3Exercise 1 Molecular Sequence Comparison
Text 1.1. Dot matrix comparisons over the
web. How to deduce domain architectures from
dotplots ? Example horizontal NCK1_HUMAN,
vertical VAV_HUMAN
Interpretation NCK1_HUMAN shares two different
domain types (A and B) with VAV_HUMAN.
NCK1_HUMAN contains 3 copies of domain A and 1
copy of domain B, in the order A-A-A-B. VAV_HUMAN
contains 2 copies of domain A and 1 copy of
domain B in the order A-B-A.
EPFL Bioinformatics I 19 Dec 2005
4Exercise 1 Molecular Sequence Comparison
Text 1.1. Dot matrix comparisons over the
web. Domain architectures of analyzed proteins
according to SWISS-PROT
LCK_HUMAN NCK1_HUMAN VAV_HUMAN
60-120 SH3 2- 61 SH3 1-119
CH 123-223 SH2 115-165 SH3 194-373
DH 224-497 Prot. Kin. 190-252 SH3 402-504
PH 282-376 SH2 617-660
SH3 671-765
SH2 782-842
SH3
Note Only the number and location of the domains
which occur more than once within these proteins
can be deduced from dot matrices. Those are the
SH2 and SH3 domains
EPFL Bioinformatics I 19 Dec 2005
5Exercise 1 Molecular Sequence Comparison
1.1. Blast searches against protein databases
Look at the degree of conservation of the
proteins RL12_HUMAN and TYSY_HUMAN proteins in
other vertebrates and in more distantly related
species.
Note TheVarizella-zoster virus thymidylate
synthase has probably been acquired from a
vertebrate by horizontal gene transfer.
EPFL Bioinformatics I 19 Dec 2005
6Exercise 2.1 Compute the optimal local alignment
score and trace back the alignment
Scoring system match/mismatch 1/-1 indel
-1 Recursive equations X0,0 0, for i 1
N X0,i 0, for j 1 M Xj,0 0, for i 1
N, j 1 M Xi,j max(0 Xi-1,j-1
s(ai,bj), Xi-1,j d Xi,j-1 d) Xopt
max(Xi,j,) Notation s(ai,bj) match score for
residue pair ai,bj d penalty for inserting or
deleting 1 residue
Maximal local alignment score 6. Several
alignments equally best alignments, e.g.
CGCATGGC ACCGCGCATGGC CGCATAGC ACCGC--ATAGC
EPFL Bioinformatics I 19 Dec 2005
7Exercise 2 Compute the score for the following
alignment by hand
Scoring system match 2, mismatch -3, gap
-4 gap_length
68.4 identity in 79 nt overlap score
22 test1 AGTCACAAAG-GCCAGAGGCTACACGATTCC--CTT-C-
TGCAGGCTATCTGGAGGT-TG
test2
AGTCACAAAAAGACAAATACTGTATGATTTCAACTTACATG-AGGC-ACC
TGGAGGAATG test1 CAGATTTATAGA-AGCAGA
test2 AAG-TCCATAGAGA-CAGA
matches 54, mismatches 14, gaps of length 1
9, gaps of length 2 1 Score 254 314 94
18 22
How exactly is the percent identity computed? Can
you reproduce this value too ? of matches /
length of alignment Length of test1 72, length
of test2 75, length of alignment 79. 54/72
100 75.0, 54/75100 72.0, 54/79100 68.4
EPFL Bioinformatics I 31 Oct 2005
8Exercise 2.2. Compute local alignments and
compute their statistical significance for pairs
of protein sequences
Pair CO9_HUMAN versus PERF_HUMAN
Note E-values expected number of scores
unshuffled score per 200 shufflings. The
estimated values may vary somewhat between
individual shuflling experiments.
EPFL Bioinformatics I 31 Oct 2005
9Exercise 2.2. Compute local alignments and
compute their statistical significance for pairs
of protein sequences
Pair CO9_HUMAN versus PERF_HUMAN Which
substitution matrix is most appropriate for this
sequence pair ? PAM250 produces the lowest
E-value (2.99e09) PAM250 also produces the
longest alignment (pos. 115-530 in CO9_HUMAN).
Why ? The percent identity of the alignments
is very low (around 20). Low Blosum matrices and
PAM250 are optimal for such sequence pairs.
PAM250 has target identity of about 20.
Note E-values expected number of scores
unshuffled score per 200 shufflings. The
estimated values may vary somewhat between
individual shuflling experiments.
E-values are expected number of occurrences in
200 shuffled sequences
EPFL Bioinformatics I 31 Oct 2005
10Exercises 3 Molecular Sequence Databases
3.1. Find sequences in EMBL and GenBank
Questions/Suggestions Try to repeat the SRS
search without restricting the Division to "INV".
You will find more entries. From which Sections
are these entries ? Mostly EST Have a look at
the user manual of the EMBL Nucleotide Sequence
Database From EBI home page, follow links -gt
Bioinformatics Products and Servuces -gt Databases
-gt EMBL Nucleotide Database -gt Documentation -gt
User Manual. In which sequence division would you
search the 18S RNA sequence from rat ? ROD
(rodents)
EPFL Bioinformatics I 19 Dec 2005
11Exercises 3 Molecular Sequence Databases
3.2. Find proteins in Swiss-Prot/Trembl
Questions/Suggestions What is the official name
of this gene ? Suggestions follow the database
cross-reference link to HGNC. MYC What is the
sequence of the helix-loop-helix motif ?
Suggestion click on the domain coordinates in
the Features Section. NELKRSFFALRDQIPELENNEKAPKVV
ILKKATAYILSVQ What is a helix-loop-helix motif ?
Suggestion follow the database cross-reference
link to the corresponding PROSITE entry. (see
PROSITE entry PDOC00038) To look at a TrEMBL
entry, try to find a c-Myc protein from a
crocodile. What is missing in a TrEMBL entry as
compared to a Swiss-Prot entry ? See for
instance Q7T244. Missing are many comments
(function, subunits, etc), Keywords, feature
tables.
EPFL Bioinformatics I 19 Dec 2005
12Exercises 3 Molecular Sequence Databases
3.3. Gene-centric resources RefSeq and Unigene
Questions/Suggestions On which chromosome is the
c-Myc gene located ? Chromosome 8 Go to the
"mRNA sequences" Section. Find the corresponding
RefSeq sequence (identifier NM_..). Look at the
RefSeq sequence from GenBank (takes two clicks).
Does the RefSeq sequence contain a link to
Swiss-Prot ? No. It refers to a
protein_id"NP_002458.2. Look at one of the EST
sequence entries that contains a piece of c-Myc
mRNA. What kind of information do you find in an
EST sequence entry ? cDNA library name, Tissue
origin Link to trace viewer Cloning vector (in
GenBank entry, takes a second mouse-click)
EPFL Bioinformatics I 19 Dec 2005
133.4. Using genome Browsers (Ensembl and USCS)
EPFL Bioinformatics I 19 Dec 2005
14Exercise 3.4. Using genome Browsers (Ensembl and
USCS)
Questions/Suggestions What is the Ensembl entry
identifier for this gene ? ENSG00000136997 What
are the neighboring genes upstream and downstream
of the c-Myc gene ? Upstream POU5F1P Downstream
ENSG00000180623 (predicted gene) Is there any
indication of alternative splicing, alternative
promoter usage or alternative polyadenylation
from the mapping of ESTs and mRNA sequences to
the genome ? Alternative promoters Yes, e.g.
EST sequence M13929 Alternative splicing Yes,
e.g. mRNA sequence U37688 Alternative
polyadenylation No, Is the Ensembl gene model
different from the RefSeq sequence ? How many
exons is the c-Myc gene composed of ? Yes. The
ResSeq sequence NP_002458 is lacking the first
non-coding exon. The Myc gene is composed of 3
exons. There may be a transcript consisting of 4
exons transcribed from an alternative
promoter. Do you find conserved sequences between
the c-Myc genes of human and chicken outside the
coding regions ? Yes, in the first intron.
EPFL Bioinformatics I 19 Dec 2005
15Exercises 4 Using Blast with Different
Parameters and Options
- Blast output look at
- the list of significant matches
- the alignments and alignment scores
- the E-values of the matches to the original
sequence
embY16474MTY16474 Branchiostoma
lanceolatumBranchiostoma lan... 119
2e-26 embAF098298AF098298 Branchiostoma
floridaeBranchiostoma flor... 103
1e-21 embAJ488732MRH488732 Mainwaringia
rhizophilaMainwaringia rhi... 34 0.82
embAY835431AY835431 Pseudendoclonium
akinetumPseudendocloniu... 32 3.3
gtembY16474MTY16474 Branchiostoma
lanceolatumBranchiostoma lanceolatum
complete mitochondrial genome Length
15076 Score 119 bits (60), Expect 2e-26
Identities 60/60 (100) Strand Plus / Plus
Query 1
cagtgagagatgggggcgctaagagtttaagtcttaacgcaaattgaagc
atattattgt 60
Sbjct 1341
cagtgagagatgggggcgctaagagtttaagtcttaacgcaaattgaagc
atattattgt 1400
Score 60, E-value2e-26
EPFL Bioinformatics I 19 Dec 2005
16Exercises 4 Using Blast with Different
Parameters and Options
- Blast output look at
- the alignment parameters and statistics about
hits and hit extensions given at the bottom of
the blast output page.
Matrix blastn matrix 1 -3 Gap Penalties
Existence 5, Extension 2 Number of Hits to DB
23,289 Number of Sequences 432248 Number of
extensions 23289 Number of successful
extensions 8300 Number of sequences better than
10.0 4 length of query 60 length of database
383,269,866 effective HSP length 17 effective
length of query 43 effective length of database
375,921,650 effective search space
16164630950 effective search space used
16164630950 T 0 A 40 X1 6 (11.9 bits) X2 15
(29.7 bits) S1 12 (24.3 bits) S2 16 (32.2 bits)
EPFL Bioinformatics I 19 Dec 2005
17Exercise 4.1. Analyse the effects of blast
parameters with artificial nucleotide sequence
Questions Which parameter combination(s)
enable(s) blast to find the source sequence at a
significant E-value? default (word size 11) -gt
not found word size 10 -gt not found (2e-26,
extensions 25842) word size 9 -gt found (4e-12,
extensions157091) word size 9, mismatch -7 -gt
found ? (0.72, extensions 157091) Explain the
effect of the mismatch penalty parameter on the
reported E-value of the match between the
original and the corrupted sequence. A mismatch
penalty of -7 targets sequence alignments with
high identity (zero score limit 87.5). 90
identity is too low to produce a significant
E-value.
EPFL Bioinformatics I 19 Dec 2005
18Exercise 4.2. Analyze the effects of blast
parameters with artificial protein sequences
The built-in word size of bastp is 3. Why do you
nevertheless find the source sequence even though
there is no exact 3-letter word match between
query and target? 2 non-overlapping word hits
are required, but there is only one. However,
unlike blastn, blastp does not require exact
matches. A cumulative substitution matrix score
higher than 12 constitutes a word hit. Explain
the effect of the substution matrix on the
E-value of the reported match. Blosum62 -gt
1e-47, Blosum45 -gt 1e-39. Blosum62 is closer to
identity of the alignment (67)
Score 186 bits (473), Expect 1e-47
Identities 95/141 (67), Positives 109/141
(76) Query 1 GLVTARIKCIQEHWGLNKKGDLQCAAESIGFK
ALTCYPHDLCFFIKFTSVQLYHLRTNPC 60 GL
TAIK IQHW LN KG LQ AASI FK LT YP DL FF KFSV
LY LRNP Sbjct 1 GLTTAQIKAIQDHWFLNIKGCLQAAADSI
FFKYLTAYPGDLAFFHKFSSVPLYGLRSNPA 60 Query 61
YKCQTMTVKNYMDKWVDCLGGNCGAMMKCKVQSHEAMHITQKHGGQMLKM
VGHVFREEGS 120 YK QTTV NYDK VD LGGN
GAMK KV SHAM IT KH GQLKVG VFEE S Sbjct 61
YKAQTLTVINYLDKVVDALGGNAGALMKAKVPSHDAMGITPKHFGQLLKL
VGGVFQEEFS 120 Query 121 AEPTVVACWGEAAHVLWACMK
141 APT VA WGAA VL A MK Sbjct 121
ADPTTVAAWGDAAGVLVAAMK 141
EPFL Bioinformatics I 19 Dec 2005
19Exercise 4.3. Analyze the effects of
compositionally biased (low-complexity) regions
in protein sequences on sequence alignment scores
and blast search results
Are the newly found hits with the BlAST filter
disabled statistically significant. Use the PRSS
program for this purpose. Not all of them. Test
for instance MAML2_HUMAN with PRSS. Does the
human TBP protein appear in the list of hits
found with the filter disabled. If not, why ?
No. The output list is probably too short. There
are too many significance matches. If the size of
the output list is expanded to 100, the protein
will be found.
60.5 identity in 81 aa overlap score 314
E(10,000) 1.8e-20 50 60 70
80 90 100 unknow
LSILEEQQRQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQAVA
AAAVQQSTSQ . ...
.. unknow
FSYQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
QQQQKQNQIQ 1010 1020 1030
1040 1050 1060 110 120
unknow QATQGTSGQAPQLFHSQTLTT
. ... . unknow QQKQNQIQQQQQQSRGSTLPS
1070 1080
EPFL Bioinformatics I 19 Dec 2005
20Exercises 5 Multiple Sequence Alignment
Compute the sum of pairs for the following two
alternative alignments of the same for sequences.
Scoring system identity 1, mismatch -1 , gap
-3 - gap_length
Sequence A A C T C C T C - C A Sequence B
A C C C T T C A Sequence C A C T C C A C
C A Sequence D A C T C C A C T C A Identities
6 6 3 6 6 2 3 0 6 6 44 1 Mismatches 0 0 0 0
0 4 3 0 0 0 7-1 Gaps(length1)0 0 3 0 0 0 0 3 0
0 6-4 Sum of pairs score 13
Pairwise scores
Sequence A A C T C C T C - C A Sequence B
A C C C T - T C A Sequence C A C T C C A C
C A Sequence D A C T C C A C T C A Identities
6 6 3 6 6 2 3 1 6 6 45 1 Mismatches 0 0 0 0
0 4 0 0 0 0 4-1 Gaps(length1)0 0 3 0 0 0 3 4 0
0 10-4 Sum of pairs score 1
Pairwise scores
EPFL Bioinformatics I 19 Dec 2005
21Exercise 5.1. Sum of pairs score
- Questions/suggestions
- Do consider it likely that the likely that the
alignment with the higher SP score is the correct
one ? Justify your answer. - Yes, if one assumes that gaps are less likely
than substitutions, as refleceted by the scoring
system. Note, however, that the multiple
alignment with the lower score implies more pairs
of identical residues.
Sequence A ACTCCTC-CA ACTCCTC-CA
Sequence B ACCCTTCA
ACCCT-TCA Sequence C ACTCCACCA
ACTCCACCA Sequence D ACTCCACTCA
ACTCCACTCA Total identities 44 1
45 1 Total mismatches 7-1
4-1 Total gaps(length1) 6-4
10-4 Sum of pairs score 13 1
EPFL Bioinformatics I 19 Dec 2005
22Exercise 5.2 Aligning seven distantly related
globin sequences
Structural alignment by Bashford (supposedly
correct)
GLB1_GLYDI ---------G LSAAQRQVIA ATWKDIAGAD
NGAGVGKDCL IKFLSAHPQM HBB_HUMAN --------VH
LTPEEKSAVT ALWGKV---- NVDEVGGEAL GRLLVVYPWT
HBA_HUMAN ---------V LSPADKTNVK AAWGKVGA--
HAGEYGAEAL ERMFLSFPTT MYG_PHYCA ---------V
LSEGEWQLVL HVWAKVEA-- DVAGHGQDIL IRLFKSHPET
GLB5_PETMA PIVDTGSVAP LSAAEKTKIR SAWAPVYS--
TYETSGVDIL VKFFTSTPAA GLB3_CHITP ----------
LSADQISTVQ ASFDKVKG-- ----DPVGIL YAVFKADPSI
LGB2_LUPLU --------GA LTESQAALVK SSWEEFNA--
NIPKHTHRFF ILVLEIAPAA GLB1_GLYDI
AAVFG-FSG- ---AS---DP GVAALGAKVL AQIGVAVSHL
--GDEGKMVA HBB_HUMAN QRFFESFGDL STPDAVMGNP
KVKAHGKKVL GAFSDGLAHL ---D--NLKG HBA_HUMAN
KTYFPHF-DL S-----HGSA QVKGHGKKVA DALTNAVAHV
---D--DMPN MYG_PHYCA LEKFDRFKHL KTEAEMKASE
DLKKHGVTVL TALGAILKK- ---K-GHHEA GLB5_PETMA
QEFFPKFKGL TTADQLKKSA DVRWHAERII NAVNDAVASM
--DDTEKMSM GLB3_CHITP MAKFTQFAG- KDLESIKGTA
PFETHANRIV GFFSKIIGEL --P---NIEA LGB2_LUPLU
KDLFS-FLK- GTSEVPQNNP ELQAHAGKVF KLVYEAAIQL
QVTGVVVTDA GLB1_GLYDI QMKAVGVRHK
GYGNKHIKAQ YFEPLGASLL SAMEHRIGGK MNAAAKDAWA
HBB_HUMAN TFATLSELHC DKL--HVDPE NFRLLGNVLV
CVLAHHFGKE FTPPVQAAYQ HBA_HUMAN ALSALSDLHA
HKL--RVDPV NFKLLSHCLL VTLAAHLPAE FTPAVHASLD
MYG_PHYCA ELKPLAQSHA TKH--KIPIK YLEFISEAII
HVLHSRHPGD FGADAQGAMN GLB5_PETMA KLRDLSGKHA
KSF--QVDPQ YFKVLAAVIA DTVAAG---- -----DAGFE
GLB3_CHITP DVNTFVASHK PRG---VTHD QLNNFRAGFV
SYMKAHT--D FA-GAEAAWG LGB2_LUPLU TLKNLGSVHV
SKG---VADA HFPVVKEAIL KTIKEVVGAK WSEELNSAWT
GLB1_GLYDI AAYADISGAL ISGLQS---- -
HBB_HUMAN KVVAGVANAL AHKYH----- -
HBA_HUMAN KFLASVSTVL TSKYR----- -
MYG_PHYCA KALELFRKDI AAKYKELGYQ G
GLB5_PETMA KLMSMICILL RSAY------ -
GLB3_CHITP ATLDTFFGMI FSKM------ -
LGB2_LUPLU IAYDELAIVI KKEMNDAA-- -
EPFL Bioinformatics I 19 Dec 2005
23Exercise 5.1 Globin alignment generated by
ClustalW.Red differences to Bashford alignment
GLB1_GLYDI ---------GLSAAQRQVIAATWKDIAGADNGAG
VGKDCLIKFLSAHPQMAAVFGFSGAS 51 HBB_HUMAN
--------VHLTPEEKSAVTALWGKV----NVDEVGGEALGRLLVVYPWT
QRFFESFGDL 48 HBA_HUMAN ---------VLSPADKTNVK
AAWGKVG--AHAGEYGAEALERMFLSFPTTKTYFPHF-DL
48 MYG_PHYCA ---------VLSEGEWQLVLHVWAKVE--AD
VAGHGQDILIRLFKSHPETLEKFDRFKHL 49 GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVY--STYETSGVDILVKFFTSTPAA
QEFFPKFKGL 58 GLB3_CHITP ----------LSADQISTVQ
ASFDKVK------GDPVGILYAVFKADPSIMAKFTQFAGK
44 LGB2_LUPLU --------GALTESQAALVKSSWEEFN--AN
IPKHTHRFFILVLEIAPAAKDLFSFLKGT 50
. . .
GLB1_GLYDI DP--------GVAALGAKVLAQI
GVAVSHLGDE--GKMVAQMKAVGVRHKGYGNKHIKAQ
101 HBB_HUMAN STPDAVMGNPKVKAHGKKVLGAFSDGLAHL
DN-----LKGTFATLSELHC--DKLHVDPE 101 HBA_HUMAN
SH-----GSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLH
A--HKLRVDPV 96 MYG_PHYCA KTEAEMKASEDLKKHGVTV
LTALGAILKKKGH-----HEAELKPLAQSHA--HTKKIPIK
102 GLB5_PETMA TTADQLKKSADVRWHAERIINAVNDAVASM
DDT--EKMSMKLRDLSGKHA--KSFQVDPQ 114 GLB3_CHITP
DLES-IKGTAPFETHANRIVGFFSKIIGELPN-----IEADVNTFVASH
K---PRGVTHD 95 LGB2_LUPLU SEVP--QNNPELQAHAGKV
FKLVYEAAIQLQVTGVVVTDATLKNLGSVHV---SKGVADA 105
. . .
. . . GLB1_GLYDI
YFEPLGASLLSAMEHRIGGKMNAAAKDAWAAAYADISGALISGLQS----
- 147 HBB_HUMAN NFRLLGNVLVCVLAHHFGKEFTPPVQAA
YQKVVAGVANALAHKYH------ 146 HBA_HUMAN
NFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----
- 141 MYG_PHYCA YLEFISEAIIHVLHSRHPGDFGADAQGA
MNKALELFRKDIAAKYKELGYQG 153 GLB5_PETMA
YFKVLAAVIADTVAAG---------DAGFEKLMSMICILLRSAY------
- 149 GLB3_CHITP QLNNFRAGFVSYMKAHTD---FAGAEAA
WGATLDTFFGMIFSKM------- 136 LGB2_LUPLU
HFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--
- 153 . . .
. . .
EPFL Bioinformatics I 19 Dec 2005
24Exercise 5.3 Aligning seven distantly related
globin sequences
Questions/Suggestions Which parts of the
alignment are correct, which ones are wrong ? The
regions around gaps tend to be incorrectly
aligned. Are there pairs of sequences, for
which all residues are correctly aligned with
regard to the 3D structures ? No, not even the
most closely related sequences HBB_HUMAN and
HBBA_HUMAN. However, there are only two
differences between the two alignments of these
sequences. Analysis by columns Correctly
aligned columns 165 6/7 correct 6 gt6 correct 18
EPFL Bioinformatics I 19 Dec 2005
25Exercise 5.3. Compare the accuracy of pair-wise
and multiple sequence alignment
ClustalW alignment for GLB1_GLYDI and HBB_HUMAN
GLB1_GLYDI -GLSAAQRQVIAATWKDIAGADNGAGVGKDCLIKFLSA
HPQMAAVFG-FSGASDPGVAAL 58 HBB_HUMAN
VHLTPEEKSAVTALWGKVN----VDEVGGEALGRLLVVYPWTQRFFESFG
DLSTPDAVMG 56 . .. . ...
. . . .. ... GLB1_GLYDI
GAKVLAQIGVAVSHLGDEGKMVAQMK--AVGVRHKGYGNKHIKAQYFEPL
GASLLSAMEH 116 HBB_HUMAN NPKVKAHGKKVLGAFSDGLAHLD
NLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAH 116
.. .. . . . . ..
. .. GLB1_GLYDI RIGGKMNAAAKDAWAAAYADIS
GALISGLQS 147 HBB_HUMAN HFGKEFTPPVQAAYQKVVAGVANA
LAHKYH- 146 .... .
..
EPFL Bioinformatics I 19 Dec 2005