Title: Jacques.van.Heldenulb.ac.be
1Sequence analysisPart 3. Similarity searches in
sequence databases
- Introduction to Bioinformatics
2Matching a sequence against a database
3Matching a sequence against a database
- Example of utilization
- We have a gene coding sequence and we would like
to search UniProt for all similar proteins, in
order to get a cue about the possible function - Approach we will align our query sequence to
each entry in UniProt. - Problem of size in Dec 2005, there were
2.500.000 entries in UniProt (Swiss-Prot
TREMBL) - It is possible to apply dynamical programming,
but it takes a lot of time or requires high
computation power.
4Fast algorithms for database matching
- FastA
- BLAST (Basic Local Alignment Search Tool)
- In short
- These algorithms are 50 times faster than
Smith-Waterman - They cannot guarantee the optimal solution
- However, a comparison with results obtained by
dynamical programming has shown that FastA answer
is close to the optimum
5FastA strategy
- FastA builds an index with the positions of all
the small words (k-tuples) found in the query
sequence - The program then detects diagonals of k-tuples
between the query and the database sequences. - When a significant diagonal is detected, the two
sequences are aligned with Smith-Waterman
algorithm - The size of words (k) influences the behaviour of
the program when k increases, the search is
faster but one might miss some matches
6FastA strategy
- Comparison of k-tuple positions between query and
database sequences - Highest-density regions ("init regions") are
identified (the best one is highlighted with an
asterisk). Regions with a score below a given
threshold appear in dotted lines. - Low-scoring regions are filtered out, and the
remaining regions are joined.
Source Mount (2000)
7BLAST strategy (gapless version, Altschul et al.,
1990)
- Version 1 (1990)
- Builds a dictionary of k-tuples (small words)
found in the query sequence. - Uses a substitution matrix (e.g. BLOSUM) to
calculate a score between these words and each
possible word of the same length. - Only retains the words with a high score.
- Each time a pair of words from the dictionary are
found (hits) in the database sequence, extends
the hit in both direction (without gaps), to
obtain a High-scoring Segment Pair (HSP). - The program returns sequences with significant
high-scoring segment pairs.
8BLAST strategy (gapped version, Altschul et al.,
1997)
- Version 2 (1997)
- Use smaller words, but only extend when there are
two hits on the same diagonal. - Extension includes gaps (dynamical programming).
- The extension costs more time, but the number of
times it is done is reduced, because the
extension requires a pair of hits.
9BLAST strategy (gapped version, Altschul et al.,
1997)
- Version 1 (1990)
- Builds a dictionary of k-tuples (small words)
found in the query sequence. - Uses a substitution matrix (e.g. BLOSUM) to
calculate a score between these words and each
possible word of the same length. - Only retains the words with a high score.
- Each time a pair of words from the dictionary are
found (hits) in the database sequence, extends
the hit in both direction (without gaps), to
obtain a High-scoring Segment Pair (HSP). - The program returns sequences with significant
high-scoring segment pairs. - Version 2 (1997)
- Use smaller words, but only extend when there are
two hits on the same diagonal. - Extension includes gaps (cynamical programming).
- PSI-BLAST (in the 1997 article as well)
- A second step after the proper BLAST process.
- Once the gapped BLAST has returned a set of
sequences, these sequences are aligned and used
to build a profile motif. - The database is then scanned with this profile
motif to collect additional similarities. - The process can be iterated several times
- collect sequences gt build a profile -gt collect
sequences -gt build a profile ...
10BLAST family of programs
- Different program names exist, depending on the
type (protein or nucleic acid) of query and
database sequences. - For comparison between nucleic acids and
proteins, the nucleic acid is translated in the 6
frames (3 frames per strand)
6-frames translation
11Statistics of sequence similarities
12Matching statistics - raw score
- The raw score of a matching segment pair (MSP) is
obtained by summing the scores (obtained from the
substitution matrix) for each pair of residues
(r1,i and r2,i) over the length of the alignment
(L).
R L A S V E T D M P L T L R Q H
. . . . . . . . T
L T S L Q T T L K A H L G T H -1
4 0 4 1 2 5 -1 2 -1 -1 -2 4 -2 -1 8 21
13MSP-wise P-value and bit score
- The P-value of a matching segment pair (MSP) with
score S is the probability to observe a score of
at least S by chance. - Karlin and Altschul (1990) defined a way to
calculate the P-value of an MSP. - The P-value follows an exponential distribution,
with two parameters lambda and K. These two
parameters depend on the substitution matrix
chosen. They have thus to be estimated for each
substitution matrix separately. - The analytic way to determine the parameters
lambda and L is only valid for gapless
alignments. - For alignment with gaps, Altschul et al (1997)
propose to estimate these parameters on the basis
of empirical observations - Bit score of an MSP
- Karlin and Altschul (1990) also propose to
convert the raw score S into a bit score S. - This facilitates the interpretation of the score,
because the P-value can be directly calculated
from the bit score, irrespective of the
substitution matrix used for the alignment. .
14Matching statistics - the E-value
- If one would aligns a random word with another
random word, the score is likely to be generally
low. - However, if this is repeated billions of times,
some high scores will occasionally occur by
chance. - In a database scan, each word of the query
sequence are compared to each word of the
database. - For a query sequence of size m and a database of
size n, the search space (number of word pair
comparisons) is thus Nnm. - FastA and BLAST estimate, for each score, the
number of matches that would be obtained by
chance alone, given the size of the database.
This is called the E-value. - The E-value is the product of the P-value by the
size of the search space. - For a given score S, the expected number of
random matches thus increases with the size of
the database.
15Threshold on E-value
- The lower is the E-value, the more significant is
the match. - High E-value ( gt 1) indicate that the match
should not be trusted too much. - One essential parameter of FastA and BLAST is the
threshold on E-value. - Beware
- On the BLAST Web server at NCBI, the default
threshold value is 10 - This means that each query would return 10
matches by chance alone. - If this default value is used, we already know
that the answer is likely to contain 10 false
positive.
16Matching statistics - database-wise P-value
(Family-Wise Error Rate)
- From the E-value (E), one can estimate the
probability to observe at least X matches by
chance in random sequences. - This is a simple application of the Poisson
distribution calculate the probability to
observe X occurrences of an event whilst
expecting E. - A particular case is the probability to observe
at least one match by chance (Xgt1). - This probability is called database-wise P-value.
- This P-value represents the probability to find
at least one spurious match in the whole database
search, with a score greater or equal to S.
17Distribution of probability for matching scores
- When one performs a database similarity search,
the distribution of scores follows an extreme
value distribution. This distribution is
asymmetric, and should thus not be modelled with
a normal (Gaussian) distribution.
18Interpreting similarity search results
19Score distribution
- The histogram shows the number of database
matches for each score. - For scores higher than 92, the number of matches
is very small. - A higher resolution histogram is shown besides
the main histogram. - Asterisks () represent the random expectation
(E-value) for each score
zoom
FastA output from Pearson (2000)
20BLAST result
- The text shows the result of a BLAST search,
- Query the E.coli protein MetL, a bifunctional
enzyme combining aspartokinase and homoserine
dehydrogenase activities. - Database all proteins from Escherichia coli K12.
- The BLAST result file starts with a summary of
- the parameters used for the search
- The matching sequences and the score of each
match.
BLASTP 2.2.6 Apr-09-2003 Reference Altschul,
Stephen F., Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb
Miller, and David J. Lipman (1997), "Gapped
BLAST and PSI-BLAST a new generation of protein
database search programs", Nucleic Acids Res.
253389-3402. Query metL gi16131778refNP_4183
75.1 aspartokinase II and homoserine
dehydrogenase II bifunctional aspartokinase
II (N-terminal) homoserine dehydrogenase II
(C-terminal) Escherichia coli K12 (810
letters) Database /Users/jvanheld/rsa- tools/dat
a/genomes/Escherichia_coli_K12/genome/NC_000913.fa
a 4242 sequences 1,351,322 total
letters Searching.........done
Score E Sequences producing significant
alignments (bits)
Value gi16131778refNP_418375.1 aspartokinase
II and homoserine deh... 1596 0.0
gi16127996refNP_414543.1 bifunctional
aspartokinase I (N-te... 344
2e-95 gi16131850refNP_418448.1 aspartokinase
III, lysine sensitive... 122
7e-29 gi16128228refNP_414777.1
gamma-glutamate kinase Escherichia... 31
0.28 gtgi16131778refNP_418375.1
aspartokinase II and homoserine
dehydrogenase II bifunctional aspartokinase II
(N-terminal) homoserine dehydrogenase
II (C-terminal) Escherichia coli
K12 Length 810 Score 1596 bits
(4132), Expect 0.0 Identities 810/810
(100), Positives 810/810 (100) Query 1
MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAA
GSTTNQLINW 60 MSVIAQAGAKGRQLHKFGGSSLADV
KCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINW Sbjct 1
MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAA
GSTTNQLINW 60 Query 61 LKLSQTDRLSAHQVQQTLRRYQCD
LISGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120
LKLSQTDRLSAHQVQQTLRRYQCDLISGLLPAEEADSLISAFVSDLER
LAALLDSGINDA Sbjct 61 LKLSQTDRLSAHQVQQTLRRYQCDLI
SGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120 Query
121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQ
VDEGLSYPLLQQLL 180 VYAEVVGHGEVWSARLMSAV
LNQQGLPAAWLDAREFLRAERAAQPQVDEGLSYPLLQQLL Sbjct
121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQ
VDEGLSYPLLQQLL 180 Query 181 VQHPGKRLVVTGFISRNNA
GETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP 240
VQHPGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGV
SRVTIWSDVAGVYSADP Sbjct 181 VQHPGKRLVVTGFISRNNAGE
TVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP
240 Query 241 RKVKDACLLPLLRLDEASELARLAAPVLHARTLQ
PVSGSEIDLQLRCSYTPDQGSTRIER 300
RKVKDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYT
PDQGSTRIER Sbjct 241 RKVKDACLLPLLRLDEASELARLAAPVL
HARTLQPVSGSEIDLQLRCSYTPDQGSTRIER 300 Query 301
VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLA
VGVHNDRQLL 360 VLASGTGARIVTSHDDVCLIEFQV
PASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLL Sbjct 301
VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLA
VGVHNDRQLL 360 Query 361 QFCYTSEVADSALKILDEAGLPG
ELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420
QFCYTSEVADSALKILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPV Sbjct 361 QFCYTSEVADSALKILDEAGLPGEL
RLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420 Query
421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGK
GNIGSRWLELFARE 480 EFTWQSDDGISLVAVLRTGP
TESLIQGLHQSVFRAEKRIGLVLFGKGNIGSRWLELFARE Sbjct
421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGK
GNIGSRWLELFARE 480 Query 481 QSTLSARTGFEFVLAGVVD
SRRSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY 540
QSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEA
VEQDEESLFLWMRAHPY Sbjct 481 QSTLSARTGFEFVLAGVVDSR
RSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY
540 Query 541 DDLVVLDVTASQQLADQYLDFASHGFHVISANKL
AGASDSNKYRQIHDAFEKTGRHWLYN 600
DDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQIHDAF
EKTGRHWLYN Sbjct 541 DDLVVLDVTASQQLADQYLDFASHGFHV
ISANKLAGASDSNKYRQIHDAFEKTGRHWLYN 600 Query 601
ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTEL
VDQAWQQGLT 660 ATVGAGLPINHTVRDLIDSGDTIL
SISGIFSGTLSWLFLQFDGSVPFTELVDQAWQQGLT Sbjct 601
ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTEL
VDQAWQQGLT 660 Query 661 EPDPRDDLSGKDVMRKLVILARE
AGYNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720
EPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEGGS
IDHFFENGDELNE Sbjct 661 EPDPRDDLSGKDVMRKLVILAREAG
YNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720 Query
721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLP
CDNVFAIESRWYRD 780 QMVQRLEAAREMGLVLRYVA
RFDANGKARVGVEAVREDHPLASLLPCDNVFAIESRWYRD Sbjct
721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLP
CDNVFAIESRWYRD 780 Query 781 NPLVIRGPGAGRDVTAGAI
QSDINRLAQLL 810 NPLVIRGPGAGRDVTAGAIQSDI
NRLAQLL Sbjct 781 NPLVIRGPGAGRDVTAGAIQSDINRLAQLL
810 gtgi16127996refNP_414543.1 bifunctional
aspartokinase I (N-terminal)
homoserine dehydrogenase I (C-terminal)
Escherichia coli K12 Length 820
Score 344 bits (882), Expect 2e-95
Identities 247/821 (30), Positives 410/821
(49), Gaps 44/821 (5) Query 16
KFGGSSLADVKCYLRVAGIMAEYSQPDDMM-VVSAAGSTTNQLINWLKLS
QTDRLSAHQV 74 KFGGSA LRVA I
VSA TN L Sbjct 5
KFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKT
ISGQDALPNI 64 Query 75 QQTLRRYQCDLISGLLPAEEADSL
--ISAFVSDLERLAALLDSGIN------DAVYAEVV 126
R LGL A L FV GI
D A Sbjct 65 SDAERIF-AELLTGLAAAQPGFPLAQ
LKTFVDQEFAQIKHVLHGISLLGQCPDSINAALI 123 Query
127 GHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAER---AAQPQVDE
GLSYPLLQQLLVQH 183 GE S M VL G
D E L A E H Sbjct
124 CRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAE
STRRIAASRIPADH 183 Query 184 PGKRLVVTGFISRNNAGET
VLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV 243
GF N GE VLGRNGSDYSA A
IWDV GVY DPRV Sbjct 184 ---MVLMAGFTAGNEKGELVV
LGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV
240 Query 244 KDACLLPLLRLDEASELARLAAPVLHARTLQPVS
GSEIDLQLRCSYTPDQ-----GSTRI 298 DA LL
EA EL A VLH RT P I P
GR Sbjct 241 PDARLLKSMSYQEAMELSYFGAKVLHPRTITPI
AQFQIPCLIKNTGNPQAPGTLIGASRD 300 Query 299
ERVLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRP
LAVGVHNDRQ 358 E L
P RA Sbjct 301
EDELP----VKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISV
VLITQSSSEY 356 Query 359 LLQFCYTSEVADSALKILDEA--
-----GLPGELRLRQGLALVAMVGAGVTRNPLHCHRF 411
FC A E GL L
LAVG G F Sbjct 357
SISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGM
RTLRGISAKF 416 Query 412 WQQLKGQPVEFTW--QSDDGISL
VAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNI 469
L Q S V HQ F
I G G Sbjct 417 FAALARANINIVAIAQGSSERSIS
VVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGV 476 Query
470 GSRWLELFAREQSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRA
LAFFNDEAVEQDEE 529 G LE RQS L
GV S L GL L E E Sbjct
477 GGALLEQLKRQQSWLKNKH-IDLRVCGVANSKALLTNVHGLN----
LENWQEELAQAKEP 531 Query 530 ----SLFLWMRAHPYDDLV
VLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQ 585
L VD TSQ ADQY DF
GFHV NK A S Y Q Sbjct 532
FNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKA
NTSSMDYYHQ 591 Query 586 IHDAFEKTGRHWLYNATVGAGLP
INHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSV 645
A EK R LY VGAGLP LGD SGI
SGLSF D Sbjct 592 LRYAAEKSRRKFLYDTNVGAGLP
VIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGM 651 Query
646 PFTELVDQAWQQGLTEPDPRDDLSGKDVMRKLVILAREAGYNIEPD
QVRVESLVPAHCEG 705 FE A G
TEPDPRDDLSG DV RKLILARE G E E PA
Sbjct 652 SFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARE
TGRELELADIEIEPVLPAEFNA 711 Query 706
-GSIDHFFENGDELNEQMVQRLEAAREMGLVLRYVARFDANGKARVGVEA
VREDHPLASL 764 G F N L R
AR G VLRYV D G RV V PL Sbjct 712
EGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAE
VDGNDPLFKV 771 Query 765 LPCDNVFAIESRWYRDNPLVIRG
PGAGRDVTAGAIQSDINR 805 N A S Y
PLVRG GAG DVTA D R Sbjct 772
KNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLR
812 gtgi16131850refNP_418448.1 aspartokinase
III, lysine sensitive aspartokinase
III, lysine-sensitive Escherichia coli
K12 Length 449 Score 122 bits
(307), Expect 7e-29 Identities 121/452
(26), Positives 194/452 (42), Gaps 25/452
(5) Query 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVS
AAGSTTNQLINWLK-LSQTDRLSAHQV 74
KFGGSAD R A I VSA TN L L
R Sbjct 8 KFGGTSVADFDAMNRSADIVLSDANVR-
LVVLSASAGITNLLVALAEGLEPGERF---EK 63 Query 75
QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAE
VVGHGEVWSA 134 R Q L
I LA A EV HGE S Sbjct 64
LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDE
LVSHGELMST 123 Query 135 RLMSAVLNQQGLPAAWLDAREFL
RA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192
L L A W D R R R D L
L LVT G Sbjct 124 LLFVEILRERDVQAQWFDVRKVMR
TNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183 Query
193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSA
DPRKVKDACLLPLL 252 FI N G T LGR
GSDYA SRV IWDV GY DPR V A
Sbjct 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIW
TDVPGIYTTDPRVVSAAKRIDEI 243 Query 253
RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI----
-----ERVLA 303 EAEA A VLH TL P
SI S P G T R LA Sbjct 244
AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKT
ENPPLFRALA 303 Query 304 SGTGARIVTSHDDVCLIEFQVPA
SQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363
T H L A LA I L
A L Sbjct 304 LRRNQTLLTLHSLNMLHSRGFLA
EVFGILARHNISVDLITTSEVSVAL-------TLDTT 356 Query
364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPVE 421 D L L E
GLALVAG L Sbjct
357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGV
GKEVFGVLEPFNIR 416 Query 422 FTWQSDDGISLVAVLRTGP
TESLIQGLHQSVF 453 L
E Q LH F Sbjct 417 MICYGASSHNLCFLVPGEDAEQVVQK
LHSNLF 448 gtgi16128228refNP_414777.1
gamma-glutamate kinase Escherichia
coli K12 Length 367 Score 31.2
bits (69), Expect 0.28 Identities 17/56
(30), Positives 29/56 (51) Query 194
ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV
KDACLL 249 I NA T D
LAG D GYADPR A L Sbjct 133
INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSN
PQAELI 188 Database /Users/jvanheld/rsa-
tools/data/genomes/Escherichia_coli_K12/genome/NC_
000913.faa Posted date Sep 8, 2004 1213
PM Number of letters in database 1,351,322
Number of sequences in database 4242 Lambda
K H 0.320 0.136 0.397
Gapped Lambda K H 0.267 0.0410
0.140 Matrix BLOSUM62 Gap Penalties
Existence 11, Extension 1 Number of Hits to DB
2,199,628 Number of Sequences 4242 Number of
extensions 96525 Number of successful
extensions 290 Number of sequences better than
1.0 4 Number of HSP's better than 1.0 without
gapping 4 Number of HSP's successfully gapped in
prelim test 0 Number of HSP's that attempted
gapping in prelim test 279 Number of HSP's
gapped (non-prelim) 5 length of query
810 length of database 1,351,322 effective HSP
length 92 effective length of query
718 effective length of database
961,058 effective search space
690039644 effective search space used
690039644 T 11 A 40 X1 16 ( 7.4 bits) X2 38
(14.6 bits) X3 64 (24.7 bits) S1 41 (21.8
bits) S2 65 (29.6 bits)
21BLAST result - first match
The first match is the query sequence itself
(metL). This is not surprising since we scanned
the set of all E.coli proteins with a protein
from E.coli. The E-value (0) means that, with
this level of similarity one would expect 0
false positive by chance.
gtgi16131778refNP_418375.1 aspartokinase II
and homoserine dehydrogenase II
bifunctional aspartokinase II
(N-terminal) homoserine dehydrogenase II
(C-terminal) Escherichia coli K12
Length 810 Score 1596 bits (4132),
Expect 0.0 Identities 810/810 (100),
Positives 810/810 (100) Query 1
MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAA
GSTTNQLINW 60 MSVIAQAGAKGRQLHKFGGSSLADV
KCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINW Sbjct 1
MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAA
GSTTNQLINW 60 Query 61 LKLSQTDRLSAHQVQQTLRRYQCD
LISGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120
LKLSQTDRLSAHQVQQTLRRYQCDLISGLLPAEEADSLISAFVSDLER
LAALLDSGINDA Sbjct 61 LKLSQTDRLSAHQVQQTLRRYQCDLI
SGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120 Query
121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQ
VDEGLSYPLLQQLL 180 VYAEVVGHGEVWSARLMSAV
LNQQGLPAAWLDAREFLRAERAAQPQVDEGLSYPLLQQLL Sbjct
121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQ
VDEGLSYPLLQQLL 180 Query 181 VQHPGKRLVVTGFISRNNA
GETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP 240
VQHPGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGV
SRVTIWSDVAGVYSADP Sbjct 181 VQHPGKRLVVTGFISRNNAGE
TVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP
240 Query 241 RKVKDACLLPLLRLDEASELARLAAPVLHARTLQ
PVSGSEIDLQLRCSYTPDQGSTRIER 300
RKVKDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYT
PDQGSTRIER Sbjct 241 RKVKDACLLPLLRLDEASELARLAAPVL
HARTLQPVSGSEIDLQLRCSYTPDQGSTRIER 300 Query 301
VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLA
VGVHNDRQLL 360 VLASGTGARIVTSHDDVCLIEFQV
PASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLL Sbjct 301
VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLA
VGVHNDRQLL 360 Query 361 QFCYTSEVADSALKILDEAGLPG
ELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420
QFCYTSEVADSALKILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPV Sbjct 361 QFCYTSEVADSALKILDEAGLPGEL
RLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420 Query
421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGK
GNIGSRWLELFARE 480 EFTWQSDDGISLVAVLRTGP
TESLIQGLHQSVFRAEKRIGLVLFGKGNIGSRWLELFARE Sbjct
421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGK
GNIGSRWLELFARE 480 Query 481 QSTLSARTGFEFVLAGVVD
SRRSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY 540
QSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEA
VEQDEESLFLWMRAHPY Sbjct 481 QSTLSARTGFEFVLAGVVDSR
RSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY
540 Query 541 DDLVVLDVTASQQLADQYLDFASHGFHVISANKL
AGASDSNKYRQIHDAFEKTGRHWLYN 600
DDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQIHDAF
EKTGRHWLYN Sbjct 541 DDLVVLDVTASQQLADQYLDFASHGFHV
ISANKLAGASDSNKYRQIHDAFEKTGRHWLYN 600 Query 601
ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTEL
VDQAWQQGLT 660 ATVGAGLPINHTVRDLIDSGDTIL
SISGIFSGTLSWLFLQFDGSVPFTELVDQAWQQGLT Sbjct 601
ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTEL
VDQAWQQGLT 660 Query 661 EPDPRDDLSGKDVMRKLVILARE
AGYNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720
EPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEGGS
IDHFFENGDELNE Sbjct 661 EPDPRDDLSGKDVMRKLVILAREAG
YNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720 Query
721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLP
CDNVFAIESRWYRD 780 QMVQRLEAAREMGLVLRYVA
RFDANGKARVGVEAVREDHPLASLLPCDNVFAIESRWYRD Sbjct
721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLP
CDNVFAIESRWYRD 780 Query 781 NPLVIRGPGAGRDVTAGAI
QSDINRLAQLL 810 NPLVIRGPGAGRDVTAGAIQSDI
NRLAQLL Sbjct 781 NPLVIRGPGAGRDVTAGAIQSDINRLAQLL
810 gtgi16127996refNP_414543.1 bifunctional
aspartokinase I (N-terminal)
homoserine dehydrogenase I (C-terminal)
Escherichia coli K12 Length 820
Score 344 bits (882), Expect 2e-95
Identities 247/821 (30), Positives 410/821
(49), Gaps 44/821 (5) Query 16
KFGGSSLADVKCYLRVAGIMAEYSQPDDMM-VVSAAGSTTNQLINWLKLS
QTDRLSAHQV 74 KFGGSA LRVA I
VSA TN L Sbjct 5
KFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKT
ISGQDALPNI 64 Query 75 QQTLRRYQCDLISGLLPAEEADSL
--ISAFVSDLERLAALLDSGIN------DAVYAEVV 126
R LGL A L FV GI
D A Sbjct 65 SDAERIF-AELLTGLAAAQPGFPLAQ
LKTFVDQEFAQIKHVLHGISLLGQCPDSINAALI 123 Query
127 GHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAER---AAQPQVDE
GLSYPLLQQLLVQH 183 GE S M VL G
D E L A E H Sbjct
124 CRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAE
STRRIAASRIPADH 183 Query 184 PGKRLVVTGFISRNNAGET
VLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV 243
GF N GE VLGRNGSDYSA A
IWDV GVY DPRV Sbjct 184 ---MVLMAGFTAGNEKGELVV
LGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV
240 Query 244 KDACLLPLLRLDEASELARLAAPVLHARTLQPVS
GSEIDLQLRCSYTPDQ-----GSTRI 298 DA LL
EA EL A VLH RT P I P
GR Sbjct 241 PDARLLKSMSYQEAMELSYFGAKVLHPRTITPI
AQFQIPCLIKNTGNPQAPGTLIGASRD 300 Query 299
ERVLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRP
LAVGVHNDRQ 358 E L
P RA Sbjct 301
EDELP----VKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISV
VLITQSSSEY 356 Query 359 LLQFCYTSEVADSALKILDEA--
-----GLPGELRLRQGLALVAMVGAGVTRNPLHCHRF 411
FC A E GL L
LAVG G F Sbjct 357
SISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGM
RTLRGISAKF 416 Query 412 WQQLKGQPVEFTW--QSDDGISL
VAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNI 469
L Q S V HQ F
I G G Sbjct 417 FAALARANINIVAIAQGSSERSIS
VVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGV 476 Query
470 GSRWLELFAREQSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRA
LAFFNDEAVEQDEE 529 G LE RQS L
GV S L GL L E E Sbjct
477 GGALLEQLKRQQSWLKNKH-IDLRVCGVANSKALLTNVHGLN----
LENWQEELAQAKEP 531 Query 530 ----SLFLWMRAHPYDDLV
VLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQ 585
L VD TSQ ADQY DF
GFHV NK A S Y Q Sbjct 532
FNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKA
NTSSMDYYHQ 591 Query 586 IHDAFEKTGRHWLYNATVGAGLP
INHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSV 645
A EK R LY VGAGLP LGD SGI
SGLSF D Sbjct 592 LRYAAEKSRRKFLYDTNVGAGLP
VIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGM 651 Query
646 PFTELVDQAWQQGLTEPDPRDDLSGKDVMRKLVILAREAGYNIEPD
QVRVESLVPAHCEG 705 FE A G
TEPDPRDDLSG DV RKLILARE G E E PA
Sbjct 652 SFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARE
TGRELELADIEIEPVLPAEFNA 711 Query 706
-GSIDHFFENGDELNEQMVQRLEAAREMGLVLRYVARFDANGKARVGVEA
VREDHPLASL 764 G F N L R
AR G VLRYV D G RV V PL Sbjct 712
EGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAE
VDGNDPLFKV 771 Query 765 LPCDNVFAIESRWYRDNPLVIRG
PGAGRDVTAGAIQSDINR 805 N A S Y
PLVRG GAG DVTA D R Sbjct 772
KNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLR
812 gtgi16131850refNP_418448.1 aspartokinase
III, lysine sensitive aspartokinase
III, lysine-sensitive Escherichia coli
K12 Length 449 Score 122 bits
(307), Expect 7e-29 Identities 121/452
(26), Positives 194/452 (42), Gaps 25/452
(5) Query 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVS
AAGSTTNQLINWLK-LSQTDRLSAHQV 74
KFGGSAD R A I VSA TN L L
R Sbjct 8 KFGGTSVADFDAMNRSADIVLSDANVR-
LVVLSASAGITNLLVALAEGLEPGERF---EK 63 Query 75
QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAE
VVGHGEVWSA 134 R Q L
I LA A EV HGE S Sbjct 64
LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDE
LVSHGELMST 123 Query 135 RLMSAVLNQQGLPAAWLDAREFL
RA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192
L L A W D R R R D L
L LVT G Sbjct 124 LLFVEILRERDVQAQWFDVRKVMR
TNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183 Query
193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSA
DPRKVKDACLLPLL 252 FI N G T LGR
GSDYA SRV IWDV GY DPR V A
Sbjct 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIW
TDVPGIYTTDPRVVSAAKRIDEI 243 Query 253
RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI----
-----ERVLA 303 EAEA A VLH TL P
SI S P G T R LA Sbjct 244
AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKT
ENPPLFRALA 303 Query 304 SGTGARIVTSHDDVCLIEFQVPA
SQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363
T H L A LA I L
A L Sbjct 304 LRRNQTLLTLHSLNMLHSRGFLA
EVFGILARHNISVDLITTSEVSVAL-------TLDTT 356 Query
364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPVE 421 D L L E
GLALVAG L Sbjct
357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGV
GKEVFGVLEPFNIR 416 Query 422 FTWQSDDGISLVAVLRTGP
TESLIQGLHQSVF 453 L
E Q LH F Sbjct 417 MICYGASSHNLCFLVPGEDAEQVVQK
LHSNLF 448 gtgi16128228refNP_414777.1
gamma-glutamate kinase Escherichia
coli K12 Length 367 Score 31.2
bits (69), Expect 0.28 Identities 17/56
(30), Positives 29/56 (51) Query 194
ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV
KDACLL 249 I NA T D
LAG D GYADPR A L Sbjct 133
INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSN
PQAELI 188 Database /Users/jvanheld/rsa-
tools/data/genomes/Escherichia_coli_K12/genome/NC_
000913.faa Posted date Sep 8, 2004 1213
PM Number of letters in database 1,351,322
Number of sequences in database 4242 Lambda
K H 0.320 0.136 0.397
Gapped Lambda K H 0.267 0.0410
0.140 Matrix BLOSUM62 Gap Penalties
Existence 11, Extension 1 Number of Hits to DB
2,199,628 Number of Sequences 4242 Number of
extensions 96525 Number of successful
extensions 290 Number of sequences better than
1.0 4 Number of HSP's better than 1.0 without
gapping 4 Number of HSP's successfully gapped in
prelim test 0 Number of HSP's that attempted
gapping in prelim test 279 Number of HSP's
gapped (non-prelim) 5 length of query
810 length of database 1,351,322 effective HSP
length 92 effective length of query
718 effective length of database
961,058 effective search space
690039644 effective search space used
690039644 T 11 A 40 X1 16 ( 7.4 bits) X2 38
(14.6 bits) X3 64 (24.7 bits) S1 41 (21.8
bits) S2 65 (29.6 bits)
22BLAST result - second match
The second match is another bifunctional protein,
product of the gene thrA. This protein contains
the same two domains as metA (aspartokinase and
homoserine dehydrogenase). The alignment covers
almost the complete sequences (820 aa), with 30
identities and 49 similarity. The E-value is
very low (2e-95), indicating that thrA and metL
are likely to be true homologs.
gtgi16127996refNP_414543.1 bifunctional
aspartokinase I (N-terminal)
homoserine dehydrogenase I (C-terminal)
Escherichia coli K12 Length 820
Score 344 bits (882), Expect 2e-95
Identities 247/821 (30), Positives 410/821
(49), Gaps 44/821 (5) Query 16
KFGGSSLADVKCYLRVAGIMAEYSQPDDMM-VVSAAGSTTNQLINWLKLS
QTDRLSAHQV 74 KFGGSA LRVA I
VSA TN L Sbjct 5
KFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKT
ISGQDALPNI 64 Query 75 QQTLRRYQCDLISGLLPAEEADSL
--ISAFVSDLERLAALLDSGIN------DAVYAEVV 126
R LGL A L FV GI
D A Sbjct 65 SDAERIF-AELLTGLAAAQPGFPLAQ
LKTFVDQEFAQIKHVLHGISLLGQCPDSINAALI 123 Query
127 GHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAER---AAQPQVDE
GLSYPLLQQLLVQH 183 GE S M VL G
D E L A E H Sbjct
124 CRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAE
STRRIAASRIPADH 183 Query 184 PGKRLVVTGFISRNNAGET
VLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV 243
GF N GE VLGRNGSDYSA A
IWDV GVY DPRV Sbjct 184 ---MVLMAGFTAGNEKGELVV
LGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV
240 Query 244 KDACLLPLLRLDEASELARLAAPVLHARTLQPVS
GSEIDLQLRCSYTPDQ-----GSTRI 298 DA LL
EA EL A VLH RT P I P
GR Sbjct 241 PDARLLKSMSYQEAMELSYFGAKVLHPRTITPI
AQFQIPCLIKNTGNPQAPGTLIGASRD 300 Query 299
ERVLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRP
LAVGVHNDRQ 358 E L
P RA Sbjct 301
EDELP----VKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISV
VLITQSSSEY 356 Query 359 LLQFCYTSEVADSALKILDEA--
-----GLPGELRLRQGLALVAMVGAGVTRNPLHCHRF 411
FC A E GL L
LAVG G F Sbjct 357
SISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGM
RTLRGISAKF 416 Query 412 WQQLKGQPVEFTW--QSDDGISL
VAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNI 469
L Q S V HQ F
I G G Sbjct 417 FAALARANINIVAIAQGSSERSIS
VVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGV 476 Query
470 GSRWLELFAREQSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRA
LAFFNDEAVEQDEE 529 G LE RQS L
GV S L GL L E E Sbjct
477 GGALLEQLKRQQSWLKNKH-IDLRVCGVANSKALLTNVHGLN----
LENWQEELAQAKEP 531 Query 530 ----SLFLWMRAHPYDDLV
VLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQ 585
L VD TSQ ADQY DF
GFHV NK A S Y Q Sbjct 532
FNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKA
NTSSMDYYHQ 591 Query 586 IHDAFEKTGRHWLYNATVGAGLP
INHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSV 645
A EK R LY VGAGLP LGD SGI
SGLSF D Sbjct 592 LRYAAEKSRRKFLYDTNVGAGLP
VIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGM 651 Query
646 PFTELVDQAWQQGLTEPDPRDDLSGKDVMRKLVILAREAGYNIEPD
QVRVESLVPAHCEG 705 FE A G
TEPDPRDDLSG DV RKLILARE G E E PA
Sbjct 652 SFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARE
TGRELELADIEIEPVLPAEFNA 711 Query 706
-GSIDHFFENGDELNEQMVQRLEAAREMGLVLRYVARFDANGKARVGVEA
VREDHPLASL 764 G F N L R
AR G VLRYV D G RV V PL Sbjct 712
EGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAE
VDGNDPLFKV 771 Query 765 LPCDNVFAIESRWYRDNPLVIRG
PGAGRDVTAGAIQSDINR 805 N A S Y
PLVRG GAG DVTA D R Sbjct 772
KNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLR
812 gtgi16131850refNP_418448.1 aspartokinase
III, lysine sensitive aspartokinase
III, lysine-sensitive Escherichia coli
K12 Length 449 Score 122 bits
(307), Expect 7e-29 Identities 121/452
(26), Positives 194/452 (42), Gaps 25/452
(5) Query 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVS
AAGSTTNQLINWLK-LSQTDRLSAHQV 74
KFGGSAD R A I VSA TN L L
R Sbjct 8 KFGGTSVADFDAMNRSADIVLSDANVR-
LVVLSASAGITNLLVALAEGLEPGERF---EK 63 Query 75
QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAE
VVGHGEVWSA 134 R Q L
I LA A EV HGE S Sbjct 64
LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDE
LVSHGELMST 123 Query 135 RLMSAVLNQQGLPAAWLDAREFL
RA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192
L L A W D R R R D L
L LVT G Sbjct 124 LLFVEILRERDVQAQWFDVRKVMR
TNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183 Query
193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSA
DPRKVKDACLLPLL 252 FI N G T LGR
GSDYA SRV IWDV GY DPR V A
Sbjct 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIW
TDVPGIYTTDPRVVSAAKRIDEI 243 Query 253
RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI----
-----ERVLA 303 EAEA A VLH TL P
SI S P G T R LA Sbjct 244
AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKT
ENPPLFRALA 303 Query 304 SGTGARIVTSHDDVCLIEFQVPA
SQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363
T H L A LA I L
A L Sbjct 304 LRRNQTLLTLHSLNMLHSRGFLA
EVFGILARHNISVDLITTSEVSVAL-------TLDTT 356 Query
364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPVE 421 D L L E
GLALVAG L Sbjct
357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGV
GKEVFGVLEPFNIR 416 Query 422 FTWQSDDGISLVAVLRTGP
TESLIQGLHQSVF 453 L
E Q LH F Sbjct 417 MICYGASSHNLCFLVPGEDAEQVVQK
LHSNLF 448 gtgi16128228refNP_414777.1
gamma-glutamate kinase Escherichia
coli K12 Length 367 Score 31.2
bits (69), Expect 0.28 Identities 17/56
(30), Positives 29/56 (51) Query 194
ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV
KDACLL 249 I NA T D
LAG D GYADPR A L Sbjct 133
INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSN
PQAELI 188 Database /Users/jvanheld/rsa-
tools/data/genomes/Escherichia_coli_K12/genome/NC_
000913.faa Posted date Sep 8, 2004 1213
PM Number of letters in database 1,351,322
Number of sequences in database 4242 Lambda
K H 0.320 0.136 0.397
Gapped Lambda K H 0.267 0.0410
0.140 Matrix BLOSUM62 Gap Penalties
Existence 11, Extension 1 Number of Hits to DB
2,199,628 Number of Sequences 4242 Number of
extensions 96525 Number of successful
extensions 290 Number of sequences better than
1.0 4 Number of HSP's better than 1.0 without
gapping 4 Number of HSP's successfully gapped in
prelim test 0 Number of HSP's that attempted
gapping in prelim test 279 Number of HSP's
gapped (non-prelim) 5 length of query
810 length of database 1,351,322 effective HSP
length 92 effective length of query
718 effective length of database
961,058 effective search space
690039644 effective search space used
690039644 T 11 A 40 X1 16 ( 7.4 bits) X2 38
(14.6 bits) X3 64 (24.7 bits) S1 41 (21.8
bits) S2 65 (29.6 bits)
23BLAST result - third match
The third match is the product of the gene lysC
aspartokinase III. This protein contains the
aspartokinase domain, but not the homoserine
dehydrogenase. Consequently, the alignment only
extends over the first half of the query protein
(453aa). On this segment, there is a good level
of identity (26) and similarity (42). The
E-value is very low (7e29), indicating that the
two domains are likely to be true homologs.
gtgi16131850refNP_418448.1 aspartokinase III,
lysine sensitive aspartokinase III,
lysine-sensitive Escherichia coli
K12 Length 449 Score 122 bits
(307), Expect 7e-29 Identities 121/452
(26), Positives 194/452 (42), Gaps 25/452
(5) Query 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVS
AAGSTTNQLINWLK-LSQTDRLSAHQV 74
KFGGSAD R A I VSA TN L L
R Sbjct 8 KFGGTSVADFDAMNRSADIVLSDANVR-
LVVLSASAGITNLLVALAEGLEPGERF---EK 63 Query 75
QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAE
VVGHGEVWSA 134 R Q L
I LA A EV HGE S Sbjct 64
LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDE
LVSHGELMST 123 Query 135 RLMSAVLNQQGLPAAWLDAREFL
RA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192
L L A W D R R R D L
L LVT G Sbjct 124 LLFVEILRERDVQAQWFDVRKVMR
TNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183 Query
193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSA
DPRKVKDACLLPLL 252 FI N G T LGR
GSDYA SRV IWDV GY DPR V A
Sbjct 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIW
TDVPGIYTTDPRVVSAAKRIDEI 243 Query 253
RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI----
-----ERVLA 303 EAEA A VLH TL P
SI S P G T R LA Sbjct 244
AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKT
ENPPLFRALA 303 Query 304 SGTGARIVTSHDDVCLIEFQVPA
SQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363
T H L A LA I L
A L Sbjct 304 LRRNQTLLTLHSLNMLHSRGFLA
EVFGILARHNISVDLITTSEVSVAL-------TLDTT 356 Query
364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPVE 421 D L L E
GLALVAG L Sbjct
357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGV
GKEVFGVLEPFNIR 416 Query 422 FTWQSDDGISLVAVLRTGP
TESLIQGLHQSVF 453 L
E Q LH F Sbjct 417 MICYGASSHNLCFLVPGEDAEQVVQK
LHSNLF 448 gtgi16128228refNP_414777.1
gamma-glutamate kinase Escherichia
coli K12 Length 367 Score 31.2
bits (69), Expect 0.28 Identities 17/56
(30), Positives 29/56 (51) Query 194
ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV
KDACLL 249 I NA T D
LAG D GYADPR A L Sbjct 133
INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSN
PQAELI 188 Database /Users/jvanheld/rsa-
tools/data/genomes/Escherichia_coli_K12/genome/NC_
000913.faa Posted date Sep 8, 2004 1213
PM Number of letters in database 1,351,322
Number of sequences in database 4242 Lambda
K H 0.320 0.136 0.397
Gapped Lambda K H 0.267 0.0410
0.140 Matrix BLOSUM62 Gap Penalties
Existence 11, Extension 1 Number of Hits to DB
2,199,628 Number of Sequences 4242 Number of
extensions 96525 Number of successful
extensions 290 Number of sequences better than
1.0 4 Number of HSP's better than 1.0 without
gapping 4 Number of HSP's successfully gapped in
prelim test 0 Number of HSP's that attempted
gapping in prelim test 279 Number of HSP's
gapped (non-prelim) 5 length of query
810 length of database 1,351,322 effective HSP
length 92 effective length of query
718 effective length of database
961,058 effective search space
690039644 effective search space used
690039644 T 11 A 40 X1 16 ( 7.4 bits) X2 38
(14.6 bits) X3 64 (24.7 bits) S1 41 (21.8
bits) S2 65 (29.6 bits)
24BLAST result - fourth match
The fourth match is a gamma-glutamate kinase,
product of proB. The match has the same level of
identity (30) and similarity (51) as the second
match (thrA). However, this match only extends
over 56aa, whereas the alignment between thrA and
metL extends over 821aa. The significance of the
match is thus much lower the E-value is quite
high (0.28) suggesting that the similarity could
be an artefact. This clearly illustrates the
fact that the important parameter to evaluate the
significance of an alignment is the E-value, not
the percentage of similarity !
gtgi16128228refNP_414777.1 gamma-glutamate
kinase Escherichia coli K12
Length 367 Score 31.2 bits (69), Expect
0.28 Identities 17/56 (30), Positives 29/56
(51) Query 194 ISRNNAGETVLLGRNGSDYSATQIGALAGVSR
VTIWSDVAGVYSADPRKVKDACLL 249 I NA T
D LAG D GYADPR A
L Sbjct 133 INENDAVATAEIKVGDNDNLSALAAILAGADKLLLL
TDQKGLYTADPRSNPQAELI 188
25BLAST result - summary
Database /Users/jvanheld/rsa-
tools/data/genomes/Escherichia_coli_K12/genome/NC_
000913.faa Posted date Sep 8, 2004 1213
PM Number of letters in database 1,351,322
Number of sequences in database 4242 Lambda
K H 0.320 0.136 0.397
Gapped Lambda K H 0.267 0.0410
0.140 Matrix BLOSUM62 Gap Penalties
Existence 11, Extension 1 Number of Hits to DB
2,199,628 Number of Sequences 4242 Number of
extensions 96525 Number of successful
extensions 290 Number of sequences better than
1.0 4 Number of HSP's better than 1.0 without
gapping 4 Number of HSP's successfully gapped in
prelim test 0 Number of HSP's that attempted
gapping in prelim test 279 Number of HSP's
gapped (non-prelim) 5 length of query
810 length of database 1,351,322 effective HSP
length 92 effective length of query
718 effective length of database
961,058 effective search space
690039644 effective search space used
690039644 T 11 A 40 X1 16 ( 7.4 bits) X2 38
(14.6 bits) X3 64 (24.7 bits) S1 41 (21.8
bits) S2 65 (29.6 bits)
The last part of the BLAST result gives some
statistics about the search Number of
hits Number of sequences in the DB
26Traps for BLAST searches
- Spurious domains
- Some domains are found in many proteins. This
does not mean that these proteins have the same
function. The width of the alignment should thus
be analyzed to assess whether the alignment
covers most of the sequence length, or only a
small segment. - Low complexity regions (repetitive sequences).
- return multiple matches with no apparent
functional relationship. - Cloning vectors
- Some entries in the database contain a fragment
of the cloning vector. This can return many
apparent matches, where the matching region is
restricted to the cloning vector.
27DNA versus protein searches
- When the query is a coding DNA sequence, it is
recommended to apply searches with the translated
rather than raw DNA sequences - This allows to introduce a substitution matrix
(PAM, BLOSUM, ...), which better reflects the
evolutionary changes. - It has been shown that some distant relationships
can be detected with translated searches, but
escape detection with the DNA search. - It is easier to filter out low complexity regions
from proteins than from DNA sequences.