Jacques.van.Heldenulb.ac.be - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Jacques.van.Heldenulb.ac.be

Description:

These algorithms are ~50 times faster than Smith-Waterman ... is detected, the two sequences are aligned with Smith-Waterman algorithm ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 28
Provided by: jacquesv8
Category:

less

Transcript and Presenter's Notes

Title: Jacques.van.Heldenulb.ac.be


1
Sequence analysisPart 3. Similarity searches in
sequence databases
  • Introduction to Bioinformatics

2
Matching a sequence against a database
  • Bioinformatics

3
Matching a sequence against a database
  • Example of utilization
  • We have a gene coding sequence and we would like
    to search UniProt for all similar proteins, in
    order to get a cue about the possible function
  • Approach we will align our query sequence to
    each entry in UniProt.
  • Problem of size in Dec 2005, there were
    2.500.000 entries in UniProt (Swiss-Prot
    TREMBL)
  • It is possible to apply dynamical programming,
    but it takes a lot of time or requires high
    computation power.

4
Fast algorithms for database matching
  • FastA
  • BLAST (Basic Local Alignment Search Tool)
  • In short
  • These algorithms are 50 times faster than
    Smith-Waterman
  • They cannot guarantee the optimal solution
  • However, a comparison with results obtained by
    dynamical programming has shown that FastA answer
    is close to the optimum

5
FastA strategy
  • FastA builds an index with the positions of all
    the small words (k-tuples) found in the query
    sequence
  • The program then detects diagonals of k-tuples
    between the query and the database sequences.
  • When a significant diagonal is detected, the two
    sequences are aligned with Smith-Waterman
    algorithm
  • The size of words (k) influences the behaviour of
    the program when k increases, the search is
    faster but one might miss some matches

6
FastA strategy
  • Comparison of k-tuple positions between query and
    database sequences
  • Highest-density regions ("init regions") are
    identified (the best one is highlighted with an
    asterisk). Regions with a score below a given
    threshold appear in dotted lines.
  • Low-scoring regions are filtered out, and the
    remaining regions are joined.

Source Mount (2000)
7
BLAST strategy (gapless version, Altschul et al.,
1990)
  • Version 1 (1990)
  • Builds a dictionary of k-tuples (small words)
    found in the query sequence.
  • Uses a substitution matrix (e.g. BLOSUM) to
    calculate a score between these words and each
    possible word of the same length.
  • Only retains the words with a high score.
  • Each time a pair of words from the dictionary are
    found (hits) in the database sequence, extends
    the hit in both direction (without gaps), to
    obtain a High-scoring Segment Pair (HSP).
  • The program returns sequences with significant
    high-scoring segment pairs.

8
BLAST strategy (gapped version, Altschul et al.,
1997)
  • Version 2 (1997)
  • Use smaller words, but only extend when there are
    two hits on the same diagonal.
  • Extension includes gaps (dynamical programming).
  • The extension costs more time, but the number of
    times it is done is reduced, because the
    extension requires a pair of hits.

9
BLAST strategy (gapped version, Altschul et al.,
1997)
  • Version 1 (1990)
  • Builds a dictionary of k-tuples (small words)
    found in the query sequence.
  • Uses a substitution matrix (e.g. BLOSUM) to
    calculate a score between these words and each
    possible word of the same length.
  • Only retains the words with a high score.
  • Each time a pair of words from the dictionary are
    found (hits) in the database sequence, extends
    the hit in both direction (without gaps), to
    obtain a High-scoring Segment Pair (HSP).
  • The program returns sequences with significant
    high-scoring segment pairs.
  • Version 2 (1997)
  • Use smaller words, but only extend when there are
    two hits on the same diagonal.
  • Extension includes gaps (cynamical programming).
  • PSI-BLAST (in the 1997 article as well)
  • A second step after the proper BLAST process.
  • Once the gapped BLAST has returned a set of
    sequences, these sequences are aligned and used
    to build a profile motif.
  • The database is then scanned with this profile
    motif to collect additional similarities.
  • The process can be iterated several times
  • collect sequences gt build a profile -gt collect
    sequences -gt build a profile ...

10
BLAST family of programs
  • Different program names exist, depending on the
    type (protein or nucleic acid) of query and
    database sequences.
  • For comparison between nucleic acids and
    proteins, the nucleic acid is translated in the 6
    frames (3 frames per strand)

6-frames translation
11
Statistics of sequence similarities
  • Bioinformatics

12
Matching statistics - raw score
  • The raw score of a matching segment pair (MSP) is
    obtained by summing the scores (obtained from the
    substitution matrix) for each pair of residues
    (r1,i and r2,i) over the length of the alignment
    (L).

R L A S V E T D M P L T L R Q H
. . . . . . . . T
L T S L Q T T L K A H L G T H -1
4 0 4 1 2 5 -1 2 -1 -1 -2 4 -2 -1 8 21
13
MSP-wise P-value and bit score
  • The P-value of a matching segment pair (MSP) with
    score S is the probability to observe a score of
    at least S by chance.
  • Karlin and Altschul (1990) defined a way to
    calculate the P-value of an MSP.
  • The P-value follows an exponential distribution,
    with two parameters lambda and K. These two
    parameters depend on the substitution matrix
    chosen. They have thus to be estimated for each
    substitution matrix separately.
  • The analytic way to determine the parameters
    lambda and L is only valid for gapless
    alignments.
  • For alignment with gaps, Altschul et al (1997)
    propose to estimate these parameters on the basis
    of empirical observations
  • Bit score of an MSP
  • Karlin and Altschul (1990) also propose to
    convert the raw score S into a bit score S.
  • This facilitates the interpretation of the score,
    because the P-value can be directly calculated
    from the bit score, irrespective of the
    substitution matrix used for the alignment. .

14
Matching statistics - the E-value
  • If one would aligns a random word with another
    random word, the score is likely to be generally
    low.
  • However, if this is repeated billions of times,
    some high scores will occasionally occur by
    chance.
  • In a database scan, each word of the query
    sequence are compared to each word of the
    database.
  • For a query sequence of size m and a database of
    size n, the search space (number of word pair
    comparisons) is thus Nnm.
  • FastA and BLAST estimate, for each score, the
    number of matches that would be obtained by
    chance alone, given the size of the database.
    This is called the E-value.
  • The E-value is the product of the P-value by the
    size of the search space.
  • For a given score S, the expected number of
    random matches thus increases with the size of
    the database.

15
Threshold on E-value
  • The lower is the E-value, the more significant is
    the match.
  • High E-value ( gt 1) indicate that the match
    should not be trusted too much.
  • One essential parameter of FastA and BLAST is the
    threshold on E-value.
  • Beware
  • On the BLAST Web server at NCBI, the default
    threshold value is 10
  • This means that each query would return 10
    matches by chance alone.
  • If this default value is used, we already know
    that the answer is likely to contain 10 false
    positive.

16
Matching statistics - database-wise P-value
(Family-Wise Error Rate)
  • From the E-value (E), one can estimate the
    probability to observe at least X matches by
    chance in random sequences.
  • This is a simple application of the Poisson
    distribution calculate the probability to
    observe X occurrences of an event whilst
    expecting E.
  • A particular case is the probability to observe
    at least one match by chance (Xgt1).
  • This probability is called database-wise P-value.
  • This P-value represents the probability to find
    at least one spurious match in the whole database
    search, with a score greater or equal to S.

17
Distribution of probability for matching scores
  • When one performs a database similarity search,
    the distribution of scores follows an extreme
    value distribution. This distribution is
    asymmetric, and should thus not be modelled with
    a normal (Gaussian) distribution.

18
Interpreting similarity search results
  • Bioinformatics

19
Score distribution
  • The histogram shows the number of database
    matches for each score.
  • For scores higher than 92, the number of matches
    is very small.
  • A higher resolution histogram is shown besides
    the main histogram.
  • Asterisks () represent the random expectation
    (E-value) for each score

zoom
FastA output from Pearson (2000)
20
BLAST result
  • The text shows the result of a BLAST search,
  • Query the E.coli protein MetL, a bifunctional
    enzyme combining aspartokinase and homoserine
    dehydrogenase activities.
  • Database all proteins from Escherichia coli K12.
  • The BLAST result file starts with a summary of
  • the parameters used for the search
  • The matching sequences and the score of each
    match.

BLASTP 2.2.6 Apr-09-2003 Reference Altschul,
Stephen F., Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb
Miller, and David J. Lipman (1997), "Gapped
BLAST and PSI-BLAST a new generation of protein
database search programs", Nucleic Acids Res.
253389-3402. Query metL gi16131778refNP_4183
75.1 aspartokinase II and homoserine
dehydrogenase II bifunctional aspartokinase
II (N-terminal) homoserine dehydrogenase II
(C-terminal) Escherichia coli K12 (810
letters) Database /Users/jvanheld/rsa- tools/dat
a/genomes/Escherichia_coli_K12/genome/NC_000913.fa
a 4242 sequences 1,351,322 total
letters Searching.........done

Score E Sequences producing significant
alignments (bits)
Value gi16131778refNP_418375.1 aspartokinase
II and homoserine deh... 1596 0.0
gi16127996refNP_414543.1 bifunctional
aspartokinase I (N-te... 344
2e-95 gi16131850refNP_418448.1 aspartokinase
III, lysine sensitive... 122
7e-29 gi16128228refNP_414777.1
gamma-glutamate kinase Escherichia... 31
0.28 gtgi16131778refNP_418375.1
aspartokinase II and homoserine
dehydrogenase II bifunctional aspartokinase II
(N-terminal) homoserine dehydrogenase
II (C-terminal) Escherichia coli
K12 Length 810 Score 1596 bits
(4132), Expect 0.0 Identities 810/810
(100), Positives 810/810 (100) Query 1
MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAA
GSTTNQLINW 60 MSVIAQAGAKGRQLHKFGGSSLADV
KCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINW Sbjct 1
MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAA
GSTTNQLINW 60 Query 61 LKLSQTDRLSAHQVQQTLRRYQCD
LISGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120
LKLSQTDRLSAHQVQQTLRRYQCDLISGLLPAEEADSLISAFVSDLER
LAALLDSGINDA Sbjct 61 LKLSQTDRLSAHQVQQTLRRYQCDLI
SGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120 Query
121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQ
VDEGLSYPLLQQLL 180 VYAEVVGHGEVWSARLMSAV
LNQQGLPAAWLDAREFLRAERAAQPQVDEGLSYPLLQQLL Sbjct
121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQ
VDEGLSYPLLQQLL 180 Query 181 VQHPGKRLVVTGFISRNNA
GETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP 240
VQHPGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGV
SRVTIWSDVAGVYSADP Sbjct 181 VQHPGKRLVVTGFISRNNAGE
TVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP
240 Query 241 RKVKDACLLPLLRLDEASELARLAAPVLHARTLQ
PVSGSEIDLQLRCSYTPDQGSTRIER 300
RKVKDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYT
PDQGSTRIER Sbjct 241 RKVKDACLLPLLRLDEASELARLAAPVL
HARTLQPVSGSEIDLQLRCSYTPDQGSTRIER 300 Query 301
VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLA
VGVHNDRQLL 360 VLASGTGARIVTSHDDVCLIEFQV
PASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLL Sbjct 301
VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLA
VGVHNDRQLL 360 Query 361 QFCYTSEVADSALKILDEAGLPG
ELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420
QFCYTSEVADSALKILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPV Sbjct 361 QFCYTSEVADSALKILDEAGLPGEL
RLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420 Query
421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGK
GNIGSRWLELFARE 480 EFTWQSDDGISLVAVLRTGP
TESLIQGLHQSVFRAEKRIGLVLFGKGNIGSRWLELFARE Sbjct
421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGK
GNIGSRWLELFARE 480 Query 481 QSTLSARTGFEFVLAGVVD
SRRSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY 540
QSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEA
VEQDEESLFLWMRAHPY Sbjct 481 QSTLSARTGFEFVLAGVVDSR
RSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY
540 Query 541 DDLVVLDVTASQQLADQYLDFASHGFHVISANKL
AGASDSNKYRQIHDAFEKTGRHWLYN 600
DDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQIHDAF
EKTGRHWLYN Sbjct 541 DDLVVLDVTASQQLADQYLDFASHGFHV
ISANKLAGASDSNKYRQIHDAFEKTGRHWLYN 600 Query 601
ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTEL
VDQAWQQGLT 660 ATVGAGLPINHTVRDLIDSGDTIL
SISGIFSGTLSWLFLQFDGSVPFTELVDQAWQQGLT Sbjct 601
ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTEL
VDQAWQQGLT 660 Query 661 EPDPRDDLSGKDVMRKLVILARE
AGYNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720
EPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEGGS
IDHFFENGDELNE Sbjct 661 EPDPRDDLSGKDVMRKLVILAREAG
YNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720 Query
721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLP
CDNVFAIESRWYRD 780 QMVQRLEAAREMGLVLRYVA
RFDANGKARVGVEAVREDHPLASLLPCDNVFAIESRWYRD Sbjct
721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLP
CDNVFAIESRWYRD 780 Query 781 NPLVIRGPGAGRDVTAGAI
QSDINRLAQLL 810 NPLVIRGPGAGRDVTAGAIQSDI
NRLAQLL Sbjct 781 NPLVIRGPGAGRDVTAGAIQSDINRLAQLL
810 gtgi16127996refNP_414543.1 bifunctional
aspartokinase I (N-terminal)
homoserine dehydrogenase I (C-terminal)
Escherichia coli K12 Length 820
Score 344 bits (882), Expect 2e-95
Identities 247/821 (30), Positives 410/821
(49), Gaps 44/821 (5) Query 16
KFGGSSLADVKCYLRVAGIMAEYSQPDDMM-VVSAAGSTTNQLINWLKLS
QTDRLSAHQV 74 KFGGSA LRVA I
VSA TN L Sbjct 5
KFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKT
ISGQDALPNI 64 Query 75 QQTLRRYQCDLISGLLPAEEADSL
--ISAFVSDLERLAALLDSGIN------DAVYAEVV 126
R LGL A L FV GI
D A Sbjct 65 SDAERIF-AELLTGLAAAQPGFPLAQ
LKTFVDQEFAQIKHVLHGISLLGQCPDSINAALI 123 Query
127 GHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAER---AAQPQVDE
GLSYPLLQQLLVQH 183 GE S M VL G
D E L A E H Sbjct
124 CRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAE
STRRIAASRIPADH 183 Query 184 PGKRLVVTGFISRNNAGET
VLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV 243
GF N GE VLGRNGSDYSA A
IWDV GVY DPRV Sbjct 184 ---MVLMAGFTAGNEKGELVV
LGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV
240 Query 244 KDACLLPLLRLDEASELARLAAPVLHARTLQPVS
GSEIDLQLRCSYTPDQ-----GSTRI 298 DA LL
EA EL A VLH RT P I P
GR Sbjct 241 PDARLLKSMSYQEAMELSYFGAKVLHPRTITPI
AQFQIPCLIKNTGNPQAPGTLIGASRD 300 Query 299
ERVLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRP
LAVGVHNDRQ 358 E L
P RA Sbjct 301
EDELP----VKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISV
VLITQSSSEY 356 Query 359 LLQFCYTSEVADSALKILDEA--
-----GLPGELRLRQGLALVAMVGAGVTRNPLHCHRF 411
FC A E GL L
LAVG G F Sbjct 357
SISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGM
RTLRGISAKF 416 Query 412 WQQLKGQPVEFTW--QSDDGISL
VAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNI 469
L Q S V HQ F
I G G Sbjct 417 FAALARANINIVAIAQGSSERSIS
VVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGV 476 Query
470 GSRWLELFAREQSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRA
LAFFNDEAVEQDEE 529 G LE RQS L
GV S L GL L E E Sbjct
477 GGALLEQLKRQQSWLKNKH-IDLRVCGVANSKALLTNVHGLN----
LENWQEELAQAKEP 531 Query 530 ----SLFLWMRAHPYDDLV
VLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQ 585
L VD TSQ ADQY DF
GFHV NK A S Y Q Sbjct 532
FNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKA
NTSSMDYYHQ 591 Query 586 IHDAFEKTGRHWLYNATVGAGLP
INHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSV 645
A EK R LY VGAGLP LGD SGI
SGLSF D Sbjct 592 LRYAAEKSRRKFLYDTNVGAGLP
VIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGM 651 Query
646 PFTELVDQAWQQGLTEPDPRDDLSGKDVMRKLVILAREAGYNIEPD
QVRVESLVPAHCEG 705 FE A G
TEPDPRDDLSG DV RKLILARE G E E PA
Sbjct 652 SFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARE
TGRELELADIEIEPVLPAEFNA 711 Query 706
-GSIDHFFENGDELNEQMVQRLEAAREMGLVLRYVARFDANGKARVGVEA
VREDHPLASL 764 G F N L R
AR G VLRYV D G RV V PL Sbjct 712
EGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAE
VDGNDPLFKV 771 Query 765 LPCDNVFAIESRWYRDNPLVIRG
PGAGRDVTAGAIQSDINR 805 N A S Y
PLVRG GAG DVTA D R Sbjct 772
KNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLR
812 gtgi16131850refNP_418448.1 aspartokinase
III, lysine sensitive aspartokinase
III, lysine-sensitive Escherichia coli
K12 Length 449 Score 122 bits
(307), Expect 7e-29 Identities 121/452
(26), Positives 194/452 (42), Gaps 25/452
(5) Query 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVS
AAGSTTNQLINWLK-LSQTDRLSAHQV 74
KFGGSAD R A I VSA TN L L
R Sbjct 8 KFGGTSVADFDAMNRSADIVLSDANVR-
LVVLSASAGITNLLVALAEGLEPGERF---EK 63 Query 75
QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAE
VVGHGEVWSA 134 R Q L
I LA A EV HGE S Sbjct 64
LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDE
LVSHGELMST 123 Query 135 RLMSAVLNQQGLPAAWLDAREFL
RA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192
L L A W D R R R D L
L LVT G Sbjct 124 LLFVEILRERDVQAQWFDVRKVMR
TNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183 Query
193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSA
DPRKVKDACLLPLL 252 FI N G T LGR
GSDYA SRV IWDV GY DPR V A
Sbjct 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIW
TDVPGIYTTDPRVVSAAKRIDEI 243 Query 253
RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI----
-----ERVLA 303 EAEA A VLH TL P
SI S P G T R LA Sbjct 244
AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKT
ENPPLFRALA 303 Query 304 SGTGARIVTSHDDVCLIEFQVPA
SQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363
T H L A LA I L
A L Sbjct 304 LRRNQTLLTLHSLNMLHSRGFLA
EVFGILARHNISVDLITTSEVSVAL-------TLDTT 356 Query
364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPVE 421 D L L E
GLALVAG L Sbjct
357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGV
GKEVFGVLEPFNIR 416 Query 422 FTWQSDDGISLVAVLRTGP
TESLIQGLHQSVF 453 L
E Q LH F Sbjct 417 MICYGASSHNLCFLVPGEDAEQVVQK
LHSNLF 448 gtgi16128228refNP_414777.1
gamma-glutamate kinase Escherichia
coli K12 Length 367 Score 31.2
bits (69), Expect 0.28 Identities 17/56
(30), Positives 29/56 (51) Query 194
ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV
KDACLL 249 I NA T D
LAG D GYADPR A L Sbjct 133
INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSN
PQAELI 188 Database /Users/jvanheld/rsa-
tools/data/genomes/Escherichia_coli_K12/genome/NC_
000913.faa Posted date Sep 8, 2004 1213
PM Number of letters in database 1,351,322
Number of sequences in database 4242 Lambda
K H 0.320 0.136 0.397
Gapped Lambda K H 0.267 0.0410
0.140 Matrix BLOSUM62 Gap Penalties
Existence 11, Extension 1 Number of Hits to DB
2,199,628 Number of Sequences 4242 Number of
extensions 96525 Number of successful
extensions 290 Number of sequences better than
1.0 4 Number of HSP's better than 1.0 without
gapping 4 Number of HSP's successfully gapped in
prelim test 0 Number of HSP's that attempted
gapping in prelim test 279 Number of HSP's
gapped (non-prelim) 5 length of query
810 length of database 1,351,322 effective HSP
length 92 effective length of query
718 effective length of database
961,058 effective search space
690039644 effective search space used
690039644 T 11 A 40 X1 16 ( 7.4 bits) X2 38
(14.6 bits) X3 64 (24.7 bits) S1 41 (21.8
bits) S2 65 (29.6 bits)
21
BLAST result - first match
The first match is the query sequence itself
(metL). This is not surprising since we scanned
the set of all E.coli proteins with a protein
from E.coli. The E-value (0) means that, with
this level of similarity one would expect 0
false positive by chance.
gtgi16131778refNP_418375.1 aspartokinase II
and homoserine dehydrogenase II
bifunctional aspartokinase II
(N-terminal) homoserine dehydrogenase II
(C-terminal) Escherichia coli K12
Length 810 Score 1596 bits (4132),
Expect 0.0 Identities 810/810 (100),
Positives 810/810 (100) Query 1
MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAA
GSTTNQLINW 60 MSVIAQAGAKGRQLHKFGGSSLADV
KCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINW Sbjct 1
MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAA
GSTTNQLINW 60 Query 61 LKLSQTDRLSAHQVQQTLRRYQCD
LISGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120
LKLSQTDRLSAHQVQQTLRRYQCDLISGLLPAEEADSLISAFVSDLER
LAALLDSGINDA Sbjct 61 LKLSQTDRLSAHQVQQTLRRYQCDLI
SGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120 Query
121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQ
VDEGLSYPLLQQLL 180 VYAEVVGHGEVWSARLMSAV
LNQQGLPAAWLDAREFLRAERAAQPQVDEGLSYPLLQQLL Sbjct
121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQ
VDEGLSYPLLQQLL 180 Query 181 VQHPGKRLVVTGFISRNNA
GETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP 240
VQHPGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGV
SRVTIWSDVAGVYSADP Sbjct 181 VQHPGKRLVVTGFISRNNAGE
TVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP
240 Query 241 RKVKDACLLPLLRLDEASELARLAAPVLHARTLQ
PVSGSEIDLQLRCSYTPDQGSTRIER 300
RKVKDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYT
PDQGSTRIER Sbjct 241 RKVKDACLLPLLRLDEASELARLAAPVL
HARTLQPVSGSEIDLQLRCSYTPDQGSTRIER 300 Query 301
VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLA
VGVHNDRQLL 360 VLASGTGARIVTSHDDVCLIEFQV
PASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLL Sbjct 301
VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLA
VGVHNDRQLL 360 Query 361 QFCYTSEVADSALKILDEAGLPG
ELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420
QFCYTSEVADSALKILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPV Sbjct 361 QFCYTSEVADSALKILDEAGLPGEL
RLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420 Query
421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGK
GNIGSRWLELFARE 480 EFTWQSDDGISLVAVLRTGP
TESLIQGLHQSVFRAEKRIGLVLFGKGNIGSRWLELFARE Sbjct
421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGK
GNIGSRWLELFARE 480 Query 481 QSTLSARTGFEFVLAGVVD
SRRSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY 540
QSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEA
VEQDEESLFLWMRAHPY Sbjct 481 QSTLSARTGFEFVLAGVVDSR
RSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY
540 Query 541 DDLVVLDVTASQQLADQYLDFASHGFHVISANKL
AGASDSNKYRQIHDAFEKTGRHWLYN 600
DDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQIHDAF
EKTGRHWLYN Sbjct 541 DDLVVLDVTASQQLADQYLDFASHGFHV
ISANKLAGASDSNKYRQIHDAFEKTGRHWLYN 600 Query 601
ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTEL
VDQAWQQGLT 660 ATVGAGLPINHTVRDLIDSGDTIL
SISGIFSGTLSWLFLQFDGSVPFTELVDQAWQQGLT Sbjct 601
ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTEL
VDQAWQQGLT 660 Query 661 EPDPRDDLSGKDVMRKLVILARE
AGYNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720
EPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEGGS
IDHFFENGDELNE Sbjct 661 EPDPRDDLSGKDVMRKLVILAREAG
YNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720 Query
721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLP
CDNVFAIESRWYRD 780 QMVQRLEAAREMGLVLRYVA
RFDANGKARVGVEAVREDHPLASLLPCDNVFAIESRWYRD Sbjct
721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLP
CDNVFAIESRWYRD 780 Query 781 NPLVIRGPGAGRDVTAGAI
QSDINRLAQLL 810 NPLVIRGPGAGRDVTAGAIQSDI
NRLAQLL Sbjct 781 NPLVIRGPGAGRDVTAGAIQSDINRLAQLL
810 gtgi16127996refNP_414543.1 bifunctional
aspartokinase I (N-terminal)
homoserine dehydrogenase I (C-terminal)
Escherichia coli K12 Length 820
Score 344 bits (882), Expect 2e-95
Identities 247/821 (30), Positives 410/821
(49), Gaps 44/821 (5) Query 16
KFGGSSLADVKCYLRVAGIMAEYSQPDDMM-VVSAAGSTTNQLINWLKLS
QTDRLSAHQV 74 KFGGSA LRVA I
VSA TN L Sbjct 5
KFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKT
ISGQDALPNI 64 Query 75 QQTLRRYQCDLISGLLPAEEADSL
--ISAFVSDLERLAALLDSGIN------DAVYAEVV 126
R LGL A L FV GI
D A Sbjct 65 SDAERIF-AELLTGLAAAQPGFPLAQ
LKTFVDQEFAQIKHVLHGISLLGQCPDSINAALI 123 Query
127 GHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAER---AAQPQVDE
GLSYPLLQQLLVQH 183 GE S M VL G
D E L A E H Sbjct
124 CRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAE
STRRIAASRIPADH 183 Query 184 PGKRLVVTGFISRNNAGET
VLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV 243
GF N GE VLGRNGSDYSA A
IWDV GVY DPRV Sbjct 184 ---MVLMAGFTAGNEKGELVV
LGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV
240 Query 244 KDACLLPLLRLDEASELARLAAPVLHARTLQPVS
GSEIDLQLRCSYTPDQ-----GSTRI 298 DA LL
EA EL A VLH RT P I P
GR Sbjct 241 PDARLLKSMSYQEAMELSYFGAKVLHPRTITPI
AQFQIPCLIKNTGNPQAPGTLIGASRD 300 Query 299
ERVLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRP
LAVGVHNDRQ 358 E L
P RA Sbjct 301
EDELP----VKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISV
VLITQSSSEY 356 Query 359 LLQFCYTSEVADSALKILDEA--
-----GLPGELRLRQGLALVAMVGAGVTRNPLHCHRF 411
FC A E GL L
LAVG G F Sbjct 357
SISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGM
RTLRGISAKF 416 Query 412 WQQLKGQPVEFTW--QSDDGISL
VAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNI 469
L Q S V HQ F
I G G Sbjct 417 FAALARANINIVAIAQGSSERSIS
VVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGV 476 Query
470 GSRWLELFAREQSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRA
LAFFNDEAVEQDEE 529 G LE RQS L
GV S L GL L E E Sbjct
477 GGALLEQLKRQQSWLKNKH-IDLRVCGVANSKALLTNVHGLN----
LENWQEELAQAKEP 531 Query 530 ----SLFLWMRAHPYDDLV
VLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQ 585
L VD TSQ ADQY DF
GFHV NK A S Y Q Sbjct 532
FNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKA
NTSSMDYYHQ 591 Query 586 IHDAFEKTGRHWLYNATVGAGLP
INHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSV 645
A EK R LY VGAGLP LGD SGI
SGLSF D Sbjct 592 LRYAAEKSRRKFLYDTNVGAGLP
VIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGM 651 Query
646 PFTELVDQAWQQGLTEPDPRDDLSGKDVMRKLVILAREAGYNIEPD
QVRVESLVPAHCEG 705 FE A G
TEPDPRDDLSG DV RKLILARE G E E PA
Sbjct 652 SFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARE
TGRELELADIEIEPVLPAEFNA 711 Query 706
-GSIDHFFENGDELNEQMVQRLEAAREMGLVLRYVARFDANGKARVGVEA
VREDHPLASL 764 G F N L R
AR G VLRYV D G RV V PL Sbjct 712
EGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAE
VDGNDPLFKV 771 Query 765 LPCDNVFAIESRWYRDNPLVIRG
PGAGRDVTAGAIQSDINR 805 N A S Y
PLVRG GAG DVTA D R Sbjct 772
KNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLR
812 gtgi16131850refNP_418448.1 aspartokinase
III, lysine sensitive aspartokinase
III, lysine-sensitive Escherichia coli
K12 Length 449 Score 122 bits
(307), Expect 7e-29 Identities 121/452
(26), Positives 194/452 (42), Gaps 25/452
(5) Query 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVS
AAGSTTNQLINWLK-LSQTDRLSAHQV 74
KFGGSAD R A I VSA TN L L
R Sbjct 8 KFGGTSVADFDAMNRSADIVLSDANVR-
LVVLSASAGITNLLVALAEGLEPGERF---EK 63 Query 75
QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAE
VVGHGEVWSA 134 R Q L
I LA A EV HGE S Sbjct 64
LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDE
LVSHGELMST 123 Query 135 RLMSAVLNQQGLPAAWLDAREFL
RA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192
L L A W D R R R D L
L LVT G Sbjct 124 LLFVEILRERDVQAQWFDVRKVMR
TNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183 Query
193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSA
DPRKVKDACLLPLL 252 FI N G T LGR
GSDYA SRV IWDV GY DPR V A
Sbjct 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIW
TDVPGIYTTDPRVVSAAKRIDEI 243 Query 253
RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI----
-----ERVLA 303 EAEA A VLH TL P
SI S P G T R LA Sbjct 244
AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKT
ENPPLFRALA 303 Query 304 SGTGARIVTSHDDVCLIEFQVPA
SQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363
T H L A LA I L
A L Sbjct 304 LRRNQTLLTLHSLNMLHSRGFLA
EVFGILARHNISVDLITTSEVSVAL-------TLDTT 356 Query
364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPVE 421 D L L E
GLALVAG L Sbjct
357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGV
GKEVFGVLEPFNIR 416 Query 422 FTWQSDDGISLVAVLRTGP
TESLIQGLHQSVF 453 L
E Q LH F Sbjct 417 MICYGASSHNLCFLVPGEDAEQVVQK
LHSNLF 448 gtgi16128228refNP_414777.1
gamma-glutamate kinase Escherichia
coli K12 Length 367 Score 31.2
bits (69), Expect 0.28 Identities 17/56
(30), Positives 29/56 (51) Query 194
ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV
KDACLL 249 I NA T D
LAG D GYADPR A L Sbjct 133
INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSN
PQAELI 188 Database /Users/jvanheld/rsa-
tools/data/genomes/Escherichia_coli_K12/genome/NC_
000913.faa Posted date Sep 8, 2004 1213
PM Number of letters in database 1,351,322
Number of sequences in database 4242 Lambda
K H 0.320 0.136 0.397
Gapped Lambda K H 0.267 0.0410
0.140 Matrix BLOSUM62 Gap Penalties
Existence 11, Extension 1 Number of Hits to DB
2,199,628 Number of Sequences 4242 Number of
extensions 96525 Number of successful
extensions 290 Number of sequences better than
1.0 4 Number of HSP's better than 1.0 without
gapping 4 Number of HSP's successfully gapped in
prelim test 0 Number of HSP's that attempted
gapping in prelim test 279 Number of HSP's
gapped (non-prelim) 5 length of query
810 length of database 1,351,322 effective HSP
length 92 effective length of query
718 effective length of database
961,058 effective search space
690039644 effective search space used
690039644 T 11 A 40 X1 16 ( 7.4 bits) X2 38
(14.6 bits) X3 64 (24.7 bits) S1 41 (21.8
bits) S2 65 (29.6 bits)
22
BLAST result - second match
The second match is another bifunctional protein,
product of the gene thrA. This protein contains
the same two domains as metA (aspartokinase and
homoserine dehydrogenase). The alignment covers
almost the complete sequences (820 aa), with 30
identities and 49 similarity. The E-value is
very low (2e-95), indicating that thrA and metL
are likely to be true homologs.
gtgi16127996refNP_414543.1 bifunctional
aspartokinase I (N-terminal)
homoserine dehydrogenase I (C-terminal)
Escherichia coli K12 Length 820
Score 344 bits (882), Expect 2e-95
Identities 247/821 (30), Positives 410/821
(49), Gaps 44/821 (5) Query 16
KFGGSSLADVKCYLRVAGIMAEYSQPDDMM-VVSAAGSTTNQLINWLKLS
QTDRLSAHQV 74 KFGGSA LRVA I
VSA TN L Sbjct 5
KFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKT
ISGQDALPNI 64 Query 75 QQTLRRYQCDLISGLLPAEEADSL
--ISAFVSDLERLAALLDSGIN------DAVYAEVV 126
R LGL A L FV GI
D A Sbjct 65 SDAERIF-AELLTGLAAAQPGFPLAQ
LKTFVDQEFAQIKHVLHGISLLGQCPDSINAALI 123 Query
127 GHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAER---AAQPQVDE
GLSYPLLQQLLVQH 183 GE S M VL G
D E L A E H Sbjct
124 CRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAE
STRRIAASRIPADH 183 Query 184 PGKRLVVTGFISRNNAGET
VLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV 243
GF N GE VLGRNGSDYSA A
IWDV GVY DPRV Sbjct 184 ---MVLMAGFTAGNEKGELVV
LGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV
240 Query 244 KDACLLPLLRLDEASELARLAAPVLHARTLQPVS
GSEIDLQLRCSYTPDQ-----GSTRI 298 DA LL
EA EL A VLH RT P I P
GR Sbjct 241 PDARLLKSMSYQEAMELSYFGAKVLHPRTITPI
AQFQIPCLIKNTGNPQAPGTLIGASRD 300 Query 299
ERVLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRP
LAVGVHNDRQ 358 E L
P RA Sbjct 301
EDELP----VKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISV
VLITQSSSEY 356 Query 359 LLQFCYTSEVADSALKILDEA--
-----GLPGELRLRQGLALVAMVGAGVTRNPLHCHRF 411
FC A E GL L
LAVG G F Sbjct 357
SISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGM
RTLRGISAKF 416 Query 412 WQQLKGQPVEFTW--QSDDGISL
VAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNI 469
L Q S V HQ F
I G G Sbjct 417 FAALARANINIVAIAQGSSERSIS
VVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGV 476 Query
470 GSRWLELFAREQSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRA
LAFFNDEAVEQDEE 529 G LE RQS L
GV S L GL L E E Sbjct
477 GGALLEQLKRQQSWLKNKH-IDLRVCGVANSKALLTNVHGLN----
LENWQEELAQAKEP 531 Query 530 ----SLFLWMRAHPYDDLV
VLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQ 585
L VD TSQ ADQY DF
GFHV NK A S Y Q Sbjct 532
FNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKA
NTSSMDYYHQ 591 Query 586 IHDAFEKTGRHWLYNATVGAGLP
INHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSV 645
A EK R LY VGAGLP LGD SGI
SGLSF D Sbjct 592 LRYAAEKSRRKFLYDTNVGAGLP
VIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGM 651 Query
646 PFTELVDQAWQQGLTEPDPRDDLSGKDVMRKLVILAREAGYNIEPD
QVRVESLVPAHCEG 705 FE A G
TEPDPRDDLSG DV RKLILARE G E E PA
Sbjct 652 SFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARE
TGRELELADIEIEPVLPAEFNA 711 Query 706
-GSIDHFFENGDELNEQMVQRLEAAREMGLVLRYVARFDANGKARVGVEA
VREDHPLASL 764 G F N L R
AR G VLRYV D G RV V PL Sbjct 712
EGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAE
VDGNDPLFKV 771 Query 765 LPCDNVFAIESRWYRDNPLVIRG
PGAGRDVTAGAIQSDINR 805 N A S Y
PLVRG GAG DVTA D R Sbjct 772
KNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLR
812 gtgi16131850refNP_418448.1 aspartokinase
III, lysine sensitive aspartokinase
III, lysine-sensitive Escherichia coli
K12 Length 449 Score 122 bits
(307), Expect 7e-29 Identities 121/452
(26), Positives 194/452 (42), Gaps 25/452
(5) Query 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVS
AAGSTTNQLINWLK-LSQTDRLSAHQV 74
KFGGSAD R A I VSA TN L L
R Sbjct 8 KFGGTSVADFDAMNRSADIVLSDANVR-
LVVLSASAGITNLLVALAEGLEPGERF---EK 63 Query 75
QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAE
VVGHGEVWSA 134 R Q L
I LA A EV HGE S Sbjct 64
LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDE
LVSHGELMST 123 Query 135 RLMSAVLNQQGLPAAWLDAREFL
RA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192
L L A W D R R R D L
L LVT G Sbjct 124 LLFVEILRERDVQAQWFDVRKVMR
TNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183 Query
193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSA
DPRKVKDACLLPLL 252 FI N G T LGR
GSDYA SRV IWDV GY DPR V A
Sbjct 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIW
TDVPGIYTTDPRVVSAAKRIDEI 243 Query 253
RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI----
-----ERVLA 303 EAEA A VLH TL P
SI S P G T R LA Sbjct 244
AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKT
ENPPLFRALA 303 Query 304 SGTGARIVTSHDDVCLIEFQVPA
SQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363
T H L A LA I L
A L Sbjct 304 LRRNQTLLTLHSLNMLHSRGFLA
EVFGILARHNISVDLITTSEVSVAL-------TLDTT 356 Query
364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPVE 421 D L L E
GLALVAG L Sbjct
357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGV
GKEVFGVLEPFNIR 416 Query 422 FTWQSDDGISLVAVLRTGP
TESLIQGLHQSVF 453 L
E Q LH F Sbjct 417 MICYGASSHNLCFLVPGEDAEQVVQK
LHSNLF 448 gtgi16128228refNP_414777.1
gamma-glutamate kinase Escherichia
coli K12 Length 367 Score 31.2
bits (69), Expect 0.28 Identities 17/56
(30), Positives 29/56 (51) Query 194
ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV
KDACLL 249 I NA T D
LAG D GYADPR A L Sbjct 133
INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSN
PQAELI 188 Database /Users/jvanheld/rsa-
tools/data/genomes/Escherichia_coli_K12/genome/NC_
000913.faa Posted date Sep 8, 2004 1213
PM Number of letters in database 1,351,322
Number of sequences in database 4242 Lambda
K H 0.320 0.136 0.397
Gapped Lambda K H 0.267 0.0410
0.140 Matrix BLOSUM62 Gap Penalties
Existence 11, Extension 1 Number of Hits to DB
2,199,628 Number of Sequences 4242 Number of
extensions 96525 Number of successful
extensions 290 Number of sequences better than
1.0 4 Number of HSP's better than 1.0 without
gapping 4 Number of HSP's successfully gapped in
prelim test 0 Number of HSP's that attempted
gapping in prelim test 279 Number of HSP's
gapped (non-prelim) 5 length of query
810 length of database 1,351,322 effective HSP
length 92 effective length of query
718 effective length of database
961,058 effective search space
690039644 effective search space used
690039644 T 11 A 40 X1 16 ( 7.4 bits) X2 38
(14.6 bits) X3 64 (24.7 bits) S1 41 (21.8
bits) S2 65 (29.6 bits)
23
BLAST result - third match
The third match is the product of the gene lysC
aspartokinase III. This protein contains the
aspartokinase domain, but not the homoserine
dehydrogenase. Consequently, the alignment only
extends over the first half of the query protein
(453aa). On this segment, there is a good level
of identity (26) and similarity (42). The
E-value is very low (7e29), indicating that the
two domains are likely to be true homologs.
gtgi16131850refNP_418448.1 aspartokinase III,
lysine sensitive aspartokinase III,
lysine-sensitive Escherichia coli
K12 Length 449 Score 122 bits
(307), Expect 7e-29 Identities 121/452
(26), Positives 194/452 (42), Gaps 25/452
(5) Query 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVS
AAGSTTNQLINWLK-LSQTDRLSAHQV 74
KFGGSAD R A I VSA TN L L
R Sbjct 8 KFGGTSVADFDAMNRSADIVLSDANVR-
LVVLSASAGITNLLVALAEGLEPGERF---EK 63 Query 75
QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAE
VVGHGEVWSA 134 R Q L
I LA A EV HGE S Sbjct 64
LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDE
LVSHGELMST 123 Query 135 RLMSAVLNQQGLPAAWLDAREFL
RA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192
L L A W D R R R D L
L LVT G Sbjct 124 LLFVEILRERDVQAQWFDVRKVMR
TNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183 Query
193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSA
DPRKVKDACLLPLL 252 FI N G T LGR
GSDYA SRV IWDV GY DPR V A
Sbjct 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIW
TDVPGIYTTDPRVVSAAKRIDEI 243 Query 253
RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI----
-----ERVLA 303 EAEA A VLH TL P
SI S P G T R LA Sbjct 244
AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKT
ENPPLFRALA 303 Query 304 SGTGARIVTSHDDVCLIEFQVPA
SQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363
T H L A LA I L
A L Sbjct 304 LRRNQTLLTLHSLNMLHSRGFLA
EVFGILARHNISVDLITTSEVSVAL-------TLDTT 356 Query
364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLH
CHRFWQQLKGQPVE 421 D L L E
GLALVAG L Sbjct
357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGV
GKEVFGVLEPFNIR 416 Query 422 FTWQSDDGISLVAVLRTGP
TESLIQGLHQSVF 453 L
E Q LH F Sbjct 417 MICYGASSHNLCFLVPGEDAEQVVQK
LHSNLF 448 gtgi16128228refNP_414777.1
gamma-glutamate kinase Escherichia
coli K12 Length 367 Score 31.2
bits (69), Expect 0.28 Identities 17/56
(30), Positives 29/56 (51) Query 194
ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV
KDACLL 249 I NA T D
LAG D GYADPR A L Sbjct 133
INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSN
PQAELI 188 Database /Users/jvanheld/rsa-
tools/data/genomes/Escherichia_coli_K12/genome/NC_
000913.faa Posted date Sep 8, 2004 1213
PM Number of letters in database 1,351,322
Number of sequences in database 4242 Lambda
K H 0.320 0.136 0.397
Gapped Lambda K H 0.267 0.0410
0.140 Matrix BLOSUM62 Gap Penalties
Existence 11, Extension 1 Number of Hits to DB
2,199,628 Number of Sequences 4242 Number of
extensions 96525 Number of successful
extensions 290 Number of sequences better than
1.0 4 Number of HSP's better than 1.0 without
gapping 4 Number of HSP's successfully gapped in
prelim test 0 Number of HSP's that attempted
gapping in prelim test 279 Number of HSP's
gapped (non-prelim) 5 length of query
810 length of database 1,351,322 effective HSP
length 92 effective length of query
718 effective length of database
961,058 effective search space
690039644 effective search space used
690039644 T 11 A 40 X1 16 ( 7.4 bits) X2 38
(14.6 bits) X3 64 (24.7 bits) S1 41 (21.8
bits) S2 65 (29.6 bits)
24
BLAST result - fourth match
The fourth match is a gamma-glutamate kinase,
product of proB. The match has the same level of
identity (30) and similarity (51) as the second
match (thrA). However, this match only extends
over 56aa, whereas the alignment between thrA and
metL extends over 821aa. The significance of the
match is thus much lower the E-value is quite
high (0.28) suggesting that the similarity could
be an artefact. This clearly illustrates the
fact that the important parameter to evaluate the
significance of an alignment is the E-value, not
the percentage of similarity !
gtgi16128228refNP_414777.1 gamma-glutamate
kinase Escherichia coli K12
Length 367 Score 31.2 bits (69), Expect
0.28 Identities 17/56 (30), Positives 29/56
(51) Query 194 ISRNNAGETVLLGRNGSDYSATQIGALAGVSR
VTIWSDVAGVYSADPRKVKDACLL 249 I NA T
D LAG D GYADPR A
L Sbjct 133 INENDAVATAEIKVGDNDNLSALAAILAGADKLLLL
TDQKGLYTADPRSNPQAELI 188
25
BLAST result - summary
Database /Users/jvanheld/rsa-
tools/data/genomes/Escherichia_coli_K12/genome/NC_
000913.faa Posted date Sep 8, 2004 1213
PM Number of letters in database 1,351,322
Number of sequences in database 4242 Lambda
K H 0.320 0.136 0.397
Gapped Lambda K H 0.267 0.0410
0.140 Matrix BLOSUM62 Gap Penalties
Existence 11, Extension 1 Number of Hits to DB
2,199,628 Number of Sequences 4242 Number of
extensions 96525 Number of successful
extensions 290 Number of sequences better than
1.0 4 Number of HSP's better than 1.0 without
gapping 4 Number of HSP's successfully gapped in
prelim test 0 Number of HSP's that attempted
gapping in prelim test 279 Number of HSP's
gapped (non-prelim) 5 length of query
810 length of database 1,351,322 effective HSP
length 92 effective length of query
718 effective length of database
961,058 effective search space
690039644 effective search space used
690039644 T 11 A 40 X1 16 ( 7.4 bits) X2 38
(14.6 bits) X3 64 (24.7 bits) S1 41 (21.8
bits) S2 65 (29.6 bits)
The last part of the BLAST result gives some
statistics about the search Number of
hits Number of sequences in the DB
26
Traps for BLAST searches
  • Spurious domains
  • Some domains are found in many proteins. This
    does not mean that these proteins have the same
    function. The width of the alignment should thus
    be analyzed to assess whether the alignment
    covers most of the sequence length, or only a
    small segment.
  • Low complexity regions (repetitive sequences).
  • return multiple matches with no apparent
    functional relationship.
  • Cloning vectors
  • Some entries in the database contain a fragment
    of the cloning vector. This can return many
    apparent matches, where the matching region is
    restricted to the cloning vector.

27
DNA versus protein searches
  • When the query is a coding DNA sequence, it is
    recommended to apply searches with the translated
    rather than raw DNA sequences
  • This allows to introduce a substitution matrix
    (PAM, BLOSUM, ...), which better reflects the
    evolutionary changes.
  • It has been shown that some distant relationships
    can be detected with translated searches, but
    escape detection with the DNA search.
  • It is easier to filter out low complexity regions
    from proteins than from DNA sequences.
Write a Comment
User Comments (0)
About PowerShow.com