Gapped BLAST and PSI-BLAST: a new generation of protein database search programs - PowerPoint PPT Presentation

About This Presentation
Title:

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Description:

Gapped BLAST and PSI-BLAST a new generation of protein database search programs Presented by Outline BLAST 1.0 BLAST 2.0 ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 87
Provided by: musashi
Category:

less

Transcript and Presenter's Notes

Title: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs


1
Gapped BLAST and PSI-BLASTa new generation of
protein database search programs
  • Presented by
  • ???
  • ???
  • ???
  • ???

2
Outline
  • BLAST 1.0
  • BLAST 2.0
  • The two-hit method
  • Gapped alignment
  • PSI-BLAST
  • Performance evaluation
  • Discussion and Conclusion
  • NCBI website

3
Statistical preliminaries
  • HSP High-scoring segment pair
  • Locally optimal pair
  • S (?S - ?K) / ?2
  • S normalized score
  • Pi background probability that amino acids
    occur randomly at all position
  • sij score for aligning each pair of amino acids
    I and j
  • K minor constant
  • ? constant to adjust for matrix
  • sij and Pi ? K and ?

4
  • E N / 2S
  • E number of distinct HSPs with normalized score
    at least S
  • N mn is search space
  • S log2(N/E)
  • qij PiPje?uSij
  • qij target frequency of aligned pair of letters
    (i, j) with HSP, high-scoring segment paris
  • ?u the unique positive number

5
BLAST
  • Basic Local Alignment Search Tool(by Altschul,
    Gish, Miller, Myers and Lipman)
  • The BLAST program are widely used tools for
    searching protein and DNA database for sequence
    similarities
  • BLAST is a heuristic that attempts to optimize a
    specific similarity measure.
  • The central idea of the BLAST algorithm is that a
    statistically significant alignment is likely to
    contain a high-scoring pair of aligned words.

6
The maximal segment pair measure
  • MSP(maximal segment pair) the highest scoring
    pair of identical length segments chosen from 2
    sequences
  • for DNA Identities 5 Mismatches -4
  • for protein BLOSUM62
  • BLAST heuristically attempts to calculate the MSP
    score.
  • DP is O(mn) ,but BLAST is O(m)

the highest scoring pair
7
BLAST 1.0
  • Build the hash table for Sequence A.
  • Scan Sequence B for hits.
  • Extend hits.

8
Step 1 Build the hash table for Sequence A.
(3-tuple example)
  • For DNA
  • Seq. A ACGTAGTA
  • 12345678
  • AAA
  • AAC
  • ..
  • ACG 1
  • ..
  • AGT 5
  • ..
  • CGT 2
  • ..
  • GTA 3 6
  • ..
  • TAG 4
  • ..
  • TTT
  • For protein
  • Seq. A YGGFMAdd xyz to the hash table if
    Score(xyz, YGG) ? TAdd xyz to the hash table
    if Score(xyz, GGF) ? TAdd xyz to the hash
    table if Score(xyz, GFM) ? T
  • T threshold parameter
  • High T yelds greater speed,
  • but weak similarities

Hash table
9
List all words in query
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM
FMT MTS TSE SEK
10
Augment word list
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM
FMT MTS TSE SEK
AAA AAB AAC YYY
11
G G F A A A 0 0 -2 -2
Non-match
BLOSUM62 scores
G G F G G Y 6 6 3 15
Match
A user-specified threshold determines which
three-letter words are considered matches and
non-matches.
12
YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM
FMT MTS TSE SEK
GGI GGL GGM GGF GGW GGY
13
Store words in search tree
Augmented list of query words
Does this query contain GGF?
Search tree
O(1) time
Yes, at position 2.
14
Search tree
G
G
L
M
F
W
Y
15
Scan the database
x
x
x
x
Query sequence
x
x
x
x
Database sequence
16
Extend hit
L P P Q G L L Query sequence M P P E G L L
Database sequence ltwordgt 7 2 6
BLOSUM62 scores word score
15 lt--- ---gt 2 7 7 2 6 4 4 HSP SCORE 32
hit
Extend
This is done by extending a hit in both
directions, until the running alignments score
has dropped more than Xbelow
17
BLAST 2.0The two-hit method
  • BLAST 1.0
  • Extension step typically accounts for gt90 of
    BLAST execution time
  • Observations
  • A HSP of interest is much longer than a single
    word pair
  • Entail multiple hits on the same diagonal and
    within short distance of one another
  • Invoke an extension only when two non-overlapping
    hits are found within distance A on the same
    diagonal

18
  • Recenti the most recent hit found on the ith
    diagonal (always increasing)

overlap
19
  • T must to be lowered
  • one-hits W3 ,T13
  • Two-hit W3 ,T11
  • More one-hits while the majority are dismissed
  • Sensitivity
  • For HSPs with at least 33 bits, the two-hit
    heuristic is more sensitive
  • Speed(two-hit)
  • Generates on average 3.2 times as many hit, but
    only 0.14 times as many hit extension(decide
    whether a hit need be extended)
  • Twice as rapid as one-hit

20
Gapped alignment
  • Original BLAST find several distinct HSPs
  • All HSPs related to one alignment should be found
  • Gapped BLAST tolerate a much higher chance of
    missing any single moderately scoring HSP
  • Seeking a single gapped alignment, rather than a
    collection of umgapped ones
  • For example, result should gt 0.95, p miss prob
    of HSP
  • Orignial with 2 HSP (1-p)(1-p)gt0.95? plt0.025
  • Now p2lt0.05?p0.22
  • T can be raised ? faster
  • Now
  • Find one HSP only seed, than use 2-hit

21
Gapped alignment (contd)
  • A gapped extension takes much longer to execute
    than an ungapped extension, but by performing
    very few of them the fraction of the total time
    could be kept low.
  • Trigger a gapped extension for any HSP exceeding
    score Sg
  • Sg should be set at 22 bits (150)

22
Original BLAST locates only the first and the
last ungapped aligment, E-value gt 50 times
23
Gapped Local Alignments
http//binfo.ym.edu.tw/post/internet/gap_blast.htm

24
actaactattacagactaactattacagactaactataca actaactat
tacggactaacttacagactaactaaaca
Before Gap Insertion actaactattacagactaactattacaga
ctaactataca
actaactattacggactaacttacagactaactaaaca Percent
Identity 24/40 0.6
After Gap Insertion actaactattacagactaactattacagac
taactataca
actaactattacggactaact--tacagactaactaaaca Perce
nt Identity 36/40 0.9
25
Gapped Local Alignments
  • Start from a single aligned pair of residues,
    called the seed.

26
Gapped expansion
  • Find out ungapped region with highest alignment
    score.
  • If the length of the ungapped region larger than
    Sg, then try using DP
  • Use its central residue pair as the seed.
  • Gapped extension is invoked less than once per 50
    database sequences.

27
PSSM
28
  • conserved regions
  • same protein family
  • some regions are very similar
  • the structure and functionality typical to this
    family

29
PSI-BLAST (Position-Specific Iterated BLAST)
1 Select a query and search it against a
protein database 2 PSI-BLAST constructs a
multiple sequence alignment then creates a
profile or specialized position-specific scoring
matrix (PSSM) 3 The PSSM is used as a query
against the database 4 PSI-BLAST estimates
statistical significance (E values) 5 Repeat
steps 3 and 4 iteratively, typically 5
times. At each new search, a new profile is used
as the query.
PSSM
PSSM
From http//bioweb.pasteur.fr/seqanal/blast/intro
-uk.html
30
Score matrix architecture
  • Each matrix has length precisely equal to that of
    the original query sequence.

31
Multiple alignment construction
  • E-value lt 0.01 from the output of BLAST output.
  • Any row identical to the query segment with which
    it aligns is purged.
  • Only one copy is retained of any rows that are
    above 98 identical to one another.

32
Multiple alignment construction
  • Pairwise alignment columns that involve gap
    characters inserted into the query are simply
    ignored.
  • So M has exactly the same length as the query.

33
Multiple alignment construction
  • The matrix scores for a given alignment column
    should depand not only upon the residues
    appearing there.
  • The set R of sequences it includes to be exactly
    those that contribute a residue to column C.
  • The columns of MC to be just those columns of M
    in which all the sequences of R are represented.

34
(No Transcript)
35
Sequence weights
  • A large set of closely related sequences carries
    little more information than a single member, but
    its size may allow it outvote a small number of
    more divergent sequences.
  • One way is to assign weights.
  • Gap characters are treated as a 21st distinct
    char.

36
Sequence weights
  • In constructing matrix scores, not only a
    columns observed residue frequencies are
    important.
  • Estimate the relative number NC of independent
    observations constituted by the alignment MC.
  • NC the mean number of different residue types.

37
  • a large number of independent sequences, the
    estimate of Qi should converge simply to the
    observed frequency of residue i in that column.
  • Pseudocount frequencies
  • Estimate Qi by

38
Performance Evaluation
39
(No Transcript)
40
Gapped BLAST 1. 3X faster than original BLAST,
finds more 2. gt100X
faster than S-W, misses only 8, same
scores PSI-BLAST 1. faster than original BLAST,
40X faster than S-W, much more
sensitive 2. multiple
iterations is even better, better for
non-redundant database of NCBI
3. slower than gapped BLAST time
for construction of PSSM
41
PSI-BLAST Examples(1)
??????????,??HIT??query, a BLAST
1.
search of SWISS-PROT reveals hits with Elt0.01
only to other HIT proteins.
2. A PSI-BLAST search, using PSSM generated by
yields the E-value of 2X10-4 for
uridylyltransferase.
42
PSI-BLAST Examples(2)BRCT proteins
43
(No Transcript)
44
Seven recent additions to the protein databases
as members of BRCT superfamily
45
Discussion
46
Possible future improvement Gap costs
  • Allows a gap to involve residues in both
    sequences rather than just one
  • A gap in which k residues are inserted or deleted
    and j pairs of residues are left unaligned
    receives the score (abkcj)

47
Possible future improvementRealignment
  • ??????threshold?pairwise alignment?????multiple
    alignment,?????the most significant??initial
    multiple alignment and PSSM,?????rescore and
    realign database sequences that received lower
    scores
  • ??
  • Improve weaker pairwise alignments
  • False positive can be downgraded by an improved
    matrix
  • False negative can have their scores increased

48
Conclusion
  • Gapped version of BLAST is faster than original
    one, and able to produce gapped alignments.
  • PSI-BLAST greatly increase sensitivity to weak
    but biologically relevant sequence relationships.
  • PSI-BLAST retains the ability to report accurate
    statistics, per iteration runs in times not much
    greater than gapped BLAST, and can be used both
    iteratively and fully automatically.

49
NCBI
  • Books
  • Pudmed
  • Blast
  • (1)Nucleotide
  • -- Quickly search for highly similar sequences
  • -- Nucleotide-nucleotide BLAST
  • (2)Protein
  • -- Protein-protein BLAST
  • (3)Translated
  • -- Translated query vs. Protein database
  • (4)Special
  • -- Align two sequences

50
(No Transcript)
51
(No Transcript)
52
NCBI
  • Books
  • Pudmed
  • Blast
  • (1)Nucleotide
  • -- Quickly search for highly similar sequences
  • -- Nucleotide-nucleotide BLAST
  • (2)Protein
  • -- Protein-protein BLAST
  • (3)Translated
  • -- Translated query vs. Protein database
  • (4)Special
  • -- Align two sequences

53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
NCBI
  • Books
  • Pudmed
  • Blast
  • (1)Nucleotide
  • -- Quickly search for highly similar sequences
  • -- Nucleotide-nucleotide BLAST
  • (2)Protein
  • -- Protein-protein BLAST
  • (3)Translated
  • -- Translated query vs. Protein database
  • (4)Special
  • -- Align two sequences

57
Database searching
Sequence database
Query
Targets ranked by score
Sequence comparison algorithm
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
NCBI
  • Books
  • Pudmed
  • Blast
  • (1)Nucleotide
  • -- Quickly search for highly similar sequences
  • -- Nucleotide-nucleotide BLAST
  • (2)Protein
  • -- Protein-protein BLAST
  • (3)Translated
  • -- Translated query vs. Protein database
  • (4)Special
  • -- Align two sequences

64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
NCBI
  • Books
  • Pudmed
  • Blast
  • (1)Nucleotide
  • -- Quickly search for highly similar sequences
  • -- Nucleotide-nucleotide BLAST
  • (2)Protein
  • -- Protein-protein BLAST
  • (3)Translated
  • -- Translated query vs. Protein database
  • (4)Special
  • -- Align two sequences

71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
NCBI
  • Books
  • Pudmed
  • Blast
  • (1)Nucleotide
  • -- Quickly search for highly similar sequences
  • -- Nucleotide-nucleotide BLAST
  • (2)Protein
  • -- Protein-protein BLAST
  • (3)Translated
  • -- Translated query vs. Protein database
  • (4)Special
  • -- Align two sequences

78
(No Transcript)
79
(No Transcript)
80
(No Transcript)
81
(No Transcript)
82
NCBI
  • Books
  • Pudmed
  • Blast
  • (1)Nucleotide
  • -- Quickly search for highly similar sequences
  • -- Nucleotide-nucleotide BLAST
  • (2)Protein
  • -- Protein-protein BLAST
  • (3)Translated
  • -- Translated query vs. Protein database
  • (4)Special
  • -- Align two sequences

83
(No Transcript)
84
(No Transcript)
85
(No Transcript)
86
Question Set of Final Exam
  • 1. ??? blast ????? database ??? sequence ???
  • 2. Two hit ? One hit ???????
  • 3. ???PSI-BLAST?BLAST???????
Write a Comment
User Comments (0)
About PowerShow.com