BLAST: Database Search - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

BLAST: Database Search

Description:

BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of ... Smith-Waterman is rigorous and it is guaranteed. to find an optimal alignment. ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 40
Provided by: sch17
Category:

less

Transcript and Presenter's Notes

Title: BLAST: Database Search


1
BLAST Database Search
  • Heuristic Algorithm
  • Some slides courtesy of Dr. Pevsner and Dr. Dirk
    Husmeier

2
What is BLAST?
  • BLAST (Basic Local Alignment Search Tool)
  • allows rapid sequence comparison of a query
  • sequence against a database.
  • Smith-Waterman is rigorous and it is guaranteed
  • to find an optimal alignment.
  • But also time and space consuming. It is
    especially inefficient in database searches.
  • BLAST provides a rapid alternative.o S-W

3
Why do we use BLAST?
  • To understanding the relatedness of any protein
    or DNA sequence (query sequence) to other known
    sequences (database)
  • Identify sequences with a common ancestor
    (orthologs) and paralogs
  • Discover new genes or proteins
  • Explore protein structure and function

4
The BLAST Algorithm
  • S. F. Altschul, et al., 1997, Nucleic Acids
    Research, 253389

The central idea of the BLAST algorithm is to
confine attention to segment pairs that contain
a word pair of length w with a score of at least
T. Altschul et al. (1990)
5
How the original BLAST algorithm works Step 1.
size w words in the query sequence
Look at the query sequence by a moving window of
size w Example for a human RBP query FSGTWYA
(query word is in yellow) The moving window of
words FSG SGT GTW TWY WYA
page 101
6
Step 1 compile a list of words scoring at least
T with query word
GTW 6,5,11 22 ASW 6,1,11 18 word
hits ATW 0,5,11 16 gt threshold NTW
0,5,11 16 GTY 6,5,2 13 GNW 10 GAW
9 word hits lt threshold
(T11)
7
2. Scan the database for entries that contains
any word from the compiled hit list.
3. Extend when you manage to find a hit extend
the hit in either direction. Keep track of the
score (use a scoring matrix) Stop when the score
drops below some cutoff.
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAG
TWYSLAMAASD. 44 lactoglobulin (hit)
extend
extend
Hit!
8
Alignment Score
It is important to assess the statistical
significance of search results. For local
alignments (including BLAST search results), the
scores follow an extreme value distribution
(EVD). E-value is closely related to the
analysis of the distribution of alignment
score Karlin, S. Altschul, S.F. (1990)
"Methods for assessing the statistical
significance of molecular sequence features by
using general scoring schemes." Proc. Natl. Acad.
Sci. USA 872264-2268.(PubMed)
9
Alignment score as a random walk
10
Max Score in an Excursion
Pi frequency of residue i in 1st seq. Pk
frequency of residue k in 2nd seq.
11
Protein Scoring
12
Distn of any excursion Y
an exponential distribution
13
The Max of n variables
  • Y1, Y2, , Yn are identical independently
    distributed
  • Ymax is the max of the above all. Then
  • Prob(Ymaxlty)(Prob(Ylty))n

14
The Max of n Exponential variables
  • Y1, Y2, , Yn are independent exponential
    variables
  • Ymax is the max of the above all. Then
  • Prob(Ymaxlty)(Prob(Ylty))n
  • (1-e-?y)n
  • Prob(Ymaxgty)1-(Prob(Ylty))n
  • 1-(1-e-?y)n

15
In a database of n seq.s
  • Number of sequences n
  • Y1, Y2,, Yn are i.i.d. exponential
  • What happens when n is large?
  • Using a widely used rule (1x/n)n?exp(x)
  • ?1-exp(-nCe?y)
  • Probability of scores/excursion higher than y
  • The distribution of Ymax follows an extreme value
    distn

16
the sum of a large number of independent
identically distributed (i.i.d) random variables
tends to a normal distribution,
0.40
0.35
0.30
0.25
normal distribution
probability
0.20
0.15
0.10
0.05
0
0
1
2
3
4
5
-1
-2
-3
-4
-5
x
17
the maximum of a large number of i.i.d. random
variables tends to an extreme value distribution
0.40
0.35
0.30
0.25
normal distribution
extreme value distribution
probability
0.20
0.15
0.10
0.05
0
0
1
2
3
4
5
-1
-2
-3
-4
-5
x
18
Expected number of better scores/higher excursions
E-value
p-value
19
E Kmn e-lS
E is the number of hits you would expect from
your search with scores greater than S where K
is a constant m is the size of the query n is
the size of the database being searched l scales
for the specific scoring matrix used (decay
constant from the extreme value distribution)
20
How to interpret BLAST E values and p values
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to
interpret than corresponding p values. E p1-exp
(-E) 10 0.99995460 5 0.99326205 2 0.86466472 1
0.63212056 0.1 0.09516258 (about
0.1) 0.05 0.04877058 (about 0.05) 0.001 0.000999
50 (about 0.001) 0.0001 0.0001000
21
  • End

22
Extreme value distribution
  • The distribution
  • The area to the right of S
  • Scaling to a particular type of score
  • where µ is the mode and ? is a scale factor.

23
Extreme value distribution
Compute this value for x0.
  • The distribution
  • The area to the right of S
  • Scaling to a particular type of score
  • where µ is the mode and ? is a scale factor.

24
Extreme value distribution
Compute this value for x 0. Solution exp-1
0.368
  • The distribution
  • The area to the right of S
  • Scaling to a particular type of score
  • where µ is the mode and ? is a scale factor.

25
An example
  • You run BLAST and get a score of 45. You then
    run BLAST on a shuffled version of the database,
    and fit an extreme value distribution to the
    resulting empirical distribution. The parameters
    of the EVD are µ 25 and ? 0.693. What is the
    p-value associated with 45?

26
An example
  • You run BLAST and get a score of 45. You then
    run BLAST on a shuffled version of the database,
    and fit an extreme value distribution to the
    resulting empirical distribution. The parameters
    of the EVD are µ 25 and ? 0.693. What is the
    p-value associated with 45?

27
Another example
  • You run BLAST and get a score of 23. You then
    run BLAST on a shuffled version of the database,
    and fit an extreme value distribution to the
    resulting empirical distribution. The parameters
    of the EVD are µ 20 and ? 0.744. What is the
    p-value associated with 23?

28
Another example
  • You run BLAST and get a score of 23. You then
    run BLAST on a shuffled version of the database,
    and fit an extreme value distribution to the
    resulting empirical distribution. The parameters
    of the EVD are µ 20 and ? 0.744. What is the
    p-value associated with 23?

29
BLAST optional parameters
You can... choose the organism to search
turn filtering on/off change the substitution
matrix change the expect (e) value change the
word size change the output format
30
Choosing Gap Penalty
  • Choice must be made corresponding to each type of
    scoring system to place gaps where they will
    increase the overall alignment score.
  • There are no hard and fast rules for choosing gap
    penalties.
  • Both g (opening penalty) and r (extension
    penalty) should be non-zero.
  • The value of g r should be greater than the
    maximum score used for a match if insertions and
    deletions are considered to be rarer than
    nucleotide substitutions.
  • The value of g strongly influences the number of
    gaps introduced into a region separating two
    closely matching regions.

31
Comparing Scoring Matrix
  • PAM
  • Homologous seq.s during evolution
  • Based on extrapolation of a small evol. Period
  • Track evolutionary origins
  • BLOSUM
  • Conserved blocks
  • Based on a range of evol. Periods
  • Find conserved domains

32
Another way to compare
  • perform a search of a sequence database with a
    known member of a protein family and to find how
    many members of the family are found. When gap
    penalty was not considered, the BLOSUM62 matrix
    outperformed the PAM250 matrix in finding more
    members of 504 different families on the Prosite
    database.

33
BLAST Phase 3
Extension In the original (1990)
implementation of BLAST, hits were extended in
either direction. In a 1997 refinement of BLAST,
two independent hits are required. The hits must
occur in close proximity to each other. With this
modification, only one seventh as many extensions
occur, greatly speeding the time required for a
search.
34
How a BLAST search works threshold
You can modify the threshold parameter. The
default value for blastp is 11. To change it,
enter -f 16 or -f 5 in the advanced options.
35
How to interpret a BLAST search expect value
The expect value E is the number of
alignments with scores greater than or equal to
score S that are expected to occur by chance in a
database search. An E value is related to a
probability value p. The key equation describing
an E value is E Kmn e-lS
36
E Kmn e-lS
This equation is derived from a description of
the extreme value distribution S the score E
the expect value the expected number of HSPs
with a score gt S m, n the length of two
sequences l, K Karlin Altschul statistics
37
Some properties of the equation E Kmn e-lS
  • The value of E decreases exponentially with
    increasing S
  • (higher S values correspond to better
    alignments). Very
  • high scores correspond to very low E values.
  • The E value for aligning a pair of random
    sequences must
  • be negative! Otherwise, long random alignments
    would
  • acquire great scores
  • Parameter K describes the search space
    (database).
  • For E1, one match with a similar score is
    expected to
  • occur by chance. For a very much larger or
    smaller
  • database, you would expect E to vary accordingly

38
From raw scores to bit scores
  • There are two kinds of scores
  • raw scores (calculated from a substitution
    matrix) and
  • bit scores (normalized scores)
  • Bit scores are comparable between different
    searches
  • because they are normalized to account for the
    use
  • of different scoring matrices and different
    database sizes
  • S bit score (lS - lnK) / ln2
  • The E value corresponding to a given bit score
    is
  • E mn 2 -S
  • Bit scores allow you to compare results between
    different
  • database searches, even using different scoring
    matrices.

39
How to interpret BLAST E values and p values
The expect value E is the number of
alignments with scores greater than or equal to
score S that are expected to occur by chance in a
database search. A p value is a different way
of representing the significance of an
alignment. p 1 - e-E
Write a Comment
User Comments (0)
About PowerShow.com