Title: Alignments Database Searching
1Alignments -gt Database Searching
2Sequence searching problems
- Task
- Query new sequence (300 aa)
- Database (searching space) very many sequences
- Goal find seqs related to query
- We want
- fast tool
- primarily a filter most sequences will be
unrelated to the query - fine-tune the alignment later
3Parenthesys do you remember the a.a. (symbols
and properties?)
4Parenthesys do you remember the a.a. (symbols
and properties?)
5Parenthesys substitution matrices
Seq 1 L V N R K P V V P Seq 2 G V
C R R P L K C
6Amino acid exchange matrices
- How do we get one?
- And how do we get associated gap penalties?
- First systematic method to derive a.a. exchange
matrices by Margaret Dayhoff et al. (1978)
Atlas of Protein Structure
20?20
7(No Transcript)
8(No Transcript)
9The PAM250 matrix
A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4
-5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2
4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3
1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3
-3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0
-2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4
0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0
9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5
6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1
2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0
1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0
-6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4
-2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2
4 2 -2 2 -1 -1 -1 0 -6 -2 4 B 0 -1 2 3 -4
1 2 0 1 -2 -3 1 -2 -5 -1 0 0 -5 -3 -2 2 Z
0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0
-1 -6 -4 -2 2 3 A R N D C Q E G H I
L K M F P S T W Y V B Z
10PAM model
- The scores derived through the PAM model are an
accurate description of the information content
(or the relative entropy) of an alignment
(Altschul, 1991). - But the matrix contains information from each
residue of every sequence while some regions
might be more evolutionary related than others
11Blosum series
- Mostly used family of amino acid exchange
matrices based on PAM model currently is BLOSUM
series (BLOSUM50, BLOSUM62). - The Blosum matrices are derived from the BLOCKS
database of multiple alignments (Henikoff
Henikoff, 1992). - Blosum50 is derived from BLOCKS (core) alignment
regions with gt50 sequence identity, Blosum62
from those gt 62, etc.
12BLAST
Basic Local Alignment Search Tool
BLAST is a Program Designed for RAPIDLY Comparing
Your Sequence With every Sequence in a database
and REPORT the most SIMILAR sequences
13A few Definitions
Query Your sequence Subject The database
against which you search Heuristic Algorithm
that does not warrant the optimal solution
14Other Important Definitions
Identity Proportion of IDENTICAL residues
between two sequences. Depends on the
Alignment. Unit the id
Similarity Proportion of SIMILAR residues Two
residues are similar if their substitution cost
is higher than 0. Depends on the matrix Unit the
similarity
Homology Sequences SIMILAR enough are sometimes
HOMOLOGOUS HOMOLOGY ? COMMON ANCESTOR Unit Yes
or No! DIFFERENT sequences can also be
Homologous
15More Important Definitions
Hit A sequence that matches your sequence and
reported by BLAST. E-Value Expectation
value How many times would you expect to find a
hit by chance only? Depends on the
alignment. Depends on the matrix Depends on
the database Sensitive to Low complexity
regions Unit must be lower than 0.0001 to
mean something
16A Good Hit Is Something You Would Not Expect by
Chance
17Database Search
1-Query
3-Database
4-Statistical Evaluation (E-Value)
PROBLEM LOCAL ALIGNMENT (SW)TOO SLOW
18What is BLAST
- Basic Local Alignment Search Tool
- Bad news it is only a heuristic method
- Heuristic A rule of thumb that often helps in
solving a certain class of problems, but makes no
guarantees.Perkins, DN (1981) The Mind's Best
Work - Basic idea
- High scoring segments have well conserved (almost
identical) parts - As well conserved parts are identified, extend it
to the real alignment
19BLAST
Basic Local Alignment Search Tool
BLAST is a Heuristic Smith and Waterman
BLAST 3 STEPS
1-Decide who will be aligned completely
20BLAST
Basic Local Alignment Search Tool
BLAST is a Heuristic Smith and Waterman
BLAST 3 STEPS
1-Decide who will be compared
2-Check the most promising Hits
3-Compute the E-value of the most interesting Hits
21BLAST
A Bit of History
- Smith and Waterman
- Exact Local Dynamic Programming, 1981
- FASTA
- Lipman and Pearson, 1985
- Looks for similar words (k-tup) on the same
diagonal. - Comparison on the sequences one by one
- BLAST
- Altschul et al., 1990
- The most widely cited tool in Biology
- www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.htm
l
22What means well conserved for BLAST?
- BLAST works with k-words (words of length k)
- k is a parameter
- different for DNA (gt10) and proteins (2..4)
- word w1 is T-similar to w2 if the sum of pair
scores is at least T (e.g. T12)
23BLAST algorithm3 basic steps
- Preprocess the query
- Scan for short, exact matches in database
- Extend them to alignments
24BLAST, Step 1 Preprocess the query
- Take the query (e.g. LVNRKPVVP)
- Chop it into overlapping k-words (k3 in this
case)
25BLAST, Step 2 Find exact matchesMethod A
Scanning
- Create finite-state automata
- Use all the T-similar k-words to build the
automata - Scan for exact matches
QKP KKP RQP REP RRP RKP ...
Table of k-words in the query
movement
...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...
26Hashing
- Indexing based on the content
- Finding hash function
- Transforms the object into a number
- The number is used to access the array
- Expected distribution should be flat
- otherwise all objects land in a few slots
- Example
- In offices drawers with the first letter of the
name - Library fiction 1st floor, poetry 2nd floor,
science 3rd floor
27Hashing
- DEFINITION - Hashing is the transformation of a
string of characters into a usually shorter
fixed-length value or key that represents the
original string. Hashing is used to index and
retrieve items in a database because it is faster
to find the item using the shorter hashed key
than to find it using the original value. It is
also used in many encryption algorithms. - As a simple example of the using of hashing in
databases, a group of people could be arranged in
a database like this - Abernathy, Sara Epperdingle, Roscoe Moore,
Wilfred Smith, David (and many more sorted
into alphabetical order) - Each of these names would be the key in the
database for that person's data. A database
search mechanism would first have to start
looking character-by-character across the name
for matches until it found the match (or ruled
the other entries out). But if each of the names
were hashed, it might be possible (depending on
the number of names in the database) to generate
a unique four-digit key for each name. For
example - 7864 Abernathy, Sara 9802
Epperdingle, Roscoe 1990 Moore, Wilfred
8822 Smith, David (and so forth) - A search for any name would first consist of
computing the hash value (using the same hash
function used to store the item) and then
comparing for a match using that value. It would,
in general, be much faster to find a match across
four digits, each having only 10 possibilities,
than across an unpredictable value length where
each character had 26 possibilities.
28BLAST, Step 2 Find exact matchesMethod B
Hashing
- Preprocess the database
- For each k-word store in which sequences it
appears - It is a hashing database with k-word
QKP KKP RQP REP RRP RKP
Gene134, IG_30, haemoglobin, ... Hashed db
29BLAST, Step 2 Scan the databaseHashing Method
- The database is preprocessed only once!
(independently from the query) - In constant time we can retrieve database
sequences where a k-word from query appears
QKP KKP RQP REP RRP RKP
Gene134, IG_30, haemoglobin, ... Hashed db
30BLAST, Step 3 Extending exact matches
- Having the list of exact matches we extend
alignment in both directions
Query L V N R K P V V P Subject G V
C R R P L K C Score -3 4 -3 5 2 7 1
-2 -3
31Inside BLAST
Step 1 finding the worthy words
Query
REL
32Inside BLAST
Step 2 Eliminate the database sequences that do
not contain any interesting word
Sequences within the database
Look for interesting words
ACT RSL TVF
...
...
List of  interesting words gt T
- Sequences containing interesting words (Hits)
33Inside BLAST the end
Step 3 Extension of the Hits
Database sequence
Query
X
- 2 "Hits" on the same diagonal distant by less
than X
34BLAST Statistics Raw Score
- Evaluation of the score
- Raw Score
- Sum of the substitutions and gap penalties.
- Not very informative
35BLAST Statistics P Values
- Derived Statistics
- p-value
- Probability of finding an alignment with such a
score, by chance. - The lower, the better
36BLAST Statistics P-Values
Just as the sum of a large number of independent
identically distributed (i.i.d) random variables
tends to a normal distribution, the maximum of a
large number of i.i.d. random variables tends to
an extreme value distribution.
Extreme value distribution (Gumbel)
normal distribution
37BLAST Statistics P-Values
Sequences m and n with at least score S
P-Value Probability that a random alignments
obtains a score superior or Equal to X K must
be calibrated with the database
composition Lambda is calibrated with the matrix
being used
38BLAST Statistics E-Values
- Derived Statistics
- E-value
- Number of alignments expected by chance
- The lower, the better lt0.00001
Sequences m and n with at least score S
For Values Lower than 0.0001, E-Value P-Value
The E-Values are easier to compare than P-Values
E-value of 5 and 10 than P-values of 0.993 and
0.99995
39BLAST Statistics limits
The E-Value depends on N, the Database size. If
N increases, some Hits can be lost
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44Database Against Database  Farm-BlastÂ
Genome 1
Genome 2
Ideal for finding Orthologues
45The Many Flavors of BLAST
Program
Query
Database
protein
protéine
blastp
46The Many Flavors of BLAST
Program
Query
Database
47If your Sequence is a Protein
48If your Sequence is made of DNA
49BLASTing with DNA Asking the right question.
50Keeping an Eye on the Public Servers.
51Using BLAST The Basic Way
52Database Search
Database Search ResultPrediction
Protein X IS or IS NOT homologous to the QUERY.
53Submitting your Query
54Understanding the BLAST Output
Graphic Display
Hit List
Alignments
55Understanding the Graphic Display
56Understanding the Hit List
57Understanding the Alignments
58Low Complexity Regions
- Regions with a single residue repeated many times
(like the AFGP) can produce meaningless
alignments. - The statistics expect ALL the regions to look the
same  on average . - By default, BLAST replaces these regions with Xs
59Reproducing The Experiment
Everything you need to know to reproduce your
search is at the bottom.
60Database Searches A few Guidelines
61DataBase Search According to Pearson
62 Using BLAST Trouble Shooting
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69Advanced Blast on the EMBnet
www.ch.embnet.org/software/aBLAST.html
- More choice on the databases
- Change all the parameters
70Adapting BLAST to your Problem
71(No Transcript)
72(No Transcript)
73Domain-FlavoredBLAST
74(No Transcript)
75Psi-BLAST
76BLAST latest Flavor
PSI-BLAST -Position Specific Iterated Version of
BLAST. -Uses Profiles. -More Sensitive.
PSI-BLAST performs a gapped BLAST database
search. PSI-BLAST program uses the information
from any significant alignments returned to
construct a position-specific score matrix,
which replaces the query sequence for the next
round of database searching. PSI-BLAST may be
iterated until no new significant alignments are
found.
77Psi-BLAST Iteration
C
C
78Psi-BLAST Iteration
C
C
C
C
C
C
C
S
C
C
C
C
C
C
C
C
C
C
C
S
79BLAST PSSM or weight matrix
M Y C E Q U E N C E S . . A 0 2
-1 0 0 0 0 -1 0 -1 3 S -1 -1 -1 0 -1 0
0 0 5 -1 -1 C -1 -1 10 1 -1 0 0 5 5 4
-1 . . Y -1 6 -1 -1 -1 0 -1 -1 -1 -1 -1 V -1
1 -1 -1 -1 0 -1 -1 -1 1 -1
80Asking a Question With Psi-BLAST
81Asking a Question With Psi-BLAST
Is the Leghemoglobin related to the Human
Hemoglobin ?
82Asking a Question With Psi-BLAST
83Asking a Question With Psi-BLAST
84Asking a Question With Psi-BLAST
85CONCLUSIONS
86Searching Databases
-BLAST computes the Statistical Significance of
the Alignments (E-Value, P-Value).
-
87Searching Databases
-