Alignments Database Searching - PowerPoint PPT Presentation

1 / 87
About This Presentation
Title:

Alignments Database Searching

Description:

BLAST is a Heuristic Smith and Waterman. Basic Local Alignment Search Tool. BLAST = 3 STEPS ... Smith and Waterman. Exact Local Dynamic Programming, 1981. FASTA ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 88
Provided by: FF8
Category:

less

Transcript and Presenter's Notes

Title: Alignments Database Searching


1
Alignments -gt Database Searching
  • Sequence Analysis

2
Sequence searching problems
  • Task
  • Query new sequence (300 aa)
  • Database (searching space) very many sequences
  • Goal find seqs related to query
  • We want
  • fast tool
  • primarily a filter most sequences will be
    unrelated to the query
  • fine-tune the alignment later

3
Parenthesys do you remember the a.a. (symbols
and properties?)
4
Parenthesys do you remember the a.a. (symbols
and properties?)
5
Parenthesys substitution matrices
Seq 1 L V N R K P V V P Seq 2 G V
C R R P L K C
6
Amino acid exchange matrices
  • How do we get one?
  • And how do we get associated gap penalties?
  • First systematic method to derive a.a. exchange
    matrices by Margaret Dayhoff et al. (1978)
    Atlas of Protein Structure

20?20
7
(No Transcript)
8
(No Transcript)
9
The PAM250 matrix
A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4
-5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2
4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3
1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3
-3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0
-2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4
0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0
9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5
6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1
2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0
1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0
-6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4
-2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2
4 2 -2 2 -1 -1 -1 0 -6 -2 4 B 0 -1 2 3 -4
1 2 0 1 -2 -3 1 -2 -5 -1 0 0 -5 -3 -2 2 Z
0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0
-1 -6 -4 -2 2 3 A R N D C Q E G H I
L K M F P S T W Y V B Z
10
PAM model
  • The scores derived through the PAM model are an
    accurate description of the information content
    (or the relative entropy) of an alignment
    (Altschul, 1991).
  • But the matrix contains information from each
    residue of every sequence while some regions
    might be more evolutionary related than others

11
Blosum series
  • Mostly used family of amino acid exchange
    matrices based on PAM model currently is BLOSUM
    series (BLOSUM50, BLOSUM62).
  • The Blosum matrices are derived from the BLOCKS
    database of multiple alignments (Henikoff
    Henikoff, 1992).
  • Blosum50 is derived from BLOCKS (core) alignment
    regions with gt50 sequence identity, Blosum62
    from those gt 62, etc.

12
BLAST
Basic Local Alignment Search Tool
BLAST is a Program Designed for RAPIDLY Comparing
Your Sequence With every Sequence in a database
and REPORT the most SIMILAR sequences
13
A few Definitions
Query Your sequence Subject The database
against which you search Heuristic Algorithm
that does not warrant the optimal solution
14
Other Important Definitions
Identity Proportion of IDENTICAL residues
between two sequences. Depends on the
Alignment. Unit the id
Similarity Proportion of SIMILAR residues Two
residues are similar if their substitution cost
is higher than 0. Depends on the matrix Unit the
similarity
Homology Sequences SIMILAR enough are sometimes
HOMOLOGOUS HOMOLOGY ? COMMON ANCESTOR Unit Yes
or No! DIFFERENT sequences can also be
Homologous
15
More Important Definitions
Hit A sequence that matches your sequence and
reported by BLAST. E-Value Expectation
value How many times would you expect to find a
hit by chance only? Depends on the
alignment. Depends on the matrix Depends on
the database Sensitive to Low complexity
regions Unit must be lower than 0.0001 to
mean something
16
A Good Hit Is Something You Would Not Expect by
Chance
17
Database Search
1-Query
3-Database
4-Statistical Evaluation (E-Value)
PROBLEM LOCAL ALIGNMENT (SW)TOO SLOW
18
What is BLAST
  • Basic Local Alignment Search Tool
  • Bad news it is only a heuristic method
  • Heuristic A rule of thumb that often helps in
    solving a certain class of problems, but makes no
    guarantees.Perkins, DN (1981) The Mind's Best
    Work
  • Basic idea
  • High scoring segments have well conserved (almost
    identical) parts
  • As well conserved parts are identified, extend it
    to the real alignment

19
BLAST
Basic Local Alignment Search Tool
BLAST is a Heuristic Smith and Waterman
BLAST 3 STEPS
1-Decide who will be aligned completely
20
BLAST
Basic Local Alignment Search Tool
BLAST is a Heuristic Smith and Waterman
BLAST 3 STEPS
1-Decide who will be compared
2-Check the most promising Hits
3-Compute the E-value of the most interesting Hits
21
BLAST
A Bit of History
  • Smith and Waterman
  • Exact Local Dynamic Programming, 1981
  • FASTA
  • Lipman and Pearson, 1985
  • Looks for similar words (k-tup) on the same
    diagonal.
  • Comparison on the sequences one by one
  • BLAST
  • Altschul et al., 1990
  • The most widely cited tool in Biology
  • www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.htm
    l

22
What means well conserved for BLAST?
  • BLAST works with k-words (words of length k)
  • k is a parameter
  • different for DNA (gt10) and proteins (2..4)
  • word w1 is T-similar to w2 if the sum of pair
    scores is at least T (e.g. T12)

23
BLAST algorithm3 basic steps
  • Preprocess the query
  • Scan for short, exact matches in database
  • Extend them to alignments

24
BLAST, Step 1 Preprocess the query
  • Take the query (e.g. LVNRKPVVP)
  • Chop it into overlapping k-words (k3 in this
    case)

25
BLAST, Step 2 Find exact matchesMethod A
Scanning
  • Create finite-state automata
  • Use all the T-similar k-words to build the
    automata
  • Scan for exact matches

QKP KKP RQP REP RRP RKP ...
Table of k-words in the query
movement
...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...
26
Hashing
  • Indexing based on the content
  • Finding hash function
  • Transforms the object into a number
  • The number is used to access the array
  • Expected distribution should be flat
  • otherwise all objects land in a few slots
  • Example
  • In offices drawers with the first letter of the
    name
  • Library fiction 1st floor, poetry 2nd floor,
    science 3rd floor

27
Hashing
  • DEFINITION - Hashing is the transformation of a
    string of characters into a usually shorter
    fixed-length value or key that represents the
    original string. Hashing is used to index and
    retrieve items in a database because it is faster
    to find the item using the shorter hashed key
    than to find it using the original value. It is
    also used in many encryption algorithms.
  • As a simple example of the using of hashing in
    databases, a group of people could be arranged in
    a database like this
  • Abernathy, Sara Epperdingle, Roscoe Moore,
    Wilfred Smith, David (and many more sorted
    into alphabetical order)
  • Each of these names would be the key in the
    database for that person's data. A database
    search mechanism would first have to start
    looking character-by-character across the name
    for matches until it found the match (or ruled
    the other entries out). But if each of the names
    were hashed, it might be possible (depending on
    the number of names in the database) to generate
    a unique four-digit key for each name. For
    example
  • 7864 Abernathy, Sara 9802
    Epperdingle, Roscoe 1990 Moore, Wilfred
    8822 Smith, David (and so forth)
  • A search for any name would first consist of
    computing the hash value (using the same hash
    function used to store the item) and then
    comparing for a match using that value. It would,
    in general, be much faster to find a match across
    four digits, each having only 10 possibilities,
    than across an unpredictable value length where
    each character had 26 possibilities.

28
BLAST, Step 2 Find exact matchesMethod B
Hashing
  • Preprocess the database
  • For each k-word store in which sequences it
    appears
  • It is a hashing database with k-word

QKP KKP RQP REP RRP RKP
Gene134, IG_30, haemoglobin, ... Hashed db
29
BLAST, Step 2 Scan the databaseHashing Method
  • The database is preprocessed only once!
    (independently from the query)
  • In constant time we can retrieve database
    sequences where a k-word from query appears

QKP KKP RQP REP RRP RKP
Gene134, IG_30, haemoglobin, ... Hashed db
30
BLAST, Step 3 Extending exact matches
  • Having the list of exact matches we extend
    alignment in both directions

Query L V N R K P V V P Subject G V
C R R P L K C Score -3 4 -3 5 2 7 1
-2 -3
31
Inside BLAST
Step 1 finding the worthy words
Query
REL
32
Inside BLAST
Step 2 Eliminate the database sequences that do
not contain any interesting word
Sequences within the database
Look for interesting words
ACT RSL TVF
...
...
List of  interesting  words gt T
  • Sequences containing interesting words (Hits)

33
Inside BLAST the end
Step 3 Extension of the Hits
Database sequence
Query
X
  • 2 "Hits" on the same diagonal distant by less
    than X

34
BLAST Statistics Raw Score
  • Evaluation of the score
  • Raw Score
  • Sum of the substitutions and gap penalties.
  • Not very informative

35
BLAST Statistics P Values
  • Derived Statistics
  • p-value
  • Probability of finding an alignment with such a
    score, by chance.
  • The lower, the better

36
BLAST Statistics P-Values
Just as the sum of a large number of independent
identically distributed (i.i.d) random variables
tends to a normal distribution, the maximum of a
large number of i.i.d. random variables tends to
an extreme value distribution.
Extreme value distribution (Gumbel)
normal distribution
37
BLAST Statistics P-Values
Sequences m and n with at least score S
P-Value Probability that a random alignments
obtains a score superior or Equal to X K must
be calibrated with the database
composition Lambda is calibrated with the matrix
being used
38
BLAST Statistics E-Values
  • Derived Statistics
  • E-value
  • Number of alignments expected by chance
  • The lower, the better lt0.00001

Sequences m and n with at least score S
For Values Lower than 0.0001, E-Value P-Value
The E-Values are easier to compare than P-Values
E-value of 5 and 10 than P-values of 0.993 and
0.99995
39
BLAST Statistics limits
The E-Value depends on N, the Database size. If
N increases, some Hits can be lost
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
Database Against Database  Farm-Blast 
Genome 1
Genome 2
Ideal for finding Orthologues
45
The Many Flavors of BLAST
Program
Query
Database
protein
protéine
blastp
46
The Many Flavors of BLAST
Program
Query
Database
47
If your Sequence is a Protein
48
If your Sequence is made of DNA
49
BLASTing with DNA Asking the right question.
50
Keeping an Eye on the Public Servers.
51
Using BLAST The Basic Way
52
Database Search
Database Search ResultPrediction
Protein X IS or IS NOT homologous to the QUERY.
53
Submitting your Query
54
Understanding the BLAST Output
Graphic Display
Hit List
Alignments
55
Understanding the Graphic Display
56
Understanding the Hit List
57
Understanding the Alignments
58
Low Complexity Regions
  • Regions with a single residue repeated many times
    (like the AFGP) can produce meaningless
    alignments.
  • The statistics expect ALL the regions to look the
    same  on average .
  • By default, BLAST replaces these regions with Xs

59
Reproducing The Experiment
Everything you need to know to reproduce your
search is at the bottom.
60
Database Searches A few Guidelines
61
DataBase Search According to Pearson
62

Using BLAST Trouble Shooting
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
Advanced Blast on the EMBnet
www.ch.embnet.org/software/aBLAST.html
  • More choice on the databases
  • Change all the parameters

70
Adapting BLAST to your Problem
71
(No Transcript)
72
(No Transcript)
73
Domain-FlavoredBLAST
74
(No Transcript)
75
Psi-BLAST
76
BLAST latest Flavor
PSI-BLAST -Position Specific Iterated Version of
BLAST. -Uses Profiles. -More Sensitive.
PSI-BLAST performs a gapped BLAST database
search. PSI-BLAST program uses the information
from any significant alignments returned to
construct a position-specific score matrix,
which replaces the query sequence for the next
round of database searching. PSI-BLAST may be
iterated until no new significant alignments are
found.
77
Psi-BLAST Iteration
C
C
78
Psi-BLAST Iteration
C
C
C
C
C
C
C
S
C
C
C
C
C
C
C
C
C
C
C
S
79
BLAST PSSM or weight matrix
M Y C E Q U E N C E S . . A 0 2
-1 0 0 0 0 -1 0 -1 3 S -1 -1 -1 0 -1 0
0 0 5 -1 -1 C -1 -1 10 1 -1 0 0 5 5 4
-1 . . Y -1 6 -1 -1 -1 0 -1 -1 -1 -1 -1 V -1
1 -1 -1 -1 0 -1 -1 -1 1 -1
80
Asking a Question With Psi-BLAST
81
Asking a Question With Psi-BLAST
Is the Leghemoglobin related to the Human
Hemoglobin ?
82
Asking a Question With Psi-BLAST
83
Asking a Question With Psi-BLAST
84
Asking a Question With Psi-BLAST
85
CONCLUSIONS
86
Searching Databases
-BLAST computes the Statistical Significance of
the Alignments (E-Value, P-Value).
-
87
Searching Databases
-
Write a Comment
User Comments (0)
About PowerShow.com