Title: Aucun titre de diapositive
1Using BLAST to Search Sequence Databases
Recherche dans des bases de données de séquences
biologiques
Cédric Notredame
2Outline
-Evolution and Sequence Similarity
-The inside of BLAST
-Using BLAST
-Adapting BLAST to your needs
-Searching Protein Domains with BLAST
-Digging Genomes
3Two Minutes of the Evolutionnary Clock
4(No Transcript)
5An Alignment is a STORY
6How Do Sequences Evolve ?
In a structure, each Amino Acid plays a Special
Role
7Why Does It Make Sense To Align Sequences ?
8How Can We Compare Sequences ? The Twilight Zone
Sequence Identity
Same 3D Fold
30
Twilight Zone
Length
100
9Different molecular clocks for different
proteins--another prediction
10A few Basic Definitions
11A few Definitions
Query Your sequence Subject The database
against which you search Heuristic Algorithm
that does not guaranty the optimal solution
12Other Important Definitions
Identity Proportion of IDENTICAL residues
between two sequences. Depends on the
Alignment. Unit the id
Similarity Proportion of SIMILAR residues Two
residues are similar if their substitution cost
is higher than 0. Depends on the matrix Unit the
similarity
Homology Sequences SIMILAR enough are sometimes
HOMOLOGOUS HOMOLOGY ? COMMON ANCESTOR Unit Yes
or No! DIFFERENT sequences can also be
Homologous
13More Important Definitions
Hit A sequence that matches your sequence and
reported by BLAST. E-Value Expectation
value How many times would you expect to find a
hit by chance only? Depends on the
alignment. Depends on the matrix Depends on
the database Sensitive to Low complexity
regions Unit must be lower than 0.0001 to
mean something
14A Good Hit Is Something You Would Not Expect by
Chance
15What is BLAST ?
16BLAST
Basic Local Alignment Search Tool
BLAST is a Program Designed for RAPIDLY Comparing
Your Sequence With every Sequence in a database
and REPORT the most SIMILAR sequences
17Database Search
1-Query
3-Database
4-Statistical Evaluation (E-Value)
PROBLEM LOCAL ALIGNMENT (SW)TOO SLOW
18Database Search
19BLAST
Basic Local Alignment Search Tool
BLAST is a Heuristic Smith and Waterman
BLAST 3 STEPS
1-Decide who will be compared
20BLAST
Basic Local Alignment Search Tool
BLAST is a Heuristic Smith and Waterman
BLAST 3 STEPS
1-Decide who will be compared
2-Check the most promising Hits
3-Compute the E-value of the most interesting Hits
21BLAST
Heuristic Algorithms
A Bit of History
- Smith and Waterman
- Exact Local Dynamic Programming, 1981
- FASTA
- Lipman and Pearson, 1985
- Looks for similar words (k-tup) on the same
diagonal. - Comparison on the sequences one by one
- BLAST
- Altschul et al., 1990
- The most widely cited tool in Biology
- www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.htm
l
22The Inside of BLAST
23Inside BLAST
Step 1 finding the worthy words
Query
REL
24Inside BLAST
Step 2 Eliminate the database sequences that do
not contain any interesting word
Sequences within the database
Look for interesting words
ACT RSL TVF
...
...
List of  interesting words gt T
- Sequences containing interesting words (Hits)
25Inside BLAST the end
Step 3 Extension of the Hits
Database sequence
Query
X
- 2 "Hits" on the same diagonal distant by less
than X
26The Statistics in BLAST
27BLAST Statistics Raw Score
- Evaluation of the score
- Raw Score
- Sum of the substitutions and gap penalties.
- Not very informative
28BLAST Statistics P Values
- Derived Statistics
- p-value
- Probability of finding an alignment with such a
score, by chance. - The lower, the better
29BLAST Statistics P-Values
Just as the sum of a large number of independent
identically distributed (i.i.d) random variables
tends to a normal distribution, the maximum of a
large number of i.i.d. random variables tends to
an extreme value distribution.
Extreme value distribution (Gumbel)
normal distribution
30BLAST Statistics P-Values
P-Value Probability that a random alignments
obtains a score superior or Equal to X K must
be calibrated with the database
composition Lambda is calibrated with the matrix
being used
31BLAST Statistics E-Values
- Derived Statistics
- E-value
- Number of alignments expected by chance
- The lower, the better lt0.00001
For Values Lower than 0.0001, E-Value P-Value
The E-Values are easier to compare than P-Values
32BLAST Statistics Bit-Score
- Bit Score
- Evaluates the amount of information in the
alignment - Makes it possible to compare alignments
33BLAST Statistics Booby Trap!
The E-Value depends on N, the Database size. If
N increases, some Hits can be lost
34(No Transcript)
35The Many Flavorsof BLAST
36(No Transcript)
37(No Transcript)
38(No Transcript)
39Database Against Database  Farm-BlastÂ
Genome 1
Genome 2
Ideal for finding Orthologues
40The Classics 1 SequenceVs A sequence Db
41The Many Flavors of BLAST
Program
Query
Database
protein
protéine
blastp
42The Many Flavors of BLAST
Program
Query
Database
43If your Sequence is a Protein
44If your Sequence is made of DNA
45BLASTing with DNA Asking the right question.
46Keeping an Eye on the Public Servers.
47Using BLAST The Basic Way
48Database Search
Database Search ResultPrediction
Protein X IS or IS NOT homologous to the QUERRY.
49Submitting your Query
50Understanding the BLAST Output
Graphic Display
Hit List
Alignments
51Understanding the Graphic Display
52Understanding the Hit List
53Understanding the Alignments
54Low Complexity Regions
- Regions with a single residue repeated many times
(like the AFGP) can produce meaningless
alignments. - The statistics expect ALL the regions to look the
same  on average . - By default, BLAST replaces these regions with Xs
55Reproducing The Experiment
Everything you need to know to reproduce your
search is at the bottom.
56Database Searches A few Guidelines
57DataBase Search According to Pearson
58DataBase Search According to Pearson
59DataBase Search According to Pearson
60Using Weak Matches To Identify Domains
61Three Short-Sighted Witnesses are more
Informative than a single eagle-eye witness
62 Using BLAST Trouble Shooting
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69Advanced Blast on the EMBnet
www.ch.embnet.org/software/aBLAST.html
- More choice on the databases
- Change all the parameters
70(No Transcript)
71(No Transcript)
72(No Transcript)
73(No Transcript)
74(No Transcript)
75Domain-FlavoredBLAST
76(No Transcript)
77Psi-BLAST
78BLAST latest Flavor
PSI-BLAST -Position Specific Iterated Version
of BLAST. -Uses Profiles. -More Sensitive.
79Psi-BLAST Iteration
C
C
80Psi-BLAST Iteration
C
C
C
C
C
C
C
S
C
C
C
C
C
C
C
C
C
C
C
S
81Psi-BLAST Iteration
C
C
C
C
C
C
C
S
C
C
C
C
C
C
C
C
C
C
C
S
82BLAST PSSM or weight matrix
M Y C E Q U E N C E S . . A 0 2
-1 0 0 0 0 -1 0 -1 3 S -1 -1 -1 0 -1 0
0 0 5 -1 -1 C -1 -1 10 1 -1 0 0 5 5 4
-1 . . Y -1 6 -1 -1 -1 0 -1 -1 -1 -1 -1 V -1
1 -1 -1 -1 0 -1 -1 -1 1 -1
83Asking a Question With Psi-BLAST
84Asking a Question With Psi-BLAST
Is the Leghemoglobin related to the Human
Hemoglobin ?
85Asking a Question With Psi-BLAST
86Asking a Question With Psi-BLAST
87Asking a Question With Psi-BLAST
88(No Transcript)
89Which Domain OrganisationFor Your
Protein (Reverse PSI-BLAST)
90Asking a Question With RPS-BLAST
PSI-BLAST Discovering Domains RPS-BLAST Which
KNOWN Domain in my protein ?
Sequence
DomainDatabase
91Asking a Question With RPS-BLAST
92(No Transcript)
93RPS-BLAST Filtering Or Not Filtering Low
COmplexity
94(No Transcript)
95How Many Proteins Have the same Domain Structure
as Mine ? (CDART)
96Asking a Question With CDART
CDART Conserved Domain Architecture Retrieval
Tool Finds the proteins that contain the same
domains as your protein.
97Asking a Question With CDART
PSI-BLAST Discovering Domains RPS-BLAST Which
known Domain in my protein ? CDART
-Which proteins have the SAME DOMAIN ORGANIZATION
as my proteins ?
Which domains are COMMONLY ASSOCIATED with the
domain I am interested in ?
98(No Transcript)
99(No Transcript)
100Filtering -By Domain -By Species
101-I want to Find all the Insect proteins
containing a June/Fos organisation.
102Asking a Question With CDART
-I want to see all the Insect proteins containing
a June/Fos organisation.
103Asking a Question With CDART
-I want to see all the Insect proteins containing
a June/Fos organisation.
104Asking a Question With CDART
-I want to see all the Insect proteins containing
a June/Fos organisation.
105Genome FlavoredBLAST
106(No Transcript)
107Standard Blastn with long word size
108MegaBLASTLonger Words
Faster BUT Less sensitive
Query
REL
109(No Transcript)
110The NcBi BlAsT GEnoMe SecTion is MesSy
111(No Transcript)
112Makes it possible to select predicted proteomes
113(No Transcript)
114Venter-BLAST
115When it comes toBLASTingEukaryotic
Genomes WWW.ENSEMBL.ORG
116Asking a Question With ENSEMBL-BLAST
ENSEMBL WHERE are located the genes coding for
Homologues of my protein
117(No Transcript)
118(No Transcript)
119(No Transcript)
120CONCLUSION
121Searching Databases
-
122Searching Databases
-
123A few Extra Ressources
124(No Transcript)
125(No Transcript)
126(No Transcript)
127Tunning BLAST
128BLAST Tunning
129(No Transcript)