Aucun titre de diapositive - PowerPoint PPT Presentation

About This Presentation
Title:

Aucun titre de diapositive

Description:

Recherche dans des bases de donn es de s quences biologiques ... BLAST Statistics: Booby Trap! The E-Value depends on N, the. Database size. ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 130
Provided by: cedricno
Learn more at: https://tcoffee.org
Category:

less

Transcript and Presenter's Notes

Title: Aucun titre de diapositive


1
Using BLAST to Search Sequence Databases
Recherche dans des bases de données de séquences
biologiques
Cédric Notredame
2
Outline
-Evolution and Sequence Similarity
-The inside of BLAST
-Using BLAST
-Adapting BLAST to your needs
-Searching Protein Domains with BLAST
-Digging Genomes
3
Two Minutes of the Evolutionnary Clock
4
(No Transcript)
5
An Alignment is a STORY
6
How Do Sequences Evolve ?
In a structure, each Amino Acid plays a Special
Role
7
Why Does It Make Sense To Align Sequences ?
8
How Can We Compare Sequences ? The Twilight Zone
Sequence Identity
Same 3D Fold
30
Twilight Zone
Length
100
9
Different molecular clocks for different
proteins--another prediction
10
A few Basic Definitions
11
A few Definitions
Query Your sequence Subject The database
against which you search Heuristic Algorithm
that does not guaranty the optimal solution
12
Other Important Definitions
Identity Proportion of IDENTICAL residues
between two sequences. Depends on the
Alignment. Unit the id
Similarity Proportion of SIMILAR residues Two
residues are similar if their substitution cost
is higher than 0. Depends on the matrix Unit the
similarity
Homology Sequences SIMILAR enough are sometimes
HOMOLOGOUS HOMOLOGY ? COMMON ANCESTOR Unit Yes
or No! DIFFERENT sequences can also be
Homologous
13
More Important Definitions
Hit A sequence that matches your sequence and
reported by BLAST. E-Value Expectation
value How many times would you expect to find a
hit by chance only? Depends on the
alignment. Depends on the matrix Depends on
the database Sensitive to Low complexity
regions Unit must be lower than 0.0001 to
mean something
14
A Good Hit Is Something You Would Not Expect by
Chance
15
What is BLAST ?
16
BLAST
Basic Local Alignment Search Tool
BLAST is a Program Designed for RAPIDLY Comparing
Your Sequence With every Sequence in a database
and REPORT the most SIMILAR sequences
17
Database Search
1-Query
3-Database
4-Statistical Evaluation (E-Value)
PROBLEM LOCAL ALIGNMENT (SW)TOO SLOW
18
Database Search
19
BLAST
Basic Local Alignment Search Tool
BLAST is a Heuristic Smith and Waterman
BLAST 3 STEPS
1-Decide who will be compared
20
BLAST
Basic Local Alignment Search Tool
BLAST is a Heuristic Smith and Waterman
BLAST 3 STEPS
1-Decide who will be compared
2-Check the most promising Hits
3-Compute the E-value of the most interesting Hits
21
BLAST
Heuristic Algorithms
A Bit of History
  • Smith and Waterman
  • Exact Local Dynamic Programming, 1981
  • FASTA
  • Lipman and Pearson, 1985
  • Looks for similar words (k-tup) on the same
    diagonal.
  • Comparison on the sequences one by one
  • BLAST
  • Altschul et al., 1990
  • The most widely cited tool in Biology
  • www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.htm
    l

22
The Inside of BLAST
23
Inside BLAST
Step 1 finding the worthy words
Query
REL
24
Inside BLAST
Step 2 Eliminate the database sequences that do
not contain any interesting word
Sequences within the database
Look for interesting words
ACT RSL TVF
...
...
List of  interesting  words gt T
  • Sequences containing interesting words (Hits)

25
Inside BLAST the end
Step 3 Extension of the Hits
Database sequence
Query
X
  • 2 "Hits" on the same diagonal distant by less
    than X

26
The Statistics in BLAST
27
BLAST Statistics Raw Score
  • Evaluation of the score
  • Raw Score
  • Sum of the substitutions and gap penalties.
  • Not very informative

28
BLAST Statistics P Values
  • Derived Statistics
  • p-value
  • Probability of finding an alignment with such a
    score, by chance.
  • The lower, the better

29
BLAST Statistics P-Values
Just as the sum of a large number of independent
identically distributed (i.i.d) random variables
tends to a normal distribution, the maximum of a
large number of i.i.d. random variables tends to
an extreme value distribution.
Extreme value distribution (Gumbel)
normal distribution
30
BLAST Statistics P-Values
P-Value Probability that a random alignments
obtains a score superior or Equal to X K must
be calibrated with the database
composition Lambda is calibrated with the matrix
being used
31
BLAST Statistics E-Values
  • Derived Statistics
  • E-value
  • Number of alignments expected by chance
  • The lower, the better lt0.00001

For Values Lower than 0.0001, E-Value P-Value
The E-Values are easier to compare than P-Values
32
BLAST Statistics Bit-Score
  • Bit Score
  • Evaluates the amount of information in the
    alignment
  • Makes it possible to compare alignments

33
BLAST Statistics Booby Trap!
The E-Value depends on N, the Database size. If
N increases, some Hits can be lost
34
(No Transcript)
35
The Many Flavorsof BLAST
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
Database Against Database  Farm-Blast 
Genome 1
Genome 2
Ideal for finding Orthologues
40
The Classics 1 SequenceVs A sequence Db
41
The Many Flavors of BLAST
Program
Query
Database
protein
protéine
blastp
42
The Many Flavors of BLAST
Program
Query
Database
43
If your Sequence is a Protein
44
If your Sequence is made of DNA
45
BLASTing with DNA Asking the right question.
46
Keeping an Eye on the Public Servers.
47
Using BLAST The Basic Way
48
Database Search
Database Search ResultPrediction
Protein X IS or IS NOT homologous to the QUERRY.
49
Submitting your Query
50
Understanding the BLAST Output
Graphic Display
Hit List
Alignments
51
Understanding the Graphic Display
52
Understanding the Hit List
53
Understanding the Alignments
54
Low Complexity Regions
  • Regions with a single residue repeated many times
    (like the AFGP) can produce meaningless
    alignments.
  • The statistics expect ALL the regions to look the
    same  on average .
  • By default, BLAST replaces these regions with Xs

55
Reproducing The Experiment
Everything you need to know to reproduce your
search is at the bottom.
56
Database Searches A few Guidelines
57
DataBase Search According to Pearson
58
DataBase Search According to Pearson
59
DataBase Search According to Pearson
60
Using Weak Matches To Identify Domains
61
Three Short-Sighted Witnesses are more
Informative than a single eagle-eye witness
62

Using BLAST Trouble Shooting
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
Advanced Blast on the EMBnet
www.ch.embnet.org/software/aBLAST.html
  • More choice on the databases
  • Change all the parameters

70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
Domain-FlavoredBLAST
76
(No Transcript)
77
Psi-BLAST
78
BLAST latest Flavor
PSI-BLAST -Position Specific Iterated Version
of BLAST. -Uses Profiles. -More Sensitive.
79
Psi-BLAST Iteration
C
C
80
Psi-BLAST Iteration
C
C
C
C
C
C
C
S
C
C
C
C
C
C
C
C
C
C
C
S
81
Psi-BLAST Iteration
C
C
C
C
C
C
C
S
C
C
C
C
C
C
C
C
C
C
C
S
82
BLAST PSSM or weight matrix
M Y C E Q U E N C E S . . A 0 2
-1 0 0 0 0 -1 0 -1 3 S -1 -1 -1 0 -1 0
0 0 5 -1 -1 C -1 -1 10 1 -1 0 0 5 5 4
-1 . . Y -1 6 -1 -1 -1 0 -1 -1 -1 -1 -1 V -1
1 -1 -1 -1 0 -1 -1 -1 1 -1
83
Asking a Question With Psi-BLAST
84
Asking a Question With Psi-BLAST
Is the Leghemoglobin related to the Human
Hemoglobin ?
85
Asking a Question With Psi-BLAST
86
Asking a Question With Psi-BLAST
87
Asking a Question With Psi-BLAST
88
(No Transcript)
89
Which Domain OrganisationFor Your
Protein (Reverse PSI-BLAST)
90
Asking a Question With RPS-BLAST
PSI-BLAST Discovering Domains RPS-BLAST Which
KNOWN Domain in my protein ?
Sequence
DomainDatabase
91
Asking a Question With RPS-BLAST
92
(No Transcript)
93
RPS-BLAST Filtering Or Not Filtering Low
COmplexity
94
(No Transcript)
95
How Many Proteins Have the same Domain Structure
as Mine ? (CDART)
96
Asking a Question With CDART
CDART Conserved Domain Architecture Retrieval
Tool Finds the proteins that contain the same
domains as your protein.
97
Asking a Question With CDART
PSI-BLAST Discovering Domains RPS-BLAST Which
known Domain in my protein ? CDART
-Which proteins have the SAME DOMAIN ORGANIZATION
as my proteins ?
Which domains are COMMONLY ASSOCIATED with the
domain I am interested in ?
98
(No Transcript)
99
(No Transcript)
100
Filtering -By Domain -By Species
101
-I want to Find all the Insect proteins
containing a June/Fos organisation.
102
Asking a Question With CDART
-I want to see all the Insect proteins containing
a June/Fos organisation.
103
Asking a Question With CDART
-I want to see all the Insect proteins containing
a June/Fos organisation.
104
Asking a Question With CDART
-I want to see all the Insect proteins containing
a June/Fos organisation.
105
Genome FlavoredBLAST
106
(No Transcript)
107
Standard Blastn with long word size
108
MegaBLASTLonger Words
Faster BUT Less sensitive
Query
REL
109
(No Transcript)
110
The NcBi BlAsT GEnoMe SecTion is MesSy
111
(No Transcript)
112
Makes it possible to select predicted proteomes
113
(No Transcript)
114
Venter-BLAST
115
When it comes toBLASTingEukaryotic
Genomes WWW.ENSEMBL.ORG
116
Asking a Question With ENSEMBL-BLAST
ENSEMBL WHERE are located the genes coding for
Homologues of my protein
117
(No Transcript)
118
(No Transcript)
119
(No Transcript)
120
CONCLUSION
121
Searching Databases
-
122
Searching Databases
-
123
A few Extra Ressources
124
(No Transcript)
125
(No Transcript)
126
(No Transcript)
127
Tunning BLAST
128
BLAST Tunning
129
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com