Title: BLAST II
1BLAST (II)
Basic Local Alignment Search Tool
??? ?????? email jimfann_at_itri.org.tw 03/11/2008
2Reference Sources
Jian Ye, Scott McGinnis, and Thomas L. Madden
(2006) "BLAST improvements for better sequence
analysis" Nucleic Acids Res. July 1 34 (Web
Server issue) W6-W9 McGinnis S, Madden TL.
(2004) "BLAST at the core of a powerful and
diverse set of sequence analysis tools." Nucleic
Acids Res. Jul 132 (Web Server issue)
W20-5. Altschul, S.F., Gish, W., Miller, W.,
Myers, E.W. Lipman, D.J. (1990) "Basic local
alignment search tool." J. Mol. Biol.
215403-410. Altschul, S.F., Madden, T.L.,
Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W.
Lipman, D.J. (1997) "Gapped BLAST and
PSI-BLAST a new generation of protein database
search programs." Nucleic Acids Res.
253389-3402. http//www.ncbi.nlm.nih.gov/BLAST/
ftp//ftp.ncbi.nih.gov/blast/ Joseph Bedell,
Ian Korf, Mark Yandell (2003) BLAST. O'Reilly
http//www.oreilly.com/catalog/blast/ http//ww
w.bioinfbook.org Jonathan Pevsner (2003)
Bioinformatics and Functional Genomics. John
Wiley Sons, Inc.
3Contents
- blastp
- Protein-protein BLAST
- PSI-BLAST
- Position-Specific Iterated BLAST
- PSSM/profile
- PHI-BLAST
- Pattern-Hit Initiated BLAST
- pattern/motif
- Choose the BLAST program
- Tips to improve BLAST searches
- Stand-alone BLAST
4BLAST programs
http//www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMDWe
bPAGE_TYPEBlastHome
5blastp
6blastp
- Enter Query Sequence
- Choose Search Set
- Program Selection
- Algorithm parameters
- General Parameters
- Scoring Parameters
- Filters and Masking
7Peptide Sequence Databases (FASTA format)
8BLAST programs
http//www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMDWe
bPAGE_TYPEBlastHome
9Consensus sequences - Patterns - PSSM
- Multiple sequence alignment (MSA) to detect
conserved regions in protein or DNA sequences and
to build models of these conserved regions - Consensus sequences
- Patterns
- Position Specific Score Matrices (PSSMs),
Profiles - etc.
10Consensus sequences
- the simplest method to build a model from a
multiple sequence alignment - Majority wins
- Skip too much variation
11Pattern
- a set of alternative sequences, using regular
expression - Prosite (http//www.expasy.org/prosite/)
12The Prosite syntax for patterns
- uses the standard IUPAC one-letter codes for
amino acids (GGly, PPro, ...), - each element in a pattern is separated from its
neighbor by a -, - the symbol X is used where any amino acid is
accepted, - ambiguities are indicated by square parentheses
(AG means Ala or Gly), - amino acids that are not accepted at a given
position are listed between a pair of curly
brackets (AG means any amino acid except
Ala and Gly), - repetitions are indicated between parentheses (
) (AG(2,4) means Ala or Gly between2 and 4
times, X(2) means any amino acid twice, - a pattern is anchored to the N-term and/or C-term
by the symbols lt and gt respectively.
13Pattern
- ltA-x-ST(2)-x(0,1)-V
- an Ala in the N-term,
- followed by any amino acid,
- followed by a Ser or Thr twice,
- followed or not by any residue,
- followed by any amino acid except Val.
14PSSM (Position Specific Scoring Matrice)
15PSSM (Position Specific Scoring Matrice)
16PSSM (Position Specific Scoring Matrice)
17PSSM (Position Specific Scoring Matrice)
18PSI-BLAST
19BLAST programs
http//www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMDWe
bPAGE_TYPEBlastHome
20PSI-BLAST(Position-Specific Iterated BLAST)
21PSI-BLAST
22PSSM (NCBI)
- To save a PSSM file
- Run a protein BLAST search.
- Check the PSI-BLAST box on formatting page.
- Click the "Format" Button.
- On the PSI-BLAST results page, click the "Run
PSI-BLAST Iteration 2" button. - Now, on the Format page, select "PSSM" from the
"Show" pull down menu. - Click "Format" button.
- This will display text output with the
ASCII-encoded PSSM. The "Save as..." option of
the browser can be used to save this to a plain
text file on your hard drive. - To use the PSSM in a new protein BLAST search
against other databases - Copy the above PSSM from the browser
- Open a new protein BLAST page
- Paste the PSSM in the PSSM field in the page
- provide the SAME query in the search box
- select a different target database
- click "BLAST" button to start the search
- If the database is the same as when the PSSM was
stored, you'll reproduce the iteration on which
you've saved the PSSM A different database will
yield a different hit list.
23PSI-BLAST
24PSI-BLAST (Position-Specific Iterated BLAST)
1 Select a query and search it against a
protein database 2 PSI-BLAST constructs a
multiple sequence alignment then creates a
profile or specialized position-specific scoring
matrix (PSSM) 3 The PSSM is used as a query
against the database 4 PSI-BLAST estimates
statistical significance (E values) 5 Repeat
steps 3 and 4 iteratively, typically 5
times. At each new search, a new profile is used
as the query.
PSSM
PSSM
From http//bioweb.pasteur.fr/seqanal/blast/intro
-uk.html
25PSI-BLAST vs BLASTp
- PSI-BLAST could find more distant homologous than
a simple BLAST search. - PSI-BLAST uses two E-values
- the threshold E-value for the initial BLAST (-e
option). The default is 10 as in the standard
BLAST - the inclusion E-value to accept sequences (-h
option) in the PSSM construction (default is
0.005).
26PSI-BLAST advantages
- Fast because of the BLAST heuristic.
- Allows PSSMs searches on large databases.
- A particularly efficient algorithm for sequence
weighting. - A very sophisticated statistical treatment of the
match scores. - Single software.
- User friendly interface.
27PSI-BLAST pitfalls
- Avoid too close sequences overfit!
- Can include false homologous! Therefore check the
matches carefully include or exclude sequences
based on biological knowledge. - The E-value reflects the significance of the
match to the previous training set not to the
original sequence! - Choose carefully your query sequence.
- Try reverse experiment to certify.
28BLAST programs
http//www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMDWe
bPAGE_TYPEBlastHome
29PHI-BLAST (Pattern-Hit Initiated BLAST)
30PHI-BLAST (Pattern-Hit Initiated BLAST)
This dual requirement is intended to reduce the
number of database hits that contain the pattern,
but are likely to have no true homology to the
query.
From http//bioweb.pasteur.fr/seqanal/blast/intro
-uk.html
31Choose the BLAST program
Program query Database
1 blastn DNA DNA 1 blastp protein pro
tein 6 blastx DNA protein
6 tblastn protein DNA
36 tblastx DNA DNA
32Choose the BLAST program
http//www.ncbi.nlm.nih.gov/blast/producttable.sht
ml
33Choose the BLAST program
http//www.ncbi.nlm.nih.gov/blast/producttable.sht
ml
34Choose the BLAST program
http//www.ncbi.nlm.nih.gov/blast/producttable.sht
ml
35(No Transcript)
36BLAST searches
- design experiment
- query sequences
- target databases
- choose BLAST program
- set parameters
- run BLAST
- data analysis
37Tips to improve BLAST searches (1/3)
- Don't use the default parameters
- Treat BLAST searches as scientific experiments
- Perform controls, especially in the twilight zone
- View BLAST reports graphically
- use the Karlin-Altschul equation to design
experiments - when troubleshooting, read the footer first
(Bedell et al. 2003)
38Tips to improve BLAST searches (2/3)
- know when to use complexity filters
- mask repeats in genomic DNA
- segment large genomic sequences
- be skeptical of hypothetical proteins
- expect contaminants in EST databases
- use caution when searching raw sequencing reads
- look for stop codons and frame-shifts to find
pseudo-genes
(Bedell et al. 2003)
39Tips to improve BLAST searches (3/3)
- consider using ungapped alignment for BLASTX,
TBLASTN, and TBLASTX - look for gaps in coverage as a sign of missed
exons - parse BLAST reports with BioPerl
- perform pilot experiments
- examine statistical outliers
- how to lie with BLAST statistics
(Bedell et al. 2003)
40Download from NCBI
Installing Stand-alone BLAST
The main advantage of Standalone BLAST is to be
able to create your own BLAST databases.
- Excutables
- ftp//ftp.ncbi.nlm.nih.gov/blast/executables/
- database in FASTA (un/formated)
- ftp//ftp.ncbi.nlm.nih.gov/blast/db/
41formatdb
- formatdb - //to display arguments
- formatdb -i ecoli.nt -p F -o T //ecoli DNA
Query gt ABCD database \n
The smallest query/database
42run blast
- add pathC\blast
- blastall - //to display options
- blastall -p blastp -i query -d database -o output
- blastall -p blastn -d ecoli.nt -i test.txt -o
test.out
43Thank You!
44http//www.ncbi.nlm.nih.gov/blast/producttable.sht
ml
45http//www.ncbi.nlm.nih.gov/blast/producttable.sht
ml