Title: Lab 3'2: Database Similarity Searching
1Lab 3.2 Database Similarity Searching
- The BLAST Buffet
- Stephanie Minnema
- University of Calgary
2Our Goal
- Take a tour of NCBI BLAST
- Review practicalities of submitting BLAST queries
- Understand BLAST output
- Do sequence comparisons using basic and advanced
BLAST methods
3BLAST is Good For You
4Database Similarity Searching
- The method youll use most!
- Scans a database for alignments to a query
sequence - Can get tons of information
- functionality
- evolutionary history
- important residues
- Basis for many forms of bioinformatic analysis
5Most Common Tool
- BLAST basic local alignment search tool
- NCBI and others
- Based on fast local alignment methods
- Global alignment computationally intensive
- Global alignment not always biologically
significant - Breaks query down into words (K-tuples)
- Finds regions of similarity
- NCBI uses BLAST 2.0 (gapped BLAST)
- Balances speed and sensitivity
6www.ncbi.nlm.nih.gov/BLAST/
7(No Transcript)
8Basic BLAST Flavors
- blastp protein query vs. protein sequence
database. - blastn nucleotide query vs. nucleotide sequence
database. - blastx translated nucleotide query vs. protein
sequence database - tblastn protein query vs. translated nucleotide
sequence database - tblastx translated nucleotide query vs.
translated nucleotide sequence database.
9Whats Your Favorite Flavor?
- What program will best suit your query, and
desired output? - Protein comparisons give most meaningful results
- Sequence complexity 20 aa vs. 4 nt.
- Moderately similar nucleotide sequences could
encode a highly similar protein sequence!
10Takeout Message 1
- Compare sequences on the protein level unless you
know your query does not encode a protein product
11Using Basic BLAST Methods
- Example MASH-1 protein sequence from mouse
- Can I find similar proteins in Human?
12(No Transcript)
13(No Transcript)
14(No Transcript)
15Submitting Your Query
- Input query sequence
- FASTA
- Raw
- Accession/ ID
- Choose Database
- Many available varies with program
- For complete list follow the link to
http//www.ncbi.nlm.nih.gov/blast/html/blastcgihel
p.htmlprotein_databases
16Finds Conserved Domains
Limit results with entrez query
E-Value cut off
17Submitting Your Query
- CD Search
- Finds conserved domains in query sequence
- Compares to patterns and profiles of CDs
- Limit by entrez query
- Restricts results to single organism etc.
- E-value cut off
- Restricts results to ones falling below defined
e-value - Default 10
- Will revisit concept of e-value
18Filtering
Matrix
Gap Penalties
19Submitting Your Query
- Low complexity filtering
- Low complexity sequence can lead to spurious
alignments - Filtering hides these regions
- On by default
- SEG (proteins) or DUST (nucleic acids)
- Should turn it off in some cases what if your
entire sequence gets filtered?
20Submitting Your Query
- Choice of scoring matrix
- Different ones available
- BLOSUM matrices based on observed frequencies of
a.a. substitutions - Each tailored to different levels of sequence
divergence and length - BLOSUM 62 default
- Shown to be best at detecting most protein
similarities dont usually need to change - Follow link for detailed information
21Submitting Your Query
- Gap Penalties
- Accounts for insertions and deletions in
different sequences - Scores are penalized for gaps to prevent aberrant
alignments - Opening penalty is high extension penalty is
lower - Defaults may change depending on matrix choice
- Rarely need to change default value
22(No Transcript)
23Click for more info
Take note
24Formatting Options
25(No Transcript)
26Understanding Your Results
- Graphic representation of results
- Top of graph represents query sequence
- Underlying bars show where hits occur
- Colors represent alignment scores
- Grey areas represent non similar regions
surrounded by similar regions - Scrolling over bar shows accession and
description of hit - Clicking on a bar takes you to its alignment with
the query
27(No Transcript)
28Understanding Your Results
- Bit scores
- Normalized raw score
- Raw score sum of substitution scores and gap
penalties - Normalized on basis of scoring method
- Can compare searches scored using different
matrices - Higher is better, but dont adequately represent
significance of alignment
29Understanding Your Results
- E-values
- Indicator of alignment significance
- Number of times an alignment with the same score
could have arose by chance - Lower is better
- E-values decrease exponentially as scores for an
alignment increase
30Examine Results
31Understanding Your Results
- Alignments
- Important to inspect them
- Take note of percent identity and similarity
between query and aligned sequence - Examine regions of similarity and gaps
- What if a sub-optimal alignment is the most
functionally significant one?
32Takeout Message 2
- Dont trust your computer blindly Examine and
think about your results
33Homology Some Rules to Consider
- Similarity can be indicative of homology
- Generally, if two sequences are significantly
similar over entire length they are likely
homologous - 50 similarity over a short sequence often occurs
by chance - Low complexity regions can be highly similar
without being homologous - Homologous sequences not always highly similar
34Takeout Message 3
- Homology is like pregnancy
35Basic BLAST Flavors for Special Occasions
- BLAST 2 Sequences (bl2seq)
- Aligns two sequences of your choice
- Can do different types of comparison ex. Blastx
- Gives dot-plot like output
- VecScreen
- Compares query with sequences of known cloning
vectors - Both very handy for sequencing!
36Basic BLAST Flavors for Special Occasions
- BLAST against genomes
- Many available
- BLAST parameters pre-optimized
- Handy for mapping query to genome
- Search for short exact matches
- BLAST parameters pre-optimized
- Great for checking probes and primers
37Basic BLAST Flavors for Special Occasions
- megaBLAST
- For aligning sequences which differ slightly due
to sequencing errors etc. - Very efficient for long query sequences
- Uses big word (k-tuple) sizes to start search
- Very fast
- Accepts batch submissions of ESTs
- Can upload files of sequences as queries
- More detailed info see megaBLAST pages
38Time to Sample the Buffet
- Try questions 1 4, found at the end of the lab
notes accompanying this lecture. - Well discuss them in 15 - 20 minutes
39Advanced BLAST Methods
- The NCBI BLAST pages have several advanced BLAST
methods available - PSI-BLAST
- PHI-BLAST
- RPS-BLAST
- All are powerful methods based on protein
similarities
40More Complex Flavor PSI-BLAST
- Position Specific Iterated BLAST
- A cycling/iterative method
- Gives increased sensitivity for detecting
distantly related proteins - Can give insight into functional relationships
- Very refined statistical methods
- Fast still based on BLAST methods
- Simple to use
41PSI-BLAST Principle
- First, a standard blastp is performed
- The highest scoring hits are used to generate a
multiple alignment - A PSSM is generated from the multiple alignment.
- Highly conserved residues get high scores
- Less conserved residues get lower scores
- Another similarity search is performed, this time
using the new PSSM - Steps 2-4 can be repeated until convergence
- No new sequences appear after iteration
42Example Aminoacyl tRNA Synthetases
- 20 enzymes for 20 amino acids
- Each is very different
- Big, small, monomers, tetramers, strange globs
- All bind to their appropriate tRNAs, with high
specificity - Bind all for their amino acid, but none of the
others - TrpRS and TyrRS share only 13 sequence identity
- BUT, overall structures of TrpRS and TyrRS are
similar - Structure ? Function relationship
43Same SCOP family based on catalytic domain
44TyrRS and TrpRS are Similar
- Sequence similarity expected right?
- BUT blastp of E.coli TyrRS against bacterial
sequences in SwissProt does not show similarity
with TrpRS - e-value cutoff of 10
45No TrpRS!?
46Try Using PSI-BLAST
- PSI-BLAST available from BLAST main page
- Query form just like for blastp
- BUT one extra formatting option must be used
- Format for PSI-BLAST check it off!
- Second e-value cutoff used to determine which
alignments will be used for PSSM build
Threshold for inclusion - First search using TyrRS as query
- Db SwissProt limit Bacteria ORGN
- Threshold for inclusion 0.005
47(No Transcript)
48(No Transcript)
49After A Few Iterations
50TyrRS Similarity to TrpRS!
51Power of PSI-BLAST
- We knew TyrRS and TrpRS were similarly
- Functionally and structurally
- Blastp gave no indication
- PSI-BLAST was able to detect their weak sequence
similarity - A word of caution be sure to inspect and think
about the results included in the PSSM build. - Include/exclude sequences on basis of biological
knowldge
52Query
Does the query really have a relationship with
the results?
Results
53Takeout Message 4
- Use you biological knowledge when doing PSI-Blast
to yield the most significant results
54Another Complex Flavour PHI-BLAST
- Pattern Hit Initiated BLAST
- PHI-BLAST principle
- Same method as PSI-BLAST
- Starts first search with query sequence pattern
for a motif in the query - PHI-BLAST finds sequences containing the motif
and having significant sequence similarity in the
vicinity of the motif occurrence - Highly specific
55Example TyrRS
- TyrRS contains the aaRS class-I signature
- Want to find sequences containing that motif, and
regional similarity to TyrRS - First get the Prosite pattern for the class-I
signature - Prosite db of protein families and domains
56http//ca.expasy.org/prosite
57P-x(0,2)-GSTAN-DENQGAPK-x-LIVMFP-HT-LIVMY
AC-G- HNTG-LIVMFYSTAGPC
58Insert Query Sequence
Insert PHI Pattern
59PHI-BLAST Results
- After first search, PHI-BLAST functions same as
PSI-BLAST - Result page is the same
- Can iterate in same way.
- Try it later if you like
60The Key to PHI- and PSI-BLAST
- Generating the multiple alignments to create
PSSMs - Refines scoring in searches
- Annotated collections of multiple alignments
defining domains exist - Conserved domain database (CDD)
- Contains 18039 alignments (10013 last year)
- Can search the CDD using CD search
- Uses RPS-BLAST
61RPS-BLAST
- Reverse Position Specific BLAST
- Opposite of PSI-BLAST
- CDD multiple alignments converted to PSSMs
- PSSMs are processed and turned into a searchable
database - Queries are searched against PSSMs using
RPS-BLAST - Output indicates conserved domains within the
query sequence
62Example CRADD protein
63Click on picture to see CDD multiple alignment
Click to see alignment with query
64Summary of Advanced BLAST Methods
- PSI-BLAST
- Input SEQUENCE
- Database SEQUENCES
- Algorithm Constructs a PSSM from an initial pass
and uses this in the next pass - Output Distantly related sequences
- sensitive, -specific
- PHI-BLAST
- Input PROFILE SEQUENCE
- Database SEQUENCES
- Algorithm Same as PSI-BLAST except start with a
profile - Output Sequences containing the domain and that
are similar in the domain region - sensitive, -gt -specific
- RPS-BLAST
- Input SEQUENCE
- Database DOMAINS
- Output Domains found in the sequence
- sensitive, specific
65Back for Another Helping
- Try the remaining questions in the notes!
66Enlightenment begins with a BLAST
Special Thanks to Sohrab Shah for the aaRS
example and further BLAST enlightenment