Title: Introduction to Bioinformatics
1Introduction to Bioinformatics
2BLAST
- Introduction
- What is BLAST?
- Query Sequence Formats
- What does BLAST tell you?
- Choices
- Variety of BLAST
- BLAST Programs Which One to Use?
- Commonly Used BLAST programs
- BLAST Databases Which One to Search?
- Understanding the Output
- Database Search with BLAST
- Blast Steps How It Works
Acknowledgement The presentation includes
adaptations from NCBIs Introduction to Molecular
Biology Information Resources Modules
3What is BLAST?
- Basic Local Alignment Search Tool
- The GoogleTM of bioinformatics
- Query is a DNA or protein sequence, not a text
term - Character string comparison against all the
sequences in the target database - Rigorous statistics used to identify
statistically significant matches
4Query Sequence Formats
- Bare sequence
- QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQES
KPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLER
IEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGST
GVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP - 1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre
mpfhvtkqes kpvqmmcmnn - 61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri
ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts
vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels - 181 edgiemagst gviedikhsp eseqfradhp
flflikhnpt ntivyfgryw sp - Identifiers
- accession, accession.version or gi's
- e.g., p01013, AAA68881.1, 129295, gi129295
- FASTA format
GenBank format
5Query Sequence in FASTA Format
- FASTA definition line ("def line") that begins
with a gt, followed by some text that briefly
describes the query sequence on a single line - Up to 80 nucleotide bases or amino acids per line
- Blank lines not allowed in the middle
- Example
- gtgi129295spP01013OVAX_CHICK GENE X PROTEIN
(OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWK
TAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEK
RRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKIS
QAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
FLFLIKHNPTNTIVYFGRYWSP - Additional information
6What does BLAST tell you?
- Putative identity and function of your query
sequence - Helps to direct experimental design to prove the
function - Find similar sequences in model organisms (e.g.,
yeast, C. elegans, mouse), which can be used to
further study the gene - Compare complete genomes against each other to
identify similarities and differences among
organisms
7Variety of BLASTs
http//www.ncbi.nlm.nih.gov/BLAST/
8BLAST Programs Which One to Use?
- Depends on
- What type of query sequence you have (nucleotide
or protein) - What type of database you will search against
(nucleotide or protein) - BLAST program descriptions
- brief list
- BLAST program selection guide
9Commonly Used BLAST Programs
- Examples of BLAST programs
- BLASTN
- Nucleic acids against nucleic acids
- BLASTP
- Protein query against protein database
- Usually better to use than nucleotide-nucleotide
BLAST - Since the genetic code is degenerate, blastn can
often give less specific results than blastp - ...but... what if we don't have a protein query
sequence. What are our options? - BLASTX
- Translated nucleic acids against protein database
- One way to do a protein BLAST search if you have
a nucleotide query sequence - The BLAST program does the translating for you,
in all 6 reading frames
10BLAST Databases Which One to Search?
- What type of data do you want to search against?
For example - Characterized sequences?
- Specialized sequences?
- Complete genomes or chromosomes?
- BLAST database descriptions are available in the
- BLAST help document
- BLAST program selection guide
11Request ID RID
- An RID is like a ticket number that allows you to
retrieve your search results and format them in
many different ways over the next 24 hours. - If you've saved RIDs from your recent searches,
you can enter the RIDs directly using the
Retrieve results with a Request ID page, which is
accessible from the bottom of the BLAST home page
12Search Results Understanding the Output
- Reference to BLAST paper
- Reminders about your specific query
- RID
- query sequence reminder (contains the information
from your FASTA def line) - what database you searched against
- Graphical summary
- shows where the hits aligned to your query
- colors indicate score range
- mouse over a colored bar to see info about that
hit - Text summary (GI numbers and Def lines)
- GI links to complete record in Entrez
- Score links to pairwise alignment between your
query sequence and the hit - Pairwise alignments
- BLAST statistics for your search
13Database Search w/ BLAST
- Finding similar sequences is a primary use of
bioinformatics - BLAST
Enter sequence Choose DB Hit
Acknowledgement Slides 15 19 are adapted from
lecture notes of Professor Chau-Wen Tseng of CS
Department at the University of Maryland with
permission.
14Database Search w/ BLAST
- Versions of BLAST
- BLASTN
- Nucleic acids against nucleic acids
- BLASTP
- Protein query against protein database
- BLASTX
- Translated nucleic acids against protein database
- TBLAST
- Protein query against translated nucleic acid
database - TBLASTX
- Translated nucleic acids against translated
nucleic acids
15Database Search w/ BLAST
16Database Search w/ BLAST
17Database Search w/ BLAST
- BLAST result
- 0Matching sequences w/ bit-score E-value
- 0Hyperlinks to database entry for sequence
- Example
- gi17330420gbBH384278.1BH384278 ... 153 3e-36
- gi17320126gbBH373984.1BH373984 ... 140 9e-34
- gi17338337gbBH392196.1BH392196 ... 112 8e-25
- gi20373967gbBH771010.1BH771010 ... 105 1e-21
- gi17314411gbBH368367.1BH368367 ... 104 2e-21
- gi17332712gbBH386570.1BH386570 ... 64 3e-21
- Hyperlinks to sequences Bit Score
E-value
18BLAST Statistical Evaluation
- E Value
- The number of different alignments with scores
equivalent to or better than alignment score that
are expected to occur in a database search by
chance. - The lower the E value, the more significant the
score.
19BLAST How It Works
- Find high scoring local alignments between query
sequence and target database - Assumption
- True match alignments very likely to contain
within them very high scoring matches - Steps
- Seeding
- Searching
- Extension
- Evaluation
20BLAST Steps
- Seeding
- For each word of length w in the query (w-mer),
generate a list of all possible words (neighbors)
with a score of at least threshold T (determined
by using the scoring matrix) - Default
- w 3 for protein
- w 11 for DNA
- BLASTn, however, does not find neighbors. It
uses the 11-mers from the query only.
21Query word (w 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
This example uses BLOSUM 62.
22BLOSUM 62
23BLAST Steps
- Searching
- Determine the locations of all common words
between the query and the database (word hits) - Identifies all word hits
24Query word (w 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
25Implemention of BLAST Steps 1 2
- (Using DNA as an example)
- Given a query sequence q, a database sequence d,
and the word size w, the following steps are
needed - Find w-mers from q and create a table (such as
hash table) to store them and record the position
of their occurrences. - Scan d to find if a w-mer in d is also in q,
i.e., in the table.
26Implemention of BLAST Steps 1 2
Example Suppose q ACACAT, d GTACACGTT,
and w 3. Â Build the table The 3-mers of q are
ACA, CAC, ACA, and CAT. Their positions are
stored in the table. The position of a sequence
starts at 1. Â
4
1
3
2
27Implemention of BLAST Steps 1 2
Example (contd) Match w-mers Use the 3-mers in d
one at a time to find its position(s) in q using
the hash table. Here is the result GTA not
found TAC not found ACA 1, 3 CAC 2 ACG not
found CGT not found GTT not found
4
1
3
2
28Implemention of BLAST Steps 1 2
- Questions
- How many w-mers are there in q?
- How many w-mers are there in d?
- What is the time required to build the table?
- What is the time required to find all the
positions? - An alternative is building table using w-mers in
d and then scan q to find positions of the
w-mers. Why is it not recommended?
29BLAST Steps
- Extension
- Extend hits to find HSPs (high-scoring segment
pairs) that have scores higher than a threshold - Introduce gaps using dynamic programming
- Problem of extension
- Time-consuming to find the highest score
- Solution (heuristic)
- Extend until score drops a value of X
Example ABCDEFGHIJKLMNOPQRST
ABCDEFZYIJKLMXWVUTAB
1234565456789876565 ? Score
00000012100001234345 ? Drop off score
Match 1 Mismatch -1 X 5
30Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
31BLAST Steps
- Evaluation
- Select all HSPs from step 3 having a score of at
least S (for some parameter S). The value of S
should be large enough so that unrelated
sequences shouldnt have HSPs with scores S. - Evaluate the statistical significance of extended
hits - Report only those above the determined threshold
32BLAST Statistical Evaluation
- For local, ungapped alignments
- m size of query
- n size of database
- E expected of HSPs with scores at least S
- p prob of finding at least one HSP with S
- good tutorial at
- http//www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu
l-1.html
33Interpretations of Expected Value
- Expected value ranges
- E lt 10-100 ? very low, homologs or identical
genes - E lt 10-3 ? moderate, may be related genes
- E gt 1 ? high, probably / may be unrelated
- 0 0.5 lt E lt 1 ? ??? In the twilight zone Try
detailed search - If database search
- Long list of gradually declining of E values ?
large gene family - Long regions of moderate similarity ? more
significant than short regions of high identity - Biological relevance
- Still need to determine biological significance!!!
34Other BLAST Algorithms
- BLAST2
- Gapped-BLAST
- PSI-BLAST
- BLAT
- Mega BLAST
- etc.