Introduction to Bioinformatics - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Introduction to Bioinformatics

Description:

BLAST Programs: Which One to Use? Commonly ... Database Search with BLAST. Blast Steps How It Works ... Putative identity and function of your query sequence ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 35

Provided by: ChiChe7

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics

1
Introduction to Bioinformatics

BLAST

2
BLAST

Introduction
What is BLAST?
Query Sequence Formats
What does BLAST tell you?
Choices
Variety of BLAST
BLAST Programs Which One to Use?
Commonly Used BLAST programs
BLAST Databases Which One to Search?
Understanding the Output
Database Search with BLAST
Blast Steps How It Works

Acknowledgement The presentation includes
adaptations from NCBIs Introduction to Molecular
Biology Information Resources Modules
3
What is BLAST?

Basic Local Alignment Search Tool
The GoogleTM of bioinformatics
Query is a DNA or protein sequence, not a text
term
Character string comparison against all the
sequences in the target database
Rigorous statistics used to identify
statistically significant matches

4
Query Sequence Formats

Bare sequence
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQES
KPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLER
IEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGST
GVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP
1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre
mpfhvtkqes kpvqmmcmnn
61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri
ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts
vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels
181 edgiemagst gviedikhsp eseqfradhp
flflikhnpt ntivyfgryw sp
Identifiers
accession, accession.version or gi's
e.g., p01013, AAA68881.1, 129295, gi129295
FASTA format

GenBank format
5
Query Sequence in FASTA Format

FASTA definition line ("def line") that begins
with a gt, followed by some text that briefly
describes the query sequence on a single line
Up to 80 nucleotide bases or amino acids per line
Blank lines not allowed in the middle
Example
gtgi129295spP01013OVAX_CHICK GENE X PROTEIN
(OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWK
TAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEK
RRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKIS
QAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
FLFLIKHNPTNTIVYFGRYWSP
Additional information

6
What does BLAST tell you?

Putative identity and function of your query
sequence
Helps to direct experimental design to prove the
function
Find similar sequences in model organisms (e.g.,
yeast, C. elegans, mouse), which can be used to
further study the gene
Compare complete genomes against each other to
identify similarities and differences among
organisms

7
Variety of BLASTs
http//www.ncbi.nlm.nih.gov/BLAST/
8
BLAST Programs Which One to Use?

Depends on
What type of query sequence you have (nucleotide
or protein)
What type of database you will search against
(nucleotide or protein)
BLAST program descriptions
brief list
BLAST program selection guide

9
Commonly Used BLAST Programs

Examples of BLAST programs
BLASTN
Nucleic acids against nucleic acids
BLASTP
Protein query against protein database
Usually better to use than nucleotide-nucleotide
BLAST
Since the genetic code is degenerate, blastn can
often give less specific results than blastp
...but... what if we don't have a protein query
sequence. What are our options?
BLASTX
Translated nucleic acids against protein database
One way to do a protein BLAST search if you have
a nucleotide query sequence
The BLAST program does the translating for you,
in all 6 reading frames

10
BLAST Databases Which One to Search?

What type of data do you want to search against?
For example
Characterized sequences?
Specialized sequences?
Complete genomes or chromosomes?
BLAST database descriptions are available in the
BLAST help document
BLAST program selection guide

11
Request ID RID

An RID is like a ticket number that allows you to
retrieve your search results and format them in
many different ways over the next 24 hours.
If you've saved RIDs from your recent searches,
you can enter the RIDs directly using the
Retrieve results with a Request ID page, which is
accessible from the bottom of the BLAST home page

12
Search Results Understanding the Output

Reference to BLAST paper
Reminders about your specific query
RID
query sequence reminder (contains the information
from your FASTA def line)
what database you searched against
Graphical summary
shows where the hits aligned to your query
colors indicate score range
mouse over a colored bar to see info about that
hit
Text summary (GI numbers and Def lines)
GI links to complete record in Entrez
Score links to pairwise alignment between your
query sequence and the hit
Pairwise alignments
BLAST statistics for your search

13
Database Search w/ BLAST

Finding similar sequences is a primary use of
bioinformatics
BLAST

Enter sequence Choose DB Hit
Acknowledgement Slides 15 19 are adapted from
lecture notes of Professor Chau-Wen Tseng of CS
Department at the University of Maryland with
permission.
14
Database Search w/ BLAST

Versions of BLAST
BLASTN
Nucleic acids against nucleic acids
BLASTP
Protein query against protein database
BLASTX
Translated nucleic acids against protein database
TBLAST
Protein query against translated nucleic acid
database
TBLASTX
Translated nucleic acids against translated
nucleic acids

15
Database Search w/ BLAST
16
Database Search w/ BLAST

BLAST graphic result

17
Database Search w/ BLAST

BLAST result
0Matching sequences w/ bit-score E-value
0Hyperlinks to database entry for sequence
Example
gi17330420gbBH384278.1BH384278 ... 153 3e-36
gi17320126gbBH373984.1BH373984 ... 140 9e-34
gi17338337gbBH392196.1BH392196 ... 112 8e-25
gi20373967gbBH771010.1BH771010 ... 105 1e-21
gi17314411gbBH368367.1BH368367 ... 104 2e-21
gi17332712gbBH386570.1BH386570 ... 64 3e-21
Hyperlinks to sequences Bit Score
E-value

18
BLAST Statistical Evaluation

E Value
The number of different alignments with scores
equivalent to or better than alignment score that
are expected to occur in a database search by
chance.
The lower the E value, the more significant the
score.

19
BLAST How It Works

Find high scoring local alignments between query
sequence and target database
Assumption
True match alignments very likely to contain
within them very high scoring matches
Steps
Seeding
Searching
Extension
Evaluation

20
BLAST Steps

Seeding
For each word of length w in the query (w-mer),
generate a list of all possible words (neighbors)
with a score of at least threshold T (determined
by using the scoring matrix)
Default
w 3 for protein
w 11 for DNA
BLASTn, however, does not find neighbors. It
uses the 11-mers from the query only.

21
Query word (w 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
This example uses BLOSUM 62.
22
BLOSUM 62
23
BLAST Steps

Searching
Determine the locations of all common words
between the query and the database (word hits)
Identifies all word hits

24
Query word (w 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
25
Implemention of BLAST Steps 1 2

(Using DNA as an example)
Given a query sequence q, a database sequence d,
and the word size w, the following steps are
needed
Find w-mers from q and create a table (such as
hash table) to store them and record the position
of their occurrences.
Scan d to find if a w-mer in d is also in q,
i.e., in the table.

26
Implemention of BLAST Steps 1 2
Example Suppose q ACACAT, d GTACACGTT,
and w 3. Build the table The 3-mers of q are
ACA, CAC, ACA, and CAT. Their positions are
stored in the table. The position of a sequence
starts at 1.
4
1
3
2
27
Implemention of BLAST Steps 1 2
Example (contd) Match w-mers Use the 3-mers in d
one at a time to find its position(s) in q using
the hash table. Here is the result GTA not
found TAC not found ACA 1, 3 CAC 2 ACG not
found CGT not found GTT not found
4
1
3
2
28
Implemention of BLAST Steps 1 2

Questions
How many w-mers are there in q?
How many w-mers are there in d?
What is the time required to build the table?
What is the time required to find all the
positions?
An alternative is building table using w-mers in
d and then scan q to find positions of the
w-mers. Why is it not recommended?

29
BLAST Steps

Extension
Extend hits to find HSPs (high-scoring segment
pairs) that have scores higher than a threshold
Introduce gaps using dynamic programming
Problem of extension
Time-consuming to find the highest score
Solution (heuristic)
Extend until score drops a value of X

Example ABCDEFGHIJKLMNOPQRST
ABCDEFZYIJKLMXWVUTAB
1234565456789876565 ? Score
00000012100001234345 ? Drop off score
Match 1 Mismatch -1 X 5
30
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
31
BLAST Steps

Evaluation
Select all HSPs from step 3 having a score of at
least S (for some parameter S). The value of S
should be large enough so that unrelated
sequences shouldnt have HSPs with scores S.
Evaluate the statistical significance of extended
hits
Report only those above the determined threshold

32
BLAST Statistical Evaluation

For local, ungapped alignments
m size of query
n size of database
E expected of HSPs with scores at least S
p prob of finding at least one HSP with S

good tutorial at
http//www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu
l-1.html

33
Interpretations of Expected Value

Expected value ranges
E lt 10-100 ? very low, homologs or identical
genes
E lt 10-3 ? moderate, may be related genes
E gt 1 ? high, probably / may be unrelated
0 0.5 lt E lt 1 ? ??? In the twilight zone Try
detailed search
If database search
Long list of gradually declining of E values ?
large gene family
Long regions of moderate similarity ? more
significant than short regions of high identity
Biological relevance
Still need to determine biological significance!!!

34
Other BLAST Algorithms