BCB 444544 - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

BCB 444544

Description:

Database Searching with Smith-Waterman Method ... (as in Smith-Waterman algorithm) Heuristic - does NOT test every possibility ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 24

Provided by: dobbslabG

Category:

more less

Transcript and Presenter's Notes

Title: BCB 444544

1
BCB 444/544
Lecture 8 Substitution Matrices BLAST 8 Sep 12

Thanks to Drena Dobbs (ISU for many borrowed
modified PPTs

2
Required Reading (before lecture)

Fri Sep 12 - for Lecture 8
Chp 4
Mon Sep 15 for Lecture 9
Chp 4

3
Homework Assignment 2

Posted on the course webpage
Due today

4
PAM Matrix Point Accepted Mutation

Relies on "evolutionary model" based on observed
differences in closely related proteins
Dayhoff78
Model includes defined rate for each type of
sequence change
Suffix number (n) reflects amount of "time"
passed
rate of expected mutation if n of amino acids
had changed
e.g., PAM1 matrix estimates what rate of
substitution would be expected if 1 of the amino
acids had changed
PAM1 matrix is used as basis for calculating
other matrices assumes that repeated mutations
would follow same pattern as those in PAM1
matrix, and multiple substitutions can occur at
the same site
PAM1 - for less divergent sequences (shorter
time)
PAM250 - for more divergent sequences (longer
time)

5
BLOSUM BLOck SUbstitution Matrix

Based on aa substitutions observed in blocks of
conserved sequences within evolutionarily
divergent proteins (in BLOCKS database) Henikoff
Henikoff92
Doesn't rely on a specific evolutionary model
Suffix number (n) reflects expected similarity
avg aa identity in MSA from which matrix was
generated
e.g., BLOSUM62 is derived from sequence
alignments of proteins with no more than 62
identity
Blocks database contains ungapped aligned
segments corresponding to the most highly
conserved regions of proteins
BLOSUM45 - for more divergent sequences
BLOSUM62 - for less divergent sequences

6
(No Transcript)
7
Scoring Matrices What are the scores?

See Xiong Textbook
Fig 3.5 PAM250
Fig 3.6 BLOSUM62
Usually only 1/2 of matrix is displayed (it is
symmetric)
s(a,b) corresponds to score of aligning character
a with character b

These are log-odds scores
each entry
log (freq(observed)/freq(expected)
? more likely than random
0 ? at random base rate
- ? less likely than random

8
Which is Better? PAM or BLOSUM

PAM matrices
derived from evolutionary model
often used in reconstructing phylogenetic trees -
but, not very good for highly divergent sequences
BLOSUM matrices
based on direct observations
more 'realistic" - and outperform PAM matrices in
terms of accuracy in local alignment

9
How Should Gaps be Scored?

So far, we've used
Simple linear gap penalty function
Gap of length k
Incurs penalty - k x ?
However, in biological sequences, gaps often
occur in clusters
AGKLAVRSTMIESTRVILTWRKW
AGKLAVRS------RVILTWRKW
More realistic? "Affine" gap penalty
penalty for one long gap
is smaller than penalty
for many smaller gaps
that add up to same size

w(k) ? (k 1) x ? ?
? gap
gap opening extension
10
Affine Gap Penalty Functions

Affine Gap Penalties Differential Gap
Penalties used to reflect cost differences
between opening a gap and extending an existing
gap
Total Gap Penalty is function of gap length
W ? ? X (k - 1)
where ? gap opening penalty
? gap extension penalty
k length of gap
Sometimes, a Constant Gap Penalty is used, but it
is usually less realistic than the Affine Gap
Penalty

11
Calculating an Alignment Score using a
Substitution Matrix an Affine Gap Penalty

Alignment score is sum of all match/mismatch
scores (from substitution matrix) with an affine
penalty subtracted for each gap
a b c - - da c c e f d9 2 7 6 gt 24 -
(10 2) 12

Matchscore
Gap opening extension
AlignmentScore
Values from substitution matrix
12
Parameter Selection in Sequence Alignment

Optimal alignment between a pair of sequences
depends critically
on the selection of substitution matrix gap
penalty function
In using alignment software, it is important to
understand and, sometimes, to adjust these
parameters (default is NOT always best!)
How do we pick parameters that give the most
biologically meaningful alignments and alignment
scores?

13
How Do We Assess the Statistical Significance of
an Alignment?

Compare score of an alignment with distribution
of scores of alignments for many 'randomized'
(shuffled) versions of the original sequence
If score is in extreme margin, then unlikely due
to random chance
P-value probability that original alignment is
due to random chance (lower P is better)
P 10-5 - 10-50 sequences have clear homology
P gt 10-1 no better than random

Check out PRSS (Probability of Random
Shuffles) http//www.ch.embnet.org/software/PRSS_f
orm.html
14
Chp 4- Database Similarity Searching

SECTION II SEQUENCE ALIGNMENT
Xiong Chp 4
Database Similarity Searching
Unique Requirements of Database Searching
Heuristic Database Searching
Basic Local Alignment Search Tool (BLAST)
FASTA
Comparison of FASTA and BLAST
Database Searching with Smith-Waterman Method

15
Database searching
Sequence database
Query Sequence
Target sequences ranked by score
Sequence comparison algorithm
16
Why search a database?

Given a newly discovered gene,
Does it occur in other species?
Is its function known in another species?
Given a newly sequenced genome, which regions
align with genomes of other organisms?
Identification of potential genes
Identification of other functional parts of
chromosomes
Find members of a multigene family

17
Recall There are 3 Basic Types of Alignment
Algorithms

SECTION II SEQUENCE ALIGNMENT
Xiong Chp 3 1) Dot Matrix
2) Dynamic Programming
Xiong Chp 4 3) Word or k-tuple methods
(BLAST FASTA)
Wikipedia
Word methods, also known as k-tuple methods, are
heuristic methods that are not guaranteed to find
an optimal alignment solution, but are
significantly more efficient than dynamic
programming.

18
Exhaustive vs Heuristic Methods

Exhaustive - tests every possible solution
guaranteed to give best answer
(identifies optimal solution)
can be very time/space intensive!
e.g., Dynamic Programming
(as in Smith-Waterman algorithm)
Heuristic - does NOT test every possibility
no guarantee that answer is best
(but, often can identify optimal
solution)
sacrifices accuracy (potentially) for speed
uses "rules of thumb" or "shortcuts"
e.g., BLAST FASTA

19
Why do we Need Fast Search Algorithms?

Your query is 200 amino acids long (N)
You are searching a non-redundant database, which
currently contains gt106 proteins (K)
If proteins in database have avg length 200 aa
(M), then
Must fill in 200 ? 200 ? 106 4 ? 1010 DP
entries!!
4 ? 1010 operations just to fill in the DP
matrix!
DP for pairwise alignment is O(NM)
Searching in a database is O(NMK)
Need faster algorithms for searching in large
databases!

20
BLAST - Statistical Significance?

E-value E m x n x P
m total number of residues in database
n number of residues in query sequence
P probability that an HSP is result of random
chance
lower E-value, less likely to result from random
chance, thus higher significance
Bit Score S'
normalized score, to account for differences in
size of database (m) sequence length(n) - more
later
3. Low Complexity Masking
remove repeats that confound scoring

21
BLAST algorithms can generate both "global" and
"local" alignments
Global alignment
Local alignment
22
BLAST - a Family of Programs Different BLAST
"flavors"

BLASTP - protein sequence query against protein
DB
BLASTN - DNA/RNA seq query against DNA DB
(GenBank)
BLASTX - 6-frame translated DNA seq query against
protein DB
TBLASTN - protein query against 6-frame DNA
translation
TBLASTX - 6-frame DNA query to 6-frame DNA
translation
PSI-BLAST - protein "profile" query against
protein DB
PHI-BLAST - protein pattern against protein DB
Newest MEGA-BLAST - optimized for highly similar
sequences

Which tool should you use?
http//www.ncbi.nlm.nih.gov/blast/producttable.sht
ml
23
BLAST Basic Local Alignment Search Tool

STEPS
Create list of very possible "word" (e.g., 3-11
letters) from query sequence
Search database to identify sequences that
contain matching words
Score match of word with sequence, using a
substitution matrix
Extend match (seed) in both directions, while
calculating alignment score at each step
Continue extension until score drops below a
threshold (due to mismatches)
High Scoring Segment Pair (HSP) - contiguous
aligned segment pair (no gaps)