Title: BCB 444544
1BCB 444/544
Lecture 8 Substitution Matrices BLAST 8 Sep 12
- Thanks to Drena Dobbs (ISU for many borrowed
modified PPTs
2Required Reading (before lecture)
- Fri Sep 12 - for Lecture 8
- Chp 4
- Mon Sep 15 for Lecture 9
- Chp 4
3Homework Assignment 2
- Posted on the course webpage
- Due today
4PAM Matrix Point Accepted Mutation
- Relies on "evolutionary model" based on observed
differences in closely related proteins
Dayhoff78 - Model includes defined rate for each type of
sequence change - Suffix number (n) reflects amount of "time"
passed - rate of expected mutation if n of amino acids
had changed - e.g., PAM1 matrix estimates what rate of
substitution would be expected if 1 of the amino
acids had changed - PAM1 matrix is used as basis for calculating
other matrices assumes that repeated mutations
would follow same pattern as those in PAM1
matrix, and multiple substitutions can occur at
the same site - PAM1 - for less divergent sequences (shorter
time) - PAM250 - for more divergent sequences (longer
time)
5BLOSUM BLOck SUbstitution Matrix
- Based on aa substitutions observed in blocks of
conserved sequences within evolutionarily
divergent proteins (in BLOCKS database) Henikoff
Henikoff92 - Doesn't rely on a specific evolutionary model
- Suffix number (n) reflects expected similarity
- avg aa identity in MSA from which matrix was
generated - e.g., BLOSUM62 is derived from sequence
alignments of proteins with no more than 62
identity - Blocks database contains ungapped aligned
segments corresponding to the most highly
conserved regions of proteins - BLOSUM45 - for more divergent sequences
- BLOSUM62 - for less divergent sequences
6(No Transcript)
7 Scoring Matrices What are the scores?
- See Xiong Textbook
- Fig 3.5 PAM250
- Fig 3.6 BLOSUM62
- Usually only 1/2 of matrix is displayed (it is
symmetric) - s(a,b) corresponds to score of aligning character
a with character b
- These are log-odds scores
- each entry
- log (freq(observed)/freq(expected)
- ? more likely than random
- 0 ? at random base rate
- - ? less likely than random
8Which is Better? PAM or BLOSUM
- PAM matrices
- derived from evolutionary model
- often used in reconstructing phylogenetic trees -
but, not very good for highly divergent sequences - BLOSUM matrices
- based on direct observations
- more 'realistic" - and outperform PAM matrices in
terms of accuracy in local alignment
9How Should Gaps be Scored?
- So far, we've used
- Simple linear gap penalty function
- Gap of length k
- Incurs penalty - k x ?
- However, in biological sequences, gaps often
occur in clusters - AGKLAVRSTMIESTRVILTWRKW
- AGKLAVRS------RVILTWRKW
- More realistic? "Affine" gap penalty
- penalty for one long gap
- is smaller than penalty
- for many smaller gaps
- that add up to same size
w(k) ? (k 1) x ? ?
? gap
gap opening extension
10Affine Gap Penalty Functions
- Affine Gap Penalties Differential Gap
Penalties used to reflect cost differences
between opening a gap and extending an existing
gap - Total Gap Penalty is function of gap length
- W ? ? X (k - 1)
-
- where ? gap opening penalty
- ? gap extension penalty
- k length of gap
- Sometimes, a Constant Gap Penalty is used, but it
is usually less realistic than the Affine Gap
Penalty
11Calculating an Alignment Score using a
Substitution Matrix an Affine Gap Penalty
- Alignment score is sum of all match/mismatch
scores (from substitution matrix) with an affine
penalty subtracted for each gap - a b c - - da c c e f d9 2 7 6 gt 24 -
(10 2) 12
Matchscore
Gap opening extension
AlignmentScore
Values from substitution matrix
12Parameter Selection in Sequence Alignment
- Optimal alignment between a pair of sequences
depends critically - on the selection of substitution matrix gap
penalty function - In using alignment software, it is important to
understand and, sometimes, to adjust these
parameters (default is NOT always best!) - How do we pick parameters that give the most
biologically meaningful alignments and alignment
scores?
13How Do We Assess the Statistical Significance of
an Alignment?
- Compare score of an alignment with distribution
of scores of alignments for many 'randomized'
(shuffled) versions of the original sequence - If score is in extreme margin, then unlikely due
to random chance - P-value probability that original alignment is
due to random chance (lower P is better) - P 10-5 - 10-50 sequences have clear homology
- P gt 10-1 no better than random
Check out PRSS (Probability of Random
Shuffles) http//www.ch.embnet.org/software/PRSS_f
orm.html
14Chp 4- Database Similarity Searching
- SECTION II SEQUENCE ALIGNMENT
- Xiong Chp 4
- Database Similarity Searching
- Unique Requirements of Database Searching
- Heuristic Database Searching
- Basic Local Alignment Search Tool (BLAST)
- FASTA
- Comparison of FASTA and BLAST
- Database Searching with Smith-Waterman Method
15Database searching
Sequence database
Query Sequence
Target sequences ranked by score
Sequence comparison algorithm
16Why search a database?
- Given a newly discovered gene,
- Does it occur in other species?
- Is its function known in another species?
- Given a newly sequenced genome, which regions
align with genomes of other organisms? - Identification of potential genes
- Identification of other functional parts of
chromosomes - Find members of a multigene family
17Recall There are 3 Basic Types of Alignment
Algorithms
- SECTION II SEQUENCE ALIGNMENT
- Xiong Chp 3 1) Dot Matrix
- 2) Dynamic Programming
-
- Xiong Chp 4 3) Word or k-tuple methods
- (BLAST FASTA)
- Wikipedia
- Word methods, also known as k-tuple methods, are
heuristic methods that are not guaranteed to find
an optimal alignment solution, but are
significantly more efficient than dynamic
programming.
18Exhaustive vs Heuristic Methods
- Exhaustive - tests every possible solution
- guaranteed to give best answer
- (identifies optimal solution)
- can be very time/space intensive!
- e.g., Dynamic Programming
- (as in Smith-Waterman algorithm)
- Heuristic - does NOT test every possibility
- no guarantee that answer is best
- (but, often can identify optimal
solution) - sacrifices accuracy (potentially) for speed
- uses "rules of thumb" or "shortcuts"
- e.g., BLAST FASTA
19Why do we Need Fast Search Algorithms?
- Your query is 200 amino acids long (N)
- You are searching a non-redundant database, which
currently contains gt106 proteins (K) - If proteins in database have avg length 200 aa
(M), then - Must fill in 200 ? 200 ? 106 4 ? 1010 DP
entries!! - 4 ? 1010 operations just to fill in the DP
matrix! - DP for pairwise alignment is O(NM)
- Searching in a database is O(NMK)
- Need faster algorithms for searching in large
databases!
20BLAST - Statistical Significance?
- E-value E m x n x P
- m total number of residues in database
- n number of residues in query sequence
- P probability that an HSP is result of random
chance - lower E-value, less likely to result from random
chance, thus higher significance - Bit Score S'
- normalized score, to account for differences in
size of database (m) sequence length(n) - more
later - 3. Low Complexity Masking
- remove repeats that confound scoring
21BLAST algorithms can generate both "global" and
"local" alignments
Global alignment
Local alignment
22BLAST - a Family of Programs Different BLAST
"flavors"
- BLASTP - protein sequence query against protein
DB - BLASTN - DNA/RNA seq query against DNA DB
(GenBank) - BLASTX - 6-frame translated DNA seq query against
protein DB - TBLASTN - protein query against 6-frame DNA
translation - TBLASTX - 6-frame DNA query to 6-frame DNA
translation - PSI-BLAST - protein "profile" query against
protein DB - PHI-BLAST - protein pattern against protein DB
- Newest MEGA-BLAST - optimized for highly similar
sequences
Which tool should you use?
http//www.ncbi.nlm.nih.gov/blast/producttable.sht
ml
23BLAST Basic Local Alignment Search Tool
- STEPS
- Create list of very possible "word" (e.g., 3-11
letters) from query sequence - Search database to identify sequences that
contain matching words - Score match of word with sequence, using a
substitution matrix - Extend match (seed) in both directions, while
calculating alignment score at each step - Continue extension until score drops below a
threshold (due to mismatches) - High Scoring Segment Pair (HSP) - contiguous
aligned segment pair (no gaps) -