Title: Sequence Alignment
1Sequence Alignment
BCB 444/544 - Introduction to Bioinformatics
Slides adapted from D. Fernandez-Baca ISU
2Assignments Reading Exercises(before lecture)
- vFri Sept 8
- CH Chp 2 pp 34-59
- Also, DQs MMs
- Re Sequence alignment Dynamic programming
- http//en.wikipedia.org/wiki/Sequence_alignment
- (read all except sections on Multiple sequence
alignment, Structural alignment, Phylogenetic
analysis) - Mon Sept 11
- Re Predicting Protein Function
- Read Friedberg I, Harder T, Godzik A. (2006)
JAFA a protein function annotation meta-server.
Nucleic Acids Res. 34 (Web Server issue)W379-81
PMID 16845030 - http//nar.oxfordjournals.org/cgi/content/full/34
/suppl_2/W379 - Visit http//jafa.burnham.org/
3Why compare sequences?
- To determine whether two (or more) genes or
proteins are evolutionarily related to each other - To identify structurally or functionally similar
regions within proteins - Other?
4Sequence Comparison Methods
- Dot Matrix Analysis
- Dynamic Programming
- Word or k-tuple methods (BLAST and FASTA)
5Dot matrices
c
g
g
a
c
a
c
a
c
g
6Dot matrix comparison
7Interpretation
- Regions of similarity appear as diagonal runs of
dots - Reverse diagonals (perpendicular to diagonal)
indicate inversions - Reverse diagonals crossing diagonals (Xs)
indicate palindromes
8Dynamic Programming
9Pair-wise sequence alignments
Idea Display one sequence above another with
spaces inserted in both to reveal similarity
- A C A T - T C A - C
-
- B C - T C G C A G C
10Two types of alignment
S CTGTCGCTGCACG T TGCCGTG
Global alignment
Local alignment
CTGTCG-CTGCACG -TGC-CG-TG----
CTGTCGCTGCACG-- -------TGC-CGTG
CTGTCG-CTGCACG -TGCCG--TG----
Is this a better alignment?
11Global alignment Scoring
CTGTCG-CTGCACG -TGC-CG-TG----
Reward for matches ? Mismatch penalty ? Space
penalty ?
score(A) ?w ?x - ?y
w matches x mismatches y spaces
12Global alignment Scoring
Reward for matches 10 Mismatch penalty
2 Space penalty 5
C T G T C G C T G C - T G C
C G T G -
-5 10 10 -2 -5 -2 -5 -5 10 10 -5
Total 11
13Optimum Alignment
- The score of an alignment is a measure of its
quality - Optimum alignment problem Given a pair of
sequences X and Y, find an alignment (global or
local) with maximum score
14Alignment algorithms
- Global Needleman-Wunsch
- Local Smith-Waterman
- NW and SW use dynamic programming
- Variations
- Gap penalty functions
- Scoring matrices
15Global Alignment Algorithm
16Theorem. C(i,j) satisfies the following
relationships
Initial conditions
Recurrence relation For 1 ? i ? n, 1 ? j ? m
17Justification
18Example
Case 1 Line up Si with Tj
i
i - 1
S C A T T C A C T C - T T C A
G
j
j -1
Case 2 Line up Si with space
i - 1
i
S C A T T C A - C T C - T T
C A G -
j
Case 3 Line up Tj with space
i
S C A T T C A C - T C - T T
C A - G
j
j -1
19Computation Procedure
C(0,0)
C(i,j)
C(n,m)
20? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
?
10
5
C
A
T
T
C
A
C
10 for match, -2 for mismatch, -5 for space
21 ? C T C G C
A G C
?
C
A
T
T
C
A
C
Traceback can yield both optimum alignments
22Local Alignment Motivation
- Ignoring stretches of non-coding DNA
- Non-coding (or "non-functional") regions may be
more likely to contain mutations than coding
regions. - Local alignment between two gene sequences is
likely to be between two exons - Locating protein domains
- Proteins of different kind and of different
species often exhibit local similarities - Local similarities may indicate functional
subunits
23Local alignment Example
S g g t c t g a g T a a a c g a
Match 2 Mismatch and space -1
Best local alignment
g g t c t g a g a a a c g a -
Score 5
24Local Alignment Algorithm
C i, j Score of optimally aligning a
suffix of s with a suffix of t.
- Initialize top row and leftmost column to zero.
25 ? C T C G C
A G C
?
C
A
T
T
C
A
C
1 for a match, -1 for a mismatch, -5 for a space
26Some Results
- Most pairwise sequence alignment problems can be
solved in O(mn) time. - Space requirement can be reduced to O(mn), while
keeping run-time fixed Myers88. - Highly similar sequences can be aligned in O(dn)
time, where d measures the distance between the
sequences Landau86.
27Affine Gap Penalty Functions
- Gap penalty h gk
- where
- k length of gap
- h gap opening penalty
- g gap continuation penalty
Can also be solved in O(nm) time using dynamic
programming
28Database Searches
29BLAST
- Basic Local Alignment Search Tool
- Altschul, Gish, Miller, Myers, Lipman, J. Mol.
Biol. 215 (1990) - Altschul, Madden, Schaffer, Zhang, Zhang, Miller,
Lipman, Nucleic Acids Res. 25 (1997) - Main ideas
- Increase search speed by finding fewer, but
better, hot spots during initial screening phase - Uses longer word sizes
- Integrate scoring matrix into first phase
- Compare with FASTA, which requires exact matches
30BLAST
31Hits
- For each word, evaluate score of match (exact or
not) according to BLOSUM62 substitution matrix - e.g., for PQG exact match with PQG
- score is 756 18
- There are 20w possible w-length words, but
considering only those with score gt t, greatly
reduces number of matches - e.g., there are 203 8000 possible matches to
PQG, - but only 50 achieve score gt t 13
32BLAST Hits
- A hit is a w-length word in the database that
aligns with a word from the query sequence with
score gt t - BLAST looks for hits instead of exact matches
- Allows word size to be kept high for speed,
without sacrificing sensitivity - Typically, w 3-5 for amino acids,
- w 11-12 for DNA
- t is the most critical parameter
- ?t ?? ? background hits (faster)
- ?t ?? ? ability to detect more distant
relationships (at cost of increased noise
33Extending a hit
- After locating a hit, BLAST attempts to extend
hit in both directions, until score has drops
more than X below the maximum score yet attained.
- Extension step typically accounts for gt 90 of
execution time.
34Extending a hit
35Improvement 2-hit method
- Do extensions only when there are two hits on the
same diagonal within some distance A of each
other (e.g., A 40) - Reduces sensitivity (ability to detect distantly
related sequences) - To compensate, use lower t value (e.g., 11 rather
than 13) - Because we only extend when there are two nearby
hits, many fewer regions are extended
36BLAST Terminology
- Segment pair equal-length substrings of
sequences S1 and S2 - Locally maximal segment pair segment pair whose
alignment score cannot be improved by extending
or shortening it - Maximum segment pair (MSP) segment pair with
maximum score over all segment pairs in the
sequences S1 and S2 - High-scoring segment pair (HSP) A segment pair
with score higher than some cutoff score, s. - w is the length parameter t is the threshold
parameter
37Gapped BLAST
- Allows local alignments with indels (similar to
FASTA) - Local alignments from different diagonal are
merged into a different local alignment followed
by some indels followed by a second local
alignment, etc. - equivalent to a path through the dynamic
programming matrix composed of alternating
diagonal sections and paths connecting them
38Gapped BLAST
- Original BLAST implicitly handled gaps by finding
several distinct HSPs and calculating a
statistical assessment of the combined result - Two or more HSPs each below the cutoff value
might in combination rise to statistical
significance - Gapped BLAST, extend hits by allowing gaps when
hits are promising (exceed sg) - Advantage We can afford to miss some HSPs as
long as at least one is found - Use dynamic programming, starting from center of
each high-scoring region if s gt sg - sg is chosen such that gapped alignment is
triggered in about 1/50 of the sequences compared
39PSI-BLAST
- Position-Specific Iterated BLAST
- Generates a multiple alignment from statistically
significant alignments produced by BLAST - Produces a Position-Specific Scoring Matrix
(PSSM) - Can search the database using the PSSM
- Match sequences to profile
- Generate new profiles
- Repeat (iteration)
- Search gradually extends to increasingly
divergent sequences
40Flavors of BLAST
- BLASTP - protein query against protein DB
- BLASTN - DNA/RNA query against DNA (GenBank)
- BLASTX - 6-frame translated DNA query against
proteinDB - TBLASTN - protein query against 6-frame DNA
translation - TBLASTX - 6-frame DNA query to 6-frame DNA
translation - PSI-BLAST - protein "profile" query against
protein DB - PHI-BLAST - protein pattern against protein DB
41Questions?
- What are substitution matrices?
- 2 Major types PAM BLOSUM
- PAM Point Accepted Mutation - relies on
"evolutionary model" based on observed
differences in closely related proteins - Model includes defined rate for each type of
sequence change - Suffix number (n) reflects amount of "time"
passed rate of expected mutation if n of amino
acids had changed - PAM1 - for less divergent sequences (shorter
time) - PAM250 - for more divergent sequences (longer
time) - BLOSUM BLOck SUbstitution Matrix - based on
aa substitutions observed in evolutionarily
divergent proteins - Doesn't rely on a specific evolutionary model
- Suffix number (n) reflects expected similarity
average aa identity in the MSA from which the
matrix was generated - BLOSUM45 - for more divergent sequences
- BLOSUM60 - for less divergent sequences
42Questions?
- What does 6-frame translation mean?