Title: E. Sequence Alignment
1E. Sequence Alignment
- Part I Overview and foundation A. Central
Dogma, B. Accessing Databases, C. Data Mining,
D. Tree of life. - Part II Gene Sequence E. Sequence Alignment, F.
Blasting NCBI Tutorial, G. SNP Video, H.
Methods for Manipulating DNA and RNA Ch 8. MBoC
pp 469-513, I. Hidden Markov Model Handouts,
J. Gene Finding Handout, K. Microarray Analysis
Ch 4 and Ch 5 DGPB - Part III Gene Regulation
- L. Cell Chemistry and Biosynthesis Ch 2 MBoC,
M. Proteins Ch 3 MBoC, N. DNA and Chromosomes
Ch 4 MBoC, O. From DNA to Protein Ch 6. MBoC,
P. Control of Gene Expression Ch 7 MBoC - Part IV Proteonomics
- Q. Introduction Ch. 6 DGPB, R. Subcellular
Localization Handout, S. Structural Prediction
Handout, T. Protein Interaction Handout, U.
Mass Spec Proteomic Informatics Handout - Part V Whole Genome Perspective
- V. Protein and Gene Networking Handout Some
parts of Unit Three DGPB - Part V Applications and Conclusion
- W. Cancer Ch. 23 MBoC, X. Pathogens Ch. 25
MBoC, Y. Case Studies Some Parts of Unit Four
DGPB, Z. Future Directions Handout -
2E. Sequence Alignment
- E. Sequence Alignment
- E.1. Sequence Similarity
- E.2. Dynamic Programming
- E.3. Blasting and (Psi-Blast)
- Reading
- The NCBI BLAST education pages (all three
sections). - Altschul, et al on PSI-BLAST and on improving
PSI-BLAST (using ROC curves)
3E. Sequence Alignment Purpose of Sequence
Alignment and Blasting
- What is an alignment, and why might it be
significant? - An alignment is a mapping from one sequence to
another, identifying elements that are likely to
have arisen from a common ancestor - A good alignment is an indication of homology
- Alignments are NOT exact matches. We will need a
method to find good alignments in a database...
4E. Sequence Alignment Phylogenetic Tree
5E.1 Sequence Similarity Similarity vs. Homology
Paralogs vs. Orthologs
- Homology is an evolutionary relationship that
either exists or does not. It cannot be partial. - An ortholog is a homolog with shared function.
- A paralog is a homolog that arose through a gene
duplication event. Paralogs often have divergent
function. - Similarity is a measure of the quality of
alignment between two sequences. High similarity
is evidence for homology. Similar sequences may
be orthologs or paralogs.
6E.1 Sequence Similarity How do we compute
similarity?
- Similarity can be defined by counting positions
that are identical between two sequences - Gaps (insertions/deletions) can be important
abcdef abcdef abcdef
abceef acdef a-cdef
7E.1 Sequence Similarity Not all mismatches are
the same
- Some amino acids are more substitutable for each
other than others. Serine and threonine are
more alike than tryptophan and lysine. - We can introduce "mismatch costs" for handling
different substitutions. - We don't usually use mismatch costs in aligning
nucleotide sequences, since no substitution is
per se better than any other.
8E.1 Sequence Similarity Many possible alignments
to consider
- Without gaps, there are are NxM possible
alignments between sequences of length N and M - Once we start allowing gaps, there are many
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd - This becomes a very large number when we allow
mismatches, since we then need to look at every
possible pairing between elements there are
roughly NM possible alignments.
9E.1 Sequence Similarity Avoiding random
alignments with a score function
- Not only are there many possible gapped
alignments, but introducing too many gaps makes
nonsense alignments possible
s--e-----qu---en--ce sometimesquipsentice - Need to distinguish between alignments that occur
due to homology, and those that could be expected
to be seen just by chance. - Define a score function that accounts for both
element mismatches and a gap penalty
10E.1 Sequence Similarity Match scores
- Match scores are often calculated on the basis of
the frequency of particular mutations in very
similar sequences. - We can transform substitution frequencies into
log odds scores, which can then be added together.
11E.1 Sequence Similarity Local vs. Global
alignments
- A global alignment includes all elements of a
sequence, and includes gaps - A global alignment may or may not include "end
gap" penalties. - A local alignment is includes only subsequences,
and sometimes computed without gaps. - Local alignments can find shared domains in
divergent proteins and are fast to compute - Global alignments are better indicators of
homology and take longer to compute.
12E.1 Sequence Similarity An alignment score
- An alignment score is the sum of all the match
scores of an alignment, with a penalty subtracted
for each gap. - Gap penalties are usually "affine" meaning that
the penalty for one long gap is smaller than the
penalty for many smaller gaps that add up to the
same size. - a b c - - da c c e f d9 2 7 6 gt 24 - (10
2) 12
Gap start continuationpenalty
AlignmentScore
Matchscore
13E.2 Dynamic Programming Finding the optimal
alignment
- Given a pair of sequences and a score function,
identify the best scoring (optimal) alignment
between the sequences. - Remember, exponential number of possible
alignments (most with terrible scores). - Computer science to the rescue dynamic
programming identifies optimal alignments in time
proportional to the sum of the lengths of the
sequences
14E.2. Dynamic programming
- The name comes from an operations research task,
and has nothing to do with writing programs. - Dynamic programming alignments are a key
technology in bioinformatics, and you should
understand how they work. - Called Needleman-Wunch or Smith-Waterman
15E.2. Dynamic Programming The recurrence
- The idea Recursion for the cost that you want to
optimize - Boundary Condition
- Recurrence
- Example Pairwise sequence alignment
- Ci,j optimal cost for aligning sequence
A1..i and B1..j where A and B are input
sequences of length n and m. - Boundary Condition
- C0,j Ci,0 0 for i,j 0.
- C0,j Ci,0 cost of deletion for i,j gt 0.
- Recurrence Si,j
- Si-1,j-1 cost of substituting Ai with Bj.
- Si-1,j cost of deletion
- Si,j-1 cost of insertion
16E.2. Dynamic Programming Filling up the table
- Using the recurrence compute all Ci,j in a
table C - At the same time, keep track of which case you
use to get Ci,j in another table F - Use F to backtrack and construct the solution.
Ci.j
For backtracking
(1)
(1)
(2)
(2)
(3)
(3)
Ci,j
17E.2. Dynamic Programming Dynamic programming
alignment
- Each cell has the score for the best aligned
sequence prefix up to that position. - Start by filling in initial gap and first element
to first element match score - Use arrow to indicate path to that alignment
18E.2. Dynamic Programming Continue filling in
optimalpath scores
- For each cell, have three choices for how to get
there from the last optimal alignment (match, gap
sequence 1, gap sequence 2). - Best score(s) are selected, and arrows added
indicated route.-5 5 0 5 -5 0-7 -5
-12
19E.2. Dynamic Programming Optimal alignment by
traceback
- We traceback a path that gets us the highest
score. If we don't have end gap penalties,
then take any path from the last row or columnto
the first. - Otherwise we needto include the top and bottom
corners - AACADCD---A-CD
20E.2. Dynamic Programming How do we pick match
scores?
- For match scores, two main options
- PAM based on global alignments of closely related
sequences. Normalized to changes per 100 sites,
then exponentiated for more distant relatives. - BLOSUM based on local alignments in much more
diverse sequences - Picking the right distance is important, and may
be hard to do. BLOSUM seems to work better for
more evolutionarily distant sequences. BLOSUM62
is a good default.
21E.2. Dynamic Programming Picking gap penalties
- Many different possible forms
- Most common is affine (gap open gap continue
penalities) - More complex penalties have been proposed.
- Penalties must be commensurate with match scores.
Therefore, the match scoring scheme influences
the gap penalty - Most alignment programs suggest appropriate
penalties for each match score option.
22E.2. Dynamic Programming Searching for optimal
scores
- One possibility is to try several different match
score and gap penalties, and choose the best - In general, this is called parameter space search
and it is important in many areas. - Problems
- requires a lot computation
- we need some principled way to compare the
results. - Use significance testing to compare...
23E.2. Dynamic Programming The significance of an
alignment
- Significance testing is the branch of statistics
that is concerned with assessing the probability
that a particular result could have occurred by
chance. - How do we calculate the probability that an
alignment occurred by chance? - Either with a model of evolution, or
- Empirically, by scrambling our sequences and
calculating scores on many randomized sequences. - Extreme value statistics (max, not sum)
24E.2. Dynamic Programming Metric-space database
search
- A metric is a function with these
characteristics - f(A, B) f (B,A)
- f(A,A) 0
- f(A,B) f(B,C) ? f(A,C) the triangle inequality
- Log(n) database search for closest (inexact)
match can be done for metrics - Miranker, et al. have applied this to sequence
databases (transforming score matrix)
25E.3 Blasting Why BLAST?
- Dynamic programming solutions to alignment
problems are relatively slow, and don't lend
themselves to efficient database search. - Need some way to search a large database to find
sequences that have an inexact match to a query
sequence - Competing solutions FASTA BLAST.
- Both imperfect approximations to DP. DP finds
some distantly related sequences the
approximations don't - BLAST is more commonly used, although both are
fine.
26E.3 Blasting Sequence search basics
- BLAST/FASTA are 50-100x faster than DP
- If searching for coding regions, always translate
nucleotide to amino acid sequence. - Use appropriate substitution and gap scores
- BLOSUM62 is good for weak protein similarities
- Use PAM30, PAM70 or BLOSUM45 for better results
on more similar sequences, BLOSUM80 for most
distant - Use Low-complexity filters and, for human
sequence, filter out human repeats (ALUs, etc)
27E.3 Blasting How does BLAST work
- BLAST2 (gapped BLAST)
- Break sequence into overlapping words, by
default of length 3. n-l1 l-size words for
sequence of length n. ABCDE ? ABC, BCD, CDE - For each word, define 50 other words that are
similar (use substitution matrix threshold T) - Repeat for each of the n-l1 words, giving about
50n words (out of 2038000 possible) - Use a hash table to find all places in DB with
exact match to any of those words.
28E.3 Blasting Blasting Extending alignments
- Identify database sequences that contain several
matching words on the same diagonal (think DP
alignments) and within a short distance. - Extend these short, ungapped alignments in both
directions along the sequence so long as score of
alignment increases. - Call these extended alignments HSP's for high
scoring pairs
29E.3 Blasting PSI Blast
- Position Specific Iterated BLAST
- Intuition substitution matrices should be
specific to a particular site. Penalize
alanine?glycine more in a helix. - Idea Use BLAST with high stringency to get a set
of closely related sequences. Align those
sequences to create a new substitution matrix for
each position. Then use that matrix
(iteratively) to find additional sequences.
30E.3 Blasting Why (not) PSI-BLAST
- If the sequences used to construct the Position
Specific Scoring Matrices (PSSMs) are all
homologous, the sensitivity at a given
specificity improves significantly. - However, if non-homologous sequences are included
in the PSSMs, they are corrupted. Then they
pull in more non-homologous sequences, and become
worse than generic
31E.3 Blasting How to use PSI BLAST
- Set initial thresholds high. Inspect each
iteration's result for suspicious sequences. - Do several iterations (5), or until no new
sequences are found - Even if only looking for a small set of
sequences, make the initial search very broad - First, use NR with up to 5 iterations to set PSSM
- Then use that PSSM to search in restricted domain
32E.3 Blasting PSI-BLAST example
33E.3 Blasting First Iteration
...
34E.3 Blasting Second iteration
...
...
35E.3 Blasting PSI-BLAST caveats
- Increased ability to find distant homologues
- Cost of additional required care to prevent
non-homologous sequences from being included in
the PSSM calculation. - When in doubt, leave it out!
- Examine sequences with moderate similarity
carefully. - Be particularly cautious about matches to
sequences with highly biased amino acid content
36E.3 Blasting Some notes on sequence-based
database searching
- Matches of gt50 identity in a 20-40 amino acid
region occur frequently by chance. - Most sequences that share statistically
significant similarity throughout their entire
lengths are homologous. - If A is homologous to B, and B to C, then A must
be homologous to C, even if they share no
significant sequence similarity. - Low complexity regions, transmembrane regions and
coiled-coil regions often display significant
similarity without homology - Screen them out of your query sequences!