Title: Sequence Alignment
1Sequence Alignment
2Sequence Alignment
- Why
- To match a new sequence to others with known
functions - To search for ESTs and other signs of gene
expression - To understand population dynabmics and
evolutionary relationships between genes and
species - To find important regions within proteins
- Issues
- Alignment should mimic evolutionary descent the
actual history of mutation and selection that led
to this gene - But it is too complicated to get perfectly
correct - Protein alignments work over larger evolutionary
distances than nucleotide - How to treat substitutions, insertions and
deletions (gaps) - How to score possible alignments
- Global vs. local alignment
- Multiple alignment (as an extension of pairwise
alignment - Hidden Markov Models and other ways of
abstracting multiple alignment information - Homology related by evolutionary descent. As
opposed to similarity, which is not necessarily
based on descent from a common ancestor - But in practice, long aligned sequences seem to
only arise by evolution - Short alignments can be due to chance or
convergent evolution.
3Example Alignments
- THISSEQUENCE vs. THATSEQUENCE
- Same length, just 2 mismatches
- THISISASEQUENCE vs. THATSEQUENCE
- Length is different, need to introduce gaps to
maximize identities.
4Scoring by Identity
- One simple way to score an alignment is by
counting the number of perfect matches. - Get percentage of identities by dividing number
of matches by total positions (including gap
positions). This is a measure of relatedness
between 2 proteins. - For previous example, 11 matches with 16
positions 68.75 (69) identities - Length matters it is harder to get a high
percentage of identities in a long sequence than
in a short one. - Problem of random matches. For nucleotides, 25
of all positions in random sequences match, and
its 5 for proteins. - General rule, based on proteins with known
structural similarity - Two proteins are probably structurally similar
(and thus probably homologous) if they have 30
or more identical amino acids over their whole
length when aligned. - Less than 20 amino acid identity means probably
not homologous - Between 20 and 30 is a gray zone
- My personal happiness with matches increases when
its above 35 - Except for very unusual proteins, 100 identity
doesnt occur between homologous proteins in
different species
5Dotplots
- Dotplots are a simple way of seeing alignments
- We really like to see good visual demonstrations,
not just tables of numbers - Its a grid put one sequence along the top and
the other down the side, and put a dot wherever
they match. - You see the alignment as a diagonal
6Dotplot Noise
- A big problem is noise there are lots of random
matches (roughly 5 for proteins) that confuse
the image. - Standard solution create a sliding window (say
10 residues) and only mark a dot if a minimum
number of matches occur in that window (say 3). - A lot of noise goes away
- This is a sequence compared to itself, so there
is a perfect diagonal.
7A Real Dotplot
- Two haptoglobin sequences. (Haptoglobin is a
blood protein that binds to hemoglobin that has
gotten out of the red blood cells). - You can see a gap in one sequence, a region of
poor similarity just before it, and a simple
sequence repeat near the beginning.
8Similarity Matching
- In proteins, many substitutions occur that have
little effect on structure or function - or, they alter the protein to make it more
adapted for the lifestyles of the different
species - This depends on where in the protein they occur
and on the chemical and physical properties of
the amino acids. - Substitution matrices scores of the probability
of changing one amino acid into another. - Amino acids are similar if they can frequently be
substituted for each other. - These are just overall numbers compiled over many
sequences, not adapted to specific cases. - Early attempts were based on amino acid
properties, or on the nubmer of nucleotide
substitutions needed to change form one amino
acid to the other. - Now they are based on actual comparison between
sequences. - The two most popular types PAM and BLOSUM
- There are other, more specialized substitution
matrices, for comparing transmembrane regions,
for example.
9BLOSUM62 Matrix
10Similarity Matrix Theory
- Think about aligning 2 proteins from similar
species that are orthologs same function and
syntenic. At some point back in evolutionary
time, there was a single DNA sequence that is the
common ancestor of both proteins. - Most paired amino acids are identical, but a few
are different. - Reduce the problem consider a single aligned
pair of amino acids, that are not identical. T-S - We are comparing 2 theories of how these amino
acids were derived from a common ancestor. - Random mutation followed by natural selection.
Some substitutions will happen more frequently
than others because they lead to functional
proteins more often. - The frequency with which T and S are substituted
for each other by evolution is derived from
counting them in well-aligned sequences.
freq(T-S) - Completely random changes every possible
substitution happens in proportion to the
relative frequencies of the different amino
acids, the two amino acids are unrelated to each
other. - In this case, the frequency of a T and an S is
just the product of the frequency of Ts and the
frequency of Ss in the entire protein (or
proteome). - freq(T) freq(S) - The odds ratio is the evolutionary theory
(observed data) frequency divided by the random
theory frequency. OR freq(T-S) / freq(T)
freq(S)
11More Theory
- We want to get the odds that a given alignment
fits the evolutionary model better than a random
model. - Good alignments give high odds ratios
- Need to multiply the ORs for all amino acids in
the alignment - It is easier (and doesnt overflow the computers
floating point calculator) to take the logarithm
of the odds ratio for each amino acid, and then
add the logarithms. - This is the lod score (log of odds).
- A negative score means that the given
substitution is less likely than chance, and a
positive score means it is more likely than
chance. - You can score each possible alignment by adding
up over the whole protein - Some fooling with constants (which dont distort
the results but are either more pleasing to the
human eye or make further calculations easier
multiply lod score by 10, or add a constant to
make al values 0 or greater
12PAM
- PAM Point Accepted Mutations, meaning single
amino acid substitutions (point mutations) that
have been accepted by natural selection they
are functional in different species. - Derived by Dayhoff and colleagues in the 1960s
and 1970s (although there are some newer
versions around) - They give a measure of the frequency of changing
from one amino acid to another, as compared to
the frequency of random change - Derived from global alignments of homologus
sequences from different, but closely related,
species. The sequences had an average of 1 amino
acid change per hundred residues. Thus we assume
at most 1 mutation has occurred at each position. - Do an phylogenetic analysis of the sequences to
determine which mutations have occurred - Calculate the lod scores. Then multiply all of
them by 10 and round to integers. - This set of scores derived from sequence
alignments is the PAM1 matrix. - Since most sequences being aligned are not
between such closely related species, the PAM1
matrix is multiplied by itself many times to
mimic lots of small changes. - This concept is a serious weakness multiplying
of errors magnifies them. - The number after PAM is the number of times the
matrix has been multiplied by itself. - Common ones PAM30, PAM70, PAM120, PAM250.
Bigger number better for more distant
relationships
13BLOSUM
- BLOck Substitution Matrix. Derived in the
1990s by Henikoff and Henikoff. - Based on local alignments of Blocks, which are
short, highly homologous regions, with no gaps - Sequences were grouped together if they were very
similar, and then comparisons were made between
the groups as in the PAM matrices. - No attempt at phylogenetic trees
- The different BLOSUM matrices have specific
cutoffs for amino acid identities. For example,
the BLOSUM62 matrix is based on sequence blocks
with at least 62 identity. - The odds ratio for each substitution is
calculated, but instead of taking the base 10 log
and multiplying the result by 10 as in PAM,
BLOSUM takes the base 2 log and multiplies by 2.
This gives scores in half-bits. - Bigger numbers imply closer evolutionary
distance, so BLOSUM80 is better for closely
related species than BLOSUM 45. - BLOSUM seems to work better than PAM
- BLOSUM62 is the default used in BLAST searches.
14BLOSUM62 and PAM120 Matrices
The colors represent different physiochemical
properties. Note that some substitutions
are positive, which indicates that they occur
more frequently than chance. The average value
is negative it is more likely than an amino acid
will stay the same than change. The diagonal
values are unchanged amino acids, all of which
have positive values. Some are less
changeable than others tryptophan and
cysteine especially.
15Gaps
- Gaps occur with roughly 1/10 the frequency of
base substitutions, so they are common in most
alignments. - Symbolized by hyphens ( --- ) paired with
residues like a mismatch with a blank space. - You can assign a penalty for each gap position.
- This is called a linear gap penalty the total
penalty is proportional to the gap length. - The problem is, once you start putting them in,
you can get almost anything aligned. - Alignment programs usually distinguish between
creating a gap and extending a gap. Thus, the
gap opening penalty and a (smaller) gap extension
penalty. - This is called an affine gap penalty.
- Although substitutions have a lot of theory
behind them, gap penalties are generally
determined by heuristic means. - Heuristic a method or value determined by
trial-and-error experiments, without a strong
guiding theory. - In this case, gap opening and extension penalties
are the result of trying many possibilities and
seeing which ones give the most pleasing
alignments. - The BLAST default is a -11 penalty for opening
the gap and -1 for each additional base of gap.
(11/1) - Other options on BLAST at NCBI are 7/2, 8/2, 9/2,
10/1, and 12/1
16- Comparing 2 distantly related sequences with
different gap penalties - Top sequence has fewer gaps and longer matches.
- Bottom sequence has more identities and
similarities overall, but lots of little gaps.
The matches near the C-terminal are absurd. - Look at the short segment after the first gap in
the lower sequence gained 3 identities
17How Do We Make Alignments?
- We have been working on scoring an alignment
identities and similarities, and gap penalties. - But, how do you get an alignment to score in the
first place? - Trying all possibilities is one of those more
possibilities than there are atoms in the
Universe problems. - The general solution dynamic programming, a
technique first applied to DNA sequences by
Needleman and Wunsch (1970) - Their original method gave global alignments.
- Smith and Waterman (1981) provided a slight (but
critical) modification that produced local
alignments, which work better than global for
most genes. - These methods provide an optimal alignment, for a
given substitution matrix and set of gap
penalties. - They are much faster than trying all
possibilities, but still not quick enough.
Various refinements and heuristic methods improve
the speed.
18Smith-Waterman Algorithm
- Start with a 2-dimensional matrix with one
sequence along the top and the other sequence
down the left side. All possible pairs of
nucleotides or amino acids are represented by the
cells of the matrix. - Edge rows along the top and left side.
- All possible alignments are represented by the
paths through the matrix. - a diagonal step is an alignment between the query
and the subject sequences at that position - a vertical step is a gap in the query sequence
- a horizontal step is a gap in the subject
sequence. - Have a match reward and penalties for mismatches,
gap openings, and gap extensions. For our
example, we will use the BLOSUM62 matrix, with a
linear gap penalty of -6 - Initialize the edge rows to scores of 0.
19BLOSUM62 With positive scores marked
20Calculating Cell Scores
T A
T 5 7
G 2 ?
- The cell at row i and column j has a score S(i,
j) - Starting at top left cell, proceed row-by-row,
calculating each cells score S(i, j). S(i, j)
is the maximum of - 0 (i.e. set to 0 if the calculated score is less
than 0) - S(i-1, j-1) match/mismatch score for cell (i,
j) - S(i, j-1) match/mismatch score for cell (i, j)
gap penalty - S(i-1, j) match/mismatch score for cell (i, j)
gap penalty
For the cell in question, the bases dont match,
so it starts with a match/mismatch score of -1.
There are 3 possible alignment paths to this
cell 1. diagonal (query/subject alignment).
Score 5 1 4. 2. vertical (query gap).
Score 7 4 1 2 3. horizontal (subject
gap). Score 2 4 1 -3 (set to 0) Since 4
is the maximum, the cells value is set to 4.
21Smith-Waterman Details
- Start at the first row T doesnt match anything,
and looking at BLOSUM62, the only positive score
for a mismatch is 1 with S. - We keep track of the 0 -gt 1 diagonal
- Second row H matches N 1, but nothing else..
- The diagonal staring with the 1 in the previous
row is a H-A mismatch -2, so 1 -22 -1, which
is scored as 0. - Third row I gives positive scores with M. L, and
V. But, nothing builds on the previous row.
22More S-W
- Fourth row S has positive scores with N, A, and
T. - S-S 4 match, added to 4 from the diagonal 8
- S-A 1. For a horizontal move (subject gap), 8
1 6 3. - S-I is -2 mismatch, added to 2 from the diagonal
0. - S-G 0 mismatch, added to 4 from the diagonal
23More S-W
24Still More!
25Traceback
- Then, start at the highest score in the matrix
and trace back the path leading through the
highest previous scores to 0. Go left and up
only, preferring the diagonal path if a choice
needs to be made. - High score is 16, in the bottom row (but it could
have been elsewhere). - Write the alignment starting at the top.
- It doesnt cover the entire sequence it is a
local alignment, not global. - It isnt perfect the strong diagonal from LI and
the 0 mismatch score from a G-N match overcame
the gap penalty needed to put a gap where the G
is. - Nevertheless, given the BLOSUM62 matrix and the
-6 linear gap penalty, this is an optimal
alignment,
ISALIGNE IS-LIN-E
26Changing the Gap Penalty
- The top one has a -4 gap penalty and the bottom
one has a -8 gap penalty (both linear). They
give somewhat different alignments.
27A Needleman-Wunsch Alignment
28Speeding Things Up