Title: Why do we care about a sequence alignment
1Why do we care about a sequence alignment?
- For a new protein (or DNA) whose sequence is
related to another one about which something is
already known, a lot can be learned with little
effort. This can include
- Biochemical function (e.g. protein kinase)
- Organismal function (e.g. cell-cycle regulator)
- Structural information (e.g. modeling against a
known X-ray crystal structure)
- To determine gene structure (e.g. splicing
pattern) or likely subcellular or tissue
expression pattern.
(cont)
2(why do we care, cont.)
- To study evolutionary relationships among genes
and organisms. - Sequence assembly in genome projects
- Find sequence overlaps for assembly
- Quality control using multiple sequence runs
3Homology, orthology, paralogy, similarity
- Two sequences are homologous if they derive from
a common ancestral sequence (includes paralogs
and orthologs). - Two homologous sequences in the same organism
are called paralogous. They arose by duplication
and divergence from a shared ancestral sequence. - Two homologous sequences in different organisms
may be orthologous (a strict definition is
thornier because of paralogy more in later
class). - Two sequences that can be aligned in a
convincing manner are said to have sequence
similarity. - If two sequences have similarity over a
substantial length, this nearly always reflects
homology (convergence is unlikely).
4Major alignment methods
- Hand alignment. Surprisingly effective but
LABORIOUS. - Dynamic programming methods. There are many
closely-related algorithms with specific names,
including Smith-Waterman and Needleman-Wunsch. - Hidden Markov Models and related strict
probabilistic methods. (In algorithmic terms,
these are dynamic programming as well but are
rarely called that in the literature.) - Dialign a method that aligns ungapped blocks
(diagonals) and then arranges them with gaps
between them. GS554/Papers/SequenceAlignment/Diali
gn1A.pdf
5Dynamic Programming Alignment
- Variants of this method are used by most
pairwise and multiple alignment programs (e.g.
Clustal). - Fundamental problem is how to produce an optimal
alignment of related protein or DNA sequences,
given some assumptions about sequence
conservation. - What is "optimal"? What assumptions have to be
made to make the alignment feasible?
6Optimal Alignments
- One definition of optimal an alignment in which
each aligned sequence residue descended from the
same ancestral residue. - Another definition each aligned sequence
residue plays the same functional role for the
two proteins. - Typically, these two definitions are closely
related.
7A self-evidently true set of related sequences
CLUSTAL W(1.4) multiple sequence alignment
identical in all
. conservative changes worm
CaMKII EARICRKLQHPNIVRLHDSIQEESFHYLVFDLVTGGELFEDI
VAREFYSEADASHCIQQI fly CaMKII
EARICRKLHHPNIVRLHDSIQEENYHYLVFDLVTGGELFEDIVAREFYSE
ADASHCIQQI rat CaMKII EARICRLLKHPNIVRLHDSISEEGHH
YLIFDLVTGGELFEDIVAREYYSEADASHCIQQI
. ..
worm CaMKII LESIAYCHSNGIVHRDLKPENLLLA
SKAKGAAVKLADFGLAIEVN-DSEAWHGFAGTPGY fly CaMKII
LESVNHCHQNGVVHRDLKPENLLLASKAKGAAVKLADFGLAIEVQGDHQA
WFGFAGTPGY rat CaMKII LEAVLHCHQMGVVHRDLKPENLLLAS
KLKGAAVKLADFGLAIEVEGEQQRWFGFAGTPGY
.. . . .
Among them, these proteins have a total of
approximately 2 billion years of evolution (!!)
8A dubious case
CLUSTAL W(1.4) multiple sequence
alignment ATTSTDGLISNGAERLRLQGSRLQTSRFACFRCCGNIIT
YLVRLRSTPEELEQRYKSKEI EARICRKLHHPNIVRLHDSIQEENYHYL
VFDLVTGGELFEDIVAREFYSEADASHCIQQI . . .
. . .. . . .
.. DKFLE--KEKHTFRRQVK--LLLLGAGESGKSTFLKQMRIIHGVN
FDYELLLEYQS---V LESVNHCHQNGVVHRDLKPENLLLASKAKGAAVK
LADFGLAIEVQGDHQAWFGFAGTPGY . .. . .
. . . . . .
Are these sequences evolutionarily related?
No - they are a G-protein alpha subunit and a CaM
kinase
9Assumptions Made for (most) Alignments
- That there is a useful alignment at all. (Any
two proteins with gt 20 amino acid identity over
their full length are certainly derived from a
common ancestral sequence.) - That each residue evolves independently of
others. - That the primary sequence is the predominant
determinant of function (biological assumption). - That residues remain in the same order (no
changes in domain order etc.).
10Graph theory and terminology
- A sequence graph represents residues as edges
between nodes (also called vertices).
- Such a graph can represent a single sequence
(above) or can be 2-dimensional (or more) and
represent a comparison between two sequences.
thanks to Phil Green for some of these slide
figures
112-dimensional graph representing alignments
12Graphs and sequence alignments
- Paths through the graph correspond to
alignments of the sequences, with each edge on
the path corresponding to a column of the
alignment. - Diagonal edges correspond to two aligned
residues. - Horizontal and vertical edges correspond to a
residue in one sequence and a gap in the other.
13A
C
G
T
T
G
A
G
A
T
A
C
C
C
G
C
A
T
G
A
T
G
A
The above path corresponds to the following
alignment
Exercise convince yourself that there are 3mn
possible alignments.
14Edge Weights on Graphs
- Edge weights correspond to scores for an aligned
residue or gap (a simple version would be 1 for a
match, 0 otherwise). - The weight of a path is the sum of weights for
each edge on the path. - The highest weight path corresponds to the
highest scoring alignment for that scoring
system. - Weights are assigned using a substitution score
matrix, such as the BLOSUM62 matrix (the
Needleman-Wunsch paper just uses 1 for each
match).
15How do we find the highest weight path?
16Needleman-Wunsch / Smith-Waterman Algorithm
- Start at the top left corner on the graph (or
the bottom right corner). - Proceed across and down the graph, adding up
match scores for every path and recording in a
2-D array (note in fact you dont have to
record them all). - At the end, choose the path that gives the best
score at the bottom row or right edge. - For alignment, trace back through best score
path. - It is possible to prove that this is the best
alignment.
17A
C
G
T
T
G
A
G
A
T
A
C
C
C
G
C
A
T
G
A
T
G
A
The above path corresponds to the following
alignment
18Informal inductive proof of best alignment path
Consider the last step in the best alignment path
to node a below. This path must come from one of
the three nodes shown, where X, Y, and Z are the
cumulative scores of the best alignments up to
those nodes. We can reach node a by three
possible paths an A-B match, a gap in sequence A
or a gap in sequence B
The best-scoring path to a is the maximum of X
match Y gap Z gap
BUT the best paths to X, Y, and Z are analogously
the max of their three upstream possibilities,
etc. Inductively QED.
19Traceback for best alignment path
For alignment, node a must have two associated
values the score of the best path leading to a
and which node that score came from (X, Y, or Z).
A similar pair of values can be kept for every
node in the graph.
At the end, search along the bottom and right
edges for the highest scoring node and start
traceback from that node. Follow the node
pointers backward from there. Et voila! Notes
there are minor variants on the algorithm that
permit global, local, and repeat alignments,
improved gap models etc. These include always
starting traceback from the lower right corner,
never letting the best score drop below zero, and
keeping track of gap open vs. gap extension
(requires keeping three best scores at each node).
20Protein score matrices
- DNA score matrices are much simpler (and we
wont cover them). - Quantitatively represent the degree of
conservation of typical amino acid residues over
evolutionary time. - All possible amino acid changes are represented
(matrix of size 20 x 20). - Most commonly used are three different BLOSUM
matrices tuned for different degrees of
evolutionary divergence. (Henikoff and Henikoff)
21BLOSUM62 Score Matrix
22Amino acid structures
Hydrophobic
Polar
Charged
phenylalanine F
23BLOSUM62 Score Matrix example D
Bad scores very dissimilar
24Amino acid structures
Hydrophobic
Polar
Charged
25How are BLOSUM scores inferred?
- Find sets of sequences whose alignment is
thought to be correct (this is largely
bootstrapped by alignment). - Measure how often various amino acid pairs occur
in all columns of all the alignments. - Normalize this to the expected frequency of such
pairs randomly in the same set. - Derive a log-odds score (usually in half bits).
26Example of a short block of 29 sequences
Column 1 29/29 H (histidine)
Column 15 24/29 P (proline)
3 H (histidine) 2 S (serine)
Column 12 mostly L (leucine), I (isoleucine), or
V (valine).
Column 18 very diverse.
27Amino acid structures
Hydrophobic
Polar
Charged
phenylalanine F
28Pair frequency vs. expectation
Sample column count
29Log-odds score calculation (so adding scores
multiplying probabilities)
For use in alignment programs computational speed
is important so these are often rounded to the
nearest integer and (to reduce round-off error)
they are often multiplied by 2 (or more) first,
giving a half-bit score
30Sample frequencies from this BLOSUM matrix
(half-bit scores)
Frequency of C over all proteins 0.0162
Reverse calculation of aligned C-C pair frequency
in BLOSUM data set
C - C
you have to look this up
31Constructing Blocks
- Blocks are ungapped alignments of multiple
sequences, usually 20 to 100 amino acids long. - Cluster the members of each block according to
their percent identity. - Make pair counts and score matrix from similarly
clustered blocks. - Each BLOSUM matrix is named for the percent
identity cutoff in step 2 (e.g. BLOSUM70).
(Note that there is some circularity to the
process the blocks are aligned using score
matrices!)
32Log-odds Scores - Reminder
Where is the alignment score, qij is the real
pair frequency, eij is the expected random pair
frequency.
33Probabilistic Interpretation of Scores (ungapped)
- By converting scores back to probabilities, we
can give a probabilistic interpretation to an
alignment score.
34The need for gaps and gap penalties
- one kind of mutation is an insertion or deletion
of one or more nucleotides, often called an
indel. - most of the time indels in protein sequences are
eliminated by selection, but not always. - represented by gap symbols, which mean a
deletion in the gap-bearing sequence OR an
insertion in the other sequence (usually you
cant tell which).
TACTTGGATCCGATCAGGAGGAACCGATTC TACTTGGATCCGA--AGGA
GGAACCGATTC
35BUT we need to penalize gaps!
TACTTGGATCCGATCAGGAGGAACCGATTC T--TT---T--G----GG-
G-------TT-
- convince yourself that if we allow gaps without
affecting score, we can always generate a nearly
perfect alignment (constrained only by residue
frequency and sequence length). - thinking biologically, like an amino acid
change, an indel indicates evolutionary
divergence.
36Simple and Affine Gap Scores
- Simple gap score - using the same dynamic
programming method, add a negative score for each
gap added (often described as subtracting a
penalty). - However in real sequences, positions where a
single gap occurs are much more likely to have a
longer gap (note biochemical basis). - Affine gap penalties have an open penalty and a
much smaller extend penalty. - Choice of penalties is relatively arbitrary
(heuristic) and not based on good evolutionary
models (as match scores are).
37Randomly Distributed Gaps
(probability of a gap at each position in the
sequence)
note - the slope of the line on a log-linear
plot will vary according to the frequency of
gaps, but it will always be linear
38Distribution of alignment gap lengths in large
set of structurally-aligned proteins
log-log plot
- a multi-tiered affine gap score is required to
fit existing data sets well. - nearly all affine gap scores currently are one
tiered (open and extend).
observed
39Alignment gap lengths regraphed
log-linear plot
Reasonably fit by 4 affine steps
40Assignment for Tues.
Download and install the Bonsai alignment
program. http//depts.washington.edu/jtlab/softwar
e/softwareIndex.html (also on course web links
page near bottom) Multiple align sample sequence
set 3 (Help menu). There should be 45
sequences. Save the tree and multiple alignment
as JPG images and send those to me. (note there
may be issues with viewing the tree on
Macs) Select a column that is moderately
conserved in the multiple alignment and look at
the column scores for the column. Pick one each
of the best and worst scoring amino acids and
write a sentence explaining (qualitatively) why
the score makes sense. Bonsai has documentation
files that will help you do all this and to
understand it (note they are badly out of date
and need rewriting).
41Assignment part 2
F Y Y L F L F F Y
- Find out the actual frequency of the 20 amino
acids in proteins (overall). - Use this column of data to construct a 3 x 3
amino acid score matrix, using rounded half-bit
scores. - (dont use pseudocounts)