Title: ALIGNMENT OF NUCLEOTIDE
1ALIGNMENT OF NUCLEOTIDEAMINO-ACID SEQUENCES
2(No Transcript)
3Homology The term was coined by Richard Owen in
1843. Definition Similarity resulting from
common ancestry.
4Homology A qualitative statment
- Homology designates a relationship of common
descent between entities - Two genes are either homologs or not
- it doesnt make sense to say two genes are 43
homologous. - it doesnt make sense to say Linda is 24
pregnant.
5Homology
By comparing homologous characters, we can
reconstruct the evolutionary events that have led
to the formation of the extant sequences from the
common ancestor.
6Homology
When dealing with sequences, we are interested in
POSITIONAL HOMOLOGY. We identify positional
homology by ALIGNMENT.
7ACTGGGCCCAAATC
1 deletion 1 substitution
1 insertion 1 substitution
AACAGGGCCCAAATC
CTGGGCCCAGATC
Correct alignment
Incorrect alignment
CTGGGCCCAGATC-- AACAGGGCCCAAATC ..........
--CTGGGCCCAGATC AACAGGGCCCAAATC ..
8Unknown!
unknown processes
unknown processes
AACAGGGCCCAAATC
CTGGGCCCAGATC
Correct alignment?
Incorrect alignment?
CTGGGCCCAGATC-- AACAGGGCCCAAATC ..........
--CTGGGCCCAGATC AACAGGGCCCAAATC ..
9ACCTGAATTTGCCC
T9 G5T ACA12
-A6 -A7 T8A G2
ACCTTAATTGCACACC
AGCCTGATTGCCC
ACCTTAATTGCACACC
AGCCTGATTGCCC---
C2G, T4C, A6G, A12C, -ACC14
10Positional homology A pair of nucleotides from
two aligned sequences that have descended from
one nucleotide in the ancestor of the two
sequences.
Alignment A hypothesis concerning positional
homology among residues in a sequence.
11An alignment consists of a series of paired
bases, one base from each sequence. There are
three types of pairs(1) matches the same
nucleotide appears in both sequences. (2)
mismatches different nucleotides are found in
the two sequences. (3) gaps a base in one
sequence and a null base in the other.
GCGGCCCATCAGGTAGTTGGTG-G GCGTTCCATC--CTGGTTGGTGTG
.. ..
12Sequence alignment The identification of the
location of deletion or insertions that might
have occurred in either of the two lineages since
their divergence from a common ancestor.
Insertion Deletion Indel or Gap
13Sequence alignment 1. Pairwise alignment 2.
Multiple alignment
14- Two DNA sequences A and B.- Lengths are m and
n, respectively. - The number of matched pairs
is x. - The number of mismatched pairs is y. -
Total number of bases in gaps is z.
15There are terminal and internal gaps.
GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG
16A terminal gap may indicate missing data.
GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG
17An internal gap indicates that a deletion or an
insertion has occurred in one of the two
lineages.
GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG
18The alignment is the first step in many
evolutionary and functional studies. Errors in
alignment tend to amplify in later computational
stages.
19Methods of alignment 1. Manual 2. Dot
matrix 3. Distance Matrix 4. Combined (Distance
Manual)
20- Manual alignment. When there are few gaps and the
two sequences are not too different from each
other, a reasonable alignment can be obtained by
visual inspection.
GCG-TCCATCAGGTAGTTGGTGTG GCGTTCCATCAGGTGGTTGGTGTG
.
21Advantages of manual alignment (1) use of a
powerful and trainable tool (the brain, well,
some brains).(2) ability to integrate
additional data, e.g., domain structure,
biological function.
22(No Transcript)
23(No Transcript)
24Protein Alignment may be guided by Tertiary
Structures
Escherichia coli DjlA protein
Homo sapiens DjlA protein
25Disadvantages of manual alignment (1) The
method is subjective and unscalable.
26The dot-matrix method The two sequences are
written out as column and row headings of a
two-dimensional matrix. A dot is put in the
dot-matrix plot at a position where the
nucleotides in the two sequences are identical.
27The alignment is defined by a path from the
upper-left element to the lower-right element.
28There are 4 possible steps in the path
- (1) a diagonal step through a dot match.
- (2) a diagonal step through an empty element of
the matrix mismatch. - (3) a horizontal step a gap in the sequence on
the top of the matrix. - (4) a vertical step a gap in the sequence on
the left of the matrix.
29forbiddendirections
alloweddirections
30A dot matrix may become cluttered. With DNA
sequences, 25 of the elements will be occupied
by dots by chance alone.
31window size 1 stringency 1 alphabet size 4
The number of spurious matches is determined by
window size, stringency, alphabet size.
32window size 1 stringency 1 alphabet size 4
window size 3 stringency 2 alphabet size 4
33window size 1 stringency 1 alphabet size 20
34Dot-matrix methodsAdvantages May unravel
information on the evolution of sequences.
35Window size 60 amino acids Stringency 24
matches
Advantages Highlighting Information
36Window size 60 amino acids Stringency 24
matches
Advantages Highlighting Information
The two diagonally oriented parallel lines most
probably indicate that a small internal
duplication has occurred in the bacterial gene.
37Dot-matrix methodsDisadvantage May not
identify the best alignment.
38Distance and similarity methods
39The best possible alignment (optimal alignment)
is the one in which the numbers of mismatches and
gaps are minimized according to certain criteria.
40Unfortunately, reducing the number of mismatches
results in an increase in the number of gaps, and
vice versa.
41a matches b mismatches g nucleotides in
gaps d gaps
42Gap penalty (or cost) is a factor (or a set of
factors) by which the gap values (numbers and
lengths of gaps) are multiplied to make the gaps
equivalent in value to the mismatches. The gap
penalties are based on our assessment of how
frequent different types of insertions and
deletions occur in evolution in comparison with
the frequency of occurrence of point
substitutions.
43Mismatch penalty is an assessment of how
frequently substitutions occur.
44- The distance (dissimilarity) index (D) between
two sequences in an alignment is
where yi is the number of mismatches of type i,
mi is the mismatch penalty for an i-type of
mismatch, zk is the number of gaps of length k,
and wk is a positive number representing the
penalty for gaps of length k.
45- The similarity index (S) between two sequences in
an alignment is
where x is the number of matches, zk is the
number of gaps of length k, and wk is a positive
number representing the penalty for gaps of
length k.
46The gap penalty has two components a gap-opening
penalty and a gap-extension penalty.
47Three main systems (1) Fixed gap-penalty
system 0 gap-extension costs. (2) Linear
gap-penalty system the gap-extension cost is
calculated by multiplying the gap length minus 1
by a constant representing the gap-extension
penalty for increasing the gap by 1. (3)
Logarithmic gap-penalty system the
gap-extension penalty increases with the
logarithm of the gap length, i.e., slower.
48(No Transcript)
49Further complications Distinguishing among
different matches and mismatches.For example, a
mismatched pair consisting of Leu Ile, which
are very similar biochemically to each other, may
be given a lesser penalty than a mismatched pair
consisting of Arg Glu, which are very
dissimilar from each other.
50Lesser penalty than
51Alignment algorithms
52Aim Find the alignment associated with the
smallest D (or largest S) from among all possible
alignments.
53The number of possible alignments may be
astronomical. For example, when two sequences
300 residues long each are compared, there are
1088 possible alignments. In comparison, the
number of elementary particles in the universe is
only 1080.
54There are computer algorithms for finding the
optimal alignment between two sequences that do
not require an exhaustive search of all the
possibilities.
55The Needleman-Wunsch algorithmuses Dynamic
Programming
56Dynamic programming a computational technique.
It is applicable when large searches can be
divided into a succession of small stages, such
that (1) the solution of the initial search stage
is trivial, (2) each partial solution in a later
stage can be calculated by reference to only a
small number of solutions in an earlier stage,
and (3) the last stage contains the overall
solution.
57Multiple Sequence Alignment
58Alignments can be easy or difficult
GCGGCCCA TCAGGTAGTT GGTGG
GCGGCCCA TCAGGTAGTT GGTGG
Easy
GCGTTCCA TCAGCTGGTT GGTGG
GCGTCCCA TCAGCTAGTT GGTGG
GCGGCGCA TTAGCTAGTT GGTGA
...
... .
TTGACATG CCGGGG---A AACCG
T-GACATG CCGGTG--GT AAGCC
TTGGCATG -CTAGG---A ACGCG
Difficult
TTGACATG -CTAGGGAAC ACGCG
TTGACATC -CTCTG---A ACGCG
.. ...
.
...
59(No Transcript)
60Multiple Alignment
- 2 methods
- Dynamic programming (exhaustive, exact)
- Consider 2 protein sequences of 100 amino acids
in length. - If it takes 1002 seconds to exhaustively align
these sequences, then it will take 1003 seconds
to align 3 sequences, 1004 to align 4
sequences...etc. - More time than the universe has existed to align
20 sequences exhaustively. - Progressive alignment (heuristic, approximate)
61Progressive Alignment
- Devised by Feng and Doolittle in 1987.
- Essentially a heuristic method and as such is not
guaranteed to find the optimal alignment. - Requires n-1n-2n-3...n-n1 pairwise alignments
as a starting point - Most successful implementation is Clustal (Des
Higgins)
62Overview ofClustal Procedure
CLUSTAL
Hbb_Human 1 -
Hbb_Horse 2 .17 -
1. Quick pairwise alignments 2. Distances for
each pair 3. Distance matrix
Hba_Human 3 .59 .60 -
Hba_Horse 4 .59 .59 .13 -
Myg_Whale 5 .77 .77 .75 .75 -
Hbb_Human
4
1
3
Hbb_Horse
Neighbor-joining tree (guide tree)
Hba_Human
2
Hba_Horse
Myg_Whale
1 PEEKSAVTALWGKVN--VDEVGG
Progressive alignment following guide tree
4
1
3
2 GEEKAAVLALWDKVN--EEEVGG
3 PADKTNVKAAWGKVGAHAGEYGA
2
4 AADKTNVKAAWSKVGGHAGEYGA
5 EHEWQLVLHVWAKVEADVAGHGQ
63Clustal good points/bad points
- Advantages
- Speed.
- Disadvantages
- No way of knowing if the alignment is correct.
64Effect of gap penalties on amino-acid alignment
Human pancreatic hormone precursor versus
chicken pancreatic hormone (a) Penalty
for gaps is 0 (b) Penalty for a gap of size k
nucleotides is wk 1 0.1k (c) The same
alignment as in (b), only the similarity between
the two sequences is further enhanced by showing
pairs of biochemically similar amino acids
65An Alignment
GCGGCTCA TCAGGTAGTT GGTG-G
Spinach
GCGGCCCA TCAGGTAGTT GGTG-G
Rice
GCGTTCCA TC--CT-GTT GGTGTG
Mosquito
GCGTCCCA TCAGCTAGTT GTTG-G
Monkey
GCGGCGCA TTAGCTAGTT GGTG-A
Human
...
. . .