Title: blaat
1Ronald L. Westra Department of Mathematics Maastri
cht University
Introduction to BioInformatics III
Maastricht, February 7, 2006
2LECTURE 3. Sequences and sequence alignments
BB-C1, C9
3Items in this Lecture
- Molecular Evolution
- Basic problem of sequence alignment
- Sequence alignment
- Dynamic Programming
- BLAST
- Representation HMM and Probabilistic Graphs
41. Molecular Evolution
5Molecular Evolution
6Molecular Evolution
7Molecular Evolution
8Molecular Evolution
9Molecular Evolution
10Molecular Evolution A homeobox gene HOX-A1 has
been implicated as a key to autism. (Patricia
Rodier's, Scientific American, February 2000).
11Homeobox gene HOX-A1 and Autism
12Principles of Molecular Evolution
13Principles of Molecular Evolution
142. The basic problem of sequence alignment
15The biological problem of sequence alignment
query tcctctgcctctgccatcat---caaccccaaagt
tcctgtgcatctgcaatcatgggcaaccccaaagt database
16(No Transcript)
17Sequence alignment - definition
Sequence alignment is an arrangement of two or
more sequences, highlighting their similarity.
The sequences are padded with gaps (dashes) so
that wherever possible, columns contain identical
characters from the sequences involved
tcctctgcctctgccatcat---caaccccaaagt
tcctgtgcatctgcaatcatggg
caaccccaaagt
18Sequence alignment - meaning
Sequence alignment is used to study the evolution
of the sequences from a common ancestor such as
protein sequences or DNA sequences. Mismatches
in the alignment correspond to mutations, and
gaps correspond to insertions or deletions.
Sequence alignment also refers to the process
of constructing significant alignments in a
database of potentially unrelated sequences.
19Special Alignments - promoter
In genetics, a promoter is a DNA sequence that
enables a gene to be transcribed. The promoter
is recognized by RNA polymerase, which then
initiates transcription. In RNA synthesis,
promoters are a means to demarcate which genes
should be used for messenger RNA creation - and,
by extension, control which proteins the cell
manufactures. The perfect promoter is called a
canonical sequence.
20Special Alignments - promoter
lt-- upstream
downstream
--gt 5'-XXXXXXXPPPPPXXXXXXPPPPPPXXXXGGGGGGGXXXX-3
' -35 -10 Gene to be
transcribed the optimal spacing between the -35
and -10 sequences is 19 nt
21Special Alignments - promoter
Probability of occurrence of each nucleotide
for -10 sequence T A T A A T 77
76 60 61 56 82 for -35 sequence T
T G A C A 69 79 61 56 54
54 Example TATA box (sequence TATAAA),
223. Sequence Alignment
23Pairwise alignment
Pairwise sequence alignment methods are concerned
with finding the best-matching piecewise local or
global alignments of protein (amino acid) or DNA
(nucleic acid) sequences. Typically, the purpose
of this is to find homologues (relatives) of a
gene or gene-product in a database of known
examples. This information is useful for
answering a variety of biological questions 1.
The identification of sequences of unknown
structure or function. 2. The study of
molecular evolution.
24Global alignment
A global alignment between two sequences is an
alignment in which all the characters in both
sequences participate in the alignment. Global
alignments are useful mostly for finding
closely-related sequences. As these sequences
are also easily identified by local alignment
methods global alignment is now somewhat
deprecated as a technique. Further, there are
several complications to molecular evolution
(such as domain shuffling) which prevent these
methods from being useful.
25Local alignment
Local alignment methods find related regions
within sequences - they can consist of a subset
of the characters within each sequence. For
example, positions 20-40 of sequence A might be
aligned with positions 50-70 of sequence
B. This is a more flexible technique than global
alignment and has the advantage that related
regions which appear in a different order in the
two proteins (which is known as domain shuffling)
can be identified as being related. This is not
possible with global alignment methods.
26(No Transcript)
27Significance of alignment
Basic assumption of alignment is the mechanism of
molecular evolution. DNA carries over genetic
material from generation to generation by the
mechanism of duplication. Changes in the
material are introduced by occasional errors
and mutations in the duplication, and by
viruses which can move sub-sequences within the
chromosome and between individuals. An
alignment between sequences indicates that the
sequences evolved from a common ancestor which
contained the matching subsequences.
28Significance of alignment
Using assumptions about the probabilities of
these change events, we can estimate the time
when sequences diverged from a common ancestor or
the time required for changing one sequence into
another. However, there is disagreement about
the value and nature of these probabilities for
biological evolution. One school of thought
assumes a simple, constant rate of change
(gradualists) while the other school (punctuated
equilibrium) assumes short evolutionary periods
when changes were extremely high.
29Significance of alignment
The actual biological meaning of any alignment
can never be absolutely guaranteed. Statistical
methods can be used to assess the likelihood of
finding an alignment between two regions (or
sequences) by chance, given the size of the
database and its composition. Two important
related issues for sequence alignment are 1.
How to choose the best alignment between two
sequences (or regions)? 2. How to rank the
alignments between a query and a database
according to their significance (such as
biological significance)?
30Significance of alignment
The actual biological meaning of any alignment
can never be absolutely guaranteed. Statistical
methods can be used to assess the likelihood of
finding an alignment between two regions (or
sequences) by chance, given the size of the
database and its composition. Two important
related issues for sequence alignment are 1.
How to choose the best alignment between two
sequences (or regions)? 2. How to rank the
alignments between a query and a database
according to their significance (such as
biological significance)?
31The meaning of alignment
The models are derived empirically using related
sequences, and are expressed as substitution
matrices. These matrices, along with gap
penalties, are used by the algorithms to evaluate
alternative alignments between two sequences.
The actual biological quality of the alignments
then depends upon the evolutionary model used to
generate the score. Pairwise alignment programs
such as BLAST use simulation to estimate the
parameters of the distribution given a particular
query, database, substitution matrix and certain
other parameters. Alignments can then be given
a statistical significance value, allowing
inference of possible relationships between
sequences.
32Structural alignment
In structural alignments the emphasis is on
amino-acid sequence rather than on nucleotide
sequence. Because protein structure is more
conserved through evolution than is nucleotide
sequence structural alignments are more reliable
over long evolutionary distances, when the
sequences have diverged so much that simple
sequence comparison cannot detect their
similarity. Structure-based sequence alignments
are useful in identifying structurally-conserved
regions among a family of closely- or
distantly-related proteins when visualization
software is available.
33Multiple alignment
Multiple alignment is an extension of pairwise
alignment to incorporate more than two sequences
into an alignment. Multiple alignment methods try
to align all of the sequences in a specified set.
Alignments help in the identification of common
regions between the sequences. There are several
approaches to creating multiple sequence
alignments, one of the most popular being the
progressive alignment strategy used by the
Clustal family of programs. Clustal is used in
cladistics to build phylogenetic trees, and to
build sequence profiles which are used by
PSI-BLAST and Hidden Markov model- (HMM-) methods
to search sequence databases for more distant
relatives. Multiple sequence alignment is
computationally difficult and is classified as an
NP-Hard problem.
34Algorithms
Needleman-Wunsch Pairwise global alignment
only. Smith-Waterman Pairwise, local or global
alignment. Framesearch This is an extension of
Smith-Waterman, for pairwise alignment between a
protein sequence and a nucleotide sequence.
35The Needleman-Wunsch algorithm
The Needleman-Wunsch algorithm (1970, J Mol Biol.
48(3)443-53) performs a global alignment on two
sequences (A and B) and is applied to align
protein or nucleotide sequences. The
Needleman-Wunsch algorithm is an example of
dynamic programming, and is guaranteed to find
the alignment with the maximum score. Scores
for aligned characters are specified by a
similarity matrix S. S(i,j) is the similarity
of characters i and j. a linear gap penalty
called d.
36The Smith Waterman algorithm
The Smith-Waterman algorithm (1981) is for
determining similar regions between two
nucleotide or protein sequences. Smith-Waterman
is also a dynamic programming algorithm and
improves on Needleman-Wunsch. As such, it has the
desirable property that it is guaranteed to find
the optimal local alignment with respect to the
scoring system being used (which includes the
substitution matrix and the gap-scoring scheme).
However, the Smith-Waterman algorithm is
demanding of time and memory resources in order
to align two sequences of lengths m and n, O(mn)
time and space are required. As a result, it
has largely been replaced in practical use by the
BLAST algorithm although not guaranteed to find
optimal alignments, BLAST is much more efficient.
37Dynamic Programming Approach to Sequence
Alignment
The dynamic programming approach to sequence
alignment always tries to follow the best
prior-result so far. Try to align two sequences
by inserting some gaps at different locations, so
as to maximize the score of this alignment.
Score measurement is determined by "match
award", "mismatch penalty" and "gap penalty". The
higher the score, the better the alignment. If
both penalties are set to 0, it aims to always
find an alignment with maximum matches so far.
Maximum match largest number matches can have
for one sequence by allowing all possible
deletion of another sequence. It is used to
compare the similarity between two sequences of
DNA or Protein, to predict similarity of their
functionalities. Examples Needleman-Wunsch(1970),
Sellers(1974), Smith-Waterman(1981)
38Sequence alignment software
SSearch Implements the standard Smith-Waterman
algorithm. It is considerably slower than the
more modern BLAST and FASTA methods. However,
Smith-Waterman remains the golden standard for
protein-protein or nucleotide-nucleotide pairwise
alignment its speed can be improved by using
specialized hardware or a computer
cluster. BLAST (Basic Local Alignment Search
Tool) This method uses a pre-computed hash table
to serve as an index for short sequences. Given a
query sequence, the sub-sequences are looked up
in the index to reduce the amount of time and
searching involved. Several parameters need to be
provided to make this method faster or more
accurate. Once patterns that match the search
sequence are found, more accurate and intensive
algorithms may be applied. BLAST uses a pairwise
local search and uses a number of methods to
increase the speed of the original Smith-Waterman
algorithm.
39Sequence alignment software
FASTA Pairwise local search. Much slower but more
sensitive than BLAST. Recent versions of the
Fasta suite include specialized algorithms for
translated searches that can align low-quality
nucleotide sequence to protein sequence despite
large numbers of indels (translated Blast
searches do not handle insertion or deletion
errors very well). Clustal Progressive multiple
alignment method. Comes in several varieties
(ClustalW, ClustalX etc.)
40The Needleman-Wunsch algorithm
The Needleman-Wunsch algorithm (1970, J Mol Biol.
48(3)443-53) performs a global alignment on two
sequences (A and B) and is applied to align
protein or nucleotide sequences. The
Needleman-Wunsch algorithm is an example of
dynamic programming, and is guaranteed to find
the alignment with the maximum score. Scores
for aligned characters are specified by a
similarity matrix S. S(i,j) is the similarity
of characters i and j. a linear gap penalty
called d.
41The Needleman-Wunsch algorithm
For example, if the similarity matrix was - A
G C T A 10 -1 -3 -4 G -1 7 -5 -3
C -3 -5 9 0 T -4 -3 0 8 with a
gap penalty of -5, would have the following
score...
then the alignment AGACTAGTTAC
CGA---GACGT
42The Needleman-Wunsch algorithm
To find the alignment with the highest score, a
two-dimensional array (or matrix) is allocated.
This matrix is often called the F matrix, and its
(i,j)th entry is often denoted Fij There is one
column for each character in sequence A, and one
row for each character in sequence B. Thus, if we
are aligning sequences of sizes n and m, the
running time of the algorithm is O(nm) and the
amount of memory used is in O(nm). (However,
there is a modified version of the algorithm
which uses only O(m n) space, at the cost of a
higher running time. This modification is in
fact a general technique which applies to many
dynamic programming algorithms this method was
introduced in Hirschberg's algorithm for solving
the longest common subsequence problem.) As the
algorithm progresses, the Fij will be assigned to
be the optimal score for the alignment of the
first i characters in A and the first j
characters in B. The principle of optimality is
then applied as follows.
43The Needleman-Wunsch algorithm
Basis F11 0 F1j 0 Fi1 0 Recursion,
based on the principle of optimality Fij
max(Fi - 1,j - 1 S(Ai,Bj),Fi,j - 1 d,Fi - 1,j
d)
44The Needleman-Wunsch algorithm
Pseudo-code for computing the F matrix for
i0 to length(A)-1 F(i,0) lt- 0 for j0 to
length(B)-1 F(0,j) lt- 0 for i1 to
length(A) for j 1 to length(B)
Choice1 lt- F(i-1,j-1) S(A(i-1), B(j-1))
Choice2 lt- F(i-1, j) - d Choice3 lt- F(i,
j-1) - d F(i,j) lt- max(Choice1, Choice2,
Choice3)
45The Needleman-Wunsch algorithm
Once the F matrix is computed, note that the
bottom right hand corner of the matrix is the
maximum score for any alignments. To compute
which alignment actually gives this score, you
can start from the bottom left cell, and compare
the value with the three possible
sources(Choice1, Choice2, and Choice3 above) to
see which it came from. If it was Choice1, then
A(i) and B(i) are aligned, if it was Choice2 then
A(i) is aligned with a gap, and if it was
Choice3, then B(i) is aligned with a gap.
46The Needleman-Wunsch algorithm
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57Conclusions