Title: Sequence Alignment
1Sequence Alignment
Motivation
- Storing, retrieving and comparing DNA sequences
in Databases. - Comparing two or more sequences for similarities.
- Searching databases for related sequences and
subsequences. - Exploring frequently occurring patterns of
nucleotides. - Finding informative elements in protein and DNA
sequences. - Various experimental applications (reconstruction
of DNA, etc.)
2Seq.Align. Protein Function
Protein1
Protein2
More than 25 sequence identity
?
Similar function
3Exact Pattern Matching
- Given a pattern P of length m and a (longer)
string T of length n, find all the occurrences of
P in T. - Naïve algorithm O(mn)
- Boyer-Moore, Knuth-Pratt-Morris O(nm)
4Alignment - inexact matching
- Substitution - replacing a sequence base by
another. - Insertion - an insertion of a base (letter)
or several bases to the sequence. - Deletion - deleting a base (or more) from
the sequence. - (Insertion and deletion are the reverse of one
another)
5Seq. Align. Score
Commonly used matrices PAM250, BLOSUM64
6Global Alignment
Global Alignment INPUT Two sequences S and T of
roughly the same length. QUESTION What is the
maximum similarity between them?
Find one of the best alignments.
7The IDEA
s1n
t1m
To align s1...i with t1j we have three
choices align s1i-1 with t1j-1 and match
si with tj align s1i with t1j-1 and
match a space with tj align s1i-1 with
t1j and match si with a space
s1i-1 i
t1j-1 j
s1 i -
t1j-1 j
s1i-1 i
t1 j -
8Recursive Relation
Define scoring matrix m(a,b) a,b ? ? U
- Define Hij the best score of alignment
between s1i and t1j for 1 lt i lt n, 1 lt
j lt m Hi-1j-1 m(si,tj) Hij
max Hij-1 m(-,tj) Hi-1j
m(si,-) Hi0 ?0,k m(sk,-) H0j ?0,k
m(-,tk)
Optimal alignment score Hnm
Needleman-Wunsch 1970
9t
I T
s3 t4
I -
s3 -
- T
- t4
s
10(No Transcript)
11Local Alignment
Local Alignment INPUT Two sequences S and T
. QUESTION What is the maximum similarity
between a subsequence of S and a
subsequence of T ? Find most
similar subsequences.
12Recursive Relation
for 1 lt i lt n, 1 lt j lt m Hi-1j-1
m(si,tj) Hij-1
m(-,tj) Hi-1j m(si,-)
0 Hi0 0 H0j 0
Hij max
Optimal alignment score maxij Hij
Smith-Waterman 1981
Penalties should be negative
13Sequence Alignment
Complexity Time O(nm) Space O(nm) (exist
algorithm with O(min(n,m)) )
14Ends free alignment
Ends free alignment
INPUT Two equences
S and T (possibly of different length).
QUESTION Find one of the best alignments between
subsequences of S and T when at least one of
these subsequences is a prefix of the original
sequence and one (not necessarily the other) is a
suffix.
or
15Gap Alignment
Definition A gap is the maximal contiguous run
of spaces in a single sequence within a given
alignment.The length of a gap is the number of
indel operations on it. A gap penalty function is
a function that measure the cost of a gap as a
(nonlinear) function of its length.
Gap penalty
INPUT Two sequences S and T (possibly of
different length). QUESTION Find one of the
best alignments between the two
sequences using the gap penalty function.
Affine Gap
Wtotal Wg qWs Wg weight to open the gap Ws
weight to extend the gap
16What Kind of Alignment to Use?
- The same protein from the different organisms.
- Two different proteins sharing the same function.
- Protein domain against a database of complete
proteins. - Protein against a database of small patterns
(functional units) - ...
17Sequence Alignment vs. Database
- Task Given a query sequence and millions of
database records, find the optimal alignment
between the query and a record
ACTTTTGGTGACTGTAC
18Sequence Alignment vs. Database
- Tool Given two sequences,there exists an
algorithm tofind the best alignment. - Naïve Solution Apply algorithm to each of the
records, one by one
19Sequence Alignment vs. Database
- Problem An exact algorithm is too slowto run
millions of times (even linear time algorithm
will run slowly on a huge DB) - Solution
- Run in parallel (expensive).
- Use a fast (heuristic) method to discard
irrelevant records. Then apply the exact
algorithm to the remaining few.
20Sequence Alignment vs. Database
- General Strategy of Heuristic Algorithms
- Homologous sequences are expected to contain
un-gapped (at least) short segments (probably
with substitutions, but without ins/dels) - Preprocess DB into some fast access data
structure of short segments.
21FASTA Idea
- Idea a good alignment probably matches some
identical words (ktups) - Example
- Database record
- ACTTGTAGATACAAAATGTG
- Aligned query sequence
- A-TTGTCG-TACAA-ATCTGT
- Matching words of size 4
22Dictionaries of Words
- ACTTGTAGATAC Is translated to the dictionary
- ACTT,
- CTTG,
- TTGT,
- TGTA
- Dictionaries of well aligned sequences share
words.
23FASTA Stage I
- Prepare dictionary for db sequence (in advance)
- Upon query
- Prepare dictionary for query sequence
- For each DB record
- Find matching words
- Search for long diagonal runs of matching words
- Init-1 score longest run
- Discard record if low score
matching word
Position in DB record
Position in query
24FASTA stage II
- Good alignment path through many runs,
withshort connections - Assign weights to runs()and connections(-)
- Find a path of max weight
- Init-n score total path weight
- Discard record if low score
25FASTA Stage III
- Improve Init-1. Apply an exact algorithm around
Init-1 diagonal within a given width band. - Init-1 Opt-score new weight
- Discard record if low score
26FASTA final stage
- Apply an exact algorithm to surviving records,
computing the final alignment score.
27BLAST (Basic Local Alignment Search Tool)
Approximate Matches
- BLAST
- Words are allowed to contain inexact matching.
- Example
- In the polypeptide sequence IHAVEADREAM
- The 4-long word HAVE starting at position 2 may
match - HAVE,RAVE,HIVE,HALE,
28Approximate Matches
- For each word of length w from a Data Base
generate all similar words. - Similar means score( word, word ) gt T
- Store all similar words in a look-up table.
29DB search
- 1) For each word of length w from a query
sequence generate all similar words. - 2) Access DB.
- 3) Each hit extend as much as possible -gt
High-scoring Segment Pair (HSP) - score(HSP) gt V
THEFIRSTLINIHAVEADREAMESIRPATRICKREAD
INVIEIAMDEADMEATTNAMHEWASNINETEEN
30DB search (2)
4) Around HSP perform DP. At each step
alignment score should be gt T
s-db
starting point (seed pair)
s-query
31Homework
- Write a cgi script (using Perl) that performs
pairwise Local/Global Alignment for DNA
sequences. All I/O is via HTML only. - Input
- Choice for Local/Global alignment.
- Two sequences text boxes.
- Values for match, mismatch, ins/dels.
- Number of iterations for computing random scores.
- Output
- Alignment score.
- z-score value (z (score-average)/standard
deviation.)
Remarks 1) you are allowed to use only linear
space. 2)To compute z-score perform random
shuffling srand(time ) init,
-proc.id int(rand(i)) returns rand. number
between 0,i. 3)Shuffling is done in windows
(non-overlapping) of 10 bases length. Number of
shuffling for each window is random 0,10.