Title: EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell
1EE3J2 Data MiningLecture 12 Sequence Analysis
(2)Martin Russell
2Objectives
- Revise dynamic programming
- Examples
3Alignment path
A C X C C
D
A B C D
4Accumulated Distance
- The accumulated distance along the path p is the
sum of distances along its length - Large accumulative distance poor matches
between symbols poor path - Small accumulative distance good matches
between symbols good path - The path with the smallest accumulated distance
is called the optimal path - Computed using Dynamic Programming
5Dynamic Programming
Accumulated distance to this point
A C X C C
D
A B C D
6Formally
7Example application sequence retrieval
Corpus of sequential data
query sequence Q
AAGDTDTDTDD AABBCBDAAAAAAA BABABABBCCDF GGGGDDG
DGDGDGDTDTD DGDGDGDGD AABCDTAABCDTAABCDTAAB CDCDCD
TGGG GGAACDTGGGGGAAA . .
BBCCDDDGDGDGDCDTCDTTDCCC
8Example Edit Distance
Accumulated distance matrix A B C C D A 0 1 2 3
4 A 0 1 2 3 4 B 1 0 1 2 3 C 2 1 0 0 1 D 2 1 1 1
0 Forward path matrix A B C C D A \ _ _ _ _ A
_ _ _ _ B \ _ _ _ C \ _ _ D \
S1 AABCD KDEL0 S2 ABCCD KINS 0
A B C C D A \ _ _ _ _ A _ _ _ _ B \ _ _ _ C
\ _ _ D \ AABCCD AABCCD
Distance matrix A B C C D A 0 1 1 1 1 A 0 1 1 1
1 B 1 0 1 1 1 C 1 1 0 0 1 D 1 1 1 1 0
9Example 2 Edit Distance
Accumulated distance matrix A B C C D A 0
3 6 9 12 A 2 1 4 7 10 B 5 2 2 5 8 C 8
5 2 2 5 D 11 8 5 3 2 Forward path matrix
A B C C D A \ _ _ _ _ A \ _ _ _ B \ \ _ _ C
\ \ _ D \ \
S1 AABCD KDEL2 S2 ABCCD KINS 2
A B C C D A \ _ _ _ _ A \ _ _ _ B \ \ _ _ C
\ \ _ D \ ABCCD ABCCD
Distance matrix A B C C D A 0 1 1 1 1 A 0 1 1 1
1 B 1 0 1 1 1 C 1 1 0 0 1 D 1 1 1 1 0
10edit-dist.c
- New C program on course website
- Computes the edit distance between two sequences
- Prints out
- Distance matrix
- Forward accumulated distance matrix
- Forward path matrix
- Optimal path
- Optimal alignment
11edit-dist.c
- Format
- edit-dist seq1 seq2 ltKdelgt ltKinsgt
- Seq1 and seq2 are the sequences
- ltKdelgt and ltKinsgt optional, default 0
12Matching partial sequences
- In some applications the interest is in whether
one sequence matches a subsequence of another
sequence - Example Bioinformatics
- Look for examples of a simple DNA sequence within
a more complex sequence - Infer evolutionary relationship between two
organisms
13Partial alignment
- Simple intuitive solution is to allow Dynamic
Programming to - Start at any point in the first row
- End at any point in the final row
- Then proceed as before
- Unfortunately this has limitations
14Finding matching sub-sequences
Start DP from here
15Backwards Pass DP
16Backwards Pass DP
- Starts in bottom row, works right-to-left and
bottom-to-top - Otherwise, backwards accumulated distance matrix
and backwards path matrix calculations analogous
with forward-pass DP
17Forward-backward DP
- Suppose that we have done a complete forward DP
and a complete backward DP - We will have two path matrices
- Forward path matrix
- Backward path matrix
- For any point in bottom row can trace-back
through forward path matrix and recover path
ending in top row - For any point in top row can trace-back through
backward path matrix and recover path ending in
bottom row
18Matching sub-sequences
Are paths the same? If so, then we have a
matching subsequence
19Matching subsequences
- If a path occurs as a consequence of tracing-back
through the forward path matrix and tracing-back
through the backward path matrix, then the
corresponding section of the horizontal sequence
is called a matching subsequence - The matching subsequences are those which achieve
a good match with the vertical pattern
20Matching subsequences
matching subsequence
X Z A B C
C Y Z
A B B C
We say that this subsequence most closely
resembles the original sequence ABBC
21Summary
- Revision of Dynamic Programming
- Examples Edit distance
- Motivation for interest in optimal subsequences
- Forward and backward dynamic programming
- Matching subsequences, subsequences which most
closely resemble a given sequence