EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell PowerPoint PPT Presentation

presentation player overlay
1 / 21
About This Presentation
Transcript and Presenter's Notes

Title: EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell


1
EE3J2 Data MiningLecture 12 Sequence Analysis
(2)Martin Russell
2
Objectives
  • Revise dynamic programming
  • Examples

3
Alignment path
A C X C C
D
A B C D
4
Accumulated Distance
  • The accumulated distance along the path p is the
    sum of distances along its length
  • Large accumulative distance poor matches
    between symbols poor path
  • Small accumulative distance good matches
    between symbols good path
  • The path with the smallest accumulated distance
    is called the optimal path
  • Computed using Dynamic Programming

5
Dynamic Programming
Accumulated distance to this point
A C X C C
D
A B C D
6
Formally
7
Example application sequence retrieval
Corpus of sequential data
query sequence Q
AAGDTDTDTDD AABBCBDAAAAAAA BABABABBCCDF GGGGDDG
DGDGDGDTDTD DGDGDGDGD AABCDTAABCDTAABCDTAAB CDCDCD
TGGG GGAACDTGGGGGAAA . .
BBCCDDDGDGDGDCDTCDTTDCCC
8
Example Edit Distance
Accumulated distance matrix A B C C D A 0 1 2 3
4 A 0 1 2 3 4 B 1 0 1 2 3 C 2 1 0 0 1 D 2 1 1 1
0 Forward path matrix A B C C D A \ _ _ _ _ A
_ _ _ _ B \ _ _ _ C \ _ _ D \
S1 AABCD KDEL0 S2 ABCCD KINS 0
A B C C D A \ _ _ _ _ A _ _ _ _ B \ _ _ _ C
\ _ _ D \ AABCCD AABCCD
Distance matrix A B C C D A 0 1 1 1 1 A 0 1 1 1
1 B 1 0 1 1 1 C 1 1 0 0 1 D 1 1 1 1 0
9
Example 2 Edit Distance
Accumulated distance matrix A B C C D A 0
3 6 9 12 A 2 1 4 7 10 B 5 2 2 5 8 C 8
5 2 2 5 D 11 8 5 3 2 Forward path matrix
A B C C D A \ _ _ _ _ A \ _ _ _ B \ \ _ _ C
\ \ _ D \ \
S1 AABCD KDEL2 S2 ABCCD KINS 2
A B C C D A \ _ _ _ _ A \ _ _ _ B \ \ _ _ C
\ \ _ D \ ABCCD ABCCD
Distance matrix A B C C D A 0 1 1 1 1 A 0 1 1 1
1 B 1 0 1 1 1 C 1 1 0 0 1 D 1 1 1 1 0
10
edit-dist.c
  • New C program on course website
  • Computes the edit distance between two sequences
  • Prints out
  • Distance matrix
  • Forward accumulated distance matrix
  • Forward path matrix
  • Optimal path
  • Optimal alignment

11
edit-dist.c
  • Format
  • edit-dist seq1 seq2 ltKdelgt ltKinsgt
  • Seq1 and seq2 are the sequences
  • ltKdelgt and ltKinsgt optional, default 0

12
Matching partial sequences
  • In some applications the interest is in whether
    one sequence matches a subsequence of another
    sequence
  • Example Bioinformatics
  • Look for examples of a simple DNA sequence within
    a more complex sequence
  • Infer evolutionary relationship between two
    organisms

13
Partial alignment
  • Simple intuitive solution is to allow Dynamic
    Programming to
  • Start at any point in the first row
  • End at any point in the final row
  • Then proceed as before
  • Unfortunately this has limitations

14
Finding matching sub-sequences
Start DP from here
15
Backwards Pass DP
16
Backwards Pass DP
  • Starts in bottom row, works right-to-left and
    bottom-to-top
  • Otherwise, backwards accumulated distance matrix
    and backwards path matrix calculations analogous
    with forward-pass DP

17
Forward-backward DP
  • Suppose that we have done a complete forward DP
    and a complete backward DP
  • We will have two path matrices
  • Forward path matrix
  • Backward path matrix
  • For any point in bottom row can trace-back
    through forward path matrix and recover path
    ending in top row
  • For any point in top row can trace-back through
    backward path matrix and recover path ending in
    bottom row

18
Matching sub-sequences
Are paths the same? If so, then we have a
matching subsequence
19
Matching subsequences
  • If a path occurs as a consequence of tracing-back
    through the forward path matrix and tracing-back
    through the backward path matrix, then the
    corresponding section of the horizontal sequence
    is called a matching subsequence
  • The matching subsequences are those which achieve
    a good match with the vertical pattern

20
Matching subsequences
matching subsequence
X Z A B C
C Y Z
A B B C
We say that this subsequence most closely
resembles the original sequence ABBC
21
Summary
  • Revision of Dynamic Programming
  • Examples Edit distance
  • Motivation for interest in optimal subsequences
  • Forward and backward dynamic programming
  • Matching subsequences, subsequences which most
    closely resemble a given sequence
Write a Comment
User Comments (0)
About PowerShow.com