Title: Text Comparison of Genetic Sequences
1Text Comparison of Genetic Sequences
- Shiri Azenkot
- Pomona College
- DIMACS REU 2004
2Comparing Two Strings
- Definition A string is a set of consecutive
characters. -
- Examples
- hello world
- 0123456
- DNA sequences
- text file
3Comparing Two Strings
- If X and Y are strings, how similar are they?
- Edit distance, d(X, Y) smallest number of
operations needed to make X look like Y. - Allowed operations
- Insert a character
- Delete a character
- Replace a character
- Running time O(mn) with a dynamic programming
algorithm
4Comparing Two Strings
- If X and Y are strings, how similar are they?
- Edit distance, d(X, Y) smallest number of
operations needed to make X look like Y.
X abcdef Y defabc d(X, Y) ? operations
5Comparing Two Strings
- If X and Y are strings, how similar are they?
- Edit distance, d(X, Y) smallest number of
operations needed to make X look like Y.
X bcdef Y defabc d(X, Y) ? operations
1
6Comparing Two Strings
- If X and Y are strings, how similar are they?
- Edit distance, d(X, Y) smallest number of
operations needed to make X look like Y.
X cdef Y defabc d(X, Y) ? operations
2
7Comparing Two Strings
- If X and Y are strings, how similar are they?
- Edit distance, d(X, Y) smallest number of
operations needed to make X look like Y.
X def Y defabc d(X, Y) ? operations
3
8Comparing Two Strings
- If X and Y are strings, how similar are they?
- Edit distance, d(X, Y) smallest number of
operations needed to make X look like Y.
X defa Y defabc d(X, Y) ? operations
4
9Comparing Two Strings
- If X and Y are strings, how similar are they?
- Edit distance, d(X, Y) smallest number of
operations needed to make X look like Y.
X defab Y defabc d(X, Y) ? operations
5
10Comparing Two Strings
- If X and Y are strings, how similar are they?
- Edit distance, d(X, Y) smallest number of
operations needed to make X look like Y.
X defabc Y defabc d(X, Y) 6 operations
6
Does this seem too high?
11Edit Distance with Moves
- d(X, Y) smallest number of operations to make X
look like Y. - New operation move a substring
X abcdef Y defabc d(X, Y) 1
12Edit Distance with Moves
- d(X, Y) smallest number of operations to make X
look like Y. - New operation move a substring
- Some applications
- Computational biology DNA sequences
- Text editing
- Webpage updating
13Edit Distance with Moves
- Edit Sensitive Parsing (ESP) Algorithm
- Parse each string into a 2-3 tree
- Compare nodes (substrings) of the trees to
compute edit distance approximation
- The problem is NP-hard
- Algorithm approximates d(X, Y) deterministically
- Run time O(n log n)
14Edit Distance with MovesAlgorithm
- Parse each string into a 2-3 tree
- Every node represents a substring
- X bagcabagehead
-
15Edit Distance with MovesAlgorithm
- Parse each string into a 2-3 tree
- Every node represents aa substring
- Y cabageheadbag
-
16Edit Distance with MovesAlgorithm
- Compare nodes (substrings) of the trees to
compute edit distance approximation - 2.1 Find frequencies of occurrence of each
substring. - X
b
a
g
c
a
b
a
g
e
h
e
a
d
17Edit Distance with MovesAlgorithm
- Compare nodes (substrings) of the trees to
compute edit distance approximation - 2.1 Find frequencies of occurrence of each
substring. - Y
caba gehea dbag
1 1 1
ca ba geh ea db ag
1 1 1 1 1
1
a
c
a
a
e
h
e
a
b
g
b
g
d
18Edit Distance with MovesAlgorithm
- Compare nodes (substrings) of the trees to
compute edit distance approximation - 2.1 Find frequencies of occurrence of each
substring. - 2.2 Subtract characteristic vectors to get
approximation for d(X, Y)
Bagca bagehead
caba gehea dbag
-
1 1
1 1 1
ca ba geh ea db ag
bag ca ba geh ead
1 1 1 1 1 1
1 1 1 1 1
19Edit Distance with MovesAlgorithm
- Compare nodes (substrings) of the trees to
compute edit distance approximation - 2.1 Find frequencies of occurrence of each
substring. - 2.2 Subtract characteristic vectors to get
approximation for d(X, Y)
Actual edit distance with moves?
1
d(bagcabagehead, cabageheadbag)
20Edit Distance with Moves
- Goals for this project
- Implement this algorithm
- Test algorithm on DNA sequences
- Questions to think about
- How accurate is the approximation?
- How applicable is this technique for comparing
large biological sequences? - This algorithm finds repeating structures within
the sequences when comparing them. Do these
structures have significance? - Do such structures exist for real sequences?
21Acknowledgements
- Mentor Graham Cormode, DIMACS Postdoc
- DIMACS REU 2004
- References
- Benedetto, D., Caglioti E., Loreto V., Language
Trees and Zipping. Physical Review Letters, 2002 - Cormode, G., Muthukrishnan, S., The String Edit
Distance Matching Problem with Moves.