Presentation For VLSI Application of Suffix Trees - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Presentation For VLSI Application of Suffix Trees

Description:

... that becomes a palindrome/tandem after k or fewer characters are changed ... Run k successive longest common extension queries forward from h and q and run k ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 21
Provided by: chunk
Category:

less

Transcript and Presenter's Notes

Title: Presentation For VLSI Application of Suffix Trees


1
Presentation For VLSIApplication of Suffix Trees
  • Chunkai Yin
  • Apr 2002

2
Topics
  • Approximate palindromes and repeats
  • Multiple common substring problem

3
Approximate palindromes and repeats
  • Definitions
  • Find all tandem repeats (naive method)
  • Find all tandem repeats (Landau-Schmidt)
  • Find all k-mismatch tandem repeats (L-S)

4
Definitions
  • Tandem repeat
  • string written as AA, where A is a substring
  • abcabc
  • Specified by starting position and length of A
  • K-mismatch palindrome/tandem
  • a substring that becomes a palindrome/tandem
    after k or fewer characters are changed
  • zzcbbcca 2-mismatch palindrome pzcb bcca
  • axabaybb 2-mismatch tamdem axab aybb

5
Find all tandem repeats (naive method)
  • Guess a start position i , middle position j for
    the tandem
  • Do a longest common extension query from i and j.
    If the extension from i reaches j, the tandem
    exists starting at position i.
  • Time complexity O(n2)
  • EX1
  • fababd

6
Problem Find all tandem repeats
  • The Landau-Schmidt method divide the problem
  • into four subproblems
  • (Recursive, divide-and-conquer)
  • Find all tandem repeats contained entirely in the
    first half of S (up to h n/2)
  • Find all tandem repeats contained entirely in the
    second half of S (after h )
  • Find all tandem repeats where the first copy
    contains position h of S.
  • Find all tandem repeats where the second copy
    contains position h of S.

7
How to handle this ?
  • Subproblem A, B will be solved by recursively
    applying the method
  • Subproblem C and D are symmetric, just consider
    subproblem C.
  • Therefore, the core problem is subproblem C.

8
Solve the subproblem 3
  • Initialization
  • h n/2 qhL L is a fixed number each time
  • Compute the longest common extension (from left
    to right) from h and q.
  • L1 is the length of extension.
  • Compute the longest common extension (from right
    to left) from h-1 and q-1. L2

9
Solve the subproblem 3
  • If and only if L1L2gtL, a tandem repeat exists
    whose first copy contains position h with the
    length 2L, and it can begin at any position
    between h-L2 and hL1-L inclusive. The second
    copy begins L places to the right. Output the
    starting point.
  • EX2

10
L1, L2 and range for starting point
L
q
h
A
B
  • ?1L-L2, L ?1 L2
  • Ah-L2
  • ?2L-L1, L ?2 L1
  • Bh- ?2hL1-L

L1
L1
?2
?2
L2
L2
?1
?1
11
An Example
  • .b d c a b b d c a b b t t t.
  • h q
  • L5 L13 L23
  • A1 B2
  • Two forms available
  • bdcab bdcab
  • dcabb dcabb

12
Time Complexity
  • Time used for output O(Z)
  • Z is the total number of tandem repeats in S
  • Time used for the extension queries
  • which is proportional to the number of
    extension queries ext.
  • T(n)T(n/2) T(n/2) n n nn/2 (-gt)
    n/2(lt-)
  • A B C D
  • T(n) O(nlogn), totalO(nlogn)Z

13
Find all k-mismatch tandem repeats (L-S)
  • Immediate extension of the O(nlognZ) method for
    finding exact tandem repeats.
  • Run k successive longest common extension queries
    forward from h and q and run k successive longest
    common extension queries backward from h-1 and
    q-1.
  • Then try to find t, a midpoint of the tandem
    repeat.
  • In the interval between h and q, the number
    of mismatches from h to t (found during the
    forward extension) plus the the number of
    mismatches from t1 to q-1(found during the
    backward ext) is ltk.

14
Time complexity
  • O(knlogn Z)

15
Multiple common substring prob.
  • Problem
  • What substring are common to a large number of
    distinct strings?
  • Definitions
  • T a generalized suffix tree for K strings
  • The identical suffixes appearing in more
  • than one string end at distinct leaves.
    Each
  • leaf has one of K unique string
    identifier.

16
  • C(v) the number of distinct leaf string
    identifiers in the subtree of v
  • S(v) the total number of leaves in the subtree
    of v
  • U(v) how many duplicate suffixes from the same
    string occur in vs subtree.
  • C(v)S(v)-U(v)
  • L(k) the length of the longest substring common
    to at least k of the strings.
  • U(v)?ini(v)gt0(ni(v)-1)
  • ni(v) the number of leaves with identifier i
  • in the subtree rooted at v

17
The key work we need do
  • Our object get C(v)
  • What is the algorithm?
  • Define Li the list of leaves with identifier i,
    in increasing order of their DFS numbers

18
Algorithm for the problem
  • Build a generalized suffix tree T for the K
    strings.
  • Number the leaves of T as they are encountered
    in a depth-first traversal of T
  • For each string identifier i, extract the ordered
    list Li, of leaves with identifier i.
  • For each identifier i, compute the lca of each
    consecutive pair of leaves in Li, and increment
    by one each time that w is the computed lca.

19
Algorithm Cont.
  • Lemma
  • If we compute the lca for each consecutive
    pair of leaves in Li, then for any node v,
    exactly ni(v)-1 of the computed lcas will lie in
    the substring of v.
  • With a bottom-up traversal of T, compute, for
    each node v, S(v), and U(v)?ini(v)gt0(ni(v)-1)
    ?h(w)w is in the subtree of v
  • Set C(v)S(v)-U(v) for each v
  • Accumulate the table of L(k) values.
  • Time complexity O(n)

20
End of the presentation
  • Thank you
  • Thank you
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com