Title: Processing Repetitive Sequence Structures at Streaming Rate
1Processing Repetitive SequenceStructures at
Streaming Rate
Albert A. Conti , Tom Van Court , Martin C.
Herbordt Department of Electrical and Computer
Engineering Boston University, Boston, MA
02215 herbordt alconti tvancour _at_bu.edu
BOSTON
UNIVERSITY
String Matching for Bioinformatics
Our Model, Problems we address
In this first study, we examined what could be
done with the simplest algorithmic models. Our
program is to investigate techniques for
analyzing repetitive sequence structure by
feeding sequences through the FPGA at streaming
rate. By streaming rate we mean that
characters are processed systolically with
emphasis on simple logic.
Repeating patterns make up a significant fraction
of DNA and protein molecules. These repeating
regions are important to biological function
because they may act as catalytic, regulatory or
evolutionary sites and because they have been
implicated in human diseases such as Fragile-X
mental retardation and Huntingtons disease.1
While identifying exact-matching repetitive
structures is a task easily handled by a standard
PC, identifying structures with a variable number
of mismatches, insertions and/or deletions is
computationally prohibitive. Existing solutions
include expensive dedicated platforms and
inaccurate heuristic methods.
In our system, an Avnet Virtex II Pro Development
Board housing a Xilinx XC2VP20 FPGA (right) acts
as a coprocessor. Designs implemented on the
FPGA for each task are all organized in a
two-tier structure (left). Input is streamed
through arrays of comparators/counters in the
first tier. In the second tier, which we call
post processing, we decide what information to
send off chip, and determine higher order
structures such as arrays of repeats.
C G A T G C G C T G
A tandem repeat of length 5 with 1 mismatch
Data Input
G T T C A A C T G
Tier 1 Structure specific comparator arrays and
systollic logic surrounded by shift registers for
input stream
An even palindrome of length 4/5 with 1
insertion/deletion
Implementations for detection
High Bandwidth Intermediate Results
Palindromes Our method here is simple.
Pair-wise comparisons are made for all characters
1 to n/2. Results from these comparisons are
added systolically to arrive at the number of
matching characters n/2 clock cycles later.
Tier 2 Post-Processing Filters
len2
Low Bandwidth Output
len3
Results can be sent off chip or processed further.
len4
Tandem repeats Our method of detecting repeats
is similar to the method for detecting
palindromes. The difference is that we can take
advantage of comparisons made in previous steps
through the string. Note below that when our
frame of reference shifts for length4, there is
only one comparison that was not made in the
previous step. Because there is only a single
comparison change for every step through the
string, the number of mismatches (k) for any
given length can change by no more than one. k
is updated for each length at each step according
to the table below. We can perform this
computation for each length up to n/2 by
replicating the logic as shown.
The following tasks were examined on an FPGA and
analyzed. Each of these tasks enumerates
quantities for strings of arbitrary length but
with n determined by available hardware. 1.
tandem repeats of length 1 to n with k or fewer
mismatches 2. palindromes of length 1 to n with k
or fewer mismatches 3. tandem repeats of length 1
to n with k or fewer mismatches and one edit
error 4. palindromes of length 1 to n with k or
fewer mismatches and one edit error 5. tandem
arrays of arbitrary length with period from 1 to n
new compare
C G A T G C G C T G A A C T
expired compare
Results gt 500x speedup
The following tables report the maximum size and
minimum clock period (post place-and-route
timing) of each problem that will fit on our
target FPGA. The serial version times are that
of a C program running on a 3GHz Xeon-based
workstation class PC.
expired comp
new comp
?k
Task max n
1. tandem repeats of length 1 to n with k or fewer mismatches 2. palindromes of length 1 to n with k or fewer mismatches 3. tandem repeats of length 1 to n with k or fewer mismatches and one edit error 4. palindromes of length 1 to n with k or fewer mismatches and one edit error 5. tandem arrays of arbitrary length with period from 1 to n 1024 256 64 40 1024
k ? 1
k ? 1
k ? 1
k ? 1
Extending these models for edit errors The basic
cells are modified to look at registers to the
left and right of their pair-wise matches. In
addition, a combinatorial network is used to
detect every possible insertion/deletion point
for each length. The diagram to the right shows
a cell for palindrome detection with a single
insertion or deletion.
Precise Tandem Arrays An additional level of
counters count successive shifts with mismatches
below a certain threshold. The values in these
counters divided by the length of the repeat they
are looking for is the number of consecutive
repeated cycles detected.
max n Serial Version for Task 1 FPGA Version for Task 1 Serial Version for Task 3 FPGA Version for task 3
32 64 128 256 512 1024 1.3 us 2.3 us 4.6 us 8.8 us 17.1 us 33.1 us 5 ns 5 ns 5 ns 5 ns 5 ns 5 ns 10.5 us 36.0 us 5 ns 5 ns
cIN
cNEXT
eq
eq-1
eq1
Please note that while designs were tested for
correctness on the Xilinx XC2VP20, maximum size
and timing figures are based on the Xilinx
XC2VP100.
1 G. Benson. A Space efficient algorithm for
finding the best nonoverlapping alignment score.
In M. Crochemore and D. Gusfield, editors, Proc.
5th Annual Symp. On Combinatorial Pattern
Matching, Lecture Notes in Computer Science,
volume 807, pages 1-14. Springer-Verlag, 1994.