Title: High Performance Direct Pairwise Comparison of Genomic Sequences
 1High Performance Direct Pairwise Comparison of 
Genomic Sequences
- Christopher Mueller, Mehmet Dalkilic, Andrew 
Lumsdaine  - HiCOMB 
 - April 4, 2005 
 - Denver, Colorado
 
  2Introduction
- Goals 
 - Generate data for large format visualization 
 - Exploit parallel features present in commodity 
hardware  - SIMD/vector processors 
 - SMP/multiple processors per machine 
 - Clusters 
 - Genome Comparison 
 - Dot plot is the only complete method for 
comparing genomes  - Often ruled out due to quadratic running time 
 - Size of data has an upper bound and modern 
hardware is approaching the point where this 
bound is (almost) within reach  - Target Data 
 - DNA sequences, one direction (5 to 3) 
 - Target Platform 
 - Apple dual processor G5, Altivec vector processor
 
  3Related Work
- BLAST 
 - Apple and Genentech (AGBLAST), 5x speedup using 
Altivec  - Smith-Waterman 
 - Rognes and Seeberg, 6x speedup using MMX 
 -  HMMER 
 - Erik Lindahl, 30 improvement using Altivec 
 - Hardware Solutions 
 - Various commercial FPGA solutions exist for 
different algorithms (e.g., TimeLogics DeCypher 
platform offers BLAST, HMM, SW) 
  4SIMD Overview
- Single Instruction, Multiple Data 
 - Perform the same operation on many data items at 
once  - Vector registers can be divided according to the 
data type  - The Altivec registers in the G5 are 128 bits 
wide.  - Vector programming using gcc on Apple G5s is one 
step removed from assembly programming  - Functions are thin wrappers around assembly calls 
 - The optimizer does not cover vector operations 
 - Memory loads and stores are handled by the 
programmer and must be properly byte aligned  
Image from http//developer.apple.com/hardware/ve 
 5The Dot Plot
qseq
NAÏVE_DOTPLOT(qseq, sseq, win, strig) // qseq 
- column sequence // sseq - row sequence // 
win - number of elements to compare // 
for each point // strig - number of matches 
required // for a point for each q 
in qseq for each s in sseq 
score  0 for each (q, s) in 
(qseqqqwin, 
ssswin) if q  s 
score  1 end if q end for each 
(q,s) if score gt strig 
AddDot(q, s) end if score end for 
each s end for each q 
sseq
win  3 strig  2
Dotplot comparing the human and fly mitochondrial 
genomes (generated by DOTTER) 
 6The Standard Algorithm
STD_DOTPLOT(qScores, s, win, strig) dotvec  
zeros(len(q)) for each char c in s dotvec 
 shift(dotvec, 1) dotvec  qScoresc 
if index(c) gt win delchar  sindex(c) - 
win dotvec - shift(qScoresdelchar, 
 win) for each dot in dotvec 
gt strig display(dot) end for each 
dot end for i end DOTPLOT 
 7Data Parallel Dot Plot
VECTOR_DOTPLOT(qScores, s, win, strig) // 
Group diagonals by the upper and lower // 
triangular sections of the martix for each 
vector diagonal D runningScore  vector(0) 
 for each char c in s score  
VecLoad(qScoresc) runningScore  
VecAdd(score, r_score) if index(c) gt win 
 delChar  sindex(c) - win 
delscore  VecLoad(qScoresdelChar) 
runningScore  VecSub(score, delscore) if 
VecAnyElementGte(runningScore, strig) 
scores  VectorUnpack(runningScore) for 
each score in scores gt strig 
Output(row(c), col(score), score) end for 
each score end if VecGte() end for 
each c end for each D end VECTOR_DOTPLOT 
 8Coarse Grained Parallelism
- Block Level Parallelism 
 - Block the matrix into columns 
 - Overlap by the number of characters in the window 
 - Single Machine 
 - Run one thread per processor 
 - Create one memory mapped file per processor 
 - Cluster 
 - Run one instance per machine and one thread per 
processor.  - Store results locally (e.g. /tmp) 
 
  9Model-driven Implementation
Goal Break the algorithm into basic operations 
that can be modeled independently to understand 
the performance issues at each step.
Data Streams (data read speed)
Vector Operations (instruction throughput)
Sparse Matrix Format
Data output
(data write speed) 
 10Data Stream Models
Data Stream Performance (Mops)
// Base case // S-sequence is one stream 
pointer s // Q-sequence is four 
streams uchar qScore4 // Option 1 Four 
Pointers // Keep pointers to the current // 
position in the score vectors qScore0 qScore
1 qScore2 qScore3 score  
qScores // Option 2 Index // Index the 
score vectors with // a counter i score  
qScoresi 
- Single stream pointer is similar to indexing, but 
a little slower  - For the four score streams, indexed 1/4 of the 
time, maintaining the pointers costs more than 
lookup  - Pointer vs. Index numbers varied based on the 
compiler version  
  11Vector Performance Models
// Model Variables uchar data  randseq(), 
out16 long i  0, l  len(data) vector uchar 
sum  0, value // VecAdd for(i  0 i lt l - 
16 i)  value  VecLoad(datai) sum  
VecAdd(value, sum)  // StoreAll for(i  0 i lt 
l - 16 i)  value  VecLoad(datai) sum 
  VecAdd(value, sum) out  VecStore(sum) 
Save(out)  // StoreFreq int freq  l  
alpha for(i  0 i lt l - 16 i)  value  
VecLoad(datai) sum  VecAdd(value, sum) 
if(i  freq)  // Pipeline stall! out  
VecStore(sum) Save(out)  
Vector Model Performance (Mops)
- Attempts to model infrequent write operations 
were unsuccessful  - Storing all dots yields high performance, but 
this is not practical for large comparisons  - StoreFreq provides a lower bound on performance
 
  12Pipeline Management
// Sequence of Vector Operations // score  
VecLoad(qScoresc) score1  vec_ld(0, ptemp) 
// unalgined score2  vec_ld(16, ptemp) // 
loads vperm  vec_lvsl(0, ptemp) score  
vec_perm(score1, score2, vperm) runningScore  
vec_add(score, r_score) // delscore  
VecLoad(qScoresdelChar) score1  vec_ld(0, 
ptemp) score2  vec_ld(16, ptemp) vperm  
vec_lvsl(0, ptemp) delscore  
vec_perm(score1, score2, vperm) runningScore  
vec_sub(score, delscore) if(vec_any_ge(runningSc
ore, strig))  scores  vec_st(runningScore) 
// Main processor for(i  0 i lt 16 i)  
if(hiti gt ustrig )  dm.AddDot(y, x  i, 
hiti)    
Cycle-accurate Plots of the Instructions in Flight
Each line shows each cycle for one instruction. 
Instructions are offset (x-axis) based on 
starting time. Time flows from top to bottom 
(y-axis). The left plot shows a series of 
add/delete steps with no dots generated. The 
bottom plot shows the pipeline being interrupted 
when a dot is generated. 
 13Sparse Matrix Format
Sparse Matrix Format Performance (Mops)
// Option 1 // stdvector CSR-eqse Sparse 
Matrix struct Dot  int col int 
value  struct Row  int num vectorltDotgt 
cols  typedef vectorltRowgt DotMatrixVec // 
Option 2 // Memory Mapped Coordinate-wise // 
Sparse Matrix struct RowDot  int row int 
col int value  RowDot out  
(RowDot)mmap() 
6.78x
3.85x
1.0x
- Both approaches required some maintenance to 
avoid exhausting main memory  - mmap avoids a second pass through the data during 
the save step  
  14Data Location
Data Location Performance (Mops)
- Large, shared data is often located on network 
drives  - This adds a network hop for all disk I/O 
 - Even for infrequent I/O, this can significantly 
affect performance  
1.98x
1.35x
1.0x
1.0x
- The stdvector sparse matrix had a slight 
benefit.  - The mmap sparse matrix improved significantly 
with local data storage. 
  15Traditional Manual Optimizations
- Prefetch 
 - G5 hardware prefetch is very good 
 - Attempts to optimize had negative impact 
 - Blocking 
 - Slight negative impact due to burps in the stream 
 - Unrolling 
 - Complicated code very quickly 
 - No measurable improvement
 
  16System Details
- Apple Dual 2.0 GHz G5, 3.5 GB RAM 
 - 100 Mbit network to file server 
 - OS X 10.3.5 (Darwin Kernel Version 7.5.0) 
 - g 3.3 (build 1620) 
 - -O3 
 - -fast (different compiler, aggressive 
optimizations)  - -altivec (limited optimizations) 
 - Upgrade from 1614 to 1620 improved DOTTERs 
performance by 30  - Libraries 
 - Boostthread 
 - Data (from GenBank) 
 - Mitochondrial genomes 
 - E. Coli, Listeria bacterial genomes 
 
  17Results
Final Results (Mops)
13.0x
- Single Machine 
 - Mitochondrial (20 kbp) 
 - DOTTER vs. Data-parallel 
 - Bacterial (4.5 Mbp) 
 - Data-parallel only 
 - Cluster 
 - (16 dual processor 2.3 GHz G5s) 
 - Bacterial Comparison 
 - 92 min, 8 sec (1 node) 
 - 5 min, 42 sec (16 nodes) 
 
7.0x
1.0x
Scalability 
Scalability (time/nodes) 
 18Visualization
- Results rendered to PDF 
 - Target Displays 
 - 2x4, 6400x2400 tiled display wall 
 - IBM T221, 3840x2400, 204 dpi display 
 - Magnifying glass required 
 - High resolution formats 
 - 600 dpi laser printer 
 - 1200 dpi ink jet printer 
 - High resolution, no interactivity
 
  19Conclusions
- Modern commodity hardware is close to providing 
the performance necessary for large direct 
genomic comparisons.  - 5,000,000 base pair sequences are realistic 
(bacteria)  - 50,000,000 base pair sequences are possible 
(small human chromosomes)  - It is important to take a careful, experimental 
approach to implementation and to test all 
assumptions.  
  20Acknowledgements
- Jeremiah Willcock helped develop the initial 
prototype  - Eric Wernert, Craig Jacobs, and Charlie Moad from 
the UITS Advanced Visualization Lab at Indiana 
University provided visualization support  - This work was supported by a grant from the Lilly 
Endowment  - References 
 -  Apple Developers Connection, Velocity Engine 
and Xcode, from, Apple Developer Connection, 
Cupertino, CA, 2004. http//developer.apple.com/ha
rdware/ve http//developer.apple.com/tools/xcode  -  
 -  A. J. Gibbs and G. A. McIntyre, The diagram, a 
method for comparing sequences. Its use with 
amino acid and nucleotide sequences, Eur J 
Biochem, 16 (1970), pp. 1-11.  -  
 -  E. L. L. Sonnhammer and R. Durbin, A Dot-Matrix 
Program with Dynamic Threshold Control Suited for 
Genomic DNA and Protein-Sequence Analysis, 
Gene-Combis, 167 (1995), pp. 1-10.