DARPA Sorting Benchmark On SRC Platform presentation

About This Presentation

Transcript and Presenter's Notes

Title: DARPA Sorting Benchmark On SRC Platform

1
DARPA Sorting Benchmark On SRC Platform

Gang Quan, Allen Michalski, Duncan Buell, James
Davis
Department of CSE
University of South Carolina

2
Outline

Introduction
DARPA Benchmark Suite
SRC Platform
Integer Sorting Algorithm Implementations
Experiments and Results
Discussions

3
DARPA Benchmark Suite

Six benchmarks to measure performance of high
productivity computing systems
Benchmark suite
Large shared memory random access
Matrix multiplication with multiprecise modular
coefficients
A dynamic programming
Data transposition
Integer sort
Bit string pattern matching

4
DARPA Integer Sorting Benchmark

Sorting a stream of n-bit unsigned integers of
length N
In-place sorting is not required
Non-unique integers
In-core sort
N106 , n64
N5107 , n 128
Secondary memory sort
N5107 , n 64

5
SRC Architecture

Host
1GB main memory
512K L2 cache
MAP
2 XC2V6000
6 memory banks (24MB total)
800MB/s to/from main memory
4800MB/s to/from on-board memory

6
Software Platform

System
Linux (Red Hat 7.3)
Driver and Library additions
Compilers
Intel Compilers (C/C, Fortran), static and
run-time libraries
SRC Compilers (C/C, Fortran) and FPGA Micro
Tools
FPGA
Symplicity Synplify Pro
Xilinx Alliance ISE

7
SRC Compilation Process
8
Integer Sorting Implementation

Software only (Proc_only)
FPGA Implementation
Multi-threading

9
Software Only

Radix Sort
radix sort( A, radix_size)
Let A represent the maximum binary bits in
each element
for (i1 i lt ceil( A/radix_size) i)
bucket sort (A, on digit i )
Priority queue sort

10
Priority Queue Sort
lt
lt
lt
lt
lt
lt
lt
. . .
11
128-bit FPGA Bubble Sorting
12
128-bit FPGA Bubble Sorting (Contd)

Comparator cell (user micro)
Pipelined
Data valid bit for updating data in the
comparator register
O(N2)
Resource usage
90 slice

13
128-bit FPGA Priority Queue Sorting

Map C function
Leaf node number 8
Resource Usage
60 slices
53 IOBs

14
Multi-threading implementation
Input Data Array
Data Partitioning
Host Radix Sorting (PC)
FPGA Bubble Sorting
Host Heap Sorting (PC)
FPGA Heap Sorting
Data Synchronization
Data sorting
2-way Merge Sorting (PC)
Output
15
Experiments and Results

Data set
Randomly generated
N 524288 (512K)
n 128 bits
Total memory needed 512 16k 8M (fits into
two on-board memory banks)
Iteration one time
Time measurement
CPU time
getrusage()
Not a valid measurement since FPGA computation
time is not counted
Wall clock time
gettimeofday()
Not accurate for multi user environment
Average data on 5 runs for the casual estimation

16
Some Parameters Measured

Average map allocation time
0.26 sec
Average map release time
0.000041 sec
Data In time
0.030878 sec
Data Out time
0.051951 sec
Multi-threading overhead
0.075 sec

17
Execution Times

Include time for
Data processing
Map alloc/release time
Data in/out
Thread creating /scheduling/removing
FPGA Only
As much as 1.5 for small block size
Proc_only wins out for large block size due to
large cache effects
Multi-threading
Generally, a good trade-off between FPGA-only and
Proc-only
Not efficient when overhead becomes significant

(sec)
18
Detailed Timing
(sec)

O(N2) effect for hardwarebubble sorting
When blocksize 65536 (8 blocks)
Hw priority queue 0.083 sec
SW priority queue 0.261 sec
Speedup 3 times

19
Lessons learned

SRC Platform is generally easy to program
Easy to explore different design alternatives
Performance
Overhead
Map alloc/release, data movement, etc
Flexibility vs. performance
Hw priority queue
Knowledge on Map C compiler
Extra cycles for each loop in hw priority queue
Measurement accuracy
Elapsed time

20
Work in progress

Using all hw resources
2nd FPGA and all memory banks
Other more hw-efficient algorithm
Hw radix sort
Optimizing performance
Parallel execution

Write a Comment

User Comments (0)

About PowerShow.com

DARPA Sorting Benchmark On SRC Platform PowerPoint PPT Presentation