DARPA Sorting Benchmark On SRC Platform PowerPoint PPT Presentation

presentation player overlay
1 / 20
About This Presentation
Transcript and Presenter's Notes

Title: DARPA Sorting Benchmark On SRC Platform


1
DARPA Sorting Benchmark On SRC Platform
  • Gang Quan, Allen Michalski, Duncan Buell, James
    Davis
  • Department of CSE
  • University of South Carolina

2
Outline
  • Introduction
  • DARPA Benchmark Suite
  • SRC Platform
  • Integer Sorting Algorithm Implementations
  • Experiments and Results
  • Discussions

3
DARPA Benchmark Suite
  • Six benchmarks to measure performance of high
    productivity computing systems
  • Benchmark suite
  • Large shared memory random access
  • Matrix multiplication with multiprecise modular
    coefficients
  • A dynamic programming
  • Data transposition
  • Integer sort
  • Bit string pattern matching

4
DARPA Integer Sorting Benchmark
  • Sorting a stream of n-bit unsigned integers of
    length N
  • In-place sorting is not required
  • Non-unique integers
  • In-core sort
  • N106 , n64
  • N5107 , n 128
  • Secondary memory sort
  • N5107 , n 64

5
SRC Architecture
  • Host
  • 1GB main memory
  • 512K L2 cache
  • MAP
  • 2 XC2V6000
  • 6 memory banks (24MB total)
  • 800MB/s to/from main memory
  • 4800MB/s to/from on-board memory

6
Software Platform
  • System
  • Linux (Red Hat 7.3)
  • Driver and Library additions
  • Compilers
  • Intel Compilers (C/C, Fortran), static and
    run-time libraries
  • SRC Compilers (C/C, Fortran) and FPGA Micro
  • Tools
  • FPGA
  • Symplicity Synplify Pro
  • Xilinx Alliance ISE

7
SRC Compilation Process
8
Integer Sorting Implementation
  • Software only (Proc_only)
  • FPGA Implementation
  • Multi-threading

9
Software Only
  • Radix Sort
  • radix sort( A, radix_size)
  • Let A represent the maximum binary bits in
    each element
  • for (i1 i lt ceil( A/radix_size) i)
  • bucket sort (A, on digit i )
  • Priority queue sort

10
Priority Queue Sort
lt
lt
lt
lt
lt
lt
lt
. . .
11
128-bit FPGA Bubble Sorting
12
128-bit FPGA Bubble Sorting (Contd)
  • Comparator cell (user micro)
  • Pipelined
  • Data valid bit for updating data in the
    comparator register
  • O(N2)
  • Resource usage
  • 90 slice

13
128-bit FPGA Priority Queue Sorting
  • Map C function
  • Leaf node number 8
  • Resource Usage
  • 60 slices
  • 53 IOBs

14
Multi-threading implementation
Input Data Array
Data Partitioning
Host Radix Sorting (PC)
FPGA Bubble Sorting
Host Heap Sorting (PC)
FPGA Heap Sorting
Data Synchronization
Data sorting
2-way Merge Sorting (PC)
Output
15
Experiments and Results
  • Data set
  • Randomly generated
  • N 524288 (512K)
  • n 128 bits
  • Total memory needed 512 16k 8M (fits into
    two on-board memory banks)
  • Iteration one time
  • Time measurement
  • CPU time
  • getrusage()
  • Not a valid measurement since FPGA computation
    time is not counted
  • Wall clock time
  • gettimeofday()
  • Not accurate for multi user environment
  • Average data on 5 runs for the casual estimation

16
Some Parameters Measured
  • Average map allocation time
  • 0.26 sec
  • Average map release time
  • 0.000041 sec
  • Data In time
  • 0.030878 sec
  • Data Out time
  • 0.051951 sec
  • Multi-threading overhead
  • 0.075 sec

17
Execution Times
  • Include time for
  • Data processing
  • Map alloc/release time
  • Data in/out
  • Thread creating /scheduling/removing
  • FPGA Only
  • As much as 1.5 for small block size
  • Proc_only wins out for large block size due to
    large cache effects
  • Multi-threading
  • Generally, a good trade-off between FPGA-only and
    Proc-only
  • Not efficient when overhead becomes significant

(sec)
18
Detailed Timing
(sec)
  • O(N2) effect for hardwarebubble sorting
  • When blocksize 65536 (8 blocks)
  • Hw priority queue 0.083 sec
  • SW priority queue 0.261 sec
  • Speedup 3 times

19
Lessons learned
  • SRC Platform is generally easy to program
  • Easy to explore different design alternatives
  • Performance
  • Overhead
  • Map alloc/release, data movement, etc
  • Flexibility vs. performance
  • Hw priority queue
  • Knowledge on Map C compiler
  • Extra cycles for each loop in hw priority queue
  • Measurement accuracy
  • Elapsed time

20
Work in progress
  • Using all hw resources
  • 2nd FPGA and all memory banks
  • Other more hw-efficient algorithm
  • Hw radix sort
  • Optimizing performance
  • Parallel execution
Write a Comment
User Comments (0)
About PowerShow.com