Title: South Carolina
1The DARPA Data Transposition Benchmark on a
Reconfigurable Computer
Sreesa Akella, Duncan A. Buell, Luis E. Cordova,
and Jeff Hammes Department of Computer Science
and Engineering University of South Carolina
MAPLD 2005/243
128-bit Data transfer Implementation
DARPA Data Transposition Benchmark
Modifications to the C Map Implementation
Let Ai be a stream of n-bit integers of length
L. consider each successive block of n integers
as a n x n matrix of bits. For each such matrix,
transpose the bits such that bit bji is
interchanged with bit bji.
- Parallel sections for computation and data
transfer. - Unrolled the inner loop. - In n
cycles we get all the n outputs. - In n
cycles we read these n values back to memory. -
All benchmarks were implemented.
- - 128-bit word transfers to 4 OBMs
- - Effectively 2 word per cycle transfer
- Transposition
- 2 units for 3264-bit 4 units for 1024-bit
- 32-bit read 8 words from 4 banks use 4 bit
shifts - 64-bit read 4 words from 4 banks use 2 bit
shifts - 1024-bit read 4 words and use 4 units in
parallel - 4 OBMS for input and 2 for output
- 2 Memory loop dependency cycles added to latency
n32 L107 ITER 400
n 64 L 107 ITER 230
n 1024 L 107 ITER 12
Timing Results
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 244 400 0.61
n 64-bit 129 230 0.56
n 1024-bit 97 12 8.08
Software Implementation
Written in C and uses a two loop structure.
SRC-6 Verilog Map Implementation
Timing Results
- The main program calls the map function. - The
map functions calls a Verilog macro. - The
Verilog macro implements the transposition. -
Performance was better than C Map implementation.
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 5609 400 14.02
n 64-bit 8160 230 35.47
n 1024-bit 2004 12 187.67
Timing Results
Timing Results
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 55 400 0.13
n 64-bit 54 230 0.23
n 1024-bit 30 12 2.50
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 179 400 0.44
n 64-bit 98 230 0.42
n 1024-bit 72.7 12 6.05
SRC-6 Reconfigurable Computer
Performance Analysis
Parallel 3-unit Implementation
Benchmarks Speedup over software Speedup over software Speedup over software Speedup over software Speedup over software
Benchmarks A B C D E
32-bit 15 21 41 46 68
64-bit 25 33 55 52 61
1024-bit 23 31 51 75 75
- Utilizes all the 6 available memory banks - 3
for input and 3 for output - Only one macro call
from the map function - Verilog macro has 3 units
working in parallel - Theoretically 3 times
computational speedup - overall twice speedup
A- C Map, B-Verilog Map, C- Parallel 3-unit,
D- 128-bit, E-Parallel 2-unit 128-bit
SRC-6 Implementations
Analysis - Parallelism
- The SRC implementation- Two ways. -
Transposition function in C C Map. -
Transposition function in Verilog Verilog Map.
- Parallel 3 unit
- 32-bit 30, 64-bit 53, 1024-bit 47
- Parallel 2 unit 128-bit
- 32-bit 26, 64-bit 40, 1024-bit 59
- Can have more parallel units
- Will lead to bank conflicts
- More memory banks run out of I/O pins on FPGA
SRC-6 C Map Implementation
- - The main program calls a C map function.
- - The parameters passed are the A, E values.
- A has the input values, E has the output values.
- The two loop structure was used for
transposition. - Implementation was slower than software.
// Assigning values for (i 0 i lt m
i) fscanf(in, "lld", temp)
Ai temp Ei 0 for
(j0jlt230j) for(k0kltnblocksk)
// assign values in blocks of half // the
bank capacity // call map function dt
(A, E, m, time, 0) .
Timing Results
Conclusions
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 95 400 0.23
n 64-bit 60 230 0.26
n 1024-bit 44 12 3.66
- SRC-6 computer provides great speedup
- 75 times for 1024-bit benchmark
- Parallelism exploited to a certain degree
- Could explore
- Highly Parallel multi-PE architectures
- Distributed memory architecture
South Carolina
COMPUTER SCIENCE ENGINEERING