South Carolina

About This Presentation

Title:

South Carolina

Description:

Title: PowerPoint Presentation Author: Michael A. Matthews Last modified by: rk Created Date: 4/21/2000 2:27:42 AM Document presentation format: Custom – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 2

Provided by: Micha895

Category:

more less

Transcript and Presenter's Notes

Title: South Carolina

1
The DARPA Data Transposition Benchmark on a
Reconfigurable Computer
Sreesa Akella, Duncan A. Buell, Luis E. Cordova,
and Jeff Hammes Department of Computer Science
and Engineering University of South Carolina
MAPLD 2005/243
128-bit Data transfer Implementation
DARPA Data Transposition Benchmark
Modifications to the C Map Implementation
Let Ai be a stream of n-bit integers of length
L. consider each successive block of n integers
as a n x n matrix of bits. For each such matrix,
transpose the bits such that bit bji is
interchanged with bit bji.
- Parallel sections for computation and data
transfer. - Unrolled the inner loop. - In n
cycles we get all the n outputs. - In n
cycles we read these n values back to memory. -
All benchmarks were implemented.

- 128-bit word transfers to 4 OBMs
- Effectively 2 word per cycle transfer
Transposition
2 units for 3264-bit 4 units for 1024-bit
32-bit read 8 words from 4 banks use 4 bit
shifts
64-bit read 4 words from 4 banks use 2 bit
shifts
1024-bit read 4 words and use 4 units in
parallel
4 OBMS for input and 2 for output
2 Memory loop dependency cycles added to latency

n32 L107 ITER 400
n 64 L 107 ITER 230
n 1024 L 107 ITER 12
Timing Results
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 244 400 0.61
n 64-bit 129 230 0.56
n 1024-bit 97 12 8.08
Software Implementation
Written in C and uses a two loop structure.
SRC-6 Verilog Map Implementation
Timing Results
- The main program calls the map function. - The
map functions calls a Verilog macro. - The
Verilog macro implements the transposition. -
Performance was better than C Map implementation.
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 5609 400 14.02
n 64-bit 8160 230 35.47
n 1024-bit 2004 12 187.67
Timing Results
Timing Results
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 55 400 0.13
n 64-bit 54 230 0.23
n 1024-bit 30 12 2.50
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 179 400 0.44
n 64-bit 98 230 0.42
n 1024-bit 72.7 12 6.05
SRC-6 Reconfigurable Computer
Performance Analysis
Parallel 3-unit Implementation
Benchmarks Speedup over software Speedup over software Speedup over software Speedup over software Speedup over software
Benchmarks A B C D E
32-bit 15 21 41 46 68
64-bit 25 33 55 52 61
1024-bit 23 31 51 75 75
- Utilizes all the 6 available memory banks - 3
for input and 3 for output - Only one macro call
from the map function - Verilog macro has 3 units
working in parallel - Theoretically 3 times
computational speedup - overall twice speedup
A- C Map, B-Verilog Map, C- Parallel 3-unit,
D- 128-bit, E-Parallel 2-unit 128-bit
SRC-6 Implementations
Analysis - Parallelism
- The SRC implementation- Two ways. -
Transposition function in C C Map. -
Transposition function in Verilog Verilog Map.

Parallel 3 unit
32-bit 30, 64-bit 53, 1024-bit 47
Parallel 2 unit 128-bit
32-bit 26, 64-bit 40, 1024-bit 59
Can have more parallel units
Will lead to bank conflicts
More memory banks run out of I/O pins on FPGA

SRC-6 C Map Implementation

- The main program calls a C map function.
- The parameters passed are the A, E values.
A has the input values, E has the output values.
The two loop structure was used for
transposition.
Implementation was slower than software.

// Assigning values for (i 0 i lt m
i) fscanf(in, "lld", temp)
Ai temp Ei 0 for
(j0jlt230j) for(k0kltnblocksk)
// assign values in blocks of half // the
bank capacity // call map function dt
(A, E, m, time, 0) .
Timing Results
Conclusions
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 95 400 0.23
n 64-bit 60 230 0.26
n 1024-bit 44 12 3.66