South Carolina - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

South Carolina

Description:

Title: PowerPoint Presentation Author: Michael A. Matthews Last modified by: rk Created Date: 4/21/2000 2:27:42 AM Document presentation format: Custom – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 2
Provided by: Micha895
Category:

less

Transcript and Presenter's Notes

Title: South Carolina


1
The DARPA Data Transposition Benchmark on a
Reconfigurable Computer
Sreesa Akella, Duncan A. Buell, Luis E. Cordova,
and Jeff Hammes Department of Computer Science
and Engineering University of South Carolina
MAPLD 2005/243
128-bit Data transfer Implementation
DARPA Data Transposition Benchmark
Modifications to the C Map Implementation
Let Ai be a stream of n-bit integers of length
L. consider each successive block of n integers
as a n x n matrix of bits. For each such matrix,
transpose the bits such that bit bji is
interchanged with bit bji.
- Parallel sections for computation and data
transfer. - Unrolled the inner loop. - In n
cycles we get all the n outputs. - In n
cycles we read these n values back to memory. -
All benchmarks were implemented.
  • - 128-bit word transfers to 4 OBMs
  • - Effectively 2 word per cycle transfer
  • Transposition
  • 2 units for 3264-bit 4 units for 1024-bit
  • 32-bit read 8 words from 4 banks use 4 bit
    shifts
  • 64-bit read 4 words from 4 banks use 2 bit
    shifts
  • 1024-bit read 4 words and use 4 units in
    parallel
  • 4 OBMS for input and 2 for output
  • 2 Memory loop dependency cycles added to latency

n32 L107 ITER 400
n 64 L 107 ITER 230
n 1024 L 107 ITER 12
Timing Results
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 244 400 0.61
n 64-bit 129 230 0.56
n 1024-bit 97 12 8.08
Software Implementation
Written in C and uses a two loop structure.
SRC-6 Verilog Map Implementation
Timing Results
- The main program calls the map function. - The
map functions calls a Verilog macro. - The
Verilog macro implements the transposition. -
Performance was better than C Map implementation.
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 5609 400 14.02
n 64-bit 8160 230 35.47
n 1024-bit 2004 12 187.67
Timing Results
Timing Results
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 55 400 0.13
n 64-bit 54 230 0.23
n 1024-bit 30 12 2.50
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 179 400 0.44
n 64-bit 98 230 0.42
n 1024-bit 72.7 12 6.05
SRC-6 Reconfigurable Computer
Performance Analysis
Parallel 3-unit Implementation
Benchmarks Speedup over software Speedup over software Speedup over software Speedup over software Speedup over software
Benchmarks A B C D E
32-bit 15 21 41 46 68
64-bit 25 33 55 52 61
1024-bit 23 31 51 75 75
- Utilizes all the 6 available memory banks - 3
for input and 3 for output - Only one macro call
from the map function - Verilog macro has 3 units
working in parallel - Theoretically 3 times
computational speedup - overall twice speedup
A- C Map, B-Verilog Map, C- Parallel 3-unit,
D- 128-bit, E-Parallel 2-unit 128-bit
SRC-6 Implementations
Analysis - Parallelism
- The SRC implementation- Two ways. -
Transposition function in C C Map. -
Transposition function in Verilog Verilog Map.
  • Parallel 3 unit
  • 32-bit 30, 64-bit 53, 1024-bit 47
  • Parallel 2 unit 128-bit
  • 32-bit 26, 64-bit 40, 1024-bit 59
  • Can have more parallel units
  • Will lead to bank conflicts
  • More memory banks run out of I/O pins on FPGA

SRC-6 C Map Implementation
  • - The main program calls a C map function.
  • - The parameters passed are the A, E values.
  • A has the input values, E has the output values.
  • The two loop structure was used for
    transposition.
  • Implementation was slower than software.

// Assigning values for (i 0 i lt m
i) fscanf(in, "lld", temp)
Ai temp Ei 0 for
(j0jlt230j) for(k0kltnblocksk)
// assign values in blocks of half // the
bank capacity // call map function dt
(A, E, m, time, 0) .
Timing Results
Conclusions
Benchmark Total User time No of Iterations Time per Iteration
n 32 bit 95 400 0.23
n 64-bit 60 230 0.26
n 1024-bit 44 12 3.66
  • SRC-6 computer provides great speedup
  • 75 times for 1024-bit benchmark
  • Parallelism exploited to a certain degree
  • Could explore
  • Highly Parallel multi-PE architectures
  • Distributed memory architecture

South Carolina
COMPUTER SCIENCE ENGINEERING
Write a Comment
User Comments (0)
About PowerShow.com