Coarse Grain Reconfigurable Architectures

About This Presentation

Title:

Coarse Grain Reconfigurable Architectures

Description:

Chess. Hexagonal Mesh (Chess board layout) of ALU's. Matrix. 2-D array of ALUs. Rapid ... The CHESS Architecture Interconnection ... – PowerPoint PPT presentation

Number of Views:610

Avg rating:3.0/5.0

Slides: 69

Provided by: eceAr

Category:

more less

Transcript and Presenter's Notes

Title: Coarse Grain Reconfigurable Architectures

1
Coarse Grain Reconfigurable Architectures
2
Announcements

Group meetings this week (sign up sheet)
Starting October 8th PSYCH304, 330-6pm
Today Coarse Grain Architectures

3
Motivation for coarse grained architectures

Definition
FPGA with granularity greater than 1 bit.
Architectures seen so far vary from 2 bits to 32
bits.
Disadvantages of low granularity
Large routing area (80 of chip area!).
Large volume of configuration data.
Low area efficiency for arithmetic operations and
RAM.
Reduced clock speed, bandwidth.
Advantages of coarse grained architectures
Lesser number of PEs so PAR problem is less
complex and is much faster.
Lesser area of chip devoted to routing.
Easier to correlate software functions and
hardware.
Tradeoffs
Flexibility/Generality of 1 bit configuration.
Possible under utilization.

4
Coarse grained architectures

DP-FPGA
LUT-based
LUTs share configuration bits
Chess
Hexagonal Mesh (Chess board layout) of ALUs
Matrix
2-D array of ALUs
Rapid
Specialized ALUs, mutlipliers
1D array
Raw
Full RISC core as basic block
2D mesh
Paddi
Cluster of 8 arithmetic EXU, 16 bits wide, 8 word
SRAM
Central Crossbar switch

5
Different Coarse Grained Architectures
ALU
DPFPGA
CHESS
MATRIX
PADDI
RAPID
1D array, 16 bit
RAW
2D mesh, 8bit, imem,dmem,cl
6
RAW, Motivation

It takes on the order of two clock cycles for a
signal to travel from edge-to-edge (roughly
fifteen mm) of a 2-GHz processor
Compaqs Alpha 21264 forced to split the integer
unit into two physically dispersed clusters, with
a one-cycle penalty for communication of results
between clusters.
Intel Pentium 4 architects had to allocate two
pipeline stages solely for the traversal of long
wires.
RAW
to use a scalable ISA that provides a parallel
software interface to the gate, wire, and pin
resources of a chip.
An architecture with direct, first-class analogs
to all of these physical resources lets
programmers extract the maximum amount of
performance and energy efficiency in the face of
wire delay.
try to minimize the ISA gap by exposing the
underlying physical resources as architectural
entities.

7
What is

The Raw microprocessor consumes 122 million
transistors
executes 16 different load, store, integer, or
floating-point instructions every cycle
controls 25 Gbytes/s of input/output (I/O)
bandwidth and
has 2 Mbytes of on-chip distributed L1 static RAM
providing on-chip memory bandwidth of 57
Gbytes/s.
it took only a handful of graduate students at
the Laboratory for Computer Science at MIT to
design and implement Raw.

8
RAW Reconfigurable Architecture Workstation
(_at_MIT)

Challenges
Short internal chip wires to have high clock
speed
Quick verification of new designs
Workloads on processors (multimedia)
Multi-granular Processing Elements (Tiles)
a RISC processor,
Configurable Logic (CL),
instruction and data memories,
programmable switch
Parallelizing compiler (to distribute workload)
Supports static and dynamic routing

9
RAW Microprocessor
Each tile includes a RISC processor and
configurable logic
10
RAW - Comparison
11
RAW
12
RAW Datapath of individual tile

Intertile communication latency similar to
those of register accesses
Distributed registers More ILP
Memory access time close to processor clock
No hardware based register-renaming logic or
instruction issue (different from superscalar)
Focus is on computation (NOT control)
Switch integrated into processor pipeline
Static and dynamic schedule
Multigranular and configurability

13
RAW Pipeline
14
RAW Software Support
15
RAW Compiler

(a) shows the original code. (b) shows the
memories of a 4 processor Raw machine, showing
the distribution of array A. (c) shows the code
after unrolling. After unrolling, each access
refers to locations from only one processor.

16
RAW parallelizing compiler
Parallelizing applications into silicon. 1999
17
RAW Experimental Results
18
RAW - Performance
19
RAW fabricated
20
RAW vs. FPGA

FPGAs
exploit fine-grained parallelism and fast static
communication.
Software has access to its low-level details,
allowing the software to optimize mapping of the
user application.
Users can also bind commonly used instruction
sequences into configurable logic.
As a result, these special purpose instruction
sequences can execute in a single cycle.
require loading an entire bitstream to reprogram
the FPGAs for a new operation.
Compilation for FPGAs is also slow because of
their fine granularity.

Raw
supports instruction sequencing.
They are more flexible, merely pointing to a new
instruction to execute a new operation.
Compilation in Raw is fast because the hardware
contains commonly used compute mechanisms such as
ALUs and memory paths.
This eliminates repeated, low-level compilations
of these units.
Binding common mechanisms into hardware also
yields faster execution speed,lower area, and
better power efficiency than FPGA systems.

21
CHESS

A Reconfigurable Arithmetic Array for Multimedia
Applications

22
The CHESS Architecture -Introduction

Developed by HP.
Aims at speeding up arithmetic operations for
multimedia applications.
Also tries to improve memory density.
Salient features
4 bit Alus Why 4 bit?
4 bit buses
Switchboxes-2 modes
Chessboard layout strong local connectivity
Embedded Block RAMs- 256 8 per 16 ALUs
Speed and Hierarchical line lengths-buffers
Small configuration memories speed of config is
high
No run time reconfiguration-Why?

23
Application Benefits of CHESS

High Computational Density _single chip
Delayed design Commitment
Wide in chip interfaces
Doesnt do runtime reconfig
Memory Bandwidth
Distributed on chip memory
Flexibility
Convert switchbox to RAM RAM routability
tradeoff- prefer embedded memory
Different ALU instructions can be programmed
Rapid Reconfiguration

24
The CHESS Layout of ALUs

4 bit Alus
4 bit buses
Hexagonal Array
Local Connectivity good

25
Distributed RAM

Distributed memory
1 RAM for 16 ALU blocks
Cascaded ALU s possible

26
ALU bit slice and sample instruction set

Performs addition, subtraction and logical
operations
4 bit inputs A B
Single bit carry
Variety of carry conditioning options

Carry input -data signal- arithmetic Carry input
-control signal-Testing equality of more than 4
bit numbers Carry input -control signal-can
drive local resets,enables,etc
27
ALU and Switchbox

Switchbox memory can be used as storage as shown
A is Address, B is Wr_data and F is Rd_data
Additional 2 bit registers for use as buffers for
long wires

ALU core for computation
ALU instruction can be changed
on a cycle to cycle basis using I

28
The CHESS Architecture Interconnection
First 2 for local connections Next 2 Connected to
long wires as feeders L1 (2) diagonally adjacent
sw boxes L2 (4) Connect to the ends of feeder
buses L4 (2) L8 (2) L16 (2)

Different type of wire segments depending on
length of connection
Rents rule

29
DP-FPGA

For regularly structured datapaths like that used
in DSP and communication.

Control- Regular FPGA Memory- Banks of
SRAM Datapath- 4bit Data -
1bit - Control Programming bit sharing Carry
bit chain
30
DPFPGA vs. CHESS

Both shared configuration bits and dedicated
carry chains
DPFPGA used LUTs vs. ALUs for Chess
CHESS used uniform/balanced wiring
DPFPGA used dedicated shifter

31
Performance of CHess

Use metrics to evaluate computational power.
Efficient multiplies due to embedded ALU
Process independent estimate

32
Summary The CHESS Architecture

Achieves high computational density
Chess is a highly flexible and scalable design.
It can feed ALUs with instructions generated
within the array
has embedded block RAM and can trade routing
switches for memory.

33
General-purpose computing two important
questions

1. How are general-purpose processing resources
controlled?
2. How much area is dedicated to holding the
instructions which control these resources?
SIMD, MIMD, VLIW, FPGA, reconfigurable ALUs

MATRIX reconfigurable device architecture
which allows these questions to be answered by
the application rather than by the device
architect.
34
Another way of criticizing FPGAs

allow finer granularity control over operation
and dedicate minimal area to instruction
distribution.
they can deliver more computations per unit
silicon than processors on a wide range of
regular operations.
However, the lack of resources for instruction
distribution make them efficient only when the
functional diversity is low
i.e. when the same operation is required
repeatedly
and that entire operation can be fit spatially
onto the FPGA or FPGAs in the system.

35
MATRIX approach

Rather than separate the resources for
instruction storage data storage and computation
dedicate silicon resources to them at fabrication
time, the MATRIX architecture unifies these
resources.

36
MATRIX approach

traditional instruction and control resources are
decomposed along with computing resources and can
be deployed in an application-specific manner.
support active computation or to control reuse of
computational resources depending on the needs of
the application and the available hardware
resources.

37
The MATRIX architecture

Developed at MIT
2D array of ALUs
8-bit Granularity
Each Basic Functional Unit contains ALU and
Memory
Ideal for systolic and VLIW computation.
Unified Configurable Network

38
The MATRIX Basic Functional Unit
Basic Functional Unit (BFU)

Granularity (8 bit)
Contains an ALU, Memory and Control Logic
Memory for instructions (configurations), data.

39
The MATRIX Interconnection Network
Interconnection Network

Hierarchical Interconnection Network
Level Two interconnect medium distance
interconnection
Global Lines Spanning entire row / column

40
The MATRIX port structure

Different Sources of ALU inputs
Static Value Mode
Static Source Mode
Dynamic Source Mode

Basic Port Architecture
41
Convolution Mapping in MATRIX -Systolic Array
Example of Convolution Mapping

The sample values are passed through the first
row
Produces the result every two cycles
Needs 4k cells to implement

42
Convolution Mapping in MATRIX Microcoded example
Functional diversity vs. performance uP vs FPGA?
Example of Convolution Mapping (Microcoded)

Co-efficients stored in BFU memory
Takes 8K 9 cycles to implement
Consumes only 8 cells

43
Convolution Mapping in MATRIX VLIW/MSIMD
Separate BFU for Xptr and Wptr
Example of Convolution Mapping (VLIW)
Perform 6 convolutions simultaneously (MSIMD)
44
Granularity Performance

8 bit granularity optimal but bit level
operations will result in underutilization
Extremely flexible

N/w 50 Mem- 30 Control-12 ALU- 8
45
Summary The Matrix Architecture

Application tailored datapaths
Dynamic control
Regularity is exploitable
Deployable resources
High Flexibility in implementation
Instruction stream compression

46
Chess Matrix

Both Use 2D array of ALUs
For both Instructions can be generated within the
array
Both are flexible
Chess is 4 bit Matrix is 8 bit. What is the
Tradeoff?
Chess does not support reconfiguration but has
very fast configuration because of few bits
required
Chess has high computational density
Chess is aimed at arithmetic matrix is more
general purpose

47
RaPiD

Reconfigurable Pipelined Datapath
optimized for highly repetitive,
computation-intensive tasks.
Very deep application-specific computation
pipelines can be configured in RaPiD
pipelines make much more efficient use of silicon
than traditional FPGAs
Yield much higher performance for a wide range of
applications

48
RaPiD - Reconfigurable Pipelined
DatapathMotivation

University of Washington
Motivation
- Configurable Computing performance close to
ASIC
- Flexibility close to General Purpose
Processor
Yet FPGA-based systems have some problems.
Platform for implementing random functions
Automatic Compilation (High-level synthesis)
Suitable for DSP applications
Datapath with static and dynamic signals

49
RaPiD

Coarse-grained FPGA architecture that allows
deeply pipelined computational datapaths to be
constructed dynamically from a mix of ALUs,
multipliers, registers and local memories.
The goal of RaPiD is to compile regular
computations like those found in DSP applications
into both an application-specific datapath and
the program for controlling that datapath

50
RaPiD - Datapath

Uses static and dynamic control signals.
Static control determines the underlying
structure of the datapath that remains constant
for a particular application.
generated by static RAM cells that are changed
only between applications
Dynamic control signals can change from cycle to
cycle and specify the variable operations
performed and the data to be used by those
operations.
provided by a control program.

51
RaPiD- Introduction

Computational bandwidth is extremely high and
scales with array size
I/O operations limit the speedup an application
can get
Suited for highly repetitive, computational
intensive tasks
Control flow to be taken care by another
processor (RISC)
Restricted to linear arrays, NOT a 2D
architecture

52
RaPiD Basic Block (Cell)

Each Cell contains 1 integer (16-bit)
multiplier, 2 ALU, 6 GP Registers, 3small local
memories
Complete array (RaPiD-I) consists of 16 cells
Ten segmented busses run through the length of
the datapath

53
RaPiD Interconnect
Bus connector

Input to any functional unit is driven through a
multiplexer (8 lines)
Output of a functional unit can span out to any
number of busses (8)
Busses in different tracks are segmented to
different lengths
Bus connector is used to connect to adjacent
segments through register or buffer (either
direction)
These registers can also be used for pipelining

54
RaPiD More about Cells

Functional Unit outputs registered/unregistered
Granularity 16 bit, Different Fixed-point
representations
ALUs logical and arithmetic operations
2 ALUs and a multiplier pipelined for MAC
operation (32 bit)
Datapath registers Expensive (area and bus
utilization)
Used to connect to different tracks
Local memory Can be used to store constant
arrays
Includes address generators
I/O streams for data transfer (FIFO)
asynchronous

55
RaPiD Register, Memory
Datapath Register
Local Memory
56
RaPiD Control Path

control pipeline
static control
dynamic control. (Example Initialization)
LUTs provide simple programmability.
Cells can be chained together to form continuous
pipe.
230 control signals/cell. 80 are dynamic.

57
RaPiD Weighted Filter Example

FIR filter algorithm 4 taps
Filter weights are stored in W array
Input is stored in X array

58
RaPiD Mapping to RaPiD (a)
Schematic
59
RaPiD Mapping to RaPiD (b)

Performance depends upon clock rate (t in MHz),
Number of Cells (S) and Memory locations/cell (M)
Results in MOPS/GOPS, single MAC is an operation
FIR filter with T taps t min (T,S) MOPS
- For t100, S16, M96, Tgt16 -gt sustained rate
1.6 GOPS
Matrix multiplication of X x Y and Y x Z
Sustained rate is t min (Y,M/3,S) MOPS
Conclusions No compiler, I/O challenges,
Integrating with RISC

60
Issues

The domain of applicability must be explored by
mapping more problems from different domains to
RaPiD.
Thus far all RaPiD applications have been
designed by hand. The next step will be to apply
compiler technology, particularly
loop-transformation theory and systolic array
compiling methods to build a compiler for RaPiD.
A memory architecture must be designed which can
support the I/O band-width required by RaPiD over
a wide range of applications.
Although it is clear that RaPiD should be closely
coupled to a generic RISC processor, it is not
clear exactly how this should be done. This is a
problem being faced by other reconfigurable
computers.

61
RaPiD OFDM implementation
Rapid C (assembly like programming) used for
mapping.
Implementing OFDM Receiver on the RaPiD
Reconfigurable Architecture. 2003
62
RaPiD Performance (OFDM)
Performance Results for OFDM implementation
63
RaPiD Performance
64
RaPiD Performance

A 16-cell RaPiD array 8x8 2D-DCT two matrix
multiplies
images gt 256x256 pixels, sustained rate16
billion MACs,
including reconfiguration overhead between
images, with an average of 2 memory accesses per
cycle.
Motion estimation matrix multiply and DCT,
sustained rate of 16 billion difference/absolute
value/accumulate operations per second
but with an average of 0.1 memory accesses per
cycle.
Motion picture compression (both motion
estimation and DCT on each frame.
reconfiguration time 2000 cycles (20 sec.),
little performance is lost to reconfiguration and
pipeline stalling
For a standard 720x576 frame12 frames per second
when executing both full motion estimation and
DCT (including 4000 reconfiguration cycles per
frame and pipelineing).