Title: Coarse Grain Reconfigurable Architectures
1Coarse Grain Reconfigurable Architectures
2Announcements
- Group meetings this week (sign up sheet)
- Starting October 8th PSYCH304, 330-6pm
- Today Coarse Grain Architectures
3Motivation for coarse grained architectures
- Definition
- FPGA with granularity greater than 1 bit.
- Architectures seen so far vary from 2 bits to 32
bits. - Disadvantages of low granularity
- Large routing area (80 of chip area!).
- Large volume of configuration data.
- Low area efficiency for arithmetic operations and
RAM. - Reduced clock speed, bandwidth.
- Advantages of coarse grained architectures
- Lesser number of PEs so PAR problem is less
complex and is much faster. - Lesser area of chip devoted to routing.
- Easier to correlate software functions and
hardware. - Tradeoffs
- Flexibility/Generality of 1 bit configuration.
- Possible under utilization.
4Coarse grained architectures
- DP-FPGA
- LUT-based
- LUTs share configuration bits
- Chess
- Hexagonal Mesh (Chess board layout) of ALUs
- Matrix
- 2-D array of ALUs
- Rapid
- Specialized ALUs, mutlipliers
- 1D array
- Raw
- Full RISC core as basic block
- 2D mesh
- Paddi
- Cluster of 8 arithmetic EXU, 16 bits wide, 8 word
SRAM - Central Crossbar switch
5Different Coarse Grained Architectures
ALU
DPFPGA
CHESS
MATRIX
PADDI
RAPID
1D array, 16 bit
RAW
2D mesh, 8bit, imem,dmem,cl
6RAW, Motivation
- It takes on the order of two clock cycles for a
signal to travel from edge-to-edge (roughly
fifteen mm) of a 2-GHz processor - Compaqs Alpha 21264 forced to split the integer
unit into two physically dispersed clusters, with
a one-cycle penalty for communication of results
between clusters. - Intel Pentium 4 architects had to allocate two
pipeline stages solely for the traversal of long
wires. - RAW
- to use a scalable ISA that provides a parallel
software interface to the gate, wire, and pin
resources of a chip. - An architecture with direct, first-class analogs
to all of these physical resources lets
programmers extract the maximum amount of
performance and energy efficiency in the face of
wire delay. - try to minimize the ISA gap by exposing the
underlying physical resources as architectural
entities.
7What is
- The Raw microprocessor consumes 122 million
transistors - executes 16 different load, store, integer, or
floating-point instructions every cycle - controls 25 Gbytes/s of input/output (I/O)
bandwidth and - has 2 Mbytes of on-chip distributed L1 static RAM
- providing on-chip memory bandwidth of 57
Gbytes/s. - it took only a handful of graduate students at
the Laboratory for Computer Science at MIT to
design and implement Raw.
8RAW Reconfigurable Architecture Workstation
(_at_MIT)
- Challenges
- Short internal chip wires to have high clock
speed - Quick verification of new designs
- Workloads on processors (multimedia)
- Multi-granular Processing Elements (Tiles)
- a RISC processor,
- Configurable Logic (CL),
- instruction and data memories,
- programmable switch
- Parallelizing compiler (to distribute workload)
- Supports static and dynamic routing
9RAW Microprocessor
Each tile includes a RISC processor and
configurable logic
10RAW - Comparison
11RAW
12RAW Datapath of individual tile
- Intertile communication latency similar to
those of register accesses - Distributed registers More ILP
- Memory access time close to processor clock
- No hardware based register-renaming logic or
instruction issue (different from superscalar) - Focus is on computation (NOT control)
- Switch integrated into processor pipeline
- Static and dynamic schedule
- Multigranular and configurability
13RAW Pipeline
14RAW Software Support
15RAW Compiler
- (a) shows the original code. (b) shows the
memories of a 4 processor Raw machine, showing
the distribution of array A. (c) shows the code
after unrolling. After unrolling, each access
refers to locations from only one processor.
16RAW parallelizing compiler
Parallelizing applications into silicon. 1999
17RAW Experimental Results
18RAW - Performance
19RAW fabricated
20RAW vs. FPGA
- FPGAs
- exploit fine-grained parallelism and fast static
communication. - Software has access to its low-level details,
allowing the software to optimize mapping of the
user application. - Users can also bind commonly used instruction
sequences into configurable logic. - As a result, these special purpose instruction
sequences can execute in a single cycle. - require loading an entire bitstream to reprogram
the FPGAs for a new operation. - Compilation for FPGAs is also slow because of
their fine granularity.
- Raw
- supports instruction sequencing.
- They are more flexible, merely pointing to a new
instruction to execute a new operation. - Compilation in Raw is fast because the hardware
contains commonly used compute mechanisms such as
ALUs and memory paths. - This eliminates repeated, low-level compilations
of these units. - Binding common mechanisms into hardware also
yields faster execution speed,lower area, and
better power efficiency than FPGA systems.
21CHESS
- A Reconfigurable Arithmetic Array for Multimedia
Applications
22The CHESS Architecture -Introduction
- Developed by HP.
- Aims at speeding up arithmetic operations for
multimedia applications. - Also tries to improve memory density.
- Salient features
- 4 bit Alus Why 4 bit?
- 4 bit buses
- Switchboxes-2 modes
- Chessboard layout strong local connectivity
- Embedded Block RAMs- 256 8 per 16 ALUs
- Speed and Hierarchical line lengths-buffers
- Small configuration memories speed of config is
high - No run time reconfiguration-Why?
23Application Benefits of CHESS
- High Computational Density _single chip
- Delayed design Commitment
- Wide in chip interfaces
- Doesnt do runtime reconfig
- Memory Bandwidth
- Distributed on chip memory
- Flexibility
- Convert switchbox to RAM RAM routability
tradeoff- prefer embedded memory - Different ALU instructions can be programmed
- Rapid Reconfiguration
24The CHESS Layout of ALUs
- 4 bit Alus
- 4 bit buses
- Hexagonal Array
- Local Connectivity good
25Distributed RAM
- Distributed memory
- 1 RAM for 16 ALU blocks
- Cascaded ALU s possible
26ALU bit slice and sample instruction set
- Performs addition, subtraction and logical
operations - 4 bit inputs A B
- Single bit carry
- Variety of carry conditioning options
Carry input -data signal- arithmetic Carry input
-control signal-Testing equality of more than 4
bit numbers Carry input -control signal-can
drive local resets,enables,etc
27ALU and Switchbox
- Switchbox memory can be used as storage as shown
A is Address, B is Wr_data and F is Rd_data - Additional 2 bit registers for use as buffers for
long wires
- ALU core for computation
- ALU instruction can be changed
- on a cycle to cycle basis using I
28The CHESS Architecture Interconnection
First 2 for local connections Next 2 Connected to
long wires as feeders L1 (2) diagonally adjacent
sw boxes L2 (4) Connect to the ends of feeder
buses L4 (2) L8 (2) L16 (2)
- Different type of wire segments depending on
length of connection - Rents rule
29DP-FPGA
- For regularly structured datapaths like that used
in DSP and communication.
Control- Regular FPGA Memory- Banks of
SRAM Datapath- 4bit Data -
1bit - Control Programming bit sharing Carry
bit chain
30DPFPGA vs. CHESS
- Both shared configuration bits and dedicated
carry chains - DPFPGA used LUTs vs. ALUs for Chess
- CHESS used uniform/balanced wiring
- DPFPGA used dedicated shifter
31Performance of CHess
- Use metrics to evaluate computational power.
- Efficient multiplies due to embedded ALU
- Process independent estimate
32Summary The CHESS Architecture
- Achieves high computational density
- Chess is a highly flexible and scalable design.
- It can feed ALUs with instructions generated
within the array - has embedded block RAM and can trade routing
switches for memory.
33General-purpose computing two important
questions
- 1. How are general-purpose processing resources
controlled? - 2. How much area is dedicated to holding the
instructions which control these resources? - SIMD, MIMD, VLIW, FPGA, reconfigurable ALUs
MATRIX reconfigurable device architecture
which allows these questions to be answered by
the application rather than by the device
architect.
34Another way of criticizing FPGAs
- allow finer granularity control over operation
and dedicate minimal area to instruction
distribution. - they can deliver more computations per unit
silicon than processors on a wide range of
regular operations. - However, the lack of resources for instruction
distribution make them efficient only when the
functional diversity is low - i.e. when the same operation is required
repeatedly - and that entire operation can be fit spatially
onto the FPGA or FPGAs in the system.
35MATRIX approach
- Rather than separate the resources for
instruction storage data storage and computation - dedicate silicon resources to them at fabrication
time, the MATRIX architecture unifies these
resources.
36MATRIX approach
- traditional instruction and control resources are
decomposed along with computing resources and can
be deployed in an application-specific manner. - support active computation or to control reuse of
computational resources depending on the needs of
the application and the available hardware
resources.
37The MATRIX architecture
- Developed at MIT
- 2D array of ALUs
- 8-bit Granularity
- Each Basic Functional Unit contains ALU and
Memory - Ideal for systolic and VLIW computation.
- Unified Configurable Network
38The MATRIX Basic Functional Unit
Basic Functional Unit (BFU)
- Granularity (8 bit)
- Contains an ALU, Memory and Control Logic
- Memory for instructions (configurations), data.
39The MATRIX Interconnection Network
Interconnection Network
- Hierarchical Interconnection Network
- Level Two interconnect medium distance
interconnection - Global Lines Spanning entire row / column
40The MATRIX port structure
- Different Sources of ALU inputs
- Static Value Mode
- Static Source Mode
- Dynamic Source Mode
Basic Port Architecture
41Convolution Mapping in MATRIX -Systolic Array
Example of Convolution Mapping
- The sample values are passed through the first
row - Produces the result every two cycles
- Needs 4k cells to implement
42Convolution Mapping in MATRIX Microcoded example
Functional diversity vs. performance uP vs FPGA?
Example of Convolution Mapping (Microcoded)
- Co-efficients stored in BFU memory
- Takes 8K 9 cycles to implement
- Consumes only 8 cells
43Convolution Mapping in MATRIX VLIW/MSIMD
Separate BFU for Xptr and Wptr
Example of Convolution Mapping (VLIW)
Perform 6 convolutions simultaneously (MSIMD)
44Granularity Performance
- 8 bit granularity optimal but bit level
operations will result in underutilization - Extremely flexible
N/w 50 Mem- 30 Control-12 ALU- 8
45Summary The Matrix Architecture
- Application tailored datapaths
- Dynamic control
- Regularity is exploitable
- Deployable resources
- High Flexibility in implementation
- Instruction stream compression
46Chess Matrix
- Both Use 2D array of ALUs
- For both Instructions can be generated within the
array - Both are flexible
- Chess is 4 bit Matrix is 8 bit. What is the
Tradeoff? - Chess does not support reconfiguration but has
very fast configuration because of few bits
required - Chess has high computational density
- Chess is aimed at arithmetic matrix is more
general purpose
47RaPiD
- Reconfigurable Pipelined Datapath
- optimized for highly repetitive,
computation-intensive tasks. - Very deep application-specific computation
pipelines can be configured in RaPiD - pipelines make much more efficient use of silicon
than traditional FPGAs - Yield much higher performance for a wide range of
applications
48RaPiD - Reconfigurable Pipelined
DatapathMotivation
- University of Washington
- Motivation
- - Configurable Computing performance close to
ASIC - - Flexibility close to General Purpose
Processor - Yet FPGA-based systems have some problems.
- Platform for implementing random functions
- Automatic Compilation (High-level synthesis)
- Suitable for DSP applications
- Datapath with static and dynamic signals
49RaPiD
- Coarse-grained FPGA architecture that allows
deeply pipelined computational datapaths to be
constructed dynamically from a mix of ALUs,
multipliers, registers and local memories. - The goal of RaPiD is to compile regular
computations like those found in DSP applications
into both an application-specific datapath and
the program for controlling that datapath
50RaPiD - Datapath
- Uses static and dynamic control signals.
- Static control determines the underlying
structure of the datapath that remains constant
for a particular application. - generated by static RAM cells that are changed
only between applications - Dynamic control signals can change from cycle to
cycle and specify the variable operations
performed and the data to be used by those
operations. - provided by a control program.
51RaPiD- Introduction
- Computational bandwidth is extremely high and
scales with array size - I/O operations limit the speedup an application
can get - Suited for highly repetitive, computational
intensive tasks - Control flow to be taken care by another
processor (RISC) - Restricted to linear arrays, NOT a 2D
architecture
52RaPiD Basic Block (Cell)
- Each Cell contains 1 integer (16-bit)
multiplier, 2 ALU, 6 GP Registers, 3small local
memories - Complete array (RaPiD-I) consists of 16 cells
- Ten segmented busses run through the length of
the datapath
53RaPiD Interconnect
Bus connector
- Input to any functional unit is driven through a
multiplexer (8 lines) - Output of a functional unit can span out to any
number of busses (8) - Busses in different tracks are segmented to
different lengths - Bus connector is used to connect to adjacent
segments through register or buffer (either
direction) - These registers can also be used for pipelining
54RaPiD More about Cells
- Functional Unit outputs registered/unregistered
- Granularity 16 bit, Different Fixed-point
representations - ALUs logical and arithmetic operations
- 2 ALUs and a multiplier pipelined for MAC
operation (32 bit) - Datapath registers Expensive (area and bus
utilization) - Used to connect to different tracks
- Local memory Can be used to store constant
arrays - Includes address generators
- I/O streams for data transfer (FIFO)
asynchronous
55RaPiD Register, Memory
Datapath Register
Local Memory
56RaPiD Control Path
- control pipeline
- static control
- dynamic control. (Example Initialization)
- LUTs provide simple programmability.
- Cells can be chained together to form continuous
pipe. - 230 control signals/cell. 80 are dynamic.
57RaPiD Weighted Filter Example
- FIR filter algorithm 4 taps
-
- Filter weights are stored in W array
- Input is stored in X array
58RaPiD Mapping to RaPiD (a)
Schematic
59RaPiD Mapping to RaPiD (b)
- Performance depends upon clock rate (t in MHz),
Number of Cells (S) and Memory locations/cell (M) - Results in MOPS/GOPS, single MAC is an operation
- FIR filter with T taps t min (T,S) MOPS
- - For t100, S16, M96, Tgt16 -gt sustained rate
1.6 GOPS - Matrix multiplication of X x Y and Y x Z
- Sustained rate is t min (Y,M/3,S) MOPS
- Conclusions No compiler, I/O challenges,
Integrating with RISC
60Issues
- The domain of applicability must be explored by
mapping more problems from different domains to
RaPiD. - Thus far all RaPiD applications have been
designed by hand. The next step will be to apply
compiler technology, particularly
loop-transformation theory and systolic array
compiling methods to build a compiler for RaPiD. - A memory architecture must be designed which can
support the I/O band-width required by RaPiD over
a wide range of applications. - Although it is clear that RaPiD should be closely
coupled to a generic RISC processor, it is not
clear exactly how this should be done. This is a
problem being faced by other reconfigurable
computers.
61RaPiD OFDM implementation
Rapid C (assembly like programming) used for
mapping.
Implementing OFDM Receiver on the RaPiD
Reconfigurable Architecture. 2003
62RaPiD Performance (OFDM)
Performance Results for OFDM implementation
63RaPiD Performance
64RaPiD Performance
- A 16-cell RaPiD array 8x8 2D-DCT two matrix
multiplies - images gt 256x256 pixels, sustained rate16
billion MACs, - including reconfiguration overhead between
images, with an average of 2 memory accesses per
cycle. - Motion estimation matrix multiply and DCT,
- sustained rate of 16 billion difference/absolute
value/accumulate operations per second - but with an average of 0.1 memory accesses per
cycle. - Motion picture compression (both motion
estimation and DCT on each frame. - reconfiguration time 2000 cycles (20 sec.),
- little performance is lost to reconfiguration and
pipeline stalling - For a standard 720x576 frame12 frames per second
when executing both full motion estimation and
DCT (including 4000 reconfiguration cycles per
frame and pipelineing).
65Comparison with existing architectures
- PADDI -16bit
- Advanced fine granularity VEGA,Dharma
- DPFPGA
66VEGA
- Virtual Element Gate Array
- Designed for circuit emulation
- Multiplex a single LUT over time to simulate an
array of LUTs
67Dharma
68PADDI
- Each EXU contains dedicated
- hardware support for fast 16 bit arithmetic
- Global broadcasting
- Local memory in the form of register files within
EXUs