HighEnd Reconfigurable Computing - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

HighEnd Reconfigurable Computing

Description:

High computational density enables small physical size. ... experimenting with 'reconfigurable computing' programming models and application domains ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 30

Provided by: ChenC5

Category:

more less

Transcript and Presenter's Notes

Title: HighEnd Reconfigurable Computing

1
High-End Reconfigurable Computing

Berkeley Wireless Research Center
January 2004
John Wawrzynek, Robert W. Brodersen, Chen Chang,
Vason Srini, Brian Richards

2
Berkeley Emulation Engine

FPGA-based system for real-time hardware
emulation
Emulation speeds up to 60 MHz
Emulation capacity of 10 Million ASIC
gate-equivalents, corresponding to 600 Gops
(16-bit adds) (although not a logic gate
emulator.)
2400 external parallel I/Os providing 192 Gbps
raw bandwidth.

3
Status

Four BEE processing units built
Three in continuous production use
Supported universities
CMU, USC, Tampere, UMass, Stanford
Successful tapeout of
3.2M transistor pico-radio chip
1.8M transistor LDPC decoder chip
System emulated
QPSK radio transceiver
BCJR decoder
MPEG IDCT
On-going projects
UWB mix-signal SOC
MPEG transcoder
Pico radio multi-node system
Infineon SIMD processor for SDR

4
Lessons from BEE

Simulink based tool-flow very effective FPGA
programming model in DSP domain.
Many system emulation tasks are significant
computations in their own right
high-performance emulation hardware makes for
high-performance general computing.
Is this the right way to build supercomputers?
BEE could be scaled up with latest FPGAs and by
using multiple boards ? TeraBEE (B2).

5
High-End Reconfigurable Computer (HERC)

The machine with supercomputer-level performance
configured on a per-problem-basis to match the
structure of the task by exploiting spatial
parallelism.
All data-paths, control paths, memory ports and
controllers, communication channels and
controllers, are wired to match the needs of a
particular problem.

6
Applications of Interest

High-performance DSP/communication systems
Cognitive radio or SDR
Hyper-spectral imaging
Image processing and navigation
Real-time scientific computation and simulation
E M simulation
Molecular dynamics
CAD acceleration
FPGA Place Route
Others
Bioinformatics

7
High-performance DSP

Stream-based computation model
Usually real-time requirement
High-bandwidth data I/O
Low numerical precision requirements fix-point
or reduced floating point
Data processing dominated, few control branch
points

8
Scientific Computing

Computationally demanding
Double-precision floating point
Traditional methods require FFTs, matrix
operations, linear systems solvers (linpack)
Often regular or adaptive grid structure
Traditionally not real-time processing, but
real-time processing would offer new
applications.
Opportunities to innovate on the
algorithm/mapping for reconfigurable.

9
CAD acceleration

Existing low-level tool flow currently too slow
to be practical for HERC systems.
HERC machines should be used to accelerate their
own tools. Some starting ideas
Hardware-Assisted Fast Routing. André DeHon,
Randy Huang, and John Wawrzynek., Published in
Proceedings of the IEEE Symposium on
Field-Programmable, Custom Computing Machines
(FCCM '02, April 22--24, 2002).
Hardware-Assisted Simulated Annealing with
Application for Fast FPGA Placement. Michael
Wrighton and André DeHon. In Proceedings of the
International Symposium on Field Programmable
Gate Arrays, pages 33--42, February, 2003.

10
Bioinformatics

Implicitly parallel algorithms
Stream-like data processing
Integer operations sufficient
History of success with reconfigurable
architectures.
High-capacity persistent storage devices required
for matching large database

11
Conventional High-end Computers
Clusters of commodity microprocessors

System performance in the 100s of GFLOPs to 10s
of TFLOP range.
Using commodity components is a key idea
Low-volume production makes it difficult to
justify custom silicon.
Commodity components ride technology curve.

But Microprocessors are the wrong component!

12
Computation Density of Processors

Serial instruction stream limits parallelism
Power consumption limits performance

13
Xilinx Platform FPGA Roadmap

Reconfigurable devices drive the next process
step
Simple performance scaling

14
FPGA Density Flexibility
_at_200MHz 20GFLOPS

FPGAs already offer density advantage
Offer problem specific operators

15
Other Characteristics of High-end Microprocessor
systems

Memory is a problem
Serial von-Neuman execution forces heavy demands
on memory system
Processor memory speed gap widens with Moores
Law
Multiple layers of caches are necessary to keep
up.
Most HEC applications derive little or no benefit
from caches but, caches add power, latency, cost,
unpredictability.
Real-time processng impossible because
unpredictable delay in memory hierarchy and
communication network
Microprocessors inherently fault-intolerant and
costly

16
Characteristics of Reconfigurable Computer Systems

increased performance density, lower clock rate,
reduced power node.
High spatial parallelism and circuit
specialization within nodes.
No cache, computing elements operate at the same
speed as memory
Multiple independently addressed memory banks per
node
Internal FPGA SRAM can be user-controlled cache
if needed.
Flexible interconnection network (circuit/packet
switching).
Predictable memory and network latency permit
static scheduling in real-time applications.
FPGAs inherently manufacturing fault-tolerant

17
B2 Design

Approach look at hardware configurations,
evaluate based on programming model and
applications, and iterate.
Starting Constraints
Use all COTS components
FPGAs, memory, connectors/cables
Highly modular
Scalable from single module to approximately 1K
FPGA chips in a system (8 TFlops)

18
Computing node and memory

Single Xilinx Virtex 2 Pro 70 FPGA
75K logic cells (4lutFF ? 0.5M logic gates)
1704 package with 996 user I/O pins
2 PowerPC cores
500 dedicated multipliers (18-bit)
700KBytes SRAM on-chip
20 10-Gbit/s serial communication links (MGTs)
4 physical DDR 400 banks
Each banks has 72 data bits with ECC
Independently addressed with 16 logical banks
total
12.8 GBps memory bandwidth, with up to 8 GB
capacity

19
Inter-node Connections

Module Each group of four nodes share a control
node and form a computational cluster.
Point-to-point connection between control node
and processing node
144 bit 300MHz DDR
38.4 Gb/s per link
Uplink connect to other modules to form a 4-ary
tree.
Downlinks for I/O on leaf nodes and for tree
connection on switch modules.

B2 module
20
4-ary Tree Connection

4-ary tree configuration
High-bandwidth high-latency 12X Infiniband 10
Gbps duplex
Low bandwidth low latency 64 pin (32 bit) LVDS _at_
200 MHz DDR
Some B2 modules act as a switch node and
aggregation computing point.

21
Fat-trees balancing computation and
communication

Uplink bandwidth can be partitioned to allow a
family of fat-tree structures.

Rents rule type analysis can be used to
characterize application connection locality.
Machine can be built or configured to match
appropriate Rent constant (maybe different at
different levels).

22
Non fat-tree (64 nodes)

Constant cross-section bandwidth at each tree
level

23
Fat-tree Configuration (64 nodes)

Cross-section bandwidth grows towards tree root.
Uses a higher ratio of switch/leaf modules.
Tree structure is configured by how modules are
wired.

24
B2 Module board layout

4 computing nodes, 1 control node, Up to 40GBytes
DRAM
8 SATA connection for up to 8 hard disks

25
Example B2 System

Two modules (8 nodes) per 1 RU unit (19 x 27)
or
One module plus up to 4 disks.
Single cabinet
256 node tree-connected B2 system, with
up to 3.4 TB DDR DRAM
gt40 TOPS or 2 TFLOPS
(not counting tree nodes)

26
Summary

Supercomputer level computation at fraction of
cost and size
High computational density enables small physical
size.
Low-level redundancy enables manufacturing-fault
tolerance and drastic cost reduction.
Platform for
Extending BEE approach to real-time emulation,
experimenting with reconfigurable computing
programming models and application domains
Scalable
Computation/memory capacity varies with number of
modules and FPGA generation
Wiring options vary computation/communication
balance