HighEnd Reconfigurable Computing at BWRC - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

HighEnd Reconfigurable Computing at BWRC

Description:

Emulation capacity of 10 Million ASIC gate-equivalents, ... Alternative topology: 3D mesh or torus. The 4 compute FPGA can be used to extend to 3D mesh/torus ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 30

Provided by: ChenC5

Category:

more less

Transcript and Presenter's Notes

Title: HighEnd Reconfigurable Computing at BWRC

1
High-End Reconfigurable Computing at BWRC

John Wawrzynek
University of California, Berkeley
Berkeley Wireless Research Center

2
Berkeley Emulation Engine (BEE), 2002

FPGA-based system for real-time hardware
emulation
Emulation speeds up to 60 MHz
Emulation capacity of 10 Million ASIC
gate-equivalents, corresponding to 600 Gops
(16-bit adds) (although not a logic gate
emulator.)
2400 external parallel I/Os providing 192 Gbps
raw bandwidth.

20 Xilinx VirtexE 2000 chips, 16 1MB ZBT SRAM
chips.

3
Realtime Processing Allows In-System Emulation
4
Matlab/Simulink Programming Tools
Discrete-Time-Block-Diagrams with FSMs

Tool flow developed by Mathworks, Xilinx, and
UCB.
User specifies design as block diagrams (for
datapaths) and finite state machines for control.
Tools automatically map to FPGAs and ASIC
implementation.
User assisted partitioning with automatic system
level routing.

Matlab/Simulink is a standard simulation
specification system in many industries. Automatic
FPGA mapping has been very successful with BEE
in DSP domain.
5
BEE Status

Four BEE processing units built
Three in near continuous production use
Other supported universities
CMU, USC, Tampere, UMass, Stanford
Successful tapeout of
3.2M transistor pico-radio chip
1.8M transistor LDPC decoder chip
System emulated
QPSK radio transceiver
BCJR decoder
MPEG IDCT
On-going projects
UWB mix-signal SOC
MPEG/PRISM transcoder
Pico radio multi-node system
Infineon SIMD processor for SDR

6
Lessons from BEE

High-performance real-time performance vastly
eases the debugging/verification/tuning process.
Simulink based tool-flow very effective FPGA
programming model in DSP domain.
System emulation tasks are significant
computations in their own right
high-performance emulation hardware makes for
high-performance general computing.
Is this the right way to build high-end (super)
computers?

BEE could be scaled up with latest FPGAs and by
using multiple boards ? BEE2 (and beyond).
7
The Team

Faculty in charge
John Wawrzynek
Bob W. Brodersen
Graduate students
Kevin Camera (tools)
Chen Chang (arch, tools, apps)
Pierre-Yves Droz (arch)
Alexander Krasnov (apps)
Zohair Hyder (arch)
Yury Markovskiy (apps)
Adam Megacz (tools)
Hayden So (tools)
Norm Zhou (apps, tools)

Industrial Liaison
Bob Conn (Xilinx)
Ivo Bolsens (Xilinx)
Research associates
Dan Werthimer (SSL)
Melvyn Wright (UCB, RAL)
Technical staff
Brian Richards
Susan H. Mellers
Undergraduate students
John Conner
Greg Gibeling

8
BEE2 Prototype

Modular design scalable from a few to hundreds of
FPGAs.
High memory capacity and bandwidth to support
general computing applications.
High bandwidth / low-latency inter-module
communication to support massive parallelism.
All off-the-shelf components no custom chips.

Thanks to Xilinx for engineering assistance,
FPGAs, and interaction on application
development.

9
Basic Computing Element

Single Xilinx Virtex 2 Pro 70 FPGA
70K logic cells
1704 package with 996 user I/O pins
2 PowerPC405 cores
326 dedicated multipliers (18-bit)
5.8 Mbit on-chip SRAM
20X 3.125-Gbit/s duplex serial communication
links (MGTs)
4 physical DDR2-400 banks
Per FPGA up to 12.8 Gbyte/s memory bandwidth and
maximum 8 GByte capacity.

10
BEE2 Module

4 computing nodes 1 control nodes form one
large virtual computing element.
2D mesh connection between computing FPGAs
140 bit _at_ 150MHz DDR, 42 Gbps per link
Star connection from control node to computing
nodes
64 bit 150 MHZ DDR, 19.2 Gbps per link
18 Infiniband 4X connectors (10Gbps duplex each)
for inter-module communication
Modules are directly connected or connected
through a commercial switch.

Single PC Board
1.0 TeraOp (16-bit) range.
11
Compute Module
Completed 12/04.

Module also includes I/O for administration and
maintenance
10/100 Ethernet
HDMI / DVI
USB

14X17 inch 22 layer PCB
12
Inter-Module Connections
Global Communication Tree
Stream Packets
Admin, UI, NFS
13
Alternative topology 3D mesh or torus

The 4 compute FPGA can be used to extend to 3D
mesh/torus
6 directional links
4 MGT links
2 on-board LVCMOS links

14
19 Rack Example

40 compute modules in 5 chassis (8U) per rack
Over 32-40 TOPS (2 TFLOPS) performance per rack
250 Watt AC/DC power supply to each blade
12.5 KWatt max power consumption
Hardware cost 800K
Currently collaborating with Paul Wright and
students on packaging for desktop single-module
version and rack-mount version.

15
Why are these systems interesting?

Best solution in several domains
Emulation for custom chip design
Extreme DSP tasks
Scientific and Supercomputing
Good model on how to build future chips and
systems
Massively parallel
Fine-grained reconfigurability enables
Robust performance/power efficiency on a
wide-range of problems.
Fault and defect tolerance.

16
Wireless-Network Simulation

SDR, Cognitive radio, and Ad-hoc Networks
Platform for developing soft-radio techniques,
validation of network protocols,
chip-level validation in context of real data and
network/environment.

Requires real-time
Simulation of complex channel and environment
models,
Simultaneous simulation of 100s to 1000s of
network nodes (with real-time sensor input).

17
FPGA-based Emulation Platforms

Continue to evolve Simulink design flow
New library elements, automatic partition of
designs across FPGAs
Several new efforts leverage the fine-grained
programmability of FPGAs to extend the tool-set
Hayden So, BORPH a general purpose, concurrent
multi-user operating system designed for
reconfigurable computing systems built from
arrays of FPGAs.
Kevin Camera, Design specific in-system debugging
and verification tools.
Norm Zhou, Interactive and incremental design
environment for multiple FPGA systems.

18
Extreme Digital-Signal-Processing
BEE2 is a promising computing platform for for
Allen Telescope Array (ATA) and proposed Square
Kilometer Array (SKA) SETI spectrometer Image-form
ation for Radio Astronomy Research

Massive arithmetic operations per second
requirement.
Stream-based computation model
Usually hard real-time requirement
High-bandwidth data I/O
Low numerical precision requirements
Mostly fix-point operations
Rarely needs reduced floating point
Data-flow processing dominated
few control branch points

19
Comparison with DSP Chips

Spectrometer polyphase filter bank (PFB) 16
mult, 32bit acc, Correllator 4bit mult, 24bit
acc.
Cost based on street price.
Assume peak numbers for DSPs, mapped for FPGAs.
TI DSPs
C6415-7E, 130nm (720MHz)
C6415T-1G, 90nm (IGHz)
FPGAs
130nm, freq. 200-250MHz.
Metrics include chips only (not system)

Performance
Energy Efficiency
Cost-Performance
20
Scientific and Supercomputing
Standard Architecture Clusters of commodity
microprocessors

System performance in the 100s of GFLOPs to 10s
of TFLOP range.
Physically large, expensive.
Using commodity components is a key idea
Low-volume production makes it difficult to
justify custom silicon.
Commodity components ride technology curve
(microprocessors, DRAM).

But Microprocessors are the wrong computing
element!

21
Processor versus FPGA performance trend
75 utilization on FPGAs

Processors are loosing the performance battle.
FPGAs better tracks Moores Law drive the next
process step and have simple performance scaling
properties.
Potential for even more gain with specialized
data-path widths of FPGAs.

22
FDTD for Antenna Design

Typical problem has 5003 grid points and 10,000
time step simulation
20 hours on a workstation
Yee cell engine
21K LUTs, 6.4GB/s _at_ 120MHz
Uses FP units from eda.org.
Single FPGA (V2P70/100)
With 3 engines per chip,
1.9 hours
On BEE2, less than a minute.

23
Bioinformatics

Implicitly parallel algorithms
Often stream-like data processing
Integer operations sufficient
History of success with reconfigurable/ASIC
architectures. (TimeLogic, Paracell)
High-quality Brute force Smith-Waterman
technique practical on BEE but not on PC clusters.

BLAST (Basic Local Alignment Search Tool)
Preliminary implementation (simulation only)
indicates BEE2, provides 1001000 times faster
execution time running the BLAST algorithm, and
over 1000X lower price-performance than existing
PC cluster solutions.
24
Full-chip SPICE-level Circuit Simulation

Conventional implementation
turns circuit into a large 2D conductance matrix
(representing connections between nodes with
circuit elements).
Gauss elimination like method used to solve
the matrix for voltages requires global
communication and floating-point arithmetic.
Possible on BEE2, but we are trying an
alternative approach.

On BEE2 (Bob Conn, Xilinx)
Circuit is mapped spatially across the
computation elements, each responsible for
updating a set of nodes iteratively.
Communication is localized and fixed point
arithmetic suffices.

25
Defect Tolerance on BEE2

The FPGAs we use have large area (like
microprocessors) hence poor yields.
Unlike microprocessors, FPGAs have and inherent
potential for defect (and fault) tolerance, and
thus dramatic yield enhancement (and price
reduction)
Fine-grained reconfigurability allows faulted
resources to be avoided with insignificant effect
on performance and area.
Majority of area in interconnect and switches
(95), and only a small fraction used by any
specific design.
Our Appoach
Per-design testing
Test specific routing resources used in design.
Easy since only small fraction of all routing
resources used.
Defect tolerance through Swapping designs. Can be
done because of symmetry in system. Statistically
very high probability of finding a successful
mapping.

26
Effect of Design Swapping

baseP is probability that any user design (bit
file) will successfully map to any FPGA.
sucessP is probability that at least one
permutation of designs successfully maps the
machine.
Anecdotal evidence suggests that baseP is around
50.
Threshold much lower for other topologies with
more symmetry and more cost (ex. Crossbar). Goes
as logn/n for n FPGAs.

27
Programming models

Three Primary Approaches
Matlab/Simulink
Extends system developed for BEE1.
Domain specific languages and tools
These are high-level programming and
specification languages.
Compiler leverage known communication and memory
access patterns for given domains.
Example 3D grid problems.
Port of Fortran/MPI
Standard programming model for supercomputing.
Optimized MPI library for FPGAs, hand-mapped
libraries, but automatic support for Fortran
acceleration. Main SPMD thread runs on processor
core.

28
Development of petaBEE

Based on concepts demonstrated in BEE2 prototype,
1 petaOPS (1015 operations per second) attainable
within 4 years.

BEE4
Assumes 65nm by late 2006.
Special masks may be needed to provide proper
balance of I/O, memory, and logic.
Special masks could boost floating point
performance (5x) if needed by applications.
Memory die stacked on FPGAs to gain 4x in
density.

29
Structure of Continuing Development Effort