Title: HighEnd Reconfigurable Computing at BWRC
1High-End Reconfigurable Computing at BWRC
- John Wawrzynek
- University of California, Berkeley
- Berkeley Wireless Research Center
2Berkeley Emulation Engine (BEE), 2002
- FPGA-based system for real-time hardware
emulation - Emulation speeds up to 60 MHz
- Emulation capacity of 10 Million ASIC
gate-equivalents, corresponding to 600 Gops
(16-bit adds) (although not a logic gate
emulator.) - 2400 external parallel I/Os providing 192 Gbps
raw bandwidth.
- 20 Xilinx VirtexE 2000 chips, 16 1MB ZBT SRAM
chips.
3Realtime Processing Allows In-System Emulation
4Matlab/Simulink Programming Tools
Discrete-Time-Block-Diagrams with FSMs
- Tool flow developed by Mathworks, Xilinx, and
UCB. - User specifies design as block diagrams (for
datapaths) and finite state machines for control. - Tools automatically map to FPGAs and ASIC
implementation. - User assisted partitioning with automatic system
level routing.
Matlab/Simulink is a standard simulation
specification system in many industries. Automatic
FPGA mapping has been very successful with BEE
in DSP domain.
5BEE Status
- Four BEE processing units built
- Three in near continuous production use
- Other supported universities
- CMU, USC, Tampere, UMass, Stanford
- Successful tapeout of
- 3.2M transistor pico-radio chip
- 1.8M transistor LDPC decoder chip
- System emulated
- QPSK radio transceiver
- BCJR decoder
- MPEG IDCT
- On-going projects
- UWB mix-signal SOC
- MPEG/PRISM transcoder
- Pico radio multi-node system
- Infineon SIMD processor for SDR
6Lessons from BEE
- High-performance real-time performance vastly
eases the debugging/verification/tuning process. - Simulink based tool-flow very effective FPGA
programming model in DSP domain. - System emulation tasks are significant
computations in their own right
high-performance emulation hardware makes for
high-performance general computing. - Is this the right way to build high-end (super)
computers?
BEE could be scaled up with latest FPGAs and by
using multiple boards ? BEE2 (and beyond).
7The Team
- Faculty in charge
- John Wawrzynek
- Bob W. Brodersen
- Graduate students
- Kevin Camera (tools)
- Chen Chang (arch, tools, apps)
- Pierre-Yves Droz (arch)
- Alexander Krasnov (apps)
- Zohair Hyder (arch)
- Yury Markovskiy (apps)
- Adam Megacz (tools)
- Hayden So (tools)
- Norm Zhou (apps, tools)
- Industrial Liaison
- Bob Conn (Xilinx)
- Ivo Bolsens (Xilinx)
- Research associates
- Dan Werthimer (SSL)
- Melvyn Wright (UCB, RAL)
- Technical staff
- Brian Richards
- Susan H. Mellers
- Undergraduate students
- John Conner
- Greg Gibeling
8BEE2 Prototype
- Modular design scalable from a few to hundreds of
FPGAs. - High memory capacity and bandwidth to support
general computing applications. - High bandwidth / low-latency inter-module
communication to support massive parallelism. - All off-the-shelf components no custom chips.
- Thanks to Xilinx for engineering assistance,
FPGAs, and interaction on application
development.
9Basic Computing Element
- Single Xilinx Virtex 2 Pro 70 FPGA
- 70K logic cells
- 1704 package with 996 user I/O pins
- 2 PowerPC405 cores
- 326 dedicated multipliers (18-bit)
- 5.8 Mbit on-chip SRAM
- 20X 3.125-Gbit/s duplex serial communication
links (MGTs) - 4 physical DDR2-400 banks
- Per FPGA up to 12.8 Gbyte/s memory bandwidth and
maximum 8 GByte capacity.
10BEE2 Module
- 4 computing nodes 1 control nodes form one
large virtual computing element. - 2D mesh connection between computing FPGAs
- 140 bit _at_ 150MHz DDR, 42 Gbps per link
- Star connection from control node to computing
nodes - 64 bit 150 MHZ DDR, 19.2 Gbps per link
- 18 Infiniband 4X connectors (10Gbps duplex each)
for inter-module communication - Modules are directly connected or connected
through a commercial switch.
Single PC Board
1.0 TeraOp (16-bit) range.
11Compute Module
Completed 12/04.
- Module also includes I/O for administration and
maintenance - 10/100 Ethernet
- HDMI / DVI
- USB
14X17 inch 22 layer PCB
12Inter-Module Connections
Global Communication Tree
Stream Packets
Admin, UI, NFS
13Alternative topology 3D mesh or torus
- The 4 compute FPGA can be used to extend to 3D
mesh/torus - 6 directional links
- 4 MGT links
- 2 on-board LVCMOS links
1419 Rack Example
- 40 compute modules in 5 chassis (8U) per rack
- Over 32-40 TOPS (2 TFLOPS) performance per rack
- 250 Watt AC/DC power supply to each blade
- 12.5 KWatt max power consumption
- Hardware cost 800K
- Currently collaborating with Paul Wright and
students on packaging for desktop single-module
version and rack-mount version.
15Why are these systems interesting?
- Best solution in several domains
- Emulation for custom chip design
- Extreme DSP tasks
- Scientific and Supercomputing
- Good model on how to build future chips and
systems - Massively parallel
- Fine-grained reconfigurability enables
- Robust performance/power efficiency on a
wide-range of problems. - Fault and defect tolerance.
16Wireless-Network Simulation
- SDR, Cognitive radio, and Ad-hoc Networks
- Platform for developing soft-radio techniques,
- validation of network protocols,
- chip-level validation in context of real data and
network/environment.
- Requires real-time
- Simulation of complex channel and environment
models, - Simultaneous simulation of 100s to 1000s of
network nodes (with real-time sensor input).
17FPGA-based Emulation Platforms
- Continue to evolve Simulink design flow
- New library elements, automatic partition of
designs across FPGAs - Several new efforts leverage the fine-grained
programmability of FPGAs to extend the tool-set - Hayden So, BORPH a general purpose, concurrent
multi-user operating system designed for
reconfigurable computing systems built from
arrays of FPGAs. - Kevin Camera, Design specific in-system debugging
and verification tools. - Norm Zhou, Interactive and incremental design
environment for multiple FPGA systems.
18Extreme Digital-Signal-Processing
BEE2 is a promising computing platform for for
Allen Telescope Array (ATA) and proposed Square
Kilometer Array (SKA) SETI spectrometer Image-form
ation for Radio Astronomy Research
- Massive arithmetic operations per second
requirement. - Stream-based computation model
- Usually hard real-time requirement
- High-bandwidth data I/O
- Low numerical precision requirements
- Mostly fix-point operations
- Rarely needs reduced floating point
- Data-flow processing dominated
- few control branch points
19Comparison with DSP Chips
- Spectrometer polyphase filter bank (PFB) 16
mult, 32bit acc, Correllator 4bit mult, 24bit
acc. - Cost based on street price.
- Assume peak numbers for DSPs, mapped for FPGAs.
- TI DSPs
- C6415-7E, 130nm (720MHz)
- C6415T-1G, 90nm (IGHz)
- FPGAs
- 130nm, freq. 200-250MHz.
- Metrics include chips only (not system)
Performance
Energy Efficiency
Cost-Performance
20Scientific and Supercomputing
Standard Architecture Clusters of commodity
microprocessors
- System performance in the 100s of GFLOPs to 10s
of TFLOP range. - Physically large, expensive.
- Using commodity components is a key idea
- Low-volume production makes it difficult to
justify custom silicon. - Commodity components ride technology curve
(microprocessors, DRAM).
- But Microprocessors are the wrong computing
element!
21Processor versus FPGA performance trend
75 utilization on FPGAs
- Processors are loosing the performance battle.
- FPGAs better tracks Moores Law drive the next
process step and have simple performance scaling
properties. - Potential for even more gain with specialized
data-path widths of FPGAs.
22FDTD for Antenna Design
- Typical problem has 5003 grid points and 10,000
time step simulation - 20 hours on a workstation
- Yee cell engine
- 21K LUTs, 6.4GB/s _at_ 120MHz
- Uses FP units from eda.org.
- Single FPGA (V2P70/100)
- With 3 engines per chip,
- 1.9 hours
- On BEE2, less than a minute.
23Bioinformatics
- Implicitly parallel algorithms
- Often stream-like data processing
- Integer operations sufficient
- History of success with reconfigurable/ASIC
architectures. (TimeLogic, Paracell) - High-quality Brute force Smith-Waterman
technique practical on BEE but not on PC clusters.
BLAST (Basic Local Alignment Search Tool)
Preliminary implementation (simulation only)
indicates BEE2, provides 1001000 times faster
execution time running the BLAST algorithm, and
over 1000X lower price-performance than existing
PC cluster solutions.
24Full-chip SPICE-level Circuit Simulation
- Conventional implementation
- turns circuit into a large 2D conductance matrix
(representing connections between nodes with
circuit elements). - Gauss elimination like method used to solve
the matrix for voltages requires global
communication and floating-point arithmetic. - Possible on BEE2, but we are trying an
alternative approach.
- On BEE2 (Bob Conn, Xilinx)
- Circuit is mapped spatially across the
computation elements, each responsible for
updating a set of nodes iteratively. - Communication is localized and fixed point
arithmetic suffices.
25Defect Tolerance on BEE2
- The FPGAs we use have large area (like
microprocessors) hence poor yields. - Unlike microprocessors, FPGAs have and inherent
potential for defect (and fault) tolerance, and
thus dramatic yield enhancement (and price
reduction) - Fine-grained reconfigurability allows faulted
resources to be avoided with insignificant effect
on performance and area. - Majority of area in interconnect and switches
(95), and only a small fraction used by any
specific design. - Our Appoach
- Per-design testing
- Test specific routing resources used in design.
- Easy since only small fraction of all routing
resources used. - Defect tolerance through Swapping designs. Can be
done because of symmetry in system. Statistically
very high probability of finding a successful
mapping.
26Effect of Design Swapping
- baseP is probability that any user design (bit
file) will successfully map to any FPGA. - sucessP is probability that at least one
permutation of designs successfully maps the
machine. - Anecdotal evidence suggests that baseP is around
50. - Threshold much lower for other topologies with
more symmetry and more cost (ex. Crossbar). Goes
as logn/n for n FPGAs.
27Programming models
- Three Primary Approaches
- Matlab/Simulink
- Extends system developed for BEE1.
- Domain specific languages and tools
- These are high-level programming and
specification languages. - Compiler leverage known communication and memory
access patterns for given domains. - Example 3D grid problems.
- Port of Fortran/MPI
- Standard programming model for supercomputing.
- Optimized MPI library for FPGAs, hand-mapped
libraries, but automatic support for Fortran
acceleration. Main SPMD thread runs on processor
core.
28Development of petaBEE
- Based on concepts demonstrated in BEE2 prototype,
1 petaOPS (1015 operations per second) attainable
within 4 years.
- BEE4
- Assumes 65nm by late 2006.
- Special masks may be needed to provide proper
balance of I/O, memory, and logic. - Special masks could boost floating point
performance (5x) if needed by applications. - Memory die stacked on FPGAs to gain 4x in
density.
29Structure of Continuing Development Effort
- Successful development requires an integrated
effort - Especially here where we are simultaneously
innovating on all fronts. - Feedback between activities is crucial.
- Several key applications drive the development.