Title: HighEnd Reconfigurable Computing
1High-End Reconfigurable Computing
- Berkeley Wireless Research Center
- January 2004
- John Wawrzynek, Robert W. Brodersen, Chen Chang,
Vason Srini, Brian Richards
2Berkeley Emulation Engine
- FPGA-based system for real-time hardware
emulation - Emulation speeds up to 60 MHz
- Emulation capacity of 10 Million ASIC
gate-equivalents, corresponding to 600 Gops
(16-bit adds) (although not a logic gate
emulator.) - 2400 external parallel I/Os providing 192 Gbps
raw bandwidth.
3Status
- Four BEE processing units built
- Three in continuous production use
- Supported universities
- CMU, USC, Tampere, UMass, Stanford
- Successful tapeout of
- 3.2M transistor pico-radio chip
- 1.8M transistor LDPC decoder chip
- System emulated
- QPSK radio transceiver
- BCJR decoder
- MPEG IDCT
- On-going projects
- UWB mix-signal SOC
- MPEG transcoder
- Pico radio multi-node system
- Infineon SIMD processor for SDR
4Lessons from BEE
- Simulink based tool-flow very effective FPGA
programming model in DSP domain. - Many system emulation tasks are significant
computations in their own right
high-performance emulation hardware makes for
high-performance general computing. - Is this the right way to build supercomputers?
- BEE could be scaled up with latest FPGAs and by
using multiple boards ? TeraBEE (B2).
5High-End Reconfigurable Computer (HERC)
- The machine with supercomputer-level performance
configured on a per-problem-basis to match the
structure of the task by exploiting spatial
parallelism. - All data-paths, control paths, memory ports and
controllers, communication channels and
controllers, are wired to match the needs of a
particular problem.
6Applications of Interest
- High-performance DSP/communication systems
- Cognitive radio or SDR
- Hyper-spectral imaging
- Image processing and navigation
- Real-time scientific computation and simulation
- E M simulation
- Molecular dynamics
- CAD acceleration
- FPGA Place Route
- Others
- Bioinformatics
7High-performance DSP
- Stream-based computation model
- Usually real-time requirement
- High-bandwidth data I/O
- Low numerical precision requirements fix-point
or reduced floating point - Data processing dominated, few control branch
points
8Scientific Computing
- Computationally demanding
- Double-precision floating point
- Traditional methods require FFTs, matrix
operations, linear systems solvers (linpack) - Often regular or adaptive grid structure
- Traditionally not real-time processing, but
real-time processing would offer new
applications. - Opportunities to innovate on the
algorithm/mapping for reconfigurable.
9CAD acceleration
- Existing low-level tool flow currently too slow
to be practical for HERC systems. - HERC machines should be used to accelerate their
own tools. Some starting ideas - Hardware-Assisted Fast Routing. André DeHon,
Randy Huang, and John Wawrzynek., Published in
Proceedings of the IEEE Symposium on
Field-Programmable, Custom Computing Machines
(FCCM '02, April 22--24, 2002). - Hardware-Assisted Simulated Annealing with
Application for Fast FPGA Placement. Michael
Wrighton and André DeHon. In Proceedings of the
International Symposium on Field Programmable
Gate Arrays, pages 33--42, February, 2003.
10Bioinformatics
- Implicitly parallel algorithms
- Stream-like data processing
- Integer operations sufficient
- History of success with reconfigurable
architectures. - High-capacity persistent storage devices required
for matching large database
11Conventional High-end Computers
Clusters of commodity microprocessors
- System performance in the 100s of GFLOPs to 10s
of TFLOP range. - Using commodity components is a key idea
- Low-volume production makes it difficult to
justify custom silicon. - Commodity components ride technology curve.
- But Microprocessors are the wrong component!
12Computation Density of Processors
- Serial instruction stream limits parallelism
- Power consumption limits performance
13Xilinx Platform FPGA Roadmap
- Reconfigurable devices drive the next process
step - Simple performance scaling
14FPGA Density Flexibility
_at_200MHz 20GFLOPS
- FPGAs already offer density advantage
- Offer problem specific operators
15Other Characteristics of High-end Microprocessor
systems
- Memory is a problem
- Serial von-Neuman execution forces heavy demands
on memory system - Processor memory speed gap widens with Moores
Law - Multiple layers of caches are necessary to keep
up. - Most HEC applications derive little or no benefit
from caches but, caches add power, latency, cost,
unpredictability. - Real-time processng impossible because
unpredictable delay in memory hierarchy and
communication network - Microprocessors inherently fault-intolerant and
costly
16Characteristics of Reconfigurable Computer Systems
- increased performance density, lower clock rate,
reduced power node. - High spatial parallelism and circuit
specialization within nodes. - No cache, computing elements operate at the same
speed as memory - Multiple independently addressed memory banks per
node - Internal FPGA SRAM can be user-controlled cache
if needed. - Flexible interconnection network (circuit/packet
switching). - Predictable memory and network latency permit
static scheduling in real-time applications. - FPGAs inherently manufacturing fault-tolerant
17B2 Design
- Approach look at hardware configurations,
evaluate based on programming model and
applications, and iterate. - Starting Constraints
- Use all COTS components
- FPGAs, memory, connectors/cables
- Highly modular
- Scalable from single module to approximately 1K
FPGA chips in a system (8 TFlops)
18Computing node and memory
- Single Xilinx Virtex 2 Pro 70 FPGA
- 75K logic cells (4lutFF ? 0.5M logic gates)
- 1704 package with 996 user I/O pins
- 2 PowerPC cores
- 500 dedicated multipliers (18-bit)
- 700KBytes SRAM on-chip
- 20 10-Gbit/s serial communication links (MGTs)
- 4 physical DDR 400 banks
- Each banks has 72 data bits with ECC
- Independently addressed with 16 logical banks
total - 12.8 GBps memory bandwidth, with up to 8 GB
capacity
19Inter-node Connections
- Module Each group of four nodes share a control
node and form a computational cluster. - Point-to-point connection between control node
and processing node - 144 bit 300MHz DDR
- 38.4 Gb/s per link
- Uplink connect to other modules to form a 4-ary
tree. - Downlinks for I/O on leaf nodes and for tree
connection on switch modules.
B2 module
204-ary Tree Connection
- 4-ary tree configuration
- High-bandwidth high-latency 12X Infiniband 10
Gbps duplex - Low bandwidth low latency 64 pin (32 bit) LVDS _at_
200 MHz DDR - Some B2 modules act as a switch node and
aggregation computing point.
21Fat-trees balancing computation and
communication
- Uplink bandwidth can be partitioned to allow a
family of fat-tree structures.
- Rents rule type analysis can be used to
characterize application connection locality. - Machine can be built or configured to match
appropriate Rent constant (maybe different at
different levels).
22Non fat-tree (64 nodes)
- Constant cross-section bandwidth at each tree
level
23Fat-tree Configuration (64 nodes)
- Cross-section bandwidth grows towards tree root.
- Uses a higher ratio of switch/leaf modules.
- Tree structure is configured by how modules are
wired.
24B2 Module board layout
- 4 computing nodes, 1 control node, Up to 40GBytes
DRAM - 8 SATA connection for up to 8 hard disks
25Example B2 System
- Two modules (8 nodes) per 1 RU unit (19 x 27)
or - One module plus up to 4 disks.
- Single cabinet
- 256 node tree-connected B2 system, with
- up to 3.4 TB DDR DRAM
- gt40 TOPS or 2 TFLOPS
- (not counting tree nodes)
26Summary
- Supercomputer level computation at fraction of
cost and size - High computational density enables small physical
size. - Low-level redundancy enables manufacturing-fault
tolerance and drastic cost reduction. - Platform for
- Extending BEE approach to real-time emulation,
- experimenting with reconfigurable computing
programming models and application domains - Scalable
- Computation/memory capacity varies with number of
modules and FPGA generation - Wiring options vary computation/communication
balance
27Spare Slides
28Alternative Switch Scheme
- Specialized crossbar switch implemented as ASIC
(Mellanox) - 200 ns latency
- Fat tree organization with constant cross section
bandwidth
29Disk Storage Schemes
- Intra B2 module working storage at each module
- User disk storage schemes
- Connection to existing NAS through Gigabit
Ethernet from all B2 modules - Direct high bandwidth storage nodes attached to
the main crossbar network - SAN bridge attached to the main crossbar network
adapting to existing SAN