HighEnd Reconfigurable Computing at BWRC - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

HighEnd Reconfigurable Computing at BWRC

Description:

Emulation capacity of 10 Million ASIC gate-equivalents, ... Alternative topology: 3D mesh or torus. The 4 compute FPGA can be used to extend to 3D mesh/torus ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 30
Provided by: ChenC5
Category:

less

Transcript and Presenter's Notes

Title: HighEnd Reconfigurable Computing at BWRC


1
High-End Reconfigurable Computing at BWRC
  • John Wawrzynek
  • University of California, Berkeley
  • Berkeley Wireless Research Center

2
Berkeley Emulation Engine (BEE), 2002
  • FPGA-based system for real-time hardware
    emulation
  • Emulation speeds up to 60 MHz
  • Emulation capacity of 10 Million ASIC
    gate-equivalents, corresponding to 600 Gops
    (16-bit adds) (although not a logic gate
    emulator.)
  • 2400 external parallel I/Os providing 192 Gbps
    raw bandwidth.
  • 20 Xilinx VirtexE 2000 chips, 16 1MB ZBT SRAM
    chips.

3
Realtime Processing Allows In-System Emulation
4
Matlab/Simulink Programming Tools
Discrete-Time-Block-Diagrams with FSMs
  • Tool flow developed by Mathworks, Xilinx, and
    UCB.
  • User specifies design as block diagrams (for
    datapaths) and finite state machines for control.
  • Tools automatically map to FPGAs and ASIC
    implementation.
  • User assisted partitioning with automatic system
    level routing.

Matlab/Simulink is a standard simulation
specification system in many industries. Automatic
FPGA mapping has been very successful with BEE
in DSP domain.
5
BEE Status
  • Four BEE processing units built
  • Three in near continuous production use
  • Other supported universities
  • CMU, USC, Tampere, UMass, Stanford
  • Successful tapeout of
  • 3.2M transistor pico-radio chip
  • 1.8M transistor LDPC decoder chip
  • System emulated
  • QPSK radio transceiver
  • BCJR decoder
  • MPEG IDCT
  • On-going projects
  • UWB mix-signal SOC
  • MPEG/PRISM transcoder
  • Pico radio multi-node system
  • Infineon SIMD processor for SDR

6
Lessons from BEE
  • High-performance real-time performance vastly
    eases the debugging/verification/tuning process.
  • Simulink based tool-flow very effective FPGA
    programming model in DSP domain.
  • System emulation tasks are significant
    computations in their own right
    high-performance emulation hardware makes for
    high-performance general computing.
  • Is this the right way to build high-end (super)
    computers?

BEE could be scaled up with latest FPGAs and by
using multiple boards ? BEE2 (and beyond).
7
The Team
  • Faculty in charge
  • John Wawrzynek
  • Bob W. Brodersen
  • Graduate students
  • Kevin Camera (tools)
  • Chen Chang (arch, tools, apps)
  • Pierre-Yves Droz (arch)
  • Alexander Krasnov (apps)
  • Zohair Hyder (arch)
  • Yury Markovskiy (apps)
  • Adam Megacz (tools)
  • Hayden So (tools)
  • Norm Zhou (apps, tools)
  • Industrial Liaison
  • Bob Conn (Xilinx)
  • Ivo Bolsens (Xilinx)
  • Research associates
  • Dan Werthimer (SSL)
  • Melvyn Wright (UCB, RAL)
  • Technical staff
  • Brian Richards
  • Susan H. Mellers
  • Undergraduate students
  • John Conner
  • Greg Gibeling

8
BEE2 Prototype
  • Modular design scalable from a few to hundreds of
    FPGAs.
  • High memory capacity and bandwidth to support
    general computing applications.
  • High bandwidth / low-latency inter-module
    communication to support massive parallelism.
  • All off-the-shelf components no custom chips.
  • Thanks to Xilinx for engineering assistance,
    FPGAs, and interaction on application
    development.

9
Basic Computing Element
  • Single Xilinx Virtex 2 Pro 70 FPGA
  • 70K logic cells
  • 1704 package with 996 user I/O pins
  • 2 PowerPC405 cores
  • 326 dedicated multipliers (18-bit)
  • 5.8 Mbit on-chip SRAM
  • 20X 3.125-Gbit/s duplex serial communication
    links (MGTs)
  • 4 physical DDR2-400 banks
  • Per FPGA up to 12.8 Gbyte/s memory bandwidth and
    maximum 8 GByte capacity.

10
BEE2 Module
  • 4 computing nodes 1 control nodes form one
    large virtual computing element.
  • 2D mesh connection between computing FPGAs
  • 140 bit _at_ 150MHz DDR, 42 Gbps per link
  • Star connection from control node to computing
    nodes
  • 64 bit 150 MHZ DDR, 19.2 Gbps per link
  • 18 Infiniband 4X connectors (10Gbps duplex each)
    for inter-module communication
  • Modules are directly connected or connected
    through a commercial switch.

Single PC Board
1.0 TeraOp (16-bit) range.
11
Compute Module
Completed 12/04.
  • Module also includes I/O for administration and
    maintenance
  • 10/100 Ethernet
  • HDMI / DVI
  • USB

14X17 inch 22 layer PCB
12
Inter-Module Connections
Global Communication Tree
Stream Packets
Admin, UI, NFS
13
Alternative topology 3D mesh or torus
  • The 4 compute FPGA can be used to extend to 3D
    mesh/torus
  • 6 directional links
  • 4 MGT links
  • 2 on-board LVCMOS links

14
19 Rack Example
  • 40 compute modules in 5 chassis (8U) per rack
  • Over 32-40 TOPS (2 TFLOPS) performance per rack
  • 250 Watt AC/DC power supply to each blade
  • 12.5 KWatt max power consumption
  • Hardware cost 800K
  • Currently collaborating with Paul Wright and
    students on packaging for desktop single-module
    version and rack-mount version.

15
Why are these systems interesting?
  • Best solution in several domains
  • Emulation for custom chip design
  • Extreme DSP tasks
  • Scientific and Supercomputing
  • Good model on how to build future chips and
    systems
  • Massively parallel
  • Fine-grained reconfigurability enables
  • Robust performance/power efficiency on a
    wide-range of problems.
  • Fault and defect tolerance.

16
Wireless-Network Simulation
  • SDR, Cognitive radio, and Ad-hoc Networks
  • Platform for developing soft-radio techniques,
  • validation of network protocols,
  • chip-level validation in context of real data and
    network/environment.
  • Requires real-time
  • Simulation of complex channel and environment
    models,
  • Simultaneous simulation of 100s to 1000s of
    network nodes (with real-time sensor input).

17
FPGA-based Emulation Platforms
  • Continue to evolve Simulink design flow
  • New library elements, automatic partition of
    designs across FPGAs
  • Several new efforts leverage the fine-grained
    programmability of FPGAs to extend the tool-set
  • Hayden So, BORPH a general purpose, concurrent
    multi-user operating system designed for
    reconfigurable computing systems built from
    arrays of FPGAs.
  • Kevin Camera, Design specific in-system debugging
    and verification tools.
  • Norm Zhou, Interactive and incremental design
    environment for multiple FPGA systems.

18
Extreme Digital-Signal-Processing
BEE2 is a promising computing platform for for
Allen Telescope Array (ATA) and proposed Square
Kilometer Array (SKA) SETI spectrometer Image-form
ation for Radio Astronomy Research
  • Massive arithmetic operations per second
    requirement.
  • Stream-based computation model
  • Usually hard real-time requirement
  • High-bandwidth data I/O
  • Low numerical precision requirements
  • Mostly fix-point operations
  • Rarely needs reduced floating point
  • Data-flow processing dominated
  • few control branch points

19
Comparison with DSP Chips
  • Spectrometer polyphase filter bank (PFB) 16
    mult, 32bit acc, Correllator 4bit mult, 24bit
    acc.
  • Cost based on street price.
  • Assume peak numbers for DSPs, mapped for FPGAs.
  • TI DSPs
  • C6415-7E, 130nm (720MHz)
  • C6415T-1G, 90nm (IGHz)
  • FPGAs
  • 130nm, freq. 200-250MHz.
  • Metrics include chips only (not system)

Performance
Energy Efficiency
Cost-Performance
20
Scientific and Supercomputing
Standard Architecture Clusters of commodity
microprocessors
  • System performance in the 100s of GFLOPs to 10s
    of TFLOP range.
  • Physically large, expensive.
  • Using commodity components is a key idea
  • Low-volume production makes it difficult to
    justify custom silicon.
  • Commodity components ride technology curve
    (microprocessors, DRAM).
  • But Microprocessors are the wrong computing
    element!

21
Processor versus FPGA performance trend
75 utilization on FPGAs
  • Processors are loosing the performance battle.
  • FPGAs better tracks Moores Law drive the next
    process step and have simple performance scaling
    properties.
  • Potential for even more gain with specialized
    data-path widths of FPGAs.

22
FDTD for Antenna Design
  • Typical problem has 5003 grid points and 10,000
    time step simulation
  • 20 hours on a workstation
  • Yee cell engine
  • 21K LUTs, 6.4GB/s _at_ 120MHz
  • Uses FP units from eda.org.
  • Single FPGA (V2P70/100)
  • With 3 engines per chip,
  • 1.9 hours
  • On BEE2, less than a minute.

23
Bioinformatics
  • Implicitly parallel algorithms
  • Often stream-like data processing
  • Integer operations sufficient
  • History of success with reconfigurable/ASIC
    architectures. (TimeLogic, Paracell)
  • High-quality Brute force Smith-Waterman
    technique practical on BEE but not on PC clusters.

BLAST (Basic Local Alignment Search Tool)
Preliminary implementation (simulation only)
indicates BEE2, provides 1001000 times faster
execution time running the BLAST algorithm, and
over 1000X lower price-performance than existing
PC cluster solutions.
24
Full-chip SPICE-level Circuit Simulation
  • Conventional implementation
  • turns circuit into a large 2D conductance matrix
    (representing connections between nodes with
    circuit elements).
  • Gauss elimination like method used to solve
    the matrix for voltages requires global
    communication and floating-point arithmetic.
  • Possible on BEE2, but we are trying an
    alternative approach.
  • On BEE2 (Bob Conn, Xilinx)
  • Circuit is mapped spatially across the
    computation elements, each responsible for
    updating a set of nodes iteratively.
  • Communication is localized and fixed point
    arithmetic suffices.

25
Defect Tolerance on BEE2
  • The FPGAs we use have large area (like
    microprocessors) hence poor yields.
  • Unlike microprocessors, FPGAs have and inherent
    potential for defect (and fault) tolerance, and
    thus dramatic yield enhancement (and price
    reduction)
  • Fine-grained reconfigurability allows faulted
    resources to be avoided with insignificant effect
    on performance and area.
  • Majority of area in interconnect and switches
    (95), and only a small fraction used by any
    specific design.
  • Our Appoach
  • Per-design testing
  • Test specific routing resources used in design.
  • Easy since only small fraction of all routing
    resources used.
  • Defect tolerance through Swapping designs. Can be
    done because of symmetry in system. Statistically
    very high probability of finding a successful
    mapping.

26
Effect of Design Swapping
  • baseP is probability that any user design (bit
    file) will successfully map to any FPGA.
  • sucessP is probability that at least one
    permutation of designs successfully maps the
    machine.
  • Anecdotal evidence suggests that baseP is around
    50.
  • Threshold much lower for other topologies with
    more symmetry and more cost (ex. Crossbar). Goes
    as logn/n for n FPGAs.

27
Programming models
  • Three Primary Approaches
  • Matlab/Simulink
  • Extends system developed for BEE1.
  • Domain specific languages and tools
  • These are high-level programming and
    specification languages.
  • Compiler leverage known communication and memory
    access patterns for given domains.
  • Example 3D grid problems.
  • Port of Fortran/MPI
  • Standard programming model for supercomputing.
  • Optimized MPI library for FPGAs, hand-mapped
    libraries, but automatic support for Fortran
    acceleration. Main SPMD thread runs on processor
    core.

28
Development of petaBEE
  • Based on concepts demonstrated in BEE2 prototype,
    1 petaOPS (1015 operations per second) attainable
    within 4 years.
  • BEE4
  • Assumes 65nm by late 2006.
  • Special masks may be needed to provide proper
    balance of I/O, memory, and logic.
  • Special masks could boost floating point
    performance (5x) if needed by applications.
  • Memory die stacked on FPGAs to gain 4x in
    density.

29
Structure of Continuing Development Effort
  • Successful development requires an integrated
    effort
  • Especially here where we are simultaneously
    innovating on all fronts.
  • Feedback between activities is crucial.
  • Several key applications drive the development.
Write a Comment
User Comments (0)
About PowerShow.com