HighEnd Reconfigurable Computing - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

HighEnd Reconfigurable Computing

Description:

High computational density enables small physical size. ... experimenting with 'reconfigurable computing' programming models and application domains ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 30
Provided by: ChenC5
Category:

less

Transcript and Presenter's Notes

Title: HighEnd Reconfigurable Computing


1
High-End Reconfigurable Computing
  • Berkeley Wireless Research Center
  • January 2004
  • John Wawrzynek, Robert W. Brodersen, Chen Chang,
    Vason Srini, Brian Richards

2
Berkeley Emulation Engine
  • FPGA-based system for real-time hardware
    emulation
  • Emulation speeds up to 60 MHz
  • Emulation capacity of 10 Million ASIC
    gate-equivalents, corresponding to 600 Gops
    (16-bit adds) (although not a logic gate
    emulator.)
  • 2400 external parallel I/Os providing 192 Gbps
    raw bandwidth.

3
Status
  • Four BEE processing units built
  • Three in continuous production use
  • Supported universities
  • CMU, USC, Tampere, UMass, Stanford
  • Successful tapeout of
  • 3.2M transistor pico-radio chip
  • 1.8M transistor LDPC decoder chip
  • System emulated
  • QPSK radio transceiver
  • BCJR decoder
  • MPEG IDCT
  • On-going projects
  • UWB mix-signal SOC
  • MPEG transcoder
  • Pico radio multi-node system
  • Infineon SIMD processor for SDR

4
Lessons from BEE
  • Simulink based tool-flow very effective FPGA
    programming model in DSP domain.
  • Many system emulation tasks are significant
    computations in their own right
    high-performance emulation hardware makes for
    high-performance general computing.
  • Is this the right way to build supercomputers?
  • BEE could be scaled up with latest FPGAs and by
    using multiple boards ? TeraBEE (B2).

5
High-End Reconfigurable Computer (HERC)
  • The machine with supercomputer-level performance
    configured on a per-problem-basis to match the
    structure of the task by exploiting spatial
    parallelism.
  • All data-paths, control paths, memory ports and
    controllers, communication channels and
    controllers, are wired to match the needs of a
    particular problem.

6
Applications of Interest
  • High-performance DSP/communication systems
  • Cognitive radio or SDR
  • Hyper-spectral imaging
  • Image processing and navigation
  • Real-time scientific computation and simulation
  • E M simulation
  • Molecular dynamics
  • CAD acceleration
  • FPGA Place Route
  • Others
  • Bioinformatics

7
High-performance DSP
  • Stream-based computation model
  • Usually real-time requirement
  • High-bandwidth data I/O
  • Low numerical precision requirements fix-point
    or reduced floating point
  • Data processing dominated, few control branch
    points

8
Scientific Computing
  • Computationally demanding
  • Double-precision floating point
  • Traditional methods require FFTs, matrix
    operations, linear systems solvers (linpack)
  • Often regular or adaptive grid structure
  • Traditionally not real-time processing, but
    real-time processing would offer new
    applications.
  • Opportunities to innovate on the
    algorithm/mapping for reconfigurable.

9
CAD acceleration
  • Existing low-level tool flow currently too slow
    to be practical for HERC systems.
  • HERC machines should be used to accelerate their
    own tools. Some starting ideas
  • Hardware-Assisted Fast Routing. André DeHon,
    Randy Huang, and John Wawrzynek., Published in
    Proceedings of the IEEE Symposium on
    Field-Programmable, Custom Computing Machines
    (FCCM '02, April 22--24, 2002).
  • Hardware-Assisted Simulated Annealing with
    Application for Fast FPGA Placement. Michael
    Wrighton and André DeHon. In Proceedings of the
    International Symposium on Field Programmable
    Gate Arrays, pages 33--42, February, 2003.

10
Bioinformatics
  • Implicitly parallel algorithms
  • Stream-like data processing
  • Integer operations sufficient
  • History of success with reconfigurable
    architectures.
  • High-capacity persistent storage devices required
    for matching large database

11
Conventional High-end Computers
Clusters of commodity microprocessors
  • System performance in the 100s of GFLOPs to 10s
    of TFLOP range.
  • Using commodity components is a key idea
  • Low-volume production makes it difficult to
    justify custom silicon.
  • Commodity components ride technology curve.
  • But Microprocessors are the wrong component!

12
Computation Density of Processors
  • Serial instruction stream limits parallelism
  • Power consumption limits performance

13
Xilinx Platform FPGA Roadmap
  • Reconfigurable devices drive the next process
    step
  • Simple performance scaling

14
FPGA Density Flexibility
_at_200MHz 20GFLOPS
  • FPGAs already offer density advantage
  • Offer problem specific operators

15
Other Characteristics of High-end Microprocessor
systems
  • Memory is a problem
  • Serial von-Neuman execution forces heavy demands
    on memory system
  • Processor memory speed gap widens with Moores
    Law
  • Multiple layers of caches are necessary to keep
    up.
  • Most HEC applications derive little or no benefit
    from caches but, caches add power, latency, cost,
    unpredictability.
  • Real-time processng impossible because
    unpredictable delay in memory hierarchy and
    communication network
  • Microprocessors inherently fault-intolerant and
    costly

16
Characteristics of Reconfigurable Computer Systems
  • increased performance density, lower clock rate,
    reduced power node.
  • High spatial parallelism and circuit
    specialization within nodes.
  • No cache, computing elements operate at the same
    speed as memory
  • Multiple independently addressed memory banks per
    node
  • Internal FPGA SRAM can be user-controlled cache
    if needed.
  • Flexible interconnection network (circuit/packet
    switching).
  • Predictable memory and network latency permit
    static scheduling in real-time applications.
  • FPGAs inherently manufacturing fault-tolerant

17
B2 Design
  • Approach look at hardware configurations,
    evaluate based on programming model and
    applications, and iterate.
  • Starting Constraints
  • Use all COTS components
  • FPGAs, memory, connectors/cables
  • Highly modular
  • Scalable from single module to approximately 1K
    FPGA chips in a system (8 TFlops)

18
Computing node and memory
  • Single Xilinx Virtex 2 Pro 70 FPGA
  • 75K logic cells (4lutFF ? 0.5M logic gates)
  • 1704 package with 996 user I/O pins
  • 2 PowerPC cores
  • 500 dedicated multipliers (18-bit)
  • 700KBytes SRAM on-chip
  • 20 10-Gbit/s serial communication links (MGTs)
  • 4 physical DDR 400 banks
  • Each banks has 72 data bits with ECC
  • Independently addressed with 16 logical banks
    total
  • 12.8 GBps memory bandwidth, with up to 8 GB
    capacity

19
Inter-node Connections
  • Module Each group of four nodes share a control
    node and form a computational cluster.
  • Point-to-point connection between control node
    and processing node
  • 144 bit 300MHz DDR
  • 38.4 Gb/s per link
  • Uplink connect to other modules to form a 4-ary
    tree.
  • Downlinks for I/O on leaf nodes and for tree
    connection on switch modules.

B2 module
20
4-ary Tree Connection
  • 4-ary tree configuration
  • High-bandwidth high-latency 12X Infiniband 10
    Gbps duplex
  • Low bandwidth low latency 64 pin (32 bit) LVDS _at_
    200 MHz DDR
  • Some B2 modules act as a switch node and
    aggregation computing point.

21
Fat-trees balancing computation and
communication
  • Uplink bandwidth can be partitioned to allow a
    family of fat-tree structures.
  • Rents rule type analysis can be used to
    characterize application connection locality.
  • Machine can be built or configured to match
    appropriate Rent constant (maybe different at
    different levels).

22
Non fat-tree (64 nodes)
  • Constant cross-section bandwidth at each tree
    level

23
Fat-tree Configuration (64 nodes)
  • Cross-section bandwidth grows towards tree root.
  • Uses a higher ratio of switch/leaf modules.
  • Tree structure is configured by how modules are
    wired.

24
B2 Module board layout
  • 4 computing nodes, 1 control node, Up to 40GBytes
    DRAM
  • 8 SATA connection for up to 8 hard disks

25
Example B2 System
  • Two modules (8 nodes) per 1 RU unit (19 x 27)
    or
  • One module plus up to 4 disks.
  • Single cabinet
  • 256 node tree-connected B2 system, with
  • up to 3.4 TB DDR DRAM
  • gt40 TOPS or 2 TFLOPS
  • (not counting tree nodes)

26
Summary
  • Supercomputer level computation at fraction of
    cost and size
  • High computational density enables small physical
    size.
  • Low-level redundancy enables manufacturing-fault
    tolerance and drastic cost reduction.
  • Platform for
  • Extending BEE approach to real-time emulation,
  • experimenting with reconfigurable computing
    programming models and application domains
  • Scalable
  • Computation/memory capacity varies with number of
    modules and FPGA generation
  • Wiring options vary computation/communication
    balance

27
Spare Slides
28
Alternative Switch Scheme
  • Specialized crossbar switch implemented as ASIC
    (Mellanox)
  • 200 ns latency
  • Fat tree organization with constant cross section
    bandwidth

29
Disk Storage Schemes
  • Intra B2 module working storage at each module
  • User disk storage schemes
  • Connection to existing NAS through Gigabit
    Ethernet from all B2 modules
  • Direct high bandwidth storage nodes attached to
    the main crossbar network
  • SAN bridge attached to the main crossbar network
    adapting to existing SAN
Write a Comment
User Comments (0)
About PowerShow.com