Title: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories
1Reconfigurable Computing FPGAs for Ultrascale
ScienceSandia National Laboratories
Craig Ulmer SNL/CA cdulmer_at_sandia.gov
SOS-8 Workshop April 14, 2004
2Motivation CPU Efficiency Trend
While CPU performance has been increasing.. ..proc
essing efficiency has been decreasing.
Efficiency MFLOPS/MHz/Mtransistors
Efficiency
Processors
3Looking Ahead
- For commodity clusters, should we be nervous?
- Significant increases in technology effort
- Diminishing returns
- Should we depend on CPU manufacturers for HPC?
- Sandia has many HPC interests
- Investigate computing alternatives and
accelerators - FPGAs Modern Reconfigurable Computing
4Outline
- Reconfigurable computing
- Use FPGAs to accelerate computations
- Strategy and examples
- Approaches to scientific computing
- Challenges for ultrascale science
- Double-precision floating-point performance
- System integration and network aspects
5Reconfigurable Computing Background
6Computing Spectrum
7Reconfigurable Hardware Devices
Devices that can be programmed to emulate
hardware circuitry
- Tile architecture
- Logic blocks (LBs)
- Routing elements
- Field-Programmable Gate Arrays
- Fine granularity
- LBs are bit-level operators
- Commercial trend
- Coarse granularity
- LBs are ALUs, FPUs
- QuickSilver, Pact XPP, ClearSpeed
8Common Acceleration Techniques
Key Designing in Hardware
- Processing concurrency
- Hardware pipelines
- Custom memory interactions
- Partial evaluation
9Reconfigurable Computing for Ultrascale
ScienceHPC Strategy and Examples
- Enhancing HPC Performance
10HPC Strategy at Sandia for RC
- RC resources work best as accelerators in HPC
- Clusters are inexpensive work well for many
applications - Add RC devices to enhance performance
- Port key portions of algorithms to RC hardware
- Focus on hotspots and inner loops
- Move data to/from FPGAs in pipelined fashion
11Scientific Computing Examples
- Pattern recognition
- ATLAS project at CERN
- Reduced 2500 CPUs to 120 nodes with FPGAs
- Visualization
- Vizard II project at University of Tübingen
- Direct volume rendering for 5123 datasets
- Molecular dynamics (MD)
- Preliminary work at Los Alamos National
Laboratory - 20 Cells in an FPGA yields 5.69 GFLOPS
- Computational fluid dynamics (CFD) analysis for
jet engines - Smith and Schnore at GE Global Research
12Challenges
- Hard to program
- Hardware design
- Must be significant parallelism
- Limited chip capacity
- Lack of HPC building blocks
- Our users need DP-FP
- System integration
- How do we add to our clusters?
13Reconfigurable Computing for Ultrascale
ScienceDouble-Precision Floating-Point Cores
- Addressing the need for HPC building blocks
14Double-Precision Floating-Point Cores
- Floating point has been historical weakness for
FPGAs - FP cores consume significant amounts of hardware
- Previous FPGAs lacked capacity
- Significant improvements in recent commercial
FPGAs - Increased capacity, faster clocks, and better
building blocks - Keith Underwood at SNL/NM
- Re-evaluating FP performance in FPGAs
- Constructing high-speed DP-FP cores
15Peak Performance Results
From Underwoods, FPGAs vs. CPUs Trends in Peak
Floating-Point Performance, in FPGA04
16Double-Precision Multiply Performance Trends
17Reconfigurable Computing for Ultrascale
ScienceNetworking Aspects
- Addressing capacity and system integration issues
18Data ExchangeMulti-Gigabit Transceivers (MGTs)
- How do we rapidly move data into/out of FPGA?
- Xilinx Virtex-II/Pro FPGA has MGTs
- Channel data rates 3.125 Gbps
- Up to 24 channels
- V2/ProX twenty 10Gbps channels
- Configured for different physical layers
- InfiniBand, FC, GigE, 10GigE
- S-ATA, PCI-Express, HT
19Importance of MGTs
- Increase Raw Capacity
- Connect FPGAs together
- MGTs provide fat pipes
- Cables, not PCB traces
- System Integration
- Connect FPGA to SAN
- Implement NI in FPGA
- FPGA is global resource
20Recent Sandia Work SNL OpenTOE
- At Sandia we are interested in connecting FPGAs
to SANs - Main target InfiniBand
- Must implement network protocols for reliable
transfer - Initial work GigE and TCP
- Implemented GigE core and basic TCP offload engine
21Concluding Remarks
- Improvements in commercial FPGAs make RC
attractive - FPGAs provide better sustained performance than
CPUs - FPGA performance growing faster than Moores Law
- Near-term strategy accelerator-based approach
- Offload key operations into hardware
- Sandia National Labs investigating RC for HPC
acceleration - Enabling scientific computing through fast DP FP
cores - Addressing system integration/capacity issues via
network