Title: RC Device Characterizations
1RC Device Characterizations Tradeoff Analysis
2Introduction
- Reconfigurable Computing (RC) is an emerging
field that utilizes devices with a programmable
fabric allowing the hardware to be configured and
adapted to solve changing problems
- RC systems have typically been built using Field
Programmable Gate Arrays (FPGAs) but there are
other architectures that could implement RC
systems such as Field Programmable Object Arrays
(FPOAs) and Field Programmable Compute Arrays
(FPCA, e.g. MONARCH)
3Subject Purpose
- Subject
- To survey the landscape of various RC devices
- Characterize these devices using various metrics
(performance, price, power) - Create a comparison framework using the
characterizations - Purpose
- Will give the end user a quantitative framework
to aid in the selection of an appropriate RC
device to meet their application needs - Lays groundwork for understanding performance
impacts of architectural components
4Problem Definition
- Problems
- RC devices can be vastly different from one
another - Various architectural differences and very few
standard/common parameters - Memory Example Xilinx BRAM vs. Altera
M-RAM/M4K/M512 vs. FPOA RF/IRAM vs. CPU cache
- RC devices differ from traditional
microprocessors - Typically slower clock rates
- Potential for massive parallelism
- Different power consumption trends
- Different on-die memory configurations
- All of these differences make direct device
comparisons difficult
5Problem Background
- Users have a variety of requirements/concerns
What key parameters do we need to compare? - Computational performance (integer/fixed point,
floating point, fine grained/bit level) - On-chip memory performance (latency, bandwidth)
- Off-chip communications and I/O
- Power consumption
- Price
6Scope Statement
- Devices to be included in study
- Xilinx Virtex 4 LX200, LX100, SX55
- Altera Stratix II S180
- Freescale PowerPC MPC7447 AltiVec
- MathStar Arrix FPOA (1 GHz)
- Raytheon Monarch PCA
- Sony/Toshiba/IBM Cell
7Methods
- Literature review
- Apply and extend characterizations and metrics to
devices under study - Datasheet analysis
- Experiments using vendor development
tools/simulation environments - Example Utilization and timing analysis results
from post place and route for common ALU/FP
structures - Combine characterization study results into a QFD
style matrix
8FPGA Theoretical Floating Point Performance
- Methodology
- Adapted from Jeff Masons (Xilinx) presentation
at RSSI 07 FPGA HPC The road beyond
processors with input from Dave Strenski (Cray).
Similar methodology also reported in An overview
of FPGAs and FPGA programming Initial
experiences at Daresbury, Richard Wain, Ian Bush,
Martyn Guest, Miles Deegan, Igor Kozin and
Christine Kitchen. November 2006. Distributed
Computing Group at Daresbury Laboratory. - Using datasheet information, Altera and Xilinx
Floating Point cores, ISE and Quartus, estimate
FP add and FP multiply performance.
9FPGA Floating Point Performance
- Xilinx Example
- Data from Virtex 4 Family Overview (DS112) and
Coregen Floating Point Operator v3.0 (DS335) - Assumptions
- 15 slice overhead (routing, I/O, etc.)
- Use DSP resources first, then logic only
implementation to fill device. - Use lower of the two clock speeds for all
calculations (DSP vs. Logic only). - Assume 2 storage elements (BRAM) per operation
(operands, overwrite with result). Limit the
number of operations if there is not enough BRAM
to support. - Use speed optimized, highest effort for
Synthesis, Map, PAR.
10FPGA Floating Point Performance
- Xilinx Example Continued (LX200 10)
- Double Precision Floating Point Multiply
- 96 / 16 6 DSP Multipliers
- 151449 (774 6) 146805 remaining LUT for
Logic Multipliers - 146805 / 2457 59 Logic Only Multipliers
- 65 total multipliers in 1 context _at_ 185 MHz 12
Gflop/s - Limit total number of multipliers to 85 due to
BRAM limitation 11.1 Gflop/s - LX100 has 336 18Kb dual port BRAM. For 64-bit
(DP), ((336 2) / 4) / 2 85 function units
Per Instance DSP Implementation Logic Only Implementation Device Maximum (less 15 LUT for overhead)
Max Frequency (MHz) 303 185 500
DSPs Used 16 0 96
LUTs Used 550 2311 178176 (151449)
FF Used 774 2457 178176 (151449)
11Theoretical Floating Point Performance
- Methodology
- FPOA floating point performance is reported as 0.
This device could have a floating point core
designed for it, but its architecture (16 bit
ALUs) would not implement FP efficiently. - PowerPC, AltiVec, MONARCH, and Cell floating
point performance numbers are available/derivable
from their respective datasheets
12Floating Point Performance Results
13Floating Point Performance Results
14Floating Point Performance Results
Theoretical Floating Point Performance (GFlops,
BRAM Limitation)
Theoretical Floating Point Performance (GFlops,
No BRAM Limitation)
15Floating Point Conclusions
- For FPGAs, floating point performance dependent
on FP core implementation. This impacts resource
utilization and maximum achievable frequency. - For Xilinx devices, available on-chip memory also
greatly impacts performance if we assume there
has to be enough on-chip memory to buffer
operands and results. Stratix II S180 has more on
chip RAM (1.5x V4LX200) and a more flexible
memory hierarchy (a larger number of smaller
blocks to support more individual registers,
higher device memory bandwidth) and does not have
this issue. - Xilinx adder cores can use on-chip DSP resources,
Altera adder cores do not. - MONARCH only supports single precision floating
point. - Cell is the clear leader in theoretical floating
point performance (using all processing
elements).
16Theoretical Integer Performance
- Utilize same basic methodology as Floating Point
Performance Comparison - 15 slice overhead (routing, I/O, etc.).
- Use DSP resources first, then logic only
implementation to fill device. - Use lower of the two clock speeds for all
calculations (DSP vs. Logic only).
- Use vendor software (Quartus, ISE) to find
resource utilization for 1 functional unit.
Calculate the number of parallel functional units
that fit in 1 context using datasheet values. - Assume 2 storage elements (BRAM) per functional
unit (operands, overwrite with result). Limit
the number of parallel functional units if there
is not enough BRAM to support 2 storage elements
per functional unit. - Use speed optimized, highest effort for
Synthesis, Map, PAR. - Use standard integer widths (32 bit and 16 bit).
- Analyze Addition and Multiplication operations
separately.
17Theoretical Integer Performance
- Methodology
- FPOA 32 bit integer performance is reported as 0.
This device could have a 32 bit ALU core
designed for it, but it is natively a 16 bit
device. - PowerPC, AltiVec, MONARCH, and Cell integer
performance numbers are available/derivable from
their respective datasheets
18Integer Performance Results
19Integer Performance Results
20Integer Performance Results
Theoretical Integer Performance (GOPs, BRAM
Limitation)
Theoretical Integer Performance (GOPs, No BRAM
Limitation)
21Integer Performance Conclusions
- In some cases, BRAM limitation is again an
important performance limiter for Xilinx devices.
Stratix II S180 has more on chip RAM (1.5x
V4LX200) and a more flexible memory hierarchy (a
larger number of smaller blocks to support more
individual registers, higher device memory
bandwidth) and does not . - Quartus II 6.0 typically reports higher maximum
achievable frequency for post place and route
timing analysis versus ISE 9.2. - Used speed grade 10 for Virtex 4 devices.
- Used speed grade 3 for Stratix II device.
- 32 bit multiply example Quartus reports 500 MHz
for both DSP and Logic Only implementations, ISE
reports 421 MHz for DSP, 249 MHz for Logic Only. - Xilinx adder cores can use on-chip DSP resources,
which could improve add performance if there was
enough memory support. Altera adder cores do not
support DSP utilization and therefore suffer a
performance hit compared to Xilinx devices. - Without the BRAM limitation, Xilinx devices show
the highest performance for Integer Add
operations. - With the BRAM limitation, the FPOA has the
highest 16 bit integer performance. - Cell has the highest 32 bit integer performance
(using all processing elements).
22Bit-level Computational Performance
- Methodology
- Based off of Dehons Computational Density
calculations - Computational Density
- Normalizes performance by die (or package) area
and minimum feature size/process technology - Bit operations for FPGAs are number of 4 input
LUTs - Bit operations for GPP and other hybrid devices
based on number of cores, number of issued
instructions, and width of ALU/Functional Units
23Bit-level Computation Performance
- As expected, fine-grained FPGAs dominate
performance in this metric
24External Memory Bandwidth Methodology
- Methodology varies by platform due to available
information and architecture differences. - In all cases, choose maximum throughput available
based on vendor IP for memory controllers.
- Saturated Case uses maximum amount of I/O for
external memory interface, Balanced Case assumes
a balance of I/O and memory interface. - Altera Stratix II
- Influenced by speed grade, number of I/O
- Used new high performance ALTMEMPHY core (vs.
legacy memory interface core) - Support for 333 MHz DDR2 RAM
- Number of controllers limited by the number of
on-chip delay-locked loops (2)
25External Memory Bandwidth Methodology
- Xilinx Virtex 4
- Influenced by speed grade, number of I/O
- Memory Interface Generator v1.73 (Coregen) forces
use of slower Direct Clocking to support
multiple banks vs. SERDES strobe implementation,
for -10 speed grade maximum frequency is 220
240 MHz (depending on bus width) - Mathstar FPOA
- Datasheet information for total external memory
interface bandwidth (RLDRAM II) - Cell
- External Memory Bandwidth (Rambus XDRAM) reported
in presentation Introduction to the Cell
Processor from Dr. Michael Perrone (IBM) - MONARCH
- External Memory Bandwidth (DDR2) reported in
presentation Worlds First Polymorphic Computer
MONARCH from K. Prager, L. Lewis, M. Vahey, G.
Groves (Raytheon)
26External Memory Bandwidth Results
27External Memory Bandwidth Conclusions
- External Memory Bandwidth important to prevent
data bottleneck into the device. - For FPGAs, the type and speed of external memory
supported depends on the family and speed grade
of the device. - In this study, non-FPGA devices have separate I/O
and memory controllers/interfaces, so there is
not a distinction between saturated and balanced. - Stratix II S180 and Virtex 4 SX55 configurations
support 2 simultaneous controllers, Virtex 4
LX100 and LX200 support 3 simultaneous
controllers which is shown in the performance
difference for the saturated case. - Although Stratix II controller supports faster
DDR2 RAM (333 MHz vs. 220 MHz in this
configuration), Virtex 4 SX55 has higher
bandwidth due to support for a wider bus. - Xilinx claims higher bandwidth on website,
assumes wider bus than existing memories. - For the balanced case, Cell is the performance
leader, primarily due to specialized RAM format
(XDRAM).
28I/O Bandwidth Methodology
- Methodology varies by platform due to available
information and architecture differences. - In all cases, choose maximum throughput available
protocol/signaling level. - Saturated Case uses maximum amount of I/O for I/O
interface, Balanced Case assumes a balance of I/O
and 1 memory interface. - Altera Stratix II
- Datasheet information for concurrent receive
pairs and transmit pairs _at_ 1.040 Gb/s per pair. - Xilinx Virtex 4
- Datasheet information for concurrent receive
pairs and transmit pairs _at_ 1 Gb/s per pair. - Mathstar FPOA
- Datasheet information for concurrent total
transmit and receive bandwidth.
29I/O Bandwidth Methodology
- Cell
- I/O Bandwidth reported in presentation
Introduction to the Cell Processor from Dr.
Michael Perrone (IBM) - MONARCH
- I/O Bandwidth reported in presentation Worlds
First Polymorphic Computer MONARCH from K.
Prager, L. Lewis, M. Vahey, G. Groves (Raytheon)
30I/O Bandwidth Results
31I/O Bandwidth Conclusions
- I/O Bandwidth is important to prevent I/O and
data bottleneck. - In this study, non-FPGA devices have separate I/O
and memory controllers/interfaces, so there is
not a distinction between saturated and balanced. - All devices except for FPOA have at least 40 GB/s
throughput. - FPGAs are shown in both fully utilized and
balanced cases. - Stratix II uses separate I/O for single ended
memory interface and differential pairs so there
is no distinction between saturated and balanced
cases. - Cell has the highest I/O performance for both
cases.
32Internal Device Memory Bandwidth
- Methodology
- FPGAs
- Xilinx all BRAMs are the same, calculation
number of BRAMS port width number of ports
memory access frequency - Altera 3 levels of internal memory hierarchy,
calculation similar to above for all levels of
hierarchy - FPOA similar to above with 2 levels of memory
hierarchy (Register File and Internal RAM) - GPP bus width frequency ports
33Internal Memory Bandwidth
- Large amount of parallel accesses give FPGAs the
advantage in this metric
34Device Characterization Matrix
- Goal enable comparison of different devices on
key parameters - Tie all device characterizations into unifying
framework - User weights allow adjustment to specific
application needs - Scores quickly show comparison results based on
input weights - Approach
- Scale each characterization study from 1 to 10
- Generate weighted average score for each device
taking into account user weights - Justification
- Significant architectural differences have
historically made these devices difficult to
compare
- Single-Precision Floating-Point scaling example
- Use min and max values to scale from 1 to 10
34
35Device Characterization Matrix
- Examples with other weights
- Power cost (10), internal external memory BW
(5), 16-bit integer performance (7) - FPOA V4SX55 lead
- DP FP performance (5), power (10)
- Stratix-II S180 and V4LX200 lead
- External I/O BW (10), power (10), cost (10)
- MONARCH and Cell lead
35
36References
- DeHon, A. The Density Advantage of Configurable
Computing. Computer , vol.33, no.4, pp.41-49, Apr
2000. - DeHon, A. Reconfigurable Architectures for
General-Purpose Computing. A.I. Technical Report
No. 1586, Massachusetts Institute of Technology,
1996. - Compton, K. and Hauck, S. Reconfigurable
computing a survey of systems and software. ACM
Comput. Surv. 34, 2 (Jun. 2002), 171-210. - Memory Bandwidth, http//en.wikipedia.org/wiki/Mem
ory_bandwidth. - Mason, J. FPGA HPC The road beyond processors,
Xilinx Corporation. RSSI 2007. - Wain, R., Bush, I., Guest, M., Deegan, M., Kozin,
I. and Kitchen, C. An overview of FPGAs and FPGA
programming Initial experiences at Daresbury,.
November 2006. Distributed Computing Group at
Daresbury Laboratory. - Bolsens, I. Programming Modern FPGAs. Xilinx
Corporation. MPSOC August, 2006. - Underwood, K. 2004. FPGAs vs. CPUs trends in
peak floating-point performance. In Proceedings
of the 2004 ACM/SIGDA 12th international
Symposium on Field Programmable Gate Arrays
(Monterey, California, USA, February 22 - 24,
2004). FPGA '04. ACM Press, New York, NY,
171-180. - HPEC Challenge Benchmarks. http//www.ll.mit.edu/H
PECchallenge. - Xilinx Corporation. 2100 Logic Drive, San Jose,
CA 95124-3400. Virtex-4 Family Overview (DS112),
January 23, 2007. - Xilinx Corporation. 2100 Logic Drive, San Jose,
CA 95124-3400. Floating-Point Operator v3.0
(DS335). September 28, 2006. - Introduction to the Cell Processor from Dr.
Michael Perrone (IBM) - Worlds First Polymorphic Computer MONARCH
from K. Prager, L. Lewis, M. Vahey, G. Groves
(Raytheon) - Strenski, Dave. FPGA Floating Point Performance
a pencil and paper evaluation.
http//www.hpcwire.com/hpc/1195762.html. - Strenski, Dave. 2006. Computational Bottlenecks
and Hardware Decisions for FPGAs. FPGA and
Structured ASIC Journal. - Altera Corporation. 101 Innovation Drive, San
Jose, CA 95134. Stratix II Device Handbook v 4.3,
May 2007. - Freescale Semiconductor Inc. 6501 William Cannon
Drive West, Austin, TX 78735. MPC7450 RISC
Microprocessor Family Reference Manual, Rev. 5.
January 2005. - Freescale Semiconductor Inc. 6501 William Cannon
Drive West, Austin, TX 78735. AltiVec Technology
Programming Environments Manual, Rev. 3. April
2006. - MathStar Corporation. 19075 NW Tanasbourne Dr.
Suite 200, Hillsboro, OR 97124. Arrix Family
Product Brief, August 2006.