RC Device Characterizations - PowerPoint PPT Presentation

About This Presentation
Title:

RC Device Characterizations

Description:

RC Device Characterizations & Tradeoff Analysis Jason Williams Introduction Reconfigurable Computing (RC) is an emerging field that utilizes devices with a ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 37
Provided by: JasonWi2
Category:

less

Transcript and Presenter's Notes

Title: RC Device Characterizations


1
RC Device Characterizations Tradeoff Analysis
  • Jason Williams

2
Introduction
  • Reconfigurable Computing (RC) is an emerging
    field that utilizes devices with a programmable
    fabric allowing the hardware to be configured and
    adapted to solve changing problems
  • RC systems have typically been built using Field
    Programmable Gate Arrays (FPGAs) but there are
    other architectures that could implement RC
    systems such as Field Programmable Object Arrays
    (FPOAs) and Field Programmable Compute Arrays
    (FPCA, e.g. MONARCH)

3
Subject Purpose
  • Subject
  • To survey the landscape of various RC devices
  • Characterize these devices using various metrics
    (performance, price, power)
  • Create a comparison framework using the
    characterizations
  • Purpose
  • Will give the end user a quantitative framework
    to aid in the selection of an appropriate RC
    device to meet their application needs
  • Lays groundwork for understanding performance
    impacts of architectural components

4
Problem Definition
  • Problems
  • RC devices can be vastly different from one
    another
  • Various architectural differences and very few
    standard/common parameters
  • Memory Example Xilinx BRAM vs. Altera
    M-RAM/M4K/M512 vs. FPOA RF/IRAM vs. CPU cache
  • RC devices differ from traditional
    microprocessors
  • Typically slower clock rates
  • Potential for massive parallelism
  • Different power consumption trends
  • Different on-die memory configurations
  • All of these differences make direct device
    comparisons difficult

5
Problem Background
  • Users have a variety of requirements/concerns
    What key parameters do we need to compare?
  • Computational performance (integer/fixed point,
    floating point, fine grained/bit level)
  • On-chip memory performance (latency, bandwidth)
  • Off-chip communications and I/O
  • Power consumption
  • Price

6
Scope Statement
  • Devices to be included in study
  • Xilinx Virtex 4 LX200, LX100, SX55
  • Altera Stratix II S180
  • Freescale PowerPC MPC7447 AltiVec
  • MathStar Arrix FPOA (1 GHz)
  • Raytheon Monarch PCA
  • Sony/Toshiba/IBM Cell

7
Methods
  • Literature review
  • Apply and extend characterizations and metrics to
    devices under study
  • Datasheet analysis
  • Experiments using vendor development
    tools/simulation environments
  • Example Utilization and timing analysis results
    from post place and route for common ALU/FP
    structures
  • Combine characterization study results into a QFD
    style matrix

8
FPGA Theoretical Floating Point Performance
  • Methodology
  • Adapted from Jeff Masons (Xilinx) presentation
    at RSSI 07 FPGA HPC The road beyond
    processors with input from Dave Strenski (Cray).
    Similar methodology also reported in An overview
    of FPGAs and FPGA programming Initial
    experiences at Daresbury, Richard Wain, Ian Bush,
    Martyn Guest, Miles Deegan, Igor Kozin and
    Christine Kitchen. November 2006. Distributed
    Computing Group at Daresbury Laboratory.
  • Using datasheet information, Altera and Xilinx
    Floating Point cores, ISE and Quartus, estimate
    FP add and FP multiply performance.

9
FPGA Floating Point Performance
  • Xilinx Example
  • Data from Virtex 4 Family Overview (DS112) and
    Coregen Floating Point Operator v3.0 (DS335)
  • Assumptions
  • 15 slice overhead (routing, I/O, etc.)
  • Use DSP resources first, then logic only
    implementation to fill device.
  • Use lower of the two clock speeds for all
    calculations (DSP vs. Logic only).
  • Assume 2 storage elements (BRAM) per operation
    (operands, overwrite with result). Limit the
    number of operations if there is not enough BRAM
    to support.
  • Use speed optimized, highest effort for
    Synthesis, Map, PAR.

10
FPGA Floating Point Performance
  • Xilinx Example Continued (LX200 10)
  • Double Precision Floating Point Multiply
  • 96 / 16 6 DSP Multipliers
  • 151449 (774 6) 146805 remaining LUT for
    Logic Multipliers
  • 146805 / 2457 59 Logic Only Multipliers
  • 65 total multipliers in 1 context _at_ 185 MHz 12
    Gflop/s
  • Limit total number of multipliers to 85 due to
    BRAM limitation 11.1 Gflop/s
  • LX100 has 336 18Kb dual port BRAM. For 64-bit
    (DP), ((336 2) / 4) / 2 85 function units

Per Instance DSP Implementation Logic Only Implementation Device Maximum (less 15 LUT for overhead)
Max Frequency (MHz) 303 185 500
DSPs Used 16 0 96
LUTs Used 550 2311 178176 (151449)
FF Used 774 2457 178176 (151449)
11
Theoretical Floating Point Performance
  • Methodology
  • FPOA floating point performance is reported as 0.
    This device could have a floating point core
    designed for it, but its architecture (16 bit
    ALUs) would not implement FP efficiently.
  • PowerPC, AltiVec, MONARCH, and Cell floating
    point performance numbers are available/derivable
    from their respective datasheets

12
Floating Point Performance Results
13
Floating Point Performance Results
14
Floating Point Performance Results
Theoretical Floating Point Performance (GFlops,
BRAM Limitation)
Theoretical Floating Point Performance (GFlops,
No BRAM Limitation)
15
Floating Point Conclusions
  • For FPGAs, floating point performance dependent
    on FP core implementation. This impacts resource
    utilization and maximum achievable frequency.
  • For Xilinx devices, available on-chip memory also
    greatly impacts performance if we assume there
    has to be enough on-chip memory to buffer
    operands and results. Stratix II S180 has more on
    chip RAM (1.5x V4LX200) and a more flexible
    memory hierarchy (a larger number of smaller
    blocks to support more individual registers,
    higher device memory bandwidth) and does not have
    this issue.
  • Xilinx adder cores can use on-chip DSP resources,
    Altera adder cores do not.
  • MONARCH only supports single precision floating
    point.
  • Cell is the clear leader in theoretical floating
    point performance (using all processing
    elements).

16
Theoretical Integer Performance
  • Utilize same basic methodology as Floating Point
    Performance Comparison
  • 15 slice overhead (routing, I/O, etc.).
  • Use DSP resources first, then logic only
    implementation to fill device.
  • Use lower of the two clock speeds for all
    calculations (DSP vs. Logic only).
  • Use vendor software (Quartus, ISE) to find
    resource utilization for 1 functional unit.
    Calculate the number of parallel functional units
    that fit in 1 context using datasheet values.
  • Assume 2 storage elements (BRAM) per functional
    unit (operands, overwrite with result). Limit
    the number of parallel functional units if there
    is not enough BRAM to support 2 storage elements
    per functional unit.
  • Use speed optimized, highest effort for
    Synthesis, Map, PAR.
  • Use standard integer widths (32 bit and 16 bit).
  • Analyze Addition and Multiplication operations
    separately.

17
Theoretical Integer Performance
  • Methodology
  • FPOA 32 bit integer performance is reported as 0.
    This device could have a 32 bit ALU core
    designed for it, but it is natively a 16 bit
    device.
  • PowerPC, AltiVec, MONARCH, and Cell integer
    performance numbers are available/derivable from
    their respective datasheets

18
Integer Performance Results
19
Integer Performance Results
20
Integer Performance Results
Theoretical Integer Performance (GOPs, BRAM
Limitation)
Theoretical Integer Performance (GOPs, No BRAM
Limitation)
21
Integer Performance Conclusions
  • In some cases, BRAM limitation is again an
    important performance limiter for Xilinx devices.
    Stratix II S180 has more on chip RAM (1.5x
    V4LX200) and a more flexible memory hierarchy (a
    larger number of smaller blocks to support more
    individual registers, higher device memory
    bandwidth) and does not .
  • Quartus II 6.0 typically reports higher maximum
    achievable frequency for post place and route
    timing analysis versus ISE 9.2.
  • Used speed grade 10 for Virtex 4 devices.
  • Used speed grade 3 for Stratix II device.
  • 32 bit multiply example Quartus reports 500 MHz
    for both DSP and Logic Only implementations, ISE
    reports 421 MHz for DSP, 249 MHz for Logic Only.
  • Xilinx adder cores can use on-chip DSP resources,
    which could improve add performance if there was
    enough memory support. Altera adder cores do not
    support DSP utilization and therefore suffer a
    performance hit compared to Xilinx devices.
  • Without the BRAM limitation, Xilinx devices show
    the highest performance for Integer Add
    operations.
  • With the BRAM limitation, the FPOA has the
    highest 16 bit integer performance.
  • Cell has the highest 32 bit integer performance
    (using all processing elements).

22
Bit-level Computational Performance
  • Methodology
  • Based off of Dehons Computational Density
    calculations
  • Computational Density
  • Normalizes performance by die (or package) area
    and minimum feature size/process technology
  • Bit operations for FPGAs are number of 4 input
    LUTs
  • Bit operations for GPP and other hybrid devices
    based on number of cores, number of issued
    instructions, and width of ALU/Functional Units

23
Bit-level Computation Performance
  • As expected, fine-grained FPGAs dominate
    performance in this metric

24
External Memory Bandwidth Methodology
  • Methodology varies by platform due to available
    information and architecture differences.
  • In all cases, choose maximum throughput available
    based on vendor IP for memory controllers.
  • Saturated Case uses maximum amount of I/O for
    external memory interface, Balanced Case assumes
    a balance of I/O and memory interface.
  • Altera Stratix II
  • Influenced by speed grade, number of I/O
  • Used new high performance ALTMEMPHY core (vs.
    legacy memory interface core)
  • Support for 333 MHz DDR2 RAM
  • Number of controllers limited by the number of
    on-chip delay-locked loops (2)

25
External Memory Bandwidth Methodology
  • Xilinx Virtex 4
  • Influenced by speed grade, number of I/O
  • Memory Interface Generator v1.73 (Coregen) forces
    use of slower Direct Clocking to support
    multiple banks vs. SERDES strobe implementation,
    for -10 speed grade maximum frequency is 220
    240 MHz (depending on bus width)
  • Mathstar FPOA
  • Datasheet information for total external memory
    interface bandwidth (RLDRAM II)
  • Cell
  • External Memory Bandwidth (Rambus XDRAM) reported
    in presentation Introduction to the Cell
    Processor from Dr. Michael Perrone (IBM)
  • MONARCH
  • External Memory Bandwidth (DDR2) reported in
    presentation Worlds First Polymorphic Computer
    MONARCH from K. Prager, L. Lewis, M. Vahey, G.
    Groves (Raytheon)

26
External Memory Bandwidth Results
27
External Memory Bandwidth Conclusions
  • External Memory Bandwidth important to prevent
    data bottleneck into the device.
  • For FPGAs, the type and speed of external memory
    supported depends on the family and speed grade
    of the device.
  • In this study, non-FPGA devices have separate I/O
    and memory controllers/interfaces, so there is
    not a distinction between saturated and balanced.
  • Stratix II S180 and Virtex 4 SX55 configurations
    support 2 simultaneous controllers, Virtex 4
    LX100 and LX200 support 3 simultaneous
    controllers which is shown in the performance
    difference for the saturated case.
  • Although Stratix II controller supports faster
    DDR2 RAM (333 MHz vs. 220 MHz in this
    configuration), Virtex 4 SX55 has higher
    bandwidth due to support for a wider bus.
  • Xilinx claims higher bandwidth on website,
    assumes wider bus than existing memories.
  • For the balanced case, Cell is the performance
    leader, primarily due to specialized RAM format
    (XDRAM).

28
I/O Bandwidth Methodology
  • Methodology varies by platform due to available
    information and architecture differences.
  • In all cases, choose maximum throughput available
    protocol/signaling level.
  • Saturated Case uses maximum amount of I/O for I/O
    interface, Balanced Case assumes a balance of I/O
    and 1 memory interface.
  • Altera Stratix II
  • Datasheet information for concurrent receive
    pairs and transmit pairs _at_ 1.040 Gb/s per pair.
  • Xilinx Virtex 4
  • Datasheet information for concurrent receive
    pairs and transmit pairs _at_ 1 Gb/s per pair.
  • Mathstar FPOA
  • Datasheet information for concurrent total
    transmit and receive bandwidth.

29
I/O Bandwidth Methodology
  • Cell
  • I/O Bandwidth reported in presentation
    Introduction to the Cell Processor from Dr.
    Michael Perrone (IBM)
  • MONARCH
  • I/O Bandwidth reported in presentation Worlds
    First Polymorphic Computer MONARCH from K.
    Prager, L. Lewis, M. Vahey, G. Groves (Raytheon)

30
I/O Bandwidth Results
31
I/O Bandwidth Conclusions
  • I/O Bandwidth is important to prevent I/O and
    data bottleneck.
  • In this study, non-FPGA devices have separate I/O
    and memory controllers/interfaces, so there is
    not a distinction between saturated and balanced.
  • All devices except for FPOA have at least 40 GB/s
    throughput.
  • FPGAs are shown in both fully utilized and
    balanced cases.
  • Stratix II uses separate I/O for single ended
    memory interface and differential pairs so there
    is no distinction between saturated and balanced
    cases.
  • Cell has the highest I/O performance for both
    cases.

32
Internal Device Memory Bandwidth
  • Methodology
  • FPGAs
  • Xilinx all BRAMs are the same, calculation
    number of BRAMS port width number of ports
    memory access frequency
  • Altera 3 levels of internal memory hierarchy,
    calculation similar to above for all levels of
    hierarchy
  • FPOA similar to above with 2 levels of memory
    hierarchy (Register File and Internal RAM)
  • GPP bus width frequency ports

33
Internal Memory Bandwidth
  • Large amount of parallel accesses give FPGAs the
    advantage in this metric

34
Device Characterization Matrix
  • Goal enable comparison of different devices on
    key parameters
  • Tie all device characterizations into unifying
    framework
  • User weights allow adjustment to specific
    application needs
  • Scores quickly show comparison results based on
    input weights
  • Approach
  • Scale each characterization study from 1 to 10
  • Generate weighted average score for each device
    taking into account user weights
  • Justification
  • Significant architectural differences have
    historically made these devices difficult to
    compare
  • Single-Precision Floating-Point scaling example
  • Use min and max values to scale from 1 to 10

34
35
Device Characterization Matrix
  • Examples with other weights
  • Power cost (10), internal external memory BW
    (5), 16-bit integer performance (7)
  • FPOA V4SX55 lead
  • DP FP performance (5), power (10)
  • Stratix-II S180 and V4LX200 lead
  • External I/O BW (10), power (10), cost (10)
  • MONARCH and Cell lead

35
36
References
  • DeHon, A. The Density Advantage of Configurable
    Computing. Computer , vol.33, no.4, pp.41-49, Apr
    2000.
  • DeHon, A. Reconfigurable Architectures for
    General-Purpose Computing. A.I. Technical Report
    No. 1586, Massachusetts Institute of Technology,
    1996.
  • Compton, K. and Hauck, S. Reconfigurable
    computing a survey of systems and software. ACM
    Comput. Surv. 34, 2 (Jun. 2002), 171-210.
  • Memory Bandwidth, http//en.wikipedia.org/wiki/Mem
    ory_bandwidth.
  • Mason, J. FPGA HPC The road beyond processors,
    Xilinx Corporation. RSSI 2007.
  • Wain, R., Bush, I., Guest, M., Deegan, M., Kozin,
    I. and Kitchen, C. An overview of FPGAs and FPGA
    programming Initial experiences at Daresbury,.
    November 2006. Distributed Computing Group at
    Daresbury Laboratory.
  • Bolsens, I. Programming Modern FPGAs. Xilinx
    Corporation. MPSOC August, 2006.
  • Underwood, K. 2004. FPGAs vs. CPUs trends in
    peak floating-point performance. In Proceedings
    of the 2004 ACM/SIGDA 12th international
    Symposium on Field Programmable Gate Arrays
    (Monterey, California, USA, February 22 - 24,
    2004). FPGA '04. ACM Press, New York, NY,
    171-180.
  • HPEC Challenge Benchmarks. http//www.ll.mit.edu/H
    PECchallenge.
  • Xilinx Corporation. 2100 Logic Drive, San Jose,
    CA 95124-3400. Virtex-4 Family Overview (DS112),
    January 23, 2007.
  • Xilinx Corporation. 2100 Logic Drive, San Jose,
    CA 95124-3400. Floating-Point Operator v3.0
    (DS335). September 28, 2006.
  • Introduction to the Cell Processor from Dr.
    Michael Perrone (IBM)
  • Worlds First Polymorphic Computer MONARCH
    from K. Prager, L. Lewis, M. Vahey, G. Groves
    (Raytheon)
  • Strenski, Dave. FPGA Floating Point Performance
    a pencil and paper evaluation.
    http//www.hpcwire.com/hpc/1195762.html.
  • Strenski, Dave. 2006. Computational Bottlenecks
    and Hardware Decisions for FPGAs. FPGA and
    Structured ASIC Journal.
  • Altera Corporation. 101 Innovation Drive, San
    Jose, CA 95134. Stratix II Device Handbook v 4.3,
    May 2007.
  • Freescale Semiconductor Inc. 6501 William Cannon
    Drive West, Austin, TX 78735. MPC7450 RISC
    Microprocessor Family Reference Manual, Rev. 5.
    January 2005.
  • Freescale Semiconductor Inc. 6501 William Cannon
    Drive West, Austin, TX 78735. AltiVec Technology
    Programming Environments Manual, Rev. 3. April
    2006.
  • MathStar Corporation. 19075 NW Tanasbourne Dr.
    Suite 200, Hillsboro, OR 97124. Arrix Family
    Product Brief, August 2006.
Write a Comment
User Comments (0)
About PowerShow.com