Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems Q3 Status Report - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems Q3 Status Report

Description:

Performance Modeling and Simulation for Tradeoff Analyses in ... drive simulation models. Post-processing ... Experiments. InfiniBand validation using Ping test ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems Q3 Status Report


1
Performance Modeling and Simulation for Tradeoff
Analyses in Advanced HPC Systems Q3 Status
Report
  • Modeling Simulation (MS) Group
  • HCS Research Laboratory
  • ECE Department
  • University of Florida
  • Principal Investigator Professor Alan D. George
  • Sr. Research Assistant Mr. Eric Grobelny

2
Objectives and Motivations
  • High-performance computing involves applications
    that require parallelization for feasible
    completion time
  • HPC systems used for distributed computing
  • Issues with heterogeneity
  • Efficiency of hardware resource usage
  • Execution time
  • Overhead
  • Challenge Find optimum configuration of
    resources and task distribution for key
    applications under study
  • Nearly impossible and too expensive to determine
    experimentally
  • Simulation tools are required
  • Challenges with simulation approach
  • Large, complex systems
  • Balance speed and fidelity

3
FASE Overview
  • FASE Fast and Accurate Simulation Environment
  • Goals
  • Find optimum system configuration on which to run
    specific application
  • Performance analysis of specific application
    running on a system
  • Identify bottlenecks in this configuration and
    optimize program
  • Use mixture of pre-simulation and simulation
  • Pre-simulation Extraction of key
    characteristics of application and abstraction of
    lesser influential components
  • Simulation Use pre-simulation results to
    determine overall performance on currently
    unavailable systems
  • MLDesigner discrete-event simulation
    environment
  • Block-oriented, hierarchical modeling paradigm to
    minimize development time
  • Sacrifices some speed for user-friendly interface
  • Related work conducted at San Diego Supercomputer
    Center (PMaC project), European Center for
    Parallelism of Barcelona (Dimemas), U of Illinois
    (SvPablo), U of Wisconsin (Paradyn), and U of
    Oregon (TAU)

4
FASE Process Flow Diagram
  • Input parallel program into Script Generator
  • Code instrumented and executed
  • Scripts created during execution
  • Post-processing conducted
  • Script files read in by MLDesigner
  • One script per simulated processor
  • When all script files have been completed,
    simulation is complete and statistics are reported

5
Script Generator
  • Extracts key characteristics from program to
    drive simulation environment
  • Features
  • Supported languages C and C
  • Supported programming models MPI and Cray SHMEM
  • Automatic instrumentation of selected subset of
    MPI and SHMEM function libraries
  • Supported MPI functions
  • MPI_Send, MPI_Ssend, MPI_Recv, MPI_Alltoall,
    MPI_Bcast, MPI_Reduce
  • Supported SHMEM functions
  • shmem_get, shmem_put
  • Foundation in place for easy addition of other
    functions
  • Non-communication events abstracted by simple
    timing
  • Times scaled during simulation to represent
    machine with different computational capabilities
  • Scripts generated by running binary executable
  • Traced events from instrumentation output to
    files
  • Script files drive simulation models
  • Post-processing
  • Overhead from timing function calculated and
    reported during application execution
  • Average overhead subtracted from all
    non-communication events

6
FASE Models
  • End node
  • Reads in script file, routes data structures to
    corresponding modules
  • Script reader/processor
  • Reads in script file
  • Converts text information to MLD data structures
  • Routes created data structures depending on type
  • Computational unit
  • Simulates computational events
  • Effectively delays simulated machine for
    specified time
  • Communication events outputted to COMM interface
  • RC events passed to RC interface
  • More on this in slide 13
  • COMM interface
  • Provides interface between end node and specific
    network model
  • Translates MPI, SHMEM, and UPC communication
    events into series of network-specific
    transactions
  • Each network model has unique COMM interface

7
Network Models
  • Packet-level simulation to minimize simulation
    time
  • Plug-and-play capabilities with different
    supported interconnects
  • Scale hardware parameters to determine benefits
    of future generations
  • Provides capabilities of using network model
    where not specifically intended
  • e.g. SCI network for embedded systems
  • Models
  • SCI Scalable Coherent Interface
  • High-speed interconnect up to 8 Gbps
  • Mainly direct topologies (1D, 2D, or 3D tori)
  • InfiniBand
  • High-speed interconnect from 2.5 to 30 Gbps
  • Switched network, can be used in embedded systems
  • RapidIO
  • High-speed (1-60 Gbps) embedded network
    interconnect
  • Switched network
  • TCP
  • Configured as TCP-Reno

8
InfiniBand Model Modifications
  • Old model did not incorporate important
    InfiniBand features
  • New model created to incorporate features and
    allow for more flexibility
  • Added features
  • Flow control
  • Link layer receiver-controlled to avoid buffer
    overflow
  • Flow control packets announce receiver-side
    buffer-size changes to
  • transmitter
  • Transport layer End-to-end mechanism
  • Ensures the target of a transfer is ready to
    receive
  • Arbitration mechanisms
  • Maintains packet prioirities of different packet
    types (weighted Round-Robin)
  • Allows for QoS capabilities (not currently
    implemented)
  • Flexibility
  • Arbitrary number of queue pairs, HCA ports, and
    virtual lanes
  • Many user customizable parameters
  • Switch model (NEW)
  • User customizable number of input/output ports
  • Customizable crossbar for internal routing
  • Dynamically configured routing table

9
InfiniBand Model Components
  • FASE IBA node
  • FASE endnode (red oval)
  • Described on slide 6
  • FASE IBA consumer (blue oval)
  • COMM interface for IBA
  • Channel interface (green oval)
  • Management and storing of work queue elements
  • HCA (purple oval)
  • Network interface
  • Switch
  • Port mapper (orange oval)
  • Assigns incoming packet a destination port based
    on routing table
  • Routing mechanism (gold oval)
  • Management unit for crossbar
  • Example 2-node FASE IBA system

10
RapidIO Model
  • Embedded systems switched interconnect
  • Three-layer architecture
  • Physical, transport, logical layers
  • Project goals
  • Determine the optimal means by which to develop
    RapidIO for space systems
  • Perform RIO switch, board and system tradeoff
    studies
  • Identify limitations of space-based RIO design
  • Determine design feasibility using SBR case study
  • GMTI and SAR algorithms
  • Provide assistance for Honeywell proposal efforts
  • Lay the groundwork for future Honeywell system
    prototyping

RapidIO four-switch backplane
11
Experimental Setup
  • System configurations
  • 16-node dual 1.4 GHz Opteron cluster, 1GB RAM per
    node
  • Red Hat Linux 9 with kernel version 2.4.20-8
    patched kernel for InfiniBand support
  • InfiniBand equipment Voltaire ISR9024 switch
    and HCA400LPs, Ohio State MPICH
  • Voltaire info at www.voltaire.com
  • Experiments
  • InfiniBand validation using Ping test (left
    figure)
  • BW dip at 2048 from protocol switch
  • Investigating experimental BW dip at 8MB
  • 2-, 4-, and 8-node systems
  • Average of 25 iterations
  • Matrix multiply
  • 3 sizes 250x250, 500x500, and 1000x1000
  • Bench 12
  • 3 main table sizes 215, 220, and 225

12
Results Matrix Multiply and Bench 12
  • Errors in simulative vs. experimental execution
    times (ET) range from 0.3 to 16.1 for Matrix
    Multiply, and 0.6 to 21.4 for Bench 12
  • Overall, remarkably accurate fast simulations!
  • Analysis
  • Matrix Multiply
  • Errors grow for smaller dataset sizes as system
    size grows
  • Accumulation of errors inherent to each node
  • Small execution times can lead to errors due to
    deviations between values collected during script
    creation and experimental measurements
  • Computationally bounded
  • Errors decrease as dataset sizes increase
  • Less simulated time spent stimulating network
    model (where most error is incurred)
  • Bench 12
  • Errors for small dataset sizes larger due to
    reasons similar to those explained for Matrix
    Multiply
  • More network dependent than Matrix Multiply
  • Errors acceptable (lt 5) in most cases
  • Collective MPI functions in program accurately
    modeled
  • Sim. slowdown ratio Simulation time ? Actual
    execution time on experimental testbed
  • Ratio ranged from 1.6 to 401 for Matrix Multiply
  • Ratio ranged from 125 to 2500 for Bench 12
  • Remarkably fast accurate simulations!

13
RC Model
  • RC arena has been dominated by experimentation
    but little done in simulation
  • Simulation
  • Predict performance gains on future and more
    advanced systems
  • Determine optimal workload and data distribution
  • Predict performance in emerging realms of RC
  • Resource management
  • Independent RC fabric communication (i.e. without
    processor)
  • Large-scale HPC (e.g. Cray XD1, SRC MAP
    processor, and SGI Altix with FPGA bricks)
  • Current models
  • RC fabric with management unit (red oval) and
    dynamically created and reconfigured functional
    units (blue oval)
  • Dynamically created RC fabrics (green oval)
  • Interface with FASE for script support (purple
    oval)
  • Multiple host processor/fabric interconnects
    supported
  • RC node figure illustrates PCI bus interface
    (orange oval), but could plug in RIO, IBA, SCI,
    etc
  • Support for inter-fabric communication
  • Potential for exploration into grid-level use of
    RC devices

RC fabric
RC node
14
Low-Locality Applications Research
  • Research motivation
  • Cache effectiveness depends on locality of
    application
  • Programs which exhibit poor locality may actually
    experience performance penalty through data
    caching
  • Overhead associated with moving entire blocks of
    data into cache
  • Performance degradation from good data being
    evicted from cache, replaced with bad data
  • Investigated solutions
  • Static non-caching scheme
  • Based on introducing new load/store instructions
    into ISA
  • Memory references known to exhibit poor locality
    replaced with non-caching load/store
  • Dynamic non-caching scheme
  • Insertion of a Dynamic Bypass Table before data
    cache
  • Tracks hit/miss history of instructions
  • After observing consistent misses, instructions
    dynamically marked to bypass cache
  • Does not require editing application code

Static scheme - main loop of Bench 12
Dynamic scheme - Dynamic Bypass Table
15
Low-Locality Applications Research
  • Simulation results
  • Bench 12 used as main target application
  • Static and dynamic solution provide similar
    performance for Bench 12 (Figure 1)
  • L2 cache misses result in longer access latencies
    than non-caching accesses
  • Dynamic solution turns all accesses in main loop
    into non-cacheable references, static solution
    leaves one load as cacheable
  • Several simulation bugs were fixed, though
    results still show static solution performing
    worse as substitution table size increases
  • Discovered that highest load latency determines
    how quickly main loop may be iterated
  • If cached references of static solution are
    missing each time, performance will begin to
    approach that of unmodified system
  • Due to nature of bench 12, caching some
    references provides no benefit as at least one
    necessary operand for each iteration must still
    be fetched from main memory
  • Waiting on this load to complete each iteration
    hides any benefit of other fast accesses which
    complete earlier
  • More complex applications should benefit more
    from static approach
  • Several possible variations of dynamic approach
    open to investigation
  • Classify instructions based on reference address,
    not PC Memi03, John97
  • Monitor other statistics besides or as well as
    miss count
  • Allow instructions which have saturated to become
    un-saturated Gonz95

Comparison of static and dynamic
16
Conclusions
  • FASE
  • Pre-simulation process characterizes application
  • Captures key parameters of communication events
    (subset of MPI and SHMEM library calls) to drive
    simulation
  • Non-communication areas use timing relies on
    scaling factor for modeling other computational
    components
  • Use hardware for timing FAST
  • Simulation
  • Used to accurately model communication events and
    scale other events
  • Can use any network model in FASE library in a
    system configuration
  • User-definable parameters to customize network
    settings and computational unit parameters
  • Network models
  • InfiniBand model modified to more accurately
    follow the InfiniBand standard
  • Current library consists of RapidIO, SCI,
    InfiniBand and TCP
  • RC model
  • Support for multiple RC nodes, multiple RC
    fabrics on each node, and multiple,
    reconfigurable functional units in each fabric
  • Model still in infancy stage, but results
    obtained are promising
  • Low-locality case study
  • Instruction-based solutions offer improvement for
    Bench 12
  • Dynamic and static techniques show similar
    behavior
  • Substitution table becomes larger than L2 cache,
    dynamic performs better

17
Future Work
  • FASE
  • Support other programming languages
  • More completely support MPI and SHMEM programming
    languages
  • Devise scheme to support UPC
  • Model other components
  • Continue enhancing RC models
  • Potential modeling of memory hierarchy, storage
    devices, WAN and grid computing components, etc.
  • Optimize existing network models for speed
    without sacrificing accuracy
  • Implementation and MLD-specific optimizations
  • Run experiments with larger systems
  • Low-locality case study
  • Memory address-based classification
  • Extend scalar simulations to include
    multi-processor systems
  • Benchmarks intended to be run on parallel
    machines
  • Extended memory hierarchy adds another layer of
    complexity to locality
  • Possible integration with FASE, model memory
    hierarchy in node components

18
References
  • Gonz95 A. Gonzalez, C. Aliagas, M. Valero,
    Data Cache with Multiple Caching Strategies
    Tuned to Different Types of Locality, Department
    of Computer Architecture, University of Catalunya
    Polytechnical, Barcelona 1995.
  • John97 T. Johnson, M. Merten, W. Hwu, Runtime
    Spatial Locality Detection and Optimization,
    Center for Reliable and High-Performance
    Computing, University of Illinois,
    Urbana-Champaign, IL 1997.
  • Memi03 G. Memik, M. Kandemir, A. Choudary, I
    Kadayif, An Integrated Approach for Improving
    Cache Behavior, Proceedings of the
    Design,Automation and Test in Europe Conference
    and Exhibition, 2003.
Write a Comment
User Comments (0)
About PowerShow.com