Title: Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems Q3 Status Report
1Performance Modeling and Simulation for Tradeoff
Analyses in Advanced HPC Systems Q3 Status
Report
- Modeling Simulation (MS) Group
- HCS Research Laboratory
- ECE Department
- University of Florida
- Principal Investigator Professor Alan D. George
- Sr. Research Assistant Mr. Eric Grobelny
2Objectives and Motivations
- High-performance computing involves applications
that require parallelization for feasible
completion time - HPC systems used for distributed computing
- Issues with heterogeneity
- Efficiency of hardware resource usage
- Execution time
- Overhead
- Challenge Find optimum configuration of
resources and task distribution for key
applications under study - Nearly impossible and too expensive to determine
experimentally - Simulation tools are required
- Challenges with simulation approach
- Large, complex systems
- Balance speed and fidelity
3FASE Overview
- FASE Fast and Accurate Simulation Environment
- Goals
- Find optimum system configuration on which to run
specific application - Performance analysis of specific application
running on a system - Identify bottlenecks in this configuration and
optimize program - Use mixture of pre-simulation and simulation
- Pre-simulation Extraction of key
characteristics of application and abstraction of
lesser influential components - Simulation Use pre-simulation results to
determine overall performance on currently
unavailable systems - MLDesigner discrete-event simulation
environment - Block-oriented, hierarchical modeling paradigm to
minimize development time - Sacrifices some speed for user-friendly interface
- Related work conducted at San Diego Supercomputer
Center (PMaC project), European Center for
Parallelism of Barcelona (Dimemas), U of Illinois
(SvPablo), U of Wisconsin (Paradyn), and U of
Oregon (TAU)
4FASE Process Flow Diagram
- Input parallel program into Script Generator
- Code instrumented and executed
- Scripts created during execution
- Post-processing conducted
- Script files read in by MLDesigner
- One script per simulated processor
- When all script files have been completed,
simulation is complete and statistics are reported
5Script Generator
- Extracts key characteristics from program to
drive simulation environment - Features
- Supported languages C and C
- Supported programming models MPI and Cray SHMEM
- Automatic instrumentation of selected subset of
MPI and SHMEM function libraries - Supported MPI functions
- MPI_Send, MPI_Ssend, MPI_Recv, MPI_Alltoall,
MPI_Bcast, MPI_Reduce - Supported SHMEM functions
- shmem_get, shmem_put
- Foundation in place for easy addition of other
functions - Non-communication events abstracted by simple
timing - Times scaled during simulation to represent
machine with different computational capabilities - Scripts generated by running binary executable
- Traced events from instrumentation output to
files - Script files drive simulation models
- Post-processing
- Overhead from timing function calculated and
reported during application execution - Average overhead subtracted from all
non-communication events
6FASE Models
- End node
- Reads in script file, routes data structures to
corresponding modules - Script reader/processor
- Reads in script file
- Converts text information to MLD data structures
- Routes created data structures depending on type
- Computational unit
- Simulates computational events
- Effectively delays simulated machine for
specified time - Communication events outputted to COMM interface
- RC events passed to RC interface
- More on this in slide 13
- COMM interface
- Provides interface between end node and specific
network model - Translates MPI, SHMEM, and UPC communication
events into series of network-specific
transactions - Each network model has unique COMM interface
7Network Models
- Packet-level simulation to minimize simulation
time - Plug-and-play capabilities with different
supported interconnects - Scale hardware parameters to determine benefits
of future generations - Provides capabilities of using network model
where not specifically intended - e.g. SCI network for embedded systems
- Models
- SCI Scalable Coherent Interface
- High-speed interconnect up to 8 Gbps
- Mainly direct topologies (1D, 2D, or 3D tori)
- InfiniBand
- High-speed interconnect from 2.5 to 30 Gbps
- Switched network, can be used in embedded systems
- RapidIO
- High-speed (1-60 Gbps) embedded network
interconnect - Switched network
- TCP
- Configured as TCP-Reno
8InfiniBand Model Modifications
- Old model did not incorporate important
InfiniBand features - New model created to incorporate features and
allow for more flexibility - Added features
- Flow control
- Link layer receiver-controlled to avoid buffer
overflow - Flow control packets announce receiver-side
buffer-size changes to - transmitter
- Transport layer End-to-end mechanism
- Ensures the target of a transfer is ready to
receive - Arbitration mechanisms
- Maintains packet prioirities of different packet
types (weighted Round-Robin) - Allows for QoS capabilities (not currently
implemented) - Flexibility
- Arbitrary number of queue pairs, HCA ports, and
virtual lanes - Many user customizable parameters
- Switch model (NEW)
- User customizable number of input/output ports
- Customizable crossbar for internal routing
- Dynamically configured routing table
9InfiniBand Model Components
- FASE IBA node
- FASE endnode (red oval)
- Described on slide 6
- FASE IBA consumer (blue oval)
- COMM interface for IBA
- Channel interface (green oval)
- Management and storing of work queue elements
- HCA (purple oval)
- Network interface
- Switch
- Port mapper (orange oval)
- Assigns incoming packet a destination port based
on routing table - Routing mechanism (gold oval)
- Management unit for crossbar
- Example 2-node FASE IBA system
10RapidIO Model
- Embedded systems switched interconnect
- Three-layer architecture
- Physical, transport, logical layers
- Project goals
- Determine the optimal means by which to develop
RapidIO for space systems - Perform RIO switch, board and system tradeoff
studies - Identify limitations of space-based RIO design
- Determine design feasibility using SBR case study
- GMTI and SAR algorithms
- Provide assistance for Honeywell proposal efforts
- Lay the groundwork for future Honeywell system
prototyping
RapidIO four-switch backplane
11Experimental Setup
- System configurations
- 16-node dual 1.4 GHz Opteron cluster, 1GB RAM per
node - Red Hat Linux 9 with kernel version 2.4.20-8
patched kernel for InfiniBand support - InfiniBand equipment Voltaire ISR9024 switch
and HCA400LPs, Ohio State MPICH - Voltaire info at www.voltaire.com
- Experiments
- InfiniBand validation using Ping test (left
figure) - BW dip at 2048 from protocol switch
- Investigating experimental BW dip at 8MB
- 2-, 4-, and 8-node systems
- Average of 25 iterations
- Matrix multiply
- 3 sizes 250x250, 500x500, and 1000x1000
- Bench 12
- 3 main table sizes 215, 220, and 225
12Results Matrix Multiply and Bench 12
- Errors in simulative vs. experimental execution
times (ET) range from 0.3 to 16.1 for Matrix
Multiply, and 0.6 to 21.4 for Bench 12 - Overall, remarkably accurate fast simulations!
- Analysis
- Matrix Multiply
- Errors grow for smaller dataset sizes as system
size grows - Accumulation of errors inherent to each node
- Small execution times can lead to errors due to
deviations between values collected during script
creation and experimental measurements - Computationally bounded
- Errors decrease as dataset sizes increase
- Less simulated time spent stimulating network
model (where most error is incurred) - Bench 12
- Errors for small dataset sizes larger due to
reasons similar to those explained for Matrix
Multiply - More network dependent than Matrix Multiply
- Errors acceptable (lt 5) in most cases
- Collective MPI functions in program accurately
modeled - Sim. slowdown ratio Simulation time ? Actual
execution time on experimental testbed - Ratio ranged from 1.6 to 401 for Matrix Multiply
- Ratio ranged from 125 to 2500 for Bench 12
- Remarkably fast accurate simulations!
13RC Model
- RC arena has been dominated by experimentation
but little done in simulation - Simulation
- Predict performance gains on future and more
advanced systems - Determine optimal workload and data distribution
- Predict performance in emerging realms of RC
- Resource management
- Independent RC fabric communication (i.e. without
processor) - Large-scale HPC (e.g. Cray XD1, SRC MAP
processor, and SGI Altix with FPGA bricks) - Current models
- RC fabric with management unit (red oval) and
dynamically created and reconfigured functional
units (blue oval) - Dynamically created RC fabrics (green oval)
- Interface with FASE for script support (purple
oval) - Multiple host processor/fabric interconnects
supported - RC node figure illustrates PCI bus interface
(orange oval), but could plug in RIO, IBA, SCI,
etc - Support for inter-fabric communication
- Potential for exploration into grid-level use of
RC devices
RC fabric
RC node
14Low-Locality Applications Research
- Research motivation
- Cache effectiveness depends on locality of
application - Programs which exhibit poor locality may actually
experience performance penalty through data
caching - Overhead associated with moving entire blocks of
data into cache - Performance degradation from good data being
evicted from cache, replaced with bad data - Investigated solutions
- Static non-caching scheme
- Based on introducing new load/store instructions
into ISA - Memory references known to exhibit poor locality
replaced with non-caching load/store - Dynamic non-caching scheme
- Insertion of a Dynamic Bypass Table before data
cache - Tracks hit/miss history of instructions
- After observing consistent misses, instructions
dynamically marked to bypass cache - Does not require editing application code
Static scheme - main loop of Bench 12
Dynamic scheme - Dynamic Bypass Table
15Low-Locality Applications Research
- Simulation results
- Bench 12 used as main target application
- Static and dynamic solution provide similar
performance for Bench 12 (Figure 1) - L2 cache misses result in longer access latencies
than non-caching accesses - Dynamic solution turns all accesses in main loop
into non-cacheable references, static solution
leaves one load as cacheable - Several simulation bugs were fixed, though
results still show static solution performing
worse as substitution table size increases - Discovered that highest load latency determines
how quickly main loop may be iterated - If cached references of static solution are
missing each time, performance will begin to
approach that of unmodified system - Due to nature of bench 12, caching some
references provides no benefit as at least one
necessary operand for each iteration must still
be fetched from main memory - Waiting on this load to complete each iteration
hides any benefit of other fast accesses which
complete earlier - More complex applications should benefit more
from static approach
- Several possible variations of dynamic approach
open to investigation - Classify instructions based on reference address,
not PC Memi03, John97 - Monitor other statistics besides or as well as
miss count - Allow instructions which have saturated to become
un-saturated Gonz95
Comparison of static and dynamic
16Conclusions
- FASE
- Pre-simulation process characterizes application
- Captures key parameters of communication events
(subset of MPI and SHMEM library calls) to drive
simulation - Non-communication areas use timing relies on
scaling factor for modeling other computational
components - Use hardware for timing FAST
- Simulation
- Used to accurately model communication events and
scale other events - Can use any network model in FASE library in a
system configuration - User-definable parameters to customize network
settings and computational unit parameters - Network models
- InfiniBand model modified to more accurately
follow the InfiniBand standard - Current library consists of RapidIO, SCI,
InfiniBand and TCP - RC model
- Support for multiple RC nodes, multiple RC
fabrics on each node, and multiple,
reconfigurable functional units in each fabric - Model still in infancy stage, but results
obtained are promising - Low-locality case study
- Instruction-based solutions offer improvement for
Bench 12 - Dynamic and static techniques show similar
behavior - Substitution table becomes larger than L2 cache,
dynamic performs better
17Future Work
- FASE
- Support other programming languages
- More completely support MPI and SHMEM programming
languages - Devise scheme to support UPC
- Model other components
- Continue enhancing RC models
- Potential modeling of memory hierarchy, storage
devices, WAN and grid computing components, etc. - Optimize existing network models for speed
without sacrificing accuracy - Implementation and MLD-specific optimizations
- Run experiments with larger systems
- Low-locality case study
- Memory address-based classification
- Extend scalar simulations to include
multi-processor systems - Benchmarks intended to be run on parallel
machines - Extended memory hierarchy adds another layer of
complexity to locality - Possible integration with FASE, model memory
hierarchy in node components
18References
- Gonz95 A. Gonzalez, C. Aliagas, M. Valero,
Data Cache with Multiple Caching Strategies
Tuned to Different Types of Locality, Department
of Computer Architecture, University of Catalunya
Polytechnical, Barcelona 1995. - John97 T. Johnson, M. Merten, W. Hwu, Runtime
Spatial Locality Detection and Optimization,
Center for Reliable and High-Performance
Computing, University of Illinois,
Urbana-Champaign, IL 1997. - Memi03 G. Memik, M. Kandemir, A. Choudary, I
Kadayif, An Integrated Approach for Improving
Cache Behavior, Proceedings of the
Design,Automation and Test in Europe Conference
and Exhibition, 2003.