Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems Q3 Status Report - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems Q3 Status Report

Description:

Performance Modeling and Simulation for Tradeoff Analyses in ... drive simulation models. Post-processing ... Experiments. InfiniBand validation using Ping test ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 19

Provided by: hcs8

Category:

more less

Transcript and Presenter's Notes

Title: Performance Modeling and Simulation for Tradeoff Analyses in Advanced HPC Systems Q3 Status Report

1
Performance Modeling and Simulation for Tradeoff
Analyses in Advanced HPC Systems Q3 Status
Report

Modeling Simulation (MS) Group
HCS Research Laboratory
ECE Department
University of Florida
Principal Investigator Professor Alan D. George
Sr. Research Assistant Mr. Eric Grobelny

2
Objectives and Motivations

High-performance computing involves applications
that require parallelization for feasible
completion time
HPC systems used for distributed computing
Issues with heterogeneity
Efficiency of hardware resource usage
Execution time
Overhead
Challenge Find optimum configuration of
resources and task distribution for key
applications under study
Nearly impossible and too expensive to determine
experimentally
Simulation tools are required
Challenges with simulation approach
Large, complex systems
Balance speed and fidelity

3
FASE Overview

FASE Fast and Accurate Simulation Environment
Goals
Find optimum system configuration on which to run
specific application
Performance analysis of specific application
running on a system
Identify bottlenecks in this configuration and
optimize program
Use mixture of pre-simulation and simulation
Pre-simulation Extraction of key
characteristics of application and abstraction of
lesser influential components
Simulation Use pre-simulation results to
determine overall performance on currently
unavailable systems
MLDesigner discrete-event simulation
environment
Block-oriented, hierarchical modeling paradigm to
minimize development time
Sacrifices some speed for user-friendly interface
Related work conducted at San Diego Supercomputer
Center (PMaC project), European Center for
Parallelism of Barcelona (Dimemas), U of Illinois
(SvPablo), U of Wisconsin (Paradyn), and U of
Oregon (TAU)

4
FASE Process Flow Diagram

Input parallel program into Script Generator

Code instrumented and executed
Scripts created during execution
Post-processing conducted

Script files read in by MLDesigner
One script per simulated processor

When all script files have been completed,
simulation is complete and statistics are reported

5
Script Generator

Extracts key characteristics from program to
drive simulation environment
Features
Supported languages C and C
Supported programming models MPI and Cray SHMEM
Automatic instrumentation of selected subset of
MPI and SHMEM function libraries
Supported MPI functions
MPI_Send, MPI_Ssend, MPI_Recv, MPI_Alltoall,
MPI_Bcast, MPI_Reduce
Supported SHMEM functions
shmem_get, shmem_put
Foundation in place for easy addition of other
functions
Non-communication events abstracted by simple
timing
Times scaled during simulation to represent
machine with different computational capabilities
Scripts generated by running binary executable
Traced events from instrumentation output to
files
Script files drive simulation models
Post-processing
Overhead from timing function calculated and
reported during application execution
Average overhead subtracted from all
non-communication events

6
FASE Models

End node
Reads in script file, routes data structures to
corresponding modules
Script reader/processor
Reads in script file
Converts text information to MLD data structures
Routes created data structures depending on type
Computational unit
Simulates computational events
Effectively delays simulated machine for
specified time
Communication events outputted to COMM interface
RC events passed to RC interface
More on this in slide 13

COMM interface
Provides interface between end node and specific
network model
Translates MPI, SHMEM, and UPC communication
events into series of network-specific
transactions
Each network model has unique COMM interface

7
Network Models

Packet-level simulation to minimize simulation
time
Plug-and-play capabilities with different
supported interconnects
Scale hardware parameters to determine benefits
of future generations
Provides capabilities of using network model
where not specifically intended
e.g. SCI network for embedded systems
Models
SCI Scalable Coherent Interface
High-speed interconnect up to 8 Gbps
Mainly direct topologies (1D, 2D, or 3D tori)
InfiniBand
High-speed interconnect from 2.5 to 30 Gbps
Switched network, can be used in embedded systems
RapidIO
High-speed (1-60 Gbps) embedded network
interconnect
Switched network
TCP
Configured as TCP-Reno

8
InfiniBand Model Modifications

Old model did not incorporate important
InfiniBand features
New model created to incorporate features and
allow for more flexibility
Added features
Flow control
Link layer receiver-controlled to avoid buffer
overflow
Flow control packets announce receiver-side
buffer-size changes to
transmitter
Transport layer End-to-end mechanism
Ensures the target of a transfer is ready to
receive
Arbitration mechanisms
Maintains packet prioirities of different packet
types (weighted Round-Robin)
Allows for QoS capabilities (not currently
implemented)
Flexibility
Arbitrary number of queue pairs, HCA ports, and
virtual lanes
Many user customizable parameters
Switch model (NEW)
User customizable number of input/output ports
Customizable crossbar for internal routing
Dynamically configured routing table

9
InfiniBand Model Components

FASE IBA node
FASE endnode (red oval)
Described on slide 6
FASE IBA consumer (blue oval)
COMM interface for IBA
Channel interface (green oval)
Management and storing of work queue elements
HCA (purple oval)
Network interface

Switch
Port mapper (orange oval)
Assigns incoming packet a destination port based
on routing table
Routing mechanism (gold oval)
Management unit for crossbar
Example 2-node FASE IBA system

10
RapidIO Model

Embedded systems switched interconnect
Three-layer architecture
Physical, transport, logical layers
Project goals
Determine the optimal means by which to develop
RapidIO for space systems
Perform RIO switch, board and system tradeoff
studies
Identify limitations of space-based RIO design
Determine design feasibility using SBR case study
GMTI and SAR algorithms
Provide assistance for Honeywell proposal efforts
Lay the groundwork for future Honeywell system
prototyping

RapidIO four-switch backplane
11
Experimental Setup

System configurations
16-node dual 1.4 GHz Opteron cluster, 1GB RAM per
node
Red Hat Linux 9 with kernel version 2.4.20-8
patched kernel for InfiniBand support
InfiniBand equipment Voltaire ISR9024 switch
and HCA400LPs, Ohio State MPICH
Voltaire info at www.voltaire.com

Experiments
InfiniBand validation using Ping test (left
figure)
BW dip at 2048 from protocol switch
Investigating experimental BW dip at 8MB
2-, 4-, and 8-node systems
Average of 25 iterations
Matrix multiply
3 sizes 250x250, 500x500, and 1000x1000
Bench 12
3 main table sizes 215, 220, and 225

12
Results Matrix Multiply and Bench 12

Errors in simulative vs. experimental execution
times (ET) range from 0.3 to 16.1 for Matrix
Multiply, and 0.6 to 21.4 for Bench 12
Overall, remarkably accurate fast simulations!
Analysis
Matrix Multiply
Errors grow for smaller dataset sizes as system
size grows
Accumulation of errors inherent to each node
Small execution times can lead to errors due to
deviations between values collected during script
creation and experimental measurements
Computationally bounded
Errors decrease as dataset sizes increase
Less simulated time spent stimulating network
model (where most error is incurred)
Bench 12
Errors for small dataset sizes larger due to
reasons similar to those explained for Matrix
Multiply
More network dependent than Matrix Multiply
Errors acceptable (lt 5) in most cases
Collective MPI functions in program accurately
modeled
Sim. slowdown ratio Simulation time ? Actual
execution time on experimental testbed
Ratio ranged from 1.6 to 401 for Matrix Multiply
Ratio ranged from 125 to 2500 for Bench 12
Remarkably fast accurate simulations!

13
RC Model

RC arena has been dominated by experimentation
but little done in simulation
Simulation
Predict performance gains on future and more
advanced systems
Determine optimal workload and data distribution
Predict performance in emerging realms of RC
Resource management
Independent RC fabric communication (i.e. without
processor)
Large-scale HPC (e.g. Cray XD1, SRC MAP
processor, and SGI Altix with FPGA bricks)
Current models
RC fabric with management unit (red oval) and
dynamically created and reconfigured functional
units (blue oval)
Dynamically created RC fabrics (green oval)
Interface with FASE for script support (purple
oval)
Multiple host processor/fabric interconnects
supported
RC node figure illustrates PCI bus interface
(orange oval), but could plug in RIO, IBA, SCI,
etc
Support for inter-fabric communication
Potential for exploration into grid-level use of
RC devices

RC fabric
RC node
14
Low-Locality Applications Research

Research motivation
Cache effectiveness depends on locality of
application
Programs which exhibit poor locality may actually
experience performance penalty through data
caching
Overhead associated with moving entire blocks of
data into cache
Performance degradation from good data being
evicted from cache, replaced with bad data
Investigated solutions
Static non-caching scheme
Based on introducing new load/store instructions
into ISA
Memory references known to exhibit poor locality
replaced with non-caching load/store
Dynamic non-caching scheme
Insertion of a Dynamic Bypass Table before data
cache
Tracks hit/miss history of instructions
After observing consistent misses, instructions
dynamically marked to bypass cache
Does not require editing application code

Static scheme - main loop of Bench 12
Dynamic scheme - Dynamic Bypass Table
15
Low-Locality Applications Research

Simulation results
Bench 12 used as main target application
Static and dynamic solution provide similar
performance for Bench 12 (Figure 1)
L2 cache misses result in longer access latencies
than non-caching accesses
Dynamic solution turns all accesses in main loop
into non-cacheable references, static solution
leaves one load as cacheable
Several simulation bugs were fixed, though
results still show static solution performing
worse as substitution table size increases
Discovered that highest load latency determines
how quickly main loop may be iterated
If cached references of static solution are
missing each time, performance will begin to
approach that of unmodified system
Due to nature of bench 12, caching some
references provides no benefit as at least one
necessary operand for each iteration must still
be fetched from main memory
Waiting on this load to complete each iteration
hides any benefit of other fast accesses which
complete earlier
More complex applications should benefit more
from static approach

Several possible variations of dynamic approach
open to investigation
Classify instructions based on reference address,
not PC Memi03, John97
Monitor other statistics besides or as well as
miss count
Allow instructions which have saturated to become
un-saturated Gonz95

Comparison of static and dynamic
16
Conclusions

FASE
Pre-simulation process characterizes application
Captures key parameters of communication events
(subset of MPI and SHMEM library calls) to drive
simulation
Non-communication areas use timing relies on
scaling factor for modeling other computational
components
Use hardware for timing FAST
Simulation
Used to accurately model communication events and
scale other events
Can use any network model in FASE library in a
system configuration
User-definable parameters to customize network
settings and computational unit parameters
Network models
InfiniBand model modified to more accurately
follow the InfiniBand standard
Current library consists of RapidIO, SCI,
InfiniBand and TCP
RC model
Support for multiple RC nodes, multiple RC
fabrics on each node, and multiple,
reconfigurable functional units in each fabric
Model still in infancy stage, but results
obtained are promising
Low-locality case study
Instruction-based solutions offer improvement for
Bench 12
Dynamic and static techniques show similar
behavior
Substitution table becomes larger than L2 cache,
dynamic performs better

17
Future Work

FASE
Support other programming languages
More completely support MPI and SHMEM programming
languages
Devise scheme to support UPC
Model other components
Continue enhancing RC models
Potential modeling of memory hierarchy, storage
devices, WAN and grid computing components, etc.
Optimize existing network models for speed
without sacrificing accuracy
Implementation and MLD-specific optimizations
Run experiments with larger systems
Low-locality case study
Memory address-based classification
Extend scalar simulations to include
multi-processor systems
Benchmarks intended to be run on parallel
machines
Extended memory hierarchy adds another layer of
complexity to locality
Possible integration with FASE, model memory
hierarchy in node components

18
References

Gonz95 A. Gonzalez, C. Aliagas, M. Valero,
Data Cache with Multiple Caching Strategies
Tuned to Different Types of Locality, Department
of Computer Architecture, University of Catalunya
Polytechnical, Barcelona 1995.
John97 T. Johnson, M. Merten, W. Hwu, Runtime
Spatial Locality Detection and Optimization,
Center for Reliable and High-Performance
Computing, University of Illinois,
Urbana-Champaign, IL 1997.
Memi03 G. Memik, M. Kandemir, A. Choudary, I
Kadayif, An Integrated Approach for Improving
Cache Behavior, Proceedings of the
Design,Automation and Test in Europe Conference
and Exhibition, 2003.