Title: Modeling and Simulation MS Group
1Modeling and Simulation (MS) Group
The MS Group was formerly known as the CGS
group.
- Modeling and Simulation for Tradeoff Analysis in
Advanced Cluster and Grid Systems - Appendix for Q3 Status Report
- DOD Project MDA904-03-R-0507
- February 5, 2004
2Outline
- Objectives and motivations
- Related research
- FASE overview
- FASE components
- Script generator (Version 2)
- MPI collective communication
- Processor modeling
- InfiniBand and TCP models
- Dynamic application considerations
- Tool evaluation results
- Case Study
- Architectural modifications of conventional
processors for low-locality applications - Conclusions and future plans
3Objectives and Motivations
- High-performance computing involves applications
that require parallelization for feasible
completion time - Clusters and grids used for distributed computing
- Issues with heterogeneity of clusters and grids
- Efficiency of hardware resource usage
- Execution time
- Overhead
- Challenge Find optimum configuration of
resources and task distribution for key
applications under study - Nearly impossible and too expensive to determine
experimentally - Simulation tools are required
- Challenges with simulation approach
- Large, complex systems
- Balance speed and fidelity
4Related Research
- Performance Modeling and Characterization (PMaC
developed at San Diego Supercomputer Center) - Mission statement to bring scientific rigor
to the prediction and understanding of factors
effecting the performance of current and
projected HPC platforms. - Funded by the DOE, DoD (NAVO MSRC PET program),
DARPA, and NSF - Follows 2 rules of thumb
- Memory subsystem dominates per-processor
performance - An applications use of an interconnect dictates
the scalability of that application - Three steps for prediction
- Machine profile extracts fundamental
characteristics of a machine independent of the
application - Generated by Memory Access Pattern Signature
(MAPS), a custom program that determines load and
store rates of a machine - Future architectures can be considered by
tweaking simulation parameters - Application signature extracts fundamental
characteristics of an application independent of
the architecture on which it was formed - Generated using MetaSim Tracer, custom simulator
built on top of ATOM toolkit for Alpha machines - Convolution integrates the machine profile and
application signature to predict the performance
of a machine - Produced by MetaSim Convolver, custom program
that maps application signature onto machine
profile - MPI trace and convolution result fed to Dimemas
simulator described below - Papers 1,2,3 show very good results from this
method - Dimemas (developed by European Center for
Parallelsim of Barcelona) - Analyzes performance of message passing programs
5FASE Overview
- FASE Fast and Accurate Simulation Environment
for Clusters and Grids - Trace Tools evaluated TAU, MPE, Paradyn,
SvPablo, and VampirTrace for parallel
applications - Initially chose MPE for communication events,
however not enough details on important MPI
functions (i.e. MPI_Alltoall and MPI_Bcast) - Solution Custom program to instrument source
code - Script Generator
- Instrumented source code outputs scripts that can
be read by Script Reader/Processor in MLD - More details provided in proceeding slides
- Performance statistics generator
- Generates statistics that can characterize
behavior of a device while running a particular
program - Example statistics
- Cache misses, CPI, percentages of instruction
types executed, instruction count, disk I/O - Can be a stand-alone program or part of the trace
tool - Models
- Processors single and multiprocessors,
processors in memory, reconfigurable - Networks Ethernet, Myrinet, SCI, InfiniBand,
Rapid I/O, HyperTransport, SONET and other
optical protocols, TCP/IP - MPI Interface
- Currently support a small subset of MPI functions
MPI_Send, MPI_Recv, MPI_Barrier, MPI_Bcast,
MPI_Alltoall, MPI_Reduce - Created for each network model in library
- Implementation - Speed vs. Fidelity Tradeoff
Main Components in FASE
FASE component interactions
FASE interfaces
6Script Generator Version 2
- Old script generator
- Required the application to be compiled and ran
using MPE libraries - Read MPE log files and converted them to scripts
to drive MLD - MPE seemed to provide inaccurate numbers when
compared to other tools - Mainly during startup
- New script generator
- Instruments applications
- Supported MPI functions (listed below)
- Timing between MPI function
- Produces new instrumented source code to be
compiled with standard MPI compiler - Scripts generated by running binary
- Features
- Single file program easy to manage and modify
- Supported languages C and C
- Supported MPI functions
- MPI_Send, MPI_Recv, MPI_Alltoall, MPI_Bcast,
MPI_Reduce - Command-line options
- Input file name
- Output instrumented file name
7Tree-based Collective Communication
- FASE now supports both unicast and selected
collective communication functions - MPI_Send, MPI_Recv, MPI_Barrier, MPI_Alltoall,
MPI_Reduce, MPI_Bcast - MPE had limited capabilities of collecting
pertinent characteristics of collective
communications - New script generator developed to remedy this
- Current implementation
- Breaks collective function into one or more
unicast functions - Some collective functions broken up into multiple
collective functions (i.e. MPI_Alltoall broken
into one or more MPI_Bcast) - Assumes that the entire MPI_Comm_World is used as
the group in the collective function calls - These algorithms will be leveraged for SHMEM and
UPC collective communications as well
- Green oval original MPI interface
- Red oval new additions to support collective
communications - Blue circle (inside red oval) critical module
with algorithms used for each supported
collective function
8Processor Modeling
- Statistical modeling 5
- Split up application into code chunks forming an
execution profile with specific paths given a
probability - Convolution method 1,2,3
- Determines scaling factor based on assumption
that memory is the main component of
applications execution time - Extract application signature and machine profile
and relate the two using algebraic formula - Three steps for prediction
- Machine profile extracts fundamental
characteristics of a machine independent of the
application - Application signature extracts fundamental
characteristics of an application independent of
the architecture on which it was formed - Convolution integrates the machine profile and
application signature to predict the performance
of a machine
- Bounded method (new approach)
- Classify application
- Memory bound, IO bound, CPU bound, etc
- Use Kojak, Paradyn, or other tool to help
classify application - Use specific benchmarks to capture machine
characteristics - Based on application classification
- Use any high-fidelity simulator or actual machine
(if in possession) - Run specific benchmark on machine used to gather
trace - Scaling factor obtained by dividing modeled time
and trace machine time - Do once and form database
- Use execution time gathered by trace program
- CON The bound can change from machine to machine
- Simulation method (new approach)
- Run actual application using processor simulator
- Simulate only portions of code with no
communication - Could require source code modification
- Need to relate uniprocessor simulation to actual
parallel application - Use scaling factor
Application Classification Using Kojak 6
9InfiniBand model design
- InfiniBand Ports and Queue Pairs
- Multiple IBA Ports increase bandwidth of a single
channel adapter and operate at the Network Layer
and below - Every port has at least two QPs
- More QPs may be present per adapter
- QPs operate at the Transport Layer
- Each QP generates request packets, services
returning response packets, and responds to
arriving request packets through exactly one
port, at any point in time - Each connected or reliable transport QP remains
bound to a single port until path migration for
error recovery and/or load balancing occurs or
the connection goes away
Courtesy of 7
- Abbreviations used in InfiniBand
- IBA InfiniBand Architecture CA Channel Adapter
- HCA Host Channel Adapter TCA Target Channel
Adapter - QP Queue Pair VL Virtual Link SM Subnet
Manager
- Host Channel Adapter (HCA)
- Resides in host processor node and connects host
to IBA fabric - Functions as interface to consumer processes
(Operating Systems message and data service)
- Queue Pair Details
- Work Queue 1 for send, 1 for receive, making up
a QP - Send Work Queue contains instructions that cause
data to be transferred between one consumers
memory to anothers - Receive Work Queue holds instructions describing
where to place data received from a remote
consumer - Currently only the Receive queues have been
modeled
- Subnet Manager (SM)
- Responsible for communication establishment and
connection management between end-nodes - Monitors and reports well-defined performance
counters For our model, we have specified the
performance measure to be queue load
10InfiniBand HCA model
- Components that will complete modeling of HCA
- Two-way communication including the Send and
Receive Queues - Completion of SM Architecture
- Introduction of appropriate delays based on
service instructions like Send, RDMA Read, RDMA
Write, etc.
- HCA model description
- Single directional HCA has been modeled so far,
i.e., InfiniBand packets would travel from ports
through VLs and finally through QPs before
reaching consumers, thus simulating the Receive
Queues - A bidirectional HCA will have the very same
components but connected in the opposite
direction, thus allowing flow of packets from
consumers to ports, in addition to the existing
direction, thus simulating both Send and Receive
Queues - Components I, II and III shown in the figure
simulate the set of ports, VLs and QPs components
respectively, modeled as FIFO queues - II determines VLID validity III assigns packets
to appropriate QPs - IV simulates SM Agent which based on statistics
collected from I, II and III calculates the least
congested paths when necessary - V is a memory pool which helps in storage of
queue statistics for IV
- Modeling results and experiments
- HCAs have been modeled for one way communication
Links, traffic sources and queue analyzers have
been modeled for this - Simulation results will be collected once the
model is complete as mentioned above - Functional testing of QPs, validation testing
using experimental benchmark tests on the testbed
from InfiniCon comprising InfinIO 2000 ( 8 port )
InfiniBand switch and HCAs will be done next - TCAs are specialized HCAs and will not be modeled
11TCP Model
- TCP model overview
- Abstract model can be placed between arbitrary
application layer and data link layer - Large number of parameters allow customization
and experimentation - Recent modifications
- Model has been reworked to more accurately
reflect real TCP performance - More accurate timing
- Modified parameters to more closely resemble real
TCP - Validated using Gigabit Ethernet
- Netperf Stream Test
- 2 Dual 2.4 GHz Xeons as hosts
- Switched by Cisco Catalyst 4503
- Interfaced with MPI for FASE framework
- Easily extendable to further collective
communication models
- FASE TCP closely simulates TCP over Gigabit
Ethernet via models - Both ramp up exponentially
- Both eventually level off at about 75 of line
rate to prevent flooding the network
12Dynamic Applications
- Solutions to be explored
- Model program dynamics in MLD
- In-depth study approach
- Requires detailed knowledge of program to be
modeled - Each new application must be dissected for
dynamic behavior - Implementation/integration simple once dynamic
portions of program are identified and understood - Leads to accurate host/topology portable study of
specific application - Use Berkeley sockets library in MLD to interface
MLD with host machine - Dynamic control does not have to be modeled but
is highly accurate - Program will physically run on the machine but
the network will be simulated in MLD - Allows host to dramatically respond to simulated
network performance - One-time library development followed by
unlimited portability to new applications - Host simulation constrained by available
processors/architectures - SimpleScalar interface
- Extend Berkeley sockets library to interface with
SimpleScalar - Very small development cost inside MLD
- Leverage previous work in Simple Scalar to make
necessary adaptations - Allows modeling of emerging and prohibitively
expensive processor - Host/architecture simulation is not constrained
by available resources
Chameleon Adapting to its environment
13Dynamic Applications
- Berkeley Sockets solution
- Redirect application data into MLD (Decode the
packets to get destinations and size) - Build data structures to represent packets
- Send MLD data structures through simulated
network - Pass data back to receiving application
- Challenges
- Intercepting program communication calls to MLD
- Accounting for delay incurred by redirecting
packets to MLD, running through simulation
Other resources also simulated RAID storage
units, RC units, etc.
14Performance Analysis Tool Evaluation
- Analysis of tracing and profiling tools for
characterization of a program in an architecture
independent manner -
- Tools Evaluated
- PAPI used by several other performance analysis
tools as a standard API for accessing processor
performance event counters - Perfometer graphical view of PAPI data
- Real-time view of event counter data
- - Can only display one metric at a time
- SvPablo captures and analyzes performance data
that reflects the interaction of hardware and
software - PAPI support allows capture of hardware
performance counter events such as cache misses,
floating point instructions and branch
instructions - Collects statistical data on function calls and
loops, such as mean, max, min, loop/ function
duration and task counts - Log file is portable, extensible, compact and
promotes scalability - - No correlation between time and data
- Vampirgraphically analyzes runtime event traces
produced by MPI applications - Collects both statistical and trace data
- Monitors communication transmission
- Log file allows fast random access and easy
extraction of data - - Provides no information on hardware performance
events counters - Paradynallows dynamic performance analysis
- Supports dynamic instrumentation
- Uses application binary file so that source
code is not needed - Allows automated searches for performance
bottlenecks - - Provides limited statistical analysis of data
- Provides no information on hardware performance
events counters - SvPablo (with PAPI) and Vampir
- Provide best tool combination for determining
architecture independent characterization of
program - Can allow different architectures or combinations
of architectures to be plugged into a simulation - Options
- Leverage these tools to extract important
information - Incorporate techniques used by tools into new
script generator - Use to classify applications for bounded method
for processor modeling
15Case Study Architectural modifications of
conventional processors for low-locality
applications
- Introduction
- Current memory hierarchies built to take
advantage of applications that exhibit good
locality in their data stream - Low-locality applications do not use memory
hierarchy efficiently - Overhead involved in data caching could possibly
be detrimental to performance of such
applications - Simulation of alternate memory configurations and
caching techniques will identify practical
architecture modifications to enhance memory
performance - Simulations done using SimpleScalar
- C-based execution-driven simulator, simulates
64-bit out-of-order processor - SimpleScalar allows modeling arbitrary memory
hierarchy configurations
Figure courtesy of 9
- Approach
- Use ideas adopted from effective approaches taken
in related literature - Determine which modifications produce best
performance gain for provided benchmarks - Focus on small tweaks and added features as
opposed to radically different architectures - SimpleScalars provided compiler produces
assembly-level code that can be edited - New load/store instructions added to the ISA, in
addition to simulated hardware modifications - Simulation measurements include execution times,
data cache miss rates at each level, average
memory access times, other various statistics - Baseline results from each benchmark run on
un-altered simulator will be compared with
results obtained after various simulator
modifications
16Case Study Possible Solutions
- Instruction or data address-based cache bypassing
- Memory access of program segments which exhibit
poor locality will be made to bypass cache
entirely - Reduces unnecessary overhead, helps ensure that
useful data in cache is not replaced with data
that should not be cached - Accomplished by manually examining code, or by
memory access pattern profiling - Multithreading
- Newer and future processors contain enough
functional units to execute multiple programs
concurrently - Also possible to run multiple instances of same
program - Cache misses and stalls are not avoided, instead
hidden to some degree by existence of more
operational functional units
- Intelligent prefetching
- Reference Prediction Tables record most recent
misses of a certain instruction - Using stride, distance between misses, and other
factors, a block of memory likely to contain a
future access is identified - Just before instruction is executed, this block
is retrieved from main memory
17Conclusions and Future Plans
- Script Generator
- Custom built to support MPI functions
- SHMEM extensions short-term goal
- UPC support long-term goal
- Difficulty determining how to support
programming models - Processor modeling
- Timing and scaling factor vs. instruction count
- PAPI implementation to be explored
- Bounded and simulation methods will be explored
- Network models
- InfiniBand model
- Preliminary model is almost completel
- Run simulations and experiments to gather results
- TCP model
- Tweak parameters for accurate results
- MPI interface and simulation results
- Dynamic applications
- Work with Berkeley sockets interface for HWIL
- Feasibility study of other options
- Low-locality case study
- SimpleScalar modifications to study results of
different techniques introduced - Simulate memory architecture modifications,
measure performance gains on benchmarks - Suggest practical solutions to increase memory
performance on supplied low-locality benchmarks - Future potential of FASE
- Expand FASE framework to incorporate other
simulated resources - Reconfigurable devices, storage devices, WAN and
grid computing components, etc.
Refer to RC groups Q3 status report for more
details
18References
- 1 L. Carrington, A. Snavely, X. Gao, and N.
Wolter, A Performance Prediction Framework for
Scientific Applications, Workshop on Performance
Modeling ICCS, Melbourne, June 2003. - 2 A. Snavely, N. Wolter, and L. Carrington,
Modeling Application Performance by Convolving
Machine Signatures with Application Profiles,
IEEE 4th Annual Workshop on Workload
Characterization, Austin, December 2001. - 3 M. McCracken, A. Snavely, and A. Malony,
Performance Modeling for Dynamic Algorithm
Selection, Workshop on Performance Modeling
ICCS, Melbourne, June 2003. - 4 R. Badia, J. Labarta, J. Gimenez, and F.
Escale, DIMEMAS Predicting MPI applications
behavior in Grid environments, CEPBA_IBM
Research Institute. - 5 D. Noonburg and J. Shen, A Framework for
Statistical Modeling of Superscalar Processor
Performance, HPCA 1997. - 6 F. Wolf and B. Mohr, Automatic Performance
Analysis of Hybrid MPI/OpenMP Applications,
Forschungzentrum Jülich, Germany. - 7 CrossRoads System Inc., Introduction to the
InfiniBand (SM) Architecture, Version 1.3,
Updated 02/04/2000, http//www.rioworks.co.jp/pro
ductsinfo/InfiniBand01.pdf - 8 T.M. Pinkston , A.F. Benner, M. Krause, I.M.
Robinson, and T. Sterling, InfiniBand The De
Facto Future Standard for System and Local Area
Networks or Just a Scalable Replacement for PCI
Buses?, Cluster Computing, Vol. 6, No. 2, April
2003, pp. 95-105. - 9 S. Chang, Performance Profiling and
Optimization on the SGI Origins, Numerical
Aerospace Simulation Facility, NASA. 2001. - 10 G. Memik, M. Kandemir, A. Choudhary, I.
Kadayif, An Integrated Approach for Improving
Cache Behavior, IEEE Proceedings of the Design,
Automation and Test in Europe Conference and
Exhibition. 2003. - 11 W. Wong, Hardware Techniques for Data
Placement, Dept. Computer Science and
Engineering, University of Washington. 1997.