Modeling and Simulation MS Group

About This Presentation

Title:

Modeling and Simulation MS Group

Description:

Modeling and Simulation for Tradeoff Analysis in Advanced ... Model has been reworked to more accurately reflect real TCP performance. More accurate timing ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 19

Provided by: dral60

Category:

more less

Transcript and Presenter's Notes

Title: Modeling and Simulation MS Group

1
Modeling and Simulation (MS) Group
The MS Group was formerly known as the CGS
group.

Modeling and Simulation for Tradeoff Analysis in
Advanced Cluster and Grid Systems
Appendix for Q3 Status Report
DOD Project MDA904-03-R-0507
February 5, 2004

2
Outline

Objectives and motivations
Related research
FASE overview
FASE components
Script generator (Version 2)
MPI collective communication
Processor modeling
InfiniBand and TCP models
Dynamic application considerations
Tool evaluation results
Case Study
Architectural modifications of conventional
processors for low-locality applications
Conclusions and future plans

3
Objectives and Motivations

High-performance computing involves applications
that require parallelization for feasible
completion time
Clusters and grids used for distributed computing
Issues with heterogeneity of clusters and grids
Efficiency of hardware resource usage
Execution time
Overhead
Challenge Find optimum configuration of
resources and task distribution for key
applications under study
Nearly impossible and too expensive to determine
experimentally
Simulation tools are required
Challenges with simulation approach
Large, complex systems
Balance speed and fidelity

4
Related Research

Performance Modeling and Characterization (PMaC
developed at San Diego Supercomputer Center)
Mission statement to bring scientific rigor
to the prediction and understanding of factors
effecting the performance of current and
projected HPC platforms.
Funded by the DOE, DoD (NAVO MSRC PET program),
DARPA, and NSF
Follows 2 rules of thumb
Memory subsystem dominates per-processor
performance
An applications use of an interconnect dictates
the scalability of that application
Three steps for prediction
Machine profile extracts fundamental
characteristics of a machine independent of the
application
Generated by Memory Access Pattern Signature
(MAPS), a custom program that determines load and
store rates of a machine
Future architectures can be considered by
tweaking simulation parameters
Application signature extracts fundamental
characteristics of an application independent of
the architecture on which it was formed
Generated using MetaSim Tracer, custom simulator
built on top of ATOM toolkit for Alpha machines
Convolution integrates the machine profile and
application signature to predict the performance
of a machine
Produced by MetaSim Convolver, custom program
that maps application signature onto machine
profile
MPI trace and convolution result fed to Dimemas
simulator described below
Papers 1,2,3 show very good results from this
method
Dimemas (developed by European Center for
Parallelsim of Barcelona)
Analyzes performance of message passing programs

5
FASE Overview

FASE Fast and Accurate Simulation Environment
for Clusters and Grids
Trace Tools evaluated TAU, MPE, Paradyn,
SvPablo, and VampirTrace for parallel
applications
Initially chose MPE for communication events,
however not enough details on important MPI
functions (i.e. MPI_Alltoall and MPI_Bcast)
Solution Custom program to instrument source
code
Script Generator
Instrumented source code outputs scripts that can
be read by Script Reader/Processor in MLD
More details provided in proceeding slides
Performance statistics generator
Generates statistics that can characterize
behavior of a device while running a particular
program
Example statistics
Cache misses, CPI, percentages of instruction
types executed, instruction count, disk I/O
Can be a stand-alone program or part of the trace
tool
Models
Processors single and multiprocessors,
processors in memory, reconfigurable
Networks Ethernet, Myrinet, SCI, InfiniBand,
Rapid I/O, HyperTransport, SONET and other
optical protocols, TCP/IP
MPI Interface
Currently support a small subset of MPI functions
MPI_Send, MPI_Recv, MPI_Barrier, MPI_Bcast,
MPI_Alltoall, MPI_Reduce
Created for each network model in library
Implementation - Speed vs. Fidelity Tradeoff

Main Components in FASE
FASE component interactions
FASE interfaces
6
Script Generator Version 2

Old script generator
Required the application to be compiled and ran
using MPE libraries
Read MPE log files and converted them to scripts
to drive MLD
MPE seemed to provide inaccurate numbers when
compared to other tools
Mainly during startup
New script generator
Instruments applications
Supported MPI functions (listed below)
Timing between MPI function
Produces new instrumented source code to be
compiled with standard MPI compiler
Scripts generated by running binary
Features
Single file program easy to manage and modify
Supported languages C and C
Supported MPI functions
MPI_Send, MPI_Recv, MPI_Alltoall, MPI_Bcast,
MPI_Reduce
Command-line options
Input file name
Output instrumented file name

7
Tree-based Collective Communication

FASE now supports both unicast and selected
collective communication functions
MPI_Send, MPI_Recv, MPI_Barrier, MPI_Alltoall,
MPI_Reduce, MPI_Bcast
MPE had limited capabilities of collecting
pertinent characteristics of collective
communications
New script generator developed to remedy this

Current implementation
Breaks collective function into one or more
unicast functions
Some collective functions broken up into multiple
collective functions (i.e. MPI_Alltoall broken
into one or more MPI_Bcast)
Assumes that the entire MPI_Comm_World is used as
the group in the collective function calls
These algorithms will be leveraged for SHMEM and
UPC collective communications as well

Green oval original MPI interface
Red oval new additions to support collective
communications
Blue circle (inside red oval) critical module
with algorithms used for each supported
collective function

8
Processor Modeling

Statistical modeling 5
Split up application into code chunks forming an
execution profile with specific paths given a
probability
Convolution method 1,2,3
Determines scaling factor based on assumption
that memory is the main component of
applications execution time
Extract application signature and machine profile
and relate the two using algebraic formula
Three steps for prediction
Machine profile extracts fundamental
characteristics of a machine independent of the
application
Application signature extracts fundamental
characteristics of an application independent of
the architecture on which it was formed
Convolution integrates the machine profile and
application signature to predict the performance
of a machine

Bounded method (new approach)
Classify application
Memory bound, IO bound, CPU bound, etc
Use Kojak, Paradyn, or other tool to help
classify application
Use specific benchmarks to capture machine
characteristics
Based on application classification
Use any high-fidelity simulator or actual machine
(if in possession)
Run specific benchmark on machine used to gather
trace
Scaling factor obtained by dividing modeled time
and trace machine time
Do once and form database
Use execution time gathered by trace program
CON The bound can change from machine to machine
Simulation method (new approach)
Run actual application using processor simulator
Simulate only portions of code with no
communication
Could require source code modification
Need to relate uniprocessor simulation to actual
parallel application
Use scaling factor

Application Classification Using Kojak 6
9
InfiniBand model design

InfiniBand Ports and Queue Pairs
Multiple IBA Ports increase bandwidth of a single
channel adapter and operate at the Network Layer
and below
Every port has at least two QPs
More QPs may be present per adapter
QPs operate at the Transport Layer
Each QP generates request packets, services
returning response packets, and responds to
arriving request packets through exactly one
port, at any point in time
Each connected or reliable transport QP remains
bound to a single port until path migration for
error recovery and/or load balancing occurs or
the connection goes away

Courtesy of 7

Abbreviations used in InfiniBand
IBA InfiniBand Architecture CA Channel Adapter
HCA Host Channel Adapter TCA Target Channel
Adapter
QP Queue Pair VL Virtual Link SM Subnet
Manager

Host Channel Adapter (HCA)
Resides in host processor node and connects host
to IBA fabric
Functions as interface to consumer processes
(Operating Systems message and data service)

Queue Pair Details
Work Queue 1 for send, 1 for receive, making up
a QP
Send Work Queue contains instructions that cause
data to be transferred between one consumers
memory to anothers
Receive Work Queue holds instructions describing
where to place data received from a remote
consumer
Currently only the Receive queues have been
modeled

Subnet Manager (SM)
Responsible for communication establishment and
connection management between end-nodes
Monitors and reports well-defined performance
counters For our model, we have specified the
performance measure to be queue load

10
InfiniBand HCA model

Components that will complete modeling of HCA
Two-way communication including the Send and
Receive Queues
Completion of SM Architecture
Introduction of appropriate delays based on
service instructions like Send, RDMA Read, RDMA
Write, etc.

HCA model description
Single directional HCA has been modeled so far,
i.e., InfiniBand packets would travel from ports
through VLs and finally through QPs before
reaching consumers, thus simulating the Receive
Queues
A bidirectional HCA will have the very same
components but connected in the opposite
direction, thus allowing flow of packets from
consumers to ports, in addition to the existing
direction, thus simulating both Send and Receive
Queues
Components I, II and III shown in the figure
simulate the set of ports, VLs and QPs components
respectively, modeled as FIFO queues
II determines VLID validity III assigns packets
to appropriate QPs
IV simulates SM Agent which based on statistics
collected from I, II and III calculates the least
congested paths when necessary
V is a memory pool which helps in storage of
queue statistics for IV

Modeling results and experiments
HCAs have been modeled for one way communication
Links, traffic sources and queue analyzers have
been modeled for this
Simulation results will be collected once the
model is complete as mentioned above
Functional testing of QPs, validation testing
using experimental benchmark tests on the testbed
from InfiniCon comprising InfinIO 2000 ( 8 port )
InfiniBand switch and HCAs will be done next
TCAs are specialized HCAs and will not be modeled

11
TCP Model

TCP model overview
Abstract model can be placed between arbitrary
application layer and data link layer
Large number of parameters allow customization
and experimentation
Recent modifications
Model has been reworked to more accurately
reflect real TCP performance
More accurate timing
Modified parameters to more closely resemble real
TCP
Validated using Gigabit Ethernet
Netperf Stream Test
2 Dual 2.4 GHz Xeons as hosts
Switched by Cisco Catalyst 4503
Interfaced with MPI for FASE framework
Easily extendable to further collective
communication models

FASE TCP closely simulates TCP over Gigabit
Ethernet via models
Both ramp up exponentially
Both eventually level off at about 75 of line
rate to prevent flooding the network

12
Dynamic Applications

Solutions to be explored
Model program dynamics in MLD
In-depth study approach
Requires detailed knowledge of program to be
modeled
Each new application must be dissected for
dynamic behavior
Implementation/integration simple once dynamic
portions of program are identified and understood
Leads to accurate host/topology portable study of
specific application
Use Berkeley sockets library in MLD to interface
MLD with host machine
Dynamic control does not have to be modeled but
is highly accurate
Program will physically run on the machine but
the network will be simulated in MLD
Allows host to dramatically respond to simulated
network performance
One-time library development followed by
unlimited portability to new applications
Host simulation constrained by available
processors/architectures
SimpleScalar interface
Extend Berkeley sockets library to interface with
SimpleScalar
Very small development cost inside MLD
Leverage previous work in Simple Scalar to make
necessary adaptations
Allows modeling of emerging and prohibitively
expensive processor
Host/architecture simulation is not constrained
by available resources

Chameleon Adapting to its environment
13
Dynamic Applications

Berkeley Sockets solution
Redirect application data into MLD (Decode the
packets to get destinations and size)
Build data structures to represent packets
Send MLD data structures through simulated
network
Pass data back to receiving application
Challenges
Intercepting program communication calls to MLD
Accounting for delay incurred by redirecting
packets to MLD, running through simulation

Other resources also simulated RAID storage
units, RC units, etc.
14
Performance Analysis Tool Evaluation

Analysis of tracing and profiling tools for
characterization of a program in an architecture
independent manner
Tools Evaluated
PAPI used by several other performance analysis
tools as a standard API for accessing processor
performance event counters
Perfometer graphical view of PAPI data
Real-time view of event counter data
- Can only display one metric at a time
SvPablo captures and analyzes performance data
that reflects the interaction of hardware and
software
PAPI support allows capture of hardware
performance counter events such as cache misses,
floating point instructions and branch
instructions
Collects statistical data on function calls and
loops, such as mean, max, min, loop/ function
duration and task counts
Log file is portable, extensible, compact and
promotes scalability
- No correlation between time and data

Vampirgraphically analyzes runtime event traces
produced by MPI applications
Collects both statistical and trace data
Monitors communication transmission
Log file allows fast random access and easy
extraction of data
- Provides no information on hardware performance
events counters
Paradynallows dynamic performance analysis
Supports dynamic instrumentation
Uses application binary file so that source
code is not needed
Allows automated searches for performance
bottlenecks
- Provides limited statistical analysis of data
Provides no information on hardware performance
events counters
SvPablo (with PAPI) and Vampir
Provide best tool combination for determining
architecture independent characterization of
program
Can allow different architectures or combinations
of architectures to be plugged into a simulation
Options
Leverage these tools to extract important
information
Incorporate techniques used by tools into new
script generator
Use to classify applications for bounded method
for processor modeling

15
Case Study Architectural modifications of
conventional processors for low-locality
applications

Introduction
Current memory hierarchies built to take
advantage of applications that exhibit good
locality in their data stream
Low-locality applications do not use memory
hierarchy efficiently
Overhead involved in data caching could possibly
be detrimental to performance of such
applications
Simulation of alternate memory configurations and
caching techniques will identify practical
architecture modifications to enhance memory
performance
Simulations done using SimpleScalar
C-based execution-driven simulator, simulates
64-bit out-of-order processor
SimpleScalar allows modeling arbitrary memory
hierarchy configurations

Figure courtesy of 9

Approach
Use ideas adopted from effective approaches taken
in related literature
Determine which modifications produce best
performance gain for provided benchmarks
Focus on small tweaks and added features as
opposed to radically different architectures
SimpleScalars provided compiler produces
assembly-level code that can be edited
New load/store instructions added to the ISA, in
addition to simulated hardware modifications
Simulation measurements include execution times,
data cache miss rates at each level, average
memory access times, other various statistics
Baseline results from each benchmark run on
un-altered simulator will be compared with
results obtained after various simulator
modifications

16
Case Study Possible Solutions

Instruction or data address-based cache bypassing
Memory access of program segments which exhibit
poor locality will be made to bypass cache
entirely
Reduces unnecessary overhead, helps ensure that
useful data in cache is not replaced with data
that should not be cached
Accomplished by manually examining code, or by
memory access pattern profiling
Multithreading
Newer and future processors contain enough
functional units to execute multiple programs
concurrently
Also possible to run multiple instances of same
program
Cache misses and stalls are not avoided, instead
hidden to some degree by existence of more
operational functional units

Intelligent prefetching
Reference Prediction Tables record most recent
misses of a certain instruction
Using stride, distance between misses, and other
factors, a block of memory likely to contain a
future access is identified
Just before instruction is executed, this block
is retrieved from main memory

17
Conclusions and Future Plans

Script Generator
Custom built to support MPI functions
SHMEM extensions short-term goal
UPC support long-term goal
Difficulty determining how to support
programming models
Processor modeling
Timing and scaling factor vs. instruction count
PAPI implementation to be explored
Bounded and simulation methods will be explored
Network models
InfiniBand model
Preliminary model is almost completel
Run simulations and experiments to gather results
TCP model
Tweak parameters for accurate results
MPI interface and simulation results

Dynamic applications
Work with Berkeley sockets interface for HWIL
Feasibility study of other options
Low-locality case study
SimpleScalar modifications to study results of
different techniques introduced
Simulate memory architecture modifications,
measure performance gains on benchmarks
Suggest practical solutions to increase memory
performance on supplied low-locality benchmarks
Future potential of FASE
Expand FASE framework to incorporate other
simulated resources
Reconfigurable devices, storage devices, WAN and
grid computing components, etc.

Refer to RC groups Q3 status report for more
details
18
References

1 L. Carrington, A. Snavely, X. Gao, and N.
Wolter, A Performance Prediction Framework for
Scientific Applications, Workshop on Performance
Modeling ICCS, Melbourne, June 2003.
2 A. Snavely, N. Wolter, and L. Carrington,
Modeling Application Performance by Convolving
Machine Signatures with Application Profiles,
IEEE 4th Annual Workshop on Workload
Characterization, Austin, December 2001.
3 M. McCracken, A. Snavely, and A. Malony,
Performance Modeling for Dynamic Algorithm
Selection, Workshop on Performance Modeling
ICCS, Melbourne, June 2003.
4 R. Badia, J. Labarta, J. Gimenez, and F.
Escale, DIMEMAS Predicting MPI applications
behavior in Grid environments, CEPBA_IBM
Research Institute.
5 D. Noonburg and J. Shen, A Framework for
Statistical Modeling of Superscalar Processor
Performance, HPCA 1997.
6 F. Wolf and B. Mohr, Automatic Performance
Analysis of Hybrid MPI/OpenMP Applications,
Forschungzentrum Jülich, Germany.
7 CrossRoads System Inc., Introduction to the
InfiniBand (SM) Architecture, Version 1.3,
Updated 02/04/2000, http//www.rioworks.co.jp/pro
ductsinfo/InfiniBand01.pdf
8 T.M. Pinkston , A.F. Benner, M. Krause, I.M.
Robinson, and T. Sterling, InfiniBand The De
Facto Future Standard for System and Local Area
Networks or Just a Scalable Replacement for PCI
Buses?, Cluster Computing, Vol. 6, No. 2, April
2003, pp. 95-105.
9 S. Chang, Performance Profiling and
Optimization on the SGI Origins, Numerical
Aerospace Simulation Facility, NASA. 2001.
10 G. Memik, M. Kandemir, A. Choudhary, I.
Kadayif, An Integrated Approach for Improving
Cache Behavior, IEEE Proceedings of the Design,
Automation and Test in Europe Conference and
Exhibition. 2003.
11 W. Wong, Hardware Techniques for Data
Placement, Dept. Computer Science and
Engineering, University of Washington. 1997.