Modeling and Simulation MS Group - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Modeling and Simulation MS Group

Description:

Modeling and Simulation for Tradeoff Analysis in Advanced ... Model has been reworked to more accurately reflect real TCP performance. More accurate timing ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 19
Provided by: dral60
Category:

less

Transcript and Presenter's Notes

Title: Modeling and Simulation MS Group


1
Modeling and Simulation (MS) Group
The MS Group was formerly known as the CGS
group.
  • Modeling and Simulation for Tradeoff Analysis in
    Advanced Cluster and Grid Systems
  • Appendix for Q3 Status Report
  • DOD Project MDA904-03-R-0507
  • February 5, 2004

2
Outline
  • Objectives and motivations
  • Related research
  • FASE overview
  • FASE components
  • Script generator (Version 2)
  • MPI collective communication
  • Processor modeling
  • InfiniBand and TCP models
  • Dynamic application considerations
  • Tool evaluation results
  • Case Study
  • Architectural modifications of conventional
    processors for low-locality applications
  • Conclusions and future plans

3
Objectives and Motivations
  • High-performance computing involves applications
    that require parallelization for feasible
    completion time
  • Clusters and grids used for distributed computing
  • Issues with heterogeneity of clusters and grids
  • Efficiency of hardware resource usage
  • Execution time
  • Overhead
  • Challenge Find optimum configuration of
    resources and task distribution for key
    applications under study
  • Nearly impossible and too expensive to determine
    experimentally
  • Simulation tools are required
  • Challenges with simulation approach
  • Large, complex systems
  • Balance speed and fidelity

4
Related Research
  • Performance Modeling and Characterization (PMaC
    developed at San Diego Supercomputer Center)
  • Mission statement to bring scientific rigor
    to the prediction and understanding of factors
    effecting the performance of current and
    projected HPC platforms.
  • Funded by the DOE, DoD (NAVO MSRC PET program),
    DARPA, and NSF
  • Follows 2 rules of thumb
  • Memory subsystem dominates per-processor
    performance
  • An applications use of an interconnect dictates
    the scalability of that application
  • Three steps for prediction
  • Machine profile extracts fundamental
    characteristics of a machine independent of the
    application
  • Generated by Memory Access Pattern Signature
    (MAPS), a custom program that determines load and
    store rates of a machine
  • Future architectures can be considered by
    tweaking simulation parameters
  • Application signature extracts fundamental
    characteristics of an application independent of
    the architecture on which it was formed
  • Generated using MetaSim Tracer, custom simulator
    built on top of ATOM toolkit for Alpha machines
  • Convolution integrates the machine profile and
    application signature to predict the performance
    of a machine
  • Produced by MetaSim Convolver, custom program
    that maps application signature onto machine
    profile
  • MPI trace and convolution result fed to Dimemas
    simulator described below
  • Papers 1,2,3 show very good results from this
    method
  • Dimemas (developed by European Center for
    Parallelsim of Barcelona)
  • Analyzes performance of message passing programs

5
FASE Overview
  • FASE Fast and Accurate Simulation Environment
    for Clusters and Grids
  • Trace Tools evaluated TAU, MPE, Paradyn,
    SvPablo, and VampirTrace for parallel
    applications
  • Initially chose MPE for communication events,
    however not enough details on important MPI
    functions (i.e. MPI_Alltoall and MPI_Bcast)
  • Solution Custom program to instrument source
    code
  • Script Generator
  • Instrumented source code outputs scripts that can
    be read by Script Reader/Processor in MLD
  • More details provided in proceeding slides
  • Performance statistics generator
  • Generates statistics that can characterize
    behavior of a device while running a particular
    program
  • Example statistics
  • Cache misses, CPI, percentages of instruction
    types executed, instruction count, disk I/O
  • Can be a stand-alone program or part of the trace
    tool
  • Models
  • Processors single and multiprocessors,
    processors in memory, reconfigurable
  • Networks Ethernet, Myrinet, SCI, InfiniBand,
    Rapid I/O, HyperTransport, SONET and other
    optical protocols, TCP/IP
  • MPI Interface
  • Currently support a small subset of MPI functions
    MPI_Send, MPI_Recv, MPI_Barrier, MPI_Bcast,
    MPI_Alltoall, MPI_Reduce
  • Created for each network model in library
  • Implementation - Speed vs. Fidelity Tradeoff

Main Components in FASE
FASE component interactions
FASE interfaces
6
Script Generator Version 2
  • Old script generator
  • Required the application to be compiled and ran
    using MPE libraries
  • Read MPE log files and converted them to scripts
    to drive MLD
  • MPE seemed to provide inaccurate numbers when
    compared to other tools
  • Mainly during startup
  • New script generator
  • Instruments applications
  • Supported MPI functions (listed below)
  • Timing between MPI function
  • Produces new instrumented source code to be
    compiled with standard MPI compiler
  • Scripts generated by running binary
  • Features
  • Single file program easy to manage and modify
  • Supported languages C and C
  • Supported MPI functions
  • MPI_Send, MPI_Recv, MPI_Alltoall, MPI_Bcast,
    MPI_Reduce
  • Command-line options
  • Input file name
  • Output instrumented file name

7
Tree-based Collective Communication
  • FASE now supports both unicast and selected
    collective communication functions
  • MPI_Send, MPI_Recv, MPI_Barrier, MPI_Alltoall,
    MPI_Reduce, MPI_Bcast
  • MPE had limited capabilities of collecting
    pertinent characteristics of collective
    communications
  • New script generator developed to remedy this
  • Current implementation
  • Breaks collective function into one or more
    unicast functions
  • Some collective functions broken up into multiple
    collective functions (i.e. MPI_Alltoall broken
    into one or more MPI_Bcast)
  • Assumes that the entire MPI_Comm_World is used as
    the group in the collective function calls
  • These algorithms will be leveraged for SHMEM and
    UPC collective communications as well
  • Green oval original MPI interface
  • Red oval new additions to support collective
    communications
  • Blue circle (inside red oval) critical module
    with algorithms used for each supported
    collective function

8
Processor Modeling
  • Statistical modeling 5
  • Split up application into code chunks forming an
    execution profile with specific paths given a
    probability
  • Convolution method 1,2,3
  • Determines scaling factor based on assumption
    that memory is the main component of
    applications execution time
  • Extract application signature and machine profile
    and relate the two using algebraic formula
  • Three steps for prediction
  • Machine profile extracts fundamental
    characteristics of a machine independent of the
    application
  • Application signature extracts fundamental
    characteristics of an application independent of
    the architecture on which it was formed
  • Convolution integrates the machine profile and
    application signature to predict the performance
    of a machine
  • Bounded method (new approach)
  • Classify application
  • Memory bound, IO bound, CPU bound, etc
  • Use Kojak, Paradyn, or other tool to help
    classify application
  • Use specific benchmarks to capture machine
    characteristics
  • Based on application classification
  • Use any high-fidelity simulator or actual machine
    (if in possession)
  • Run specific benchmark on machine used to gather
    trace
  • Scaling factor obtained by dividing modeled time
    and trace machine time
  • Do once and form database
  • Use execution time gathered by trace program
  • CON The bound can change from machine to machine
  • Simulation method (new approach)
  • Run actual application using processor simulator
  • Simulate only portions of code with no
    communication
  • Could require source code modification
  • Need to relate uniprocessor simulation to actual
    parallel application
  • Use scaling factor

Application Classification Using Kojak 6
9
InfiniBand model design
  • InfiniBand Ports and Queue Pairs
  • Multiple IBA Ports increase bandwidth of a single
    channel adapter and operate at the Network Layer
    and below
  • Every port has at least two QPs
  • More QPs may be present per adapter
  • QPs operate at the Transport Layer
  • Each QP generates request packets, services
    returning response packets, and responds to
    arriving request packets through exactly one
    port, at any point in time
  • Each connected or reliable transport QP remains
    bound to a single port until path migration for
    error recovery and/or load balancing occurs or
    the connection goes away

Courtesy of 7
  • Abbreviations used in InfiniBand
  • IBA InfiniBand Architecture CA Channel Adapter
  • HCA Host Channel Adapter TCA Target Channel
    Adapter
  • QP Queue Pair VL Virtual Link SM Subnet
    Manager
  • Host Channel Adapter (HCA)
  • Resides in host processor node and connects host
    to IBA fabric
  • Functions as interface to consumer processes
    (Operating Systems message and data service)
  • Queue Pair Details
  • Work Queue 1 for send, 1 for receive, making up
    a QP
  • Send Work Queue contains instructions that cause
    data to be transferred between one consumers
    memory to anothers
  • Receive Work Queue holds instructions describing
    where to place data received from a remote
    consumer
  • Currently only the Receive queues have been
    modeled
  • Subnet Manager (SM)
  • Responsible for communication establishment and
    connection management between end-nodes
  • Monitors and reports well-defined performance
    counters For our model, we have specified the
    performance measure to be queue load

10
InfiniBand HCA model
  • Components that will complete modeling of HCA
  • Two-way communication including the Send and
    Receive Queues
  • Completion of SM Architecture
  • Introduction of appropriate delays based on
    service instructions like Send, RDMA Read, RDMA
    Write, etc.
  • HCA model description
  • Single directional HCA has been modeled so far,
    i.e., InfiniBand packets would travel from ports
    through VLs and finally through QPs before
    reaching consumers, thus simulating the Receive
    Queues
  • A bidirectional HCA will have the very same
    components but connected in the opposite
    direction, thus allowing flow of packets from
    consumers to ports, in addition to the existing
    direction, thus simulating both Send and Receive
    Queues
  • Components I, II and III shown in the figure
    simulate the set of ports, VLs and QPs components
    respectively, modeled as FIFO queues
  • II determines VLID validity III assigns packets
    to appropriate QPs
  • IV simulates SM Agent which based on statistics
    collected from I, II and III calculates the least
    congested paths when necessary
  • V is a memory pool which helps in storage of
    queue statistics for IV
  • Modeling results and experiments
  • HCAs have been modeled for one way communication
    Links, traffic sources and queue analyzers have
    been modeled for this
  • Simulation results will be collected once the
    model is complete as mentioned above
  • Functional testing of QPs, validation testing
    using experimental benchmark tests on the testbed
    from InfiniCon comprising InfinIO 2000 ( 8 port )
    InfiniBand switch and HCAs will be done next
  • TCAs are specialized HCAs and will not be modeled

11
TCP Model
  • TCP model overview
  • Abstract model can be placed between arbitrary
    application layer and data link layer
  • Large number of parameters allow customization
    and experimentation
  • Recent modifications
  • Model has been reworked to more accurately
    reflect real TCP performance
  • More accurate timing
  • Modified parameters to more closely resemble real
    TCP
  • Validated using Gigabit Ethernet
  • Netperf Stream Test
  • 2 Dual 2.4 GHz Xeons as hosts
  • Switched by Cisco Catalyst 4503
  • Interfaced with MPI for FASE framework
  • Easily extendable to further collective
    communication models
  • FASE TCP closely simulates TCP over Gigabit
    Ethernet via models
  • Both ramp up exponentially
  • Both eventually level off at about 75 of line
    rate to prevent flooding the network

12
Dynamic Applications
  • Solutions to be explored
  • Model program dynamics in MLD
  • In-depth study approach
  • Requires detailed knowledge of program to be
    modeled
  • Each new application must be dissected for
    dynamic behavior
  • Implementation/integration simple once dynamic
    portions of program are identified and understood
  • Leads to accurate host/topology portable study of
    specific application
  • Use Berkeley sockets library in MLD to interface
    MLD with host machine
  • Dynamic control does not have to be modeled but
    is highly accurate
  • Program will physically run on the machine but
    the network will be simulated in MLD
  • Allows host to dramatically respond to simulated
    network performance
  • One-time library development followed by
    unlimited portability to new applications
  • Host simulation constrained by available
    processors/architectures
  • SimpleScalar interface
  • Extend Berkeley sockets library to interface with
    SimpleScalar
  • Very small development cost inside MLD
  • Leverage previous work in Simple Scalar to make
    necessary adaptations
  • Allows modeling of emerging and prohibitively
    expensive processor
  • Host/architecture simulation is not constrained
    by available resources

Chameleon Adapting to its environment
13
Dynamic Applications
  • Berkeley Sockets solution
  • Redirect application data into MLD (Decode the
    packets to get destinations and size)
  • Build data structures to represent packets
  • Send MLD data structures through simulated
    network
  • Pass data back to receiving application
  • Challenges
  • Intercepting program communication calls to MLD
  • Accounting for delay incurred by redirecting
    packets to MLD, running through simulation

Other resources also simulated RAID storage
units, RC units, etc.
14
Performance Analysis Tool Evaluation
  • Analysis of tracing and profiling tools for
    characterization of a program in an architecture
    independent manner
  • Tools Evaluated
  • PAPI used by several other performance analysis
    tools as a standard API for accessing processor
    performance event counters
  • Perfometer graphical view of PAPI data
  • Real-time view of event counter data
  • - Can only display one metric at a time
  • SvPablo captures and analyzes performance data
    that reflects the interaction of hardware and
    software
  • PAPI support allows capture of hardware
    performance counter events such as cache misses,
    floating point instructions and branch
    instructions
  • Collects statistical data on function calls and
    loops, such as mean, max, min, loop/ function
    duration and task counts
  • Log file is portable, extensible, compact and
    promotes scalability
  • - No correlation between time and data
  • Vampirgraphically analyzes runtime event traces
    produced by MPI applications
  • Collects both statistical and trace data
  • Monitors communication transmission
  • Log file allows fast random access and easy
    extraction of data
  • - Provides no information on hardware performance
    events counters
  • Paradynallows dynamic performance analysis
  • Supports dynamic instrumentation
  • Uses application binary file so that source
    code is not needed
  • Allows automated searches for performance
    bottlenecks
  • - Provides limited statistical analysis of data
  • Provides no information on hardware performance
    events counters
  • SvPablo (with PAPI) and Vampir
  • Provide best tool combination for determining
    architecture independent characterization of
    program
  • Can allow different architectures or combinations
    of architectures to be plugged into a simulation
  • Options
  • Leverage these tools to extract important
    information
  • Incorporate techniques used by tools into new
    script generator
  • Use to classify applications for bounded method
    for processor modeling

15
Case Study Architectural modifications of
conventional processors for low-locality
applications
  • Introduction
  • Current memory hierarchies built to take
    advantage of applications that exhibit good
    locality in their data stream
  • Low-locality applications do not use memory
    hierarchy efficiently
  • Overhead involved in data caching could possibly
    be detrimental to performance of such
    applications
  • Simulation of alternate memory configurations and
    caching techniques will identify practical
    architecture modifications to enhance memory
    performance
  • Simulations done using SimpleScalar
  • C-based execution-driven simulator, simulates
    64-bit out-of-order processor
  • SimpleScalar allows modeling arbitrary memory
    hierarchy configurations

Figure courtesy of 9
  • Approach
  • Use ideas adopted from effective approaches taken
    in related literature
  • Determine which modifications produce best
    performance gain for provided benchmarks
  • Focus on small tweaks and added features as
    opposed to radically different architectures
  • SimpleScalars provided compiler produces
    assembly-level code that can be edited
  • New load/store instructions added to the ISA, in
    addition to simulated hardware modifications
  • Simulation measurements include execution times,
    data cache miss rates at each level, average
    memory access times, other various statistics
  • Baseline results from each benchmark run on
    un-altered simulator will be compared with
    results obtained after various simulator
    modifications

16
Case Study Possible Solutions
  • Instruction or data address-based cache bypassing
  • Memory access of program segments which exhibit
    poor locality will be made to bypass cache
    entirely
  • Reduces unnecessary overhead, helps ensure that
    useful data in cache is not replaced with data
    that should not be cached
  • Accomplished by manually examining code, or by
    memory access pattern profiling
  • Multithreading
  • Newer and future processors contain enough
    functional units to execute multiple programs
    concurrently
  • Also possible to run multiple instances of same
    program
  • Cache misses and stalls are not avoided, instead
    hidden to some degree by existence of more
    operational functional units
  • Intelligent prefetching
  • Reference Prediction Tables record most recent
    misses of a certain instruction
  • Using stride, distance between misses, and other
    factors, a block of memory likely to contain a
    future access is identified
  • Just before instruction is executed, this block
    is retrieved from main memory

17
Conclusions and Future Plans
  • Script Generator
  • Custom built to support MPI functions
  • SHMEM extensions short-term goal
  • UPC support long-term goal
  • Difficulty determining how to support
    programming models
  • Processor modeling
  • Timing and scaling factor vs. instruction count
  • PAPI implementation to be explored
  • Bounded and simulation methods will be explored
  • Network models
  • InfiniBand model
  • Preliminary model is almost completel
  • Run simulations and experiments to gather results
  • TCP model
  • Tweak parameters for accurate results
  • MPI interface and simulation results
  • Dynamic applications
  • Work with Berkeley sockets interface for HWIL
  • Feasibility study of other options
  • Low-locality case study
  • SimpleScalar modifications to study results of
    different techniques introduced
  • Simulate memory architecture modifications,
    measure performance gains on benchmarks
  • Suggest practical solutions to increase memory
    performance on supplied low-locality benchmarks
  • Future potential of FASE
  • Expand FASE framework to incorporate other
    simulated resources
  • Reconfigurable devices, storage devices, WAN and
    grid computing components, etc.

Refer to RC groups Q3 status report for more
details
18
References
  • 1 L. Carrington, A. Snavely, X. Gao, and N.
    Wolter, A Performance Prediction Framework for
    Scientific Applications, Workshop on Performance
    Modeling ICCS, Melbourne, June 2003.
  • 2 A. Snavely, N. Wolter, and L. Carrington,
    Modeling Application Performance by Convolving
    Machine Signatures with Application Profiles,
    IEEE 4th Annual Workshop on Workload
    Characterization, Austin, December 2001.
  • 3 M. McCracken, A. Snavely, and A. Malony,
    Performance Modeling for Dynamic Algorithm
    Selection, Workshop on Performance Modeling
    ICCS, Melbourne, June 2003.
  • 4 R. Badia, J. Labarta, J. Gimenez, and F.
    Escale, DIMEMAS Predicting MPI applications
    behavior in Grid environments, CEPBA_IBM
    Research Institute.
  • 5 D. Noonburg and J. Shen, A Framework for
    Statistical Modeling of Superscalar Processor
    Performance, HPCA 1997.
  • 6 F. Wolf and B. Mohr, Automatic Performance
    Analysis of Hybrid MPI/OpenMP Applications,
    Forschungzentrum Jülich, Germany.
  • 7 CrossRoads System Inc., Introduction to the
    InfiniBand (SM) Architecture, Version 1.3,
    Updated 02/04/2000, http//www.rioworks.co.jp/pro
    ductsinfo/InfiniBand01.pdf
  • 8 T.M. Pinkston , A.F. Benner, M. Krause, I.M.
    Robinson, and T. Sterling, InfiniBand The De
    Facto Future Standard for System and Local Area
    Networks or Just a Scalable Replacement for PCI
    Buses?, Cluster Computing, Vol. 6, No. 2, April
    2003, pp. 95-105.
  • 9 S. Chang, Performance Profiling and
    Optimization on the SGI Origins, Numerical
    Aerospace Simulation Facility, NASA. 2001.
  • 10 G. Memik, M. Kandemir, A. Choudhary, I.
    Kadayif, An Integrated Approach for Improving
    Cache Behavior, IEEE Proceedings of the Design,
    Automation and Test in Europe Conference and
    Exhibition. 2003.
  • 11 W. Wong, Hardware Techniques for Data
    Placement, Dept. Computer Science and
    Engineering, University of Washington. 1997.
Write a Comment
User Comments (0)
About PowerShow.com