Advantages of Processor Virtualization and AMPI - PowerPoint PPT Presentation

About This Presentation
Title:

Advantages of Processor Virtualization and AMPI

Description:

Seek optimal division of labor between 'system' and programmer: Specialization. MPI ... Virtualization of other coordination languages (UPC, GA, ..) 10/19/09 ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 44
Provided by: san7196
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Advantages of Processor Virtualization and AMPI


1
Advantages of Processor Virtualizationand AMPI
  • Laxmikant Kale
  • CS320
  • Spring 2003
  • Kale_at_cs.uiuc.edu
  • http//charm.cs.uiuc.edu
  • Parallel Programming Laboratory
  • Department of Computer Science
  • University of Illinois at Urbana Champaign

2
Overview
  • Processor Virtualization
  • Motivation
  • Realization in AMPI and Charm
  • Part I Benefits
  • Better Software Engineering
  • Message Driven Execution
  • Flexible and dynamic mapping to processors
  • Principle of Persistence

3
Motivation
  • We need to Improve Performance and Productivity
    in parallel programming
  • Parallel Computing/Programming is about
  • Coordination between processes
  • Information exchange
  • Synchronization
  • (knowing when the other guy has done something)
  • Resource management
  • Allocating work and data to processors

4
Coordination
  • Processes, each with possibly local data
  • How do they interact with each other?
  • Data exchange and synchronization
  • Solutions proposed
  • Message passing
  • Shared variables and locks
  • Global Arrays / shmem
  • UPC
  • Asynchronous method invocation
  • Specifically shared variables
  • readonly, accumulators, tables
  • Others Linda,

Each is probably suitable for different
applications and subjective tastes of programmers
5
Resource Management
  • Coordination is one aspect
  • But parallel computing is also about resource
    management
  • Who needs resources
  • Work units
  • Threads, function-calls, method invocations, loop
    iterations
  • Data units
  • Array segments, cache lines, stack-frames,
    messages, object variables
  • What are the resources
  • Processors, floating point units, thread-units
  • Memories caches, SRAMs, drams,
  • Idea
  • Programmer should not have to manage resources
    explicitly, even within one program

6
Processor Virtualization
  • Basic Idea
  • Divide the computation into a large number of
    pieces
  • Independent of number of processors
  • Typically larger than number of processors
  • Let the system map these virtual processors to
    processors
  • Old idea? G. Fox Book (86?),
  • DRMS (IBM), Data Parallel C (Michael Quinn),
    MPVM/UPVM/MIST
  • Our approach is virtualization
  • Language and runtime support for virtualization
  • Exploitation of virtualization to the hilt

7
Virtualization Object-based Parallelization
User is only concerned with interaction between
objects (VPs)
User View
8
Technical Approach
  • Seek optimal division of labor between system
    and programmer

Decomposition done by programmer, everything else
automated
Automation
HPF
Charm
AMPI
MPI
Specialization
9
Why Virtualization?
  • Advertisement
  • Virtualization is ready and powerful to meet the
    needs of tomorrows applications and machines
  • Specifically
  • Virtualization and associated techniques that we
    have been exploring for the past decade are ready
    and powerful enough to meet the needs of high-end
    parallel computing and complex and dynamic
    applications
  • These techniques are embodied into
  • Charm
  • AMPI
  • Frameworks (Strucured grids, unstructured grids,
    particles)
  • Virtualization of other coordination languages
    (UPC, GA, ..)

10
Realizations Charm
  • Charm
  • Parallel C with Data Driven Objects (Chares)
  • Asynchronous method invocation
  • Prioritized scheduling
  • Object Arrays
  • Object Groups
  • Information sharing abstractions readonly,
    tables,..
  • Mature, robust, portable (http//charm.cs.uiuc.edu
    )

11
Object Arrays
  • A collection of data-driven objects
  • With a single global name for the collection
  • Each member addressed by an index
  • sparse 1D, 2D, 3D, tree, string, ...
  • Mapping of element objects to procS handled by
    the system

Users view
A0
A1
A2
A3
A..
12
Object Arrays
  • A collection of data-driven objects
  • With a single global name for the collection
  • Each member addressed by an index
  • sparse 1D, 2D, 3D, tree, string, ...
  • Mapping of element objects to procS handled by
    the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
13
Object Arrays
  • A collection of data-driven objects
  • With a single global name for the collection
  • Each member addressed by an index
  • sparse 1D, 2D, 3D, tree, string, ...
  • Mapping of element objects to procS handled by
    the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
14
Adaptive MPI
  • A migration path for legacy MPI codes
  • AMPI MPI Virtualization
  • Uses Charm object arrays and migratable threads
  • Existing MPI programs
  • Minimal modifications needed to convert existing
    MPI programs
  • Bindings for
  • C, C, and Fortran90
  • We will focus on AMPI
  • Ignoring Charm for now..

15
AMPI
16
AMPI
Implemented as virtual processors (user-level
migratable threads)
17
Benefits of Virtualization
  • Modularity and Better Software Engineering
  • Message Driven Execution
  • Flexible and dynamic mapping to processors
  • Principle of Persistence
  • Enables Runtime Optimizations
  • Automatic Dynamic Load Balancing
  • Communication Optimizations
  • Other Runtime Optimizations

18
1 Modularization
  • Logical Units decoupled from Number of
    processors
  • E.G. Oct tree nodes for particle data
  • No artificial restriction on the number of
    processors
  • Cube of power of 2
  • Modularity
  • Software engineering cohesion and coupling
  • MPIs are on the same processor is a bad
    coupling principle
  • Objects liberate you from that
  • E.G. Solid and fluid moldules in a rocket
    simulation

19
Example Rocket Simulation
  • Large Collaboration headed by Prof. M. Heath
  • DOE supported ASCI center
  • Challenge
  • Multi-component code,
  • with modules from independent researchers
  • MPI was common base
  • AMPI new wine in old bottle
  • Easier to convert
  • Can still run original codes on MPI, unchanged
  • Example of modularization
  • RocFlo Fluids code.
  • RocSolid Structures code,
  • Rocface data-transfer at the boundary.

20
Rocket simulation via virtual processors
21
AMPI and Roc Communication
Using separate sets of virtual processors for
rocflo and Rocsolid eliminates unnecessary
coupling
Rocflo
Rocflo
Rocflo
Rocflo
Rocflo
22
2 Benefits of Message Driven Execution
Virtualization leads to Message Driven
Execution Since there are potential multiple
objects on each processor
Which leads to Automatic Adaptive overlap of
computation and communication
23
Adaptive Overlap via Data-driven Objects
  • Problem
  • Processors wait for too long at receive
    statements
  • Routine communication optimizations in MPI
  • Move sends up and receives down
  • Use irecvs, but be careful
  • With Data-driven objects
  • Adaptive overlap of computation and communication
  • No object or threads holds up the processor
  • No need to guess which is likely to arrive first

24
Adaptive overlap and modules
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
25
Handling Random Load Variations via MDE
  • MDE encourages asynchrony
  • Asynchronous reductions, for example
  • Only data dependence should force synchronization
  • One benefit
  • Consider an algorithm with N steps
  • Each step has different load balanceTij
  • Loose dependence between steps
  • (on neighbors, for example)
  • Sum-of-max (MPI) vs max-of-sum (MDE)
  • OS Jitter
  • Causes random processors to add delays in each
    step
  • Handled Automatically by MDE

26
Asynchronous reductions in Jacobi
reduction
Processor timeline with sync. reduction
compute
compute
This gap is avoided below
Processor timeline with async. reduction
reduction
compute
compute
27
Virtualization/MDE leads to predictability
  • Ability to predict
  • Which data is going to be needed and
  • Which code will execute
  • Based on the ready queue of object method
    invocations
  • So, we can
  • Prefetch data accurately
  • Prefetch code if needed
  • Out-of-core execution
  • Caches vs controllable SRAM

28
3 Flexible Dynamic Mapping to Processors
  • The system can migrate objects between processors
  • Vacate processor used by a parallel program
  • Dealing with extraneous loads on shared
    workstations
  • Adapt to speed difference between processors
  • E.g. Cluster with 500 MHz and 1 Ghz processors
  • Automatic checkpointing
  • Checkpointing migrate to disk!
  • Restart on a different number of processors
  • Shrink and Expand the set of processors used by
    an app
  • Shrink from 1000 to 900 procs. Later expand to
    1200.
  • Adaptive job scheduling for better System
    utilization

29
Inefficient Utilization Within A Cluster
16 Processor system
Current Job Schedulers can yield low system
utilization.. A competetive problem in the
context of Faucets-like systems
30
Two Adaptive Jobs
Adaptive Jobs can shrink or expand the number of
processors they use, at runtime by migrating
virtual processor
16 Processor system
31
AQS Features
  • AQSAdaptive Queuing System
  • Multithreaded
  • Reliable and robust
  • Supports most features of standard queuing
    systems
  • Has the ability to manage adaptive jobs currently
    implemented in Charm and MPI
  • Handles regular (non-adaptive) jobs

32
Cluster Utilization
Experimental
Simulated
33
Experimental Mean Response Time
34
4 Principle of Persistence
  • Once the application is expressed in terms of
    interacting objects
  • Object communication patterns and
    computational loads tend to persist over time
  • In spite of dynamic behavior
  • Abrupt and large,but infrequent changes (egAMR)
  • Slow and small changes (eg particle migration)
  • Parallel analog of principle of locality
  • Heuristics, that holds for most CSE applications
  • Learning / adaptive algorithms
  • Adaptive Communication libraries
  • Measurement based load balancing

35
Measurement Based Load Balancing
  • Based on Principle of persistence
  • Runtime instrumentation
  • Measures communication volume and computation
    time
  • Measurement based load balancers
  • Use the instrumented data-base periodically to
    make new decisions
  • Many alternative strategies can use the database
  • Centralized vs distributed
  • Greedy improvements vs complete reassignments
  • Taking communication into account
  • Taking dependences into account (More complex)

36
Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements Added
3. Chunks Migrated
2. Load Balancer Invoked
37
Optimizing for Communication Patterns
  • The parallel-objects Runtime System can observe,
    instrument, and measure communication patterns
  • Communication is from/to objects, not processors
  • Load balancers use this to optimize object
    placement
  • Communication libraries can optimize
  • By substituting most suitable algorithm for each
    operation
  • Learning at runtime
  • E.g. Each to all individualized sends
  • Performance depends on many runtime
    characteristics
  • Library switches between different algorithms

V. Krishnan, MS Thesis, 1996
38
All to all on Lemieux for a 76 Byte Message
39
Impact on Application Performance
Molecular Dynamics (NAMD) Performance on Lemieux,
with the transpose step implemented using
different all-to-all algorithms
40
Overhead of Virtualization
Isnt there significant overhead of
virtualization? No! Not in most cases. Here, an
application is run with increasing degree of
virtualization
Performance actually improves with virtualization
because of better cache performance
41
How to decide the granularity
  • How many virtual processors should you use?
  • This (typically) does not depend on the number
    physical processors available
  • Granularity
  • Simple definition amount of computation per
    message
  • Guiding principle
  • Make (the work for) each virtual processor as
    small as possible, while making sure it is
    sufficiently large compared with the
    scehduling/messaging overhead.
  • In practivce, today
  • Average computation per message gt 100
    microseconds is enough
  • 0.5 msec to several msecs is typically used

42
How to decide the granularity contd.
  • Exceptions
  • Memory overhead
  • Virtualization may lead to a large area of memory
    devoted to ghosts
  • Reduce the number of virtual processors
  • OR fuse chunks on individual processors to
    avoid ghost regions.
  • Large messages
  • Modify the rule
  • Calculate message overhead
  • Ensure granularity is more than 10 times this
    overhead

43
Benefits of Virtualization Summary
  • Software Engineering
  • Number of virtual processors can be independently
    controlled
  • Separate VPs for modules
  • Message Driven Execution
  • Adaptive overlap
  • Modularity
  • Predictability
  • Automatic Out-of-core
  • Cache management
  • Dynamic mapping
  • Heterogeneous clusters
  • Vacate, adjust to speed, share
  • Automatic checkpointing
  • Change the set of processors
  • Principle of Persistence
  • Enables Runtime Optimizations
  • Automatic Dynamic Load Balancing
  • Communication Optimizations
  • Other Runtime Optimizations

More info http//charm.cs.uiuc.edu
Write a Comment
User Comments (0)
About PowerShow.com