Title: Advantages of Processor Virtualization and AMPI
1Advantages of Processor Virtualizationand AMPI
- Laxmikant Kale
- CS320
- Spring 2003
- Kale_at_cs.uiuc.edu
- http//charm.cs.uiuc.edu
- Parallel Programming Laboratory
- Department of Computer Science
- University of Illinois at Urbana Champaign
2Overview
- Processor Virtualization
- Motivation
- Realization in AMPI and Charm
- Part I Benefits
- Better Software Engineering
- Message Driven Execution
- Flexible and dynamic mapping to processors
- Principle of Persistence
3Motivation
- We need to Improve Performance and Productivity
in parallel programming - Parallel Computing/Programming is about
- Coordination between processes
- Information exchange
- Synchronization
- (knowing when the other guy has done something)
- Resource management
- Allocating work and data to processors
4Coordination
- Processes, each with possibly local data
- How do they interact with each other?
- Data exchange and synchronization
- Solutions proposed
- Message passing
- Shared variables and locks
- Global Arrays / shmem
- UPC
- Asynchronous method invocation
- Specifically shared variables
- readonly, accumulators, tables
- Others Linda,
Each is probably suitable for different
applications and subjective tastes of programmers
5Resource Management
- Coordination is one aspect
- But parallel computing is also about resource
management - Who needs resources
- Work units
- Threads, function-calls, method invocations, loop
iterations - Data units
- Array segments, cache lines, stack-frames,
messages, object variables - What are the resources
- Processors, floating point units, thread-units
- Memories caches, SRAMs, drams,
- Idea
- Programmer should not have to manage resources
explicitly, even within one program
6Processor Virtualization
- Basic Idea
- Divide the computation into a large number of
pieces - Independent of number of processors
- Typically larger than number of processors
- Let the system map these virtual processors to
processors - Old idea? G. Fox Book (86?),
- DRMS (IBM), Data Parallel C (Michael Quinn),
MPVM/UPVM/MIST
- Our approach is virtualization
- Language and runtime support for virtualization
- Exploitation of virtualization to the hilt
7Virtualization Object-based Parallelization
User is only concerned with interaction between
objects (VPs)
User View
8Technical Approach
- Seek optimal division of labor between system
and programmer
Decomposition done by programmer, everything else
automated
Automation
HPF
Charm
AMPI
MPI
Specialization
9Why Virtualization?
- Advertisement
- Virtualization is ready and powerful to meet the
needs of tomorrows applications and machines - Specifically
- Virtualization and associated techniques that we
have been exploring for the past decade are ready
and powerful enough to meet the needs of high-end
parallel computing and complex and dynamic
applications - These techniques are embodied into
- Charm
- AMPI
- Frameworks (Strucured grids, unstructured grids,
particles) - Virtualization of other coordination languages
(UPC, GA, ..)
10Realizations Charm
- Charm
- Parallel C with Data Driven Objects (Chares)
- Asynchronous method invocation
- Prioritized scheduling
- Object Arrays
- Object Groups
- Information sharing abstractions readonly,
tables,.. - Mature, robust, portable (http//charm.cs.uiuc.edu
)
11Object Arrays
- A collection of data-driven objects
- With a single global name for the collection
- Each member addressed by an index
- sparse 1D, 2D, 3D, tree, string, ...
- Mapping of element objects to procS handled by
the system
Users view
A0
A1
A2
A3
A..
12Object Arrays
- A collection of data-driven objects
- With a single global name for the collection
- Each member addressed by an index
- sparse 1D, 2D, 3D, tree, string, ...
- Mapping of element objects to procS handled by
the system
Users view
A0
A1
A2
A3
A..
System view
A3
A0
13Object Arrays
- A collection of data-driven objects
- With a single global name for the collection
- Each member addressed by an index
- sparse 1D, 2D, 3D, tree, string, ...
- Mapping of element objects to procS handled by
the system
Users view
A0
A1
A2
A3
A..
System view
A3
A0
14Adaptive MPI
- A migration path for legacy MPI codes
- AMPI MPI Virtualization
- Uses Charm object arrays and migratable threads
- Existing MPI programs
- Minimal modifications needed to convert existing
MPI programs - Bindings for
- C, C, and Fortran90
- We will focus on AMPI
- Ignoring Charm for now..
15AMPI
16AMPI
Implemented as virtual processors (user-level
migratable threads)
17Benefits of Virtualization
- Modularity and Better Software Engineering
- Message Driven Execution
- Flexible and dynamic mapping to processors
- Principle of Persistence
- Enables Runtime Optimizations
- Automatic Dynamic Load Balancing
- Communication Optimizations
- Other Runtime Optimizations
181 Modularization
- Logical Units decoupled from Number of
processors - E.G. Oct tree nodes for particle data
- No artificial restriction on the number of
processors - Cube of power of 2
- Modularity
- Software engineering cohesion and coupling
- MPIs are on the same processor is a bad
coupling principle - Objects liberate you from that
- E.G. Solid and fluid moldules in a rocket
simulation
19Example Rocket Simulation
- Large Collaboration headed by Prof. M. Heath
- DOE supported ASCI center
- Challenge
- Multi-component code,
- with modules from independent researchers
- MPI was common base
- AMPI new wine in old bottle
- Easier to convert
- Can still run original codes on MPI, unchanged
- Example of modularization
- RocFlo Fluids code.
- RocSolid Structures code,
- Rocface data-transfer at the boundary.
20Rocket simulation via virtual processors
21AMPI and Roc Communication
Using separate sets of virtual processors for
rocflo and Rocsolid eliminates unnecessary
coupling
Rocflo
Rocflo
Rocflo
Rocflo
Rocflo
222 Benefits of Message Driven Execution
Virtualization leads to Message Driven
Execution Since there are potential multiple
objects on each processor
Which leads to Automatic Adaptive overlap of
computation and communication
23Adaptive Overlap via Data-driven Objects
- Problem
- Processors wait for too long at receive
statements - Routine communication optimizations in MPI
- Move sends up and receives down
- Use irecvs, but be careful
- With Data-driven objects
- Adaptive overlap of computation and communication
- No object or threads holds up the processor
- No need to guess which is likely to arrive first
24Adaptive overlap and modules
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
25Handling Random Load Variations via MDE
- MDE encourages asynchrony
- Asynchronous reductions, for example
- Only data dependence should force synchronization
- One benefit
- Consider an algorithm with N steps
- Each step has different load balanceTij
- Loose dependence between steps
- (on neighbors, for example)
- Sum-of-max (MPI) vs max-of-sum (MDE)
- OS Jitter
- Causes random processors to add delays in each
step - Handled Automatically by MDE
26Asynchronous reductions in Jacobi
reduction
Processor timeline with sync. reduction
compute
compute
This gap is avoided below
Processor timeline with async. reduction
reduction
compute
compute
27Virtualization/MDE leads to predictability
- Ability to predict
- Which data is going to be needed and
- Which code will execute
- Based on the ready queue of object method
invocations - So, we can
- Prefetch data accurately
- Prefetch code if needed
- Out-of-core execution
- Caches vs controllable SRAM
283 Flexible Dynamic Mapping to Processors
- The system can migrate objects between processors
- Vacate processor used by a parallel program
- Dealing with extraneous loads on shared
workstations - Adapt to speed difference between processors
- E.g. Cluster with 500 MHz and 1 Ghz processors
- Automatic checkpointing
- Checkpointing migrate to disk!
- Restart on a different number of processors
- Shrink and Expand the set of processors used by
an app - Shrink from 1000 to 900 procs. Later expand to
1200. - Adaptive job scheduling for better System
utilization
29Inefficient Utilization Within A Cluster
16 Processor system
Current Job Schedulers can yield low system
utilization.. A competetive problem in the
context of Faucets-like systems
30Two Adaptive Jobs
Adaptive Jobs can shrink or expand the number of
processors they use, at runtime by migrating
virtual processor
16 Processor system
31AQS Features
- AQSAdaptive Queuing System
- Multithreaded
- Reliable and robust
- Supports most features of standard queuing
systems - Has the ability to manage adaptive jobs currently
implemented in Charm and MPI - Handles regular (non-adaptive) jobs
32Cluster Utilization
Experimental
Simulated
33Experimental Mean Response Time
344 Principle of Persistence
- Once the application is expressed in terms of
interacting objects - Object communication patterns and
computational loads tend to persist over time - In spite of dynamic behavior
- Abrupt and large,but infrequent changes (egAMR)
- Slow and small changes (eg particle migration)
- Parallel analog of principle of locality
- Heuristics, that holds for most CSE applications
- Learning / adaptive algorithms
- Adaptive Communication libraries
- Measurement based load balancing
35Measurement Based Load Balancing
- Based on Principle of persistence
- Runtime instrumentation
- Measures communication volume and computation
time - Measurement based load balancers
- Use the instrumented data-base periodically to
make new decisions - Many alternative strategies can use the database
- Centralized vs distributed
- Greedy improvements vs complete reassignments
- Taking communication into account
- Taking dependences into account (More complex)
36Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements Added
3. Chunks Migrated
2. Load Balancer Invoked
37Optimizing for Communication Patterns
- The parallel-objects Runtime System can observe,
instrument, and measure communication patterns - Communication is from/to objects, not processors
- Load balancers use this to optimize object
placement - Communication libraries can optimize
- By substituting most suitable algorithm for each
operation - Learning at runtime
- E.g. Each to all individualized sends
- Performance depends on many runtime
characteristics - Library switches between different algorithms
V. Krishnan, MS Thesis, 1996
38All to all on Lemieux for a 76 Byte Message
39Impact on Application Performance
Molecular Dynamics (NAMD) Performance on Lemieux,
with the transpose step implemented using
different all-to-all algorithms
40Overhead of Virtualization
Isnt there significant overhead of
virtualization? No! Not in most cases. Here, an
application is run with increasing degree of
virtualization
Performance actually improves with virtualization
because of better cache performance
41How to decide the granularity
- How many virtual processors should you use?
- This (typically) does not depend on the number
physical processors available - Granularity
- Simple definition amount of computation per
message - Guiding principle
- Make (the work for) each virtual processor as
small as possible, while making sure it is
sufficiently large compared with the
scehduling/messaging overhead. - In practivce, today
- Average computation per message gt 100
microseconds is enough - 0.5 msec to several msecs is typically used
42How to decide the granularity contd.
- Exceptions
- Memory overhead
- Virtualization may lead to a large area of memory
devoted to ghosts - Reduce the number of virtual processors
- OR fuse chunks on individual processors to
avoid ghost regions. - Large messages
- Modify the rule
- Calculate message overhead
- Ensure granularity is more than 10 times this
overhead
43Benefits of Virtualization Summary
- Software Engineering
- Number of virtual processors can be independently
controlled - Separate VPs for modules
- Message Driven Execution
- Adaptive overlap
- Modularity
- Predictability
- Automatic Out-of-core
- Cache management
- Dynamic mapping
- Heterogeneous clusters
- Vacate, adjust to speed, share
- Automatic checkpointing
- Change the set of processors
- Principle of Persistence
- Enables Runtime Optimizations
- Automatic Dynamic Load Balancing
- Communication Optimizations
- Other Runtime Optimizations
More info http//charm.cs.uiuc.edu