Advantages of Processor Virtualization and AMPI

About This Presentation

Title:

Advantages of Processor Virtualization and AMPI

Description:

Seek optimal division of labor between 'system' and programmer: Specialization. MPI ... Virtualization of other coordination languages (UPC, GA, ..) 10/19/09 ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 44

Provided by: san7196

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Advantages of Processor Virtualization and AMPI

1
Advantages of Processor Virtualizationand AMPI

Laxmikant Kale
CS320
Spring 2003
Kale_at_cs.uiuc.edu
http//charm.cs.uiuc.edu
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign

2
Overview

Processor Virtualization
Motivation
Realization in AMPI and Charm
Part I Benefits
Better Software Engineering
Message Driven Execution
Flexible and dynamic mapping to processors
Principle of Persistence

3
Motivation

We need to Improve Performance and Productivity
in parallel programming
Parallel Computing/Programming is about
Coordination between processes
Information exchange
Synchronization
(knowing when the other guy has done something)
Resource management
Allocating work and data to processors

4
Coordination

Processes, each with possibly local data
How do they interact with each other?
Data exchange and synchronization
Solutions proposed
Message passing
Shared variables and locks
Global Arrays / shmem
UPC
Asynchronous method invocation
Specifically shared variables
readonly, accumulators, tables
Others Linda,

Each is probably suitable for different
applications and subjective tastes of programmers
5
Resource Management

Coordination is one aspect
But parallel computing is also about resource
management
Who needs resources
Work units
Threads, function-calls, method invocations, loop
iterations
Data units
Array segments, cache lines, stack-frames,
messages, object variables
What are the resources
Processors, floating point units, thread-units
Memories caches, SRAMs, drams,
Idea
Programmer should not have to manage resources
explicitly, even within one program

6
Processor Virtualization

Basic Idea
Divide the computation into a large number of
pieces
Independent of number of processors
Typically larger than number of processors
Let the system map these virtual processors to
processors
Old idea? G. Fox Book (86?),
DRMS (IBM), Data Parallel C (Michael Quinn),
MPVM/UPVM/MIST

Our approach is virtualization
Language and runtime support for virtualization
Exploitation of virtualization to the hilt

7
Virtualization Object-based Parallelization
User is only concerned with interaction between
objects (VPs)
User View
8
Technical Approach

Seek optimal division of labor between system
and programmer

Decomposition done by programmer, everything else
automated
Automation
HPF
Charm
AMPI
MPI
Specialization
9
Why Virtualization?

Advertisement
Virtualization is ready and powerful to meet the
needs of tomorrows applications and machines
Specifically
Virtualization and associated techniques that we
have been exploring for the past decade are ready
and powerful enough to meet the needs of high-end
parallel computing and complex and dynamic
applications
These techniques are embodied into
Charm
AMPI
Frameworks (Strucured grids, unstructured grids,
particles)
Virtualization of other coordination languages
(UPC, GA, ..)

10
Realizations Charm

Charm
Parallel C with Data Driven Objects (Chares)
Asynchronous method invocation
Prioritized scheduling
Object Arrays
Object Groups
Information sharing abstractions readonly,
tables,..
Mature, robust, portable (http//charm.cs.uiuc.edu
)

11
Object Arrays

A collection of data-driven objects
With a single global name for the collection
Each member addressed by an index
sparse 1D, 2D, 3D, tree, string, ...
Mapping of element objects to procS handled by
the system

Users view
A0
A1
A2
A3
A..
12
Object Arrays

A collection of data-driven objects
With a single global name for the collection
Each member addressed by an index
sparse 1D, 2D, 3D, tree, string, ...
Mapping of element objects to procS handled by
the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
13
Object Arrays

A collection of data-driven objects
With a single global name for the collection
Each member addressed by an index
sparse 1D, 2D, 3D, tree, string, ...
Mapping of element objects to procS handled by
the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
14
Adaptive MPI

A migration path for legacy MPI codes
AMPI MPI Virtualization
Uses Charm object arrays and migratable threads
Existing MPI programs
Minimal modifications needed to convert existing
MPI programs
Bindings for
C, C, and Fortran90
We will focus on AMPI
Ignoring Charm for now..

15
AMPI
16
AMPI
Implemented as virtual processors (user-level
migratable threads)
17
Benefits of Virtualization

Modularity and Better Software Engineering
Message Driven Execution
Flexible and dynamic mapping to processors
Principle of Persistence
Enables Runtime Optimizations
Automatic Dynamic Load Balancing
Communication Optimizations
Other Runtime Optimizations

18
1 Modularization

Logical Units decoupled from Number of
processors
E.G. Oct tree nodes for particle data
No artificial restriction on the number of
processors
Cube of power of 2
Modularity
Software engineering cohesion and coupling
MPIs are on the same processor is a bad
coupling principle
Objects liberate you from that
E.G. Solid and fluid moldules in a rocket
simulation

19
Example Rocket Simulation

Large Collaboration headed by Prof. M. Heath
DOE supported ASCI center
Challenge
Multi-component code,
with modules from independent researchers
MPI was common base
AMPI new wine in old bottle
Easier to convert
Can still run original codes on MPI, unchanged
Example of modularization
RocFlo Fluids code.
RocSolid Structures code,
Rocface data-transfer at the boundary.

20
Rocket simulation via virtual processors
21
AMPI and Roc Communication
Using separate sets of virtual processors for
rocflo and Rocsolid eliminates unnecessary
coupling
Rocflo
Rocflo
Rocflo
Rocflo
Rocflo
22
2 Benefits of Message Driven Execution
Virtualization leads to Message Driven
Execution Since there are potential multiple
objects on each processor
Which leads to Automatic Adaptive overlap of
computation and communication
23
Adaptive Overlap via Data-driven Objects

Problem
Processors wait for too long at receive
statements
Routine communication optimizations in MPI
Move sends up and receives down
Use irecvs, but be careful
With Data-driven objects
Adaptive overlap of computation and communication
No object or threads holds up the processor
No need to guess which is likely to arrive first

24
Adaptive overlap and modules
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
25
Handling Random Load Variations via MDE

MDE encourages asynchrony
Asynchronous reductions, for example
Only data dependence should force synchronization
One benefit
Consider an algorithm with N steps
Each step has different load balanceTij
Loose dependence between steps
(on neighbors, for example)
Sum-of-max (MPI) vs max-of-sum (MDE)
OS Jitter
Causes random processors to add delays in each
step
Handled Automatically by MDE

26
Asynchronous reductions in Jacobi
reduction
Processor timeline with sync. reduction
compute
compute
This gap is avoided below
Processor timeline with async. reduction
reduction
compute
compute
27
Virtualization/MDE leads to predictability

Ability to predict
Which data is going to be needed and
Which code will execute
Based on the ready queue of object method
invocations
So, we can
Prefetch data accurately
Prefetch code if needed
Out-of-core execution
Caches vs controllable SRAM

28
3 Flexible Dynamic Mapping to Processors

The system can migrate objects between processors
Vacate processor used by a parallel program
Dealing with extraneous loads on shared
workstations
Adapt to speed difference between processors
E.g. Cluster with 500 MHz and 1 Ghz processors
Automatic checkpointing
Checkpointing migrate to disk!
Restart on a different number of processors
Shrink and Expand the set of processors used by
an app
Shrink from 1000 to 900 procs. Later expand to
1200.
Adaptive job scheduling for better System
utilization

29
Inefficient Utilization Within A Cluster
16 Processor system
Current Job Schedulers can yield low system
utilization.. A competetive problem in the
context of Faucets-like systems
30
Two Adaptive Jobs
Adaptive Jobs can shrink or expand the number of
processors they use, at runtime by migrating
virtual processor
16 Processor system
31
AQS Features

AQSAdaptive Queuing System
Multithreaded
Reliable and robust
Supports most features of standard queuing
systems
Has the ability to manage adaptive jobs currently
implemented in Charm and MPI
Handles regular (non-adaptive) jobs

32
Cluster Utilization
Experimental
Simulated
33
Experimental Mean Response Time
34
4 Principle of Persistence

Once the application is expressed in terms of
interacting objects
Object communication patterns and
computational loads tend to persist over time
In spite of dynamic behavior
Abrupt and large,but infrequent changes (egAMR)
Slow and small changes (eg particle migration)
Parallel analog of principle of locality
Heuristics, that holds for most CSE applications
Learning / adaptive algorithms
Adaptive Communication libraries
Measurement based load balancing

35
Measurement Based Load Balancing

Based on Principle of persistence
Runtime instrumentation
Measures communication volume and computation
time
Measurement based load balancers
Use the instrumented data-base periodically to
make new decisions
Many alternative strategies can use the database
Centralized vs distributed
Greedy improvements vs complete reassignments
Taking communication into account
Taking dependences into account (More complex)

36
Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements Added
3. Chunks Migrated
2. Load Balancer Invoked
37
Optimizing for Communication Patterns

The parallel-objects Runtime System can observe,
instrument, and measure communication patterns
Communication is from/to objects, not processors
Load balancers use this to optimize object
placement
Communication libraries can optimize
By substituting most suitable algorithm for each
operation
Learning at runtime
E.g. Each to all individualized sends
Performance depends on many runtime
characteristics
Library switches between different algorithms

V. Krishnan, MS Thesis, 1996
38
All to all on Lemieux for a 76 Byte Message
39
Impact on Application Performance
Molecular Dynamics (NAMD) Performance on Lemieux,
with the transpose step implemented using
different all-to-all algorithms
40
Overhead of Virtualization
Isnt there significant overhead of
virtualization? No! Not in most cases. Here, an
application is run with increasing degree of
virtualization
Performance actually improves with virtualization
because of better cache performance
41
How to decide the granularity

How many virtual processors should you use?
This (typically) does not depend on the number
physical processors available
Granularity
Simple definition amount of computation per
message
Guiding principle
Make (the work for) each virtual processor as
small as possible, while making sure it is
sufficiently large compared with the
scehduling/messaging overhead.
In practivce, today
Average computation per message gt 100
microseconds is enough
0.5 msec to several msecs is typically used

42
How to decide the granularity contd.

Exceptions
Memory overhead
Virtualization may lead to a large area of memory
devoted to ghosts
Reduce the number of virtual processors
OR fuse chunks on individual processors to
avoid ghost regions.
Large messages
Modify the rule
Calculate message overhead
Ensure granularity is more than 10 times this
overhead

43
Benefits of Virtualization Summary

Software Engineering
Number of virtual processors can be independently
controlled
Separate VPs for modules
Message Driven Execution
Adaptive overlap
Modularity
Predictability
Automatic Out-of-core
Cache management
Dynamic mapping
Heterogeneous clusters
Vacate, adjust to speed, share
Automatic checkpointing
Change the set of processors

Principle of Persistence
Enables Runtime Optimizations
Automatic Dynamic Load Balancing
Communication Optimizations
Other Runtime Optimizations

More info http//charm.cs.uiuc.edu

Write a Comment

User Comments (0)

About PowerShow.com

Advantages of Processor Virtualization and AMPI - PowerPoint PPT Presentation

Advantages of Processor Virtualization and AMPI

Seek optimal division of labor between 'system' and programmer: Specialization. MPI ... Virtualization of other coordination languages (UPC, GA, ..) 10/19/09 ... – PowerPoint PPT presentation