Processor Virtualization for Scalable Parallel Computing

About This Presentation

Title:

Processor Virtualization for Scalable Parallel Computing

Description:

Department of Computer Science. University of Illinois at Urbana ... Seek optimal division of labor between 'system' and programmer: Specialization. MPI ... – PowerPoint PPT presentation

Number of Views:201

Avg rating:3.0/5.0

Slides: 60

Provided by: san7196

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Processor Virtualization for Scalable Parallel Computing

1
Processor Virtualization for Scalable Parallel
Computing

Laxmikant Kale
Kale_at_cs.uiuc.edu
http//charm.cs.uiuc.edu
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign

2
Acknowlwdgements

Graduate students including
Gengbin Zheng
Orion Lawlor
Milind Bhandarkar
Arun Singla
Josh Unger
Terry Wilmarth
Sameer Kumar

Recent Funding
NSF (NGS Frederica Darema)
DOE (ASCI Rocket Center)
NIH (Molecular Dynamics)

3
Overview

Processor Virtualization
Motivation
Realization in AMPI and Charm
Part I Benefits
Better Software Engineering
Message Driven Execution
Flexible and dynamic mapping to processors
Principle of Persistence
Application Examples

Part II
PetaFLOPS Machines
Emulator
Programming Environments
Simulator
Performance prediction
Part III
Programming Models

4
Motivation

Research Group Mission
Improve Performance and Productivity in parallel
programming
Via Application-oriented but Computer-Science
Centered research
Parallel Computing/Programming
Coordination between processes
Resource management

5
Coordination

Processes, each with possibly local data
How do they interact with each other?
Data exchange and synchronization
Solutions proposed
Message passing
Shared variables and locks
Global Arrays / shmem
UPC
Asynchronous method invocation
Specifically shared variables
readonly, accumulators, tables
Others Linda,

Each is probably suitable for different
applications and subjective tastes of programmers
6
Parallel Computing Is About Resource Management

Who needs resources
Work units
Threads, function-calls, method invocations, loop
iterations
Data units
Array segments, cache lines, stack-frames,
messages, object variables
Resources
Processors, floating point units, thread-units
Memories Caches, SRAMs, DRAMs,
Programmer should not have to manage resources
explicitly, even within one program

7
Processor Virtualization

Basic Idea
Divide the computation into a large number of
pieces
Independent of number of processors
Typically larger than number of processors
Let the system map these virtual processors to
processors
Old idea? G. Fox Book (86?),
DRMS (IBM), Data Parallel C (Michael Quinn),
MPVM/UPVM/MIST

Our approach is virtualization
Language and runtime support for virtualization
Exploitation of virtualization to the hilt

8
Virtualization Object-based Parallelization
User is only concerned with interaction between
objects (VPs)
User View
9
Technical Approach

Seek optimal division of labor between system
and programmer

Decomposition done by programmer, everything else
automated
Automation
HPF
Charm
AMPI
MPI
Specialization
10
Message From This Talk

Virtualization is ready and powerful to meet the
needs of tomorrows applications and machines
Virtualization and associated techniques that we
have been exploring for the past decade are ready
and powerful enough to meet the needs of high-end
parallel computing and complex and dynamic
applications
These techniques are embodied into
Charm
AMPI
Frameworks (Strucured Grids, Unstructured Grids,
Particles)
Virtualization of other coordination languages
(UPC, GA, ..)

11
Realizations Charm

Charm
Parallel C with Data Driven Objects (Chares)
Asynchronous method invocation
Prioritized scheduling
Object Arrays
Object Groups
Information sharing abstractions readonly,
tables,..
Mature, robust, portable (http//charm.cs.uiuc.edu
)

12
Object Arrays

A collection of data-driven objects
With a single global name for the collection
Each member addressed by an index
sparse 1D, 2D, 3D, tree, string, ...
Mapping of element objects to procS handled by
the system

Users view
A0
A1
A2
A3
A..
13
Object Arrays

A collection of data-driven objects
With a single global name for the collection
Each member addressed by an index
sparse 1D, 2D, 3D, tree, string, ...
Mapping of element objects to procS handled by
the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
14
Object Arrays

A collection of data-driven objects
With a single global name for the collection
Each member addressed by an index
sparse 1D, 2D, 3D, tree, string, ...
Mapping of element objects to procS handled by
the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
15
Adaptive MPI

A migration path for legacy MPI codes
AMPI MPI Virtualization
Uses Charm object arrays and migratable threads
Minimal modifications to convert existing MPI
programs
Automated via AMPizer
Based on Polaris Compiler Framework
Bindings for
C, C, and Fortran90

16
AMPI
17
AMPI
Implemented as virtual processors (user-level
migratable threads)
18
Benefits of Virtualization

Better Software Engineering
Message Driven Execution
Flexible and dynamic mapping to processors
Principle of Persistence
Enables Runtime Optimizations
Automatic Dynamic Load Balancing
Communication Optimizations
Other Runtime Optimizations

19
Modularization

Logical Units decoupled from Number of
processors
E.G. Oct tree nodes for particle data
No artificial restriction on the number of
processors
Cube of power of 2
Modularity
Software engineering cohesion and coupling
MPIs are on the same processor is a bad
coupling principle
Objects liberate you from that
E.G. Solid and fluid moldules in a rocket
simulation

20
Rocket Simulation

Large Collaboration headed Mike Heath
DOE supported ASCI center
Challenge
Multi-component code, with modules from
independent researchers
MPI was common base
AMPI new wine in old bottle
Easier to convert
Can still run original codes on MPI, unchanged

21
Rocket simulation via virtual processors
22
AMPI and Roc Communication
Rocflo
Rocflo
Rocflo
Rocflo
Rocflo
23
Message Driven Execution
Virtualization leads to Message Driven Execution
Which leads to Automatic Adaptive overlap of
computation and communication
24
Adaptive Overlap via Data-driven Objects

Problem
Processors wait for too long at receive
statements
Routine communication optimizations in MPI
Move sends up and receives down
Sometimes. Use irecvs, but be careful
With Data-driven objects
Adaptive overlap of computation and communication
No object or threads holds up the processor
No need to guess which is likely to arrive first

25
Adaptive overlap and modules
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
26
Handling Random Load Variations via MDE

MDE encourages asynchrony
Asynchronous reductions, for example
Only data dependence should force synchronization
One benefit
Consider an algorithm with N steps
Each step has different load balanceTij
Loose dependence between steps
(on neighbors, for example)
Sum-of-max (MPI) vs max-of-sum (MDE)
OS Jitter
Causes random processors to add delays in each
step
Handled Automatically by MDE

27
Example Molecular Dynamics in NAMD

Collection of charged atoms, with bonds
Newtonian mechanics
Thousands of atoms (1,000 - 500,000)
1 femtosecond time-step, millions needed!
At each time-step
Calculate forces on each atom
Bonds
Non-bonded electrostatic and van der Waals
Calculate velocities and advance positions
Multiple Time Stepping PME (3D FFT) every 4
steps

Collaboration with K. Schulten, R. Skeel, and
coworkers
28
Parallel Molecular Dynamics
192 144 vps
700 vps
30,000 vps
29
Performance NAMD on Lemieux
ATPase 320,000 atoms including water
30
(No Transcript)
31
Molecular Dynamics Benefits of avoiding barrier

In NAMD
The energy reductions were made asynchronous
No other global barriers are used in cut-off
simulations
This came handy when
Running on Pittsburgh Lemieux (3000 processors)
The machine ( our way of using the communication
layer) produced unpredictable, random delays in
communication
A send call would remain stuck for 20 ms, for
example
How did the system handle it?
See timeline plots

32
(No Transcript)
33
Asynchronous reductions in Jacobi
reduction
Processor timeline with sync. reduction
compute
compute
This gap is avoided below
Processor timeline with async. reduction
reduction
compute
compute
34
Virtualization/MDE leads to predictability

Ability to predict
Which data is going to be needed and
Which code will execute
Based on the ready queue of object method
invocations
So, we can
Prefetch data accurately
Prefetch code if needed
Out-of-core execution
Caches vs controllable SRAM

35
Programmable SRAMs

Problems with Caches
Cache management is based on principle of
locality
A heuristic, not a perfect predictor
Cache miss handling is in the critical path
Our approach (Message-driven execution)
Can exploit a programmable SRAM very effectively
Load the relevant data into the SRAM just-in-time

36
Example Jacobi Relaxation
Each processor may have hundreds of such objects
(few 10s of KB each, say).
When all the boundary data for an object is
available, it is added to the ready queue.
Prefetch/ SRAM management
Schedulers Queue
Ready Queue
DRAM
Scheduler
37
Flexible Dynamic Mapping to Processors

The system can migrate objects between processors
Vacate processor used by a parallel program
Dealing with extraneous loads on shared
workstations
Adapt to speed difference between processors
E.g. Cluster with 500 MHz and 1 Ghz processors
Automatic checkpointing
Checkpointing migrate to disk!
Restart on a different number of processors
Shrink and Expand the set of processors used by
an app
Shrink from 1000 to 900 procs. Later expand to
1200.
Adaptive job scheduling for better System
utilization

38
Faucets Optimizing Utilization Within/across
Clusters
Cluster
Job Submission
Cluster
Job Monitor
Cluster

http//charm.cs.uiuc.edu/research/faucets

39
Inefficient Utilization Within A Cluster
16 Processor system
Current Job Schedulers can yield low system
utilization.. A competetive problem in the
context of Faucets-like systems
40
Two Adaptive Jobs
Adaptive Jobs can shrink or expand the number of
processors they use, at runtime by migrating
virtual processor
16 Processor system
41
Job Monitoring Appspector
42
AQS Features

AQSAdaptive Queuing System
Multithreaded
Reliable and robust
Supports most features of standard queuing
systems
Has the ability to manage adaptive jobs currently
implemented in Charm and MPI
Handles regular (non-adaptive) jobs

43
Cluster Utilization
Experimental
Simulated
44
Experimental MRT
45
Principle of Persistence

Once the application is expressed in terms of
interacting objects
Object communication patterns and
computational loads tend to persist over time
In spite of dynamic behavior
Abrupt and large,but infrequent changes (egAMR)
Slow and small changes (eg particle migration)
Parallel analog of principle of locality
Heuristics, that holds for most CSE applications
Learning / adaptive algorithms
Adaptive Communication libraries
Measurement based load balancing

46
Measurement Based Load Balancing

Based on Principle of persistence
Runtime instrumentation
Measures communication volume and computation
time
Measurement based load balancers
Use the instrumented data-base periodically to
make new decisions
Many alternative strategies can use the database
Centralized vs distributed
Greedy improvements vs complete reassignments
Taking communication into account
Taking dependences into account (More complex)

47
Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements Added
3. Chunks Migrated
2. Load Balancer Invoked
48
Optimizing for Communication Patterns

The parallel-objects Runtime System can observe,
instrument, and measure communication patterns
Communication is from/to objects, not processors
Load balancers use this to optimize object
placement
Communication libraries can optimize
By substituting most suitable algorithm for each
operation
Learning at runtime
E.g. Each to all individualized sends
Performance depends on many runtime
characteristics
Library switches between different algorithms

V. Krishnan, MS Thesis, 1996
49
All to all on Lemieux for a 76 Byte Message
50
Impact on Application Performance
Namd Performance on Lemieux, with the transpose
step implemented using different all-to-all
algorithms
51
Overhead of Virtualization
Isnt there significant overhead of
virtualization? No! Not in most cases.
52
Ongoing Research

Fault Tolerance
Much easier at object level TMR, efficient
variations
However, checkpointing used to be such an
efficient alternative (low forward-path cost)
Resurrect past research
Programming petaFLOPS machines
Programming Environment
Simulation and Performance prediction
Communication Optimizations grids
Dealing with limited Virtual Memory Space

53
Applications on the current Emulator

Using Charm
LeanMD
Research quality Molecular Dyanmics
Version 0 only electrostatics van der Vaal
Simple AMR kernel
Adaptive tree to generate millions of objects
Each holding a 3D array
Communication with neighbors
Tree makes it harder to find nbrs, but Charm
makes it easy

54
Applications Funded Collaborations

Molecular Dynamics for biophysics NAMD
QM/MM Car-Parinello
Materials
Microstructure Dendritic growth
Bridging the gap between Atomistic and FEM models
Space-time Meshing

Rocket Simulation
DOE ASCI Center
Computational Astrophysics

Developing CS enabling Technology in the Context
of Real Applications
55
QM using Car-Parinello method Glenn Martyna,
Mark Tuckerman et al
56
Evolution of a Galaxy in its cosmological context
Thomas Quinn et al
Need to bridge length gap Multiple modules
communication optimizations dynamic load
balancing
57
Ongoing Research

Load balancing
Charm framework allows distributed and
centralized
Recent years, we focused on centralized
Still ok for 3000 processors for NAMD
Reverting back to older work on distributed
balancing
Need to handle locality of communication
Topology sensitive placement
Need to work with global information
Approx global info
Incomplete global info (only neighborhood)
Achieving global effects by local action

58
Application
Orchestration Support
Data transfer
Application Components
D
C
A
B
Framework Components
Unmesh
MBlock
Particles
AMR support
Solvers
Parallel Standard Libraries
59
Benefits of Virtualization Summary