Title: Processor Virtualization for Scalable Parallel Computing
1Processor Virtualization for Scalable Parallel
Computing
- Laxmikant Kale
- Kale_at_cs.uiuc.edu
- http//charm.cs.uiuc.edu
- Parallel Programming Laboratory
- Department of Computer Science
- University of Illinois at Urbana Champaign
2Acknowlwdgements
- Graduate students including
- Gengbin Zheng
- Orion Lawlor
- Milind Bhandarkar
- Arun Singla
- Josh Unger
- Terry Wilmarth
- Sameer Kumar
- Recent Funding
- NSF (NGS Frederica Darema)
- DOE (ASCI Rocket Center)
- NIH (Molecular Dynamics)
3Overview
- Processor Virtualization
- Motivation
- Realization in AMPI and Charm
- Part I Benefits
- Better Software Engineering
- Message Driven Execution
- Flexible and dynamic mapping to processors
- Principle of Persistence
- Application Examples
- Part II
- PetaFLOPS Machines
- Emulator
- Programming Environments
- Simulator
- Performance prediction
- Part III
- Programming Models
4Motivation
- Research Group Mission
- Improve Performance and Productivity in parallel
programming - Via Application-oriented but Computer-Science
Centered research - Parallel Computing/Programming
- Coordination between processes
- Resource management
5Coordination
- Processes, each with possibly local data
- How do they interact with each other?
- Data exchange and synchronization
- Solutions proposed
- Message passing
- Shared variables and locks
- Global Arrays / shmem
- UPC
- Asynchronous method invocation
- Specifically shared variables
- readonly, accumulators, tables
- Others Linda,
Each is probably suitable for different
applications and subjective tastes of programmers
6Parallel Computing Is About Resource Management
- Who needs resources
- Work units
- Threads, function-calls, method invocations, loop
iterations - Data units
- Array segments, cache lines, stack-frames,
messages, object variables - Resources
- Processors, floating point units, thread-units
- Memories Caches, SRAMs, DRAMs,
- Programmer should not have to manage resources
explicitly, even within one program
7Processor Virtualization
- Basic Idea
- Divide the computation into a large number of
pieces - Independent of number of processors
- Typically larger than number of processors
- Let the system map these virtual processors to
processors - Old idea? G. Fox Book (86?),
- DRMS (IBM), Data Parallel C (Michael Quinn),
MPVM/UPVM/MIST
- Our approach is virtualization
- Language and runtime support for virtualization
- Exploitation of virtualization to the hilt
8Virtualization Object-based Parallelization
User is only concerned with interaction between
objects (VPs)
User View
9Technical Approach
- Seek optimal division of labor between system
and programmer
Decomposition done by programmer, everything else
automated
Automation
HPF
Charm
AMPI
MPI
Specialization
10Message From This Talk
- Virtualization is ready and powerful to meet the
needs of tomorrows applications and machines - Virtualization and associated techniques that we
have been exploring for the past decade are ready
and powerful enough to meet the needs of high-end
parallel computing and complex and dynamic
applications - These techniques are embodied into
- Charm
- AMPI
- Frameworks (Strucured Grids, Unstructured Grids,
Particles) - Virtualization of other coordination languages
(UPC, GA, ..)
11Realizations Charm
- Charm
- Parallel C with Data Driven Objects (Chares)
- Asynchronous method invocation
- Prioritized scheduling
- Object Arrays
- Object Groups
- Information sharing abstractions readonly,
tables,.. - Mature, robust, portable (http//charm.cs.uiuc.edu
)
12Object Arrays
- A collection of data-driven objects
- With a single global name for the collection
- Each member addressed by an index
- sparse 1D, 2D, 3D, tree, string, ...
- Mapping of element objects to procS handled by
the system
Users view
A0
A1
A2
A3
A..
13Object Arrays
- A collection of data-driven objects
- With a single global name for the collection
- Each member addressed by an index
- sparse 1D, 2D, 3D, tree, string, ...
- Mapping of element objects to procS handled by
the system
Users view
A0
A1
A2
A3
A..
System view
A3
A0
14Object Arrays
- A collection of data-driven objects
- With a single global name for the collection
- Each member addressed by an index
- sparse 1D, 2D, 3D, tree, string, ...
- Mapping of element objects to procS handled by
the system
Users view
A0
A1
A2
A3
A..
System view
A3
A0
15Adaptive MPI
- A migration path for legacy MPI codes
- AMPI MPI Virtualization
- Uses Charm object arrays and migratable threads
- Minimal modifications to convert existing MPI
programs - Automated via AMPizer
- Based on Polaris Compiler Framework
- Bindings for
- C, C, and Fortran90
16AMPI
17AMPI
Implemented as virtual processors (user-level
migratable threads)
18Benefits of Virtualization
- Better Software Engineering
- Message Driven Execution
- Flexible and dynamic mapping to processors
- Principle of Persistence
- Enables Runtime Optimizations
- Automatic Dynamic Load Balancing
- Communication Optimizations
- Other Runtime Optimizations
19Modularization
- Logical Units decoupled from Number of
processors - E.G. Oct tree nodes for particle data
- No artificial restriction on the number of
processors - Cube of power of 2
- Modularity
- Software engineering cohesion and coupling
- MPIs are on the same processor is a bad
coupling principle - Objects liberate you from that
- E.G. Solid and fluid moldules in a rocket
simulation
20Rocket Simulation
- Large Collaboration headed Mike Heath
- DOE supported ASCI center
- Challenge
- Multi-component code, with modules from
independent researchers - MPI was common base
- AMPI new wine in old bottle
- Easier to convert
- Can still run original codes on MPI, unchanged
21Rocket simulation via virtual processors
22AMPI and Roc Communication
Rocflo
Rocflo
Rocflo
Rocflo
Rocflo
23Message Driven Execution
Virtualization leads to Message Driven Execution
Which leads to Automatic Adaptive overlap of
computation and communication
24Adaptive Overlap via Data-driven Objects
- Problem
- Processors wait for too long at receive
statements - Routine communication optimizations in MPI
- Move sends up and receives down
- Sometimes. Use irecvs, but be careful
- With Data-driven objects
- Adaptive overlap of computation and communication
- No object or threads holds up the processor
- No need to guess which is likely to arrive first
25Adaptive overlap and modules
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
26Handling Random Load Variations via MDE
- MDE encourages asynchrony
- Asynchronous reductions, for example
- Only data dependence should force synchronization
- One benefit
- Consider an algorithm with N steps
- Each step has different load balanceTij
- Loose dependence between steps
- (on neighbors, for example)
- Sum-of-max (MPI) vs max-of-sum (MDE)
- OS Jitter
- Causes random processors to add delays in each
step - Handled Automatically by MDE
27Example Molecular Dynamics in NAMD
- Collection of charged atoms, with bonds
- Newtonian mechanics
- Thousands of atoms (1,000 - 500,000)
- 1 femtosecond time-step, millions needed!
- At each time-step
- Calculate forces on each atom
- Bonds
- Non-bonded electrostatic and van der Waals
- Calculate velocities and advance positions
- Multiple Time Stepping PME (3D FFT) every 4
steps
Collaboration with K. Schulten, R. Skeel, and
coworkers
28Parallel Molecular Dynamics
192 144 vps
700 vps
30,000 vps
29Performance NAMD on Lemieux
ATPase 320,000 atoms including water
30(No Transcript)
31Molecular Dynamics Benefits of avoiding barrier
- In NAMD
- The energy reductions were made asynchronous
- No other global barriers are used in cut-off
simulations - This came handy when
- Running on Pittsburgh Lemieux (3000 processors)
- The machine ( our way of using the communication
layer) produced unpredictable, random delays in
communication - A send call would remain stuck for 20 ms, for
example - How did the system handle it?
- See timeline plots
32(No Transcript)
33Asynchronous reductions in Jacobi
reduction
Processor timeline with sync. reduction
compute
compute
This gap is avoided below
Processor timeline with async. reduction
reduction
compute
compute
34Virtualization/MDE leads to predictability
- Ability to predict
- Which data is going to be needed and
- Which code will execute
- Based on the ready queue of object method
invocations - So, we can
- Prefetch data accurately
- Prefetch code if needed
- Out-of-core execution
- Caches vs controllable SRAM
35Programmable SRAMs
- Problems with Caches
- Cache management is based on principle of
locality - A heuristic, not a perfect predictor
- Cache miss handling is in the critical path
- Our approach (Message-driven execution)
- Can exploit a programmable SRAM very effectively
- Load the relevant data into the SRAM just-in-time
36Example Jacobi Relaxation
Each processor may have hundreds of such objects
(few 10s of KB each, say).
When all the boundary data for an object is
available, it is added to the ready queue.
Prefetch/ SRAM management
Schedulers Queue
Ready Queue
DRAM
Scheduler
37Flexible Dynamic Mapping to Processors
- The system can migrate objects between processors
- Vacate processor used by a parallel program
- Dealing with extraneous loads on shared
workstations - Adapt to speed difference between processors
- E.g. Cluster with 500 MHz and 1 Ghz processors
- Automatic checkpointing
- Checkpointing migrate to disk!
- Restart on a different number of processors
- Shrink and Expand the set of processors used by
an app - Shrink from 1000 to 900 procs. Later expand to
1200. - Adaptive job scheduling for better System
utilization
38Faucets Optimizing Utilization Within/across
Clusters
Cluster
Job Submission
Cluster
Job Monitor
Cluster
- http//charm.cs.uiuc.edu/research/faucets
39Inefficient Utilization Within A Cluster
16 Processor system
Current Job Schedulers can yield low system
utilization.. A competetive problem in the
context of Faucets-like systems
40Two Adaptive Jobs
Adaptive Jobs can shrink or expand the number of
processors they use, at runtime by migrating
virtual processor
16 Processor system
41Job Monitoring Appspector
42AQS Features
- AQSAdaptive Queuing System
- Multithreaded
- Reliable and robust
- Supports most features of standard queuing
systems - Has the ability to manage adaptive jobs currently
implemented in Charm and MPI - Handles regular (non-adaptive) jobs
43Cluster Utilization
Experimental
Simulated
44Experimental MRT
45Principle of Persistence
- Once the application is expressed in terms of
interacting objects - Object communication patterns and
computational loads tend to persist over time - In spite of dynamic behavior
- Abrupt and large,but infrequent changes (egAMR)
- Slow and small changes (eg particle migration)
- Parallel analog of principle of locality
- Heuristics, that holds for most CSE applications
- Learning / adaptive algorithms
- Adaptive Communication libraries
- Measurement based load balancing
46Measurement Based Load Balancing
- Based on Principle of persistence
- Runtime instrumentation
- Measures communication volume and computation
time - Measurement based load balancers
- Use the instrumented data-base periodically to
make new decisions - Many alternative strategies can use the database
- Centralized vs distributed
- Greedy improvements vs complete reassignments
- Taking communication into account
- Taking dependences into account (More complex)
47Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements Added
3. Chunks Migrated
2. Load Balancer Invoked
48Optimizing for Communication Patterns
- The parallel-objects Runtime System can observe,
instrument, and measure communication patterns - Communication is from/to objects, not processors
- Load balancers use this to optimize object
placement - Communication libraries can optimize
- By substituting most suitable algorithm for each
operation - Learning at runtime
- E.g. Each to all individualized sends
- Performance depends on many runtime
characteristics - Library switches between different algorithms
V. Krishnan, MS Thesis, 1996
49All to all on Lemieux for a 76 Byte Message
50Impact on Application Performance
Namd Performance on Lemieux, with the transpose
step implemented using different all-to-all
algorithms
51Overhead of Virtualization
Isnt there significant overhead of
virtualization? No! Not in most cases.
52Ongoing Research
- Fault Tolerance
- Much easier at object level TMR, efficient
variations - However, checkpointing used to be such an
efficient alternative (low forward-path cost) - Resurrect past research
- Programming petaFLOPS machines
- Programming Environment
- Simulation and Performance prediction
- Communication Optimizations grids
- Dealing with limited Virtual Memory Space
53Applications on the current Emulator
- Using Charm
- LeanMD
- Research quality Molecular Dyanmics
- Version 0 only electrostatics van der Vaal
- Simple AMR kernel
- Adaptive tree to generate millions of objects
- Each holding a 3D array
- Communication with neighbors
- Tree makes it harder to find nbrs, but Charm
makes it easy
54Applications Funded Collaborations
- Molecular Dynamics for biophysics NAMD
- QM/MM Car-Parinello
- Materials
- Microstructure Dendritic growth
- Bridging the gap between Atomistic and FEM models
- Space-time Meshing
- Rocket Simulation
- DOE ASCI Center
- Computational Astrophysics
Developing CS enabling Technology in the Context
of Real Applications
55QM using Car-Parinello method Glenn Martyna,
Mark Tuckerman et al
56Evolution of a Galaxy in its cosmological context
Thomas Quinn et al
Need to bridge length gap Multiple modules
communication optimizations dynamic load
balancing
57Ongoing Research
- Load balancing
- Charm framework allows distributed and
centralized - Recent years, we focused on centralized
- Still ok for 3000 processors for NAMD
- Reverting back to older work on distributed
balancing - Need to handle locality of communication
- Topology sensitive placement
- Need to work with global information
- Approx global info
- Incomplete global info (only neighborhood)
- Achieving global effects by local action
58Application
Orchestration Support
Data transfer
Application Components
D
C
A
B
Framework Components
Unmesh
MBlock
Particles
AMR support
Solvers
Parallel Standard Libraries
59Benefits of Virtualization Summary
- Software Engineering
- Number of virtual processors can be independently
controlled - Separate VPs for modules
- Message Driven Execution
- Adaptive overlap
- Modularity
- Predictability
- Automatic Out-of-core
- Cache management
- Dynamic mapping
- Heterogeneous clusters
- Vacate, adjust to speed, share
- Automatic checkpointing
- Change the set of processors
- Principle of Persistence
- Enables Runtime Optimizations
- Automatic Dynamic Load Balancing
- Communication Optimizations
- Other Runtime Optimizations
More info http//charm.cs.uiuc.edu