Processor Virtualization for Scalable Parallel Computing - PowerPoint PPT Presentation

About This Presentation
Title:

Processor Virtualization for Scalable Parallel Computing

Description:

Department of Computer Science. University of Illinois at Urbana ... Seek optimal division of labor between 'system' and programmer: Specialization. MPI ... – PowerPoint PPT presentation

Number of Views:180
Avg rating:3.0/5.0
Slides: 60
Provided by: san7196
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Processor Virtualization for Scalable Parallel Computing


1
Processor Virtualization for Scalable Parallel
Computing
  • Laxmikant Kale
  • Kale_at_cs.uiuc.edu
  • http//charm.cs.uiuc.edu
  • Parallel Programming Laboratory
  • Department of Computer Science
  • University of Illinois at Urbana Champaign

2
Acknowlwdgements
  • Graduate students including
  • Gengbin Zheng
  • Orion Lawlor
  • Milind Bhandarkar
  • Arun Singla
  • Josh Unger
  • Terry Wilmarth
  • Sameer Kumar
  • Recent Funding
  • NSF (NGS Frederica Darema)
  • DOE (ASCI Rocket Center)
  • NIH (Molecular Dynamics)

3
Overview
  • Processor Virtualization
  • Motivation
  • Realization in AMPI and Charm
  • Part I Benefits
  • Better Software Engineering
  • Message Driven Execution
  • Flexible and dynamic mapping to processors
  • Principle of Persistence
  • Application Examples
  • Part II
  • PetaFLOPS Machines
  • Emulator
  • Programming Environments
  • Simulator
  • Performance prediction
  • Part III
  • Programming Models

4
Motivation
  • Research Group Mission
  • Improve Performance and Productivity in parallel
    programming
  • Via Application-oriented but Computer-Science
    Centered research
  • Parallel Computing/Programming
  • Coordination between processes
  • Resource management

5
Coordination
  • Processes, each with possibly local data
  • How do they interact with each other?
  • Data exchange and synchronization
  • Solutions proposed
  • Message passing
  • Shared variables and locks
  • Global Arrays / shmem
  • UPC
  • Asynchronous method invocation
  • Specifically shared variables
  • readonly, accumulators, tables
  • Others Linda,

Each is probably suitable for different
applications and subjective tastes of programmers
6
Parallel Computing Is About Resource Management
  • Who needs resources
  • Work units
  • Threads, function-calls, method invocations, loop
    iterations
  • Data units
  • Array segments, cache lines, stack-frames,
    messages, object variables
  • Resources
  • Processors, floating point units, thread-units
  • Memories Caches, SRAMs, DRAMs,
  • Programmer should not have to manage resources
    explicitly, even within one program

7
Processor Virtualization
  • Basic Idea
  • Divide the computation into a large number of
    pieces
  • Independent of number of processors
  • Typically larger than number of processors
  • Let the system map these virtual processors to
    processors
  • Old idea? G. Fox Book (86?),
  • DRMS (IBM), Data Parallel C (Michael Quinn),
    MPVM/UPVM/MIST
  • Our approach is virtualization
  • Language and runtime support for virtualization
  • Exploitation of virtualization to the hilt

8
Virtualization Object-based Parallelization
User is only concerned with interaction between
objects (VPs)
User View
9
Technical Approach
  • Seek optimal division of labor between system
    and programmer

Decomposition done by programmer, everything else
automated
Automation
HPF
Charm
AMPI
MPI
Specialization
10
Message From This Talk
  • Virtualization is ready and powerful to meet the
    needs of tomorrows applications and machines
  • Virtualization and associated techniques that we
    have been exploring for the past decade are ready
    and powerful enough to meet the needs of high-end
    parallel computing and complex and dynamic
    applications
  • These techniques are embodied into
  • Charm
  • AMPI
  • Frameworks (Strucured Grids, Unstructured Grids,
    Particles)
  • Virtualization of other coordination languages
    (UPC, GA, ..)

11
Realizations Charm
  • Charm
  • Parallel C with Data Driven Objects (Chares)
  • Asynchronous method invocation
  • Prioritized scheduling
  • Object Arrays
  • Object Groups
  • Information sharing abstractions readonly,
    tables,..
  • Mature, robust, portable (http//charm.cs.uiuc.edu
    )

12
Object Arrays
  • A collection of data-driven objects
  • With a single global name for the collection
  • Each member addressed by an index
  • sparse 1D, 2D, 3D, tree, string, ...
  • Mapping of element objects to procS handled by
    the system

Users view
A0
A1
A2
A3
A..
13
Object Arrays
  • A collection of data-driven objects
  • With a single global name for the collection
  • Each member addressed by an index
  • sparse 1D, 2D, 3D, tree, string, ...
  • Mapping of element objects to procS handled by
    the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
14
Object Arrays
  • A collection of data-driven objects
  • With a single global name for the collection
  • Each member addressed by an index
  • sparse 1D, 2D, 3D, tree, string, ...
  • Mapping of element objects to procS handled by
    the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
15
Adaptive MPI
  • A migration path for legacy MPI codes
  • AMPI MPI Virtualization
  • Uses Charm object arrays and migratable threads
  • Minimal modifications to convert existing MPI
    programs
  • Automated via AMPizer
  • Based on Polaris Compiler Framework
  • Bindings for
  • C, C, and Fortran90

16
AMPI
17
AMPI
Implemented as virtual processors (user-level
migratable threads)
18
Benefits of Virtualization
  • Better Software Engineering
  • Message Driven Execution
  • Flexible and dynamic mapping to processors
  • Principle of Persistence
  • Enables Runtime Optimizations
  • Automatic Dynamic Load Balancing
  • Communication Optimizations
  • Other Runtime Optimizations

19
Modularization
  • Logical Units decoupled from Number of
    processors
  • E.G. Oct tree nodes for particle data
  • No artificial restriction on the number of
    processors
  • Cube of power of 2
  • Modularity
  • Software engineering cohesion and coupling
  • MPIs are on the same processor is a bad
    coupling principle
  • Objects liberate you from that
  • E.G. Solid and fluid moldules in a rocket
    simulation

20
Rocket Simulation
  • Large Collaboration headed Mike Heath
  • DOE supported ASCI center
  • Challenge
  • Multi-component code, with modules from
    independent researchers
  • MPI was common base
  • AMPI new wine in old bottle
  • Easier to convert
  • Can still run original codes on MPI, unchanged

21
Rocket simulation via virtual processors
22
AMPI and Roc Communication
Rocflo
Rocflo
Rocflo
Rocflo
Rocflo
23
Message Driven Execution
Virtualization leads to Message Driven Execution
Which leads to Automatic Adaptive overlap of
computation and communication
24
Adaptive Overlap via Data-driven Objects
  • Problem
  • Processors wait for too long at receive
    statements
  • Routine communication optimizations in MPI
  • Move sends up and receives down
  • Sometimes. Use irecvs, but be careful
  • With Data-driven objects
  • Adaptive overlap of computation and communication
  • No object or threads holds up the processor
  • No need to guess which is likely to arrive first

25
Adaptive overlap and modules
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
26
Handling Random Load Variations via MDE
  • MDE encourages asynchrony
  • Asynchronous reductions, for example
  • Only data dependence should force synchronization
  • One benefit
  • Consider an algorithm with N steps
  • Each step has different load balanceTij
  • Loose dependence between steps
  • (on neighbors, for example)
  • Sum-of-max (MPI) vs max-of-sum (MDE)
  • OS Jitter
  • Causes random processors to add delays in each
    step
  • Handled Automatically by MDE

27
Example Molecular Dynamics in NAMD
  • Collection of charged atoms, with bonds
  • Newtonian mechanics
  • Thousands of atoms (1,000 - 500,000)
  • 1 femtosecond time-step, millions needed!
  • At each time-step
  • Calculate forces on each atom
  • Bonds
  • Non-bonded electrostatic and van der Waals
  • Calculate velocities and advance positions
  • Multiple Time Stepping PME (3D FFT) every 4
    steps

Collaboration with K. Schulten, R. Skeel, and
coworkers
28
Parallel Molecular Dynamics
192 144 vps
700 vps
30,000 vps
29
Performance NAMD on Lemieux
ATPase 320,000 atoms including water
30
(No Transcript)
31
Molecular Dynamics Benefits of avoiding barrier
  • In NAMD
  • The energy reductions were made asynchronous
  • No other global barriers are used in cut-off
    simulations
  • This came handy when
  • Running on Pittsburgh Lemieux (3000 processors)
  • The machine ( our way of using the communication
    layer) produced unpredictable, random delays in
    communication
  • A send call would remain stuck for 20 ms, for
    example
  • How did the system handle it?
  • See timeline plots

32
(No Transcript)
33
Asynchronous reductions in Jacobi
reduction
Processor timeline with sync. reduction
compute
compute
This gap is avoided below
Processor timeline with async. reduction
reduction
compute
compute
34
Virtualization/MDE leads to predictability
  • Ability to predict
  • Which data is going to be needed and
  • Which code will execute
  • Based on the ready queue of object method
    invocations
  • So, we can
  • Prefetch data accurately
  • Prefetch code if needed
  • Out-of-core execution
  • Caches vs controllable SRAM

35
Programmable SRAMs
  • Problems with Caches
  • Cache management is based on principle of
    locality
  • A heuristic, not a perfect predictor
  • Cache miss handling is in the critical path
  • Our approach (Message-driven execution)
  • Can exploit a programmable SRAM very effectively
  • Load the relevant data into the SRAM just-in-time

36
Example Jacobi Relaxation
Each processor may have hundreds of such objects
(few 10s of KB each, say).
When all the boundary data for an object is
available, it is added to the ready queue.
Prefetch/ SRAM management
Schedulers Queue
Ready Queue
DRAM
Scheduler
37
Flexible Dynamic Mapping to Processors
  • The system can migrate objects between processors
  • Vacate processor used by a parallel program
  • Dealing with extraneous loads on shared
    workstations
  • Adapt to speed difference between processors
  • E.g. Cluster with 500 MHz and 1 Ghz processors
  • Automatic checkpointing
  • Checkpointing migrate to disk!
  • Restart on a different number of processors
  • Shrink and Expand the set of processors used by
    an app
  • Shrink from 1000 to 900 procs. Later expand to
    1200.
  • Adaptive job scheduling for better System
    utilization

38
Faucets Optimizing Utilization Within/across
Clusters
Cluster
Job Submission
Cluster
Job Monitor
Cluster
  • http//charm.cs.uiuc.edu/research/faucets

39
Inefficient Utilization Within A Cluster
16 Processor system
Current Job Schedulers can yield low system
utilization.. A competetive problem in the
context of Faucets-like systems
40
Two Adaptive Jobs
Adaptive Jobs can shrink or expand the number of
processors they use, at runtime by migrating
virtual processor
16 Processor system
41
Job Monitoring Appspector
42
AQS Features
  • AQSAdaptive Queuing System
  • Multithreaded
  • Reliable and robust
  • Supports most features of standard queuing
    systems
  • Has the ability to manage adaptive jobs currently
    implemented in Charm and MPI
  • Handles regular (non-adaptive) jobs

43
Cluster Utilization
Experimental
Simulated
44
Experimental MRT
45
Principle of Persistence
  • Once the application is expressed in terms of
    interacting objects
  • Object communication patterns and
    computational loads tend to persist over time
  • In spite of dynamic behavior
  • Abrupt and large,but infrequent changes (egAMR)
  • Slow and small changes (eg particle migration)
  • Parallel analog of principle of locality
  • Heuristics, that holds for most CSE applications
  • Learning / adaptive algorithms
  • Adaptive Communication libraries
  • Measurement based load balancing

46
Measurement Based Load Balancing
  • Based on Principle of persistence
  • Runtime instrumentation
  • Measures communication volume and computation
    time
  • Measurement based load balancers
  • Use the instrumented data-base periodically to
    make new decisions
  • Many alternative strategies can use the database
  • Centralized vs distributed
  • Greedy improvements vs complete reassignments
  • Taking communication into account
  • Taking dependences into account (More complex)

47
Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements Added
3. Chunks Migrated
2. Load Balancer Invoked
48
Optimizing for Communication Patterns
  • The parallel-objects Runtime System can observe,
    instrument, and measure communication patterns
  • Communication is from/to objects, not processors
  • Load balancers use this to optimize object
    placement
  • Communication libraries can optimize
  • By substituting most suitable algorithm for each
    operation
  • Learning at runtime
  • E.g. Each to all individualized sends
  • Performance depends on many runtime
    characteristics
  • Library switches between different algorithms

V. Krishnan, MS Thesis, 1996
49
All to all on Lemieux for a 76 Byte Message
50
Impact on Application Performance
Namd Performance on Lemieux, with the transpose
step implemented using different all-to-all
algorithms
51
Overhead of Virtualization
Isnt there significant overhead of
virtualization? No! Not in most cases.
52
Ongoing Research
  • Fault Tolerance
  • Much easier at object level TMR, efficient
    variations
  • However, checkpointing used to be such an
    efficient alternative (low forward-path cost)
  • Resurrect past research
  • Programming petaFLOPS machines
  • Programming Environment
  • Simulation and Performance prediction
  • Communication Optimizations grids
  • Dealing with limited Virtual Memory Space

53
Applications on the current Emulator
  • Using Charm
  • LeanMD
  • Research quality Molecular Dyanmics
  • Version 0 only electrostatics van der Vaal
  • Simple AMR kernel
  • Adaptive tree to generate millions of objects
  • Each holding a 3D array
  • Communication with neighbors
  • Tree makes it harder to find nbrs, but Charm
    makes it easy

54
Applications Funded Collaborations
  • Molecular Dynamics for biophysics NAMD
  • QM/MM Car-Parinello
  • Materials
  • Microstructure Dendritic growth
  • Bridging the gap between Atomistic and FEM models
  • Space-time Meshing
  • Rocket Simulation
  • DOE ASCI Center
  • Computational Astrophysics

Developing CS enabling Technology in the Context
of Real Applications
55
QM using Car-Parinello method Glenn Martyna,
Mark Tuckerman et al
56
Evolution of a Galaxy in its cosmological context
Thomas Quinn et al
Need to bridge length gap Multiple modules
communication optimizations dynamic load
balancing
57
Ongoing Research
  • Load balancing
  • Charm framework allows distributed and
    centralized
  • Recent years, we focused on centralized
  • Still ok for 3000 processors for NAMD
  • Reverting back to older work on distributed
    balancing
  • Need to handle locality of communication
  • Topology sensitive placement
  • Need to work with global information
  • Approx global info
  • Incomplete global info (only neighborhood)
  • Achieving global effects by local action

58
Application
Orchestration Support
Data transfer
Application Components
D
C
A
B
Framework Components
Unmesh
MBlock
Particles
AMR support
Solvers
Parallel Standard Libraries
59
Benefits of Virtualization Summary
  • Software Engineering
  • Number of virtual processors can be independently
    controlled
  • Separate VPs for modules
  • Message Driven Execution
  • Adaptive overlap
  • Modularity
  • Predictability
  • Automatic Out-of-core
  • Cache management
  • Dynamic mapping
  • Heterogeneous clusters
  • Vacate, adjust to speed, share
  • Automatic checkpointing
  • Change the set of processors
  • Principle of Persistence
  • Enables Runtime Optimizations
  • Automatic Dynamic Load Balancing
  • Communication Optimizations
  • Other Runtime Optimizations

More info http//charm.cs.uiuc.edu
Write a Comment
User Comments (0)
About PowerShow.com