Title: Programming Environment and Performance Modeling for millionprocessor machines
1Programming Environment and Performance Modeling
for million-processor machines
- Laxmikant (Sanjay) Kale
- Parallel Programming Laboratory
- Department of Computer Science
- University of Illinois at Urbana-Champaign
- Http//charm.Cs.uiuc.edu
2Context Group Mission and Approach
- To enhance Performance and Productivity in
programming complex parallel applications - Performance scalable to very large number of
processors - Productivity of human programmers
- Complex irregular structure, dynamic variations
- Approach Application Oriented yet CS centered
research - Develop enabling technology, for a wide
collection of apps. - Develop, use and test it in the context of real
applications - Develop standard library of reusable parallel
components
3Project Objective and Overview
- Focus on extremely large parallel machines
- Exemplified by Blue Gene/Cyclops
- Issues
- Programming Environment
- Objects, threads, compiler support
- Runtime performance adaptation
- Performance modeling
- Coarse grained models
- Fine grained models
- Hybrid
- Applications
- Unstructured Meshes (FEM/Crack Propagation), ..
David Padua
Sanjay Kale
Sarita Adve
Phillipe Geubelle
4Project Objective and Overview
- Focus on extremely large parallel machines
- Exemplified by Blue Gene/Cyclops
- Issues
- Programming Environment
- Runtime performance adaptation
- Performance modeling
- Coarse grained models
- Fine grained models
- Hybrid
- Applications
- Unstructured Meshes (FEM/Crack Propagation), ..
David Padua
Sanjay Kale
Sarita Adve
Phillipe Geubelle
5Multi-partition Decomposition
- Idea divide the computation into a large number
of pieces - Independent of number of processors
- Typically larger than number of processors
- Let the system map entities to processors
- Optimal division of labor between system and
programmer - Decomposition done by programmer,
- Everything else automated
6Object-based Parallelization
User is only concerned with interaction between
objects
System implementation
User View
7Charm
- Parallel C with Data Driven Objects
- Object Arrays/ Object Collections
- Object Groups
- Global object with a representative on each PE
- Asynchronous method invocation
- Prioritized scheduling
- Information sharing abstractions readonly,
tables,.. - Mature, robust, portable
- http//charm.cs.uiuc.edu
8Data driven execution
Scheduler
Scheduler
Message Q
Message Q
9Load Balancing Framework
- Based on object migration
- Partitions implemented as objects (or threads)
are mapped to available processors by LB
framework - Measurement based load balancers
- Principle of persistence
- Computational loads and communication patterns
- Runtime system measures actual computation times
of every partition, as well as communication
patterns - Variety of plug-in LB strategies available
- Scalable to a few thousand processors
- Including those for situations when principle of
persistence does not apply
10Building on Object-based Parallelism
- Application induced load imbalances
- Environment induced performance issues
- Dealing with extraneous loads on shared m/cs
- Vacating workstations
- Heterogeneous clusters
- Shrinking and Expanding jobs to available Pes
- Object migration novel uses
- Automatic checkpointing
- Automatic prefetching for out-of-core execution
- Reuse object based components
11Applications
- Charm developed in the context of real
applications - Current applications we are involved with
- Molecular dynamics
- Crack propagation
- Rocket simulation fluid dynamics structures
- QM/MM Material properties via quantum mech
- Cosmology simulations parallel analysisviz
- Cosmology gravitational with multiple
timestepping
12Molecular Dynamics
- Collection of charged atoms, with bonds
- Newtonian mechanics
- At each time-step
- Calculate forces on each atom
- Bonds
- Non-bonded electrostatic and van der Waals
- Calculate velocities and advance positions
- 1 femtosecond time-step, millions needed!
- Thousands of atoms (1,000 - 100,000)
13 Object Based Parallelization for MD
14Performance Data SC2000
15Charm Is a Good Match for M-PIM
- Encapsulation objects
- Cost model
- Object data, read-only data, remote data
- Migration and resource management automatic
- One sided communication since the beginning
- Asynchronous global operations (reductions, ..)
- Modularity
- see 1996 paper for why DD Objects enable
modularity - Acceptability
- C
- Now also AMPI on top of charm
16Higher-level Models
- Do programmers find Charm/AMPI easy/good
- We think so ?
- Certainly a good intermediate level model
- Higher level abstractions can be built on it
- But what kinds of abstractions?
- We think domain-specific ones
17Domain specific frameworks
/AMPI
18Further Match With MPIM
- Ability to predict
- Which data is going to be needed and
- Which code will execute
- Based on the ready queue of object method
invocations - So, we can
- Prefetch data accurately
- Prefetch code if needed
19So, What Are We Doing About It?
- How to develop any programming environment for a
machine that isnt built yet - Blue Gene/C emulator using charm
- Completed last year
- Implememnts low level BG/C API
- Packet sends, extract packet from comm buffers
- Emulation runs on machines with hundreds of
normal processors - Charm on blue Gene /C Emulator
20Structure of the Emulators
Blue Gene/C Low-level API
Charm
Converse
21Emulation on a Parallel Machine
22Extensions to Charm for BG/C
- Microtasks
- Objects may fire microtasks that can be executed
by any thread on the same node - Increases parallelism
- Overhead sub-microsecond
- Issue
- Object affinity map to thread or node?
- Thread, currently.
- Microtasks alleviate load balancing within a node
23Emulation efficiency
- How much time does it take to run an emulation?
- 8 Million processors being emulated on 100
- In addition, lower cache performance
- Lots of tiny messages
- On a Linux cluster
- Emulation shows good speedup
24Emulation efficiency
1000 BG/C nodes (10x10x10) Each with 200
threads (total of 200,000 user-level threads)
But Data is preliminary, based on one simulation
25Emulator to Simulator
- Step 1 Coarse grained simulation
- Simulation performance prediction capability
- Models contention for processor/thread
- Also models communication delay based on distance
- Doesnt model memory access on chip, or network
- How to do this in spite of out-of-order message
delivery? - Rely on determinism of Charm programs
- Time stamped messages and threads
- Parallel time-stamp correction algorithm
26Timestamp correction
- Basic execution
- Timestamped messages
- Correction needed when
- A message arrives with an earlier timestamp than
other messages processed already - Cases
- Messages to Handlers or simple objects
- MPI style threads, without wildcard or irecvs
- Charm with dependence expressed via structured
dagger
27Timestamps Correction
28Timestamps Correction
29Timestamps Correction
30Timestamps Correction
31Applications on the current system
- Using BG Charm
- LeanMD
- Research quality Molecular Dyanmics
- Version 0 only electrostatics van der Vaal
- Simple AMR kernel
- Adaptive tree to generate millions of objects
- Each holding a 3D array
- Communication with neighbors
- Tree makes it harder to find nbrs, but Charm
makes it easy
32Emulator to Simulator
- Step 2 Add fine grained procesor simulation
- Sarita Adve RSIM based simulation of a node
- SMP node simulation completed
- Also simulation of interconnection network
- Millions of thread units/caches to simulate in
detail? - Step 3 Hybrid simulation
- Instead use detailed simulation to build model
- Drive coarse simulation using model behavior
- Further help from compiler and RTS
33Modeling layers
Applications
For each need a detailed simulation and a
simpler (e.g. table-driven) model
Libraries/RTS
Chip Architecture
Network model
And methods for combining them
34Summary
- Charm (data-driven migratable objects)
- is a well-matched candidate programming model
for M-PIMs - We have developed an Emulator/Simulator
- For BG/C
- Runs on parallel machines
- We have Implemented multi-million object
applications using Charm - And tested on emulated Blue Gene/C
- More info http//charm.cs.uiuc.edu
- Emulator is available for download, along with
Charm