Programming Environment and Performance Modeling for millionprocessor machines - PowerPoint PPT Presentation

About This Presentation

Title:

Programming Environment and Performance Modeling for millionprocessor machines

Description:

NGS/IBM: April2002. PPL-Dept of Computer Science, UIUC. Programming Environment and ... Exemplified by Blue Gene/Cyclops. Issues: Programming Environment: ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 35

Provided by: KALE2

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Programming Environment and Performance Modeling for millionprocessor machines

1
Programming Environment and Performance Modeling
for million-processor machines

Laxmikant (Sanjay) Kale
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana-Champaign
Http//charm.Cs.uiuc.edu

2
Context Group Mission and Approach

To enhance Performance and Productivity in
programming complex parallel applications
Performance scalable to very large number of
processors
Productivity of human programmers
Complex irregular structure, dynamic variations
Approach Application Oriented yet CS centered
research
Develop enabling technology, for a wide
collection of apps.
Develop, use and test it in the context of real
applications
Develop standard library of reusable parallel
components

3
Project Objective and Overview

Focus on extremely large parallel machines
Exemplified by Blue Gene/Cyclops
Issues
Programming Environment
Objects, threads, compiler support
Runtime performance adaptation
Performance modeling
Coarse grained models
Fine grained models
Hybrid
Applications
Unstructured Meshes (FEM/Crack Propagation), ..

David Padua
Sanjay Kale
Sarita Adve
Phillipe Geubelle
4
Project Objective and Overview

Focus on extremely large parallel machines
Exemplified by Blue Gene/Cyclops
Issues
Programming Environment
Runtime performance adaptation
Performance modeling
Coarse grained models
Fine grained models
Hybrid
Applications
Unstructured Meshes (FEM/Crack Propagation), ..

David Padua
Sanjay Kale
Sarita Adve
Phillipe Geubelle
5
Multi-partition Decomposition

Idea divide the computation into a large number
of pieces
Independent of number of processors
Typically larger than number of processors
Let the system map entities to processors
Optimal division of labor between system and
programmer
Decomposition done by programmer,
Everything else automated

6
Object-based Parallelization
User is only concerned with interaction between
objects
System implementation
User View
7
Charm

Parallel C with Data Driven Objects
Object Arrays/ Object Collections
Object Groups
Global object with a representative on each PE
Asynchronous method invocation
Prioritized scheduling
Information sharing abstractions readonly,
tables,..
Mature, robust, portable
http//charm.cs.uiuc.edu

8
Data driven execution
Scheduler
Scheduler
Message Q
Message Q
9
Load Balancing Framework

Based on object migration
Partitions implemented as objects (or threads)
are mapped to available processors by LB
framework
Measurement based load balancers
Principle of persistence
Computational loads and communication patterns
Runtime system measures actual computation times
of every partition, as well as communication
patterns
Variety of plug-in LB strategies available
Scalable to a few thousand processors
Including those for situations when principle of
persistence does not apply

10
Building on Object-based Parallelism

Application induced load imbalances
Environment induced performance issues
Dealing with extraneous loads on shared m/cs
Vacating workstations
Heterogeneous clusters
Shrinking and Expanding jobs to available Pes
Object migration novel uses
Automatic checkpointing
Automatic prefetching for out-of-core execution
Reuse object based components

11
Applications

Charm developed in the context of real
applications
Current applications we are involved with
Molecular dynamics
Crack propagation
Rocket simulation fluid dynamics structures
QM/MM Material properties via quantum mech
Cosmology simulations parallel analysisviz
Cosmology gravitational with multiple
timestepping

12
Molecular Dynamics

Collection of charged atoms, with bonds
Newtonian mechanics
At each time-step
Calculate forces on each atom
Bonds
Non-bonded electrostatic and van der Waals
Calculate velocities and advance positions
1 femtosecond time-step, millions needed!
Thousands of atoms (1,000 - 100,000)

13

Object Based Parallelization for MD
14
Performance Data SC2000
15
Charm Is a Good Match for M-PIM

Encapsulation objects
Cost model
Object data, read-only data, remote data
Migration and resource management automatic
One sided communication since the beginning
Asynchronous global operations (reductions, ..)
Modularity
see 1996 paper for why DD Objects enable
modularity
Acceptability
C
Now also AMPI on top of charm

16
Higher-level Models

Do programmers find Charm/AMPI easy/good
We think so ?
Certainly a good intermediate level model
Higher level abstractions can be built on it
But what kinds of abstractions?
We think domain-specific ones

17
Domain specific frameworks
/AMPI
18
Further Match With MPIM

Ability to predict
Which data is going to be needed and
Which code will execute
Based on the ready queue of object method
invocations
So, we can
Prefetch data accurately
Prefetch code if needed

19
So, What Are We Doing About It?

How to develop any programming environment for a
machine that isnt built yet
Blue Gene/C emulator using charm
Completed last year
Implememnts low level BG/C API
Packet sends, extract packet from comm buffers
Emulation runs on machines with hundreds of
normal processors
Charm on blue Gene /C Emulator

20
Structure of the Emulators
Blue Gene/C Low-level API
Charm
Converse
21
Emulation on a Parallel Machine
22
Extensions to Charm for BG/C

Microtasks
Objects may fire microtasks that can be executed
by any thread on the same node
Increases parallelism
Overhead sub-microsecond
Issue
Object affinity map to thread or node?
Thread, currently.
Microtasks alleviate load balancing within a node

23
Emulation efficiency

How much time does it take to run an emulation?
8 Million processors being emulated on 100
In addition, lower cache performance
Lots of tiny messages
On a Linux cluster
Emulation shows good speedup

24
Emulation efficiency
1000 BG/C nodes (10x10x10) Each with 200
threads (total of 200,000 user-level threads)
But Data is preliminary, based on one simulation
25
Emulator to Simulator

Step 1 Coarse grained simulation
Simulation performance prediction capability
Models contention for processor/thread
Also models communication delay based on distance
Doesnt model memory access on chip, or network
How to do this in spite of out-of-order message
delivery?
Rely on determinism of Charm programs
Time stamped messages and threads
Parallel time-stamp correction algorithm

26
Timestamp correction

Basic execution
Timestamped messages
Correction needed when
A message arrives with an earlier timestamp than
other messages processed already
Cases
Messages to Handlers or simple objects
MPI style threads, without wildcard or irecvs
Charm with dependence expressed via structured
dagger

27
Timestamps Correction
28
Timestamps Correction
29
Timestamps Correction
30
Timestamps Correction
31
Applications on the current system

Using BG Charm
LeanMD
Research quality Molecular Dyanmics
Version 0 only electrostatics van der Vaal
Simple AMR kernel
Adaptive tree to generate millions of objects
Each holding a 3D array
Communication with neighbors
Tree makes it harder to find nbrs, but Charm
makes it easy

32
Emulator to Simulator

Step 2 Add fine grained procesor simulation
Sarita Adve RSIM based simulation of a node
SMP node simulation completed
Also simulation of interconnection network
Millions of thread units/caches to simulate in
detail?
Step 3 Hybrid simulation
Instead use detailed simulation to build model
Drive coarse simulation using model behavior
Further help from compiler and RTS

33
Modeling layers
Applications
For each need a detailed simulation and a
simpler (e.g. table-driven) model
Libraries/RTS
Chip Architecture
Network model
And methods for combining them
34
Summary