Programming Environment and Performance Modeling for millionprocessor machines - PowerPoint PPT Presentation

About This Presentation
Title:

Programming Environment and Performance Modeling for millionprocessor machines

Description:

NGS/IBM: April2002. PPL-Dept of Computer Science, UIUC. Programming Environment and ... Exemplified by Blue Gene/Cyclops. Issues: Programming Environment: ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 35
Provided by: KALE2
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Programming Environment and Performance Modeling for millionprocessor machines


1
Programming Environment and Performance Modeling
for million-processor machines
  • Laxmikant (Sanjay) Kale
  • Parallel Programming Laboratory
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign
  • Http//charm.Cs.uiuc.edu

2
Context Group Mission and Approach
  • To enhance Performance and Productivity in
    programming complex parallel applications
  • Performance scalable to very large number of
    processors
  • Productivity of human programmers
  • Complex irregular structure, dynamic variations
  • Approach Application Oriented yet CS centered
    research
  • Develop enabling technology, for a wide
    collection of apps.
  • Develop, use and test it in the context of real
    applications
  • Develop standard library of reusable parallel
    components

3
Project Objective and Overview
  • Focus on extremely large parallel machines
  • Exemplified by Blue Gene/Cyclops
  • Issues
  • Programming Environment
  • Objects, threads, compiler support
  • Runtime performance adaptation
  • Performance modeling
  • Coarse grained models
  • Fine grained models
  • Hybrid
  • Applications
  • Unstructured Meshes (FEM/Crack Propagation), ..

David Padua
Sanjay Kale
Sarita Adve
Phillipe Geubelle
4
Project Objective and Overview
  • Focus on extremely large parallel machines
  • Exemplified by Blue Gene/Cyclops
  • Issues
  • Programming Environment
  • Runtime performance adaptation
  • Performance modeling
  • Coarse grained models
  • Fine grained models
  • Hybrid
  • Applications
  • Unstructured Meshes (FEM/Crack Propagation), ..

David Padua
Sanjay Kale
Sarita Adve
Phillipe Geubelle
5
Multi-partition Decomposition
  • Idea divide the computation into a large number
    of pieces
  • Independent of number of processors
  • Typically larger than number of processors
  • Let the system map entities to processors
  • Optimal division of labor between system and
    programmer
  • Decomposition done by programmer,
  • Everything else automated

6
Object-based Parallelization
User is only concerned with interaction between
objects
System implementation
User View
7
Charm
  • Parallel C with Data Driven Objects
  • Object Arrays/ Object Collections
  • Object Groups
  • Global object with a representative on each PE
  • Asynchronous method invocation
  • Prioritized scheduling
  • Information sharing abstractions readonly,
    tables,..
  • Mature, robust, portable
  • http//charm.cs.uiuc.edu

8
Data driven execution
Scheduler
Scheduler
Message Q
Message Q
9
Load Balancing Framework
  • Based on object migration
  • Partitions implemented as objects (or threads)
    are mapped to available processors by LB
    framework
  • Measurement based load balancers
  • Principle of persistence
  • Computational loads and communication patterns
  • Runtime system measures actual computation times
    of every partition, as well as communication
    patterns
  • Variety of plug-in LB strategies available
  • Scalable to a few thousand processors
  • Including those for situations when principle of
    persistence does not apply

10
Building on Object-based Parallelism
  • Application induced load imbalances
  • Environment induced performance issues
  • Dealing with extraneous loads on shared m/cs
  • Vacating workstations
  • Heterogeneous clusters
  • Shrinking and Expanding jobs to available Pes
  • Object migration novel uses
  • Automatic checkpointing
  • Automatic prefetching for out-of-core execution
  • Reuse object based components

11
Applications
  • Charm developed in the context of real
    applications
  • Current applications we are involved with
  • Molecular dynamics
  • Crack propagation
  • Rocket simulation fluid dynamics structures
  • QM/MM Material properties via quantum mech
  • Cosmology simulations parallel analysisviz
  • Cosmology gravitational with multiple
    timestepping

12
Molecular Dynamics
  • Collection of charged atoms, with bonds
  • Newtonian mechanics
  • At each time-step
  • Calculate forces on each atom
  • Bonds
  • Non-bonded electrostatic and van der Waals
  • Calculate velocities and advance positions
  • 1 femtosecond time-step, millions needed!
  • Thousands of atoms (1,000 - 100,000)

13

Object Based Parallelization for MD
14
Performance Data SC2000
15
Charm Is a Good Match for M-PIM
  • Encapsulation objects
  • Cost model
  • Object data, read-only data, remote data
  • Migration and resource management automatic
  • One sided communication since the beginning
  • Asynchronous global operations (reductions, ..)
  • Modularity
  • see 1996 paper for why DD Objects enable
    modularity
  • Acceptability
  • C
  • Now also AMPI on top of charm

16
Higher-level Models
  • Do programmers find Charm/AMPI easy/good
  • We think so ?
  • Certainly a good intermediate level model
  • Higher level abstractions can be built on it
  • But what kinds of abstractions?
  • We think domain-specific ones

17
Domain specific frameworks
/AMPI
18
Further Match With MPIM
  • Ability to predict
  • Which data is going to be needed and
  • Which code will execute
  • Based on the ready queue of object method
    invocations
  • So, we can
  • Prefetch data accurately
  • Prefetch code if needed

19
So, What Are We Doing About It?
  • How to develop any programming environment for a
    machine that isnt built yet
  • Blue Gene/C emulator using charm
  • Completed last year
  • Implememnts low level BG/C API
  • Packet sends, extract packet from comm buffers
  • Emulation runs on machines with hundreds of
    normal processors
  • Charm on blue Gene /C Emulator

20
Structure of the Emulators
Blue Gene/C Low-level API
Charm
Converse
21
Emulation on a Parallel Machine
22
Extensions to Charm for BG/C
  • Microtasks
  • Objects may fire microtasks that can be executed
    by any thread on the same node
  • Increases parallelism
  • Overhead sub-microsecond
  • Issue
  • Object affinity map to thread or node?
  • Thread, currently.
  • Microtasks alleviate load balancing within a node

23
Emulation efficiency
  • How much time does it take to run an emulation?
  • 8 Million processors being emulated on 100
  • In addition, lower cache performance
  • Lots of tiny messages
  • On a Linux cluster
  • Emulation shows good speedup

24
Emulation efficiency
1000 BG/C nodes (10x10x10) Each with 200
threads (total of 200,000 user-level threads)
But Data is preliminary, based on one simulation
25
Emulator to Simulator
  • Step 1 Coarse grained simulation
  • Simulation performance prediction capability
  • Models contention for processor/thread
  • Also models communication delay based on distance
  • Doesnt model memory access on chip, or network
  • How to do this in spite of out-of-order message
    delivery?
  • Rely on determinism of Charm programs
  • Time stamped messages and threads
  • Parallel time-stamp correction algorithm

26
Timestamp correction
  • Basic execution
  • Timestamped messages
  • Correction needed when
  • A message arrives with an earlier timestamp than
    other messages processed already
  • Cases
  • Messages to Handlers or simple objects
  • MPI style threads, without wildcard or irecvs
  • Charm with dependence expressed via structured
    dagger

27
Timestamps Correction
28
Timestamps Correction
29
Timestamps Correction
30
Timestamps Correction
31
Applications on the current system
  • Using BG Charm
  • LeanMD
  • Research quality Molecular Dyanmics
  • Version 0 only electrostatics van der Vaal
  • Simple AMR kernel
  • Adaptive tree to generate millions of objects
  • Each holding a 3D array
  • Communication with neighbors
  • Tree makes it harder to find nbrs, but Charm
    makes it easy

32
Emulator to Simulator
  • Step 2 Add fine grained procesor simulation
  • Sarita Adve RSIM based simulation of a node
  • SMP node simulation completed
  • Also simulation of interconnection network
  • Millions of thread units/caches to simulate in
    detail?
  • Step 3 Hybrid simulation
  • Instead use detailed simulation to build model
  • Drive coarse simulation using model behavior
  • Further help from compiler and RTS

33
Modeling layers
Applications
For each need a detailed simulation and a
simpler (e.g. table-driven) model
Libraries/RTS
Chip Architecture
Network model
And methods for combining them
34
Summary
  • Charm (data-driven migratable objects)
  • is a well-matched candidate programming model
    for M-PIMs
  • We have developed an Emulator/Simulator
  • For BG/C
  • Runs on parallel machines
  • We have Implemented multi-million object
    applications using Charm
  • And tested on emulated Blue Gene/C
  • More info http//charm.cs.uiuc.edu
  • Emulator is available for download, along with
    Charm
Write a Comment
User Comments (0)
About PowerShow.com