Emulating Massively Parallel (PetaFLOPS) Machines - PowerPoint PPT Presentation

About This Presentation
Title:

Emulating Massively Parallel (PetaFLOPS) Machines

Description:

Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kal http://charm.cs.uiuc.edu – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 16
Provided by: ung60
Category:

less

Transcript and Presenter's Notes

Title: Emulating Massively Parallel (PetaFLOPS) Machines


1
Emulating Massively Parallel (PetaFLOPS) Machines
  • Neelam Saboo, Arun Kumar Singla
  • Joshua Mostkoff Unger, Gengbin Zheng,
  • Laxmikant V. Kalé

http//charm.cs.uiuc.edu
Department of Computer Science Parallel
Programming Laboratory
2
Roadmap
  • BlueGene Architecture
  • Need for an Emulator
  • Charm BlueGene
  • Converse BlueGene
  • Future Work

3
Blue Gene Processor-in-memory Case Study
  • Five steps to a PetaFLOPS, taken from
  • http//www.research.ibm.com/bluegene/

FUNCTIONAL MODEL 34X34X36 cube of shared memory
nodes each having 25 processors.
4
SMP Node
  • 25 processors
  • 200 processing elements
  • Input/Output Buffer
  • 32 x 128 bytes
  • Network
  • Connected to six neighbors via duplex link
  • 16 bit _at_ 500 MHz 1 Gigabyte/s
  • Latencies
  • 5 cycles per hop
  • 75 cycles per turn

5
Processor
  • STATS
  • 500 MHz
  • Memory-side cache eliminates coherency problems
  • 10 cycles local cache
  • 20 cycles remote cache
  • 10 cycles cache miss
  • 8 integer units sharing 2 floating
    point units
  • 8 x 25 x 40,000 8 x 106 processing elements!

6
Need for Emulator
  • Emulator enables programmer to develop,
    compile, and run software using programming
    interface that will be used in actual machine

7
Emulator Objectives
  • Emulate Blue Gene and other petaFLOPS machines.
  • Memory limitations and time limitations on single
    processor requires that simulation MUST be
    performed on parallel architecture.
  • Issues
  • Assume that program written for
    processor-in-memory machine will handle
    out-of-order execution and messaging.
  • Therefore dont need complex event
    queue/rollback.

8
Emulator Implementation
  • What are basic data structures/interface?
  • Machine configuration (topology), handler
    registration
  • Nodes with node-level shared data
  • Threads (associated with each node) representing
    processing elements
  • Communication between nodes
  • How to handle all these objects on parallel
    architecture? How to handle object-to-object
    communication?
  • Difficulties of implementation eased by using
    Charm, object-oriented parallel programming
    paradigm.

9
Experiments on Emulator
  • Sample applications implemented
  • Primes
  • Jacobi relaxation
  • MD prototype
  • 40,000 atoms, no bonds calculated, nearest
    neighbor cutoff
  • Ran full Blue Gene (with 8 x 106 threads) on 100
    ASCI-Red processors

ApoA-I 92k Atoms
10
Collective Operations
  • Explore different algorithms for broadcasts and
    reductions

OCTREE
LINE
RING
z
y
Use primitive 30 x 30 x 20 (10 threads) Blue
Gene emulation on 50 processor Linux cluster
x
11
Converse BlueGene Emulator Objective
  • Performance estimation (with proper time
    stamping)
  • Provide API for building Charm on top of
    emulator.

12
Bluegene Emulator
Communication threads
Worker thread
inBuffer
Affinity message queue
Non-affinity message queue
Node Structure
13
Performance
  • Pingpong
  • Close to Converse pingpong
  • 81-103 us v.s. 92 us RTT
  • Charm pingpong
  • 116 us RTT
  • Charm Bluegene pingpong
  • 134-175 us RTT

14
Charm on top of Emulator
  • BlueGene thread represents Charm node
  • Name conflict
  • Cpv, Ctv
  • MsgSend, etc
  • CkMyPe(), CkNumPes(), etc

15
Future Work Simulator
  • LeanMD Fully functional MD with only cutoff
  • How can we examine performance of algorithms on
    variants of processor-in-memory design in massive
    system?
  • Several layers of detail to measure
  • Basic Correctly model performance, timestamp
    messages with correction for out-of-order
    execution
  • More detailed network performance, memory
    access, modeling sharing of floating-point unit,
    estimation techniques
Write a Comment
User Comments (0)
About PowerShow.com