Emulating Massively Parallel (PetaFLOPS) Machines

About This Presentation

Title:

Emulating Massively Parallel (PetaFLOPS) Machines

Description:

Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kal http://charm.cs.uiuc.edu – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 16

Provided by: ung60

Learn more at: http://charm.cs.illinois.edu

Category:

more less

Transcript and Presenter's Notes

Title: Emulating Massively Parallel (PetaFLOPS) Machines

1
Emulating Massively Parallel (PetaFLOPS) Machines

Neelam Saboo, Arun Kumar Singla
Joshua Mostkoff Unger, Gengbin Zheng,
Laxmikant V. Kalé

http//charm.cs.uiuc.edu
Department of Computer Science Parallel
Programming Laboratory
2
Roadmap

BlueGene Architecture
Need for an Emulator
Charm BlueGene
Converse BlueGene
Future Work

3
Blue Gene Processor-in-memory Case Study

Five steps to a PetaFLOPS, taken from
http//www.research.ibm.com/bluegene/

FUNCTIONAL MODEL 34X34X36 cube of shared memory
nodes each having 25 processors.
4
SMP Node

25 processors
200 processing elements
Input/Output Buffer
32 x 128 bytes
Network
Connected to six neighbors via duplex link
16 bit _at_ 500 MHz 1 Gigabyte/s

Latencies
5 cycles per hop
75 cycles per turn

5
Processor

STATS
500 MHz
Memory-side cache eliminates coherency problems
10 cycles local cache
20 cycles remote cache
10 cycles cache miss
8 integer units sharing 2 floating
point units

8 x 25 x 40,000 8 x 106 processing elements!

6
Need for Emulator

Emulator enables programmer to develop,
compile, and run software using programming
interface that will be used in actual machine

7
Emulator Objectives

Emulate Blue Gene and other petaFLOPS machines.
Memory limitations and time limitations on single
processor requires that simulation MUST be
performed on parallel architecture.
Issues
Assume that program written for
processor-in-memory machine will handle
out-of-order execution and messaging.
Therefore dont need complex event
queue/rollback.

8
Emulator Implementation

What are basic data structures/interface?
Machine configuration (topology), handler
registration
Nodes with node-level shared data
Threads (associated with each node) representing
processing elements
Communication between nodes
How to handle all these objects on parallel
architecture? How to handle object-to-object
communication?
Difficulties of implementation eased by using
Charm, object-oriented parallel programming
paradigm.

9
Experiments on Emulator

Sample applications implemented
Primes
Jacobi relaxation
MD prototype

40,000 atoms, no bonds calculated, nearest
neighbor cutoff
Ran full Blue Gene (with 8 x 106 threads) on 100
ASCI-Red processors

ApoA-I 92k Atoms
10
Collective Operations

Explore different algorithms for broadcasts and
reductions

OCTREE
LINE
RING
z
y
Use primitive 30 x 30 x 20 (10 threads) Blue
Gene emulation on 50 processor Linux cluster
x
11
Converse BlueGene Emulator Objective

Performance estimation (with proper time
stamping)
Provide API for building Charm on top of
emulator.

12
Bluegene Emulator
Communication threads
Worker thread
inBuffer
Affinity message queue
Non-affinity message queue
Node Structure
13
Performance

Pingpong
Close to Converse pingpong
81-103 us v.s. 92 us RTT
Charm pingpong
116 us RTT
Charm Bluegene pingpong
134-175 us RTT

14
Charm on top of Emulator

BlueGene thread represents Charm node
Name conflict
Cpv, Ctv
MsgSend, etc
CkMyPe(), CkNumPes(), etc

15
Future Work Simulator

LeanMD Fully functional MD with only cutoff
How can we examine performance of algorithms on
variants of processor-in-memory design in massive
system?
Several layers of detail to measure
Basic Correctly model performance, timestamp
messages with correction for out-of-order
execution
More detailed network performance, memory
access, modeling sharing of floating-point unit,
estimation techniques