Michael Bender, SUNY Stony Brook - PowerPoint PPT Presentation

About This Presentation
Title:

Michael Bender, SUNY Stony Brook

Description:

for the United States Department of Energy under contract DE-AC04 ... 12,960 Dual-Core AMD Opteron 2.4Ghz. 39.19 TB Memory, 340 TB disk. 124 TF peak performance ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 32
Provided by: wwwid
Category:
Tags: suny | bender | brook | michael | stony

less

Transcript and Presenter's Notes

Title: Michael Bender, SUNY Stony Brook


1
New Experimental Results in Communication-Aware
Processor Allocation for Supercomputers
  • Michael Bender, SUNY Stony Brook
  • David Bunde, Knox College
  • Vitus Leung, Sandia National Laboratories
  • Kevin Pedretti, Sandia National Laboratories
  • Cynthia Phillips, Sandia National Laboratories

Sandia is a multiprogram laboratory operated by
Sandia Corporation, a Lockheed Martin
Company,for the United States Department of
Energy under contract DE-AC04-94AL85000.
2
Computational Plant (Cplant)
  • Commodity-based supercomputers at Sandia National
    Laboratories (off-the-shelf components)
  • Up to 2048 processors
  • Production computing environment
  • Our Job Improve parallel node allocation on
    Cplant to optimize performance.

3
The Cplant System
  • DEC alpha processors
  • Myrinet interconnect (Sandia modified)
  • MPI
  • Different sizes/topologies usually 2D or 3D grid
    with toroidal wraps
  • Ross 2048 proc, 3D mesh
  • Zermatt 128-proc 2D mesh
  • Alaska 600, heavily-augmented 2D mesh
    (cannibalized).
  • Modified Linux OS (now public domain)
  • Four processors/switch (compute, I/O, service
    nodes)

4
Scheduling Environment
  • Users submit jobs to queue (online)
  • Users specify number of processors and runtime
    estimate
  • If a job runs past this estimate by 5 min, it is
    killed
  • No preemption, no migration, no multitasking
    (security)
  • Actual runtime depends on set of processors
    allocated and placement of other jobs
  • Goals
  • User - minimum response time
  • Bureaucracy (GAO) - high utilization

5
Scheduler/Allocator Association
  • Scheduler and allocator effect each others
    performance.

Performance dependencies
6
Scheduler/Allocator Dissociation
Job
User Executable processors Requested time
Node Allocator
PBS Scheduler
Cplant
. . .
queue
Job
  • Scheduler enforces policy
  • Management sets priorities for access,
    utilization policy
  • Allocator can optimize performance

7
Whats a Good Allocation?
Good allocation For 2D mesh
Bad allocation For 2D mesh
  • Objective Allocate jobs to processors to
    minimize network contention ? processor locality.
  • Especially important for commodity networks

8
Quantitative Effect of Processor Locality
2 ?
  • But, speed-up anomaly

faster than
empty processor
9
Communication Hops on a 2D grid
5
4
  • L1 distance hops ( switches) between 2
    processors on grid

10
Allocation Problem
  • Given n available points on grid (some
    unavailable)
  • Find a set of k available points with minimum
    average (or total) L1 distance.
  • Example green allocation 3(2) 3(1) 9

11
Empirical Correlation
Leung et al, 2002 Related support Mache and Lo,
1996
12
Previous Work
  • Various Work forcing a convex set
  • Insufficient processor utilization
  • Mache, Lo, Windisch MC algorithm
  • Krumke et al 2-approximation, NP-hard w/general
    metric
  • Complexity open for grids
  • Dispersion problem (max distance) linear time for
    fixed k (Fekete and Meijer)

13
Optimal Unconstrained ShapeBender,Bender,Demaine
,Fekete 2004
Almost a circle but not quite. Only .05 percent
difference in area.
0.650 245 952 951
14
Previous Results (Bender et al 2005)
  • 7/4-approximation (2 - in d dimensions)
  • PTAS ((1?)-approximation in poly time for fixed
    ?)
  • MC is a 4-approximation
  • Linear-time exact dynamic program 1D
  • O(n log n) time for k3
  • Simulations (performance on job streams)

15
Experiments Placement Algorithm MC
  • Search in shell from minimum-size region of
    preferred shape.
  • Weight processors by shells
  • Return processor set with minimum weight.

16
Alternative One-Dimensional Reduction
rlrubin illustrate algorithms unlikely to be
efficiently solvable more motivation - why
default is not good enough
  • Order processors so that
  • close in linear order ? close in physical
    processor graph
  • Consider one-dimensional processor allocation
  • Bin packing (best fit, first fit, sum of squares)
  • Pack jobs onto the line (or ring), allowing
    fragmentation

17
New System Red Storm
  • 12,960 Dual-Core AMD Opteron 2.4Ghz
  • 39.19 TB Memory, 340 TB disk
  • 124 TF peak performance
  • 3D Mesh

18
Impact
  • Changed the node allocator on Cplant
  • 1D default allocator
  • 2D algorithms implemented
  • Carried over to Red Storm system software
  • 1D and 2D algorithms implemented
  • Selectable at compilation
  • RD 100 winner (Leung, Bender, Bunde, Pedretti,
    Phillips 2006)

19
Red Storm Development Machine
I/O node
Compute node
  • 1 Cray XT3/4 Cabinet

20
Does Bandwidth Make a Difference?
  • Yes!

21
Red Storm Development Machine
I/O node
Compute node
  • YZ S Curve

22
Red Storm Development Machine
I/O node
Compute node
  • ZY S Curve

23
Hilbert (Space-Filling) Curves
  • For 2D and 3D grids
  • Previous applications
  • I/O efficient and cache-oblivious computation
  • Compression (images)
  • Domain decomposition

24
Red Storm Development Machine
I/O node
Compute node
  • Zoltan Hilbert-Space-Filling Curve

25
Red Storm Development Machine
I/O node
Compute node
  • Spliced Hilbert-Space-Filling Curve

26
Results (Makespan in Seconds)
  • Consistent with simulations (Bender et al 2005)

27
Results (Makespan Normalized)
28
Red Storm Development Machine
I/O node
Compute node
  • Is it I/O or interprocess communication?

29
Results (Makespan Normalized)
  • Not I/O
  • Consistent with Cplant experiments (Leung et al
    2002)
  • Consistent with Pittsburgh Supercomputing Center
    experiments (Weisser et al 2006)

30
Experiments- Test Set
  • All-to-All Communications
  • Job Size Number of Jobs
  • 2 1820
  • 5 660
  • 15 620
  • 20 660
  • High communication, best-case for runtime
    improvements
  • Small number of repetitions (3)

31
Questions
  • Whats the right allocation for a stream
    (online)?
  • Scheduling Allocation
Write a Comment
User Comments (0)
About PowerShow.com