Title: Michael Bender, SUNY Stony Brook
1New Experimental Results in Communication-Aware
Processor Allocation for Supercomputers
- Michael Bender, SUNY Stony Brook
- David Bunde, Knox College
- Vitus Leung, Sandia National Laboratories
- Kevin Pedretti, Sandia National Laboratories
- Cynthia Phillips, Sandia National Laboratories
Sandia is a multiprogram laboratory operated by
Sandia Corporation, a Lockheed Martin
Company,for the United States Department of
Energy under contract DE-AC04-94AL85000.
2Computational Plant (Cplant)
- Commodity-based supercomputers at Sandia National
Laboratories (off-the-shelf components) - Up to 2048 processors
- Production computing environment
- Our Job Improve parallel node allocation on
Cplant to optimize performance.
3The Cplant System
- DEC alpha processors
- Myrinet interconnect (Sandia modified)
- MPI
- Different sizes/topologies usually 2D or 3D grid
with toroidal wraps - Ross 2048 proc, 3D mesh
- Zermatt 128-proc 2D mesh
- Alaska 600, heavily-augmented 2D mesh
(cannibalized). - Modified Linux OS (now public domain)
- Four processors/switch (compute, I/O, service
nodes)
4Scheduling Environment
- Users submit jobs to queue (online)
- Users specify number of processors and runtime
estimate - If a job runs past this estimate by 5 min, it is
killed - No preemption, no migration, no multitasking
(security) - Actual runtime depends on set of processors
allocated and placement of other jobs - Goals
- User - minimum response time
- Bureaucracy (GAO) - high utilization
5Scheduler/Allocator Association
- Scheduler and allocator effect each others
performance.
Performance dependencies
6Scheduler/Allocator Dissociation
Job
User Executable processors Requested time
Node Allocator
PBS Scheduler
Cplant
. . .
queue
Job
- Scheduler enforces policy
- Management sets priorities for access,
utilization policy - Allocator can optimize performance
7Whats a Good Allocation?
Good allocation For 2D mesh
Bad allocation For 2D mesh
- Objective Allocate jobs to processors to
minimize network contention ? processor locality. - Especially important for commodity networks
8Quantitative Effect of Processor Locality
2 ?
faster than
empty processor
9Communication Hops on a 2D grid
5
4
- L1 distance hops ( switches) between 2
processors on grid
10Allocation Problem
- Given n available points on grid (some
unavailable) - Find a set of k available points with minimum
average (or total) L1 distance. - Example green allocation 3(2) 3(1) 9
11Empirical Correlation
Leung et al, 2002 Related support Mache and Lo,
1996
12Previous Work
- Various Work forcing a convex set
- Insufficient processor utilization
- Mache, Lo, Windisch MC algorithm
- Krumke et al 2-approximation, NP-hard w/general
metric - Complexity open for grids
- Dispersion problem (max distance) linear time for
fixed k (Fekete and Meijer)
13Optimal Unconstrained ShapeBender,Bender,Demaine
,Fekete 2004
Almost a circle but not quite. Only .05 percent
difference in area.
0.650 245 952 951
14Previous Results (Bender et al 2005)
- 7/4-approximation (2 - in d dimensions)
- PTAS ((1?)-approximation in poly time for fixed
?) - MC is a 4-approximation
- Linear-time exact dynamic program 1D
- O(n log n) time for k3
- Simulations (performance on job streams)
15Experiments Placement Algorithm MC
- Search in shell from minimum-size region of
preferred shape. - Weight processors by shells
- Return processor set with minimum weight.
16Alternative One-Dimensional Reduction
rlrubin illustrate algorithms unlikely to be
efficiently solvable more motivation - why
default is not good enough
- Order processors so that
- close in linear order ? close in physical
processor graph - Consider one-dimensional processor allocation
- Bin packing (best fit, first fit, sum of squares)
- Pack jobs onto the line (or ring), allowing
fragmentation
17New System Red Storm
- 12,960 Dual-Core AMD Opteron 2.4Ghz
- 39.19 TB Memory, 340 TB disk
- 124 TF peak performance
- 3D Mesh
18Impact
- Changed the node allocator on Cplant
- 1D default allocator
- 2D algorithms implemented
- Carried over to Red Storm system software
- 1D and 2D algorithms implemented
- Selectable at compilation
- RD 100 winner (Leung, Bender, Bunde, Pedretti,
Phillips 2006)
19Red Storm Development Machine
I/O node
Compute node
20Does Bandwidth Make a Difference?
21Red Storm Development Machine
I/O node
Compute node
22Red Storm Development Machine
I/O node
Compute node
23Hilbert (Space-Filling) Curves
- For 2D and 3D grids
- Previous applications
- I/O efficient and cache-oblivious computation
- Compression (images)
- Domain decomposition
24Red Storm Development Machine
I/O node
Compute node
- Zoltan Hilbert-Space-Filling Curve
25Red Storm Development Machine
I/O node
Compute node
- Spliced Hilbert-Space-Filling Curve
26Results (Makespan in Seconds)
- Consistent with simulations (Bender et al 2005)
27Results (Makespan Normalized)
28Red Storm Development Machine
I/O node
Compute node
- Is it I/O or interprocess communication?
29Results (Makespan Normalized)
- Not I/O
- Consistent with Cplant experiments (Leung et al
2002) - Consistent with Pittsburgh Supercomputing Center
experiments (Weisser et al 2006)
30Experiments- Test Set
- All-to-All Communications
- Job Size Number of Jobs
- 2 1820
- 5 660
- 15 620
- 20 660
- High communication, best-case for runtime
improvements - Small number of repetitions (3)
31Questions
- Whats the right allocation for a stream
(online)? - Scheduling Allocation