Title: CGrid 2005, slide 1
1Empirical Evaluation of Shared Parallel
Execution on Independently Scheduled Clusters
Mala Ghanesh Satish Kumar Jaspal
Subhlok University of Houston CCGrid, May 2005
2 Scheduling Parallel Threads
- Space Sharing/Gang Scheduling
- All parallel threads of an application scheduled
together by a global scheduler - Independent Scheduling
- Threads scheduled independently on each node of a
parallel system by the local scheduler
3Space Sharing and Gang Scheduling
Nodes
Gang scheduling
Space sharing
N1 N2 N3 N4
N1 N2 N3 N4
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
Time slice
Threads of application A are a1, a2, a3,
a4 Threads of application B are b1, b2, b3, b4
4Independent Scheduling and Gang Scheduling
Nodes
Gang scheduling
Independent Scheduling
N1 N2 N3 N4
N1 N2 N3 N4
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
Time slice
5Gang versus Independent Scheduling
- Gang scheduling is de-facto standard for parallel
computation clusters - How does independent scheduling compare ?
- More flexible no central scheduler required
- Potentially uses resources more efficiently
- - Potentially increases synchronization
overhead
6Synchronization/Communication with Independent
Scheduling
Nodes
With strict independent round robin scheduling
parallel threads may never be able to
communicate! Fortunately scheduling is never
strictly round robin, but this is a significant
performance issue
N1 N2 N3 N4
T1 T2 T3 T4 T5 T6
Time slice
7Research in This Paper
- How does node sharing with independent scheduling
perform in practice ? - Improved resource utilization versus higher
synchronization overhead ? - Dependence on application characteristics ?
- Dependence on CPU time slice values ?
8Experiments
- All experiments with NAS benchmarks on 2 clusters
- Benchmark programs executed
- Dedicated mode on a cluster
- With node sharing with competing applications
- Slowdown due to sharing analyzed
- Above experiments conducted with
- Various node and thread counts
- Various CPU time slice values
9Experimental Setup
- Two clusters are used
- 10 node, 1 GB RAM, dual Pentium Xeon processors,
RedHat Linux 7.2, GigE interconnect - 18 node 1 GB RAM, dual AMD Athlon processors,
RedHat Linux 7.3, GigE interconnect - NAS Parallel Benchmarks 2.3, Class B MPI
Versions - CG, EP, IS, LU, MP compiled for 4, 8,16, 32
threads - SP and BT compiled for 4, 9, 16, 36 threads
- IS (Integer Sort) and CG (Conjugate Gradient) are
most communication intensive benchmarks. - EP(Embarassingly Parallel) has no communication.
10Experimental 1
- NAS Benchmarks compiled for 4, 8/9 and 16 threads
- Benchmarks first executed in dedicated mode with
one thread per node - Then executed with 2 additional competing threads
on each node - Each node has 2 CPUs minimum 3 total threads
are needed to cause contention - Competing load threads are simple compute loops
with no communication - Slowdown (age increase in execution time)
plotted - Nominal slowdown is 50 - used for comparison as
gang scheduling slowdown
11Results 10 node cluster
4 nodes
80
Expected slowdown with gang scheduling
8/9 nodes
70
60
50
Percentage Slowdown
40
30
20
10
0
CG
EP
IS
LU
MG
SP
BT
Avg
Benchmark
- Slowdown ranges around 50
- Some increase in slowdown going from 4 to 8 nodes
12Results 18 node cluster
- Broadly similar
- Slow increase in slowdown from 4 to 16 nodes
13Remarks
- Why is slowdown not much higher ?
- Scheduling is not strict round robin a blocked
application thread will get scheduled again on
message arrival - leads to self synchronization - threads of the
same application across nodes get scheduled
together - Applications often have significant wait times
that are used by competing applications with
sharing - Increase in slowdown with more nodes is expected
as communication operations are more complex - The rate of increase is modest
14Experiment 2
- Similar to the previous batch of experiments,
except - 2 Application threads per node
- 1 load thread per node
- Nominal slowdown is still 50
15Performance 1 and 2 app threads/node
1 app thread per node, 4 nodes
2 app threads per node, 4/5 nodes
1 app thread per node, 8/9 nodes
2 app threads per node, 8 nodes
80
Expected slowdown with gang scheduling
70
60
50
Percentage Slowdown
40
30
20
10
0
CG
EP
IS
LU
MG
SP
BT
Avg
Slowdown is lower for 2 threads/node
16Performance 1 and 2 app threads/node
1 app thread per node, 4 nodes
2 app threads per node, 4/5 nodes
1 app thread per node, 8/9 nodes
2 app threads per node, 8 nodes
80
70
60
50
Percentage Slowdown
40
30
20
10
0
CG
EP
IS
LU
MG
SP
BT
Avg
- Slowdown is lower for 2 threads/node
- competing with one 100 compute thread (not 2)
- scaling a fixed size problem to more threads
means each thread uses CPU less efficiently - hence more free cycles available
17Experiment 3
- Similar to the previous batch of experiments,
except - CPU time slice quantum varied from 30 to 200 ms.
- (default was 50 msecs)
- CPU time slice quantum is the amount of time a
process gets when others are waiting in ready
queue - Intuitively, longer time slice quantum means
- a communication operation between nodes is less
likely to be interrupted due to swapping good - a node may have to wait longer for a peer to be
scheduled, before communicating - bad
18Performance with different CPU time slice quanta
- Small time slices are uniformly bad
- Medium time slices (50 ms and 100 ms) generally
good - Longer time slice good for communication
intensive codes
19Conclusions
- Performance with independent scheduling
competitive with gang scheduling for small
clusters. - Key is passive self synchronization of
application threads across the cluster - Steady but slow increase in slowdown with larger
number of nodes - Given the flexibility of independent scheduling,
it may be a good choice for some scenarios
20 Broader Picture Distributed Applications on
Networks Resource selection, Mapping, Adapting
Which nodes offer best performance
?
Application
Network
21End of Talk!
FOR MORE INFORMATION www.cs.uh.edu/jaspal
jaspal_at_uh.edu
22Mapping Distributed Applications on Networks
state of the art
Mapping for Best Performance
- 1. Measure and model network properties, such as
available bandwidth and CPU loads (with tools
like NWS, Remos) - Find best nodes for execution based on network
status - But the approach has significant limitations
- Knowing network status is not the same as knowing
how an application will perform - Frequent measurements are expensive, less
frequent measurements mean stale data
23Discovered Communication Structure of NAS
Benchmarks
1
1
1
0
0
0
2
2
3
3
3
2
BT
CG
IS
1
1
1
0
0
0
2
2
2
3
3
3
LU
MG
SP
1
0
2
3
EP
24CPU Behavior of NAS Benchmarks