CGrid 2005, slide 1 - PowerPoint PPT Presentation

About This Presentation
Title:

CGrid 2005, slide 1

Description:

All parallel threads of an application scheduled together by a global scheduler ... Threads scheduled independently on each node of a parallel system by the ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 25
Provided by: camp203
Learn more at: https://www2.cs.uh.edu
Category:
Tags: cgrid

less

Transcript and Presenter's Notes

Title: CGrid 2005, slide 1


1
Empirical Evaluation of Shared Parallel
Execution on Independently Scheduled Clusters
Mala Ghanesh Satish Kumar Jaspal
Subhlok University of Houston CCGrid, May 2005
2
Scheduling Parallel Threads
  • Space Sharing/Gang Scheduling
  • All parallel threads of an application scheduled
    together by a global scheduler
  • Independent Scheduling
  • Threads scheduled independently on each node of a
    parallel system by the local scheduler

3
Space Sharing and Gang Scheduling
Nodes
Gang scheduling
Space sharing
N1 N2 N3 N4
N1 N2 N3 N4
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
Time slice
Threads of application A are a1, a2, a3,
a4 Threads of application B are b1, b2, b3, b4
4
Independent Scheduling and Gang Scheduling
Nodes
Gang scheduling
Independent Scheduling
N1 N2 N3 N4
N1 N2 N3 N4
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
Time slice
5
Gang versus Independent Scheduling
  • Gang scheduling is de-facto standard for parallel
    computation clusters
  • How does independent scheduling compare ?
  • More flexible no central scheduler required
  • Potentially uses resources more efficiently
  • - Potentially increases synchronization
    overhead

6
Synchronization/Communication with Independent
Scheduling
Nodes

With strict independent round robin scheduling
parallel threads may never be able to
communicate! Fortunately scheduling is never
strictly round robin, but this is a significant
performance issue
N1 N2 N3 N4
T1 T2 T3 T4 T5 T6
Time slice
7
Research in This Paper
  • How does node sharing with independent scheduling
    perform in practice ?
  • Improved resource utilization versus higher
    synchronization overhead ?
  • Dependence on application characteristics ?
  • Dependence on CPU time slice values ?

8
Experiments
  • All experiments with NAS benchmarks on 2 clusters
  • Benchmark programs executed
  • Dedicated mode on a cluster
  • With node sharing with competing applications
  • Slowdown due to sharing analyzed
  • Above experiments conducted with
  • Various node and thread counts
  • Various CPU time slice values

9
Experimental Setup
  • Two clusters are used
  • 10 node, 1 GB RAM, dual Pentium Xeon processors,
    RedHat Linux 7.2, GigE interconnect
  • 18 node 1 GB RAM, dual AMD Athlon processors,
    RedHat Linux 7.3, GigE interconnect
  • NAS Parallel Benchmarks 2.3, Class B MPI
    Versions
  • CG, EP, IS, LU, MP compiled for 4, 8,16, 32
    threads
  • SP and BT compiled for 4, 9, 16, 36 threads
  • IS (Integer Sort) and CG (Conjugate Gradient) are
    most communication intensive benchmarks.
  • EP(Embarassingly Parallel) has no communication.

10
Experimental 1
  • NAS Benchmarks compiled for 4, 8/9 and 16 threads
  • Benchmarks first executed in dedicated mode with
    one thread per node
  • Then executed with 2 additional competing threads
    on each node
  • Each node has 2 CPUs minimum 3 total threads
    are needed to cause contention
  • Competing load threads are simple compute loops
    with no communication
  • Slowdown (age increase in execution time)
    plotted
  • Nominal slowdown is 50 - used for comparison as
    gang scheduling slowdown

11
Results 10 node cluster
4 nodes
80
Expected slowdown with gang scheduling
8/9 nodes
70
60
50
Percentage Slowdown
40
30

20
10
0
CG
EP
IS
LU
MG
SP
BT
Avg
Benchmark
  • Slowdown ranges around 50
  • Some increase in slowdown going from 4 to 8 nodes

12
Results 18 node cluster
  • Broadly similar
  • Slow increase in slowdown from 4 to 16 nodes

13
Remarks
  • Why is slowdown not much higher ?
  • Scheduling is not strict round robin a blocked
    application thread will get scheduled again on
    message arrival
  • leads to self synchronization - threads of the
    same application across nodes get scheduled
    together
  • Applications often have significant wait times
    that are used by competing applications with
    sharing
  • Increase in slowdown with more nodes is expected
    as communication operations are more complex
  • The rate of increase is modest

14
Experiment 2
  • Similar to the previous batch of experiments,
    except
  • 2 Application threads per node
  • 1 load thread per node
  • Nominal slowdown is still 50

15
Performance 1 and 2 app threads/node
1 app thread per node, 4 nodes
2 app threads per node, 4/5 nodes
1 app thread per node, 8/9 nodes
2 app threads per node, 8 nodes
80
Expected slowdown with gang scheduling
70
60
50
Percentage Slowdown
40
30

20
10
0
CG
EP
IS
LU
MG
SP
BT
Avg
Slowdown is lower for 2 threads/node
16
Performance 1 and 2 app threads/node
1 app thread per node, 4 nodes
2 app threads per node, 4/5 nodes
1 app thread per node, 8/9 nodes
2 app threads per node, 8 nodes
80
70
60
50
Percentage Slowdown
40
30

20
10
0
CG
EP
IS
LU
MG
SP
BT
Avg
  • Slowdown is lower for 2 threads/node
  • competing with one 100 compute thread (not 2)
  • scaling a fixed size problem to more threads
    means each thread uses CPU less efficiently
  • hence more free cycles available

17
Experiment 3
  • Similar to the previous batch of experiments,
    except
  • CPU time slice quantum varied from 30 to 200 ms.
  • (default was 50 msecs)
  • CPU time slice quantum is the amount of time a
    process gets when others are waiting in ready
    queue
  • Intuitively, longer time slice quantum means
  • a communication operation between nodes is less
    likely to be interrupted due to swapping good
  • a node may have to wait longer for a peer to be
    scheduled, before communicating - bad

18
Performance with different CPU time slice quanta
  • Small time slices are uniformly bad
  • Medium time slices (50 ms and 100 ms) generally
    good
  • Longer time slice good for communication
    intensive codes

19
Conclusions
  • Performance with independent scheduling
    competitive with gang scheduling for small
    clusters.
  • Key is passive self synchronization of
    application threads across the cluster
  • Steady but slow increase in slowdown with larger
    number of nodes
  • Given the flexibility of independent scheduling,
    it may be a good choice for some scenarios

20
Broader Picture Distributed Applications on
Networks Resource selection, Mapping, Adapting
Which nodes offer best performance
?
Application
Network
21
End of Talk!
FOR MORE INFORMATION www.cs.uh.edu/jaspal
jaspal_at_uh.edu
22
Mapping Distributed Applications on Networks
state of the art
Mapping for Best Performance
  • 1. Measure and model network properties, such as
    available bandwidth and CPU loads (with tools
    like NWS, Remos)
  • Find best nodes for execution based on network
    status
  • But the approach has significant limitations
  • Knowing network status is not the same as knowing
    how an application will perform
  • Frequent measurements are expensive, less
    frequent measurements mean stale data

23
Discovered Communication Structure of NAS
Benchmarks
1
1
1
0
0
0
2
2
3
3
3
2
BT
CG
IS
1
1
1
0
0
0
2
2
2
3
3
3
LU
MG
SP
1
0
2
3
EP
24
CPU Behavior of NAS Benchmarks
Write a Comment
User Comments (0)
About PowerShow.com