Title: Designing Parallel Operating Systems via Parallel Programming
1Designing Parallel Operating Systemsvia Parallel
Programming
Eitan Frachtenberg1, Kei Davis1, Fabrizio
Petrini1, Juan Fernández1,2 and José Carlos
Sancho1 1Performance and Architecture Lab
(PAL) 2Grupo de Arquitectura y Computación
Paralelas (GACOP) CCS-3 Modeling, Algorithms
and Informatics Dpto. IngenierÃa y TecnologÃa
de Computadores Los Alamos National Laboratory,
NM 87545, USA Universidad de Murcia, 30071
Murcia, SPAIN URL http//www.c3.lanl.gov
URL http//www.ditec.um.es
emailjuanf_at_um.es
2Motivation
- Clusters have been the most successful player in
high-performance computing in the last decade
OS
OS
OS
OS
OS
OS
OS
OS
HARDWARE Independent Nodes High-speed
Network SOFTWARE Commodity OS Parallel Apps
System Software
3Motivation
- Ever-increasing demand for computing capability
is driving the construction of ever-larger
clusters
2
3
1
Earth Simulator 5120 Processors
Thunder (LLNL) 4096 Processors
ASCI Q (LANL) 8192 Processors
Systems are becoming more complex, less efficient
and less reliable
4Motivation
- Clusters are loosely-coupled systems used for
solving inherently tightly-coupled problems - Parallel software keeps all the pieces together
- Development of parallel software is a time- and
resource- consuming task due to its complexity
PROBLEM parallel software has neither evolved
nor scaled accordingly to cluster sizes
SOLUTION new approach to the design of parallel
software for large-scale clusters
5Goals
- Target
- New methodology for the design of parallel
software - Simplicity, performance, scalability, reliability
- Backbone to integrate all nodes into a parallel
OS - Vision
- BSP-like system running MIMD applications
- (variable granularity in the order of hundreds
of ?s) - Approach
- BSP-like global control and coordination of all
system activities - Small set of collective communication primitives
for global coordination
6Outline
- Motivation and Goals
- Toward a Parallel Operating System
- Core Primitives
- Parallel Software Design
- Case Studies
- Concluding remarks
7Toward a Parallel OS
- Designing a Parallel OS
- Lack of global coordination (loose coupling)
- Redundant/missing functionality (complexity)
Resource Management
Parallel Application
. . .
Parallel File System
Comm Protocol 1
Comm Protocol 2
. . .
Comm Protocol N
Hardware
8Toward a Parallel OS
- Scientific applications are tightly coupled
- Data dependencies between nodes
- They exchange messages very often
- but the processing nodes are bolted together
in a loosely coupled fashion
Need for global control and coordination of all
the system activities, enforced by global
collective communication primitives
9Toward a Parallel OS
- Designing a Parallel OS
- System-level, global control and coordination of
all application and system software activities
10Toward a Parallel OS
- Parallel applications use point-to-point and
collective communication - System software tasks are either collective
operations or can be cast in terms of them
Parallel applications and system software can be
built atop the same communication primitives
11Toward a Parallel OS
- Designing a Parallel OS
- Least common denominator of system and
application software ? Core Primitives
Resource Management
Parallel Application
. . .
Parallel File System
Global control and coordination
Comm Protocol 1
Comm Protocol 2
. . .
Comm Protocol N
Core Primitives
Hardware
12Outline
- Motivation and Goals
- Toward a Parallel Operating System
- Core Primitives
- Parallel Software Design
- Case Studies
- Concluding remarks
13Core Primitives
- Parallel software built atop three primitives
- Xfer-And-Signal
- Transfer block of data to a set of nodes
- Optionally signal local/remote event upon
completion - Test-Event
- Poll local event
- Compare-And-Write
- Compare global variable on a set of nodes
- Optionally write global variable on the same set
of nodes
14Core Primitives
- Parallel software built atop three primitives
- Xfer-And-Signal (QsNet)
- Node S transfers block of data to nodes D1, D2,
D3 and D4
S
15Core Primitives
- Parallel software built atop three primitives
- Xfer-And-Signal (QsNet)
- Node S transfers block of data to nodes D1, D2,
D3 and D4 - Events triggered at source and destinations
S
16Core Primitives
- Parallel software built atop three primitives
- Compare-And-Write (QsNet)
- Node S compares variable V on nodes D1, D2, D3
and D4
S
- Is V ?, ? ?, gt to Value?
17Core Primitives
- Parallel software built atop three primitives
- Compare-And-Write (QsNet)
- Node S compares variable V on nodes D1, D2, D3
and D4 - Partial results are combined in the switches
S
18Outline
- Motivation and Goals
- Toward a Parallel Operating System
- Core Primitives
- Parallel Software Design
- Case Studies
- Concluding remarks
19Toward a Parallel OS
- Global control/coordination of all system
activities
- Global Strobe
- (time slice starts)
Task 1
Task 2
Time Slice (hundreds of ?s)
Task 3
- Global Strobe
- (time slice ends)
20Parallel Software Design
- Using the core primitives
- Global control and coordination
- Strobe sent at regular intervals (time slices)
- Compare-And-Write Xfer-And-Signal (Master)
- Test-Event (Slaves)
- All system activities are tightly coupled
- Global information is required to schedule
resources, global synchronization facilitates the
task but it is not enough - Global resource scheduling
- Exchange of requirements/restrictions
- Xfer-And-Signal Test-Event
- Resource scheduling
21Parallel Software Design
SYSTEM SOFTWARE
22Parallel Software Design
- Using the core primitives
23 Parallel Software Design
Can we really build system software using this
new approach?
24Outline
- Motivation and Goals
- Introduction
- Core Primitives
- Parallel Software Design
- Case Studies
- Concluding remarks
25Case Studies
26Case Studies
- STORM (Scalable TOol for Resource Management)
- Architecture
- Set of dæmons running on the management/compute
nodes - Built atop the three core primitives
- BSP-like behavior management activities are
synchronized and scheduled every few hundreds of
microseconds - Functionality
- Job Launching
- Job Scheduling (FCFS, gang scheduling and others)
- New scheduling algorithms can be plugged in
- Resource Accounting
27Case Studies
- Job Launching send/execute/check for completion
- 40 times faster than the best reported
result!!!
28Case Studies
- BCS-MPI (Buffered CoScheduled MPI)
- Architecture
- Set of cooperative threads running in the NIC
- Built atop the three core primitives
- BSP-like behavior communications are
synchronized and scheduled every few hundreds of
microseconds - Functionality
- Subset of the MPI standard
- Paves the way to provide
- Traffic segregation
- Deterministic replay of user applications
- System-level fault tolerance
29Case Studies
- SWEEP3D and SAGE Performance (IA32)
- Production-level MPI versus BCS-MPI
0.5 SPEEDUP
2 SPEEDUP
30Outline
- Motivation and Goals
- Introduction
- Core Primitives
- Parallel Software Design
- Case Studies
- Concluding remarks
31Concluding Remarks
- Methodology for designing parallel software
- Coordination of all system and application
software activities in a BSP-like fashion - Parallel applications and system software built
atop a basic set of collective primitives for
global coordination - Backbone to integrate all nodes into a parallel
OS - Promising preliminary results demonstrate that
this approach is indeed feasible
32Future Work
- Kernel-level implementation
- User-level solution is already working
- Deterministic replay of MPI programs
- Ordered resource scheduling may enforce
reproducibility - Transparent fault tolerance
- Global coordination simplifies the state of the
machine
33Designing Parallel Operating Systemsvia Parallel
Programming
Eitan Frachtenberg1, Kei Davis1, Fabrizio
Petrini1, Juan Fernández1,2 and José Carlos
Sancho1 1Performance and Architecture Lab
(PAL) 2Grupo de Arquitectura y Computación
Paralelas (GACOP) CCS-3 Modeling, Algorithms
and Informatics Dpto. IngenierÃa y TecnologÃa
de Computadores Los Alamos National Laboratory,
NM 87545, USA Universidad de Murcia, 30071
Murcia, SPAIN URL http//www.c3.lanl.gov
URL http//www.ditec.um.es
emailjuanf_at_um.es
34Parallel Software Design
- Using the core primitives
35Case Studies
- Job Scheduling gang scheduling
- Very small time slices RESPONSIVENESS !!!
36Toward a Parallel OS
- BCS-MPI real-time communication scheduling
- Global Strobe
- (time slice starts)
Exchange of comm requirements
Communication scheduling
Time Slice (hundreds of ?s)
Real transmission
- Global Strobe
- (time slice ends)
37Toward a Parallel OS
- BCS-MPI real-time communication scheduling