Resource Management in Cluster Systems

About This Presentation

Title:

Resource Management in Cluster Systems

Description:

8 Computing Cabinet. 4GB Memory for. 4 Computing Nodes ... 6 Nodes,2 Service Cabinets. 4 Way. 400MHZ PowerPC RS64-III. 4GB Memory, 9GB Disk. 3 1TB FC-AL RAID ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 19

Provided by: griffi

Category:

more less

Transcript and Presenter's Notes

Title: Resource Management in Cluster Systems

1
Resource Management in Cluster Systems

The Scalability and Availability Issues

Jin Xiong, Ninghui Sun Institute of Computing
Technology Chinese Academy of Sciences
2
Overview

Resource management is to manage the resources
that are relevant to computation applications in
cluster systems.
Computing Nodes
Parallel Tasks
Communication Ports

3
The RMS

RMS is the resource management software that we
implemented for the Dawning3000 supercomputer.
Pool Management
Nodes are organized into non-overlapping
pools
Any parallel task should run in a pool
Pool Attributes

4
The RMS (continue)

Task Management
Node allocation, access authentication and
load balancing
Parallel task loading
Handling of standard I/O
Resource release when terminate abnormally
Communication
Allocation of communication ports
Maintenance of the mapping between
processes and ports

5
Architecture of Dawning3000
6
Dawning3000
7
Implementation

Five Components
SDR (System Data Repository)
RMD (Resource Management Daemon)
CSD (Communication Service Daemon)
A library librms.a
Commands

8
Scalability ( I )

Objects
RMS must work without incurring errors
Performance should not degrade
Problems when system size increasing
The number of connections between CSDs is
proportional to the
system size
The number of connections between user
processes and the I/O
servers is proportional to the number of
processes
Run out of TCP/IP resource

9
Scalability ( II )

Possible Solutions
UDP no connections needed, however,
unreliable
TCP established connection simple, poor
scalability
TCP run-time connection simple, poor
performance
TCP on-demand connection both
performance and scalable

10
Scalability ( III )

Our Solutions
TCP on-demand connection for
communication among CSDs
UDP for standard error
I/O Agents two-level method for standard
I/O

11
Availability ( I )

Failures
Crash of OS, Hardware Failures, Network
Disconnection
Crash of RMS daemons
Object
RMS can continue the services on the
remaining nodes
when failures occur
Problems
How to isolate the nodes where failures
occur?
RMS is the single failure point

12
Availability ( II )

Failure Detection
We developed a tool to check the status
of nodes and daemons
Isolation of failed node
Setting the state of the failure node as
Isolated
Cutting off the connections to the daemon
on the node
Discarding requests to the daemon on the
node
RMD backup
Master/Slave Method
Internal data mirror between the master
and slave RMDs
Taking over when the master RMD crash

13
Performance Evaluation ( I )
14
Performance Evaluation ( II )
15
Performance Evaluation ( III )
16
Performance Evaluation ( IV )
17
Performance Evaluation ( V )
18
What we learn?

On-line node adding, reconfiguration and node
isolation are proved to be useful and convenient
Scalability for even larger systems that contains
over thousands of nodes needs to be reconsidered
Resource management for applications of
commercial computing needs to be considered

Write a Comment

User Comments (0)