Resource Management in Cluster Systems - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Resource Management in Cluster Systems

Description:

8 Computing Cabinet. 4GB Memory for. 4 Computing Nodes ... 6 Nodes,2 Service Cabinets. 4 Way. 400MHZ PowerPC RS64-III. 4GB Memory, 9GB Disk. 3 1TB FC-AL RAID ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 19
Provided by: griffi
Category:

less

Transcript and Presenter's Notes

Title: Resource Management in Cluster Systems


1
Resource Management in Cluster Systems
  • The Scalability and Availability Issues

Jin Xiong, Ninghui Sun Institute of Computing
Technology Chinese Academy of Sciences
2
Overview
  • Resource management is to manage the resources
    that are relevant to computation applications in
    cluster systems.
  • Computing Nodes
  • Parallel Tasks
  • Communication Ports

3
The RMS
  • RMS is the resource management software that we
    implemented for the Dawning3000 supercomputer.
  • Pool Management
  • Nodes are organized into non-overlapping
    pools
  • Any parallel task should run in a pool
  • Pool Attributes

4
The RMS (continue)
  • Task Management
  • Node allocation, access authentication and
    load balancing
  • Parallel task loading
  • Handling of standard I/O
  • Resource release when terminate abnormally
  • Communication
  • Allocation of communication ports
  • Maintenance of the mapping between
    processes and ports

5
Architecture of Dawning3000
6
Dawning3000
7
Implementation
  • Five Components
  • SDR (System Data Repository)
  • RMD (Resource Management Daemon)
  • CSD (Communication Service Daemon)
  • A library librms.a
  • Commands

8
Scalability ( I )
  • Objects
  • RMS must work without incurring errors
  • Performance should not degrade
  • Problems when system size increasing
  • The number of connections between CSDs is
    proportional to the
  • system size
  • The number of connections between user
    processes and the I/O
  • servers is proportional to the number of
    processes
  • Run out of TCP/IP resource

9
Scalability ( II )
  • Possible Solutions
  • UDP no connections needed, however,
    unreliable
  • TCP established connection simple, poor
    scalability
  • TCP run-time connection simple, poor
    performance
  • TCP on-demand connection both
    performance and scalable

10
Scalability ( III )
  • Our Solutions
  • TCP on-demand connection for
    communication among CSDs
  • UDP for standard error
  • I/O Agents two-level method for standard
    I/O

11
Availability ( I )
  • Failures
  • Crash of OS, Hardware Failures, Network
    Disconnection
  • Crash of RMS daemons
  • Object
  • RMS can continue the services on the
    remaining nodes
  • when failures occur
  • Problems
  • How to isolate the nodes where failures
    occur?
  • RMS is the single failure point

12
Availability ( II )
  • Failure Detection
  • We developed a tool to check the status
    of nodes and daemons
  • Isolation of failed node
  • Setting the state of the failure node as
    Isolated
  • Cutting off the connections to the daemon
    on the node
  • Discarding requests to the daemon on the
    node
  • RMD backup
  • Master/Slave Method
  • Internal data mirror between the master
    and slave RMDs
  • Taking over when the master RMD crash

13
Performance Evaluation ( I )
14
Performance Evaluation ( II )
15
Performance Evaluation ( III )
16
Performance Evaluation ( IV )
17
Performance Evaluation ( V )
18
What we learn?
  • On-line node adding, reconfiguration and node
    isolation are proved to be useful and convenient
  • Scalability for even larger systems that contains
    over thousands of nodes needs to be reconsidered
  • Resource management for applications of
    commercial computing needs to be considered
Write a Comment
User Comments (0)
About PowerShow.com