The Sun Cluster Grid Architecture (Sun Grid Engine Project)

About This Presentation

Title:

The Sun Cluster Grid Architecture (Sun Grid Engine Project)

Description:

The Sun Cluster Grid Architecture (Sun Grid Engine Project) Adam Belloum Computer Architecture & Parallel Systems group University of Amsterdam adam_at_science.uva.nl – PowerPoint PPT presentation

Number of Views:347

Avg rating:3.0/5.0

Slides: 74

Provided by: Pime

Category:

more less

Transcript and Presenter's Notes

Title: The Sun Cluster Grid Architecture (Sun Grid Engine Project)

1
The Sun Cluster Grid Architecture(Sun Grid
Engine Project)

Adam Belloum
Computer Architecture Parallel Systems group
University of Amsterdam
adam_at_science.uva.nl

2
Sun Cluster Architecture

The architecture includes
Front-end access nodes
Middle-tier management nodes
Back-end compute nodes

3
Access Tier

The access tier provides access and
authentication services to the Cluster Grid
users.
Access methods
telnet, rlogin, or ssh, can be used to grant
access.
Web-based services
can be provided to permit easy (tightly-controlled
) access to the facility.

4
Management tier

This middle tier includes one or more servers
which run
the server elements of client-server software
such as Distributed Resource Management (DRM)
The hardware diagnosis software
The system performance monitors.
File server provide NFS service to other nodes
in the Cluster Grid
License key server manage software license keys
for the Cluster Grid
Software provisioning server
manage operating system
application software versioning
patch application on other nodes in the Cluster
Grid

5
Compute Tier

Supplies the compute power for the Cluster Grid.
Jobs submitted through upper tiers in the
architecture are scheduled to run on one or more
nodes in the compute tier.
Nodes in this tier run
the client-side of the DRM software,
the daemons associated with message-passing
environments,
any agents for system health monitoring.
The compute tier communicates with the management
tier, receiving jobs to run and reporting job
completion status and accounting details.

6
Hardware consideration

The essential hardware components of a Cluster
grid are
Computing systems,
Networking equipment,
Storage.
The choice of hardware at each tier depends on a
number of factors, primarily
What are the required services?
What kind of service level is needed?
What is the expected user load?
What is the expected application type and mix?

7
Access, Compute, and Management Nodes

Nodes in the access tier are utilized by users to
submit, control, and monitor jobs
Nodes in the compute tier are used to execute
jobs
Nodes in the management tier run the majority of
the software needed to implement the Cluster
Grid.
The hardware requirements for each node depend,
in part, on its location in the system
architecture.

8
Access Node Requirements

Access nodes typically require no special
configuration
Any desktop or server that is connected to the
network can be configured to allow direct access
the Cluster Grid.
Introduction or modification of nodes in the
access tier
is a simple operation that does not affect other
tiers in the architecture.
Users without a system directly connected to the
local area network can interface with an access
node via conventional methods (e.g., telnet,
rlogin, ftp, and ssh),

9
Compute Node Requirements

Compute nodes run jobs that are submitted to the
Cluster Grid, and design of this tier is crucial
to maximize application performance.
the Cluster Grid software places little load upon
nodes in the compute tier.
Access nodes can also be configured as compute
nodes, and may be appropriate in certain
environments.
For example, desktop machines used as access
nodes can also be tapped for their spare compute
cycles after business hours or when the CPU is
otherwise idle.

10
Management Node Requirements

Cluster Grids can be designed with one or more
systems in the management tier.
managing services on a single system is simplest,
and this may be the best choice for small Cluster
Grids.
managing services on a multiple systems provide
greater scalability and can provide increased
performance, especially for larger Cluster Grids.

11
Management Node Requirements

The system requirements for the master node are
dependent on
Size of the compute cluster,
Volume of jobs being submitted,
Complexity of any scheduling decisions that must
be made.
In large clusters, the DRM master node is
dedicated
Should not perform compute tier duties or act in
any other capacity.
Particularly relevant in clusters running large
MPI jobs.
If the DRM server is acting as a compute node,
system services can continually interrupt the MPI
job in progress, thereby delaying a large job
running across many nodes.

12
Networking Infrastructure

A typical Cluster Grid can be configured with
three separate types of network interconnects
Ethernet,
serial interconnect,
specialized low-latency,
High bandwidth

13
Networking Infrastructure

Compute, management, and access nodes in a
Cluster Grid typically are connected by a local
area network utilizing Fast Ethernet or Gigabit
Ethernet technology.
This network is used for file sharing,
interprocess communication, and system
management.
Care should be taken to separate standard
Ethernet traffic from compute-related
communications as far as possible if these
functions share network hardware.

14
Networking Infrastructure

For resiliency and increased reliability, the
network infrastructure can be configured to
ensure that no single point of failure can
compromise availability.
the network can be designed with redundant
switches,
and multiple network interfaces can be used for
increased throughput and to help meet network
bandwidth requirements.
An additional serial interconnect can be used for
administrative convenience.
A serial network can connect the system console
of all compute nodes in the Cluster Grid to one
or more terminal concentrators, which are in turn
connected to the local area network.
A specialized low-latency, high-bandwidth system
interconnect is crucial to the performance of
large, high communication MPI jobs..

15
Networking Infrastructure

Using a separate high-performance interconnect
also reduces the networking load on the server
CPU, freeing it for other tasks.
An additional network can be added to provide
rapid data delivery to and from compute nodes if
required.
This network can utilize high-speed Ethernet, or
a Storage Area Network (SAN) can be implemented

16
Software Integration

Resilience
Interoperability
Manageability

17
Software Integration

It includes writing utility scripts or,
modifying the scripts that do application setup.
Ideal case, applications can be submitted to the
DRM without requiring recompilation or linking
with special libraries.
Software integration also includes the
integration of the DRM with parallel environments
such as
Parallel Virtual Machine (PVM)
Message Passing Interface (MPI). With
integration, parallel jobs submitted by users can
be controlled and properly accounted for by the
DRM.

18
Software Integration

Other aspects of software integration can
include the design of special interfaces at the
access tier which automate or simplify the
submission of tasks to the management tier for
running on the compute tier.
This can include writing specialized wrapper
scripts, Web interfaces, or more fully-featured
graphical user interfaces (GUIs).

19
Resilience

On the compute tier, nodes are anonymous and
independent.
If one node fails, the remaining nodes are
unaffected and remain available to execute user
jobs.
The cluster can be configured to redo any work
that is lost if a server fails mid-job, making
users unaware of any individual node failures and
providing increased availability.
The RAS (Reliability, Availability, and
Serviceability) features of the hardware and
software elements are most relevant to the
management tier.

20
Resilience

The system operating environment can also
contribute to high availability with features
like
live upgrades,
automatic dynamic reconfiguration,
file system logging,
and IP network failover.
The availability of data can be increased with
redundant, hot-swappable storage components,
multiple paths to data storage and hardware or
software RAID capabilities.

21
High Availability

If required, High Availability (HA) software can
provide even greater levels of availability. For
example,
HA software can be used to provide a highly
available NFS service to the Cluster Grid.
If the primary NFS server should fail for any
reason, NFS data services are automatically and
transparently failed over to a backup server.
Similar to the compute tier, the access tier
generally contains many systems or devices, thus
providing inherent redundancy.

22
Interoperability

Cluster Grid implementations work on the
principle of an integratable stack and should be
able to run across a heterogeneous environment.
servers running different operating environments
should be permitted to belong to the same compute
cluster.
Users should be able to submit jobs to any
available architecture by simply submitting their
job to the DRM software.
If the job must run on a particular architecture,
users can specify this as a resource requirement
when submitting the job.
The DRM software can then ensure that this job
runs only on the correct system types and
dispatch it appropriately.

23
Manageability

The scalability of the Cluster Grid architecture
can result in hundreds or even thousands of
managed nodes.
Management tools must scale with the size of a
Cluster Grid, provide a single point of
management, offer flexibility, and ensure
security in a distributed environment.
Pro-active system management monitoring the
health and functionality of systems can provide
improved service.
System management costs can be reduced
significantly by utilizing installation and
deployment technologies
that help minimize the amount of time
administrators spend installing and patching
systems and software.

24
Cluster Grid Components
25

One of the most important features of the Cluster
Grid architecture is its modular and open design.
Components are separate and have unique roles
within the architecture.
This design is commonly referred to as a software
stack, with each layer in the stack
representing a different functionality.

26
(No Transcript)
27
Sun Grid Engine

Distributed Resource Management
Cluster Queues
Hostgroup and Hostlist
Scheduler

28
Sun Grid Engine

The Sun Grid Engine distributed resource
management software the essential component of
any Cluster Grid
Optimizes utilization of software/hardware
resources
Aggregates the compute power available in cluster
grids and presents a unified and simple access
point to users needing compute cycles.
Sun Grid Engine software provides dependable,
consistent, and pervasive access to both
high-throughput and highly parallel computational
capabilities.

29
Sun Grid Engine

Sun Grid Engine can also provide
The job accounting information
Statistics that are used to monitor resource
utilization and determine how to improve resource
allocation.
Administrators can specify job options
priority,
hardware and license requirements,
dependencies,
define and control user access to computer
resources.

30
Distributed Resource Management

The basis for DRM is the batch queuing mechanism.
In the normal operation of a cluster, if the
proper resources are not currently available to
execute a job, then the job is queued until the
resources are available.
DRM further enhances batch queuing by monitoring
host computers in the cluster for properly
balanced load conditions.
Sun Grid Engine software provides DRM functions
batch queueing, load balancing, job accounting
statistics, user-specifiable resources,
suspending and resuming jobs, and cluster-wide
resources.

31
Cluster Queues

The new cluster queue design is based on three
major points
Multiple hosts per queue configuration.
Different queue attributes per execution host.
Introduction of the concept o Hostgroups.

32
Cluster Queues

The cluster queue named big serves 3 different
hosts
balrog, durin and ori.
The seq_no attribute value is
1 for balrog,
2 for durin
and zero for ori.
Both the load_thresholds and suspend_thresholds
attributes are the same for all execution hosts.

33
Hostgroup and Hostlist

A hostgroup contains a list of grid engine
execution hosts and is referred to by an at ('_at_')
sign followed by a string.
A hostlist is a cluster queue attribute that will
contain exec hosts and / or hostgroups.
Figure illustrates an example where the two
created hostgroups _at_solaris64 and _at_linux belong
to the queue named big.

34
How to dispatch jobs?

The scheduler selects queues for the submitted
jobs
via attribute matching in an N1 Grid Engine 6
cluster as opposed to submitting to a queue,
popular in other DRM products.
can still submit to a specific queue if desired.
An example would be for a job to request mem /
cpu resources to be submitted to a queue setup to
fulfill this type of request.

35
How to dispatch jobs?

N1 Grid Engine 6 provides the capability to use
regular expressions for matching resource
requests.
qsub -q medium job.sh
? submits job.sh to the medium queue
qsub -q fast_at__at_solaris64 job.sh
? submits job.sh to the fast queue with the
_at_solaris64 hostlist.
qsub -q fast_at_sf15k job.sh
? submits job.sh to queue instance fast that
belongs to the sf15k host.
qmod -e big
? Enables queue big
qmod -c big_at__at_linux
? big_at_balrog Clears the alarm state from the
queue big which is attached to the hosts in
the _at_linux hostgroup.

36
Scheduler

Scheduler internal status creation is optimized
for performance and the task of sending tickets
from the scheduler to qmaster is streamlined.
The scheduler has look-ahead features, such as
Resource reservation
Back filling
New prioritization scheme
Improved algorithm
Scheduling profile choices at install time

37
Scheduler
38
Scheduler

The new scheduler, the high priority job can use
resource reservation to block the resources.
Although the new scheduler ensures proper
prioritization of jobs, it may leave resources
idle for extended periods of time

39
Scheduler (backfilling)

Grid Engine notices that queues 2 and 3 will be
idle because the 3 CPU Job 2 will have to wait
until Job 1 finishes.
It then scans the wait list for short jobs that
could be run on queues 2 and 3 without delaying
Job 2.
After this analysis, Jobs 3 and 4 are started.
Finally, Job 1 finishes after 3 and 4, freeing up
the resources to start Job 2.

40
Scheduler

Wait lists are controlled by three factors
priority (from POSIX priority),
urgency
number of tickets
Priority normalized urgencyweight_urgency
normalized ticketsweight_tickets
normalized ppriorityweight_priority

41
Scheduler

The scheduler has two new parameters to obtain
more information about scheduling activities
PROFILE if set to true, the scheduler will show
how much time it spent on each step of a
scheduling run.
MONITOR If is set to true, the scheduler will
dump all information necessary to reproduce job
resource.

ARCo has several predefined reports such as
Accounting per Department
Accounting per Project
Accounting per User
Host Load
Statistics
Average Job Turnaround time
Average Job Wait Time per day
Job Log
Number of Jobs Completed
Queue Consumables

43
Grid Sun Engine Architecture

Master host
A single host is selected to be the Sun Grid
Engine master host.
This host handles all requests from users, makes
job scheduling decisions, and dispatches jobs to
execution hosts.
Execution hosts
Systems in the cluster that are available to
execute jobs are called execution hosts.
Submit hosts
Submit hosts are machines configured to submit,
monitor, and administer jobs, and to manage the
entire cluster.

44
Grid Sun Engine Architecture

Software Job flow
Security
High Availability

45
Grid Sun Engine Architecture

Administration hosts
Sun Grid Engine administrators use administration
hosts to make changes to the cluster
configuration, such as
changing DRM parameters,
adding new nodes,
adding or changing users.
Shadow master host
While there is only one master host, other
machines in the cluster can be designated as
shadow master hosts to provide greater
availability.
A shadow master host continually monitors the
master host, and automatically and transparently
assumes control in the event that the master host
fails.

46
Software Job flow

Jobs are submitted to the master host and are
held in a spooling area until the scheduler
determines that the job is ready to run.
Sun Grid Engine software matches available
resources to job requirements, such as available
memory, CPU speed, and available software
licenses.
The requirements of the jobs may be very
different and only certain hosts may be able to
provide the corresponding service.

47
Software Job flow

Job submission
User submits a job from a submit host, the job
submission request is sent to the master host
Job scheduling
The master host determines the host to which the
job will be assigned. It assesses the load,
checks for licenses, and evaluates any other job
requirements.
Job execution
After obtaining scheduling information, the
master host then sends the job to the selected
execution host. The execution host saves the job
in a job information database and starts a
shepherd process which starts the job, and waits
for completion.
Accounting information
When the job is complete, the shepherd process
returns the job information, and the execution
host then reports the job completion to the
master host and removes the job from the job
information database. The master host updates the
job accounting database to reflect job completion.

48
Security

To control access to the cluster, the Sun Grid
Engine master host maintains information about
eligible submit and administration hosts.
Systems which have been explicitly listed as
eligible submit hosts are able to submit jobs to
the cluster.
Systems which have been added to the list of
eligible administration hosts can be used to
modify the cluster configuration.

49
High Availability

The cluster can be configured with one or more
shadow master hosts,
eliminating the master host as a single point of
failure and providing increased availability to
users.
If the master goes down, the shadow master host
automatically and transparently takes over as the
master.
Shadow master host functionality is a
fully-integrated part of the Sun Grid Engine
software.
The only prerequisite for its use is a
highly-available file system on which to install
the software and configuration files.

50
Development Tools and Run-Time Libraries

Sun HPC ClusterTools
Parallel Application Development
Sun HPC ClusterTools Software
Integration with Sun Grid Engine
Forte for High Performance Computing
Technical Computing Portal

51
Development Tools and Run-Time Libraries

Sun HPC ClusterTools and Forte for High
Performance Computing (HPC) software are commonly
used to develop and run applications on Cluster
Grids.
Sun HPC ClusterTools provides an integrated
software environment for developing and deploying
parallel distributed applications.
Forte HPC provides support for developing
high-performance (non-parallel) applications in
the FORTRAN, C, and C programming languages

52
Sun HPC ClusterTools

Sun HPC ClusterTools 4 software is a complete
integrated environment for parallel application
development.
It delivers an end-to-end software development
environment for parallel distributed applications
and provides middleware to manage a workload of
highly resource-intensive applications.
Sun HPC ClusterTools Software enables users to
develop and deploy distributed parallel
applications with continuous scalability from one
to 2048 processes within a single well-integrated
parallel development environment.

53
Parallel Application Development

Two primary high performance parallel programming
models are supported the single-process model
and the multi-process model.
The single-process model includes all types of
multi-threaded applications.
may be automatically parallelized by high
performance compilers using parallelization
directives (e.g., OpenMP) or explicitly
parallelized with user-inserted Solaris or POSIX
threads.
The multi-process model supports the MPI standard
for parallel applications that run both on single
SMPs and on clusters of SMPs or thin nodes.

54
Parallel Application Development

Sun HPC ClusterTools software includes
a high-performance,
multi-protocol implementation of the industry
standard MPI
a full implementation of the MPI I/O protocol,
A tools for executing, debugging, performance
analysis, and tuning of technical computing
applications.
Sun HPC ClusterTools software is thread-safe,
facilitating a third, hybrid parallel
application
the mixing of threads and MPI parallelism to
create applications that use MPI for
communication between cooperating processes and
threads within each process.

55
Sun HPC ClusterTools Software

Sun HPC ClusterTools software provides the
features to effectively develop, deploy, and
manage a workload of highly resource-intensive,
MPI-parallel applications
Sun HPC ClusterTools is integrated to work with
Sun Grid Engine software for use in Cluster Grid
environments.
Sun HPC ClusterTools software supports standard
programming paradigms like
MPI message passing and includes a parallel file
system that delivers high-performance, scalable
I/O.

56
Integration with Sun Grid Engine

Sun CRE provides Sun Grid Engine with the
relevant information about parallel applications
in which multiple resources are reserved for a
single job.
the Sun Grid Engine software uses the Sun CRE
component to handle the details of launching MPI
jobs, while still presenting the familiar Sun
Grid Engine interface to the user.
Integration of Sun HPC ClusterTools with the Sun
Grid Engine framework provides a distinct
advantage to users of a Sun Cluster Grid.
running parallel jobs with Sun CRE under the DRM
of Sun Grid Engine, users achieve both efficient
resource utilization and effective control over
parallel applications.

57
Forte for High Performance Computing (HPC)

64-bit application development 64-bit technology
offers many benefits, including
address space to handle large problems,
64-bit integer arithmetic to increase the
calculation speed for mathematical operations
support for files greater than 4 GB in size.
Sun Performance Library compatibility
Compatibility with the Sun Performance Library
helps provide
optimized performance for matrix algebra
signal processing tasks on single-processor and
multiprocessor systems.

58
Forte for High Performance Computing (HPC)

Integrated programming environment Forte HPC
includes
integrated programming environment that enables
to browse, edit, compile, debug and tune
applications efficiently.
Software configuration management tools Forte
HPC provides
software configuration management tools to enable
development teams to work together effectively
and efficiently.

59
Forte for High Performance Computing (HPC)

Multi-threading technology Forte HPC software
enables
develop and tune multi-threaded/multi-processing
applications using capabilities such as OpenMP
API support for C and FORTRAN programs.
Performance analysis tools Performance analysis
tools enable
evaluate code performance, spot potential
performance issues, and locate problems quickly

60
Technical Computing Portal

The Technical Computing Portal is a
services-centric, Web-based, shared-everything
approach to technical computing.
It offers an easy-to-use interface for job
submission, job control, and access to results
via the Sun ONE Portal Server (formerly iPlanet
Portal Server) and the Sun Grid Engine software.
The Sun ONE Portal Server is a community-based
server application that securely provides an
aggregation of key content, application and
services personalized based on user
role/identity, user preferences and system
determined relevancy

61
System Management Center

Sun Management Center
Intelligent Agent-Based Architecture
Sun Validation Test Suite
Installation and Deployment Technologies
Web Start Flash
Solaris JumpStart software
Solaris Live Upgrade

62
System Management

Cluster Grids can contain large numbers of
distributed systems, and ensuring efficient and
effective system management is essential.
Powerful system administration tools such as Sun
Management Center provide comprehensive
administrative and management operations.
Other tools include Sun Validation Test Suite
(SunVTS) to test and verify hardware
functionality
across a network, and automated installation and
deployment technologies like the Solaris Web
Start product line to help reduce the amount of
time administrators spend installing and patching
systems and software in a Cluster Grid.

63
Sun Management Center

Sun Management Center software is an advanced
system management tool designed to support Sun
systems.
It offers a single point of management for Sun
systems, the Solaris Operating Environment,
applications, and services for data center and
highly distributed computing environments.
Sun Management Center software enables
system administrators to perform remote system
management
monitor performance,
isolate hardware/software faults for hundreds of
Sun systems,
all through an easy-to-use Web interface.
Enhanced, proactive event/alarm management
provide early notification of potential service
problems.

64
Intelligent Agent-Based Architecture

Sun Management Center is based on an intelligent
agent-based architecture.
a manager monitors and controls managed
entities by sending requests to agents residing
on the managed node.
Agents are key software components that collect
management data on behalf of the manager.

65
Intelligent Agent-Based Architecture

Scalability Distributing responsibility to the
agents
improves the Sun Management Center softwares
ability to scale as the number of managed nodes
increases.
Increased reliability and availability
agents process data locally and are not
dependent on other software components,
reliability and availability are enhanced

66
Intelligent Agent-Based Architecture

Flexibility and extensibility Additional
modules can be dynamically loaded to Sun
Management Center agents.
Decreased bandwidth requirements Intelligent
agents offer a savings in network bandwidth,
agents collect data on the managed nodes and
only report status and significant events when
necessary.
Security All users are authenticated, limiting
administrators access to and management of only
the systems within their control.

67
Sun Validation Test Suite

SunVTS is a comprehensive diagnostic tool that
tests and validates Sun hardware by verifying the
connectivity and functionality of most system
hardware.
SunVTS can be tailored to run on various types of
machines ranging from desktops to servers, and
supports testing in both 32-bit and 64-bit
Solaris operating environments.
Tests examine subsystems such as processors,
peripherals, storage, network, memory, graphics
and video, audio, and communication.

68
Sun Validation Test Suite

The primary goal of the SunVTS software is to
create an environment in which Sun systems can be
thoroughly tested to enable their proper
operation or to find elusive problems.
SunVTS can be used to validate a system during
development or production, as well as for
troubleshooting, periodic maintenance, and system
or subsystem stressing

69
Installation and Deployment Technologies

.Solaris Web Start software and the Solaris Web
Start Wizards technology, the Solaris Operating
Environment and other applications can be
installed interactively with a browser-based
interface.
Solaris JumpStart software provides automated
installation and setup of multiple systems over
the network.
Web Start Flash, Solaris JumpStart, and Solaris
Live Upgrade technologies are particularly
relevant to the Cluster Grid environment where
large numbers of similarly configured systems
must be managed.

70
Web Start Flash

Take a complete system image of the Solaris
Operating Environment, application stack, and
system configuration and replicate that reference
server configuration image onto multiple servers.
applicable to Cluster Grid environments that
contain large numbers of identical systems.
Complete system replication System administrators
can capture a snapshot image of a complete server
Rapid deployment Web Start Flash technology can
reduce configuration complexity, improve
deployment scalability, and can significantly
reduces installation time for rapid deployment.

71
Web Start Flash

Layered Flash deployment Web Start Flash
technology provides
the ability to layer Flash Archives, increasing
the flexibility of the Web Start Flash
installation while also reducing the disk space
required to store Flash Archives.
FRU Server Snapshot Web Start Flash technology
can also be used to store existing server
configurations, thus making them a field
replaceable unit (FRU).

72
Solaris JumpStart software

install and set up a Solaris system anywhere on
the network without any user interaction.
the Solaris Operating Environment and application
software can be placed on centralized servers,
and the install process can be customized by
system administrators.
highly customizable.
Administrators can set rules which automatically
match the characteristics of the node being
installed to an installation method.

73
Solaris Live Upgrade

promotes greater availability
providing a mechanism to upgrade and manage
multiple on-disk instances of the Solaris
Operating Environment
allowing operating system upgrades to take place
while the system continues to operate.
Can be used for patch testing and roll-out, and
can also provide a safe fall-back environment
to quickly recover from upgrade problems or
failures.