Title: The Sun Cluster Grid Architecture (Sun Grid Engine Project)
1The Sun Cluster Grid Architecture(Sun Grid
Engine Project)
- Adam Belloum
- Computer Architecture Parallel Systems group
- University of Amsterdam
- adam_at_science.uva.nl
2Sun Cluster Architecture
- The architecture includes
- Front-end access nodes
- Middle-tier management nodes
- Back-end compute nodes
3Access Tier
- The access tier provides access and
authentication services to the Cluster Grid
users. - Access methods
- telnet, rlogin, or ssh, can be used to grant
access. - Web-based services
- can be provided to permit easy (tightly-controlled
) access to the facility.
4Management tier
- This middle tier includes one or more servers
which run - the server elements of client-server software
such as Distributed Resource Management (DRM) - The hardware diagnosis software
- The system performance monitors.
- File server provide NFS service to other nodes
in the Cluster Grid - License key server manage software license keys
for the Cluster Grid - Software provisioning server
- manage operating system
- application software versioning
- patch application on other nodes in the Cluster
Grid
5Compute Tier
- Supplies the compute power for the Cluster Grid.
Jobs submitted through upper tiers in the
architecture are scheduled to run on one or more
nodes in the compute tier. - Nodes in this tier run
- the client-side of the DRM software,
- the daemons associated with message-passing
environments, - any agents for system health monitoring.
- The compute tier communicates with the management
tier, receiving jobs to run and reporting job
completion status and accounting details.
6Hardware consideration
- The essential hardware components of a Cluster
grid are - Computing systems,
- Networking equipment,
- Storage.
- The choice of hardware at each tier depends on a
number of factors, primarily - What are the required services?
- What kind of service level is needed?
- What is the expected user load?
- What is the expected application type and mix?
7Access, Compute, and Management Nodes
- Nodes in the access tier are utilized by users to
submit, control, and monitor jobs - Nodes in the compute tier are used to execute
jobs - Nodes in the management tier run the majority of
the software needed to implement the Cluster
Grid. - The hardware requirements for each node depend,
in part, on its location in the system
architecture.
8Access Node Requirements
- Access nodes typically require no special
configuration - Any desktop or server that is connected to the
network can be configured to allow direct access
the Cluster Grid. - Introduction or modification of nodes in the
access tier - is a simple operation that does not affect other
tiers in the architecture. - Users without a system directly connected to the
local area network can interface with an access
node via conventional methods (e.g., telnet,
rlogin, ftp, and ssh),
9Compute Node Requirements
- Compute nodes run jobs that are submitted to the
Cluster Grid, and design of this tier is crucial
to maximize application performance. - the Cluster Grid software places little load upon
nodes in the compute tier. - Access nodes can also be configured as compute
nodes, and may be appropriate in certain
environments. - For example, desktop machines used as access
nodes can also be tapped for their spare compute
cycles after business hours or when the CPU is
otherwise idle.
10Management Node Requirements
- Cluster Grids can be designed with one or more
systems in the management tier. - managing services on a single system is simplest,
and this may be the best choice for small Cluster
Grids. - managing services on a multiple systems provide
greater scalability and can provide increased
performance, especially for larger Cluster Grids.
11Management Node Requirements
- The system requirements for the master node are
dependent on - Size of the compute cluster,
- Volume of jobs being submitted,
- Complexity of any scheduling decisions that must
be made. - In large clusters, the DRM master node is
dedicated - Should not perform compute tier duties or act in
any other capacity. - Particularly relevant in clusters running large
MPI jobs. - If the DRM server is acting as a compute node,
system services can continually interrupt the MPI
job in progress, thereby delaying a large job
running across many nodes.
12Networking Infrastructure
- A typical Cluster Grid can be configured with
three separate types of network interconnects - Ethernet,
- serial interconnect,
- specialized low-latency,
- High bandwidth
13Networking Infrastructure
- Compute, management, and access nodes in a
Cluster Grid typically are connected by a local
area network utilizing Fast Ethernet or Gigabit
Ethernet technology. - This network is used for file sharing,
interprocess communication, and system
management. - Care should be taken to separate standard
Ethernet traffic from compute-related
communications as far as possible if these
functions share network hardware.
14Networking Infrastructure
- For resiliency and increased reliability, the
network infrastructure can be configured to
ensure that no single point of failure can
compromise availability. - the network can be designed with redundant
switches, - and multiple network interfaces can be used for
increased throughput and to help meet network
bandwidth requirements. - An additional serial interconnect can be used for
administrative convenience. - A serial network can connect the system console
of all compute nodes in the Cluster Grid to one
or more terminal concentrators, which are in turn
connected to the local area network. - A specialized low-latency, high-bandwidth system
interconnect is crucial to the performance of
large, high communication MPI jobs..
15Networking Infrastructure
- Using a separate high-performance interconnect
also reduces the networking load on the server
CPU, freeing it for other tasks. - An additional network can be added to provide
rapid data delivery to and from compute nodes if
required. - This network can utilize high-speed Ethernet, or
a Storage Area Network (SAN) can be implemented
16Software Integration
- Resilience
- Interoperability
- Manageability
17Software Integration
- It includes writing utility scripts or,
modifying the scripts that do application setup. - Ideal case, applications can be submitted to the
DRM without requiring recompilation or linking
with special libraries. - Software integration also includes the
integration of the DRM with parallel environments
such as - Parallel Virtual Machine (PVM)
- Message Passing Interface (MPI). With
integration, parallel jobs submitted by users can
be controlled and properly accounted for by the
DRM.
18Software Integration
- Other aspects of software integration can
- include the design of special interfaces at the
access tier which automate or simplify the
submission of tasks to the management tier for
running on the compute tier. - This can include writing specialized wrapper
scripts, Web interfaces, or more fully-featured
graphical user interfaces (GUIs).
19Resilience
- On the compute tier, nodes are anonymous and
independent. - If one node fails, the remaining nodes are
unaffected and remain available to execute user
jobs. - The cluster can be configured to redo any work
that is lost if a server fails mid-job, making
users unaware of any individual node failures and
providing increased availability. - The RAS (Reliability, Availability, and
Serviceability) features of the hardware and
software elements are most relevant to the
management tier.
20Resilience
- The system operating environment can also
contribute to high availability with features
like - live upgrades,
- automatic dynamic reconfiguration,
- file system logging,
- and IP network failover.
- The availability of data can be increased with
redundant, hot-swappable storage components,
multiple paths to data storage and hardware or
software RAID capabilities.
21High Availability
- If required, High Availability (HA) software can
provide even greater levels of availability. For
example, - HA software can be used to provide a highly
available NFS service to the Cluster Grid. - If the primary NFS server should fail for any
reason, NFS data services are automatically and
transparently failed over to a backup server. - Similar to the compute tier, the access tier
generally contains many systems or devices, thus
providing inherent redundancy.
22Interoperability
- Cluster Grid implementations work on the
principle of an integratable stack and should be
able to run across a heterogeneous environment. - servers running different operating environments
should be permitted to belong to the same compute
cluster. - Users should be able to submit jobs to any
available architecture by simply submitting their
job to the DRM software. - If the job must run on a particular architecture,
users can specify this as a resource requirement
when submitting the job. - The DRM software can then ensure that this job
runs only on the correct system types and
dispatch it appropriately.
23Manageability
- The scalability of the Cluster Grid architecture
can result in hundreds or even thousands of
managed nodes. - Management tools must scale with the size of a
Cluster Grid, provide a single point of
management, offer flexibility, and ensure
security in a distributed environment. - Pro-active system management monitoring the
health and functionality of systems can provide
improved service. - System management costs can be reduced
significantly by utilizing installation and
deployment technologies - that help minimize the amount of time
administrators spend installing and patching
systems and software.
24Cluster Grid Components
25- One of the most important features of the Cluster
Grid architecture is its modular and open design.
- Components are separate and have unique roles
within the architecture. - This design is commonly referred to as a software
stack, with each layer in the stack
representing a different functionality.
26(No Transcript)
27Sun Grid Engine
- Distributed Resource Management
- Cluster Queues
- Hostgroup and Hostlist
- Scheduler
28Sun Grid Engine
- The Sun Grid Engine distributed resource
management software the essential component of
any Cluster Grid - Optimizes utilization of software/hardware
resources - Aggregates the compute power available in cluster
grids and presents a unified and simple access
point to users needing compute cycles. - Sun Grid Engine software provides dependable,
consistent, and pervasive access to both
high-throughput and highly parallel computational
capabilities.
29Sun Grid Engine
- Sun Grid Engine can also provide
- The job accounting information
- Statistics that are used to monitor resource
utilization and determine how to improve resource
allocation. - Administrators can specify job options
- priority,
- hardware and license requirements,
- dependencies,
- define and control user access to computer
resources.
30Distributed Resource Management
- The basis for DRM is the batch queuing mechanism.
- In the normal operation of a cluster, if the
proper resources are not currently available to
execute a job, then the job is queued until the
resources are available. - DRM further enhances batch queuing by monitoring
host computers in the cluster for properly
balanced load conditions. - Sun Grid Engine software provides DRM functions
- batch queueing, load balancing, job accounting
statistics, user-specifiable resources,
suspending and resuming jobs, and cluster-wide
resources.
31Cluster Queues
- The new cluster queue design is based on three
major points - Multiple hosts per queue configuration.
- Different queue attributes per execution host.
- Introduction of the concept o Hostgroups.
32Cluster Queues
- The cluster queue named big serves 3 different
hosts - balrog, durin and ori.
- The seq_no attribute value is
- 1 for balrog,
- 2 for durin
- and zero for ori.
- Both the load_thresholds and suspend_thresholds
attributes are the same for all execution hosts.
33Hostgroup and Hostlist
- A hostgroup contains a list of grid engine
execution hosts and is referred to by an at ('_at_')
sign followed by a string. - A hostlist is a cluster queue attribute that will
contain exec hosts and / or hostgroups. - Figure illustrates an example where the two
created hostgroups _at_solaris64 and _at_linux belong
to the queue named big.
34How to dispatch jobs?
- The scheduler selects queues for the submitted
jobs - via attribute matching in an N1 Grid Engine 6
cluster as opposed to submitting to a queue,
popular in other DRM products. - can still submit to a specific queue if desired.
An example would be for a job to request mem /
cpu resources to be submitted to a queue setup to
fulfill this type of request.
35How to dispatch jobs?
- N1 Grid Engine 6 provides the capability to use
regular expressions for matching resource
requests. - qsub -q medium job.sh
- ? submits job.sh to the medium queue
- qsub -q fast_at__at_solaris64 job.sh
- ? submits job.sh to the fast queue with the
_at_solaris64 hostlist. - qsub -q fast_at_sf15k job.sh
- ? submits job.sh to queue instance fast that
belongs to the sf15k host. - qmod -e big
- ? Enables queue big
- qmod -c big_at__at_linux
- ? big_at_balrog Clears the alarm state from the
queue big which is attached to the hosts in
the _at_linux hostgroup.
36Scheduler
- Scheduler internal status creation is optimized
for performance and the task of sending tickets
from the scheduler to qmaster is streamlined. - The scheduler has look-ahead features, such as
- Resource reservation
- Back filling
- New prioritization scheme
- Improved algorithm
- Scheduling profile choices at install time
37Scheduler
38Scheduler
- The new scheduler, the high priority job can use
resource reservation to block the resources. - Although the new scheduler ensures proper
prioritization of jobs, it may leave resources
idle for extended periods of time
39Scheduler (backfilling)
- Grid Engine notices that queues 2 and 3 will be
idle because the 3 CPU Job 2 will have to wait
until Job 1 finishes. - It then scans the wait list for short jobs that
could be run on queues 2 and 3 without delaying
Job 2. - After this analysis, Jobs 3 and 4 are started.
- Finally, Job 1 finishes after 3 and 4, freeing up
the resources to start Job 2.
40Scheduler
- Wait lists are controlled by three factors
- priority (from POSIX priority),
- urgency
- number of tickets
- Priority normalized urgencyweight_urgency
- normalized ticketsweight_tickets
- normalized ppriorityweight_priority
41Scheduler
- The scheduler has two new parameters to obtain
more information about scheduling activities - PROFILE if set to true, the scheduler will show
how much time it spent on each step of a
scheduling run. - MONITOR If is set to true, the scheduler will
dump all information necessary to reproduce job
resource.
42- ARCo has several predefined reports such as
- Accounting per Department
- Accounting per Project
- Accounting per User
- Host Load
- Statistics
- Average Job Turnaround time
- Average Job Wait Time per day
- Job Log
- Number of Jobs Completed
- Queue Consumables
43Grid Sun Engine Architecture
- Master host
- A single host is selected to be the Sun Grid
Engine master host. - This host handles all requests from users, makes
job scheduling decisions, and dispatches jobs to
execution hosts. - Execution hosts
- Systems in the cluster that are available to
execute jobs are called execution hosts. - Submit hosts
- Submit hosts are machines configured to submit,
monitor, and administer jobs, and to manage the
entire cluster.
44Grid Sun Engine Architecture
- Software Job flow
- Security
- High Availability
45Grid Sun Engine Architecture
- Administration hosts
- Sun Grid Engine administrators use administration
hosts to make changes to the cluster
configuration, such as - changing DRM parameters,
- adding new nodes,
- adding or changing users.
- Shadow master host
- While there is only one master host, other
machines in the cluster can be designated as
shadow master hosts to provide greater
availability. - A shadow master host continually monitors the
master host, and automatically and transparently
assumes control in the event that the master host
fails.
46Software Job flow
- Jobs are submitted to the master host and are
held in a spooling area until the scheduler
determines that the job is ready to run. - Sun Grid Engine software matches available
resources to job requirements, such as available
memory, CPU speed, and available software
licenses. - The requirements of the jobs may be very
different and only certain hosts may be able to
provide the corresponding service.
47Software Job flow
- Job submission
- User submits a job from a submit host, the job
submission request is sent to the master host - Job scheduling
- The master host determines the host to which the
job will be assigned. It assesses the load,
checks for licenses, and evaluates any other job
requirements. - Job execution
- After obtaining scheduling information, the
master host then sends the job to the selected
execution host. The execution host saves the job
in a job information database and starts a
shepherd process which starts the job, and waits
for completion. - Accounting information
- When the job is complete, the shepherd process
returns the job information, and the execution
host then reports the job completion to the
master host and removes the job from the job
information database. The master host updates the
job accounting database to reflect job completion.
48Security
- To control access to the cluster, the Sun Grid
Engine master host maintains information about
eligible submit and administration hosts. -
- Systems which have been explicitly listed as
eligible submit hosts are able to submit jobs to
the cluster. - Systems which have been added to the list of
eligible administration hosts can be used to
modify the cluster configuration.
49High Availability
- The cluster can be configured with one or more
shadow master hosts, - eliminating the master host as a single point of
failure and providing increased availability to
users. - If the master goes down, the shadow master host
automatically and transparently takes over as the
master. - Shadow master host functionality is a
fully-integrated part of the Sun Grid Engine
software. - The only prerequisite for its use is a
highly-available file system on which to install
the software and configuration files.
50Development Tools and Run-Time Libraries
- Sun HPC ClusterTools
- Parallel Application Development
- Sun HPC ClusterTools Software
- Integration with Sun Grid Engine
- Forte for High Performance Computing
- Technical Computing Portal
51Development Tools and Run-Time Libraries
- Sun HPC ClusterTools and Forte for High
Performance Computing (HPC) software are commonly
used to develop and run applications on Cluster
Grids. - Sun HPC ClusterTools provides an integrated
software environment for developing and deploying
parallel distributed applications. - Forte HPC provides support for developing
high-performance (non-parallel) applications in
the FORTRAN, C, and C programming languages
52Sun HPC ClusterTools
- Sun HPC ClusterTools 4 software is a complete
integrated environment for parallel application
development. - It delivers an end-to-end software development
environment for parallel distributed applications
and provides middleware to manage a workload of
highly resource-intensive applications. - Sun HPC ClusterTools Software enables users to
develop and deploy distributed parallel
applications with continuous scalability from one
to 2048 processes within a single well-integrated
parallel development environment.
53Parallel Application Development
- Two primary high performance parallel programming
models are supported the single-process model
and the multi-process model. - The single-process model includes all types of
multi-threaded applications. - may be automatically parallelized by high
performance compilers using parallelization
directives (e.g., OpenMP) or explicitly
parallelized with user-inserted Solaris or POSIX
threads. - The multi-process model supports the MPI standard
for parallel applications that run both on single
SMPs and on clusters of SMPs or thin nodes.
54Parallel Application Development
- Sun HPC ClusterTools software includes
- a high-performance,
- multi-protocol implementation of the industry
standard MPI - a full implementation of the MPI I/O protocol,
- A tools for executing, debugging, performance
analysis, and tuning of technical computing
applications. - Sun HPC ClusterTools software is thread-safe,
facilitating a third, hybrid parallel
application - the mixing of threads and MPI parallelism to
create applications that use MPI for
communication between cooperating processes and
threads within each process.
55Sun HPC ClusterTools Software
- Sun HPC ClusterTools software provides the
features to effectively develop, deploy, and
manage a workload of highly resource-intensive,
MPI-parallel applications -
- Sun HPC ClusterTools is integrated to work with
Sun Grid Engine software for use in Cluster Grid
environments. - Sun HPC ClusterTools software supports standard
programming paradigms like - MPI message passing and includes a parallel file
system that delivers high-performance, scalable
I/O.
56Integration with Sun Grid Engine
- Sun CRE provides Sun Grid Engine with the
relevant information about parallel applications
in which multiple resources are reserved for a
single job. - the Sun Grid Engine software uses the Sun CRE
component to handle the details of launching MPI
jobs, while still presenting the familiar Sun
Grid Engine interface to the user. - Integration of Sun HPC ClusterTools with the Sun
Grid Engine framework provides a distinct
advantage to users of a Sun Cluster Grid. - running parallel jobs with Sun CRE under the DRM
of Sun Grid Engine, users achieve both efficient
resource utilization and effective control over
parallel applications.
57Forte for High Performance Computing (HPC)
- 64-bit application development 64-bit technology
offers many benefits, including - address space to handle large problems,
- 64-bit integer arithmetic to increase the
calculation speed for mathematical operations - support for files greater than 4 GB in size.
- Sun Performance Library compatibility
Compatibility with the Sun Performance Library
helps provide - optimized performance for matrix algebra
- signal processing tasks on single-processor and
multiprocessor systems.
58Forte for High Performance Computing (HPC)
- Integrated programming environment Forte HPC
includes - integrated programming environment that enables
to browse, edit, compile, debug and tune
applications efficiently. - Software configuration management tools Forte
HPC provides - software configuration management tools to enable
development teams to work together effectively
and efficiently.
59Forte for High Performance Computing (HPC)
- Multi-threading technology Forte HPC software
enables - develop and tune multi-threaded/multi-processing
applications using capabilities such as OpenMP
API support for C and FORTRAN programs. - Performance analysis tools Performance analysis
tools enable - evaluate code performance, spot potential
performance issues, and locate problems quickly
60Technical Computing Portal
- The Technical Computing Portal is a
services-centric, Web-based, shared-everything
approach to technical computing. - It offers an easy-to-use interface for job
submission, job control, and access to results
via the Sun ONE Portal Server (formerly iPlanet
Portal Server) and the Sun Grid Engine software. -
- The Sun ONE Portal Server is a community-based
server application that securely provides an
aggregation of key content, application and
services personalized based on user
role/identity, user preferences and system
determined relevancy
61System Management Center
- Sun Management Center
- Intelligent Agent-Based Architecture
- Sun Validation Test Suite
- Installation and Deployment Technologies
- Web Start Flash
- Solaris JumpStart software
- Solaris Live Upgrade
62System Management
- Cluster Grids can contain large numbers of
distributed systems, and ensuring efficient and
effective system management is essential. - Powerful system administration tools such as Sun
Management Center provide comprehensive
administrative and management operations. - Other tools include Sun Validation Test Suite
(SunVTS) to test and verify hardware
functionality - across a network, and automated installation and
deployment technologies like the Solaris Web
Start product line to help reduce the amount of
time administrators spend installing and patching
systems and software in a Cluster Grid.
63Sun Management Center
- Sun Management Center software is an advanced
system management tool designed to support Sun
systems. - It offers a single point of management for Sun
systems, the Solaris Operating Environment,
applications, and services for data center and
highly distributed computing environments. - Sun Management Center software enables
- system administrators to perform remote system
management - monitor performance,
- isolate hardware/software faults for hundreds of
Sun systems, - all through an easy-to-use Web interface.
- Enhanced, proactive event/alarm management
provide early notification of potential service
problems.
64Intelligent Agent-Based Architecture
- Sun Management Center is based on an intelligent
agent-based architecture. - a manager monitors and controls managed
entities by sending requests to agents residing
on the managed node. - Agents are key software components that collect
management data on behalf of the manager.
65Intelligent Agent-Based Architecture
- Scalability Distributing responsibility to the
agents - improves the Sun Management Center softwares
ability to scale as the number of managed nodes
increases. - Increased reliability and availability
- agents process data locally and are not
dependent on other software components,
reliability and availability are enhanced
66Intelligent Agent-Based Architecture
- Flexibility and extensibility Additional
modules can be dynamically loaded to Sun
Management Center agents. - Decreased bandwidth requirements Intelligent
agents offer a savings in network bandwidth, - agents collect data on the managed nodes and
only report status and significant events when
necessary. - Security All users are authenticated, limiting
administrators access to and management of only
the systems within their control.
67Sun Validation Test Suite
- SunVTS is a comprehensive diagnostic tool that
tests and validates Sun hardware by verifying the
connectivity and functionality of most system
hardware. - SunVTS can be tailored to run on various types of
machines ranging from desktops to servers, and
supports testing in both 32-bit and 64-bit
Solaris operating environments. - Tests examine subsystems such as processors,
peripherals, storage, network, memory, graphics
and video, audio, and communication.
68Sun Validation Test Suite
- The primary goal of the SunVTS software is to
create an environment in which Sun systems can be
thoroughly tested to enable their proper
operation or to find elusive problems. - SunVTS can be used to validate a system during
development or production, as well as for
troubleshooting, periodic maintenance, and system
or subsystem stressing
69Installation and Deployment Technologies
- .Solaris Web Start software and the Solaris Web
Start Wizards technology, the Solaris Operating
Environment and other applications can be
installed interactively with a browser-based
interface. - Solaris JumpStart software provides automated
installation and setup of multiple systems over
the network. - Web Start Flash, Solaris JumpStart, and Solaris
Live Upgrade technologies are particularly
relevant to the Cluster Grid environment where
large numbers of similarly configured systems
must be managed.
70Web Start Flash
- Take a complete system image of the Solaris
Operating Environment, application stack, and
system configuration and replicate that reference
server configuration image onto multiple servers.
- applicable to Cluster Grid environments that
contain large numbers of identical systems. - Complete system replication System administrators
can capture a snapshot image of a complete server - Rapid deployment Web Start Flash technology can
reduce configuration complexity, improve
deployment scalability, and can significantly
reduces installation time for rapid deployment.
71Web Start Flash
- Layered Flash deployment Web Start Flash
technology provides - the ability to layer Flash Archives, increasing
the flexibility of the Web Start Flash
installation while also reducing the disk space
required to store Flash Archives. - FRU Server Snapshot Web Start Flash technology
can also be used to store existing server
configurations, thus making them a field
replaceable unit (FRU).
72Solaris JumpStart software
- install and set up a Solaris system anywhere on
the network without any user interaction. - the Solaris Operating Environment and application
software can be placed on centralized servers,
and the install process can be customized by
system administrators. - highly customizable.
- Administrators can set rules which automatically
match the characteristics of the node being
installed to an installation method.
73Solaris Live Upgrade
- promotes greater availability
- providing a mechanism to upgrade and manage
multiple on-disk instances of the Solaris
Operating Environment - allowing operating system upgrades to take place
while the system continues to operate. - Can be used for patch testing and roll-out, and
can also provide a safe fall-back environment
to quickly recover from upgrade problems or
failures.