TRIUMF Computing Services - PowerPoint PPT Presentation

1 / 9

About This Presentation

Title:

TRIUMF Computing Services

Description:

Steve McDonald. 4. What Type ... When no PBS jobs are queued the PBS post execution script turns on the openMosix ... Thereby dedicating node to PBS jobs only ... – PowerPoint PPT presentation

Number of Views:172

Avg rating:3.0/5.0

Slides: 10

Provided by: legacywe

Category:

more less

Transcript and Presenter's Notes

Title: TRIUMF Computing Services

1
TRIUMF Computing Services
Steven McDonald

TRIUMF public compute cluster
OpenMosix PBS
Wide area networking
Traffic shaping bandwidth mgmt
TRIUMF wireless network

2
TRIUMF Public Compute Cluster
3
The State of Public Computing _at_TRIUMF

Since the slow and very painful demise of the VMS
clustering during the 90s, TRIUMF has not
provided an alternative to true cluster computing
.
Began using LINUX in the mid 90s, typically a
public server was purchased, latest-greatest cpu
and everyone got an an account
The lifetime was typically 2years before the
hardware need to be replaced, disk capacity
doubled and accounts migrated.
This continued for a while until we started using
small NIS clusters, but typically the
tear-down-replace approach continued.
The challenge was find a cluster that could be
useful to the majority of our users yet
affordable and maintainable with our limited
resources.

4
What Type of Cluster Do We Need

Focus on the configuration of the cluster and how
it has addressed a number of issues that were
important to TRIUMF
Satisfies a spectrum of user requirements
Most efficient use of its resources
Manageable and maintained by one/two people
What type of cluster is required?
hpc, load-balancing, fail-over, CluMPs,
parallel, storage, database, SSI
What type of use is expected?
Interactive use 30-50 users - Web browsing -
E-mail - etc
Program development
Batch jobs (long short)
WESTGRID removed the need for large batch cluster

5
Cluster Hardware Software Config.

IBM x330s 12 x 1.4 GHz cpus
1TB SCSI attached IDE RAID5 disks
Red Hat 7.3
OpenMosix kernel
Transparent process migration
OpenPBS with Maui scheduler (same as Westgrid)
Batch queue
xCat,Ganglia, openmosixview, xpbs for monitoring
status

6
How is it Different from others

So what is unique
It is both
traditional batch cluster
Interactive load sharing cluster with transparent
process migration
Configuration
Head node 2 compute nodes always dedicated to
openMosix with load balancing
3 compute nodes (6 processors) allocated to PBS
batch queues
When no PBS jobs are queued the PBS post
execution script turns on the openMosix
properties of the kernel to allow membership in a
mosix load sharing cluster

7
(No Transcript)
8
Cluster Logic
5. Else, submit job to another PBS node and
remove that node from openMosix membership also.
2. User Submits job

1 Initial Condition
All nodes running openMosix with process
migration turned ON
PBS turned OFF on Head 2 compute nodes,
prevents PBS from ever submitting jobs to these
nodes

3. PBS prologue script removes node from
participating in openMosix load balancing.
Thereby dedicating node to PBS jobs only
6. When PBS jobs end, PBS epilogue script checks
if any other PBS jobs are running on that node,
if not return node to participating again in
openMosix membership
4. If more PBS jobs are submitted and a CPU is
available on existing allocated PBS node submit
job.
9
Final Words

While by no means an impressive cluster in terms
of size we have managed to combine a traditional
batch queuing cluster using OpenPBS with a load
sharing cluster such as OpenMosix in a unique
way.
However, it is scalable thereby solving the
consuming tear-down-replace approach of the past
It satisfies a spectrum of user requirements, and
is only one cluster to manage and maintain with
limited resources.