Title: TRIUMF Computing Services
1TRIUMF Computing Services
Steven McDonald
- TRIUMF public compute cluster
- OpenMosix PBS
- Wide area networking
- Traffic shaping bandwidth mgmt
- TRIUMF wireless network
2TRIUMF Public Compute Cluster
3The State of Public Computing _at_TRIUMF
- Since the slow and very painful demise of the VMS
clustering during the 90s, TRIUMF has not
provided an alternative to true cluster computing - .
- Began using LINUX in the mid 90s, typically a
public server was purchased, latest-greatest cpu
and everyone got an an account - The lifetime was typically 2years before the
hardware need to be replaced, disk capacity
doubled and accounts migrated. - This continued for a while until we started using
small NIS clusters, but typically the
tear-down-replace approach continued. - The challenge was find a cluster that could be
useful to the majority of our users yet
affordable and maintainable with our limited
resources.
4What Type of Cluster Do We Need
- Focus on the configuration of the cluster and how
it has addressed a number of issues that were
important to TRIUMF - Satisfies a spectrum of user requirements
- Most efficient use of its resources
- Manageable and maintained by one/two people
- What type of cluster is required?
- hpc, load-balancing, fail-over, CluMPs,
parallel, storage, database, SSI - What type of use is expected?
- Interactive use 30-50 users - Web browsing -
E-mail - etc - Program development
- Batch jobs (long short)
- WESTGRID removed the need for large batch cluster
5Cluster Hardware Software Config.
- IBM x330s 12 x 1.4 GHz cpus
- 1TB SCSI attached IDE RAID5 disks
- Red Hat 7.3
- OpenMosix kernel
- Transparent process migration
- OpenPBS with Maui scheduler (same as Westgrid)
- Batch queue
- xCat,Ganglia, openmosixview, xpbs for monitoring
status
6How is it Different from others
- So what is unique
- It is both
- traditional batch cluster
- Interactive load sharing cluster with transparent
process migration - Configuration
- Head node 2 compute nodes always dedicated to
openMosix with load balancing - 3 compute nodes (6 processors) allocated to PBS
batch queues - When no PBS jobs are queued the PBS post
execution script turns on the openMosix
properties of the kernel to allow membership in a
mosix load sharing cluster
7(No Transcript)
8 Cluster Logic
5. Else, submit job to another PBS node and
remove that node from openMosix membership also.
2. User Submits job
- 1 Initial Condition
- All nodes running openMosix with process
migration turned ON - PBS turned OFF on Head 2 compute nodes,
prevents PBS from ever submitting jobs to these
nodes
3. PBS prologue script removes node from
participating in openMosix load balancing.
Thereby dedicating node to PBS jobs only
6. When PBS jobs end, PBS epilogue script checks
if any other PBS jobs are running on that node,
if not return node to participating again in
openMosix membership
4. If more PBS jobs are submitted and a CPU is
available on existing allocated PBS node submit
job.
9Final Words
- While by no means an impressive cluster in terms
of size we have managed to combine a traditional
batch queuing cluster using OpenPBS with a load
sharing cluster such as OpenMosix in a unique
way. - However, it is scalable thereby solving the
consuming tear-down-replace approach of the past - It satisfies a spectrum of user requirements, and
is only one cluster to manage and maintain with
limited resources.