Caging the CCLRC Compute Zoo - PowerPoint PPT Presentation

About This Presentation
Title:

Caging the CCLRC Compute Zoo

Description:

Of minimal use for HTC. Presenter Name. Facility Name ... Other non-HTC Uses ... I have several server-licensed products and many potential occasional users. ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 23
Provided by: me18
Category:
Tags: cclrc | caging | compute | htc | products | zoo

less

Transcript and Presenter's Notes

Title: Caging the CCLRC Compute Zoo


1
Caging the CCLRC Compute Zoo (Activities at
CCLRC) John Kewley j.kewley_at_dl.ac.uk http//www.e
-science.clrc.ac.uk/web/staff/john_kewley
2
Outline
  • What is a Compute Zoo?
  • Caging Problems
  • A Trip to the Zoo
  • Uses for a Compute Zoo

3
What is a Compute Zoo?
4
Compute Farm
  • Homogenous large numbers of (near) identical
    resources
  • Often co-located physically a training room, lab
    workstations or a large cluster
  • Centrally managed, often by dedicated staff
  • Typical of many Condor Pools excellent for High
    Throughput Computing

5
Compute Farm
6
Compute Zoo
  • Heterogeneous resources are of many different
    operating systems and architectures
  • Located across a site
  • Individually, or variously managed
  • Of minimal use for HTC

7
Compute Zoo
8
Caging Problems (Firewall Mirroring)
9
Firewalls within a Condor Pool
  • Some resource owners have firewalls on their
    personal workstations
  • Since Condor needs each submit node to be able to
    talk to every potential execute node, this
    necessitates the opening of every firewall in the
    pool to every submit node when it is added.
  • Between adding the new node and the firewalls
    being updated, the firewalled nodes will be
    unavailable for use.
  • Or are they?
  • Maybe someone should tell Condor!

10
Adding a new machine to the pool
  • If we add a new machine to the pool, the existing
    firewalls may not have anticipated this.
  • The firewalls will likely block this new machine
  • A Job may still match for the newly added machine
    to the firewalled resource.
  • This job will not be able to run
  • Parts of the system can jam as a result.
  • condor_q on submitting node
  • Subsequent parts of the submit script
  • (maybe also parts of the central node)

11
Private networks
  • Similar "jams" occur if part of your pool (or
    flock of pools) is on a network that is
    unavailable to some of the other nodes
  • How can we permit jobs from submit nodes that can
    access the private network to run on these nodes
    whilst preventing Condor sending jobs from other
    submit nodes there?

12
How can we get round this?
  1. Restrict the number of submit nodes
  2. Automatically update the firewall files
  3. Ensure everything is up-to-date
  4. Permit pool to evolve whilst persuading Condor to
    avoid going to nodes where the job cant run

13
Firewall Mirroring (1)
  • Each machine with a firewall declares the fact in
    its ClassAds
  • HAS_FIREWALL TRUE
  • Also, which machines and/or subnets it permits to
    access its Condor ports (mirroring FW table
    settings)
  • FW_ALLOWS_113 TRUE
  • FW_ALLOWS_rjavig6 TRUE
  • Finally, it needs to export these settings
  • STARTD_EXPRS HAS_FIREWALL, FW_ALLOWS_113, \
    FW_ALLOWS_rjavig6

14
Firewall Mirroring (2)
  • To ensure that jobs can only go to resources they
    can reach,
  • Ensure that submit machines declare their subnet
    and hostname
  • MY_SUBNET 113
  • MY_HOST condor
  • Use these value in the following expression which
    is added to all REQUIREMENTS for jobs from this
    machine
  • APPEND_REQUIREMENTS ( \
  • (HAS_FIREWALL ! TRUE) \
  • (FW_ALLOWS_(MY_HOST) TRUE) \
  • (FW_ALLOWS_(MY_SUBNET) TRUE) )

15
And Private Networks?
  • Same solution can be used for private networks by
    pretending they have a firewall and declaring
    which other nodes have access to that network

16
A Trip to the Zoo (Viewing the Pool)
17
The CCLRC Compute Zoo
  • 2x Windows XP Professional
  • 2x Windows 2000 Professional
  • 1x Windows NT 4.0 Workstation
  • 7x SuSE Linux 9.0
  • 2x SuSE Linux 8.0
  • 1x SuSE Linux 9.1
  • 5x White Box Enterprise Linux 3.0
  • 1x Red Hat Enterprise Linux AS release 3.0
  • 1x Red Hat Enterprise Linux WS release 3.0
  • 3x Red Hat Linux 9
  • 2x Red Hat Linux 8.0
  • 2x Red Hat Linux 7.3
  • 1x Mandrake Linux 10.1
  • 1x Gentoo Linux 1.4

18
Viewing the Pool
  • http//tardis.dl.ac.uk/Condor/cgi-bin/CondorStatus
    .cgi
  • http//tardis.dl.ac.uk/Condor/cgi-bin/WiscStatus.c
    gi

19
Uses of a Zoo
20
Build and Test
  • The CCLRC pool was part of the UK Grid
    Engineering Task Force Build and Test project.
  • Software bundles were distributed to a variety of
    OS types around the flocked pool for building and
    testing.
  • This type of (flocked) pool relies on
    heterogeneity and small numbers of each type are
    all that are required.
  • http//polaris.ecs.soton.ac.uk65000/
  • http//wiki.nesc.ac.uk/read/sfct?HomePage

21
Other non-HTC Uses
  • I want to ensure my code compiles without
    warnings and/or runs its basic tests on
  • As many OSs as possible
  • With as many different compilers as possible
  • I want to perform a release build of my product
    for platform X, but I only have accounts on A, B
    and C
  • I have several server-licensed products and many
    potential occasional users. How can this be made
    available to them more easily (within the bounds
    of the licence of course!)

22
What other uses are there for a Compute Zoo?
Write a Comment
User Comments (0)
About PowerShow.com