Condor Usage at Brookhaven National Lab - PowerPoint PPT Presentation

About This Presentation
Title:

Condor Usage at Brookhaven National Lab

Description:

Condor Usage at Brookhaven National Lab – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 21
Provided by: Csw5
Category:

less

Transcript and Presenter's Notes

Title: Condor Usage at Brookhaven National Lab


1
Condor Usage at Brookhaven National Lab
Alexander Withers (talk given by Tony Chan) RHIC
Computing Facility Condor Week - March 15, 2005
2
About Brookhaven National Lab
  • One of a handful of Laboratories supported and
    managed by the U.S. govt through DOE.
  • Multi-disciplinary Lab with 2,700 employees,
    Physics being the largest department.
  • Physics Dept. has its own computing division (30
    FTEs) to support physics (HEP) projects.
  • RHIC (nuclear) and ATLAS (HEP) are largest
    projects currently being supported.

3
Computing Facility Resources
  • Full service facility central/distributed
    storage capacity, large Linux Farm, robotic
    system for data storage, data backup, etc.
  • 6 PB permanent tape storage capacity.
  • 500 TB central/distributed disk storage
    capacity.
  • 1.4 million SpecInt2000 aggregrate computing
    power in Linux Farm.

4
History of Condor at Brookhaven
  • First looked at Condor in 2003 as a replacement
    for LSF and in-house batch software.
  • Installed 6.4.7 in August 2003.
  • Upgraded to 6.6.0 in February 2004.
  • Upgraded to 6.6.6 (with 6.7.0 startd binary) in
    August 2004.
  • User base grew from 12 (April 2004) to 50 (March
    2005).

5
The Rise in Condor Usage
6
The Rise in Condor Usage
7
Condor Cluster Usage
8
BNLs modified Condorview
9
Overview of Computing Resources
  • Total of 2750 CPUs (growing to 3400 in 2005).
  • Two central managers with one acting as a backup.
  • Three specialized submit machines which handle
    600 simultaneous jobs each on average.
  • 131 of the execute nodes can also act as
    submission nodes.
  • One monitoring/Condorview server.

10
Overview of Computing Resources, cont.
  • Six GLOBUS gateway machines for remote job
    submission.
  • Most machines run SL-3.0.2 on the x86 platform,
    some still using RH 7.3.
  • Running 6.6.6 with 6.7.0 startd binary to take
    advantage of multiple VM feature.

11
Overview of Configuration
  • Computing resources divided into 6 pools.
  • Two configuration models
  • Split pool resources into two parts and restrict
    which jobs can run in each part.
  • More complex version of the Bologna Batch System.
  • A pool uses one or both of these models.
  • Some pools employ user priority preemption.
  • Use drop queue method to fill fast machines
    first.
  • Have tools to easily reconfigure nodes.
  • All jobs use vanilla universe (no checkpointing).

12
Two Part Model
  • Nodes are assigned one of two tasks irrespective
    of Condor analysis or reconstruction.
  • Within Condor, a node advertises itself as either
    an analysis node or a reconstruction node.
  • A job must advertise itself in the same manner to
    match with an appropriate node.
  • Only certain users may run reconstruction jobs
    but anyone can run an analysis job.

13
Analysis/Reconstruction
Group 5
Group 3
vm1
Fast
vm2
Group 4
Group 2
Group 3
  • No suspension
  • No preemption
  • Will start a job if CPU is free

Group 2
Group 1
Slow
Group 1
Reconstruction Job wants group lt 2
14
A More Complex Version of the Bologna Model
  • Two CPU nodes each with 8 VMs.
  • 2 VMs per CPU.
  • Only two jobs running at a time.
  • Four job categories, each with its own priority.
  • A high priority VM will suspend a random VM of
    lower priority.
  • The random aspect is to prevent the same VM from
    always getting suspended.

15
High (vm7/vm8)
Analysis/Reconstruction
High Prio
Group 5
Group 3
Med (vm5/vm6)
Fast
Low (vm3/vm4)
Group 4
Low Prio
MC (vm1/vm2)
Group 2
Group 3
  • Low priority VMs suspended
  • No preemption
  • Will start a job if CPU is free
  • or is of higher priority

Group 2
Group 1
Slow
Group 1
Reconstruction Job wants group 3 Med.
Priority (vm5/vm6)
16
Issues We've Had to Deal With
  • Tune parameters to alleviate scalability
    problems.
  • MATCH_TIMEOUT
  • MAX_CLAIM_ALIVES_MISSED
  • Panasas (proprietary file system) creates kernel
    threads with whitespace in process name. Breaks
    an fscanf in procapi.C? Panasas fixed bug.
  • High-volume users can dominate pool, partially
    solved with PREEMPTION_REQUIREMENTS.

17
Issues Weve Had to Deal With, cont.
  • Dagman problems (latency, termination) ? changed
    from dagman for plain Condor.
  • Created own ClassAds and JobAds to create batch
    queues and handy management tools (ie, our
    version of condor_off).
  • Modified Condorview to meet our accounting
    monitoring requirements.

18
Issues Not Yet Resolved
  • Need job ClassAd which gives user's primary group
    --gt better control over cluster usage.
  • Transfer output files for debugging when job is
    evicted.
  • Need option to force the schedd to release its
    claim after each job.
  • Allow schedd to set mandatory periodic_remove
    policy ? avoid manual cleanup.

19
Issues Not Yet Resolved, cont.
  • Shadow seems to make a large number of NIS calls.
    Possible problem with caching ? address shadows
    in vanilla universe?
  • Need Kerberos support to comply with security
    mandates.
  • Interested in Condor on Demand (COD), but lack of
    functionality prevents more usage.
  • Need more (and effective) cluster management
    tools ? condor_off works?

20
Near-Term Plans Summary
  • Waiting for 6.8.x series (late 2005?) to upgrade.
  • Scalability concerns as usage rises.
  • High availability more critical as usage rises.
  • Integration of BNL Condor pools with external
    pools, but concerned about security.
  • Need some functionalities listed above for a
    meaningful upgrade and to improve cluster
    management capability.
Write a Comment
User Comments (0)
About PowerShow.com