Upgrading Condor Best Practices - PowerPoint PPT Presentation

About This Presentation
Title:

Upgrading Condor Best Practices

Description:

Try to save this much swap space by not starting new shadows. ## Specified in megabytes. ... But we try very hard! Both forward and backward. Especially within ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 23
Provided by: con92
Category:

less

Transcript and Presenter's Notes

Title: Upgrading Condor Best Practices


1
Upgrading CondorBest Practices
2
The problem
  • More frequent releases of Condor
  • Every six to nine months?
  • Understand this is a problem for users
  • Were willing to help out

3
Overview
  • Config file management
  • Condor testing strategies
  • Standard Universe issues

4
Config files
  • LOCAL_CONFIG_FILE
  • Used for include-like behaviour
  • LOCAL_CONFIG_FILE \
  • (HOSTS), (GLOBAL), (POLICY)

5
Typical Config file
  • Try to save this much swap space by not
    starting new shadows.
  • Specified in megabytes.
  • RESERVED_SWAP 5
  • Commented out lists the default value

6
Config file editing
  • Never edit base condor_config file
  • Except to specify the local file
  • Put all edits in a local file
  • One local file per config type
  • E.g. for schedds, CMs, types of execute machines
  • Can mix and match

7
Dealing with a new config
  • Diff base config with your config
  • Understand new items
  • Documented in manual version-history
  • Existing ones rarely change
  • Usually capacity changes
  • Almost always, overwriting base file works

8
Managing config files
  • Centralized management key
  • Cfengine, rsync, nfs (!) etc.

9
Testing new versions
10
Compatibility Guarantees
  • No guarantees
  • But we try very hard!
  • Both forward and backward
  • Especially within one machine
  • Federation techniques require this

11
Incremental testing!
  • Three basic components of Condor
  • Central Manager
  • Submit points
  • Execute machines
  • Test each independently

12
Testing Central Manager
  • Take advantage of statelessness
  • Condor HAD can help out here
  • If it breaks, existing jobs keep running

13
Testing schedds
  • Adding a new test schedd easy
  • Test jobs useful too, not just sleep
  • Schedd can be bottleneck
  • Probably only place you need to check cpu
    performance

14
Testing startds
  • Easy to test a few at once
  • Be careful when running std uni
  • Glide in can be very helpful
  • But beware of root specific issues
  • Admin slots helpful

15
Now that weve tested
  • Always be undo-able!
  • (never overwrite files)
  • Rely on master restart on stat change

16
Big bang approach
  • What we do at CS
  • Just change a symlink to the binaries
  • Master does the rest
  • Can be a big hit on shared filesystems

17
Incremental restart
  • First, restart CM
  • No jobs lost
  • Send, reboot schedd
  • If restart happens in 20 minutes, jobs keep
    running
  • What about the startds?
  • Might be OK for standard uni
  • Work on this coming soon

18
Standard Universe
  • More sensitive to backward compatibility
  • CheckpointPlatform clarifications
  • condor_qedit -constraint 'LastCheckpointPlatform
    ? "LINUX INTEL 2.6.x normal"'
    LastCheckpointPlatform '"LINUX INTEL 2.6.x normal
    0xffffe000"'

19
Draining old Std Uni
  • Keep a few old startds around
  • To finish old standard uni jobs
  • Set start to JobUniverse 1
  • Or maybe rank
  • Only on the old platforms

20
When to upgrade?
  • Zeroth law of software engineering
  • Development series actually pretty stable
  • Well let you know about security issues
  • Probably dont need every minor version
  • Dont be more than one major stable version behind

21
In summary
  • Keep config files under control
  • Test each component in isolation
  • Be aware of standard universe issues

22
Any questions?
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com