Title: What
1Whats new in Condor?Whats coming up?Condor
Week 2008
2Release Situation
- Stable Series
- Current Condor v7.0.1 (Feb 27th 2008)
- Last Year Condor ver 6.8.4. (Feb 5th 2007)
- Development Series
- Current Condor v7.1.0 (April 1st 2008)
- Last Year Condor ver 6.9.2. (April 10th 2007)
- v6.9 Series 14 months
3(No Transcript)
4Special Condor Week Edition
5(No Transcript)
6How many cores in one new UW Condor cluster rack?
7New Ports
- RHEL 5 x86 x86_64 with stduniv and glibc 2.5
- Playstation 3
- HPUX 11i Itanium (almost done)
- Cross testing on x86-like platforms
- Debian clipped port
- Out with the old.
- Red Hat Linux 7.x systems on the x86 processor.
- Digital Unix systems on the Alpha processor.
- Yellow Dog Linux 3.0 systems on the PPC
processor. - MacOS 10.3 systems on the PPC processor.
8Big v7.0 Goodies
- Scalability Improvements
- GCB Improvements
- Privilege Separation
- New Quill
- Virtual Machine Universe
9Scalability
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16Condors Privilege Separation
- Apply principle of least privilege to Condor
- No more root / super-user privilege required
- Currently completed on execute side
- Use glexec or Condors own sudo
- Can still run the old way if you want
17Quill Take Two in v7.x
- Shared databases
- More than just the JobAd, e.g.
- Startd Machine ClassAds
- Negotiator matches
- Run Job User Log information
- More than just PostgreSQL DBMS
- All the details
- http//www.cs.wisc.edu/condor/quill_overview_07-18
-2007.pdf
18Disk
19Virtual Machine Universe
- Submit a Job that consists of a virtual machine
image - Condor schedules, manages, and monitors VM job
- Works w/ VMware Server and Xen
- Matchmaking
- Checkpoint/Restart/Migration
- Data Movement
- Plug BoF Session 130pm tomorrow
20What else?GCB Improvments!
21(No Transcript)
22(No Transcript)
23- Improved Scalability Only use the broker if
required! - Local Host Optimizations
- Bypass GCB if two daemons are talking on the
same host - Local Network Optimizations
- Two hosts on the same private net bypass the
broker - Every network is assigned a unique network name
- Daemons advertise (a) public accessible IP (b)
real IP (c) network name. - Names match ? use real ip use public IP.
- Improved Robustness
- Broker dies -gt master finds another broker and
restarts. - When master starts up, it pings a list o brokers
and randomly chooses from those that respond. - Bug fixes
- Improved Logging now they are helpful and sane.
24Process Tracking Guarantee
- Iron-clad tracking of process groups
- Even if running as the job submitter
- Uses supplementary group ids
- Linux only
- Also as a standalone-daemon for OSG
- USE_GID_PROCESS_TRACKING True
- MIN_TRACKING_GID 750
- MAX_TRACKING_GID 757
25Better Collector Authorization
- New authorization levels to allow different rules
for submission vs- execution - ADVERTISE_STARTD, ADVERTISE_SCHEDD
- New config setting COLLECTOR_REQUIREMENTS
expression must evaluate to true for Collector to
accept the ad.
26- Well-known ports for the trusted daemons
- Use the below ports if launching the
condor_master - as root else, pick 3 ports above 1024.
- MASTER_PORT 890
- SCHEDD_PORT 891
- STARTD_PORT 892
- Â
- MASTER_ARGS -p (MASTER_PORT)
- SCHEDD_ARGS -p (SCHEDD_PORT)
- STARTD_ARGS -p (STARTD_PORT)
- Â
- COLLECTOR_REQUIREMENTS \
- ( MyType ? "Machine" \
- Â regexp( "lt0-9.(STARTD_PORT)gt" , MyAddress
) ) \ - ( MyType ? "Scheduler" \
- Â regexp( "lt0-9.(SCHEDD_PORT)gt" , MyAddress
) ) \ - ( MyType ? "DaemonMaster" \
- Â regexp( "lt0-9.(MASTER_PORT)gt" , MyAddress
) ) \ - ( MyType ! "Machine" MyType ! "Scheduler"
\
27Handy New Attributes
- In your machine ad
- TotalTimeBackfillBusy, TotalTimeBackfillIdle,Total
TimeBackfillKilling - TotalTimeClaimedBusy,TotalTimeClaimedIdle
- TotalTimeClaimedRetiring, TotalTimeClaimedSuspende
d - TotalTimeMatchedIdle, TotalTimeOwnerIdle
- TotalTimePreemptingKilling,TotalTimePreemptingVaca
ting,TotalTimeUnclaimedBenchmarking,TotalTimeUncla
imedIdle - In your job ad
- NumJobStarts
- NumJobReconnects
- NumShadowExceptions
- NumShadowStarts
28And last but not least
- Leases added to COD.
- Simple best-fit algorithm added to dedicated
scheduler. - Can reference resource usage and quota
information in preemption policy. - condor_config_val dump -v
- Chirp improvements
- Jobs can write messages into the user log
- Can use proc 0 ClassAd as a scratch pad
- Condor shutdown via expressions
- External Awareness
29 and finally
- File Transfer I/O Throttling
- MAX_CONCURRENT_DOWNLOADS and MAX_CONCURRENT_UPLOAD
S - More types of jobs can survive across a
shutdown/crash of submit machine - Such as jobs that stream stdout/err.
- Users job log changes.
- Can have a centralized job log file.
- Get values of any job ad attribute in log.
- Cron like job scheduling (Crondor?)
- Job Router shipped (Dans talk)
- License Change
- Source code publically released on web
30 and finally
and before shipping the new stable release
We squashed LOTS of bugs!
31(No Transcript)
32Shiny new bug free Condor v7.0.x stable series!
33Enough already, Todd.Tell me about what is
cooking with v7.1.x and beyond.
34Terms of License Any and all dates in these
slides are relative from a date hereby
unspecified in the event of a likely situation
involving a frequent condition. Viewing, use,
reproduction, display, modification and
redistribution of these slides, with or without
modification, in source and binary forms, is
permitted only after a deposit by said user into
PayPal accounts registered to Todd Tannenbaum .
35Generalizing the Startd/Starter Architecture
- Making the startd more generic with the
underlying system. - How about running without a starter, running
w/o a scheddshadow, pulling jobs, running
starter less jobs that it does not fork/exec, - Lightweight Jobs
- Examples
- Work Fetch ? Ref to Dereks Talk
- Blue Heron Project ? Ref to Tom, Amanda, and
Gregs Talk
36Some Love for Windows
- Jobs can write to the registry
- Condor allocates HKEY_CURRENT_USER.
- Problems w/ the Batch Login approach sessions on
Windows Server 2003 fixed (by not using them ?) - Interoperability with Samba (as a PDC) has been
improved - Arch class-ad attribute now reflects the wide
range of architectures available to the Windows
world it no longer simply returns INTEL
37Green Computing
- The startd has the ability to place a machine
into a low power state. (Standby, Hibernate,
Soft-Off, etc.) - HIBERNATE, HIBERNATE_CHECK_INTERVAL
- If all slots return non-zero, then the machine is
powered down otherwise it continues running. - Machine ClassAd contains all information required
for a client to wake it up - Condor can wake it up, also a standalone tool.
- This was NOT as easy as it should be.
- Machines in Offline State
- Lots of other uses
- Wake-up on Matchmaking Pressure
- Future Work ?
38Plugins
- Think Firefox
- Callouts from Condor daemons on appropriate
events - Plugin could re-implement or modify action
(different than a client API) - Will only build as needed as refactoring
happens to add features - Miron I dont want your plugs, I want new
features! - Examples Collector, Accountant, File Transfers,
Scheduling Algorithms,
39Scheduling in Condor Today
CM
CM
schedd
schedd
schedd
schedd
schedd
- Distributed Ownership
- Settings reflect 3 separate viewpoints
- Pool manager, Resource Owner, Job Submitter
40But some sites want to use Condor like this
schedd
- Just one submission point (schedd)
- All resources owned by one entity
- We can do better for these sites.
- Policy configurations are complicated.
- Some useful policies not present because they are
hard to do a wide-area distributed system. - Today the dedicated scheduler only supports
FIFO and a naive Best Fit algorithms.
41So what to do?
schedd
- Give the schedd more scheduling options.
- Examples why cant the schedd do priority
preemption without the matchmakers help? Or move
jobs from slow to fast claimed resources ? - Pluggable scheduler routines.
42DAGMan Improvements
- Automatic running of rescue DAGs (useful for
nested DAGs) - Significantly improved speed of DAG recovery mode
- Assignment of node categories and category
throttles - Added generic node priorities Depth First
Traversal algorithm
43DAGMan Depth First Example
44Category Example
Run lt 2
Run lt 5
45DAGMan Future Work
- DAG Splicing
- Allowing custom attributes in node ClassAds
- Fixing condor_hold semantics
- Configurable job start rate
- Node iteration
46DAGMan Future Work
- Scalability
- Current potential about 1 million nodes
- Future up to 10 million nodes
- Submit files which generate more than one cluster
47EC2 / VM Universe Next Steps Impregnate Condor
into the Image
- When? On Demand. How?
- Job Router, GlideIn Factory,
- File Transfer To/From S3 (Plugin!)
- Options to handle Amazons looming threat NAT
only - Overlay Network ?
- GCB
- OpenVPN
- Communicate by way of S3 ?
48Negotiation Performance
- v6.8 -gt automatic significant attributes, Match
caching - v7.1.0 -gt resource request ads
- Simple explanation Resource request ad a
count plus all significant attributes. - Inserted into a schedd submitter ad.
- Give me 400 resources like this, and 200
resources like that, etc. - Matchmaking algorithms remains the same, just how
it learns about jobs changes. - Disabled by default.
- Possibilities, possibilities
- More robust against unresponsive schedds
- No startd Rank preemption?
- Others?
49(No Transcript)
50And
- The End of the NFS Locking issue
- Avoid redundant copies of the same executable in
the Condor spool - Maybe more?
- The Stamping of a Passport
- End-to-End Security ? Ref Ians Talk
- A web site design from this decade.
51Thank you for being such an awesome audience and
an awesome user community!!!
Jason Stowe, enjoying free bacon at a local pub.
Only in Wisconsin.