Title: High-Throughput Computing on Commodity Systems.
1High-Throughput Computing on Commodity Systems.
2The Good News
- Raw computing power is everywhere - on desk-tops,
shelves, racks, and in your pockets. It is - Cheap
- Plentiful
- Mass-Produced
3The Bad News
- GFLOPS per year
- /
- GFLOPS per second
- 30,000,000 seconds/year
4A variation on a chestnut
5Answer
- The throughput which your system is
- guaranteed
- never to exceed!
6Why?
- A community of commodity computers can be
difficult to manage - Dynamic State and availability change over
time - Evolving New hardware and software is
continuously acquired and installed - Heterogeneous Hardware and software
- Distributed ownership Each machine has a
different owner with different requirements and
preferences.
7Why?
- Even traditionally static systems (such as
professionally managed clusters) suffer the same
problems when viewed at a yearly scale - Power failures
- Hardware failures
- Software upgrades
- Load imbalance
- Network imbalance
8How do we measure computer performance?
- High-Performance Computing
- Achieve max GFLOP per second under ideal
circumstances. - High-Throughput Computing
- Achieve max GFLOP per months or years in whatever
conditions prevail.
9High-Throughput Computing
- Focuses on maximizing
- simulations run before the paper deadline
- crystal lattices per week
- reconstructions per week
- video frames rendered per year
- without babysitting from the user.
- Cannot depend on ideal circumstances.
10High-Throughput Computing
- Is achieved by
- Expanding the CPUs available.
- Silently adapting to inevitable changes.
- Robust software
- Is only marginally affected by
- MB, MHz, MIPS, FLOPS
- Robust hardware
11Solution Condor
- Condor is software for creating a high-throughput
computing environment on a community of
workstations, ranging from commodity PCs to
supercomputers.
12Who are we?
13The Condor Project (Established 85)
- Distributed systems CS research performed by a
team that faces - software engineering challenges in a
UNIX/Linux/NT environment, - active interaction with users and collaborators,
- daily maintenance and support challenges of a
distributed production environment, - and educating and training students.
- Funding - NSF, NASA,DoE, DoD, IBM, INTEL,
Microsoft and the UW Graduate School - .
14Users and collaborators
- Scientists - Biochemistry, high energy physics,
computer sciences, genetics, - Engineers - Hardware design, software building
and testing, animation, ... - Educators - Hardware design tools, distributed
systems, networking, ...
15National Grid Efforts
- National Technology Grid - NCSA Alliance
(NSF-PACI) - Information Power Grid - IPG (NASA)
- Particle Physics Data Grid - PPDG (DoE)
- Grid Physics Network GriPhyN (NSF-ITR)
16Condor CPUs on the UW Campus
17Some NumbersUW-CS Pool
- 6/98-6/00 4,000,000 hours 450 years
- Real Users 1,700,000 hours 260 years
- CS-Optimization 610,000 hours
- CS-Architecture 350,000 hours
- Physics 245,000 hours
- Statistics 80,000 hours
- Engine Research Center 38,000 hours
- Math 90,000 hours
- Civil Engineering 27,000 hours
- Business 970 hours
- External Users 165,000 hours 19 years
- MIT 76,000 hours
- Cornell 38,000 hours
- UCSD 38,000 hours
- CalTech 18,000 hours
18Start slow,but thinkBIG
19Start slow, but think big!
1000 machines in the GRID.
100 machines in your department
1 machine on your desktop
One Personal Condor
Condor Pool
Condor-G
20Start slow, but think big!
- Personal Condor
- Manage just your machine with Condor. Fault
tolerance, policy control, logging. Sleep
soundly at night. - Condor Pool
- Take advantage of your friends and colleagues
share cycles, gain 100x throughput. - Condor-G
- Jobs from your pool migrate to other
computational facilities around the world. Gain
1000x throughput. (Record-breaking results!)
21Key Condor User Services
- Local control - jobs are stored and managed
locally by a personal scheduler. - Priority scheduling - execution order controlled
by priority ranking assigned by user. - Job preemption - re-linked jobs can be
checkpointed, suspended, hold and resumed. - Local executing environment preserved - re-linked
jobs can have their I/O re-directed to submission
site.
22More Condor User Services
- Powerful and flexible means for selecting
execution site (requirements and preferences) - Logging of job activities.
- Management of large (10K) numbers of jobs per
user. - Support for jobs with dependencies - DAGMan
(Directed Acyclic Graph Manager) - Support for dynamic MW (PVM and File)
applications
23How does it work?
24Basic HTC Mechanisms
- Matchmaking - enables requests for services and
offers to provide services find each other
(ClassAds). - Fault tolerance - Checkpointing enables
preemptive resume scheduling (go ahead and use it
as long as it is available!). - Remote execution enables transparent access to
resources from any machine in the world. - Asynchronicity - enables management of dynamic
(opportunistic) resources.
25Every Communityneeds a Matchmaker!
26Why? Because ...
- .. someone has to bring together community
members who have requests for goods and services
with members who offer them. - Both sides are looking for each other
- Both sides have constraints
- Both sides have preferences
27ClassAd - Properties
- Type Machine
- Activity Idle
- KbdIdle 002231
- Disk 2.1G //2.1 Gigs
- Memory 64M // 6.4 Megs
- State Unclaimed
- LoadAverage 0.042969
- Arch INTEL
- OpSys SOLARIS251
28ClassAd - Policy
- RsrchGrp raman, miron, solomon
- Friends dilbert, wally
- Untrusted rival, riffraff, TPHB
- Tier member(RsrchGroup, other.Owner) ? 2
- ( member(Friends, other.Owner) ? 1 0 )
- Requirements !member(Untrusted, other.Owener)
- (Tier 2 ? True
- Tier 1 ? LoadAvg lt 0.3
- KbdIdle gt 0015 )
- DayTime() lt0800 DayTime()gt1800 )
29Advantages of Matchmaking
- Hybrid (CentralizedDistributed) resource
allocation algorithm - End-to-end verification
- Bilateral specialization
- Weak consistency requirements
- Authentication
- Fault tolerance
- Incremental system evolution
30Fault-Tolerance
- Condor can checkpoint a program by writing its
image to disk. - If a machine should fail, the program may resume
from the last checkpoint. - Ifa job must vacate a machine, it may resume from
where it left off.
31Remote Execution
- Condor might run your jobs on machines spread
around the world not all of them will have your
files. - Condor provides an adapter a library which
converts your jobs I/O operations into remote
I/O back to your home machine. - No matter where your job runs, it sees the same
environment.
32Asynchronicity
- A fact of life in a system of 1000s of machines.
- Power on/off
- Lunch breaks
- Jobs start and finish
- Condor never depends on a fixed configuration
work with what is available.
33Does it work?
34An example - NUG28
- We are pleased to announce the exact solution of
the nug28 quadratic assignment problem (QAP).
This problem was derived from the well known
nug30 problem using the distance matrix from a 4
by 7 grid, and the flow matrix from nug30 with
the last 2 facilities deleted. This is to our
knowledge the largest instance from the nugxx
series ever provably solved to optimality. - The problem was solved using the branch-and-bound
algorithm described in the paper "Solving
quadratic assignment problems using convex
quadratic programming relaxations," N.W. Brixius
and K.M. Anstreicher. The computation was
performed on a pool of workstations using the
Condor high-throughput computing system in a
total wall time of approximately 4 days, 8 hours.
During this time the number of active worker
machines averaged approximately 200. Machines
from UW, UNM and (INFN) all participated in the
computation.
35NUG30 Personal Condor
- For the run we will be flocking to
- -- the main Condor pool at Wisconsin (600
processors) - -- the Condor pool at Georgia Tech (190 Linux
boxes) - -- the Condor pool at UNM (40 processors)
- -- the Condor pool at Columbia (16 processors)
- -- the Condor pool at Northwestern (12
processors) - -- the Condor pool at NCSA (65 processors)
- -- the Condor pool at INFN (200 processors)
- We will be using glide_in to access the Origin
2000 (through LSF ) at NCSA. - We will use "hobble_in" to access the Chiba City
Linux cluster and Origin - 2000 here at Argonne.
36It works!!!
- Date Thu, 8 Jun 2000 224100 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt To Miron
Livny ltmiron_at_cs.wisc.edugt Subject Re Priority - This has been a great day for metacomputing!
Everything is going wonderfully. We've had over
900 machines (currently around 890), and all the
pieces are working great - Date Fri, 9 Jun 2000 114111 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt - Still rolling along. Over three billion nodes in
about 1 day!
37Up to a Point
- Date Fri, 9 Jun 2000 143511 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt Hi Gang, - The glory days of metacomputing are over. Our job
just crashed. I watched it happen right before my
very eyes. It was what I was afraid of -- they
just shut down denali, and losing all of those
machines at once caused other connections to time
out -- and the snowball effect had bad
repercussions for the Schedd.
38Back in Business
- Date Fri, 9 Jun 2000 185559 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt - Hi Gang,
- We are back up and running. And, yes, it took me
all afternoon to get it going again. There was a
(brand new) bug in the QAP "read checkpoint"
information that was making the master coredump.
(Only with optimization level -O4). I was nearly
reduced to tears, but with some supportive words
from Jean-Pierre, I made it through.
39The First 600K seconds
40We made it!!!
- Sender goux_at_dantec.ece.nwu.edu Subject Re Let
the festivities begin. - Hi dear Condor Team,
- you all have been amazing. NUG30 required 10.9
years of Condor Time. In just seven days ! - More stats tomorrow !!! We are off celebrating !
- condor rules !
- cheers,
- JP.
41Do not be picky, be agile!!!