Title: The Condor Story (
1The Condor Story ( why it is worth developing
the plot further)
2Regardless of how we call IT (distributed
computing, eScience, grid, cyberinfrastructure,
)IT is not easy!
3Therefore, if we want IT to happen, we MUST
join forces and work together
4Working Together
- Each of us must be consider as both a consumer
and a provider and view others in the same way - We have to know each other
- We have to trust each other
- We have to understand each other
5The Condor Project (Established 85)
- Distributed Computing research performed by a
team of 40 faculty, full time staff and students
who - face software/middleware engineering challenges,
- involved in national and international
collaborations, - interact with users in academia and industry,
- maintain and support a distributed production
environment (more than 2300 CPUs at UW), - and educate and train students.
- Funding ( 4.5M annual budget)
- DoE, NASA, NIH, NSF, EU, INTEL, Micron,
- Microsoft and the UW Graduate School
6(No Transcript)
7Excellence
S u p p o r t
Functionality
Research
8- Since the early days of mankind the primary
motivation for the establishment of communities
has been the idea that by being part of an
organized group the capabilities of an individual
are improved. The great progress in the area of
inter-computer communication led to the
development of means by which stand-alone
processing sub-systems can be integrated into
multi-computer communities.
Miron Livny, Study of Load Balancing Algorithms
for Decentralized Distributed Processing
Systems., Ph.D thesis, July 1983.
9Claims for benefits provided by Distributed
Processing Systems
- High Availability and Reliability
- High System Performance
- Ease of Modular and Incremental Growth
- Automatic Load and Resource Sharing
- Good Response to Temporary Overloads
- Easy Expansion in Capacity and/or Function
What is a Distributed Data Processing System? ,
P.H. Enslow, Computer, January 1978
10Benefits to Science
- Democratization of Computing you do not have
to be a SUPER person to do SUPER computing.
(accessibility) - Speculative Science Since the resources are
there, lets run it and see what we get.
(unbounded computing power) - Function shipping Find the image that has a
red car in this 3 TB collection. (computational
mobility)
11High Throughput Computing
- For many experimental scientists, scientific
progress and quality of research are strongly
linked to computing throughput. In other words,
they are less concerned about instantaneous
computing power. Instead, what matters to them is
the amount of computing they can harness over a
month or a year --- they measure computing power
in units of scenarios per day, wind patterns per
week, instructions sets per month, or crystal
configurations per year.
12High Throughput Computingis a24-7-365activity
FLOPY ? (606024752)FLOPS
13Every communityneeds a Matchmaker!
or a Classified section in the newspaper or an
eBay.
14We use Matchmakersto build Computing
Communities out of Commodity Components
15CERN 92
16The 94 Worldwide Condor Flock
Amsterdam
Delft
3
30
10
200
3
3
3
Madison
Warsaw
10
10
Geneva
Dubna/Berlin
17The Grid Blueprint for a New Computing
Infrastructure Edited by Ian Foster and Carl
Kesselman July 1998, 701 pages.
The grid promises to fundamentally change the way
we think about and use computing. This
infrastructure will connect multiple regional and
national computational grids, creating a
universal source of pervasive and dependable
computing power that supports dramatically new
classes of applications. The Grid provides a
clear vision of what computational grids are, why
we need them, who will use them, and how they
will be programmed.
18- We claim that these mechanisms, although
originally developed in the context of a cluster
of workstations, are also applicable to
computational grids. In addition to the required
flexibility of services in these grids, a very
important concern is that the system be robust
enough to run in production mode continuously
even in the face of component failures.
Miron Livny Rajesh Raman, "High Throughput
Resource Management", in The Grid Blueprint for
a New Computing Infrastructure.
19- Grid computing is a partnership between
clients and servers. Grid clients have more
responsibilities than traditional clients, , and
must be equipped with powerful mechanisms for
dealing with and recovering from failures,
whether they occur in the context of remote
execution, work management, or data output. When
clients are powerful, servers must accommodate
them by using careful protocols.
Douglas Thain Miron Livny, "Building Reliable
Clients and Servers", in The Grid Blueprint for
a New Computing Infrastructure,2nd edition
20(No Transcript)
21Grid
WWW
22Being a Master
- Customer deposits task(s) with the master that
is responsible for - Obtaining resources and/or workers
- Deploying and managing workers on obtained
resources - Assigning and delivering work unites to
obtained/deployed workers - Receiving and processing results
- Notify customer.
23our answer to High Throughput MW Computing on
commodity resources
24(No Transcript)
25The Layers of Condor
Matchmaker
26 PSE or User
Condor
Local
Condor G (schedD)
Flocking
Condor
Remote
27Cycle Delivery at theMadison campus
28Yearly Condor usage at UW-CS
10,000,000 8,000,000 6,000,000 4,000,000 2,000
,000
29Yearly Condor CPUs at UW
30(inter) national science
31U.S. Trillium Grid Partnership
- Trillium PPDG GriPhyN iVDGL
- Particle Physics Data Grid 12M (DOE) (1999
2004) - GriPhyN 12M (NSF) (2000 2005)
- iVDGL 14M (NSF) (2001 2006)
- Basic composition (150 people)
- PPDG 4 universities, 6 labs
- GriPhyN 12 universities, SDSC, 3 labs
- iVDGL 18 universities, SDSC, 4 labs, foreign
partners - Expts BaBar, D0, STAR, Jlab, CMS, ATLAS, LIGO,
SDSS/NVO - Complementarity of projects
- GriPhyN CS research, Virtual Data Toolkit (VDT)
development - PPDG End to end Grid services, monitoring,
analysis - iVDGL Grid laboratory deployment using VDT
- Experiments provide frontier challenges
- Unified entity when collaborating internationally
32- Grid2003 An Operational National Grid
- 28 sites Universities national labs
- 2800 CPUs, 4001300 jobs
- Running since October 2003
- Applications in HEP, LIGO, SDSS, Genomics
Korea
http//www.ivdgl.org/grid2003
33Contributions to Grid3
- Condor-G your window to Grid3 resources
- GRAM 1.5 GASS Cache
- Directed Acyclic Graph Manager (DAGMan)
- Packaging, Distribution and Support of the
Virtual Data Toolkit (VDT) - Trouble Shooting
- Technical road-map/blueprint
34Contributions to EDG/EGEE
- Condor-G
- DAGMan
- VDT
- Design of gLite
- Testbed
35VDT Growth
VDT 1.1.8 First real use by LCG
VDT 1.1.11 Grid2003
VDT 1.0 Globus 2.0b Condor 6.3.1
VDT 1.1.7 Switch to Globus 2.2
VDT 1.1.3, 1.1.4 1.1.5 pre-SC 2002
36The Build Process
NMI
VDT
Test
Sources (CVS)
Build
Binaries
Build Test Condor pool (40 computers)
Pacman cache
Package
Patching
RPMs
Build
Binaries
GPT src bundles
Build
Binaries
Test
Contributors (VDS, etc.)
37Tools in the VDT 1.2.0
Components built by NMI
- Condor Group
- Condor/Condor-G
- Fault Tolerant Shell
- ClassAds
- Globus Alliance
- Job submission (GRAM)
- Information service (MDS)
- Data transfer (GridFTP)
- Replica Location (RLS)
- EDG LCG
- Make Gridmap
- Certificate Revocation List Updater
- Glue Schema/Info prov.
- ISI UC
- Chimera Pegasus
- NCSA
- MyProxy
- GSI OpenSSH
- UberFTP
- LBL
- PyGlobus
- Netlogger
- Caltech
- MonaLisa
- VDT
- VDT System Profiler
- Configuration software
- Others
- KX509 (U. Mich.)
- DRM 1.2
- Java
- FBSng job manager
38Tools in the VDT 1.2.0
Components built by contributors
- Condor Group
- Condor/Condor-G
- Fault Tolerant Shell
- ClassAds
- Globus Alliance
- Job submission (GRAM)
- Information service (MDS)
- Data transfer (GridFTP)
- Replica Location (RLS)
- EDG LCG
- Make Gridmap
- Certificate Revocation List Updater
- Glue Schema/Info prov.
- ISI UC
- Chimera Pegasus
- NCSA
- MyProxy
- GSI OpenSSH
- UberFTP
- LBL
- PyGlobus
- Netlogger
- Caltech
- MonaLisa
- VDT
- VDT System Profiler
- Configuration software
- Others
- KX509 (U. Mich.)
- DRM 1.2
- Java
- FBSng job manager
39Tools in the VDT 1.2.0
Components built by VDT
- Condor Group
- Condor/Condor-G
- Fault Tolerant Shell
- ClassAds
- Globus Alliance
- Job submission (GRAM)
- Information service (MDS)
- Data transfer (GridFTP)
- Replica Location (RLS)
- EDG LCG
- Make Gridmap
- Certificate Revocation List Updater
- Glue Schema/Info prov.
- ISI UC
- Chimera Pegasus
- NCSA
- MyProxy
- GSI OpenSSH
- UberFTP
- LBL
- PyGlobus
- Netlogger
- Caltech
- MonaLisa
- VDT
- VDT System Profiler
- Configuration software
- Others
- KX509 (U. Mich.)
- DRM 1.2
- Java
- FBSng job manager
40Health
41Condor at Noregon
gtAt 1014 AM 7/15/2004 -0400, xxx wrotegtDr.
Livny gtI wanted to update you on our progress
with our grid computing gtproject. We have about
300 nodes deployed presently with the ability to
gtdeploy up to 6,000 total nodes whenever we are
ready. The project has gtbeen getting attention
in the local press and has gained the full
support gtof the public school system and
generated a lot of excitement in the gtbusiness
community.
- Noregon has entered into a partnership with
Targacept Inc. to develop a system to efficiently
perform molecular dynamics simulations. Targacept
is a privately held pharmaceutical company
located in Winston-Salem's Triad Research Park
whose efforts are focused on creating drug
therapies for neurological, psychiatric, and
gastrointestinal diseases. -
- Using the Condor grid middleware, Noregon is
designing and implementing an ensemble
Car-Parrinello simulation tool for Targacept that
will allow a simulation to be distributed across
a large grid of inexpensive Windows PCs.
Simulations can be completed in a fraction of the
time without the use of high performance
(expensive) hardware.
42Electronics
43Condor at Micron
44Condor at Micron
- The Chief Officer value proposition
- Info Week 2004 IT Survey includes Grid questions!
- Makes our CIO look good by letting him answer yes
- Microns 2003 rank 23rd
- Without Condor we only get about 25 of PC value
today - Didt tell our CFO a 1000 PC really costs 4000!
- Doubling utilization to 50 doubles CFOs return
on capital - Microns goal 66 monthly average utilization
- Providing a personal supercomputer to every
engineer - CTO appreciates the cool factor
- CTO really gets it when his engineers say
- I dont know how I would have done that without
the Grid
45Condor at MicronExample Value
- 73606 job hours / 24 / 30 103 Solaris boxes
- 103 10,000/box 1,022,306
- And thats just for one application not
considering decreased development time, increased
uptime, etc. - Chances are if you have Micron memory in your PC,
it was processed by Condor!
46Software Engineering
47Condor at Oracle
- Condor is used within Oracle's Automated
Integration Management Environment (AIME) to
perform automated build and regression testing of
multiple components for Oracle's flagship
Database Server product.Each day, nearly 1,000
developers make contributions to the code base of
Oracle Database Server. Just the compilation
alone of these software modules would take over
11 hours on a capable workstation. But in
addition to building, AIME must control
repository labelling/tagging, configuration
publishing, and last but certainly not least,
regression testing. Oracle is very serious about
the stability and correctness about their
products. Therefore, the AIME daily regression
test suite currently covers 90,000 testable items
divided into over 700 test packages. The entire
process must complete within 12 hours to keep
development moving forward.About five years
ago, Oracle selected Condor as the resource
manager underneath AIME because they liked the
maturity of Condor's core components. In total,
3000 CPUs at Oracle are managed by Condor today.
48GRIDS Center- Enabling Collaborative
Science-Grid Research Integration Development
Support
49Procedures,Tools and Facilities
- Build Generate executable versions of a
component - Package Integrate executables into a
distribution - Test Verify the functionality of a
- Component
- A set of a components
- A distribution
50Build
- Reproducibility build the version we released
2 years ago! - Well managed source repository
- Know your externals and keep them around
- Portability build the component on
build17.nmi.wisc.edu! - No dependencies on local capabilities
- Understand your hardware requirements
- Manageability run the build daily and email me
outcome
51Fetch component
Build Component
Build Component
Move source files to build site
Move source files to build site
Retrieve executables from build site
Retrieve executables from build site
Report outcome and clean up
52Goals of the Build Facility
- Design, develop and deploy a build system (HW and
software) capable of performing daily builds of a
suite of middleware packages on a heterogeneous
(HW, OS, libraries, ) collection of platforms - Dependable
- Traceable
- Manageable
- Portable
- Extensible
- Schedulable
53Using our own technologies
- Using GRIDS technologies to automate the build,
deploy, and test cycle - Condor schedule build and testing tasks
- DAGMan Manage build and testing workflow
- GridFTP copy/move files
- GSI-OpenSSH remote login, start/stop services
etc - Constructed and manage a dedicated heterogeneous
and distributed facility
54NMI Build facility
- Build resources
- mi-aix.cs.wisc.edu
- nmi-hpux.cs.wisc.edu
- nmi-irix.cs.wisc.edu
- nmi-rh72-alpha.cs.wisc.edu
- nmi-redhat72-ia64.cs.wisc.edu
- nmi-sles8-ia64.cs.wisc.edu
- nmi-redhat72-build.cs.wisc.edu
- nmi-redhat72-dev.cs.wisc.edu
- nmi-redhat80-ia32.cs.wisc.edu
- nmi-redhat9-ia32.cs.wisc.edu
- (rh9 x86)nmi-test-1.cs.wisc.edu
- (production system rh73 x86)vger.cs.wisc.edu
- nmi-dux40f.cs.wisc.edu
- nmi-tru64.cs.wisc.edu
- nmi-macosx.local.
- nmi-solaris6.cs.wisc.edu
- nmi-solaris7.cs.wisc.edu
Web interface
Database
Build Manager
Build Generator
Email
55The VDT operation
NMI
VDT
Test
Sources (CVS)
Build
Binaries
Build Test Condor pool (37 computers)
Pacman cache
Package
Patching
RPMs
Build
Binaries
GPT src bundles
Build
Binaries
Test
Contributors (VDS, etc.)
56Test
- Reproducibility Run last year test harness on
last week build! - Separation between build and test processes
- Well managed repository of test harnesses
- Know your externals and keep them around
- Portability run the test harness of component
A on test17.nmi.wisc.edu! - Automatic install and de-install of component
- No dependencies on local capabilities
- Understand your hardware requirements
- Manageability run the test suite daily and
email me the outcome
57Testing Tools
- Current focus on component testing
- Developed scripts and procedures to verify
deployment, very basic operations - Multi-component, multi-version, multi-platform
test harness and procedures - Testing as bottom feeder activity
- Short and long term testing cycles
58Movies
59C.O.R.E Digital Pictures
X-Men, X-Men II
- There has been a lot of buzz in the industry
about something big going on here at C.O.R.E.
We're really really really pleased to make the
following announcementYes, it's true. C.O.R.E.
digital pictures has spawned a new
divisionC.O.R.E. Feature Animation We're in
production on a CG animated feature film being
directed by Steve "Spaz" Williams. The script is
penned by the same writers who brought you
There's Something About Mary, Ed Decter and John
Strauss.
The Time Machine
Blade, Blade II
The Nutty Professor II
60How can we accommodatean unbounded need for
computing with an unbounded amount of
resources?
61GCB
DAGMan
HawkEye
Parrot
Condor-G
Stork
BirdBath
NeST
Chirp
Condor-C