Condor-G - Your Window to the Grid - PowerPoint PPT Presentation

About This Presentation

Title:

Condor-G - Your Window to the Grid

Description:

Condor-G - Your Window to the Grid The Condor Project (Established 85) Distributed systems CS research performed by a team that faces software engineering ... – PowerPoint PPT presentation

Number of Views:155

Avg rating:3.0/5.0

Slides: 45

Provided by: Miro172

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Condor-G - Your Window to the Grid

1
Condor-G -Your Window to the Grid
2
The Condor Project (Established 85)

Distributed systems CS research performed by a
team that faces
software engineering challenges in a
UNIX/Linux/NT environment,
active interaction with users and collaborators,
daily maintenance and support challenges of a
distributed production environment,
and educating and training students.
Funding - NSF, NASA,DoE, DoD, IBM, INTEL,
Microsoft and the UW Graduate School
.

3
National Grid Efforts

National Technology Grid - NCSA Alliance
(NSF-PACI)
Information Power Grid (NASA)
Particle Physics Data Grid (DoE)
Grid Physics Network (NSF-ITR)

4
Driving Concepts
5

Since the early days of mankind the primary
motivation for the establishment of communities
has been the idea that by being part of an
organized group the capabilities of an individual
are improved. The great progress in the area of
inter-computer communication led to the
development of means by which stand-alone
processing sub-systems can be integrated into
multi-computer communities.

Miron Livny, Study of Load Balancing Algorithms
for Decentralized Distributed Processing
Systems., Ph.D thesis, July 1983.
6
Every Communityneeds a Matchmaker!
7
Why? Because ...

.. someone has to bring together members of the
community who have requests for goods and
services with members who offer them.
Both sides are looking for each other
Both sides have constraints
Both sides have preferences

8
High Throughput Computing

For many experimental scientists, scientific
progress and quality of research are strongly
linked to computing throughput. In other words,
they are less concerned about instantaneous
computing power. Instead, what matters to them is
the amount of computing they can harness over a
month or a year --- they measure computing power
in units of scenarios per day, wind patterns per
week, instructions sets per month, or crystal
configurations per year.

9
HW is a Commodity

Raw computing power is everywhere - on desk-tops,
shelves, and racks. It is
cheap
dynamic,
distributively owned,
heterogeneous and
evolving.

10
Master-Worker (MW) computing is common and
Naturally Parallel.It is by no means
Embarrassingly Parallel. Doing it right is by no
means trivial.
11
The Tool
12
Our Answer to High Throughput MW Computing on
commodity resources
13
The Condor System

A High Throughput Computing system that
supports large dynamic MW applications on large
collections of distributively owned resources
developed, maintained and supported by the Condor
Team at the University of Wisconsin - Madison
since 86.
Originally developed for UNIX workstations
Based on matchmaking technology.
Fully integrated NT version is available.
Deployed world-wide by academia and industry.
More than 1300 CPUs at the U of Wisconsin.
Available at www.cs.wisc.edu/condor.

14
Condor CPUs on the UW Campus
15
Some NumbersUW-CS Pool

Total since 6/98 4,000,000 hours 450 years
Real Users 1,700,000 hours 260 years
CS-Optimization 610,000 hours
CS-Architecture 350,000 hours
Physics 245,000 hours
Statistics 80,000 hours
Engine Research Center 38,000 hours
Math 90,000 hours
Civil Engineering 27,000 hours
Business 970 hours
External Users 165,000 hours 19 years
MIT 76,000 hours
Cornell 38,000 hours
UCSD 38,000 hours
CalTech 18,000 hours

16
I have a job parallel MW application with 600
workers. How can I benefit from Condor?
17
The Application

Study the behavior of F(x,y,z) for 20 values of
x, 10 values of y and 3 values of z (20103
600)
F takes on the average 3 hours to compute on a
typical workstation (total 1800 hours)
F requires a moderate (128MB) amount of memory
F performs little I/O - (x,y,z) is 15 MB and
F(x,y,z) is 40 MB

18
Step I - get organized!

Turn your workstation into a Personal Condor
Write a script that creates 600 input files for
each of the (x,y,z) combinations
Submit a cluster of 600 jobs to your personal
Condor
Write a script that collects the data from the
600 output files
Go on a long vacation (2.5 months)

19
A Condor Job-Parallel Submit File

executable worker
requirement ((OS Linux2.2) Memory gt
128))
rank KFlops
initialdir worker_dir.(process)
input in
output out
error err
log log
queue 600

20
Your Personal Condor will ...

... keep an eye on your jobs and will keep you
posted on their progress
... implement your policy on when the jobs can
run on your workstation
... implement your policy on the execution order
of the jobs
.. add fault tolerance to your jobs
keep a log of your job activities

21
(No Transcript)
22
Condor Layers
Application
Application Agent
Customer Agent
Environment Agent
Owner Agent
Local Resource Management
Resource
23
Step II - build your personal Grid

Install Condor on the desk-top machine next door.
Install Condor on the machines in the class room.
Install Condor on the O2K in the basement.
Configure these machines to be part of your
Condor pool.
Go on a shorter vacation ...

24
(No Transcript)
25
Step III - Take advantage of your friends

Get permission from friendly Condor pools to
access their resources
Configure your personal Condor to flock to
these pools
reconsider your vacation plans ...

26
(No Transcript)
27
Think big. Go to the Grid
28
Condor-G

A Grid enabled version of Condor that uses the
inter-domain services of Globus to bring Grid
resources into the domain of your Personal-Condor
Supports Grid Universe jobs
Uses GSIFTP to move glide-in software
Uses MDS for submit information

29
Condor-glide-in

Enable an application to dynamically turn
allocated grid resources into members of a Condor
pool for the duration of the allocation.
Easy to use on different platforms
Robust
Supports SMPs

30
X509 Certificates

We are in the process of adding X509 based
authentication capabilities to Condor services.
Job submission
Local file access
Access to Condor-glide-in software
Resource authentication

31
GSIFTP

Enable Condor I/O services to use remote GSIFTP
servers.
Move glide-in tar files
Read executables
Move Data from/to data repositories
Access disk caches

32
Grid Universe

Grid Universe jobs submitted to Condor are
transformed in the Globus jobs and submitted (via
GlobusRun) to a grid resource.
Use MDS to locate resource
Monitor status of job on remote resource
Report status via Condor services
Rewrite in progress with new Globus library.

33
Step IV - Think big (Grid)!

Get access (account(s) certificate(s)) to a
Computational Grid
Submit 599 Grid Universe Condor- glide-in jobs
to your personal Condor
Take the rest of the afternoon off ...

34
(No Transcript)
35
Does it work?
36
An example - NUG28

We are pleased to announce the exact solution of
the nug28 quadratic assignment problem (QAP).
This problem was derived from the well known
nug30 problem using the distance matrix from a 4
by 7 grid, and the flow matrix from nug30 with
the last 2 facilities deleted. This is to our
knowledge the largest instance from the nugxx
series ever provably solved to optimality.
The problem was solved using the branch-and-bound
algorithm described in the paper "Solving
quadratic assignment problems using convex
quadratic programming relaxations," N.W. Brixius
and K.M. Anstreicher. The computation was
performed on a pool of workstations using the
Condor high-throughput computing system in a
total wall time of approximately 4 days, 8 hours.
During this time the number of active worker
machines averaged approximately 200. Machines
from UW, UNM and (INFN) all participated in the
computation.

37
NUG30 Personal Condor

For the run we will be flocking to
-- the main Condor pool at Wisconsin (600
processors)
-- the Condor pool at Georgia Tech (190 Linux
boxes)
-- the Condor pool at UNM (40 processors)
-- the Condor pool at Columbia (16 processors)
-- the Condor pool at Northwestern (12
processors)
-- the Condor pool at NCSA (65 processors)
-- the Condor pool at INFN (200 processors)
We will be using glide_in to access the Origin
2000 (through LSF ) at NCSA.
We will use "hobble_in" to access the Chiba City
Linux cluster and Origin
2000 here at Argonne.

38
It works!!!

Date Thu, 8 Jun 2000 224100 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt To Miron
Livny ltmiron_at_cs.wisc.edugt Subject Re Priority
This has been a great day for metacomputing!
Everything is going wonderfully. We've had over
900 machines (currently around 890), and all the
pieces are working great
Date Fri, 9 Jun 2000 114111 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt
Still rolling along. Over three billion nodes in
about 1 day!

39
Up to a Point

Date Fri, 9 Jun 2000 143511 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt Hi Gang,
The glory days of metacomputing are over. Our job
just crashed. I watched it happen right before my
very eyes. It was what I was afraid of -- they
just shut down denali, and losing all of those
machines at once caused other connections to time
out -- and the snowball effect had bad
repercussions for the Schedd.

40
Back in Business

Date Fri, 9 Jun 2000 185559 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt
Hi Gang,
We are back up and running. And, yes, it took me
all afternoon to get it going again. There was a
(brand new) bug in the QAP "read checkpoint"
information that was making the master coredump.
(Only with optimization level -O4). I was nearly
reduced to tears, but with some supportive words
from Jean-Pierre, I made it through.

41
The First 600K seconds
42
The First 35K seconds
43
We made it!!!

Sender goux_at_dantec.ece.nwu.edu Subject Re Let
the festivities begin.
Hi dear Condor Team,
you all have been amazing. NUG30 required 10.9
years of Condor Time. In just seven days !
More stats tomorrow !!! We are off celebrating !
condor rules !
cheers,
JP.

44
Do not be picky, be agile!!!

Write a Comment

User Comments (0)