Condor-G - Your Window to the Grid - PowerPoint PPT Presentation

About This Presentation
Title:

Condor-G - Your Window to the Grid

Description:

Condor-G - Your Window to the Grid The Condor Project (Established 85) Distributed systems CS research performed by a team that faces software engineering ... – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 45
Provided by: Miro172
Category:
Tags: condor | grid | window

less

Transcript and Presenter's Notes

Title: Condor-G - Your Window to the Grid


1
Condor-G -Your Window to the Grid
2
The Condor Project (Established 85)
  • Distributed systems CS research performed by a
    team that faces
  • software engineering challenges in a
    UNIX/Linux/NT environment,
  • active interaction with users and collaborators,
  • daily maintenance and support challenges of a
    distributed production environment,
  • and educating and training students.
  • Funding - NSF, NASA,DoE, DoD, IBM, INTEL,
    Microsoft and the UW Graduate School
  • .

3
National Grid Efforts
  • National Technology Grid - NCSA Alliance
    (NSF-PACI)
  • Information Power Grid (NASA)
  • Particle Physics Data Grid (DoE)
  • Grid Physics Network (NSF-ITR)

4
Driving Concepts
5
  • Since the early days of mankind the primary
    motivation for the establishment of communities
    has been the idea that by being part of an
    organized group the capabilities of an individual
    are improved. The great progress in the area of
    inter-computer communication led to the
    development of means by which stand-alone
    processing sub-systems can be integrated into
    multi-computer communities.

Miron Livny, Study of Load Balancing Algorithms
for Decentralized Distributed Processing
Systems., Ph.D thesis, July 1983.
6
Every Communityneeds a Matchmaker!
7
Why? Because ...
  • .. someone has to bring together members of the
    community who have requests for goods and
    services with members who offer them.
  • Both sides are looking for each other
  • Both sides have constraints
  • Both sides have preferences

8
High Throughput Computing
  • For many experimental scientists, scientific
    progress and quality of research are strongly
    linked to computing throughput. In other words,
    they are less concerned about instantaneous
    computing power. Instead, what matters to them is
    the amount of computing they can harness over a
    month or a year --- they measure computing power
    in units of scenarios per day, wind patterns per
    week, instructions sets per month, or crystal
    configurations per year.

9
HW is a Commodity
  • Raw computing power is everywhere - on desk-tops,
    shelves, and racks. It is
  • cheap
  • dynamic,
  • distributively owned,
  • heterogeneous and
  • evolving.

10
Master-Worker (MW) computing is common and
Naturally Parallel.It is by no means
Embarrassingly Parallel. Doing it right is by no
means trivial.
11
The Tool
12
Our Answer to High Throughput MW Computing on
commodity resources
13
The Condor System
  • A High Throughput Computing system that
    supports large dynamic MW applications on large
    collections of distributively owned resources
    developed, maintained and supported by the Condor
    Team at the University of Wisconsin - Madison
    since 86.
  • Originally developed for UNIX workstations
  • Based on matchmaking technology.
  • Fully integrated NT version is available.
  • Deployed world-wide by academia and industry.
  • More than 1300 CPUs at the U of Wisconsin.
  • Available at www.cs.wisc.edu/condor.

14
Condor CPUs on the UW Campus
15
Some NumbersUW-CS Pool
  • Total since 6/98 4,000,000 hours 450 years
  • Real Users 1,700,000 hours 260 years
  • CS-Optimization 610,000 hours
  • CS-Architecture 350,000 hours
  • Physics 245,000 hours
  • Statistics 80,000 hours
  • Engine Research Center 38,000 hours
  • Math 90,000 hours
  • Civil Engineering 27,000 hours
  • Business 970 hours
  • External Users 165,000 hours 19 years
  • MIT 76,000 hours
  • Cornell 38,000 hours
  • UCSD 38,000 hours
  • CalTech 18,000 hours

16
I have a job parallel MW application with 600
workers. How can I benefit from Condor?
17
The Application
  • Study the behavior of F(x,y,z) for 20 values of
    x, 10 values of y and 3 values of z (20103
    600)
  • F takes on the average 3 hours to compute on a
    typical workstation (total 1800 hours)
  • F requires a moderate (128MB) amount of memory
  • F performs little I/O - (x,y,z) is 15 MB and
    F(x,y,z) is 40 MB

18
Step I - get organized!
  • Turn your workstation into a Personal Condor
  • Write a script that creates 600 input files for
    each of the (x,y,z) combinations
  • Submit a cluster of 600 jobs to your personal
    Condor
  • Write a script that collects the data from the
    600 output files
  • Go on a long vacation (2.5 months)

19
A Condor Job-Parallel Submit File
  • executable worker
  • requirement ((OS Linux2.2) Memory gt
    128))
  • rank KFlops
  • initialdir worker_dir.(process)
  • input in
  • output out
  • error err
  • log log
  • queue 600

20
Your Personal Condor will ...
  • ... keep an eye on your jobs and will keep you
    posted on their progress
  • ... implement your policy on when the jobs can
    run on your workstation
  • ... implement your policy on the execution order
    of the jobs
  • .. add fault tolerance to your jobs
  • keep a log of your job activities

21
(No Transcript)
22
Condor Layers
Application
Application Agent
Customer Agent
Environment Agent
Owner Agent
Local Resource Management
Resource
23
Step II - build your personal Grid
  • Install Condor on the desk-top machine next door.
  • Install Condor on the machines in the class room.
  • Install Condor on the O2K in the basement.
  • Configure these machines to be part of your
    Condor pool.
  • Go on a shorter vacation ...

24
(No Transcript)
25
Step III - Take advantage of your friends
  • Get permission from friendly Condor pools to
    access their resources
  • Configure your personal Condor to flock to
    these pools
  • reconsider your vacation plans ...

26
(No Transcript)
27
Think big. Go to the Grid
28
Condor-G
  • A Grid enabled version of Condor that uses the
    inter-domain services of Globus to bring Grid
    resources into the domain of your Personal-Condor
  • Supports Grid Universe jobs
  • Uses GSIFTP to move glide-in software
  • Uses MDS for submit information

29
Condor-glide-in
  • Enable an application to dynamically turn
    allocated grid resources into members of a Condor
    pool for the duration of the allocation.
  • Easy to use on different platforms
  • Robust
  • Supports SMPs

30
X509 Certificates
  • We are in the process of adding X509 based
    authentication capabilities to Condor services.
  • Job submission
  • Local file access
  • Access to Condor-glide-in software
  • Resource authentication

31
GSIFTP
  • Enable Condor I/O services to use remote GSIFTP
    servers.
  • Move glide-in tar files
  • Read executables
  • Move Data from/to data repositories
  • Access disk caches

32
Grid Universe
  • Grid Universe jobs submitted to Condor are
    transformed in the Globus jobs and submitted (via
    GlobusRun) to a grid resource.
  • Use MDS to locate resource
  • Monitor status of job on remote resource
  • Report status via Condor services
  • Rewrite in progress with new Globus library.

33
Step IV - Think big (Grid)!
  • Get access (account(s) certificate(s)) to a
    Computational Grid
  • Submit 599 Grid Universe Condor- glide-in jobs
    to your personal Condor
  • Take the rest of the afternoon off ...

34
(No Transcript)
35
Does it work?
36
An example - NUG28
  • We are pleased to announce the exact solution of
    the nug28 quadratic assignment problem (QAP).
    This problem was derived from the well known
    nug30 problem using the distance matrix from a 4
    by 7 grid, and the flow matrix from nug30 with
    the last 2 facilities deleted. This is to our
    knowledge the largest instance from the nugxx
    series ever provably solved to optimality.
  • The problem was solved using the branch-and-bound
    algorithm described in the paper "Solving
    quadratic assignment problems using convex
    quadratic programming relaxations," N.W. Brixius
    and K.M. Anstreicher. The computation was
    performed on a pool of workstations using the
    Condor high-throughput computing system in a
    total wall time of approximately 4 days, 8 hours.
    During this time the number of active worker
    machines averaged approximately 200. Machines
    from UW, UNM and (INFN) all participated in the
    computation.

37
NUG30 Personal Condor
  • For the run we will be flocking to
  • -- the main Condor pool at Wisconsin (600
    processors)
  • -- the Condor pool at Georgia Tech (190 Linux
    boxes)
  • -- the Condor pool at UNM (40 processors)
  • -- the Condor pool at Columbia (16 processors)
  • -- the Condor pool at Northwestern (12
    processors)
  • -- the Condor pool at NCSA (65 processors)
  • -- the Condor pool at INFN (200 processors)
  • We will be using glide_in to access the Origin
    2000 (through LSF ) at NCSA.
  • We will use "hobble_in" to access the Chiba City
    Linux cluster and Origin
  • 2000 here at Argonne.

38
It works!!!
  • Date Thu, 8 Jun 2000 224100 -0500 (CDT) From
    Jeff Linderoth ltlinderot_at_mcs.anl.govgt To Miron
    Livny ltmiron_at_cs.wisc.edugt Subject Re Priority
  • This has been a great day for metacomputing!
    Everything is going wonderfully. We've had over
    900 machines (currently around 890), and all the
    pieces are working great
  • Date Fri, 9 Jun 2000 114111 -0500 (CDT) From
    Jeff Linderoth ltlinderot_at_mcs.anl.govgt
  • Still rolling along. Over three billion nodes in
    about 1 day!

39
Up to a Point
  • Date Fri, 9 Jun 2000 143511 -0500 (CDT) From
    Jeff Linderoth ltlinderot_at_mcs.anl.govgt Hi Gang,
  • The glory days of metacomputing are over. Our job
    just crashed. I watched it happen right before my
    very eyes. It was what I was afraid of -- they
    just shut down denali, and losing all of those
    machines at once caused other connections to time
    out -- and the snowball effect had bad
    repercussions for the Schedd.

40
Back in Business
  • Date Fri, 9 Jun 2000 185559 -0500 (CDT) From
    Jeff Linderoth ltlinderot_at_mcs.anl.govgt
  • Hi Gang,
  • We are back up and running. And, yes, it took me
    all afternoon to get it going again. There was a
    (brand new) bug in the QAP "read checkpoint"
    information that was making the master coredump.
    (Only with optimization level -O4). I was nearly
    reduced to tears, but with some supportive words
    from Jean-Pierre, I made it through.

41
The First 600K seconds
42
The First 35K seconds
43
We made it!!!
  • Sender goux_at_dantec.ece.nwu.edu Subject Re Let
    the festivities begin.
  • Hi dear Condor Team,
  • you all have been amazing. NUG30 required 10.9
    years of Condor Time. In just seven days !
  • More stats tomorrow !!! We are off celebrating !
  • condor rules !
  • cheers,
  • JP.

44
Do not be picky, be agile!!!
Write a Comment
User Comments (0)
About PowerShow.com