Title: Condor, CondorG, CondorC and Stork
1Condor, (Condor-G),Condor-C and Stork
2Resource Allocationvs.Work Delegation
3(No Transcript)
4Resource Allocation
- A limited assignment of the ownership of a
resource - Owner is charged for allocation regardless of
actual consumption - Owner can allocate the resource to others
- Owner has the right and means to revoke an
allocation - Allocation is governed by an agreement between
the consumer and the owner - Allocation is always a lease
- Trees of allocations can be formed
5- We present some principles that we believe
should apply in any compute resource management
system. The first, P1, speaks to the need to
avoid resource leaks of all kinds, as might
result, for example, from a monitoring system
that consumes a nontrivial number of resources. - P1 - It must be possible to monitor and control
all resources consumed by a CEwhether for
computation or management. - Our second principle is a corollary of P1
- P2 - A system should incorporate circuit breakers
to protect both the compute resource and clients.
For example, negotiating with a CE consumes
resources. How do we prevent an eager client from
turning into a denial of service attack?
Ian Foster Miron Livny, "Virtualization and
Management of Compute Resources Principles and
Architecture ", A working document (February
2005)
6Work Delegation
- A limited assignment of the responsibility to
perform the work - Delegation involves a definition of these
responsibilities - Responsibilities my be further delegated
- Delegation always consumes resources
- Delegation is always a lease
- Tree of delegations can be formed
7startD
DAGMan
3
starter
schedD
1
3
Globus
4
1
2
5
3
4
6
shadow
NorduGrid
5
1
3
grid manager
4
5
6
GAHP- Globus
4
6
6
5
6
8Some details
9Condor-C to Condor-G
10Condor-G to Condor-C
111. Glide-in
2. Submit jobs
12Matchmaking
- In all of these examples, Condor-C (and Condor-G)
went to a specific remote schedD (or remote site) - This is not required you can do matchmaking!
13Matchmaking with Condor-C
14What about other types of work and Resources?
- Make data placement jobs first class citizens
- Manage storage space
- Manage FTP connections
- Bridge protocols
- Manage network connections
- Across private networks
- Through firewalls
- Through shared gateways
15Customer requestsPlace y F(x) at L!Master
delivers.
16Data Placement
- Management of storage space and bulk data
transfers play a key role in the end-to-end
performance of an application. - Data Placement (DaP) operations must be treated
as first class jobs and explicitly expressed in
the job flow - Fabric must provide services to manage storage
space and connections - Data Placement schedulers are needed.
- Data Placement and computing must be coordinated
- Smooth transition of CPU-I/O interleaving across
software layers - Error handling and garbage collection
17A simple DAG for yF(x)?L
- Allocate (size(x)size(y)size(F)) at SE(i)
- Move x from SE(j) to SE(i)
- Place F on CE(k)
- Compute F(x) at CE(k)
- Move y to L
- Release allocated space
Storage Element (SE) Compute Element (CE)
18Data Placement Jobs
Computational Jobs
19The Concept
Condor Job Queue
DaP A A.submit DaP B B.submit Job C
C.submit .. Parent A child B Parent B child
C Parent C child D, E ..
DAG specification
C
DAGMan
Stork Job Queue
C
E
20Current Status
- Implemented a first version of a framework that
unifies the management of compute and data
placement activities. - DaP aware Job Flow (DAGMan).
- Stork A DaP scheduler
- Parrot A tool that speaks a variety of
distributed I/O services - NeST A portable grid enabled storage appliance
(lot and connection management)
21Planner
MM
SchedD
Stork
StartD
SchedD
RFT
GridFTP
22Failure Recovery and Efficient Resource
Utilization
- Fault tolerance
- Just submit a bunch of data placement jobs, and
then go away.. - Control number of concurrent transfers from/to
any storage system - Prevents overloading
- Space allocation and De-allocations
- Make sure space is available
23Support for Heterogeneity
Protocol translation using Stork memory buffer.
24Support for Heterogeneity
Protocol translation using Stork Disk Cache.
25Flexible Job Representation and Multilevel Policy
Support
-
- Type Transfer
- Src_Url srb//ghidorac.sdsc.edu/kosart.cond
or/x.dat - Dest_Url nest//turkey.cs.wisc.edu/kosart/x
.dat -
-
- Max_Retry 10
- Restart_in 2 hours
-
-
26Real life Data Pipelines
- Astronomy data processing pipeline
- 3 TB (2611 x 1.1 GB files)
- Joint work with Robert Brunner, Michael Remijan
et al. at NCSA - WCER educational video pipeline
- 6TB (13 GB files)
- Joint work with Chris Thorn et al at WCER
27DPOSS Data
- Palomar-Oschin photographic plates used to map
one half of celestial sphere - Each photographic plate digitized into a single
image - Calibration done by software pipeline at Caltech
- Want to run SExtractor on the images
28NCSA Pipeline
Staging Site _at_UW
Staging Site _at_NCSA
Unitree _at_NCSA
Input Data flow
Output Data flow
Processing
Condor Pool _at_Starlight
29NCSA Pipeline
- Moved Processed 3 TB of DPOSS image data in
under 6 days - Most powerful astronomy data processing facility!
- Adapt for other datasets (Petabytes) Quest2,
CARMA, NOAO, NRAO, LSST
30WCER Pipeline
- Need to convert DV videos to MPEG-1, MPEG-2 and
MPEG-4 - Each 1 hour video is 13 GB
- Videos accessible through transana software
- Need to stage the original and processed videos
to SDSC
31WCER Pipeline
- First attempt at such large scale distributed
video processing - Decoder problems with large 13 GB files
- Uses bleeding edge technology
32WCER Pipeline
Staging Site _at_UW
SRB Server _at_SDSC
33Current status
- The Stork binaries are included with the
condor-6.7.-linux-x86-glibc23 releases. These
are at least compatible with RedHat9, FedoraCore
and ScientificLinux - The list of supported Stork protocols is on
http//www.cs.wisc.edu/condor/stork. - Stork was tested against the follwing remote
servers GridFTP v2.x, v3.x, SRB v3.1.2, dCache
SRM v1.5.2, FTP, HTTP, NeST, unitree diskrouter,
Castor, - SRM, LBNL SRM, JLAB SRMÂ Â Â Â Â Â Â
34How can we accommodatean unbounded need for
computing with an unbounded amount of
resources?