Managing Network Resources in Condor - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Managing Network Resources in Condor

Description:

Scheduling Preemption Ckpts. Time to checkpoint is limited when preempted ... Schedule preemption checkpoints in advance of reservations. ondor. C. Subnet A. Subnet B ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 22
Provided by: jimb184
Learn more at: http://www.cs.wisc.edu
Category:

less

Transcript and Presenter's Notes

Title: Managing Network Resources in Condor


1
Managing Network Resources in Condor
  • Jim Basney
  • Computer Sciences Department
  • University of Wisconsin-Madison
  • jbasney_at_cs.wisc.edu

2
Why is the Network Important?
  • Increase in physical memory per CPU
  • Larger checkpoints
  • Increase in size of Condor pools
  • 700 CPUs in our local pool
  • Increase in remote execution across WAN
  • WAN pools (INFN)
  • Flocking UW, NCSA, UNM, INFN
  • Remote Submitters Personal Condor

3
Types of Network Usage
  • Placement
  • Periodic Checkpoints
  • Preemption
  • Remote I/O

4
Network Management Goals
  • Provide Administrative Control
  • HTC applications must co-exist with other network
    users
  • Improve Application Efficiency Goodput

5
Monitoring Network Usage
  • Configure Network Routing Info
  • Monitor Network Usage Per User Subnet
  • Checkpoint Executable Transfers
  • Remote System Calls
  • CondorView Visualization

6
Network CPU Co-Allocation
  • For each Subnet, configure
  • Available capacity
  • Allocation window
  • Job Placement requires capacity for
  • Checkpoint Executable Transfer
  • Remote I/O (estimated)

T
TW
7
  • Goodput Allocation - Network Overhead

X
Preemption Checkpoint
Placement
Periodic Checkpoint
Remote I/O
8
Use Network Efficiently
  • Compressed Checkpoints
  • CPU vs. network resources
  • Incremental Checkpointing
  • Record changes since last checkpoint
  • Buffered Remote I/O (Doug Thain)
  • Latency Hiding
  • Avoid multiple reads/writes of same file data

9
Ckpt and Filesystem Domains
  • Provide local access to checkpoint and file
    storage

WAN
10
Checkpoint Domains
  • Resource offer includes nearest server
  • CkptServer ckpt.cs.wisc.edu
  • Job must remain in checkpoint domain
  • LastCkptServer ckpt.cs.wisc.edu
  • Requirements My.LastCkptServer
    Target.CkptServer

11
Checkpoint Domains (cont.)
  • Job may migrate if no CPUs available in domain
  • Requirements (My.LastCkptServer
    Target.CkptServer) (CurrentTime -
    My.LastPreemptTime gt 86400)
  • Rank My.LastCkptServer
    Target.CkptServer

12
Filesystem Domains
  • Resource offer includes filesystem domain
  • FileSystemDomain cs.wisc.edu
  • Job runs where input data is staged
  • Requirements Target.FilesystemDomain
    cs.wisc.edu

13
Filesystem Domains (cont.)
  • Resource offer may include staged datasets
  • HasDataSet174 True
  • Job runs where dataset is staged
  • Requirements Target.HasDataSet174

14
Co-Allocation Revisited
  • Network-Aware CPU Requests
  • Requirements CPUBW gt 8.0 RSCBW gt 4.0
  • Rank RestartBW
  • Rank 0 - RSCHops
  • Time-based capacity specification
  • Limit WAN bandwidth used during work hours

15
Scheduling Preemption Ckpts
  • Time to checkpoint is limited when preempted
  • Preempting user doesn't want to wait
  • Simultaneous preemptions
  • Heavy network demand
  • Slow checkpointing
  • Missed deadlines / Failed checkpoints

16
Scheduling Preemption Ckpts
  • Many preemption events may be anticipated
  • Start of class for lab workstation
  • Start of work hours for office workstation
  • System maintenance
  • Schedule preemption checkpoints in advance of
    reservations

17
Scheduling Preemption Ckpts
R
Subnet A
Subnet B
Subnet C
18
Scheduling Periodic Ckpts
  • Goals
  • Complete checkpoint quickly
  • Don't interfere with more important transfers
  • Perform when network is otherwise idle
  • Avoid synchronized periodic checkpoints

19
Network Scheduling
  • Fit jobs to network topology
  • Place network-intensive jobs on fast networks
  • Place jobs near their data
  • Locate best checkpoint and file servers at
    run-time
  • Pre-fetch and store-behind application data when
    network capacity is available

20
Network Scheduling (cont.)
  • Balance checkpoint costs with expected allocation
    time
  • Preempt or migrate heavy network users
  • Backfill pool with light network users to fully
    utilize CPUs

21
Summary
  • Making the network a Condor-managed resource
  • Provide administrative control over HTC network
    usage
  • Improve execution efficiency by co-scheduling
    network and CPU resources
Write a Comment
User Comments (0)
About PowerShow.com