Fairness in Job Scheduling on CPlant - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Fairness in Job Scheduling on CPlant

Description:

Fairness in Job Scheduling on CPlant. Vitus Leung. Sandia National Labs. Gerald Sabin. RNET Technologies, Inc. P. Sadayappan. The Ohio-State University ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 48
Provided by: sab94
Category:

less

Transcript and Presenter's Notes

Title: Fairness in Job Scheduling on CPlant


1
Fairness in Job Scheduling on CPlant
  • Vitus Leung
  • Sandia National Labs
  • Gerald Sabin
  • RNET Technologies, Inc
  • P. Sadayappan
  • The Ohio-State University

Sandia is a multiprogram laboratory operated by
Sandia Corporation, a Lockheed Martin
Company,for the United States Department of
Energys National Nuclear Security
Administration under contract DE-AC04-94AL85000.
2
Table Of Contents
  • Introduction to Job Scheduling
  • The original C-Plant/Ross scheduler
  • Improving fairness
  • Simulation Environment
  • Results
  • Conclusion
  • Questions

3
Introduction to Job Scheduling
  • Independent parallel jobs
  • Job specifies number of nodes and expected
    runtime
  • Jobs run on a parallel machine with a fixed
    number of nodes (C-Plant Ross)
  • Examples PBS, MAUI,

4
Introduction to Job Scheduling
  • Primary focus of research in job scheduling has
    been to increase utilization and improve desired
    user metrics
  • Very little research so far that has addressed
    fairness in job scheduling
  • CPlant scheduler uses a fair-share measure to
    order jobs in the queue How fair is the
    scheduler?

5
Assessing Fairness
  • Possible approach For each job, find number of
    jobs with higher usage-count that are serviced
    while the job waits
  • Problem How to account for benign back-filling
    that uses slots in schedule not usable by this
    job?
  • Proposed approach Assign a fair-start time for
    each job when it is submitted
  • by generating a non-backfilling, in order
    schedule based on fairness priority
  • if actual start time does not exceed fair-start
    time, job is considered fairly treated, else
    unfair

6
Introduction to Job Scheduling (options)
  • Reservation Depth Number of jobs which are
    reserved/blocked
  • 0 (No Guarantee Backfilling)
  • 1 (Aggressive/EASY Backfilling)
  • Unlimited (Conservative Backfilling)
  • Queue Priority Sorting of waiting jobs
  • FCFS, SJF, LJF, Fairness
  • Starvation Limits/Selective Reservations (provide
    artificial starvation limits)
  • Wait time limit
  • Usage (nodehours)
  • How many jobs can starve (per system or per
    user?)
  • Only fair jobs can starve?

7
Depth of Reservation
Conservative Backfilling
Queue sorted in priority order
Q2
Q1
Q5
Q3
Q4
Q2
Q4
Q1
R1
Processors
Q5
R2
Q3
Time
8
Depth of Reservation
Conservative Backfilling
Queue sorted in priority order
Q2
Q1
Q5
Q3
Q4
Q2
Q1
R1
Processors
Q5
R2
Q3
Time
9
Depth of Reservation
Conservative Backfilling
Queue sorted in priority order
Q2
Q1
Q5
Q3
Q4
Q5
Q2
Q1
R1
Processors
R2
Q3
Time
  • No job is delayed by a latter arriving job
  • Higher priority jobs have a better chance of
    backfilling
  • Guaranteed starvation free and bounded delays

10
Depth of Reservation
Aggressive Backfilling
Queue sorted in priority order
Q2
Q1
Q3
Q2
Q1
R1
Processors
R2
Time
11
Depth of Reservation
Aggressive Backfilling
Queue sorted in priority order
Q2
Q1
Q5
Q3
Q4
Q3
Q2
Q1
R1
Processors
R2
Q5
Q4
Time
12
Depth of Reservation
Aggressive Backfilling
Queue sorted in priority order
Q2
Q1
Q5
Q3
Q4
Q3
Q2
Q1
R1
Processors
R2
Q5
Q4
Time
  • Possibility for longer narrow jobs to start
  • All but the first job can be continually unfairly
    delayed
  • Starvation free (assuming progress in queue
    priority) but unbounded delays

13
Depth of Reservation
No Guarantee Backfilling
Q2
Q1
Q3
Q2
Q1
Processors
R1
R2
Time
14
Depth of Reservation
No Guarantee Backfilling
Q2
Q1
Q4
Q3
Q2
Q3
Q1
Processors
R1
R2
Time
15
Depth of Reservation
No Guarantee Backfilling
Q2
Q1
Q4
Q5
Q3
Q3
Q2
Q4
Q1
Processors
R1
R2
Q5
Time
16
Depth of Reservation
No Guarantee Backfilling
Q2
Q1
Q4
Q5
Q3
Q3
Q2
Q4
Q1
Processors
R1
R2
Q5
Time
  • First job which fits is selected
  • Starvation is a problem
  • All job can continually be unfairly delayed
  • Possibly good utilization

17
Queuing Priority
  • FCFS
  • fair on a per job basis
  • guarantees a static queue order
  • SJF/LJF/WJF
  • Reorder jobs for backfilling order
  • Attempt to improve average user metrics and
    utilization by sorting jobs in an intelligent
    way
  • Newly arriving jobs can move other jobs back in
    the queue
  • Possibility of starvation unless all jobs have
    static reservations
  • Fair Share
  • Reorders jobs
  • Attempts to improve user fairness

18
Starvation Thresholds
  • Scheduler changes normal policy for a starving
    job when some threshold is crossed, e.g.
    wait-time of 1 day
  • Selective reservations for a starving job
  • Attempt to eliminate starvation with a scheduling
    policies which is not starvation free
  • Not needed if policy is starvation free
  • Many free variables which needed tweaking (and
    are not dynamic)
  • Can adversely affect fairness for other jobs

19
Starvation cont.
  • When is a job starving?
  • Exceeded wait time?
  • Exceeded slowdown?
  • What value of wait time/slowdown?
  • Can a user who has used more than their fair
    share be considered starving?
  • What binary limit do you place on fair share?
  • How many starving jobs get a reservation?
  • Per user or per system?

20
Table Of Contents
  • Introduction to Job Scheduling
  • The original C-Plant/Ross scheduler
  • Increasing fairness
  • Simulation Environment
  • Results
  • Conclusion
  • Questions

21
C-Plant scheduler
  • No Guarantee backfilling
  • Fair share queue priority (decaying node-hours)
  • Jobs with a waittime gt 24/72 hours are considered
    starving and
  • Are placed in a virtual queue by receiving a
    higher priority than non-starving jobs
  • Are sorted in FCFS order instead of by fairshare
  • Head of queue has a reservation (aggressive
    backfilling)

22
C-Plant scheduler (implications)
  • Jobs do not necessarily run in fair share order
  • Allows for unfair use of the machine
  • No Guarantee Backfilling
  • Starvation Queue/FCFS order
  • Unbounded wait times and starvation forces system
    admins to start jobs manually
  • Good utilization and average user metrics

23
Table Of Contents
  • Introduction to Job Scheduling
  • The original C-Plant/Ross scheduler
  • Increasing fairness
  • Simulation Environment
  • Results
  • Conclusion
  • Questions

24
Suggestions to Improve Fairness
  • Runtime Limitation
  • Cap runtimes at 72 hours
  • Improves fairness by allowing preemption
  • Improve user metrics by allowing preemption
  • Scripts have been developed to help ease the
    burden on the user
  • Minimal impact on fair long jobs expected

25
Suggestions to Improve Fairness
  • Increase starvation limit from 24 to 72 (or
    greater?)
  • Reduces unfairness due to FCFS queue
  • Does not address lack of fairness due to no
    guarantees
  • Prevents jobs from starving forever
  • Minimal impact on standard average user metrics
    and utilization

26
Suggestions to Improve Fairness
  • Do not allow a starving reservation for users
    who are hogging the machine
  • Introduces fairness to the virtual starvation
    queue
  • Very minimal impact on standard user metrics and
    utilization
  • Only tracks usage through system time windows
  • Simple change to existing scheduler, minimal
    impact to normal users

27
Suggestions to Improve Fairness
  • Conservative Backfilling
  • Eliminates starvation
  • Queue still sorted by fair-share, fairness
    still matters
  • Deterministic worst case start time upon
    submittal of each job
  • FCFS feel, each job receives an initial
    reservation in arrival order
  • An unfair job can still delay a fair job during
    backfilling

28
Suggestions to Improve Fairness
  • Conservative backfilling with dynamic
    reservations
  • Removes FCFS feel from conservative backfilling
  • An job can never delay a more fair job
  • Starvation is possible
  • User has control
  • If does not submit jobs (or adds artificial
    dependencies), progress in the queue is
    guaranteed, eliminating starvation
  • Implements the spirit of the fair share policy
  • No unfair job will ever delay a more fair job

29
Table Of Contents
  • Introduction to Job Scheduling
  • The original C-Plant/Ross scheduler
  • Increasing fairness
  • Simulation Environment
  • Results
  • Conclusion
  • Questions

30
Simulation Environment
  • Event driven simulator
  • Actual CPlant traces are used as input to the
    simulator
  • CPlant/Ross
  • December 02 June 03

31
Table Of Contents
  • Introduction to Job Scheduling
  • The original C-Plant/Ross scheduler
  • Increasing fairness
  • Simulation Environment
  • Results
  • Conclusion
  • Questions

32
Results
  • Original CPlant/Ross policy
  • 24 hours starvation, any job can starve, no
    maximum runtime
  • Small tweaks
  • 72 hour starvation, any job can starve, no max
    runtime
  • 24 hour starvation, unfair jobs can not starve,
    no max runtime
  • 24 hour starvation, any job can starve, 72 hour
    max runtime
  • 72 hours starvation, unfair jobs can not starve,
    72 hours max runtime

33
Results
  • Reduce jobs which miss fair start time
  • Loss of capacity is generally slightly lower
  • Combining all three enhancements shows the most
    improvement

34
Results
  • Heavy users with high wait times benefit
  • Reduce extreme wait time for mid-range users
  • A very light user actually gets worse
  • Still seems unfair

35
Results
  • Heavy users with high wait times benefit
  • Reduce extreme wait time for mid-range users
  • A very light user actually gets worse
  • Still seems unfair

36
Results
  • Fundamental changes
  • Conservative backfilling
  • Conservative backfilling with 72 hr runtime
    limits
  • Conservative backfilling with dynamic
    reservations
  • Conservative backfilling with dynamic
    reservations and 72hr runtime limits

37
Results
  • Possible to further reduce percent of jobs which
    miss fair start time
  • Conservative backfilling (static) can be bad for
    fairness
  • Small increase (3) in loss of capacity (for
    dynamic reservations)

38
Results
  • Heavy user are appropriately penalized
  • Light user are given better treatment
  • Medium users can still perform worse than heavy
    users

39
Results
  • Heavy user are appropriately penalized
  • Light user are given better treatment
  • Medium users can still perform worse than heavy
    users

40
Results
  • Previously unfair user improve the most
  • No dramatic increase in waittime
  • Most users improve

41
Results
  • Previously unfair user improve the most
  • No dramatic increase in waittime
  • Most users improve

42
Results
  • Previously unfair user improve the most
  • No dramatic increase in waittime
  • Most users improve

43
Conclusions
  • Proposed a new way of quantitatively assessing
    how well a fair-share policy is implemented by a
    scheduler
  • The original scheduling policy causes unfair
    treatment of about 10 of jobs
  • Effect of several possible changes to scheduling
    policy were evaluated through simulations
  • Change of starvation threshold (from 24 to 72
    hours)
  • Imposition of maximum time limit for jobs
  • Disallowing unfair jobs from starvation-queue
  • Use of reservations for all jobs (conservative
    back-fill variations) instead of starvation-queue
    mechanism
  • Simulations show that modifications can reduce
    unfairness to under 3 of jobs
  • Several issues for further investigation

44
Future Work
  • More detailed analysis
  • Trade-off between fairness and average response
    time
  • Extent of unfairness experienced by different job
    categories
  • Robustness of scheduler under high-load and
    user-hogging sceenarios
  • Perform analysis on other Sandia traces
    (West/Alaska) to find any possible
    inconsistencies and trace dependent results.
  • Determine desirable scheduling strategy for the
    Institutional Cluster
  • Improve utilization while maintaining fairness
  • Slack based backfilling
  • Selective reservations
  • Generate blocking without backfilling
  • Use expansion factor/wait time to influence
    priority

45
Future Work
  • Effects of limiting job submissions
  • Number of jobs Node hour limits User/Admin
    control over fairness
  • Allow a more flexible priority to take into
    account user needs
  • User defined checkpointing
  • Allow users to inform scheduler of checkpoint and
    send a signal
  • Checkpointed jobs can achieve lower turnaround
    time by taking advantage of currently unused
    cycles (improve utilization)
  • Transparency
  • Give user estimates (how accurate can we be under
    different scheduling policies)

46
Acknowledgments
  • Thanks to Jeanette Johnston for the discussions
    regarding the current policy and possible
    improvements
  • Thanks to Jon Stearly for going out of his way to
    discuss fairness and for getting the raw Cplant
    logs

47
Questions?
Write a Comment
User Comments (0)
About PowerShow.com