Matt Mutka and Miron Lzvny 1990 - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Matt Mutka and Miron Lzvny 1990

Description:

The coordinator gets system information in order to implement the meta scheduling policy. ... Utilization of the system over one working week (Mon Fri) 20% in ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 29
Provided by: Pask2
Category:

less

Transcript and Presenter's Notes

Title: Matt Mutka and Miron Lzvny 1990


1
Scheduling Remote Processing Capacity in A
Workstation-Processor Bank Network(Up-Down
Algorithm in Condor )
  • By
  • Matt _at_ Mutka and Miron Lzvny (1990)
  • The 7th on Distributed Computing System
  • Reviewed by
  • Paskorn Champrasert

2
Problems
  • Workstations are powerful machines capable of
    executing millions of instructions each second.
  • The processing demands of the owner are much
    smaller that the capacity of the workstation.
  • Some of the users face the problem that the
    capacity of their workstations is much too small
    to meet their processing demands.
  • Can we provide a high quality of service in a
    highly utilized network of workstations?

3
Condor Up-Down Algorithm
  • The Condor systems schedule long running
    background jobs at idle workstations.
  • Background jobs means long running process which
    do not require interaction with the user.
  • Exploration of algorithms for the management of
    idle workstation capacity.
  • - The jobs from users who request large amounts
    of capacity should be granted as much as possible
    without inhibiting the access to capacity of
    other users who want smaller amounts.
  • - The Up-Down algorithm is designed to allow
    fair access to remote capacity.
  • (fair between jobs of light users and jobs
    of heavy users)

4
System Design
  • Background jobs require several hours of CPU time
    and little interaction with their users.
  • A system has been designed and implemented to
    execute background jobs remotely at idle
    workstations.

5
Scheduling Structure
  • A centralized coordinator will assign background
    jobs to execute at available remote workstations.
  • The coordinator gets system information in order
    to implement the meta scheduling policy.
  • The list of running jobs, waiting jobs
  • The location of idle stations.

6
(No Transcript)
7
Scheduling Structure
  • Each workstation contains
  • a local scheduler
  • Each workstation makes its own decision which job
    should be executed next
  • process queue
  • One workstation work as a coordinator
  • holds central coordinator
  • Every 2 minutes the central coordinator gets the
    information from workstations to see
  • which workstations are available
  • which workstations have background jobs waiting
  • If a background job (remote process) running on a
    workstation
  • The local scheduler in the workstation checks
    every ½ minute to see if the background job
    should be preempted because the local user has
    resumed using the station.
  • The local scheduler immediately preempts the
    background job
  • Checkpointing is invoked
  • Checkpointing of a program is the saving of an
    intermediate state of the program so that its
    execution can be restarted from this intermediate
    state.
  • The coordinator submit the preempted background
    job to another idle workstation.

8
Fair Access to Remote Cycles
  • The authors observed that the users can be
    divided into 2 groups
  • Heavy userstry to consumes all available
    capacity for long periods.
  • 2. Light users
  • consume remote cycles occasionally.
  • All user should be served fairly.
  • Heavy users should not inhibit light users to
    access remote cycles.

9
Up-down algorithm
  • Up-down algorithm enables heavy users to maintain
    steady access to remote cycles while providing
    fair access to cycles for light users.
  • Up-down algorithm protects the right of light
    users when a few heavy users try to monopolize
    all free resources.

10
Up-Down Algorithm
  • Each workstation maintains Schedule Index (SI)
  • Schedule Index Table (SI of all workstations) is
    maintained at the CONDOR coordinator.
  • The value of SI is used to decide which
    workstation is next to be allocated remote
    process.
  • Workstations with smaller SI entries are given
    priority over workstations with larger SI
  • (Lower SI, higher Priority)
  • Initially, SI is set to zero

11
Up-Down Algorithm
  • Each workstation maintains scheduling index (SI)
    as the priority to preempt remote cycles.
  • At the beginning SI is 0
  • For each scheduling interval,
  • Each workstation increases/decreases its SI
  • Workstation that has a light user, SI is very low
    (high priority)
  • Workstation that has a heavy user, SI is very
    high (low priority)
  • Periodically, the coordinator checks if any
    workstations have new background jobs to execute.
  • If a workstation with high priority has a job to
    execute and there is no idle workstation, the
    coordinator preempts a remotely executing
    background job on low priority workstation
  • - the preempted job is checkpointed (save
    intermediate state)
  • - the coordinator scheduler will place the
    preempted job on lower priority workstation.
  • Up-Down algorithm is an algorithm to dynamically
    update SI
  • Light users that increase their loads ( of
    background jobs) will have their SI increase (low
    priority)
  • So they will be considered l heavy users.
  • Heavy users that decrease their loads will have
    their SI decrease (high priority) be
  • So, they will be considered light user.

12
Node available remote processing
cycles f,g,h,l are SI updating function f
increase SI when a light user increases its
load g decrease SI when the process has to wait
for remote cycles SI is decreased if a station
wants a remote processor but was denied. h and l
stabilize the SI when workstations do not want
remote cycles.
13
SI is initialized 0 When a job arrives and there
is no allocation given -gt SI decreases
g(SI) After allocation is made -gt SI increases
f(SI) If two allocations are given -gtSI
increases twice as fast (2f(SI) Completion
of one job -gt SI increases f(SI) Second job
complete -gt SI decreases h(SI)
Once stations SI reaches zero it will stay there
until it asks for a node. ScheldulingInterval 10
min
14
Algorithms Used for Comparisons
  • For comparison with the Up-Down Algorithm, the
    Random algorithm and the Round-Robin algorithm
    will be used.
  • Random and Round-Robin are non-preemptive
    algorithms. After the process is allocated on a
    remote resources the process has to run until it
    terminates.
  • The random algorithm
  • All its decisions are made without reference to
    past decisions.
  • When a workstation want to place a process to a
    remote workstation, the random algorithm randomly
    picks one of the available remote workstation.
  • The Round-Robin algorithm
  • Each workstation is given a chance in particular
    order to receive remote cycles.

15
Simulation Configuration
  • Number of workstations 13
  • 11 Workstations have a light user
  • ( background jobs 1)
  • 1 Workstation has a heavy user
  • ( background jobs will be varied from 2-13)
  • 1 Workstation has a medium user
  • ( background jobs 2)
  • The background jobs have a mean service demand
    2 hours for all workstations
  • Scheduling Interval 10 minutes
  • Job transfer cost (time for transfer a job) 1
    minutes
  • Run the simulation for 2 years simulation time

16
Simulation Configuration
SI will be very high.The authors want to reduce
SI to zero faster
17
Simulation Results
  • Remote Cycle Wait Ratio is calculated by
  • a remote execution time a workstation received
    divided by its wait time. (higher is better)

18
Extra Slides
19
Performance
  • 23 workstations are observed in one month
  • The workstations operate under the BSD 4.3 Unix
  • A workstation works as the coordinator.

20
average
job demand
User A heavy user User B-E light users
21
Total queue length
The heavy user kept more than 30 jobs in the queue
Wait ratio Amount of time that a job waits / its
service time
Light user did not wait. Wait ratio is very
small. The up-down algorithm allocated remote
capacity to light user and preempt the heavy user.
(Demand service)
22
12438 machine hours were available for remote
execution. 4771 machine hours were utilized by
Condor. Average of local utilization is only 25
Almost 200 machine days of capacity that
otherwise would have been lost where consumed by
Condor
Utilization of the system over one working week
(Mon Fri) 20 in the evening and night50 for
short peak period in the afternoon.
23
Queue length
24
Impact on Local Workstions
  • Some local capacity is provided to
  • placement and checkpointing of remote jobs
  • Local scheduler
  • The coordinator also consumes some resources.
  • Results
  • - local scheduler consumes less than 1 of
    stations capacity.
  • - coordinator consumes less than 1 of stations
    capacity.
  • - Checkpointing depends on the size of jobs
  • - The size of checkpoint is approximately .5
    megabyte
  • - Approximately 5 seconds per megabyte of the
    checkpoint file.
  • 2.5 seconds for one checkpoints

25
Leverage is the ratio of the capacity consumed by
a job remotely to the capacity consumed on the
home station to support remote execution? Large
gt job consumes more remote capacity Small gt job
consumes more local capacity
Average of leverage is 1300. 1 minute of local
capacity 22 hours of remote capacity is
received.
26
Conclusions
  • Almost 200 machine days of capacity that
    otherwise would have been lost where consumed by
    Condor.
  • The users dedicate small amount of local
    resources to access the huge amount of remote
    resources

27
The Remote Unix (RU) Facility
  • Workstations operate under BSD 4.3 Unix
  • Remote Unix turns idle workstation into cycle
    servers.
  • A shadow process runs locally as the substitute
    of the process running on the remote machine.
  • Any unix system call (e.g., read/write files)
    made by the program on the remote machine
    communicates with the shadow process.
  • A message indicating the type of system call is
    sent to the shadow process on the local machine
    and can be viewed as a remote procedure call.

28
Checkpointing
  • When a job is removed from a remote location,
    checkpointing is invoked.
  • The checkpointing of the program is the saving of
    the state of the program so that its execution
    can be restarted.
  • The state of a process contains
  • Text (executable code)
  • Data (variables of the program)
  • Stack segments of the program
  • The values of registers
  • The status of the open files
  • Messages that is sent from the program to its
    shadow which no reply
Write a Comment
User Comments (0)
About PowerShow.com