Improving Goodput by Coscheduling CPU and Network Capacity - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Improving Goodput by Coscheduling CPU and Network Capacity

Description:

A HTC environment strives to provide large amounts of processing ... pre-emption deadline. limit on resource consumption during preemption. Remote file access ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 23
Provided by: lagh
Category:

less

Transcript and Presenter's Notes

Title: Improving Goodput by Coscheduling CPU and Network Capacity


1
Improving Goodput by Co-scheduling CPU and
Network Capacity By Jim Basney and Miron
Livny Presented by Prasanna Laghate
CS 599 Grid Computing, USC, Fall 2000
2
  • Talk Overview
  • Introduction
  • Goodput
  • Condor Environment
  • Approaches for Improving Goodput
  • Co-matching
  • Checkpoint Scheduling
  • Reducing Overall Network Demand
  • Monitoring Goodput
  • Future Work
  • Conclusion

3
  • Introduction
  • A HTC environment strives to provide large
    amounts of processing capacity over long periods
    of time by harnessing available resources on the
    network.
  • Definition of Goodput The allocation time
    when a remotely executing application uses the
    CPU to make forward progress.
  • To maximize the amount of available resources,
    HTC uses
  • non-dedicated, distributively owned resources
  • an application checkpoint facility
    (preempt-resume scheduling)
  • Factors which affect Goodput
  • Changes in customer application
  • Network Infrastructure
  • Workstation availability
  • Resource owners policy

4
  • Introduction(contd)
  • HTC applications do not use allocated CPU while
    performing n/w operations
  • Why Co-schedule CPU and n/w capacity?
  • You can take advantage of
  • a variety of application n/w capacity
    requirements
  • n/w capacity between hosts
  • deadlines for n/w transfers
  • n/w demand over time in a cluster
  • to improve goodput.
  • We will look at the co-scheduling techniques
    developed and implemented in Condor environment
    to improve goodput.

5
Goodput (contd)
  • Network Wait Time due to
  • Application placements
  • Periodic checkpoints
  • Migrations
  • Remote file access
  • Roll backs due to
  • Failed migrations

6
  • Goodput (contd)
  • Application placements
  • involves transfer of executable and checkpoint
    data
  • amt. of data to be transferred is known in
    advance
  • occurs at the start of the allocation
  • Periodic checkpoint
  • transfers application checkpoint data to a file
    system
  • reduces risk of lost work (e.g., failed
    migration)
  • during this application doesnt make forward
    progress
  • Application migration
  • involves transfer of checkpoint data to a FS or
    memory of new workstation
  • triggered when
  • Owner reclaims the workstation
  • Resource manager preempts the application to
    enforce customer priorities.

7
  • Goodput (contd)
  • Application migration (contd)
  • Migration may fail due to
  • pre-emption deadline
  • limit on resource consumption during preemption
  • Remote file access
  • Required to enable application to read i/p files
    and write results
  • I/O operations initiated by read and write system
    calls
  • Hence, timing is not known in advance
  • Network overload
  • Increases application wait time
  • Increases migration failure probability
  • Avoid it by scheduling n/w transfers

8
Condor Environment
  • Entities in Condor
  • Matchmaker uses matchmaking protocol to match
    resource requests
  • Customer agent makes a resource request
  • Resource agent makes a resource offer
  • Matchmaker may break create new match -gt
    application migration

9
  • Condor Environment (contd)
  • The 5 components of matchmaking framework
  • Classad Specification defines a language for
    expressing characteristics and constraints, and a
    semantics of evaluating these attributes
  • Advertising protocol defines the basic
    conventions regarding what a matchmaker expects
    to find in a classad.
  • Matchmaking algorithm defines how the contents
    of ads and state of the system relate to the
    outcome of the matchmaking
  • Matchmaking protocol defines how matched
    entities are notified
  • Claiming protocol defines what actions the
    matched entities take to enable discharge of
    service.
  • Two noteworthy distinguishing aspects
  • Allows service providers to express constraints
  • Mutual introduction of the two entities a
    separate claiming protocol.

10
Condor Environment (contd)
  • Resource owner agent
  • Controls an opportunistic resource by
    implementing owner policies
  • Has a start policy and a preemption policy
  • These policies depend on time of day,
    keyboard/mouse activity,etc.
  • They are distributively and dynamically defined
    by resource owners
  • Owner has complete control over the policy
  • Application resource manager
  • An agent which does application-level scheduling
    for the customer
  • Directs the application to the appropriate
    server for transferring executable and checkpoint
    files
  • Responsible for scheduling periodic checkpoints

11
Condor Environment (contd)
  • Checkpoint server
  • File server developed for bulk transfer of large
    checkpoint files
  • Safeguards against failed transfers overwriting
    earlier checkpoints
  • Application resource manager directs application
    to appropriate server

12
  • Progress
  • Introduction
  • Goodput
  • Condor Environment
  • Approaches for Improving Goodput
  • Co-matching
  • Checkpoint Scheduling
  • Reducing Overall Network Demand
  • Monitoring Goodput
  • Future Work
  • Conclusion

13
  • Approaches for Improving Goodput
  • Allocation efficiency ratio of goodput to
    allocated throughput
  • It is the percentage of time an application uses
    the CPU to make forward progress (i.e. it is a
    measure of goodput)
  • Objective
  • To improve allocation efficiency with minimal
    impact on allocated throughput.
  • A set of co-scheduling techniques to improve
    goodput are
  • Co-matching
  • Checkpoint scheduling
  • Data compression,etc.
  • For each, we consider impact on allocation
    efficiency and possible negative impact on
    allocated throughput

14
  • Co-matching
  • Matchmaker may cause bursts of application
    placements and migrations.
  • The resulting high network demand
  • Slows application placements and migrations gt
    underutilization of CPU for long periods.
  • Slows unrelated migrations gt failed migration
    lost application forward progress.
  • To control this bursty n/w demand, matchmaker is
    modified to use co-matching i.e. match a
    resource request both with a resource offer and a
    n/w bandwidth allocation.
  • N/w bandwidth placement cost of new application
    (sum of executable and checkpoint file sizes)
    estimated migration cost (estimated checkpoint
    size) of application to be preempted.
  • If n/w bandwidth allocation cannot be obtained,
    match fails.
  • N/w bandwidth allocation is obtained per subnet.

15
  • Co-matching (contd)
  • Implementation
  • Resource request size,location of application
    exec. and data files
  • Resource offer estimate of active applications
    checkpoint size and location of checkpoint
    server.
  • Matchmaker configuration file specifies n/w
    capacity and topology.
  • Resource owner agent is responsible for enforcing
    the allocation limits.
  • CPU and bandwidth are allocated to resource
    requests in priority order.
  • Priority calculation may include historical CPU
    and n/w usage.
  • CPUs may remain unused if there are only large
    applications in the system.
  • Aggressive limits on n/w usage gt decrease in
    goodput.

16
  • Checkpoint Scheduling
  • Checkpoint server provides scheduling for
    periodic checkpoints.
  • Avoids bursts of periodic checkpoint traffic.

17
Checkpoint Scheduling (contd)
  • Choice of periodic checkpoint interval is
    important
  • Short interval increases checkpoint overhead,
    decreases losses due to failed migration.
  • To maintain balance
  • Weigh cost of periodic checkpointing (checkpoint
    size, n/w capacity)
  • Against likelihood and severity of failed
    migrations (checkpoint size, n/w capacity and
    resource owners policy)
  • Additional opportunities for n/w bandwidth
    management
  • Checkpoint server can prioritize streams
    (migration v/s checkpoint streams)
  • Checkpoint server can serialize streams (reduces
    n/w contention)
  • Checkpoint servers may be deployed throughout the
    n/w to localize traffic.
  • Pre-scheduling preemptions gt decrease goodput,
    if too aggressive.

18
  • Reducing Overall N/w Demand
  • Minimize the amount of checkpoint and file data
    over n/w
  • fast checkpointing techniques compressing
    checkpoints checkpointing only dirty pages.
  • Data staging techniques (e.g., transfer data
    over n/w when resources are available).
  • Data caching policy reduces I/O overhead during
    allocation.
  • These techniques do not have a negative impact
    on allocated throughput.

19
  • Monitoring Goodput
  • Goodput is a measure of the health of the
    system.
  • Helps in detecting problems with specific
    applications and subnets in the cluster.
  • System configuration can be adjusted to enhance
    service provided
  • Supports quality control by system
    administrator.
  • A method
  • keep a small number of representative
    applications running at all times
  • form a goodput index measures change in
    overall good put and for specific application
    classes.
  • helpful in detecting a number of efficiency
    problems in the system.

20
Monitoring Goodput (contd)
  • Statistics maintained by checkpoint server
  • maintains a record of all attempted file
    transfers.
  • record used to report n/w usage by system.
  • statistics can be used to set scheduling
    policies in matchmaker and checkpoint servers.
  • Estimates of future n/w utilization for
    checkpoints and remote file access can be logged
    and compared to actual utilization.

21
  • Future work
  • Develop an effective model of n/w and I/O
    capabilities of a Condor pool.
  • Enables appropriate setting of scheduling
    policies.
  • Develop a multi-resource consumption based
    priority scheme to replace ad hoc mechanisms
    currently used to manage n/w bandwidth.
  • Investigate mechanisms and policies which keep
    resource owners satisfied and improve goodput.
  • a number of techniques exist to reduce the
    impact of application migration on the
    workstation.
  • they allow preemption deadlines to be relaxed.

22
  • Conclusion
  • A goodput metric for measuring efficiency of
    remote execution was introduced.
  • Mechanisms for improving goodput by
    co-scheduling CPU and n/w capacity were discussed.
Write a Comment
User Comments (0)
About PowerShow.com