Queue Wait Estimation, User Activity Profiling and Resource Usage Modeling Using the Integrated Reso - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Queue Wait Estimation, User Activity Profiling and Resource Usage Modeling Using the Integrated Reso

Description:

Permit the sharing of geographically distributed resources ... Additionally, a probe for a node count (say x) is not resubmitted ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 51
Provided by: architshi
Category:

less

Transcript and Presenter's Notes

Title: Queue Wait Estimation, User Activity Profiling and Resource Usage Modeling Using the Integrated Reso


1
Queue Wait Estimation, User Activity Profiling
and Resource Usage Modeling Using the Integrated
Resource Information Servicehttp//iris.cs.uh.edu
  • Archit Shivaprakash
  • Department of Computer Science
  • University of Houston
  • archit_at_cs.uh.edu

2
Grid Environments
  • Collaborative in nature
  • Permit the sharing of geographically distributed
    resources
  • Advantages capital saving, better resource
    utilization, fault tolerance, cooperation
    amongst the user community
  • Result Competition amongst users for
    available resources
  • Environment is often dynamic
    and unpredictable
  • Presence of an information service that informs
    users about the state of their operating
    environment can be very helpful!

3
Globus Middleware
IRIS is an information service that can operate
in a grid environment. It is unrelated to Globus
in its current form.
4
Information Services
  • Types of Information
  • Static Information that seldom changes (E.g. OS
    on resource)
  • Dynamic Information changes continually (E.g.
    Queue Lengths)
  • Current State of Information Services
  • Services like MDS are good about providing static
    information
  • Dynamic information services are a work in
    progress!
  • It is against such a backdrop that IRIS is
    introduced as a dynamic information service for
    grid environments.

5
Job Execution on Grids
  • Life-cycle of a grid job
  • Submission by the User to a Resource Broker
  • Broker decides which site is best for job and
    forwards it to the RMS for the chosen site
    (Independent of IRIS)
  • RMS checks for availability of requested
    resources
  • Job remains in wait state until the resources are
    available
  • Once resources are available, job executes and
    returns
  • Job Turnaround Time Wait Time Execution Time
  • How much does queue wait contribute to the job
    turnaround time?
  • And is there any way of reducing it to improve
    overall throughput?

6
Queue Wait Problem
  • Scenario at UH-HPCC during Year 2003
  • Number of jobs submitted by users 8929
  • Total Execution time recorded 57988
    hours
  • Total Queue Wait time recorded 27925 hours

Queue Wait accounted for about half of the
recorded execution time, increasing job
turnaround time by a staggering 50!
7
Analyzing the QWP
  • Queue wait during months of Year 2003 _at_ HPCC

Average queue wait, Year 2003 11,260 sec
Busy months like May and June witness
comparatively longer queue waits as compared to
months like December where user activity is
relatively lower.
8
Degree of Parallelism
  • The whole point of grid environments is to
    exploit the available resources while running
    user jobs.
  • Thus users strive for higher degrees of
    parallelism.
  • Is this always good? No!
  • Why?
  • Larger resource requests can lead to longer
    queue waits, often offsetting the performance
    gain the user was aiming to achieve through the
    parallel execution of his task.

9
Parallelism and QWP
  • How is QWP affected by Parallelism of user jobs?

UH-HPCC Year 2003
Increase in queue wait for higher degrees of
parallelism is sufficiently evident!
10
Problem Summary
  • Queue wait time encountered by users is
    significant
  • This is especially true for higher degrees of
    parallelism and during months of high grid
    activity
  • Solution
  • An information service that helps users make
    good resource requests while running their jobs.
  • What do we mean by Good Resource Request?
  • The ability to maximize parallelism while
    minimizing queue wait!

11
Good Resource Request
  • Mean QWT reported by IRIS on 12/18/2003 (UH-HPCC)
  • NC 47 -gt 559 seconds NC 48 -gt 3301 seconds
  • Resource Request of 48 nodes is not advisable as
    it increases QWP by six-fold when compared to 47!

12
What is IRIS?
  • Stands for Integrated Resource Information
    Service
  • It is a real-time information service for
    distributed environments (grids)
  • Aims at providing
  • Dynamic information about the state of the
    resource queues
  • Statistical summaries of user activity on the
    monitored resource (user activity profiling)
  • Modeling resource activity as a whole (meant for
    administrators)

13
IRIS Objectives
  • Functional Objectives
  • Provide users with queue wait estimates (primary
    objective)
  • Profile user activity with respect to a resource
  • Model resource usage as a whole
  • Supplementary Objectives
  • Ease of use, ubiquitous, fault tolerant,
    extensible, easy to maintain
  • Many of these are required of any service (tool)
    that operates in a
  • grid environment.

14
IRIS Development Testbed
  • Uses the High Performance Computing Center (HPCC)
    as a development testbed
  • Thus, monitors the HPCC resources
  • Interfaces with the SUN Grid Engine (SGE)
  • HPCC is a non-preemptive environment and jobs do
    not begin execution until all of the requested
    resources are available
  • Note All data presented in the next few slides
    is collected on UH
  • -HPCC and pertains to the year 2003 unless
    otherwise mentioned
  • This contents of this slide is relevant later on
    when we discuss the future of IRIS

15
IRIS Methodology
  • Submit dummy probe jobs on the monitored resource
  • Probes are submitted for all the possible user
    requests
  • Determine the queue wait encountered by IRIS
    probes
  • Report the QWT to the user
  • Note
  • The probe jobs impose no computational overheads
    on the
  • monitored resource.
  • Additionally, a probe for a node count (say x) is
    not resubmitted
  • until the previous probe (for x) returns.
  • The above approach is Active in nature.

16
Why Active Approach?
  • Can we determine the QWT by not submitting any
    probe jobs on
  • the resource? (Passive approach)
  • In other words, can we estimate QWT based on the
    queue waits
  • encountered by user jobs in the recent past?
  • Passive approach is not viable because
  • Job submission on grids is sporadic
  • QWT is not consistent and can fluctuate widely
  • Determining QWT for a resource specification is
    challenging
  • Requires the presence of a complex mathematical
    function that could prove to be a bottleneck
    during concurrent requests.

17
Example of QWT Variation
  • Observed on 13th October 2003 on UH-HPCC for
    12-node jobs

18
Active Approach Overhead
  • Interestingly, the Active approach is not very
    intrusive on the monitored resource.

0.045 0.009
This approach does cause some networking and
bookkeeping overheads. However, it is more viable
than the passive approach.
19
IRIS Organization Model
IRIS Organization Model
20
IRIS Functional Units
21
IRIS Architecture
IRIS Architecture
22
IRIS Workflow
23
IRIS Website
  • http//iris.cs.uh.edu/

24
Current Implementation of IRIS
  • Uses the High Performance Computing Center (HPCC)
    as a development testbed
  • Thus, monitors the HPCC resources
  • Interfaces with the SUN Grid Engine (SGE)
  • HPCC is a non-preemptive environment and jobs do
    not begin execution until all of the requested
    resources are available
  • Note All data presented in the next few slides
    is collected on UH
  • -HPCC and pertains to the year 2003 unless
    otherwise mentioned
  • This contents of this slide is relevant later on
    when we discuss the future of IRIS

25
IRIS Results - 1
  • Analyzing relationship between QWT and Degree of
    Parallelism

Mean QWT reported by IRIS on 12/18/2003
(UH-HPCC) Note the Occurrence of Surge Thresholds
(ST)
26
IRIS Results 1 (contd)
  • The relationship between QWT and Node Count can
    be divided into multiple linear segments
  • At the boundaries of these segments, we see
    exponential increase in QWT with a unit increase
    in Node Count
  • E.g. NC 47 -gt 559 seconds NC 48 -gt
    3301 seconds
  • The points at which there is an exponential
    increase in QWT is termed as Surge Thresholds
  • For Maximum Parallelism with Minimal Penalty
    (Queue-wait)
  • User resource specification should be at inner
    boundary of ST

27
IRIS Results 1 (contd)
  • It is important to note that Surge Thresholds are
    dynamic in nature and they are largely dependent
    on the jobs currently executing and the resources
    allocated to them by the RMS
  • Does IRIS inform users about the Node count at
    which the QWT
  • surges? - No
  • It presents the QWT for all the possible resource
    requests and
  • counts on the user to make an informed decision.

IRIS is touted to be a best-effort information
service
28
IRIS Results - 2
  • Analyzing the relationship between QWT and degree
    of parallelism over periods of time (Month of
    December 2003)

Results similar to what we saw over the 24-hour
period previously.
29
Validating IRIS Results
  • Compare the actual and projected QWT over three
    test periods
  • 10 random samples considered during each of the
    test periods
  • Test Period 1 15-16 November 2003 / 2 Days
  • Test Period 2 26-28 November 2003 / 3 Days
  • Test Period 3 09-12 December 2003 / 4 Days
  • Evaluation Metrics
  • Accuracy
  • Timeliness
  • Adaptability

Error (/-) Est. QWT W Time (Actual)
Accuracy (1 - ?Error / ?Actual Wait Time)
30
Validation Test Period 1
31
Validation Test Period 1 (contd)
  • Mean Accuracy of QWT Estimation 89
  • Entries in red (previous slide) indicate cases
    where information was not timely

32
Validation Test Period 2
33
Validation Test Period 2 (contd)
  • Mean Accuracy of QWT Estimation 82
  • Entries in red (previous slide) indicate cases
    where information was not timely

34
Validation Test Period 3
35
Validation Test Period 3 (contd)
  • Mean Accuracy of QWT Estimation 72
  • Entries in red (previous slide) indicate cases
    where information was not timely

36
What we can infer about QWT estimates?
  • Accuracy
  • Accuracy is observed to be lower for higher
    degrees of parallelism
  • This is especially true during periods of high
    grid activity
  • Timeliness
  • Better for lower degrees of parallelism. High
    node-count probes experience long queue wait
    themselves
  • Adaptability
  • IRIS methodology allows it to adapt to the
    dynamic grid environment (E.g. Test Period 2
    Cases 5 6)

37
Profiling User Activity
  • Users like to obtain a statistical summary of
    their past activity on a resource
  • Users can view execution times encountered by
    their jobs in the past. If a similar job is
    submitted by them, they can add the QWT and the
    execution time (recorded in the past) and get the
    approximate job turnaround time
  • Grid jobs are often similar with the exception
    that they use different data sets during each
    execution. The above theory holds true if the
    execution times are similar.

38
Sample User Profile
39
User Activity on UH-HPCC (1)
  • Six largest users of HPCC resources (in terms of
    jobs submitted)

6 users account for 58 of jobs submitted. Other
58 users account for 42.
40
Mean Execution/QWT observed for 6 major users of
UH-HPCC
Users A-F are the largest contributors of jobs on
HPCC
41
User Activity on UH-HPCC (2)
  • Six largest users of HPCC resources (in terms of
    QWT observed)

6 users account for 65 of total QWP. Other 58
users account for 35.
42
Grid Modeling (Month)
  • Essentially deals with modeling resource usage
  • Submission of jobs during various months of Year
    2003 (HPCC)

Number of jobs submitted is not a good indicator
of resource usage. We are more interested in the
execution times recorded by the user jobs!
43
Grid Modeling (Month) contd
Total Execution/QWT time (monthly)
Mean Execution/QWT per job (monthly)
Notice similar trends
44
Grid Modeling (Day)
  • Submission of jobs during various times of the
    day
  • Break 24 hours into eight 3-hour periods

45
Grid Modeling (Day) contd
  • Analyzing Mean Execution and QWT during the day

46
Benefits of Grid Modeling
  • Administrators
  • Can decide periods to schedule maintenance/upgrade
    s with
  • minimal disruption
  • Helps in accounting and bookkeeping operations
  • Users
  • Increases awareness amongst grid users
  • Can determine periods of lower activity to run
    their jobs, resulting in reduced QWT and higher
    resource availability

47
Future of IRIS
  • Interface with other RMS like PBS
  • Monitor several distributed resources, thereby
    servicing larger set of users or users with
    accounts on different resources
  • Users can thus submit jobs on resources
    with least QWT
  • Meta-scheduling
  • Extensions will make IRIS more relevant to a
    grid environment

48
Future of IRIS
49
Points to Note
  • IRIS is not a generic information service that
    can be applied to any environment!
  • IRIS would have to be customized to suit the
    requirements of each of the local sites it is
    supposed to monitor.
  • Current implementation is designed to work on
    UH-HPCC.
  • Issues that impact the applicability of IRIS
  • RMS Scheduling Algorithms used
  • Number of queues that are to be monitored
  • Number of users and what priorities they operate
    with

50
Thank You!
  • Please visit http//iris.cs.uh.edu to use IRIS or
    to learn more about it!
  • Archit Shivaprakash
  • archit_at_cs.uh.edu
Write a Comment
User Comments (0)
About PowerShow.com