Title: Queue Wait Estimation, User Activity Profiling and Resource Usage Modeling Using the Integrated Reso
1Queue Wait Estimation, User Activity Profiling
and Resource Usage Modeling Using the Integrated
Resource Information Servicehttp//iris.cs.uh.edu
- Archit Shivaprakash
- Department of Computer Science
- University of Houston
- archit_at_cs.uh.edu
2Grid Environments
- Collaborative in nature
- Permit the sharing of geographically distributed
resources -
- Advantages capital saving, better resource
utilization, fault tolerance, cooperation
amongst the user community - Result Competition amongst users for
available resources - Environment is often dynamic
and unpredictable - Presence of an information service that informs
users about the state of their operating
environment can be very helpful!
3Globus Middleware
IRIS is an information service that can operate
in a grid environment. It is unrelated to Globus
in its current form.
4Information Services
- Types of Information
- Static Information that seldom changes (E.g. OS
on resource) - Dynamic Information changes continually (E.g.
Queue Lengths) - Current State of Information Services
- Services like MDS are good about providing static
information - Dynamic information services are a work in
progress! - It is against such a backdrop that IRIS is
introduced as a dynamic information service for
grid environments.
5Job Execution on Grids
- Life-cycle of a grid job
- Submission by the User to a Resource Broker
- Broker decides which site is best for job and
forwards it to the RMS for the chosen site
(Independent of IRIS) - RMS checks for availability of requested
resources - Job remains in wait state until the resources are
available - Once resources are available, job executes and
returns - Job Turnaround Time Wait Time Execution Time
- How much does queue wait contribute to the job
turnaround time? - And is there any way of reducing it to improve
overall throughput?
6Queue Wait Problem
- Scenario at UH-HPCC during Year 2003
- Number of jobs submitted by users 8929
- Total Execution time recorded 57988
hours - Total Queue Wait time recorded 27925 hours
Queue Wait accounted for about half of the
recorded execution time, increasing job
turnaround time by a staggering 50!
7Analyzing the QWP
- Queue wait during months of Year 2003 _at_ HPCC
Average queue wait, Year 2003 11,260 sec
Busy months like May and June witness
comparatively longer queue waits as compared to
months like December where user activity is
relatively lower.
8Degree of Parallelism
- The whole point of grid environments is to
exploit the available resources while running
user jobs. - Thus users strive for higher degrees of
parallelism. - Is this always good? No!
- Why?
- Larger resource requests can lead to longer
queue waits, often offsetting the performance
gain the user was aiming to achieve through the
parallel execution of his task.
9Parallelism and QWP
- How is QWP affected by Parallelism of user jobs?
UH-HPCC Year 2003
Increase in queue wait for higher degrees of
parallelism is sufficiently evident!
10Problem Summary
- Queue wait time encountered by users is
significant - This is especially true for higher degrees of
parallelism and during months of high grid
activity - Solution
- An information service that helps users make
good resource requests while running their jobs. - What do we mean by Good Resource Request?
- The ability to maximize parallelism while
minimizing queue wait!
11Good Resource Request
- Mean QWT reported by IRIS on 12/18/2003 (UH-HPCC)
- NC 47 -gt 559 seconds NC 48 -gt 3301 seconds
- Resource Request of 48 nodes is not advisable as
it increases QWP by six-fold when compared to 47!
12What is IRIS?
- Stands for Integrated Resource Information
Service - It is a real-time information service for
distributed environments (grids) - Aims at providing
- Dynamic information about the state of the
resource queues - Statistical summaries of user activity on the
monitored resource (user activity profiling) - Modeling resource activity as a whole (meant for
administrators)
13IRIS Objectives
- Functional Objectives
- Provide users with queue wait estimates (primary
objective) - Profile user activity with respect to a resource
- Model resource usage as a whole
- Supplementary Objectives
- Ease of use, ubiquitous, fault tolerant,
extensible, easy to maintain - Many of these are required of any service (tool)
that operates in a - grid environment.
14IRIS Development Testbed
- Uses the High Performance Computing Center (HPCC)
as a development testbed - Thus, monitors the HPCC resources
- Interfaces with the SUN Grid Engine (SGE)
- HPCC is a non-preemptive environment and jobs do
not begin execution until all of the requested
resources are available - Note All data presented in the next few slides
is collected on UH - -HPCC and pertains to the year 2003 unless
otherwise mentioned - This contents of this slide is relevant later on
when we discuss the future of IRIS
15IRIS Methodology
- Submit dummy probe jobs on the monitored resource
- Probes are submitted for all the possible user
requests - Determine the queue wait encountered by IRIS
probes - Report the QWT to the user
- Note
- The probe jobs impose no computational overheads
on the - monitored resource.
- Additionally, a probe for a node count (say x) is
not resubmitted - until the previous probe (for x) returns.
- The above approach is Active in nature.
16Why Active Approach?
- Can we determine the QWT by not submitting any
probe jobs on - the resource? (Passive approach)
- In other words, can we estimate QWT based on the
queue waits - encountered by user jobs in the recent past?
- Passive approach is not viable because
- Job submission on grids is sporadic
- QWT is not consistent and can fluctuate widely
- Determining QWT for a resource specification is
challenging - Requires the presence of a complex mathematical
function that could prove to be a bottleneck
during concurrent requests.
17Example of QWT Variation
- Observed on 13th October 2003 on UH-HPCC for
12-node jobs
18Active Approach Overhead
- Interestingly, the Active approach is not very
intrusive on the monitored resource.
0.045 0.009
This approach does cause some networking and
bookkeeping overheads. However, it is more viable
than the passive approach.
19IRIS Organization Model
IRIS Organization Model
20IRIS Functional Units
21IRIS Architecture
IRIS Architecture
22IRIS Workflow
23IRIS Website
24Current Implementation of IRIS
- Uses the High Performance Computing Center (HPCC)
as a development testbed - Thus, monitors the HPCC resources
- Interfaces with the SUN Grid Engine (SGE)
- HPCC is a non-preemptive environment and jobs do
not begin execution until all of the requested
resources are available - Note All data presented in the next few slides
is collected on UH - -HPCC and pertains to the year 2003 unless
otherwise mentioned - This contents of this slide is relevant later on
when we discuss the future of IRIS
25IRIS Results - 1
- Analyzing relationship between QWT and Degree of
Parallelism
Mean QWT reported by IRIS on 12/18/2003
(UH-HPCC) Note the Occurrence of Surge Thresholds
(ST)
26IRIS Results 1 (contd)
- The relationship between QWT and Node Count can
be divided into multiple linear segments - At the boundaries of these segments, we see
exponential increase in QWT with a unit increase
in Node Count - E.g. NC 47 -gt 559 seconds NC 48 -gt
3301 seconds - The points at which there is an exponential
increase in QWT is termed as Surge Thresholds - For Maximum Parallelism with Minimal Penalty
(Queue-wait) - User resource specification should be at inner
boundary of ST
27IRIS Results 1 (contd)
- It is important to note that Surge Thresholds are
dynamic in nature and they are largely dependent
on the jobs currently executing and the resources
allocated to them by the RMS - Does IRIS inform users about the Node count at
which the QWT - surges? - No
- It presents the QWT for all the possible resource
requests and - counts on the user to make an informed decision.
IRIS is touted to be a best-effort information
service
28IRIS Results - 2
- Analyzing the relationship between QWT and degree
of parallelism over periods of time (Month of
December 2003)
Results similar to what we saw over the 24-hour
period previously.
29Validating IRIS Results
- Compare the actual and projected QWT over three
test periods - 10 random samples considered during each of the
test periods - Test Period 1 15-16 November 2003 / 2 Days
- Test Period 2 26-28 November 2003 / 3 Days
- Test Period 3 09-12 December 2003 / 4 Days
- Evaluation Metrics
- Accuracy
- Timeliness
- Adaptability
Error (/-) Est. QWT W Time (Actual)
Accuracy (1 - ?Error / ?Actual Wait Time)
30Validation Test Period 1
31Validation Test Period 1 (contd)
- Mean Accuracy of QWT Estimation 89
- Entries in red (previous slide) indicate cases
where information was not timely
32Validation Test Period 2
33Validation Test Period 2 (contd)
- Mean Accuracy of QWT Estimation 82
- Entries in red (previous slide) indicate cases
where information was not timely
34Validation Test Period 3
35Validation Test Period 3 (contd)
- Mean Accuracy of QWT Estimation 72
- Entries in red (previous slide) indicate cases
where information was not timely
36What we can infer about QWT estimates?
- Accuracy
- Accuracy is observed to be lower for higher
degrees of parallelism - This is especially true during periods of high
grid activity - Timeliness
- Better for lower degrees of parallelism. High
node-count probes experience long queue wait
themselves - Adaptability
- IRIS methodology allows it to adapt to the
dynamic grid environment (E.g. Test Period 2
Cases 5 6)
37Profiling User Activity
- Users like to obtain a statistical summary of
their past activity on a resource - Users can view execution times encountered by
their jobs in the past. If a similar job is
submitted by them, they can add the QWT and the
execution time (recorded in the past) and get the
approximate job turnaround time - Grid jobs are often similar with the exception
that they use different data sets during each
execution. The above theory holds true if the
execution times are similar.
38Sample User Profile
39User Activity on UH-HPCC (1)
- Six largest users of HPCC resources (in terms of
jobs submitted)
6 users account for 58 of jobs submitted. Other
58 users account for 42.
40Mean Execution/QWT observed for 6 major users of
UH-HPCC
Users A-F are the largest contributors of jobs on
HPCC
41User Activity on UH-HPCC (2)
- Six largest users of HPCC resources (in terms of
QWT observed)
6 users account for 65 of total QWP. Other 58
users account for 35.
42Grid Modeling (Month)
- Essentially deals with modeling resource usage
- Submission of jobs during various months of Year
2003 (HPCC)
Number of jobs submitted is not a good indicator
of resource usage. We are more interested in the
execution times recorded by the user jobs!
43Grid Modeling (Month) contd
Total Execution/QWT time (monthly)
Mean Execution/QWT per job (monthly)
Notice similar trends
44Grid Modeling (Day)
- Submission of jobs during various times of the
day - Break 24 hours into eight 3-hour periods
45Grid Modeling (Day) contd
- Analyzing Mean Execution and QWT during the day
46Benefits of Grid Modeling
- Administrators
- Can decide periods to schedule maintenance/upgrade
s with - minimal disruption
- Helps in accounting and bookkeeping operations
- Users
- Increases awareness amongst grid users
- Can determine periods of lower activity to run
their jobs, resulting in reduced QWT and higher
resource availability
47Future of IRIS
- Interface with other RMS like PBS
- Monitor several distributed resources, thereby
servicing larger set of users or users with
accounts on different resources - Users can thus submit jobs on resources
with least QWT - Meta-scheduling
- Extensions will make IRIS more relevant to a
grid environment
48Future of IRIS
49Points to Note
- IRIS is not a generic information service that
can be applied to any environment! - IRIS would have to be customized to suit the
requirements of each of the local sites it is
supposed to monitor. - Current implementation is designed to work on
UH-HPCC. - Issues that impact the applicability of IRIS
- RMS Scheduling Algorithms used
- Number of queues that are to be monitored
- Number of users and what priorities they operate
with
50Thank You!
- Please visit http//iris.cs.uh.edu to use IRIS or
to learn more about it! - Archit Shivaprakash
- archit_at_cs.uh.edu