Queue Wait Estimation, User Activity Profiling and Resource Usage Modeling Using the Integrated Reso - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Queue Wait Estimation, User Activity Profiling and Resource Usage Modeling Using the Integrated Reso

Description:

Permit the sharing of geographically distributed resources ... Additionally, a probe for a node count (say x) is not resubmitted ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 51

Provided by: architshi

Category:

more less

Transcript and Presenter's Notes

Title: Queue Wait Estimation, User Activity Profiling and Resource Usage Modeling Using the Integrated Reso

1
Queue Wait Estimation, User Activity Profiling
and Resource Usage Modeling Using the Integrated
Resource Information Servicehttp//iris.cs.uh.edu

Archit Shivaprakash
Department of Computer Science
University of Houston
archit_at_cs.uh.edu

2
Grid Environments

Collaborative in nature
Permit the sharing of geographically distributed
resources
Advantages capital saving, better resource
utilization, fault tolerance, cooperation
amongst the user community
Result Competition amongst users for
available resources
Environment is often dynamic
and unpredictable
Presence of an information service that informs
users about the state of their operating
environment can be very helpful!

3
Globus Middleware
IRIS is an information service that can operate
in a grid environment. It is unrelated to Globus
in its current form.
4
Information Services

Types of Information
Static Information that seldom changes (E.g. OS
on resource)
Dynamic Information changes continually (E.g.
Queue Lengths)
Current State of Information Services
Services like MDS are good about providing static
information
Dynamic information services are a work in
progress!
It is against such a backdrop that IRIS is
introduced as a dynamic information service for
grid environments.

5
Job Execution on Grids

Life-cycle of a grid job
Submission by the User to a Resource Broker
Broker decides which site is best for job and
forwards it to the RMS for the chosen site
(Independent of IRIS)
RMS checks for availability of requested
resources
Job remains in wait state until the resources are
available
Once resources are available, job executes and
returns
Job Turnaround Time Wait Time Execution Time
How much does queue wait contribute to the job
turnaround time?
And is there any way of reducing it to improve
overall throughput?

6
Queue Wait Problem

Scenario at UH-HPCC during Year 2003
Number of jobs submitted by users 8929
Total Execution time recorded 57988
hours
Total Queue Wait time recorded 27925 hours

Queue Wait accounted for about half of the
recorded execution time, increasing job
turnaround time by a staggering 50!
7
Analyzing the QWP

Queue wait during months of Year 2003 _at_ HPCC

Average queue wait, Year 2003 11,260 sec
Busy months like May and June witness
comparatively longer queue waits as compared to
months like December where user activity is
relatively lower.
8
Degree of Parallelism

The whole point of grid environments is to
exploit the available resources while running
user jobs.
Thus users strive for higher degrees of
parallelism.
Is this always good? No!
Why?
Larger resource requests can lead to longer
queue waits, often offsetting the performance
gain the user was aiming to achieve through the
parallel execution of his task.

9
Parallelism and QWP

How is QWP affected by Parallelism of user jobs?

UH-HPCC Year 2003
Increase in queue wait for higher degrees of
parallelism is sufficiently evident!
10
Problem Summary

Queue wait time encountered by users is
significant
This is especially true for higher degrees of
parallelism and during months of high grid
activity
Solution
An information service that helps users make
good resource requests while running their jobs.
What do we mean by Good Resource Request?
The ability to maximize parallelism while
minimizing queue wait!

11
Good Resource Request

Mean QWT reported by IRIS on 12/18/2003 (UH-HPCC)
NC 47 -gt 559 seconds NC 48 -gt 3301 seconds
Resource Request of 48 nodes is not advisable as
it increases QWP by six-fold when compared to 47!

12
What is IRIS?

Stands for Integrated Resource Information
Service
It is a real-time information service for
distributed environments (grids)
Aims at providing
Dynamic information about the state of the
resource queues
Statistical summaries of user activity on the
monitored resource (user activity profiling)
Modeling resource activity as a whole (meant for
administrators)

13
IRIS Objectives

Functional Objectives
Provide users with queue wait estimates (primary
objective)
Profile user activity with respect to a resource
Model resource usage as a whole
Supplementary Objectives
Ease of use, ubiquitous, fault tolerant,
extensible, easy to maintain
Many of these are required of any service (tool)
that operates in a
grid environment.

14
IRIS Development Testbed

Uses the High Performance Computing Center (HPCC)
as a development testbed
Thus, monitors the HPCC resources
Interfaces with the SUN Grid Engine (SGE)
HPCC is a non-preemptive environment and jobs do
not begin execution until all of the requested
resources are available
Note All data presented in the next few slides
is collected on UH
-HPCC and pertains to the year 2003 unless
otherwise mentioned
This contents of this slide is relevant later on
when we discuss the future of IRIS

15
IRIS Methodology

Submit dummy probe jobs on the monitored resource
Probes are submitted for all the possible user
requests
Determine the queue wait encountered by IRIS
probes
Report the QWT to the user
Note
The probe jobs impose no computational overheads
on the
monitored resource.
Additionally, a probe for a node count (say x) is
not resubmitted
until the previous probe (for x) returns.
The above approach is Active in nature.

16
Why Active Approach?

Can we determine the QWT by not submitting any
probe jobs on
the resource? (Passive approach)
In other words, can we estimate QWT based on the
queue waits
encountered by user jobs in the recent past?
Passive approach is not viable because
Job submission on grids is sporadic
QWT is not consistent and can fluctuate widely
Determining QWT for a resource specification is
challenging
Requires the presence of a complex mathematical
function that could prove to be a bottleneck
during concurrent requests.

17
Example of QWT Variation

Observed on 13th October 2003 on UH-HPCC for
12-node jobs

18
Active Approach Overhead

Interestingly, the Active approach is not very
intrusive on the monitored resource.

0.045 0.009
This approach does cause some networking and
bookkeeping overheads. However, it is more viable
than the passive approach.
19
IRIS Organization Model
IRIS Organization Model
20
IRIS Functional Units
21
IRIS Architecture
IRIS Architecture
22
IRIS Workflow
23
IRIS Website

http//iris.cs.uh.edu/

24
Current Implementation of IRIS

Uses the High Performance Computing Center (HPCC)
as a development testbed
Thus, monitors the HPCC resources
Interfaces with the SUN Grid Engine (SGE)
HPCC is a non-preemptive environment and jobs do
not begin execution until all of the requested
resources are available
Note All data presented in the next few slides
is collected on UH
-HPCC and pertains to the year 2003 unless
otherwise mentioned
This contents of this slide is relevant later on
when we discuss the future of IRIS

25
IRIS Results - 1

Analyzing relationship between QWT and Degree of
Parallelism

Mean QWT reported by IRIS on 12/18/2003
(UH-HPCC) Note the Occurrence of Surge Thresholds
(ST)
26
IRIS Results 1 (contd)

The relationship between QWT and Node Count can
be divided into multiple linear segments
At the boundaries of these segments, we see
exponential increase in QWT with a unit increase
in Node Count
E.g. NC 47 -gt 559 seconds NC 48 -gt
3301 seconds
The points at which there is an exponential
increase in QWT is termed as Surge Thresholds
For Maximum Parallelism with Minimal Penalty
(Queue-wait)
User resource specification should be at inner
boundary of ST

27
IRIS Results 1 (contd)

It is important to note that Surge Thresholds are
dynamic in nature and they are largely dependent
on the jobs currently executing and the resources
allocated to them by the RMS
Does IRIS inform users about the Node count at
which the QWT
surges? - No
It presents the QWT for all the possible resource
requests and
counts on the user to make an informed decision.

IRIS is touted to be a best-effort information
service
28
IRIS Results - 2

Analyzing the relationship between QWT and degree
of parallelism over periods of time (Month of
December 2003)

Results similar to what we saw over the 24-hour
period previously.
29
Validating IRIS Results

Compare the actual and projected QWT over three
test periods
10 random samples considered during each of the
test periods
Test Period 1 15-16 November 2003 / 2 Days
Test Period 2 26-28 November 2003 / 3 Days
Test Period 3 09-12 December 2003 / 4 Days
Evaluation Metrics
Accuracy
Timeliness
Adaptability

Error (/-) Est. QWT W Time (Actual)
Accuracy (1 - ?Error / ?Actual Wait Time)
30
Validation Test Period 1
31
Validation Test Period 1 (contd)

Mean Accuracy of QWT Estimation 89
Entries in red (previous slide) indicate cases
where information was not timely

32
Validation Test Period 2
33
Validation Test Period 2 (contd)

Mean Accuracy of QWT Estimation 82
Entries in red (previous slide) indicate cases
where information was not timely

34
Validation Test Period 3
35
Validation Test Period 3 (contd)

Mean Accuracy of QWT Estimation 72
Entries in red (previous slide) indicate cases
where information was not timely

36
What we can infer about QWT estimates?

Accuracy
Accuracy is observed to be lower for higher
degrees of parallelism
This is especially true during periods of high
grid activity
Timeliness
Better for lower degrees of parallelism. High
node-count probes experience long queue wait
themselves
Adaptability
IRIS methodology allows it to adapt to the
dynamic grid environment (E.g. Test Period 2
Cases 5 6)

37
Profiling User Activity

Users like to obtain a statistical summary of
their past activity on a resource
Users can view execution times encountered by
their jobs in the past. If a similar job is
submitted by them, they can add the QWT and the
execution time (recorded in the past) and get the
approximate job turnaround time
Grid jobs are often similar with the exception
that they use different data sets during each
execution. The above theory holds true if the
execution times are similar.

38
Sample User Profile
39
User Activity on UH-HPCC (1)

Six largest users of HPCC resources (in terms of
jobs submitted)

6 users account for 58 of jobs submitted. Other
58 users account for 42.
40
Mean Execution/QWT observed for 6 major users of
UH-HPCC
Users A-F are the largest contributors of jobs on
HPCC
41
User Activity on UH-HPCC (2)

Six largest users of HPCC resources (in terms of
QWT observed)

6 users account for 65 of total QWP. Other 58
users account for 35.
42
Grid Modeling (Month)

Essentially deals with modeling resource usage
Submission of jobs during various months of Year
2003 (HPCC)

Number of jobs submitted is not a good indicator
of resource usage. We are more interested in the
execution times recorded by the user jobs!
43
Grid Modeling (Month) contd
Total Execution/QWT time (monthly)
Mean Execution/QWT per job (monthly)
Notice similar trends
44
Grid Modeling (Day)

Submission of jobs during various times of the
day
Break 24 hours into eight 3-hour periods

45
Grid Modeling (Day) contd

Analyzing Mean Execution and QWT during the day

46
Benefits of Grid Modeling

Administrators
Can decide periods to schedule maintenance/upgrade
s with
minimal disruption
Helps in accounting and bookkeeping operations
Users
Increases awareness amongst grid users
Can determine periods of lower activity to run
their jobs, resulting in reduced QWT and higher
resource availability

47
Future of IRIS

Interface with other RMS like PBS
Monitor several distributed resources, thereby
servicing larger set of users or users with
accounts on different resources
Users can thus submit jobs on resources
with least QWT
Meta-scheduling
Extensions will make IRIS more relevant to a
grid environment

48
Future of IRIS
49
Points to Note

IRIS is not a generic information service that
can be applied to any environment!
IRIS would have to be customized to suit the
requirements of each of the local sites it is
supposed to monitor.
Current implementation is designed to work on
UH-HPCC.
Issues that impact the applicability of IRIS
RMS Scheduling Algorithms used
Number of queues that are to be monitored
Number of users and what priorities they operate
with

50
Thank You!