Workload Management - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Workload Management

Description:

This decision has been praised by Bob Jones and Frederic Hemmer. ... Tiziana & Costas. 5th July 2005. Workload Management. David Colling, Imperial College London ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 30
Provided by: owenma
Category:

less

Transcript and Presenter's Notes

Title: Workload Management


1
Workload Management
  • Status of current activity
  • GridPP 13, Durham, 6th July 2005

2
Activity
  • Scalability testing
  • Analysis of current middleware performance
  • SGE integration
  • GridCC

3
Scalability Testing
People involved Janusz Martyniak, Luke Dikens,
Barry MacEvoy, Steve McGough, David Colling
4
Scalability Testing
Why
  • From EDG we knew that it was easy to build a
    system capable of running 5 jobs concurrently.
  • No so easy to build one capable of running 500
    jobs or 5000 jobs concurrently.
  • The plan was to perform testing to find software
    bottlenecks and hot spots
  • Feed the results back to the developers in a
    virtuous circle

5
Scalability Testing
  • The methodology
  • Original plan was to build a testbed across 2
    sites (Imperial HEP and LeSC). This was
    deliverable X.Y
  • Take an engineering approach. I.e. Submit tests
    to the testbed and monitor how the different
    components respond.
  • Metrics to be tested to evolve in complexity as
    the stability grew.

6
Scalability Testing
  • What happened
  • Decided to join the JRA1 testbed instead of
    forming our own. This gave us better access to
    the developers and much support on other parts of
    the system that we were not directly testing but
    which are needed to run the tests e.g. VOMs,
    RGMA. Also thus made a contribution to wider
    community. This decision has been praised by Bob
    Jones and Frederic Hemmer.
  • Still decided to run two sites (as per
    deliverable) as this gave a better testing
    environment for scalability tests

7
Scalability Testing
  • What happenned
  • We were delayed by the late release of the WMS in
    EGEE
  • However have had two sites in JRA1 testing since
    immediately after the Athens meeting. The two
    sites are maintained by JM and LD and they
    consist of
  • Machines 1 WMS
  • 2 CEs (1)
  • 2 WNs (1)
  • Install apt
  • Config Site
  • Version R1.1 ( QF78)
  • Machines 1 WMS
  • 1 CE
  • 2 WNs
  • 1 RGMA Server.
  • 1 IO Server
  • 1 UI
  • Install Manual
  • Config Site (mostly)
  • Version R1.1

Site 2
Site 1
8
Scalability Testing
  • To add to these sites
  • SEs
  • VOMS
  • Second RGMA server (to complete split)

9
Scalability Testing
  • Actual testing
  • Only really started writing scalability tests a
    couple of weeks ago
  • Have defined some basic metrics
  • Time to submit as a function of number of jobs
    for serial submission
  • Time to submit for parallel submission
  • Failure rates as function of active jobs
  • etc
  • Use LB database and system monitoring on WMS node
    to reconstruct what is going on

10
Scalability Testing
  • So, 100 simple jobs submitted sequentially
  • Result preliminary
  • Example of what we are trying to do
  • Bypassed known problems especially cross matching
  • Summary

11
Scalability Testing
Summary
28 Success53 Proxy expired (12 hours after the
jobs were submitted !)3 Aborted due to reaching
retry count16 Ready state
In this sample greatest source of failure is
CondorC
12
Scalability testing
100 jobs submitted sequentially
All registered in 3 minutes
13
Scalability Testing
Long tail of retries
Greatest number lt5000s (Excel binning)
14
Scalability Testing
100 jobs submitted sequentially
Can plot for individual or groups of processes
Still activity 1 hour later
5 Minutes
15
Scalability Testing
  • Future Plans
  • Automate testing scripts
  • Output directed to web-pages
  • Expand metrics as appropriate

16
Performance of middleware
  • We access to the job data through the LB
    databases, so why not have a look?
  • People involved Gidon Moont and David Colling

17
Performance of middleware
Long tail
18
Performance of middleware
Number of entries
Efficiency
RunTime (s)
19
Performance of middleware
  • Future plans
  • Keep monitoring this across different releases
  • Low level activity
  • Feedback into JRA2

20
SGE Porting
  • People involved David McBride, Mona Aggarwal and
    Owen Maroney

21
SGE Porting
LCG Integration with Sun Grid Engine (SGE)
  • Wish to add LCG as an additional entry point for
    our existing SGE cluster
  • Problem LCG installation assumes the use of PBS
    as the cluster management system.
  • Solution replace PBS-specific components with
    SGE specific components.

22
SGE Porting
PBS-specific components in LCG(That need
replacing)
  • Globus JobManager
  • Already have an existing alternative Globus
    JobManager for Sun Grid Engine to replace lcgpbs
    version.
  • Implemented in Perl, well understood.
  • Supports 5.x, 6.x revisions of SGE.
  • Currently installed, about to enter the first run
    of testing as part of an LCG CE installation.

23
SGE Porting
PBS-specific Components in LCG (That need
replacing)
  • Information Reporter
  • Have developed first-pass attempt at an SGE
    information reporter.
  • Again, developed in Perl, small, relatively
    straightforward. (Existing PBS code wasn't very
    clear, but GLUE Schema is public.)
  • Installed on site CE, about to enter first run of
    validation and iterative improvement.

24
SGE Porting
PBS-specific components in LCG(That need
replacing)
  • Accounting (APEL)
  • APEL Accounting using PBS Event Logs.
  • SGE does have advanced accounting records but are
    not stored in the same format as PBS!
  • Existing Java-based tooling seems large and
    complex for what should be a fairly
    straightforward task not obvious where changes
    could/should be made.
  • Refactored version exists in gLite, but would
    still require new implementation of SGE-specific
    backend.
  • Using updated gLite revision on site may well
    work, but would introduce manageability issues at
    upgrade-time.
  • Currently wondering whether APEL can simply be
    replaced with a small perl script(!) Currently
    looking up for documentation on the APEL/R-GMA
    reporting interface.

25
SGE Porting
  • Community of Interest formed
  • Code available from
  • http//www.lesc.ic.ac.uk/projects/SGE-LCG.html
  • Mailing list
  • coi-sge-lcg_at_imperial.ac.uk

26
GridCC
  • People involved
  • Marko Krznaric, Janusz Martyniak, Luke
    Dickens, John Darlington, Steve McGough, David
    McBride and David Colling
  • Tiziana Costas

27
GridCC
  • Lot about GridCC at GridPP12 so brief update
  • Discussions between GridCC and EGEE (Bob Jones
    and Frederic Hemmer)
  • Agreed to collaborate (e.g. use EGEE CVS) GridCC
    relies on EGEE
  • First release September this year
  • Review October this year

28
GridCC
Bits in red from UK wms activity
29
Summary
  • Activity in 4 areas
  • testing,
  • analysis,
  • SGE port,
  • GridCC
Write a Comment
User Comments (0)
About PowerShow.com