HEP Applications Evaluation of the EDG Testbed and Middleware - PowerPoint PPT Presentation

About This Presentation
Title:

HEP Applications Evaluation of the EDG Testbed and Middleware

Description:

Metadata catalogues and file collections (still needs more work) ... Middleware groups seem to have a different security model to VOMS designers ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 15
Provided by: wp0
Category:

less

Transcript and Presenter's Notes

Title: HEP Applications Evaluation of the EDG Testbed and Middleware


1
HEP Applications Evaluation of the EDG Testbed
and Middleware
  • Stephen Burke (EDG HEP Applications WP8)
  • s.burke_at_rl.ac.uk

2
Introduction
  • Updated from the CHEP talk 1 year ago
  • Some things have changed, some not!
  • Based on D8.4 report (EDG only here, 2.0/2.1
    releases)
  • Achievements of WP8
  • Updated use case analysis mapping HEPCAL to EDG
  • Lessons learnt

3

4

5
Use Case Analysis
  • EDG release 2.0 has been evaluated against the
    HEPCAL Use Cases
  • Of the 43 Use Cases
  • 13 (was 10) are fully implemented
  • 4 (was 8) are largely satisfied, but with some
    restrictions or complications
  • 11 (was 8) are partially implemented, but have
    significant missing features
  • 15 (was 17) are not implemented
  • Missing functionality is mainly in
  • Virtual data (not considered by EDG)
  • Metadata catalogues and file collections (still
    needs more work)
  • Authorisation, job control and optimisation
    (partly delivered but not integrated)

6
Lessons Learnt - General
  • Having real users on an operating testbed on a
    fairly large scale is vital many problems
    emerged which had not been seen in local testing.
  • Problems with configuration are at least as
    important as bugs - integrating the middleware
    into a working system takes as long as writing
    it!
  • Grids need different ways of thinking by users
    and system managers. A job must run anywhere it
    lands. Sites are not uniform so jobs should make
    as few demands as possible.

7
Job Submission
  • Limitations seen in 1.4 are largely gone
  • Efficiency over 90 in stress tests (1600 jobs)
  • Failures are 1 in normal use (after
    resubmission)
  • Most failures now at globus/site level, not
    broker
  • Can still be sensitive to poor or incorrect
    information from Information Providers
  • Info providers have improved, configuration
    generally better
  • No black hole sites lately (but still possible)
  • Still hard to diagnose errors (invalid script
    response???)
  • Advanced features (checkpointing, DAGMAN,
    interactivity, accounting, ) largely untested,
    some not integrated

8
Information Systems
  • R-GMA is a big improvement on MDS
  • Tables, SQL queries, much easier to publish,
  • Largely a personal view, experiments have mostly
    not used it yet
  • Took a very long time to become stable during
    the D8.4 evaluation R-GMA availability was O(75)
  • Latest version installed for the EU review looks
    much better total end-to-end efficiency now gt
    95, R-GMA is 100 (but testbed is now lightly
    loaded)
  • NO SECURITY!
  • And no Registry/schema replication
  • Need to check published information for accuracy
    (or at least sanity!)
  • GLUE schema is not in EDG/LCG control, and has
    proved very hard to change

9
Replica Management
  • Now mostly just works
  • Command line tools are fairly intuitive
  • Sometimes processes can hang
  • Orphan processes sometimes left behind when job
    ends
  • Some inconsistencies found when used with POOL
  • Interaction with SE schema is still unclear
  • Works, but gives artificial restrictions on NFS
    access
  • Bulk operations, mirroring and client-server
    architecture lost with GDMP
  • Java command-line tools are very slow (tens of
    seconds)
  • Fault tolerance is important error conditions
    should leave things in a consistent state,
    failures should be re-tried where possible

10
Replica Catalogues
  • Oracle/MySQL catalogues are much better than LDAP
    in 1.4
  • Tested up to O(100k) entries, no degradation seen
  • But need to cope with millions
  • At 10 seconds per file it would take 4 months
    to register a million files!
  • Queries can be very slow due to inefficient
    transport of data
  • 30 minutes to return 45k entries
  • Java runs out of memory on bigger queries
  • Distributed LRC RLI not deployed
  • NO SECURITY! (Integrated but not deployed)
  • Still no consistency checking against SE content

11
Mass Storage
  • Always the most problematic area, and still not
    solved
  • LCG2 still using classic SE, but only a
    stop-gap
  • SRM should be the solution (?), WP5 SE is the EDG
    version
  • Works, but many rough edges, really still a
    prototype
  • No disk space management
  • Error reporting is poor, not fault-tolerant
  • Too much logging, not helpful for a system
    manager
  • Configuration is complex and fragile
  • Also dCache, CASTOR SRM, Enstore SRM
  • But still not production-quality?
  • What is the way forward?

12
VO Management
  • Current LDAP-based system works fairly well, but
    has many limitations
  • VO servers are a single point of failure
  • VOMS looks good, but not yet deployed or fully
    integrated
  • Or documented!
  • Middleware groups seem to have a different
    security model to VOMS designers
  • E.g. they usually assume one and only one VO
  • VO defines service (Replica Catalogue, SE
    namespace) and not authorisation
  • Experiments will need to gain experience about
    how a VO should be run

13
User View of the Testbed
  • Site configuration is very complex, there is
    usually one way to get it right and many ways to
    be wrong
  • LCFG is a big help in ensuring uniform
    configuration
  • Middleware should be self-configuring (and
    self-checking) as far as possible
  • Need well-defined certification procedures,
    checked on an ongoing basis (sites decay with a
    half-life of a few weeks)
  • Services should fail gracefully when they hit
    resource limits
  • The grid must be robust against failures and
    misconfiguration. Large grids will always be
    broken, so errors are not exceptional!
  • Many HEP experiments require outbound IP
    connectivity from worker nodes
  • Still no solution, discussion is needed
  • Scalability? Still only 20 sites 1 job/minute!

14
Gaps
  • Disk space management on worker nodes
  • Some discussion, nothing appeared
  • Analysis of scheduling algorithms
  • EstimatedResponseTime is not optimal
  • Pre-replication by the broker
  • Information about networking at the LAN level
  • Where are the network bottlenecks?
  • Distribution of experiment software (now being
    tackled in LCG)
  • Enforcement of quotas (whose job is this?)
  • Documentation
Write a Comment
User Comments (0)
About PowerShow.com