GridPP use- interoper- communic- ability - PowerPoint PPT Presentation

About This Presentation
Title:

GridPP use- interoper- communic- ability

Description:

GridPP use-interoper-communic-ability Tony Doyle – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 35
Provided by: acuk
Category:

less

Transcript and Presenter's Notes

Title: GridPP use- interoper- communic- ability


1
GridPP use-interoper-communic-ability
Tony Doyle
2
Introduction
  1. Is the system usable?
  2. How will GridPP and NGS interoperate?
  3. Communication and discussion introduction

3
A. Usability (Prequel)
  • GridPP runs a major part of the EGEE/LCG Grid,
    which supports 3000 users
  • The Grid is not (yet) as transparent as end-users
    want it to be
  • The underlying overall failure rate is 10
  • User (interface)s, middleware and operational
    procedures (need to) adapt
  • (see talk by Jeremy for more info. on performance
    and operations)
  • Procedures to manage the underlying problems such
    that system is usable are highlighted

4
EGEE CPU hours(1 April 2006 to 31 July 2006 )
Active User requires thousands of CPU hours
5 million hours
5
Virtual Organisations
  • Users are grouped into Virtual Organisations
  • Users/VO varies from 1 to 806 members (and
    growing..)
  • Broadly four classes of VO
  • LHC experiments
  • EGEE supported
  • Worldwide (mainly non-LHC particle physics)
  • Local/regional e.g. UK PhenoGrid
  • Sites can choose which VOs to support, subject to
    MOU/funding commitments
  • Most GridPP sites support 20 VOs
  • GridPP nominally allocates 1 of resources to
    EGEE non-HEP VOs
  • GridPP currently contributes 30 of the EGEE CPU
    resources

6
User View?
  • Perspective matters
  • This is not
  • a usability survey
  • unbiased
  • representative
  • Straw poll
  • users overcame initial registration hurdles
    within two weeks
  • users adapt to Grid in (un-)coordinated
    ways
  • The Grid was sufficiently flexible for many
    analysis applications

7
Physics Analysis
ESD Data or Monte Carlo
Event Tags
Collaboration -wide Tasks
Event Selection
Calibration Data
Analysis, Skims
INCREASING DATA FLOW
Raw Data
Analysis Groups
Physics Objects
Physics Objects
Physics Objects
Individual Physicists
Physics Analysis
8
User evolution
  • Number of UK Grid users (exc. Deployment Team)
  • Quarter 05Q4
    06Q2 06Q3
  • Value 1342
    1831 2777
  • Many EGEE VOs supported c.f. 3000 EGEE target
  • Number of active users (gt 10 jobs per month)
  • Quarter 05Q4 06Q1
    06Q2
  • Value 83 166 201
  • Fraction 6.2 11.0
  • Viewpoint growing fairly rapidly, but not as
    active as they could be? depends on the active
    definition

9
Know your users? UK-enabled VOs
  • 806 atlas 763 dzero 577 cms 566 dteam 150
    lhcb 131 alice 75 bio 65 dteamsgm 41
    esr 31 ilc 27 atlassgm 27 alicesgm
    21 cmsprg 18 atlasprg 17 fusn 15 zeus 13
    dteamprg 13 cmssgm 11 hone 9 pheno 9
    geant 7 babar 6 aliceprg 5 lhcbsgm
    5 biosgm 3 babarsgm 2 zeussgm 2 t2k 2
    geantsgm 2 cedar 1 phenosgm 1 minossgm
    1 lhcbprg 1 ilcsgm 1 honesgm 1 cdf

10
User Interface
Dockable windows
Screenshot of the Ganga GUI
  • The GUI is relatively low-level (jobs, file
    collections)
  • Dynamic panels for higher level functions

11
Complex Applications
12
WLCG MoU
  • Particle physicists collaborate, play roles and
    delegate
  • e.g. prg production group sgm
    software group managers
  • Underpinned by Memoranda of Understanding
  • Current MoU signatories
  • China France Germany Italy India Japan
    Netherlands Pakistan Portugal Romania Taiwan UK
    USA
  • Pending signatures
  • Australia Belgium Canada Czech Republic Nordic
    Poland Russia Spain Switzerland Ukraine
  • Negotiation w.r.t. resource and service level

13
Resource allocation
  • Need to assign quotas and priorities to VOs and
    measure delivery
  • VOMS provides group/role information in the proxy
  • Tools to control quotas and priorities in site
    services being developed
  • So far only at whole-VO level
  • Maui batch scheduler is flexible, easy to map to
    groups/roles
  • Sites set the target shares
  • Can publish VO/group-specific values in GLUE
    schema, hence the RB can use them for scheduling
  • Accounting tool (APEL) measures CPU use at global
    level (UK task)
  • Storage accounting currently being added
  • GridPP monitors storage across UK
  • Privacy issues around user-level accounting,
    being solved by encryption

14
User Support
  • Becoming vital as the number of users grows
  • But modest effort available in the various
    projects
  • Global Grid User Support (GGUS) portal at
    Karlsruhe provides a central ticket interface
  • Problems are categorised
  • Tickets are classified by an on-duty Ticket
    Process Manager, and assigned to an appropriate
    support unit
  • UK (GridPP) contributes support effort
  • GGUS has a web-service interface to ticketing
    systems at each ROC
  • Other support units are local mailing lists
  • Mostly best-effort support, working hours only
  • Currently tens of tickets/week
  • Manageable, but may not scale much further
  • Some tickets slip through the net

15
Documentation Training
  • Need documentation and training for both system
    managers and users
  • Mostly expert users up to now, but user community
    is expanding
  • Induction of new VOs is a particular problem no
    peer support
  • EGEE is running User Fora for users to share
    experience
  • Next in Manchester in May 07 (with OGF)
  • EGEE has a dedicated training activity run by
    NeSC/Edinburgh
  • Documentation is often a low priority, little
    dedicated effort
  • The rapid pace of change means that material
    requires constant review
  • Effort on documentation is now increasing
  • GridPP has appointed a documentation officer
  • GridPP web site, wiki
  • Installation manual for admins is good
  • There is also a wiki for admins to share
    experience
  • Focus is now on user documentation
  • New EGEE web site coming soon

16
Alternative view?
  • The number of users in the Grid School for the
    Gifted is manageable now
  • The system may be too complex, requiring too much
    work by the average user?
  • Or the (virtual) help desk may not be enough?
  • Or the documentation may be misleading?
  • Or..
  • Having smart users helps (the current ones are)

17
B. Interoperability
  • GridPP/NGS meeting - Nottingham EMCC, September
    2006
  • Present Tony Doyle, David Britton, Paul
    Jeffreys, David Wallom, Robin Middleton, Andy
    Richards, Stephen Pickles, Steven Young, Dave
    Colling, Peter Clarke, Neil Geddes
  • Agenda
  • Ultimate goals and the model for achieving them
    and any constraints
  • Timetables
  • Required software (in both directions)

18
B. Interoperability
  • Goals A general discussion on what we might hope
    to achieve and why.
  • Several key points made...
  • Open question whether we ever need to actually
    have any closer partnership
  • GridPP is focused on a relatively immediate goal
    and will always be constrained in some way by the
    broader LCG requirements
  • NGS should be further from the bleeding edge in
    grid developments
  • NGS affiliation and partnership model exists
  • GridPP T2's all have MoUs which will need
    revamping under GridPP3. This will be an ideal
    opportunity to formalise any relationship between
    GridPP (T2's) and the NGS.
  • It is unclear who is using EGEE (in the UK) and
    who could or would want to use it
  • EGEE-UKI needs to do a better PR job within the
    UK
  • Phenogrid are registering with EGEE

19
B. Interoperability
  • The current "minimal software stack" approach of
    NGS is being reviewed as a greater variety of
    partner resources are considered (data centres
    and research facilities)
  • Different "stacks" will be relevant to different
    sorts of partners i.e. there is likely to be a
    range of "NGS Profiles
  • For the foreseeable future, NGS is likely to
    exist in a world with multiple parallel software
    stacks and it will not be possible merge them
  • Installing parallel stacks or profiles is not a
    problem if they are easy to install and do not
    interfere
  • One possibility is that the different NGS
    profiles would reflect Different stacks such as
    GT4 or gLite
  • Operations-can we present accounting information
    consistently

20
B. Interoperability
  • What benefit is there in a GridPP site joining
    NGS ?
  • much less relevant for sites where the resources
    are essentially dedicated for HEP. Where there
    are shared facilities with other fields then the
    generic and shared nature of the NGS can provide
    ready made interfaces for the broader
    communities. We are clearly a long way form being
    able to merge both activities completely. e.g.
    GridPP requirements on monitoring and accounting
    could not currently be met by NGS nodes and NGS
    would not require all partners to report a la
    GridPP. (Of course this does not preclude project
    specific layers such as this accounting on top of
    the basic NGS profiles, for relevant partner).
  • There is a concern that "joining" the NGS would
    put an additional load on the GridPP sites.
    Looking further ahead of course, the intention is
    that this is not the case, but that supporting
    the standard NGS profiles is exactly the same
    work as required to meet (a subset of) the GridPP
    requirements. This can only be guaranteed if
    there is sufficient representation of GridPP
    sites within the NGS.

21
B. Interoperability
  • Next steps/timetable
  • GridPP3 MoUs - No action required. Can wait until
    next year and should be informed by lessons
    learned over the next 6-12 months. GridPP sites
    currently meet the minimal requirements for NGS
    through the standard GridPP installations.
  • If Sites enable the NGS VO then this effectively
    gives NGS affiliation if they wish.
  • Formal Affiliation would, however, require that
    the interface be monitored by NGS. Agreed that
    the next step should be to understand in detail
    what is actually required for NGS partnership.

22
B. Interoperability
  • Next steps/timetable
  • Agreed to focus on two sites, Glasgow and LeSC.
    Aim to be ready to achieve NGS partnership by
    Christmas 2006.
  • The decision as to whether or not to actually
    apply for formal partnership can be left to later
    in the year.
  • The principal goal is to understand the steps and
    requirements etc.
  • It was agreed that NGS should provide a Glite CE
    for core NGS nodes which would allow the nodes To
    be a part of the EGEE/LCG SAM infrastructure.
  • Accounting and monitoring are areas which are
    still developing and where it is not clear what
    the best solution is (for NGS)
  • Meet once more before Christmas..

23
gt Implementation
  • GU should concentrate on delivering 1. A job
    submission mechanism 2. A method to prepare the
    job's environment what input files, etc. This
    means we can offer 1. gsissh login to head
    node, with access to some shared space (e.g. the
    home directory for the NGS pool accounts). 2.
    job submission from head node to the gatekeeper,
    which can use either GRAM (globus-job-submit) or
    EGEE methods (edg-job-submit) This would seem
    to qualify us as an NGS partner site, comparing
    with 
  • http//www.grid-support.ac.uk/index.php?optionco
    ntenttaskviewid143 
  • The SLAs on offer seem none too onerous

24
C. Communicability
  1. "T0-T1-T2 Service Challenges" Panel Members
    Tony Cass, Jeremy Coles, Dave Colling, John
    Gordon, Dave Kant, Mark Leese, Jamie Shiers.
    notes recorded by Neasan O'Neill
  2. "Analysis on the Grid" Panel Members Roger
    Barlow, Giuliano Castelli, David Grellscheid,
    Mike Kenyon, Gennady Kuznetsov, Steve Lloyd,
    Andrew McNab, Caitriana Nicholson, James Werner.
    notes recorded by Giuseppe Mazza
  3. "How is/will data be managed at the T1/T2s?"
    Panel Members Phil Clark, Greig Cowan, Brian
    Davies, Alessandra Forti, David Martin, Paul
    Millar, Jens Jensen, Sam Skipsey, Gianfranco
    Sciacca, Robin Tasker, Paul Trepka. notes
    recorded by Tom Doherty
  4. "Experiment Service Challenges" Panel Members
    Dave Colling, Catalin Condurache, Peter Hobson,
    Roger Jones, Raja Nandakumar, Glenn Patrick.
    notes recorded by Caitriana Nicholson
  1. "Beyond GridPP2 and e-Infrastructure" Panel
    Members Pete Clarke, Dave Britton, Tony Doyle,
    Neil Geddes, John Gordon, Neasan O'Neill, Joanna
    Schmidt, John Walsh, Pete Watkins. notes
    recorded by Duncan Rand
  2. "Site Installation and Management" Panel
    Members Tony Cass, Pete Gronbech, Dave Kelsey,
    Winnie Lacesso, Colin Morey, Mark Nelson, Derek
    Ross, Graeme Stewart, Steve Thorn, John Walsh.
    notes recorded by Mark Leese
  3. "What is a workable Tier-2 Deployment Model?"
    Panel Members Olivier van der Aa, Jeremy Coles,
    Santanu Das, Alessandra Forti, Pete Gronbech,
    Peter Love, Giuseppe Mazza, Duncan Rand, Graeme
    Stewart, Pete Watkins. notes recorded by
    Gianfranco Sciacca
  4. "What is Middleware Support?" Panel Members
    Mona Aggarwal, Tom Doherty, Barney Garrett, Jens
    Jensen, Andrew McNab, Robin Middleton, Paul
    Millar, Robin Tasker. notes recorded by
    Catalin Condurache

25
1. "LCG Service Challenges"
  • This was a session which brought out the detailed
    planning of Service Challenges.

1. SC is a great idea which is a kind of reality
check reality is imminent data, increasing
complexity of experiment-led initiatives, and
more users 2. Need more documentation and
support still true(!) despite effort 3. Time
scales and deadlines are needed for deployment
well known and widely communicated via Jamie
Jeremy 4. Storage model is important issue
especially for storage group increasingly large
issue dedicated discussion 5. Communication on
experience forthcoming discussions will be
discussed at DTeam and PMB meetings 6. Networks
will play an important part in SC4 underpins
file transfer tests, but needs to be embedded
within these - disk performance (being
understood) v network performance (many hidden
variables)
26
There was a list of specific actions
  • Implement a better user support model ONGOING
  • Support the deployment of an SRM at every Tier-2
    site DONE
  • Revisit site plans for implementing promised
    resources DONE
  • Support the installation of any required local
    catalogues at sites GENERALLY LIMITED TO TIER-1.
    DONE
  • Investigate the experiment VO box requests. Make
    a recommendation to Tier-2s. Revisit as GridPP.
    NOT REQD. (CURRENTLY)
  • Better understand network links to sites (we do
    not want to saturate links) ONGOING
  • Schedule transfer tests from Tier-1 to Tier-2
    test rates and stability DONE AND ONGOING
  • Work closer with experiments? CAN IMPROVE

27
There was a list of specific actions
  • user support (mail lists, web form, TPMs, GGUS
    integration) NEED TO ENSURE USERS KNOW (AND
    KEEP REMINDING THEM)
  • SRM at T2 (almost done) DONE
  • site plans revised (SRIF3, FEC) ONGOING
  • local catalogues (wiki, SC3, plan for rest)
  • VO boxes (review group) DISAPPEARING..
  • network links (10 easy questions, wiki)
    FIREWALLGRID http//www.ggf.org/documents/GFD.83.
    pdf
  • T1-T2 tests (plan, stalled, dcache/dpm) DONE
  • Experiment links (some progress) MORE REQD.

28
2. "Running Applications on the Grid"
  • (Why won't my jobs run?)
  • Summary
  • A number of people say things working are well -
    pleasant surprise - easier than LSF! A SUBSET OF
    USERS ATTEND GRIDPP MEETINGS
  • VO setup and requirements don't want each VO to
    have to talk to each site. VO should provide list
    of requirements for site to support VO. THERE ARE
    A LARGE NUMBER OF RESPONSIBILITIES TO BE HANDLED
    BY EACH EXPT.
  • Certificates need to improve situation. Once
    over this hurdle using the grid is plainer
    sailing. INTRINSIC TIME DEPENDENCE OF CA-RA-USER
    TRUST ESTABLISHMENT (NECESSARY)
  • Data management issues more of a problem than job
    or RB problems. How to get information to user re
    failures and support channels. INCREASINGLY TRUE
    MANY AD-HOC DELETIONS FOLLOWING E.G. FTS
    FAILURES
  • Monitoring real file transfers would be an
    interesting addition. USER MECHANISMS TO TRACE
    OVERALL PROGRESS, BUT NOT MANY INDIVIDUAL USER
    TOOLS/SCRIPTS APPEARING E.G. TNT (Tag Navigator
    Tool) PLUG-IN TO GANGA FOR ATLAS FILE COLLECTIONS
    WOULD NEED TO COMMUNICATE WITH THE MonAMI FTS
    PLUG-IN

29
3. "Grid Documentation"
  • (What documentation is needed/missing? Is it a
    question of organisation?)
  • Could updates to documents be raised at meetings?
  • A mailing list specifically for document updates
    may be useful.
  • Competition between different solutions to one
    problem.
  • For all experiments - link in all documentation
    and give responsibility to a line manager (for
    example) to oversee its maintenance.
  • What are the mechanisms or how do we find out
    what is inadequate within a document - a document
    should be checked every few months to point out
    its inadequacies gt should a review process be
    set up by SB.
  • Roles and responsibilities should be established.
  • Important documents should be highlighted - and
    index of useful doc's and what sources of
    documents are available may be useful.
  • Much progress made by Stephen Burke in many of
    these areas. Steve attends PMB

30
5. "Beyond GridPP2 and e-Infrastructure"
  • (What is the current status of planning?)
  • EGEE II may be superseded by European
    infrastructure EGEE III NOW BEING PLANNED
  • DTI planning a UK infrastructure
  • Integrate better with NGS - SEE EARLIER SLIDES
  • More things developed by GridPP will be supported
    centrally NEED TO CONVINCE UK COMMUNITY OF THE
    USEFULNESS AND ADAPTABILITY OF GLITE AS A
    COMPONENT PART OF PERVASIVE INFRASTRUCTURE

31
6. "Managing Large Facilities in the LHC era"
  • (What works? What doesn't? What won't)
  • Sys admins seem happy with their package
    managers.
  • We should share common knowledge (about software
    tools) more. ONGOING
  • Extra Costs (over and above the price of the
    hardware) involved in having large clusters.
    ONGOING
  • IMPROVED, BUT CAN IMPROVE FURTHER METRIC DT
    (INSTALL USER AVAILABILTY) AVAILABILITY

32
7. "What is a workable Tier-2 Deployment Model?
  • Conclusion Deployment is under control
  • testing has made good progress
  • operations still an issue
  • METRIC DT (INSTALL USER AVAILABILTY)
    OVERALL AVAILABILITY SYSTEM MANAGER(S)
  • EXCELLENT T2 SUPPORT STRUCTURE REQD.

33
8. "What is Middleware Support?"
  • (really all about)
  • gLite test bed
  • EGEE2 - dedicated testing/certification system
  • using wiki was good idea. Consolidate into
    documents.
  • need some structure to make sure wiki doesn't get
    out of control.
  • need some moderators for the wiki.
  • developers not getting correct requirements for
    s/w.sysadmin questions not the same questions
    that were in the minds
  • of the developers..
  • bad if the wiki is incorrect.
  • need someone to move what is in the wiki to some
    sort of more formal docs (LaTeX or DocBook) which
    has been properly checked and signed off by the
    developers.
  • ONGOING, LIMITED PROGRESS INTRINSIC LIMITATION?
    (THERE WILL ALWAYS BE OUT OF DATE/LIMITED
    DOCUMENTATION?)
  • NEED A DOCUMENTATION REVIEW CHALLENGE?

34
Conclusion
  • All sessions were felt to be worthwhile
  • Some produced hard actions
  • Some areas have made progress since
  • Positive correlation between subjects which made
    progress and where GridPP had existing structures
    in place (Deployment, Documentation)
  • Counter examples, middleware, experiments
  • Lets do this again but next time take more care
    to task people with subsequent progress and look
    for new structures to deliver results.
  • MAKE IT SO
  • The logical end of a talk on Gridability (or
    the emperors new clothes?)
Write a Comment
User Comments (0)
About PowerShow.com