GridPP use- interoper- communic- ability - PowerPoint PPT Presentation

About This Presentation

Title:

GridPP use- interoper- communic- ability

Description:

GridPP use-interoper-communic-ability Tony Doyle – PowerPoint PPT presentation

Number of Views:135

Avg rating:3.0/5.0

Slides: 35

Provided by: acuk

Category:

more less

Transcript and Presenter's Notes

Title: GridPP use- interoper- communic- ability

1
GridPP use-interoper-communic-ability
Tony Doyle
2
Introduction

Is the system usable?
How will GridPP and NGS interoperate?
Communication and discussion introduction

3
A. Usability (Prequel)

GridPP runs a major part of the EGEE/LCG Grid,
which supports 3000 users
The Grid is not (yet) as transparent as end-users
want it to be
The underlying overall failure rate is 10
User (interface)s, middleware and operational
procedures (need to) adapt
(see talk by Jeremy for more info. on performance
and operations)
Procedures to manage the underlying problems such
that system is usable are highlighted

4
EGEE CPU hours(1 April 2006 to 31 July 2006 )
Active User requires thousands of CPU hours
5 million hours
5
Virtual Organisations

Users are grouped into Virtual Organisations
Users/VO varies from 1 to 806 members (and
growing..)
Broadly four classes of VO
LHC experiments
EGEE supported
Worldwide (mainly non-LHC particle physics)
Local/regional e.g. UK PhenoGrid
Sites can choose which VOs to support, subject to
MOU/funding commitments
Most GridPP sites support 20 VOs
GridPP nominally allocates 1 of resources to
EGEE non-HEP VOs
GridPP currently contributes 30 of the EGEE CPU
resources

6
User View?

Perspective matters
This is not
a usability survey
unbiased
representative
Straw poll
users overcame initial registration hurdles
within two weeks
users adapt to Grid in (un-)coordinated
ways
The Grid was sufficiently flexible for many
analysis applications

7
Physics Analysis
ESD Data or Monte Carlo
Event Tags
Collaboration -wide Tasks
Event Selection
Calibration Data
Analysis, Skims
INCREASING DATA FLOW
Raw Data
Analysis Groups
Physics Objects
Physics Objects
Physics Objects
Individual Physicists
Physics Analysis
8
User evolution

Number of UK Grid users (exc. Deployment Team)
Quarter 05Q4
06Q2 06Q3
Value 1342
1831 2777
Many EGEE VOs supported c.f. 3000 EGEE target
Number of active users (gt 10 jobs per month)
Quarter 05Q4 06Q1
06Q2
Value 83 166 201
Fraction 6.2 11.0
Viewpoint growing fairly rapidly, but not as
active as they could be? depends on the active
definition

9
Know your users? UK-enabled VOs

806 atlas 763 dzero 577 cms 566 dteam 150
lhcb 131 alice 75 bio 65 dteamsgm 41
esr 31 ilc 27 atlassgm 27 alicesgm
21 cmsprg 18 atlasprg 17 fusn 15 zeus 13
dteamprg 13 cmssgm 11 hone 9 pheno 9
geant 7 babar 6 aliceprg 5 lhcbsgm
5 biosgm 3 babarsgm 2 zeussgm 2 t2k 2
geantsgm 2 cedar 1 phenosgm 1 minossgm
1 lhcbprg 1 ilcsgm 1 honesgm 1 cdf

10
User Interface
Dockable windows
Screenshot of the Ganga GUI

The GUI is relatively low-level (jobs, file
collections)
Dynamic panels for higher level functions

11
Complex Applications
12
WLCG MoU

Particle physicists collaborate, play roles and
delegate
e.g. prg production group sgm
software group managers
Underpinned by Memoranda of Understanding
Current MoU signatories
China France Germany Italy India Japan
Netherlands Pakistan Portugal Romania Taiwan UK
USA
Pending signatures
Australia Belgium Canada Czech Republic Nordic
Poland Russia Spain Switzerland Ukraine
Negotiation w.r.t. resource and service level

13
Resource allocation

Need to assign quotas and priorities to VOs and
measure delivery
VOMS provides group/role information in the proxy
Tools to control quotas and priorities in site
services being developed
So far only at whole-VO level
Maui batch scheduler is flexible, easy to map to
groups/roles
Sites set the target shares
Can publish VO/group-specific values in GLUE
schema, hence the RB can use them for scheduling
Accounting tool (APEL) measures CPU use at global
level (UK task)
Storage accounting currently being added
GridPP monitors storage across UK
Privacy issues around user-level accounting,
being solved by encryption

14
User Support

Becoming vital as the number of users grows
But modest effort available in the various
projects
Global Grid User Support (GGUS) portal at
Karlsruhe provides a central ticket interface
Problems are categorised
Tickets are classified by an on-duty Ticket
Process Manager, and assigned to an appropriate
support unit
UK (GridPP) contributes support effort
GGUS has a web-service interface to ticketing
systems at each ROC
Other support units are local mailing lists
Mostly best-effort support, working hours only
Currently tens of tickets/week
Manageable, but may not scale much further
Some tickets slip through the net

15
Documentation Training

Need documentation and training for both system
managers and users
Mostly expert users up to now, but user community
is expanding
Induction of new VOs is a particular problem no
peer support
EGEE is running User Fora for users to share
experience
Next in Manchester in May 07 (with OGF)
EGEE has a dedicated training activity run by
NeSC/Edinburgh
Documentation is often a low priority, little
dedicated effort
The rapid pace of change means that material
requires constant review
Effort on documentation is now increasing
GridPP has appointed a documentation officer
GridPP web site, wiki
Installation manual for admins is good
There is also a wiki for admins to share
experience
Focus is now on user documentation
New EGEE web site coming soon

16
Alternative view?

The number of users in the Grid School for the
Gifted is manageable now
The system may be too complex, requiring too much
work by the average user?
Or the (virtual) help desk may not be enough?
Or the documentation may be misleading?
Or..
Having smart users helps (the current ones are)

17
B. Interoperability

GridPP/NGS meeting - Nottingham EMCC, September
2006
Present Tony Doyle, David Britton, Paul
Jeffreys, David Wallom, Robin Middleton, Andy
Richards, Stephen Pickles, Steven Young, Dave
Colling, Peter Clarke, Neil Geddes
Agenda
Ultimate goals and the model for achieving them
and any constraints
Timetables
Required software (in both directions)

18
B. Interoperability

Goals A general discussion on what we might hope
to achieve and why.
Several key points made...
Open question whether we ever need to actually
have any closer partnership
GridPP is focused on a relatively immediate goal
and will always be constrained in some way by the
broader LCG requirements
NGS should be further from the bleeding edge in
grid developments
NGS affiliation and partnership model exists
GridPP T2's all have MoUs which will need
revamping under GridPP3. This will be an ideal
opportunity to formalise any relationship between
GridPP (T2's) and the NGS.
It is unclear who is using EGEE (in the UK) and
who could or would want to use it
EGEE-UKI needs to do a better PR job within the
UK
Phenogrid are registering with EGEE

19
B. Interoperability

The current "minimal software stack" approach of
NGS is being reviewed as a greater variety of
partner resources are considered (data centres
and research facilities)
Different "stacks" will be relevant to different
sorts of partners i.e. there is likely to be a
range of "NGS Profiles
For the foreseeable future, NGS is likely to
exist in a world with multiple parallel software
stacks and it will not be possible merge them
Installing parallel stacks or profiles is not a
problem if they are easy to install and do not
interfere
One possibility is that the different NGS
profiles would reflect Different stacks such as
GT4 or gLite
Operations-can we present accounting information
consistently

20
B. Interoperability

What benefit is there in a GridPP site joining
NGS ?
much less relevant for sites where the resources
are essentially dedicated for HEP. Where there
are shared facilities with other fields then the
generic and shared nature of the NGS can provide
ready made interfaces for the broader
communities. We are clearly a long way form being
able to merge both activities completely. e.g.
GridPP requirements on monitoring and accounting
could not currently be met by NGS nodes and NGS
would not require all partners to report a la
GridPP. (Of course this does not preclude project
specific layers such as this accounting on top of
the basic NGS profiles, for relevant partner).
There is a concern that "joining" the NGS would
put an additional load on the GridPP sites.
Looking further ahead of course, the intention is
that this is not the case, but that supporting
the standard NGS profiles is exactly the same
work as required to meet (a subset of) the GridPP
requirements. This can only be guaranteed if
there is sufficient representation of GridPP
sites within the NGS.

21
B. Interoperability

Next steps/timetable
GridPP3 MoUs - No action required. Can wait until
next year and should be informed by lessons
learned over the next 6-12 months. GridPP sites
currently meet the minimal requirements for NGS
through the standard GridPP installations.
If Sites enable the NGS VO then this effectively
gives NGS affiliation if they wish.
Formal Affiliation would, however, require that
the interface be monitored by NGS. Agreed that
the next step should be to understand in detail
what is actually required for NGS partnership.

22
B. Interoperability

Next steps/timetable
Agreed to focus on two sites, Glasgow and LeSC.
Aim to be ready to achieve NGS partnership by
Christmas 2006.
The decision as to whether or not to actually
apply for formal partnership can be left to later
in the year.
The principal goal is to understand the steps and
requirements etc.
It was agreed that NGS should provide a Glite CE
for core NGS nodes which would allow the nodes To
be a part of the EGEE/LCG SAM infrastructure.
Accounting and monitoring are areas which are
still developing and where it is not clear what
the best solution is (for NGS)
Meet once more before Christmas..

23
gt Implementation

GU should concentrate on delivering 1. A job
submission mechanism 2. A method to prepare the
job's environment what input files, etc. This
means we can offer 1. gsissh login to head
node, with access to some shared space (e.g. the
home directory for the NGS pool accounts). 2.
job submission from head node to the gatekeeper,
which can use either GRAM (globus-job-submit) or
EGEE methods (edg-job-submit) This would seem
to qualify us as an NGS partner site, comparing
with
http//www.grid-support.ac.uk/index.php?optionco
ntenttaskviewid143
The SLAs on offer seem none too onerous

24
C. Communicability

"T0-T1-T2 Service Challenges" Panel Members
Tony Cass, Jeremy Coles, Dave Colling, John
Gordon, Dave Kant, Mark Leese, Jamie Shiers.
notes recorded by Neasan O'Neill
"Analysis on the Grid" Panel Members Roger
Barlow, Giuliano Castelli, David Grellscheid,
Mike Kenyon, Gennady Kuznetsov, Steve Lloyd,
Andrew McNab, Caitriana Nicholson, James Werner.
notes recorded by Giuseppe Mazza
"How is/will data be managed at the T1/T2s?"
Panel Members Phil Clark, Greig Cowan, Brian
Davies, Alessandra Forti, David Martin, Paul
Millar, Jens Jensen, Sam Skipsey, Gianfranco
Sciacca, Robin Tasker, Paul Trepka. notes
recorded by Tom Doherty
"Experiment Service Challenges" Panel Members
Dave Colling, Catalin Condurache, Peter Hobson,
Roger Jones, Raja Nandakumar, Glenn Patrick.
notes recorded by Caitriana Nicholson

"Beyond GridPP2 and e-Infrastructure" Panel
Members Pete Clarke, Dave Britton, Tony Doyle,
Neil Geddes, John Gordon, Neasan O'Neill, Joanna
Schmidt, John Walsh, Pete Watkins. notes
recorded by Duncan Rand
"Site Installation and Management" Panel
Members Tony Cass, Pete Gronbech, Dave Kelsey,
Winnie Lacesso, Colin Morey, Mark Nelson, Derek
Ross, Graeme Stewart, Steve Thorn, John Walsh.
notes recorded by Mark Leese
"What is a workable Tier-2 Deployment Model?"
Panel Members Olivier van der Aa, Jeremy Coles,
Santanu Das, Alessandra Forti, Pete Gronbech,
Peter Love, Giuseppe Mazza, Duncan Rand, Graeme
Stewart, Pete Watkins. notes recorded by
Gianfranco Sciacca
"What is Middleware Support?" Panel Members
Mona Aggarwal, Tom Doherty, Barney Garrett, Jens
Jensen, Andrew McNab, Robin Middleton, Paul
Millar, Robin Tasker. notes recorded by
Catalin Condurache

25
1. "LCG Service Challenges"

This was a session which brought out the detailed
planning of Service Challenges.

1. SC is a great idea which is a kind of reality
check reality is imminent data, increasing
complexity of experiment-led initiatives, and
more users 2. Need more documentation and
support still true(!) despite effort 3. Time
scales and deadlines are needed for deployment
well known and widely communicated via Jamie
Jeremy 4. Storage model is important issue
especially for storage group increasingly large
issue dedicated discussion 5. Communication on
experience forthcoming discussions will be
discussed at DTeam and PMB meetings 6. Networks
will play an important part in SC4 underpins
file transfer tests, but needs to be embedded
within these - disk performance (being
understood) v network performance (many hidden
variables)
26
There was a list of specific actions

Implement a better user support model ONGOING
Support the deployment of an SRM at every Tier-2
site DONE
Revisit site plans for implementing promised
resources DONE
Support the installation of any required local
catalogues at sites GENERALLY LIMITED TO TIER-1.
DONE
Investigate the experiment VO box requests. Make
a recommendation to Tier-2s. Revisit as GridPP.
NOT REQD. (CURRENTLY)
Better understand network links to sites (we do
not want to saturate links) ONGOING
Schedule transfer tests from Tier-1 to Tier-2
test rates and stability DONE AND ONGOING
Work closer with experiments? CAN IMPROVE

27
There was a list of specific actions

user support (mail lists, web form, TPMs, GGUS
integration) NEED TO ENSURE USERS KNOW (AND
KEEP REMINDING THEM)
SRM at T2 (almost done) DONE
site plans revised (SRIF3, FEC) ONGOING
local catalogues (wiki, SC3, plan for rest)
VO boxes (review group) DISAPPEARING..
network links (10 easy questions, wiki)
FIREWALLGRID http//www.ggf.org/documents/GFD.83.
pdf
T1-T2 tests (plan, stalled, dcache/dpm) DONE
Experiment links (some progress) MORE REQD.

28
2. "Running Applications on the Grid"

(Why won't my jobs run?)
Summary
A number of people say things working are well -
pleasant surprise - easier than LSF! A SUBSET OF
USERS ATTEND GRIDPP MEETINGS
VO setup and requirements don't want each VO to
have to talk to each site. VO should provide list
of requirements for site to support VO. THERE ARE
A LARGE NUMBER OF RESPONSIBILITIES TO BE HANDLED
BY EACH EXPT.
Certificates need to improve situation. Once
over this hurdle using the grid is plainer
sailing. INTRINSIC TIME DEPENDENCE OF CA-RA-USER
TRUST ESTABLISHMENT (NECESSARY)
Data management issues more of a problem than job
or RB problems. How to get information to user re
failures and support channels. INCREASINGLY TRUE
MANY AD-HOC DELETIONS FOLLOWING E.G. FTS
FAILURES
Monitoring real file transfers would be an
interesting addition. USER MECHANISMS TO TRACE
OVERALL PROGRESS, BUT NOT MANY INDIVIDUAL USER
TOOLS/SCRIPTS APPEARING E.G. TNT (Tag Navigator
Tool) PLUG-IN TO GANGA FOR ATLAS FILE COLLECTIONS
WOULD NEED TO COMMUNICATE WITH THE MonAMI FTS
PLUG-IN

29
3. "Grid Documentation"

(What documentation is needed/missing? Is it a
question of organisation?)
Could updates to documents be raised at meetings?
A mailing list specifically for document updates
may be useful.
Competition between different solutions to one
problem.
For all experiments - link in all documentation
and give responsibility to a line manager (for
example) to oversee its maintenance.
What are the mechanisms or how do we find out
what is inadequate within a document - a document
should be checked every few months to point out
its inadequacies gt should a review process be
set up by SB.
Roles and responsibilities should be established.
Important documents should be highlighted - and
index of useful doc's and what sources of
documents are available may be useful.
Much progress made by Stephen Burke in many of
these areas. Steve attends PMB

30
5. "Beyond GridPP2 and e-Infrastructure"

(What is the current status of planning?)
EGEE II may be superseded by European
infrastructure EGEE III NOW BEING PLANNED
DTI planning a UK infrastructure
Integrate better with NGS - SEE EARLIER SLIDES
More things developed by GridPP will be supported
centrally NEED TO CONVINCE UK COMMUNITY OF THE
USEFULNESS AND ADAPTABILITY OF GLITE AS A
COMPONENT PART OF PERVASIVE INFRASTRUCTURE

31
6. "Managing Large Facilities in the LHC era"

(What works? What doesn't? What won't)
Sys admins seem happy with their package
managers.
We should share common knowledge (about software
tools) more. ONGOING
Extra Costs (over and above the price of the
hardware) involved in having large clusters.
ONGOING
IMPROVED, BUT CAN IMPROVE FURTHER METRIC DT
(INSTALL USER AVAILABILTY) AVAILABILITY

32
7. "What is a workable Tier-2 Deployment Model?

Conclusion Deployment is under control
testing has made good progress
operations still an issue
METRIC DT (INSTALL USER AVAILABILTY)
OVERALL AVAILABILITY SYSTEM MANAGER(S)
EXCELLENT T2 SUPPORT STRUCTURE REQD.

33
8. "What is Middleware Support?"

(really all about)
gLite test bed
EGEE2 - dedicated testing/certification system
using wiki was good idea. Consolidate into
documents.
need some structure to make sure wiki doesn't get
out of control.
need some moderators for the wiki.
developers not getting correct requirements for
s/w.sysadmin questions not the same questions
that were in the minds
of the developers..
bad if the wiki is incorrect.
need someone to move what is in the wiki to some
sort of more formal docs (LaTeX or DocBook) which
has been properly checked and signed off by the
developers.
ONGOING, LIMITED PROGRESS INTRINSIC LIMITATION?
(THERE WILL ALWAYS BE OUT OF DATE/LIMITED
DOCUMENTATION?)
NEED A DOCUMENTATION REVIEW CHALLENGE?

34
Conclusion

All sessions were felt to be worthwhile
Some produced hard actions
Some areas have made progress since
Positive correlation between subjects which made
progress and where GridPP had existing structures
in place (Deployment, Documentation)
Counter examples, middleware, experiments
Lets do this again but next time take more care
to task people with subsequent progress and look
for new structures to deliver results.
MAKE IT SO
The logical end of a talk on Gridability (or
the emperors new clothes?)