Status of PDC - PowerPoint PPT Presentation

About This Presentation

Title:

Status of PDC

Description:

Hosting the VO-boxes (interface to site services) ... This model ... With more SEs, holding replicas of the data, the Grid will naturally become more stable ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 19

Provided by: wwwali

Category:

more less

Transcript and Presenter's Notes

Title: Status of PDC

1
Status of PDC07 and user analysis issues (from
admin point of view)

L. Betev
August 28, 2007

2
The ALICE Grid

Powered by AliEn
Interfaces to gLite, ARC and (future) OSG WMS
As of today 65 entry points (62 sites), 4
continents
Africa (1), Asia (4), Europe (53), North America
(4)
21 countries, 1 consortium (NDGF)
6 Tier-1 (MSS capacity) sites, 58 Tier-2
All together 5000 CPUs (pledged), 1.5PB disk,
1.5PB Tape
Contribution range from 4 to 1200 CPUs
PIII, PIV, Itanium, Xeon, AMD
All Linux Mandriva, Suse to Ubuntu, mostly
SL3/4, no Gentoo all possible kernelgcc
combinations

3
The ALICE Grid (2)
62 active sites
4
Operation

ALICE offline is
Hosting the central AliEn services Grid
catalogue, task queue, job handling,
authentication, API services, user registration
Organising (guided by the requirements of the
PWGs) and running the production
AliEn site services updates and operation
(together with the regional experts)
User analysis support
Sites are
Hosting the VO-boxes (interface to site services)
Operating the local services (gLite and site
fabric)
Providing CPU and storage
This model
Has been in operation with minor modification
since several years and is working quite well for
production
Requires minor modification to support a large
user community - mostly in the area of user
support

5
History of PDCs

Exercise of the ALICE production model
Data production / storage/ replication
Validation of AliRoot
Validation of Grid software and operation
User analysis (not yet integral part of the PDC)
Since April 2006 the PDC is running continuously

6
PDC job history
Average of 1500 CPUs running continuously since
April 2006
7
PDC job history - zoom on last 2 months
2900 jobs in average, saturating all available
resources
8
Site performance

Typical operation
- Up to 10 of the sites not in
production at any given moment
- Half of these are undergoing
scheduled upgrades
- The other half - Grid or local
services failures
- T1s are in general better in
stability than T2
- Some T2s are much better
than any of the T1s
Achieving better stability of the
services at the computing centres
is a top priority of all parties
involved

The central services availability is better than
95
9
Production status
Total 85,837,100 events as of 26/082007 2400
hours
10
Sites contributions
Standard distribution 50/50 T1/T2 contribution
11
Relative contribution - Germany
Standard distribution 50/50 T1/T2 contribution
15 of total
12
Efficiencies/debugging

Workload management for production
Under control and is near production quality
We keep saying that, but this time we really mean
it
Improvements (speed, stability) are expected with
the new gLite version 3.1, still untested
Support and debugging
The overall situation is much less fragile now
Substantial improvements in AliEn and monitoring
are making the work of the experts supporting the
operations easier
gLite services at the sites are (mostly) well
understood and supported
User support is still very much in need of
improvement
The issues with user analysis are often unique
and sometimes lead to development of new
functionality
But at least the response time (if not the
solution) is quick

13
General

The Grid is getting better
Running conditions are improving
The Grid middleware in general and AliEn in
particular are quite stable
After a long and hard work by the developers
Even user analysis, much derided in the past few
months is finally not a painful exercise
The operation is more streamlined now
Better understanding of running conditions and
problems by the experts
We continue with the usual PDC07 programme
Simulation/reconstruction of MC event
Validation of new middleware components
User analysis
And in addition the Full Dress Rehearsal (FDR)

14
User analysis issues - short list

Major issues - February/June 2007
Jobs do not start/lost/output missing
Input data collections are difficult to handle
and impossible to process at once
Priorities are not set - single user can grab
all resources
Unclear definition of storage elements (Disk/MSS)

15
User analysis issues - short list (2)

What has been done
Failover CE for user queue (Grid partition
Analysis)
Since 20 June - 100 availability
Pre staging of data (available on spinning media)
and creation of xml collections centrally
The availability of the pre-staged files is
checked periodically
More robust central services (see previous
slides)
Use of dedicated SE for user files - this will be
transparently increased to multile SEs with
quotas
Priority mechanism (not the final version) put in
place
We havent had reports of unfair use

16
Job completion chart
Standard distribution 50/50 T1/T2 contribution
User jobs
17
User analysis issues - current

Storage availability and consistency
Still very few working SEs - common storage
solutions are not yet production quality
The effort is now concentrated on CASTOR2 with
xrootd
Sites (GSI f.e.) are installing large xrootd
pools - these are tested and working
With more SEs, holding replicas of the data, the
Grid will naturally become more stable
Availability of specific data sets
Dependent on the storage capacity in operation
Currently TPC RAW data is being replicated to GSI
With CASTOR2xrootd working, the number of events
on spinning media will increase 20x

18
User analysis issues - current (2)

User applications
Compatibility of user installation of ROOT, gcc
version, OS - locally complied application will
not necessarily run on the Grid
All sites are installed with lowest common
denominator middleware and packages - currnetly
SLC3, gcc v.3.2, while most users have gcc v.3.4
There is no easy way out, until the centres
migrate to SL(C)4 and gcc v.3.4
Meanwhile, the experts are looking into
repackaging the Grid apps (most notably gshell)
Currently the only solution is to always compile
ROOT and user application with the same compiler,
before submitting to Grid