Title: Status of PDC
1Status of PDC07 and user analysis issues (from
admin point of view)
2The ALICE Grid
- Powered by AliEn
- Interfaces to gLite, ARC and (future) OSG WMS
- As of today 65 entry points (62 sites), 4
continents - Africa (1), Asia (4), Europe (53), North America
(4) - 21 countries, 1 consortium (NDGF)
- 6 Tier-1 (MSS capacity) sites, 58 Tier-2
- All together 5000 CPUs (pledged), 1.5PB disk,
1.5PB Tape - Contribution range from 4 to 1200 CPUs
- PIII, PIV, Itanium, Xeon, AMD
- All Linux Mandriva, Suse to Ubuntu, mostly
SL3/4, no Gentoo all possible kernelgcc
combinations
3The ALICE Grid (2)
62 active sites
4Operation
- ALICE offline is
- Hosting the central AliEn services Grid
catalogue, task queue, job handling,
authentication, API services, user registration - Organising (guided by the requirements of the
PWGs) and running the production - AliEn site services updates and operation
(together with the regional experts) - User analysis support
- Sites are
- Hosting the VO-boxes (interface to site services)
- Operating the local services (gLite and site
fabric) - Providing CPU and storage
- This model
- Has been in operation with minor modification
since several years and is working quite well for
production - Requires minor modification to support a large
user community - mostly in the area of user
support
5History of PDCs
- Exercise of the ALICE production model
- Data production / storage/ replication
- Validation of AliRoot
- Validation of Grid software and operation
- User analysis (not yet integral part of the PDC)
- Since April 2006 the PDC is running continuously
6PDC job history
Average of 1500 CPUs running continuously since
April 2006
7PDC job history - zoom on last 2 months
2900 jobs in average, saturating all available
resources
8Site performance
- Typical operation
- - Up to 10 of the sites not in
- production at any given moment
- - Half of these are undergoing
- scheduled upgrades
- - The other half - Grid or local
- services failures
- - T1s are in general better in
- stability than T2
- - Some T2s are much better
- than any of the T1s
- Achieving better stability of the
- services at the computing centres
- is a top priority of all parties
- involved
The central services availability is better than
95
9Production status
Total 85,837,100 events as of 26/082007 2400
hours
10Sites contributions
Standard distribution 50/50 T1/T2 contribution
11Relative contribution - Germany
Standard distribution 50/50 T1/T2 contribution
15 of total
12Efficiencies/debugging
- Workload management for production
- Under control and is near production quality
- We keep saying that, but this time we really mean
it - Improvements (speed, stability) are expected with
the new gLite version 3.1, still untested - Support and debugging
- The overall situation is much less fragile now
- Substantial improvements in AliEn and monitoring
are making the work of the experts supporting the
operations easier - gLite services at the sites are (mostly) well
understood and supported - User support is still very much in need of
improvement - The issues with user analysis are often unique
and sometimes lead to development of new
functionality - But at least the response time (if not the
solution) is quick
13General
- The Grid is getting better
- Running conditions are improving
- The Grid middleware in general and AliEn in
particular are quite stable - After a long and hard work by the developers
- Even user analysis, much derided in the past few
months is finally not a painful exercise - The operation is more streamlined now
- Better understanding of running conditions and
problems by the experts - We continue with the usual PDC07 programme
- Simulation/reconstruction of MC event
- Validation of new middleware components
- User analysis
- And in addition the Full Dress Rehearsal (FDR)
14User analysis issues - short list
- Major issues - February/June 2007
- Jobs do not start/lost/output missing
- Input data collections are difficult to handle
and impossible to process at once - Priorities are not set - single user can grab
all resources - Unclear definition of storage elements (Disk/MSS)
15User analysis issues - short list (2)
- What has been done
- Failover CE for user queue (Grid partition
Analysis) - Since 20 June - 100 availability
- Pre staging of data (available on spinning media)
and creation of xml collections centrally - The availability of the pre-staged files is
checked periodically - More robust central services (see previous
slides) - Use of dedicated SE for user files - this will be
transparently increased to multile SEs with
quotas - Priority mechanism (not the final version) put in
place - We havent had reports of unfair use
16Job completion chart
Standard distribution 50/50 T1/T2 contribution
User jobs
17User analysis issues - current
- Storage availability and consistency
- Still very few working SEs - common storage
solutions are not yet production quality - The effort is now concentrated on CASTOR2 with
xrootd - Sites (GSI f.e.) are installing large xrootd
pools - these are tested and working - With more SEs, holding replicas of the data, the
Grid will naturally become more stable - Availability of specific data sets
- Dependent on the storage capacity in operation
- Currently TPC RAW data is being replicated to GSI
- With CASTOR2xrootd working, the number of events
on spinning media will increase 20x
18User analysis issues - current (2)
- User applications
- Compatibility of user installation of ROOT, gcc
version, OS - locally complied application will
not necessarily run on the Grid - All sites are installed with lowest common
denominator middleware and packages - currnetly
SLC3, gcc v.3.2, while most users have gcc v.3.4 - There is no easy way out, until the centres
migrate to SL(C)4 and gcc v.3.4 - Meanwhile, the experts are looking into
repackaging the Grid apps (most notably gshell) - Currently the only solution is to always compile
ROOT and user application with the same compiler,
before submitting to Grid