Status of PDC - PowerPoint PPT Presentation

About This Presentation
Title:

Status of PDC

Description:

Hosting the VO-boxes (interface to site services) ... This model ... With more SEs, holding replicas of the data, the Grid will naturally become more stable ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 19
Provided by: wwwali
Category:
Tags: pdc | become | model | status

less

Transcript and Presenter's Notes

Title: Status of PDC


1
Status of PDC07 and user analysis issues (from
admin point of view)
  • L. Betev
  • August 28, 2007

2
The ALICE Grid
  • Powered by AliEn
  • Interfaces to gLite, ARC and (future) OSG WMS
  • As of today 65 entry points (62 sites), 4
    continents
  • Africa (1), Asia (4), Europe (53), North America
    (4)
  • 21 countries, 1 consortium (NDGF)
  • 6 Tier-1 (MSS capacity) sites, 58 Tier-2
  • All together 5000 CPUs (pledged), 1.5PB disk,
    1.5PB Tape
  • Contribution range from 4 to 1200 CPUs
  • PIII, PIV, Itanium, Xeon, AMD
  • All Linux Mandriva, Suse to Ubuntu, mostly
    SL3/4, no Gentoo all possible kernelgcc
    combinations

3
The ALICE Grid (2)
62 active sites
4
Operation
  • ALICE offline is
  • Hosting the central AliEn services Grid
    catalogue, task queue, job handling,
    authentication, API services, user registration
  • Organising (guided by the requirements of the
    PWGs) and running the production
  • AliEn site services updates and operation
    (together with the regional experts)
  • User analysis support
  • Sites are
  • Hosting the VO-boxes (interface to site services)
  • Operating the local services (gLite and site
    fabric)
  • Providing CPU and storage
  • This model
  • Has been in operation with minor modification
    since several years and is working quite well for
    production
  • Requires minor modification to support a large
    user community - mostly in the area of user
    support

5
History of PDCs
  • Exercise of the ALICE production model
  • Data production / storage/ replication
  • Validation of AliRoot
  • Validation of Grid software and operation
  • User analysis (not yet integral part of the PDC)
  • Since April 2006 the PDC is running continuously

6
PDC job history
Average of 1500 CPUs running continuously since
April 2006
7
PDC job history - zoom on last 2 months
2900 jobs in average, saturating all available
resources
8
Site performance
  • Typical operation
  • - Up to 10 of the sites not in
  • production at any given moment
  • - Half of these are undergoing
  • scheduled upgrades
  • - The other half - Grid or local
  • services failures
  • - T1s are in general better in
  • stability than T2
  • - Some T2s are much better
  • than any of the T1s
  • Achieving better stability of the
  • services at the computing centres
  • is a top priority of all parties
  • involved

The central services availability is better than
95
9
Production status
Total 85,837,100 events as of 26/082007 2400
hours
10
Sites contributions
Standard distribution 50/50 T1/T2 contribution
11
Relative contribution - Germany
Standard distribution 50/50 T1/T2 contribution
15 of total
12
Efficiencies/debugging
  • Workload management for production
  • Under control and is near production quality
  • We keep saying that, but this time we really mean
    it
  • Improvements (speed, stability) are expected with
    the new gLite version 3.1, still untested
  • Support and debugging
  • The overall situation is much less fragile now
  • Substantial improvements in AliEn and monitoring
    are making the work of the experts supporting the
    operations easier
  • gLite services at the sites are (mostly) well
    understood and supported
  • User support is still very much in need of
    improvement
  • The issues with user analysis are often unique
    and sometimes lead to development of new
    functionality
  • But at least the response time (if not the
    solution) is quick

13
General
  • The Grid is getting better
  • Running conditions are improving
  • The Grid middleware in general and AliEn in
    particular are quite stable
  • After a long and hard work by the developers
  • Even user analysis, much derided in the past few
    months is finally not a painful exercise
  • The operation is more streamlined now
  • Better understanding of running conditions and
    problems by the experts
  • We continue with the usual PDC07 programme
  • Simulation/reconstruction of MC event
  • Validation of new middleware components
  • User analysis
  • And in addition the Full Dress Rehearsal (FDR)

14
User analysis issues - short list
  • Major issues - February/June 2007
  • Jobs do not start/lost/output missing
  • Input data collections are difficult to handle
    and impossible to process at once
  • Priorities are not set - single user can grab
    all resources
  • Unclear definition of storage elements (Disk/MSS)

15
User analysis issues - short list (2)
  • What has been done
  • Failover CE for user queue (Grid partition
    Analysis)
  • Since 20 June - 100 availability
  • Pre staging of data (available on spinning media)
    and creation of xml collections centrally
  • The availability of the pre-staged files is
    checked periodically
  • More robust central services (see previous
    slides)
  • Use of dedicated SE for user files - this will be
    transparently increased to multile SEs with
    quotas
  • Priority mechanism (not the final version) put in
    place
  • We havent had reports of unfair use

16
Job completion chart
Standard distribution 50/50 T1/T2 contribution
User jobs
17
User analysis issues - current
  • Storage availability and consistency
  • Still very few working SEs - common storage
    solutions are not yet production quality
  • The effort is now concentrated on CASTOR2 with
    xrootd
  • Sites (GSI f.e.) are installing large xrootd
    pools - these are tested and working
  • With more SEs, holding replicas of the data, the
    Grid will naturally become more stable
  • Availability of specific data sets
  • Dependent on the storage capacity in operation
  • Currently TPC RAW data is being replicated to GSI
  • With CASTOR2xrootd working, the number of events
    on spinning media will increase 20x

18
User analysis issues - current (2)
  • User applications
  • Compatibility of user installation of ROOT, gcc
    version, OS - locally complied application will
    not necessarily run on the Grid
  • All sites are installed with lowest common
    denominator middleware and packages - currnetly
    SLC3, gcc v.3.2, while most users have gcc v.3.4
  • There is no easy way out, until the centres
    migrate to SL(C)4 and gcc v.3.4
  • Meanwhile, the experts are looking into
    repackaging the Grid apps (most notably gshell)
  • Currently the only solution is to always compile
    ROOT and user application with the same compiler,
    before submitting to Grid
Write a Comment
User Comments (0)
About PowerShow.com