NorthGrid - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

NorthGrid

Description:

Torque server hangs - Maui server hangs - Info System gives wrong responses. ... effective some with draw backs like caching torque queries had to be removed. ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 15
Provided by: abc91
Category:
Tags: northgrid | torque

less

Transcript and Presenter's Notes

Title: NorthGrid


1
NorthGrid
  • Alessandra Forti
  • GridPP 16
  • 27 June 2006

2
Outline
  • Current problems
  • Sheffield support
  • Good news
  • Conclusions

3
Current problems
  • Lancaster and Manchester have been heavily
    affected by the infamous 4444 problem.
  • Caused by heavy load on the CE.
  • Torque server hangs -gt Maui server hangs -gt Info
    System gives wrong responses.
  • Affects job submission from users
  • Different solutions have been more or less
    effective some with draw backs like caching
    torque queries had to be removed.
  • Manchester seems ok since installation of ncsd
    (S. Traylen suggestion).
  • Sheffield is having support post problems (see
    next slide)
  • Manchester dcache unstable since upgrade to 2_7_0
  • Lately from unstable it has become broken not
    understood why yet.

4
Current problems
  • Manchester has put online only a 3rd of the nodes
  • computing room rearrangement
  • But mostly 4444 problem and dcache instability
  • if they are caused by the load increasing it
    doesnt do any good to increase it
  • Liverpool has had problems with an unscheduled
    downtime
  • They were still receiving jobs
  • Problem has been solved adding downtime to the
    Freedom of Choice tool?
  • Is it usable by normal users
  • Sheffield adding few VOs has affected their
    Classic SE

5
Sheffiled support
  • Sheffield support post probably probably going
    away.
  • Meeting with computing centre people with whom
    the post is shared.
  • Replaced by someone still from the university
    Computing Centre
  • Person located at Computing Centre not involved
    enough with PP community
  • PP people will have weekly meeting to follow
    situation with new person (or old if he stays)
  • Explained LCG requirements
  • Software upgrades vs service uptime.

6
Concerns
  • Lancaster
  • Storage ratio of 1TB for every 2 kSpecInt. Even
    if we "dcache-up" all our spare WN disk we will
    have about half this-and thats giving it all to
    atlas! Even if we get the funding for the extra
    disk it'll be hell finding somewhere to put it.
  • Network Gb/s links between the WN and SE are
    going to be challenging to get, particularly with
    the NAT.
  • Sheffield
  • Importing data for local Atlas users with both
    lcg utils and DQ2 tools has 50 failure rate
  • Manchester
  • SFT partial failures and sites suspension
    Manchester risked to be suspended due to RM test
    failures despite the fact that the cluster is
    constantly loaded with running jobs.

7
Good news
  • All the sites participated to dteam SC4
  • It helped to understand bottlenecks
  • Atlas SC4
  • Lancaster will participate
  • Networking work to put UkLight/SRM switch on the
    same subnet as the cluster
  • Manchester has been volunteered before dcache
    problems manifest themselves
  • hasnt been contacted yet by Atlas anyway
  • 3 sites have already upgraded to glite3.0
  • Lancaster and Liverpool are SFTs almost
    completely green carpets.
  • Liverpool working on networking, firewall
    bottlenecks
  • Manchester has now a 1 Gbs dedicated link
    directly to NNW
  • skips completely the campus network and should be
    upgraded to 10Gbs

8
Q1 2006 36
9
Q2 2006 51
10
Conclusions
  • Despite a number of problems NorthGrid is
    delivering resources quite successfully
  • Biggest issues
  • SE stability
  • Support post stability

11
VOMS deployment
  • Alessandra Forti
  • Sergey Dolgobrodov

12
Glite 1.5 VOMS production
  • 1 production machine, 1 backup and 1 public
    testing machine
  • 9 VOs supported local and regional
  • manmace,ralpp,ltwo,gridpp,t2k,minos,cedar,gridcc,m
    ice
  • 28 users
  • However not much load from the users
  • Mostly from services building gridmap files
  • Few bugs make support difficult
  • Users cant have more than a role in a VO
  • Users in VO admin role cannot be cleanly deleted
  • Admin interface hangs easily after simple VO
    management requiring reinstallation from scratch
  • Tools to mirror database content dont work
    properly making difficult to maintain a backup
  • Developers respond but slowly (mostly dont
    bother to acknowledge)
  • Some problems with same VO supported across the
    ATLANTIC

13
Glite3.0 VOMS tests
  • 1 machine is dedicated to trash and test not
    publicly accessible
  • Currently used for glite3.0 evaluation
  • Production configuration (9 VOs, 28 users)
  • Testing have showed
  • Incomplete deletion of VO Admin user has been
    corrected
  • A user can now have more than one role in a VO
  • The administrator service improved significantly.
    One can manipulate separate VOs and accounts
    without pain, e.g. threat to hang the whole Admin
    interface.

14
VOMS
  • The configuration process was also improved and
    simplified.
  • There is single configuration file rather then
    several ones in this version.
  • Many parameters are automatically defined by the
    system.
  • Looks like YAIM -)
  • Stability and performance under load havent been
    tested.
  • Test with fake requests has been planned
  • New bugs
  • Wrong permissions for /etc/cron.d files resulted
    in "crl" files not updated and some proxy
    refused.
  • As with the previous version the log entries are
    not helpful.
  • Waiting for the summer to upgrade the production
    system.
  • Possibly already in the position to upgrade the
    public test machine
Write a Comment
User Comments (0)
About PowerShow.com