London Tier2 Status - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

London Tier2 Status

Description:

... Mazza, D. McBride, H. Nebrinsky, D. Rand, G. Rybkine, G. Sciacca, K. Septhon, B. Waugh ... Increase of the infrastructure usage by LHCB last month. Has ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 25
Provided by: owenma
Category:
Tags: london | mcbride | status | tier2

less

Transcript and Presenter's Notes

Title: London Tier2 Status


1
London Tier2 Status
  • Olivier van der Aa

LT2 Team M. Aggarwal, D. Colling, A. Fage, S.
George, K. Georgiou, W. Hay, P. Kyberd, A.
Martin, G. Mazza, D. McBride, H. Nebrinsky, D.
Rand, G. Rybkine, G. Sciacca, K. Septhon, B. Waugh
2
Outline
  • LT2 Usage
  • LT2 Sites updates
  • LT2 SC4 activity
  • Conclusion

3
Number of Running Jobs
January
February
4
Number of running jobs
March
April
5
Number of running jobs
May
  • Increase of the infrastructure usage by LHCB last
    month
  • Has stressed the system. Caused very slow mds
    responses.

6
Usage and efficiency per VO2006-01-01,2006-04-30
  • WallTime consumption
  • ATLAS, LHCB, BIOMED, CMS are the top consumers
  • Efficiency Fraction of total time which result
    in a successful state.
  • Efficiency by order BIOMED, ATLAS, LHCB, CMS.
  • Efficiency pattern is notyet understood. Why is
    BIOMED more efficient (ie causes less middleware
    failures)

7
Usage and efficiency per CE 2006-01-01,2006-04-3
0
  • WallTime
  • QMUL provides 55 of the total WallTime
  • Delicate to provide the Service Level Agreement
    of 95 availability of the LT2 with 1FTE at QMUL
  • Efficiency
  • UCL-CENTRAL the most efficient in job successes
    rate.
  • Can be explained because they mainly attract
    biomed jobs

Brunel
8
CE / VO view
Brunel
IC-LESC
Brunel
ucl-central
QMUL
RHUL
IC-HEP
UCL-HEP
ucl-central
  • In London we support18 VO. (sixt has not been
    used)
  • Right Plots shows the relative VO usage for each
    CE.
  • Size of the box is proportional to the total Wall
    Clock Time

9
GridLoad
https//gfe03.hep.ph.ic.ac.uk4175/cgi-bin/load
  • Tool to monitor the sites
  • -Updates every 5minutes
  • -Uses the RTM data and stores it in rrd files
  • Shows theNumber of Jobs in any state
  • VO view. Stacks the Jobs by VO
  • CE view. Stacks the Jobs by CE
  • Still a prototype. Will add
  • View by GOC and ROC.
  • Error checking.
  • Add usage (running cpu/total cpu).
  • Improve look and feel
  • Could interface with NAGIOS for raising alarms
    (high abort rate)

10
GridLoad (cont)
  • The GridLoad plots can be useful to spot
    problems.
  • Example Observed High Abort rate at one site for
    LHCB jobs
  • It helped to be proactive for the VO. Could spot
    that there is a problem before we receive a
    ticket

Aborted Jobs
Running Jobs
11
LT2 Usage Conclusions
  • We have now an additional tool to monitor the LT2
    cpu activity on real time.
  • The overall usage is increasing.
  • We need to understand the efficiency patterns.
    What causes those differences between the VO.
  • We need similar real time monitoring tools for
    the storage.

Jan - May
12
Outline
  • LT2 Usage
  • LT2 Sites updates
  • LT2 SC4 activity
  • Conclusion

13
Brunel site update
  • New Cluster provided by Streamline Computing
  • Supermicro dual processor dual core AMD Opteron
    nodes
  • 40x1.8 GHz, 4GB memory, 80 GB disk
  • Head node 2 GHz 8GB memry, 320 GB disk
  • Total 164 Cores
  • Is in the process of being configured
  • Gb connection ?
  • 1Gb wan at Brunel in 65 days from now.
  • They are currently buying appropriates switches
    and related hardware.
  • Will have a throttling router that limits the LCG
    traffic if the university demand is high. If the
    university demand is low then the LCG will have
    higher allocation
  • The Brunel site is expected to have a 10 times
    faster connection (200Mb) by September.
  • SRM
  • best rate was 59Mb/s .
  • Will remove any nfs mounted filesystem. No real
    showstopper there.

14
IC Sites update
  • HEP
  • Old IBM (60CPU) cluster running smoothly, almost
    full of jobs for the last two month.
  • Will build a new cluster with off the shelf boxes
  • 40 Dual Core AMD
  • 40 TB of disk (non raid)
  • Will use SGE for the job manager.

15
IC Sites update
  • Investigated FTS performance issues with
    dCachelt-gtdCache transfers
  • FTS using urlcopy causes high iowait

iowait
FTS/urlcopy (130Mb/s)
FTS/srmcp (179Mb/s)
Block
Block
Time
Time
16
IC Sites update
  • LESC
  • 33 of 400 1.8 GHz opterons.
  • Running RHEL3 64 bit.
  • SGE job manager.
  • DPM storage with small disk partition
  • Currently porting DPM to SOLARIS to avoid nfs
    mounting file systems used for the SRM.
  • See the progressing work at http//www.gridpp.ac.u
    k/wiki/DPM-on-Solaris
  • Difficulties Improving usage. Several VO not
    comfortable with 64bit arch even if 32 bit
    libraries are there
  • ICT
  • Deploying a new 200 Xeon cluster running PBS for
    College Use.
  • Will have a share of 30 in that cluster for LCG
  • 30TB of raid storage that will be shared.
  • Difficulties They want to use GT 3.2.1

17
QMUL site update
  • Lots of activity with the commissioning of their
    new cluster provided by Viglen
  • 280 Dual Core Opterons (270) 2GHz
  • All nodes have 2x250 Gb disks 140TB !
  • What filesystem to use with that environment.
    Will consider lustre.
  • All nodes are 1Gb connected. With 10Gb inter
    switch links.
  • Now online with 1600 job slots
  • Problems
  • Site stability under high job load nfs mounted
    software area not coping
  • Raid boxes giving hardware errors. Seemed to be
    due to loose sata connectors. The disks where
    tested ok with smart. Not yet clear what it is
    due to.
  • Reliability of DPM on Poolfs

18
UCL sites update
  • CCC
  • Have successfully moved to SGE job manager to
    service364 Slots (91 dual cpu hyper threading)
  • Improved their SRM performance by using direct
    fiberchannel link to the raid array from the head
    node.
  • Write bandwidth moved from 90Mb/s to 238Mb/s
  • Will have 40 additional nodes (160 slots) soon.
  • Moving their cluster from one building to another
    one will start on July 3 for 1 week.
  • HEP
  • New Gb switches have been bought. Need to cable
    them to the head node.
  • Will have 1,2 boxes with mirrored 120Gb disks
    with dpm pool installed on them to support non
    Atlas vo
  • Atlas will still be using nfs mounted
  • Problem Performance for Atlas storage

19
RHUL siteupdate
  • Cluster running smoothly
  • 142 Job slots almost full for two month. All VO
    targetting that site.
  • No more nfs mounted disks with write access from
    DPM .
  • Broad VO usage
  • Update on the 1Gb connection
  • Purchase order was signed yesterday.
  • Discussions are now starting as to when it will
    be installed.
  • Problems
  • Need to be able to drain Pool to remove the
    read-only nfs mounted filesystem

20
Transfers throughput status
  Rate (Mb/s) Rate (Mb/s)  
Site Inbound Outbound Update
Brunel 57 59 Gb connection signed (200Mb by september)
IC-HEP 80 190 FTS performance problem not yet understood
IC-LeSC 156 95 DPM being build for solaris
QMUL 118 172 Poolfs need to be recompiled with round robin feature
RHUL 59 58 Gb connection signed
UCL-HEP 71 63 Gb switches there.
UCL-CENTRAL 90 309 Move to direct fiberchannel connection. Rate is now 238Mb/s
21
Outline
  • LT2 Usage
  • LT2 Sites updates
  • LT2 SC4 activity
  • Conclusion

22
SC4 Activity
  • CMS Target is CSA06 (Computing Software and
    Analysis Challenge)
  • CSA06 ObjectiveA 50 million event exercise to
    test the workflow and dataflow associated with
    the data handling and data access model of CMS
  • Will test the new cms reconstruction framework
    for large production
  • Need 20MB/s bandwidth to T2 storage
  • Will start on 15 Sept
  • More information can be found athttps//twiki.ce
    rn.ch/twiki/bin/view/CMS/CSA06
  • IC-HEP and IC-LESC preparing for CSA06
  • Strategy is to help other sites when IC is ok.
  • New PheDex installed that uses FTS
  • Need to solve the FTS performance issues
  • ProdAgent configuration prepared for IC-LESC and
    IC-HEP
  • Brunel Involved in PheDex.
  • ATLAS
  • No commitment yet.

23
Conclusions
  • Real Time monitoring of the LT2 job states in
    place
  • The usage is increasing
  • Site evolution
  • SGE deployed at UCL-CENTRAL
  • QMUL more than doubled the number of job slots
  • Brunel Gb connection on the right track,
    Commissioning a new cluster (160 cores)
  • IC spotted FTS performance issues, Porting of
    DPM under solaris ongoing, Will commission a new
    cluster at HEP
  • RHUL very stable site, Gb connection signed.
  • General storage evolution In the process of
    removing nfs mounts.
  • SC4 Involvement in the CMS SC4 activity is going
    on. Need to have a volunteer in the atlas sc4.

24
LT2
Thanks to all of the Team M.
Aggarwal, D. Colling, A. Fage, S. George, K.
Georgiou, M. Green, W. Hay, P. Hobson, P. Kyberd,
A. Martin, G. Mazza, D. McBride, H. Nebrinsky, D.
Rand, G. Rybkine, G. Sciacca, K. Septhon, B.
Waugh,
Write a Comment
User Comments (0)
About PowerShow.com