CMS production status at CERN - PowerPoint PPT Presentation

About This Presentation
Title:

CMS production status at CERN

Description:

Production tests in Spring with LXSHARE. Use of RRP in ... Ran with 300 batch jobs writing into the ... ( caveat emptor) Pro/Con statement largely a ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 17
Provided by: tonywi9
Category:

less

Transcript and Presenter's Notes

Title: CMS production status at CERN


1
CMS production status at CERN
T.Wildish 18 November 2009
2
Introduction
  • Production tests in Spring with LXSHARE
  • Use of RRP in production tests
  • 3Ware-based EIDE disk servers
  • Moving to Castor-managed storage
  • Summary/conclusions

3
Production test configuration
1 MB/minute/job
tbed server (signal)
From HPSSor Castor
4 MB/minute/job
35 MB/minute/job
Up to 180 nodes
To Castor
Federation/journal/lockserver(s)
2-4 nodes
N.B. Widths of solid arrows are to scale!
4
Production tests with LXSHARE
  • Ran with gt 300 batch jobs writing into the same
    federation
  • Were able to run, but throughput flattens out at
    120 jobs. No real gain in running more than that
  • For large-scale productions we expect we will
    have to divide our farm(s) into manageable chunks
  • Conceptually no different to running distributed
    production and combining the results with grid
    tools
  • Ran into filehandle limits on both AMS (4096) and
    lockserver (1024)
  • Worked around both limits this time, wont be
    able to if we keep increasing the size of the
    farm
  • Series of rather contrived tests explored the
    hardware limits, using 120 jobs only
  • Separate federation for each batch node, local to
    the node, local metadata, writing either
    centrally or locally depending on the exact
    details of the test
  • Equivalent to 120 users analysing on private
    shallow-copies, this at least shows that that
    model scales well

5
Hardware performance
  • Serve pileup at 20 MB/sec sustained per server
  • We need 0.5 MB/sec/job, ? one server supports 40
    jobs
  • For 300 jobs, would need 8 servers
  • CPU usage seen to be 25 USR, 140 SYS (dual
    CPU)
  • Not limited by the AMS CPU consumption itself,
    either the kernel or the I/O is the limiting step
  • 300 filehandles open on the AMS at this time
  • Serving signal is no effort, need 1 MB/min/job
  • Output server writing at up to 24 GB/hour (6-7
    MB/sec)
  • Then mostly busy in I/O, probably saturated
  • Tests with 120 jobs did not seem to reach these
    hardware limits, but not clear what limit they
    did reach
  • Ran out of time for investigating further with
    full farm
  • Most of the time spent running real production,
    not tests
  • Smaller scale tests still going on

6
(No Transcript)
7
The Request Redirection Protocol
  • We have been using the RRP since late last summer
  • All catalogue entries point at the federation
    server, federation server redirects AMS to the
    real server(s)
  • Simple implementation driven by a configuration
    file
  • server.cern.ch /dir/file
  • Serve /dir/file from host server.cern.ch
  • Match on full filename for now, partial matching
    soon to come
  • Server.cern.ch may redirect the request again
  • defaultserver server.cern.ch
  • Files not in the RRP table can be served by
    default servers
  • Multiple or repeated default servers are OK
  • Multiple default servers are selected on a
    round-robin basis
  • Repeated servers are selected proportionately
    more frequently
  • updateinterval 30
  • Check the redirection table for changes every 30
    seconds
  • Unrecognised entries are silently ignored
  • An empty or missing RRP table is also legal

8
Request Redirection algorithm
  • If the files exists locally, serve it directly
  • Ignore any entry for this file in the table
  • If a named server exists for the file, redirect
    to that server, else
  • If there are any default servers specified,
    choose the next one (round-robin) and redirect to
    it, else
  • Attempt to serve the file locally (assume an MSS
    backend will stage it in on demand or that I can
    create it if required)
  • When choosing a default server there are two
    variations possible
  • Remember the server for this file (add it to the
    table)
  • Suitable for staging systems, may reduce tape
    mounts
  • Dont record this server/file pair
  • Suitable for load-balancing for disk-resident
    data (pileup)
  • Not yet a run-time option, but will be very soon

9
RRP use scenarios
  • Server1 serves all metadata, redirects all signal
    data to server2 and all pileup to server3
  • Server2 redirects signal data to server21 and
    server22 alternately, remembering which of them
    serves a given file (needs only 2
    defaultserver entries)
  • Data staged in, and eventually purged from disk
  • Server3 redirects pileup data to server31,
    server32, and server33 alternately, using the
    load-balancing variation (needs only 3
    defaultserver entries)
  • Each of server31,2,3 has a full local copy of
    the pileup files
  • ServerA redirects all existing files in the
    federation to serverB, has default entries for
    serverC and serverD
  • ServerB serves data as it likes (e.g. like
    server1 above?)
  • ServerC/D have no redirection, only new files are
    created on them (they are output servers).

10
Pros and cons of RRP (Pros)
  • Extremely flexible data management
  • Can increase/decrease load on servers trivially
  • Add/remove (duplicate) defaultserver entries
  • Bring servers up/take them down w/o interrupting
    data access
  • Take servers offline when they crash ? improve
    data-availability
  • No lock contention
  • Dont alter the federation catalogue, no locks to
    care about
  • Propagates changes to all clients
  • Many users in CMS take shallow-copies of the
    federation for their own purposes (e.g. adding
    tag databases). They automatically see the
    changes in the server configuration
  • Changes are propagated extremely quickly
  • On the timescale of the updateinterval an entire
    set of data servers can be replaced by new ones
  • N.B. Still have open connections from existing
    clients
  • RRP table easy to build and manage
  • on a given node that is (but see next slide)

11
Pros and cons of RRP (Cons)
  • RRP table very difficult to manage!
  • Need local knowledge about other hosts, not
    always simple
  • Need care organising the server hierarchy for
    efficiency
  • Avoid circular dependencies
  • Avoid putting weak servers near the top of the
    hierarchy
  • Data-access problems can be hard to trace
  • Dont have tools that can trace through the RRP
    to determine who really serves(/served) a given
    file (or who failed to).
  • Much trawling through logfiles
    expert-intensive
  • Cannot safely redirect a file that is being
    written to!
  • Not surprising, thats what database locks are
    for.
  • RRP can relocate files with no knowledge of
    federation locks. If the DB is being written at
    the time, all bets are off
  • But I do bet that I wont enjoy the result!
    (caveat emptor)
  • Pro/Con statement largely a matter of Hats
  • With power (Pro) comes responsibility (Con). You
    can do what you like, but it is up to you to do
    something useful!

12
3Ware-based disk servers (I)
  • Severe problems with 3Ware-based disk servers
  • Data-loss at 104 times higher than 3Ware quotes
  • Many failures seen in configurations not tested
    by IT, but failures also seen in the supported
    configuration
  • The same configuration ALICE used, same machines,
    same disks, directly before CMS. ALICE saw no
    such problems!
  • Cause unclear. Suspect bad disks, but also
    unclear if CMS data-access patterns are
    triggering specific problems
  • CMS have multiple asynchronous writers in many
    files
  • ALICE were streaming data, at much higher rate
    than CMS
  • Bad-blocks should cause array-degradation, not
    data-loss!
  • No tool to reproduce data-access patterns
    reliably.
  • Need to run the whole farm to reproduce effects!

13
3Ware-based disk servers (II)
  • CMS intend to continue testing these servers
  • Performance is excellent, clearly on a par with
    SCSI
  • Performance per buck is unbeatable
  • 3Ware provide frequent firmware upgrades
  • Both good news and bad!
  • They care, they are active
  • They do not have a fully mature product
  • Latest version of firmware has not yet been seen
    to fail
  • Tested artificially, but for several days.
    Cautiously optimistic

14
Moving to Castor
  • Castor is the IT-proposed solution to replace
    HPSS
  • Experiments should make the move this year
    (September?)
  • CMS are part-way through the transition
  • Serving lots of data to users via Castor instead
    of HPSS
  • Serve all existing data from Castor by end of
    summer
  • New data going to Castor instead of HPSS since
    May
  • Not yet using Castor-managed pools for
    data-writing
  • Doesnt yet work for us, fix expected very soon
  • Castor and CMS still learning to understand each
    other
  • Castor will simplify our data-management in the
    future
  • Our problems will become theirs!
  • Exploiting the potential of properly managed
    storage may affect our production model

15
Conclusions
  • To make efficient use of worldwide resources we
    have coherent but separate federations
  • By extension to make efficient usage of in-farm
    resources we expect to have to work in
    partitioned federations
  • Therefore, must foresee deep copying at
    collection level to present coherent collections
    to users
  • Plan for same tools WAN/LAN (Grid!)
  • Production clustering requirements and analysis
    clustering requirements differ
  • Production.
  • Low latency on disk (exposure to error).
  • Weak coupling between Db files (single event
    failures destroy small amounts of data)
  • Analysis
  • Cluster data according to use pattern(s)
  • Therefore, must foresee reclustering phases

16
More Conclusions
  • We can test the scaling in a given farm
    configuration up to some unit size (200 nodes?
    300 nodes?.., 1 node?)
  • Configure to have all HW and SW limits under
    control
  • We can test the scaling of building the coherent
    collections and reclustering, just simulating the
    output of many of the farm units
  • Inherently more possible to determine the future
    limitations without an impossibly sized facility
    available
  • Of course, there are more copying steps, but we
    suspect that each part of the farm can be more
    efficiently used to reach higher overall
    performance
Write a Comment
User Comments (0)
About PowerShow.com