CMS production status at CERN - PowerPoint PPT Presentation

About This Presentation

Title:

CMS production status at CERN

Description:

Production tests in Spring with LXSHARE. Use of RRP in ... Ran with 300 batch jobs writing into the ... ( caveat emptor) Pro/Con statement largely a ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 17

Provided by: tonywi9

Category:

more less

Transcript and Presenter's Notes

Title: CMS production status at CERN

1
CMS production status at CERN
T.Wildish 18 November 2009
2
Introduction

Production tests in Spring with LXSHARE
Use of RRP in production tests
3Ware-based EIDE disk servers
Moving to Castor-managed storage
Summary/conclusions

3
Production test configuration
1 MB/minute/job
tbed server (signal)
From HPSSor Castor
4 MB/minute/job
35 MB/minute/job
Up to 180 nodes
To Castor
Federation/journal/lockserver(s)
2-4 nodes
N.B. Widths of solid arrows are to scale!
4
Production tests with LXSHARE

Ran with gt 300 batch jobs writing into the same
federation
Were able to run, but throughput flattens out at
120 jobs. No real gain in running more than that
For large-scale productions we expect we will
have to divide our farm(s) into manageable chunks
Conceptually no different to running distributed
production and combining the results with grid
tools
Ran into filehandle limits on both AMS (4096) and
lockserver (1024)
Worked around both limits this time, wont be
able to if we keep increasing the size of the
farm
Series of rather contrived tests explored the
hardware limits, using 120 jobs only
Separate federation for each batch node, local to
the node, local metadata, writing either
centrally or locally depending on the exact
details of the test
Equivalent to 120 users analysing on private
shallow-copies, this at least shows that that
model scales well

5
Hardware performance

Serve pileup at 20 MB/sec sustained per server
We need 0.5 MB/sec/job, ? one server supports 40
jobs
For 300 jobs, would need 8 servers
CPU usage seen to be 25 USR, 140 SYS (dual
CPU)
Not limited by the AMS CPU consumption itself,
either the kernel or the I/O is the limiting step
300 filehandles open on the AMS at this time
Serving signal is no effort, need 1 MB/min/job
Output server writing at up to 24 GB/hour (6-7
MB/sec)
Then mostly busy in I/O, probably saturated
Tests with 120 jobs did not seem to reach these
hardware limits, but not clear what limit they
did reach
Ran out of time for investigating further with
full farm
Most of the time spent running real production,
not tests
Smaller scale tests still going on

6
(No Transcript)
7
The Request Redirection Protocol

We have been using the RRP since late last summer
All catalogue entries point at the federation
server, federation server redirects AMS to the
real server(s)
Simple implementation driven by a configuration
file
server.cern.ch /dir/file
Serve /dir/file from host server.cern.ch
Match on full filename for now, partial matching
soon to come
Server.cern.ch may redirect the request again
defaultserver server.cern.ch
Files not in the RRP table can be served by
default servers
Multiple or repeated default servers are OK
Multiple default servers are selected on a
round-robin basis
Repeated servers are selected proportionately
more frequently
updateinterval 30
Check the redirection table for changes every 30
seconds
Unrecognised entries are silently ignored
An empty or missing RRP table is also legal

8
Request Redirection algorithm

If the files exists locally, serve it directly
Ignore any entry for this file in the table
If a named server exists for the file, redirect
to that server, else
If there are any default servers specified,
choose the next one (round-robin) and redirect to
it, else
Attempt to serve the file locally (assume an MSS
backend will stage it in on demand or that I can
create it if required)
When choosing a default server there are two
variations possible
Remember the server for this file (add it to the
table)
Suitable for staging systems, may reduce tape
mounts
Dont record this server/file pair
Suitable for load-balancing for disk-resident
data (pileup)
Not yet a run-time option, but will be very soon

9
RRP use scenarios

Server1 serves all metadata, redirects all signal
data to server2 and all pileup to server3
Server2 redirects signal data to server21 and
server22 alternately, remembering which of them
serves a given file (needs only 2
defaultserver entries)
Data staged in, and eventually purged from disk
Server3 redirects pileup data to server31,
server32, and server33 alternately, using the
load-balancing variation (needs only 3
defaultserver entries)
Each of server31,2,3 has a full local copy of
the pileup files
ServerA redirects all existing files in the
federation to serverB, has default entries for
serverC and serverD
ServerB serves data as it likes (e.g. like
server1 above?)
ServerC/D have no redirection, only new files are
created on them (they are output servers).

10
Pros and cons of RRP (Pros)

Extremely flexible data management
Can increase/decrease load on servers trivially
Add/remove (duplicate) defaultserver entries
Bring servers up/take them down w/o interrupting
data access
Take servers offline when they crash ? improve
data-availability
No lock contention
Dont alter the federation catalogue, no locks to
care about
Propagates changes to all clients
Many users in CMS take shallow-copies of the
federation for their own purposes (e.g. adding
tag databases). They automatically see the
changes in the server configuration
Changes are propagated extremely quickly
On the timescale of the updateinterval an entire
set of data servers can be replaced by new ones
N.B. Still have open connections from existing
clients
RRP table easy to build and manage
on a given node that is (but see next slide)

11
Pros and cons of RRP (Cons)

RRP table very difficult to manage!
Need local knowledge about other hosts, not
always simple
Need care organising the server hierarchy for
efficiency
Avoid circular dependencies
Avoid putting weak servers near the top of the
hierarchy
Data-access problems can be hard to trace
Dont have tools that can trace through the RRP
to determine who really serves(/served) a given
file (or who failed to).
Much trawling through logfiles
expert-intensive
Cannot safely redirect a file that is being
written to!
Not surprising, thats what database locks are
for.
RRP can relocate files with no knowledge of
federation locks. If the DB is being written at
the time, all bets are off
But I do bet that I wont enjoy the result!
(caveat emptor)
Pro/Con statement largely a matter of Hats
With power (Pro) comes responsibility (Con). You
can do what you like, but it is up to you to do
something useful!

12
3Ware-based disk servers (I)

Severe problems with 3Ware-based disk servers
Data-loss at 104 times higher than 3Ware quotes
Many failures seen in configurations not tested
by IT, but failures also seen in the supported
configuration
The same configuration ALICE used, same machines,
same disks, directly before CMS. ALICE saw no
such problems!
Cause unclear. Suspect bad disks, but also
unclear if CMS data-access patterns are
triggering specific problems
CMS have multiple asynchronous writers in many
files
ALICE were streaming data, at much higher rate
than CMS
Bad-blocks should cause array-degradation, not
data-loss!
No tool to reproduce data-access patterns
reliably.
Need to run the whole farm to reproduce effects!

13
3Ware-based disk servers (II)

CMS intend to continue testing these servers
Performance is excellent, clearly on a par with
SCSI
Performance per buck is unbeatable
3Ware provide frequent firmware upgrades
Both good news and bad!
They care, they are active
They do not have a fully mature product
Latest version of firmware has not yet been seen
to fail
Tested artificially, but for several days.
Cautiously optimistic

14
Moving to Castor

Castor is the IT-proposed solution to replace
HPSS
Experiments should make the move this year
(September?)
CMS are part-way through the transition
Serving lots of data to users via Castor instead
of HPSS
Serve all existing data from Castor by end of
summer
New data going to Castor instead of HPSS since
May
Not yet using Castor-managed pools for
data-writing
Doesnt yet work for us, fix expected very soon
Castor and CMS still learning to understand each
other
Castor will simplify our data-management in the
future
Our problems will become theirs!
Exploiting the potential of properly managed
storage may affect our production model

15
Conclusions

To make efficient use of worldwide resources we
have coherent but separate federations
By extension to make efficient usage of in-farm
resources we expect to have to work in
partitioned federations
Therefore, must foresee deep copying at
collection level to present coherent collections
to users
Plan for same tools WAN/LAN (Grid!)
Production clustering requirements and analysis
clustering requirements differ
Production.
Low latency on disk (exposure to error).
Weak coupling between Db files (single event
failures destroy small amounts of data)
Analysis
Cluster data according to use pattern(s)
Therefore, must foresee reclustering phases

16
More Conclusions

We can test the scaling in a given farm
configuration up to some unit size (200 nodes?
300 nodes?.., 1 node?)
Configure to have all HW and SW limits under
control
We can test the scaling of building the coherent
collections and reclustering, just simulating the
output of many of the farm units
Inherently more possible to determine the future
limitations without an impossibly sized facility
available
Of course, there are more copying steps, but we
suspect that each part of the farm can be more
efficiently used to reach higher overall
performance