Title: ATLAS DDM Operations Activities Review
1ATLAS DDM OperationsActivities Review
- Alexander Zaytsev, Sergey PirogovBudker
Institute of Nuclear Physics(BudkerINP),
Novosibirsk
On behalf of theDDM Operation Team
RDIG-ATLAS Meeting _at_ IHEP (20 September 2007)
2Summary on Our Visits to ATLAS DDM Operations
Team (2006-2007)
- Nov-Dec 2006 (A.Zaytsev, S.Pirogov)
- Implementing the LFC/LRC Test Suite and applying
it to measuring performance of the existing
version of the production LFC server anda
non-GSI enabled LRC testbed - May-Aug 2007 (A.Zaytsev, S.Pirogov)
- Continuing the development of the LFC/LRC Test
Suite and applyingit to measuring performance of
the updated version of the production LFC server
and a new GSI enabled LRC testbed - Extending functionality and documenting the DDM
Data Transfer Request Web Interface - Installing and configuring a complete PanDA
server and a new implementation of PanDA
Scheduler Server (Autopilot) at CERNand
assisting LYON Tier-1site to do the same - Contributing to the recent DDM/DQ2 Functional
Tests (Aug 2007) activity, developing tools for
statistical analysis of the results and applying
them to the data gathered during the tests
3LFC/LRCTest Suite
LFC LCG File CatalogLRC Local Replica Catalog
4Motivation
- LFC and LRC are the most commonly used virtual
catalog solutions exploited by the LCG, OSG and
NorduGrid - These two solutions provide almost the same basic
functionality though they were implemented in a
different way and are supported by the
independent groups of developers - Although the developers do provide some
validation and benchmarking tools for their
products the ATLAS DDM Operations Group needs to
study some specific use cases derived from the
everyday DDM experience and try to maximize the
performance of the ATLAS data management system
as whole - So the idea was to develop an integrated LFC/LRC
test suite in order to be able to measure the
catalogs performance and supply developers with
some hints on what has to be optimized
5Test Suite Development History (1)
- Nov 21 Dec 25, 2006
- Design and development of the basic test tools
and its distribution kit - Adopting the test suite to the standard LFC CLI
and Python API andthe lightweight LRC Python API - Building the LFC/CASTOR cross validation tools
which are awareof the datasets and their naming
convention - Obtaining the preliminary results on testing the
performance of the CERN LFC server by using local
and remote machines on the different sites
(presented during the ATLAS Software Week in Dec
2006). Data obtained were used to support a
proposal for adding bulk operations to the LFC
API - Getting the preliminary results on performance of
the non-GSI-enabled LRC installation within the
testbed at CERN - Jan 2007
- The test suit was adopted to a new LFC API (J-P.
Baud et al., A. Klimentov) - Measuring performance of the new
bulk-operations-enabled version of the LFC API
(presented during the ATLAS Distributed Data
Management Workshop in Jan 2007)
6Test Suite Development History (2)
- Jan Apr 2007
- Installing GSI-enabled MySQL server on the
testbed machine lxmrrb5310.cern.ch (A.Vanyashin) - May 8 Jun 28, 2007
- Building the GSI-enabled MySQL Python client on
the testbedmachine (A. Vanyashin, W. Deng) - Implementing bulk LFC tests by using the standard
productionLFC API (by J-P. Baud et al.) - Designing a simple test LRC DB schema and
implementingthe scripts creating and filling up
the tables - Implementing bulk/non-bulk GSI/non-GSI LRC
tests to be usedfor the reference performance
measurements - Improving usability of the test scripts run by
user - CVS release of the test suite offline/Production/
swing/lfctest - Performing local bulk/non-bulk GSI/non-GSI LRC
tests - Performing bulk/non-bulk LFC tests with
production LFCserver at CERN from ATLAS VO boxes
worldwide
7Preliminary Results on Performance Testing of the
LFC _at_ CERN (14/12/2006)
8Results on Recent Performance Testingof the LFC
_at_ CERN (2127/06/2007)
The plato bulk rate is stable up to ? 28500
GUIDs(1.0 MB of request size), while for larger
requests get_replicas() returns empty output
list. Limitation of the underlying DB server.
9Results on Recent Performance Testingof the LRC
_at_ CERN (2127/06/2007)
LFC (unified sophisticated, CERN) and LRC
(homegrown and fast, NG, BNL) solutions coexist
in the LHC-related GRID environment
and that complicates the centralized data
management a lot!
10ATLAS Requirements for LFC Functionality
Enhancement Memo (May 9, 2007)
LFC testbed powered by the new pre-production
version of the LFC server which is supplied with
the requested methods has been provided in Aug
2007 and is being validated (the feedback channel
is established!)
Optimization process is far from over and we are
aiming to prepare another request covering more
use cases on behalf of the DDM group
Test suite based centralized LFC performance
monitoring systemis still on the schedule!
11DDM Data TransferRequest Interface
12DDM Req. I/F a high level tool forATLAS
production tasks and data replication management
(based on DQ2 and LFC/LRC)
User Registration I/F Registration Data Browser
Job Definition Interface
Dataset Browser
DDM Req. I/F
13Overview / Recent Development Activities
Each dataset replication request is defined by a
regular expression (pattern) on the dataset name
and the known patterns can be looked up manually
or via dedicated filter interface
14Overview / Recent Development Activities
Before you can submit a (Task/DDM) request you
need to register
then your registration has to be approved by the
Grid/Cloud administrator
15Overview / Recent Development Activities
Submitting a new DDM request (which must be
approved by a cloud/physicsgroup administrator
before any data replication takes place!)
16Overview / Recent Development Activities
Before DDM request is approved it assigned a
pending state.Once it is approved is it
scheduled for actual data replicationto be
performed (transfer state).
Unless the request is doneit can be deleted.
Only administrators are allowed to modify the
status of request.
17Summary on Development Activities
- May-Aug 2007
- User registration interface has been implemented
- Dataset browsing functionality within the DDM
requests section - ATLAS offline software release patterns.
- Filtering functionality for dataset name patterns
- Extending the DDM request interface
- Adding new request controls and states
- Monitor performance evaluation
- Server-side caching of the data transfer summary
pages - Adding filtering functionality for the lists of
replicas - Documenting the monitor DB Schema being used
- Many LFC/LRC, DQ2 and DB access optimizations
- Refactoring of the scripts validating subscribed
datasets via checking local LFC/LRC catalogs 24
times (!) improvementof performance by
exploiting LFC bulk operations! - More minor bugs elimination and feature adding
The DDM Req. I/F could be considered as the most
suitable and user friendly tool forATLAS DDM
operations available up to now
18PanDA Server Installation at CERN
PanDA Production and Distributed Analysis System
19Summary on Development Activities
- May-Aug 2007
- Testbed machines configuration at CERN
- PanDA DB servers configuration and DB schema
deployment - PanDA configuration system revision
- PanDA Scheduler Server (AutoPilot) installation
at CERN - Sending test pilots to various sites, including
SARA cloud - Assisting LYON to deploy their own AutoPilot
instance - Production installation of the AutoPilot is being
establishedat CERN (and LYON as well)
The new implementation of PanDA (AutoPilot) is
well designed and could be configuredand
installed on many LCG sites without major efforts
The production version of the complete server is
not so easy to deploy since it has many
rootsgoing deep ito the OSG infrastructure and
BNL Tier-1 site in particular
20DDM/DQ2Functional Tests
DQ2 Don Quijote Dms2
21DDM/DQ2 Functional Tests (Aug 2007)
- Tests Scope
- Data transfer from CERN to Tiers for datasets
with average file size 0.5 GB and 4 GB. This step
simulated the flow of ATLAS data from CERN to the
sites - Step 1 Data transfer from CERN to Tiers
- a Data transfer from Tier-0 to Tier-1
- b Data replication within clouds (Tier-1 to
multiple Tier-2s) - Step 2 MC production data flow simulation.Data
transfer from multiple Tier-2s to single Tier-1. - Data transfer between regional centers
- Tier-1 to Tier-1
- from Tier-2 to Tier-1 of another cloud
- from Tier-1 to Tier-2 of another cloud
22DDM/DQ2 Functional Tests (Aug 2007)
- DDM/DQ2 functional test
- (Step 1a) Tier-0 gtgt all Tier-1s was started on
Aug 7, 2007 - (Step 1a) Large files (4 GB per file) was
subscribed on Aug 9, 2007 - (Step 1b) Tier-1 gtgt Tier-2s within each cloud was
started on Aug 13, 2007 - FT was stopped on 1300 UTC Aug 20, 2007
- Summary on Tier-0 gtgt Tier-s transfers (Step
1a)76 datasets (2724 files in total 1334
files transferred) - The activity is reflected on the DDM
Wikipagehttps//twiki.cern.ch/twiki/bin/view/At
las/DDMOperationsGroupData_transfer_functional_te
st_Ti Please find the initial FT proposal
DQ2_LFC_LRC_tests_proposal_Aug2007.pdf attached
- The monitoring is done by using a dedicated
monitoring toollocated at subsection the DDM
Req. I/F monitorhttp//panda.atlascomp.org/?mod
elistFunctionalTestshttp//panda.atlascomp.org/?
modelistFunctionalTeststestTypeT1toT2s - Subscription and test dataset management
A.Klimentov, P.Nevski - Tools for data gathering from LFC/LRC
catalogs,web monitoring tools S.Pirogov - ROOT based statistical analysis tools A.Zaytsev
23Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
10 datasets no files transferred (not
analyzed)66 datasets with at least one file
transferred 25 of test datasets were completed
PRELIMINARY RESULTS
24Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
PRELIMINARY RESULTS
25Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
Average transfer rate imply software and hardware
induced delaysand the impact of the overall
stability of the system!
PRELIMINARY RESULTS
lt 1 MB/s region
26Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
Average transfer rate imply software and hardware
induced delaysand the impact of the overall
stability of the system!
CNAFNDGFT1BNLFZKASGCNIKHEFTRIUMFSARARALPI
CLYON
CNAFNDGFT1BNLFZKASGCNIKHEFTRIUMFSARARALPI
CLYON
PRELIMINARY RESULTS
27FT Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
Legend (Tier-1 sites)LYON PIC CNAF RAL SARA
NIKHEF FZK TRIUMF ASGC BNL NDGFT1 (1 hour time
bin)Overall (all sites, 6 hours time bin) grey
Beginning of the step1a (Aug 7, 2007)
Large DS subscribed, RAL is back 2.0d
Step1b begins 6.5d
FT is stopped (13.3d)
INFORMATION RETRIEVAL TIME 26 Aug 2038)
28 files (46 GB in total)were transferred
afterthe official FT stop
PRELIMINARY RESULTS
Dataset subscription activity GUID creation with
no replica creation time defined
28FT Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
Legend (Tier-1 sites)LYON PIC CNAF RAL SARA
NIKHEF FZK TRIUMF ASGC BNL NDGFT1 (1 hour time
bin)Overall (all sites, 6 hours time bin) grey
Large DS subscribed, RAL is back 2.0d
Step1b begins 6.5d
INFORMATION RETRIEVAL TIME 26 Aug 2038)
FT is stopped (13.3d)
PRELIMINARY RESULTS
Dataset subscription activity GUID creation with
no replica creation time defined
29FT Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
1st dip
2nd dip
Legend (Tier-1 sites)LYON PIC CNAF RAL SARA
NIKHEF FZK TRIUMF ASGC BNL NDGFT1 (1 hour time
bin)Overall (all sites, 6 hours time bin) grey
PRELIMINARY RESULTS
30FT Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
1st dip
2nd dip
Legend (Tier-1 sites)LYON PIC CNAF RAL SARA
NIKHEF FZK TRIUMF ASGC BNL NDGFT1 (1 hour time
bin)Overall (all sites, 6 hours time bin) grey
Large files (gt 1 GB) Small files (lt 1 GB)
PRELIMINARY RESULTS
31FT Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
All files involved in the FT Step1a included(LFC
only)
PRELIMINARY RESULTS
The stability and reliability of the DDM/DQ2
services should be increased significantly in
order tomatch the future challenges of the full
scale ATLAS DDM activities!
32Summary Future Plans
- LFC/LRC Test Suite
- Maintaining the optimization loop
- Keeping the test suite up-to-date
- Validating new versions of the LFC server
- Establishing continuous LFC performance
monitoring system - DDM Req. I/F
- Adding more advanced DDM requests controls
- More sophisticated request validation mechanisms
- More documentation
- DB schema optimization
- PanDA _at_ CERN
- Complete PanDA server deployment at CERN
- Production installation of AutoPilot at CERN and
LYON - PanDA Server installation manuals
- DDM/DQ2 Functional Tests
- M4 data replication monitoring
- Automatic data analysis and plot generation tools
- Better understanding of the bottlenecks in the
existing DDM infrastructure - More functional test sessions in the future
CHEP2007
ATLAS SW 06/2007
CHEP2007
ATLAS SW 06/2007
CHEP2007
33Backup Slides
34Infrastructure Involved (updated)
LFC server prod-lfc-atlas-local.cern.ch LFC
Python API /afs/cern.ch/project/gd/LCG-share/3
.0.21-0/lcg/lib/python
(production version with bulk operations
enabled) LRC testbed (GSI-enabled MySQL
server) lxmrrb5310.cern.ch LRC Python API
exploiting the local builds by of Python
2.4, mysql-openssl, and mysql-gsi-5.0.37
- Machine used for running the local tests
lxmrrb5310.cern.ch - CPUs 2x Intel Xeon 3.0 GHz (2 MB L2 cache)
- RAM 4 GB
- NIC 1 Gbps Ethernet
On the remote sitesthe similar dual CPU ATLAS VO
boxes were used.
- Local test conditions
- Background load lt 2 (CPUs), lt 45 (RAM)
- Ping to the LFC (LRC) server 6.8 (0.05) ms
35Functional Test Aug 2007
Datasets Replication Status within clouds