ATLAS DDM Operations Activities Review - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

ATLAS DDM Operations Activities Review

Description:

These two solutions provide almost the same basic functionality though they were ... The monitoring is done by using a dedicated monitoring tool. located at ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 36
Provided by: davidq9
Category:

less

Transcript and Presenter's Notes

Title: ATLAS DDM Operations Activities Review


1
ATLAS DDM OperationsActivities Review
  • Alexander Zaytsev, Sergey PirogovBudker
    Institute of Nuclear Physics(BudkerINP),
    Novosibirsk

On behalf of theDDM Operation Team
RDIG-ATLAS Meeting _at_ IHEP (20 September 2007)
2
Summary on Our Visits to ATLAS DDM Operations
Team (2006-2007)
  • Nov-Dec 2006 (A.Zaytsev, S.Pirogov)
  • Implementing the LFC/LRC Test Suite and applying
    it to measuring performance of the existing
    version of the production LFC server anda
    non-GSI enabled LRC testbed
  • May-Aug 2007 (A.Zaytsev, S.Pirogov)
  • Continuing the development of the LFC/LRC Test
    Suite and applyingit to measuring performance of
    the updated version of the production LFC server
    and a new GSI enabled LRC testbed
  • Extending functionality and documenting the DDM
    Data Transfer Request Web Interface
  • Installing and configuring a complete PanDA
    server and a new implementation of PanDA
    Scheduler Server (Autopilot) at CERNand
    assisting LYON Tier-1site to do the same
  • Contributing to the recent DDM/DQ2 Functional
    Tests (Aug 2007) activity, developing tools for
    statistical analysis of the results and applying
    them to the data gathered during the tests

3
LFC/LRCTest Suite
LFC LCG File CatalogLRC Local Replica Catalog
4
Motivation
  • LFC and LRC are the most commonly used virtual
    catalog solutions exploited by the LCG, OSG and
    NorduGrid
  • These two solutions provide almost the same basic
    functionality though they were implemented in a
    different way and are supported by the
    independent groups of developers
  • Although the developers do provide some
    validation and benchmarking tools for their
    products the ATLAS DDM Operations Group needs to
    study some specific use cases derived from the
    everyday DDM experience and try to maximize the
    performance of the ATLAS data management system
    as whole
  • So the idea was to develop an integrated LFC/LRC
    test suite in order to be able to measure the
    catalogs performance and supply developers with
    some hints on what has to be optimized

5
Test Suite Development History (1)
  • Nov 21 Dec 25, 2006
  • Design and development of the basic test tools
    and its distribution kit
  • Adopting the test suite to the standard LFC CLI
    and Python API andthe lightweight LRC Python API
  • Building the LFC/CASTOR cross validation tools
    which are awareof the datasets and their naming
    convention
  • Obtaining the preliminary results on testing the
    performance of the CERN LFC server by using local
    and remote machines on the different sites
    (presented during the ATLAS Software Week in Dec
    2006). Data obtained were used to support a
    proposal for adding bulk operations to the LFC
    API
  • Getting the preliminary results on performance of
    the non-GSI-enabled LRC installation within the
    testbed at CERN
  • Jan 2007
  • The test suit was adopted to a new LFC API (J-P.
    Baud et al., A. Klimentov)
  • Measuring performance of the new
    bulk-operations-enabled version of the LFC API
    (presented during the ATLAS Distributed Data
    Management Workshop in Jan 2007)

6
Test Suite Development History (2)
  • Jan Apr 2007
  • Installing GSI-enabled MySQL server on the
    testbed machine lxmrrb5310.cern.ch (A.Vanyashin)
  • May 8 Jun 28, 2007
  • Building the GSI-enabled MySQL Python client on
    the testbedmachine (A. Vanyashin, W. Deng)
  • Implementing bulk LFC tests by using the standard
    productionLFC API (by J-P. Baud et al.)
  • Designing a simple test LRC DB schema and
    implementingthe scripts creating and filling up
    the tables
  • Implementing bulk/non-bulk GSI/non-GSI LRC
    tests to be usedfor the reference performance
    measurements
  • Improving usability of the test scripts run by
    user
  • CVS release of the test suite offline/Production/
    swing/lfctest
  • Performing local bulk/non-bulk GSI/non-GSI LRC
    tests
  • Performing bulk/non-bulk LFC tests with
    production LFCserver at CERN from ATLAS VO boxes
    worldwide

7
Preliminary Results on Performance Testing of the
LFC _at_ CERN (14/12/2006)
8
Results on Recent Performance Testingof the LFC
_at_ CERN (2127/06/2007)
The plato bulk rate is stable up to ? 28500
GUIDs(1.0 MB of request size), while for larger
requests get_replicas() returns empty output
list. Limitation of the underlying DB server.
9
Results on Recent Performance Testingof the LRC
_at_ CERN (2127/06/2007)
LFC (unified sophisticated, CERN) and LRC
(homegrown and fast, NG, BNL) solutions coexist
in the LHC-related GRID environment
and that complicates the centralized data
management a lot!
10
ATLAS Requirements for LFC Functionality
Enhancement Memo (May 9, 2007)
LFC testbed powered by the new pre-production
version of the LFC server which is supplied with
the requested methods has been provided in Aug
2007 and is being validated (the feedback channel
is established!)
Optimization process is far from over and we are
aiming to prepare another request covering more
use cases on behalf of the DDM group
Test suite based centralized LFC performance
monitoring systemis still on the schedule!
11
DDM Data TransferRequest Interface
12
DDM Req. I/F a high level tool forATLAS
production tasks and data replication management
(based on DQ2 and LFC/LRC)
User Registration I/F Registration Data Browser
Job Definition Interface
Dataset Browser
DDM Req. I/F
13
Overview / Recent Development Activities
Each dataset replication request is defined by a
regular expression (pattern) on the dataset name
and the known patterns can be looked up manually
or via dedicated filter interface
14
Overview / Recent Development Activities
Before you can submit a (Task/DDM) request you
need to register
then your registration has to be approved by the
Grid/Cloud administrator
15
Overview / Recent Development Activities
Submitting a new DDM request (which must be
approved by a cloud/physicsgroup administrator
before any data replication takes place!)
16
Overview / Recent Development Activities
Before DDM request is approved it assigned a
pending state.Once it is approved is it
scheduled for actual data replicationto be
performed (transfer state).
Unless the request is doneit can be deleted.
Only administrators are allowed to modify the
status of request.
17
Summary on Development Activities
  • May-Aug 2007
  • User registration interface has been implemented
  • Dataset browsing functionality within the DDM
    requests section
  • ATLAS offline software release patterns.
  • Filtering functionality for dataset name patterns
  • Extending the DDM request interface
  • Adding new request controls and states
  • Monitor performance evaluation
  • Server-side caching of the data transfer summary
    pages
  • Adding filtering functionality for the lists of
    replicas
  • Documenting the monitor DB Schema being used
  • Many LFC/LRC, DQ2 and DB access optimizations
  • Refactoring of the scripts validating subscribed
    datasets via checking local LFC/LRC catalogs 24
    times (!) improvementof performance by
    exploiting LFC bulk operations!
  • More minor bugs elimination and feature adding

The DDM Req. I/F could be considered as the most
suitable and user friendly tool forATLAS DDM
operations available up to now
18
PanDA Server Installation at CERN
PanDA Production and Distributed Analysis System
19
Summary on Development Activities
  • May-Aug 2007
  • Testbed machines configuration at CERN
  • PanDA DB servers configuration and DB schema
    deployment
  • PanDA configuration system revision
  • PanDA Scheduler Server (AutoPilot) installation
    at CERN
  • Sending test pilots to various sites, including
    SARA cloud
  • Assisting LYON to deploy their own AutoPilot
    instance
  • Production installation of the AutoPilot is being
    establishedat CERN (and LYON as well)

The new implementation of PanDA (AutoPilot) is
well designed and could be configuredand
installed on many LCG sites without major efforts
The production version of the complete server is
not so easy to deploy since it has many
rootsgoing deep ito the OSG infrastructure and
BNL Tier-1 site in particular
20
DDM/DQ2Functional Tests
DQ2 Don Quijote Dms2
21
DDM/DQ2 Functional Tests (Aug 2007)
  • Tests Scope
  • Data transfer from CERN to Tiers for datasets
    with average file size 0.5 GB and 4 GB. This step
    simulated the flow of ATLAS data from CERN to the
    sites
  • Step 1 Data transfer from CERN to Tiers
  • a Data transfer from Tier-0 to Tier-1
  • b Data replication within clouds (Tier-1 to
    multiple Tier-2s)
  • Step 2 MC production data flow simulation.Data
    transfer from multiple Tier-2s to single Tier-1.
  • Data transfer between regional centers
  • Tier-1 to Tier-1
  • from Tier-2 to Tier-1 of another cloud
  • from Tier-1 to Tier-2 of another cloud

22
DDM/DQ2 Functional Tests (Aug 2007)
  • DDM/DQ2 functional test
  • (Step 1a) Tier-0 gtgt all Tier-1s was started on
    Aug 7, 2007
  • (Step 1a) Large files (4 GB per file) was
    subscribed on Aug 9, 2007
  • (Step 1b) Tier-1 gtgt Tier-2s within each cloud was
    started on Aug 13, 2007
  • FT was stopped on 1300 UTC Aug 20, 2007
  • Summary on Tier-0 gtgt Tier-s transfers (Step
    1a)76 datasets (2724 files in total 1334
    files transferred)
  • The activity is reflected on the DDM
    Wikipagehttps//twiki.cern.ch/twiki/bin/view/At
    las/DDMOperationsGroupData_transfer_functional_te
    st_Ti Please find the initial FT proposal
    DQ2_LFC_LRC_tests_proposal_Aug2007.pdf attached
  • The monitoring is done by using a dedicated
    monitoring toollocated at subsection the DDM
    Req. I/F monitorhttp//panda.atlascomp.org/?mod
    elistFunctionalTestshttp//panda.atlascomp.org/?
    modelistFunctionalTeststestTypeT1toT2s
  • Subscription and test dataset management
    A.Klimentov, P.Nevski
  • Tools for data gathering from LFC/LRC
    catalogs,web monitoring tools S.Pirogov
  • ROOT based statistical analysis tools A.Zaytsev

23
Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
10 datasets no files transferred (not
analyzed)66 datasets with at least one file
transferred 25 of test datasets were completed
PRELIMINARY RESULTS
24
Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
PRELIMINARY RESULTS
25
Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
Average transfer rate imply software and hardware
induced delaysand the impact of the overall
stability of the system!
PRELIMINARY RESULTS
lt 1 MB/s region
26
Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
Average transfer rate imply software and hardware
induced delaysand the impact of the overall
stability of the system!
CNAFNDGFT1BNLFZKASGCNIKHEFTRIUMFSARARALPI
CLYON
CNAFNDGFT1BNLFZKASGCNIKHEFTRIUMFSARARALPI
CLYON
PRELIMINARY RESULTS
27
FT Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
Legend (Tier-1 sites)LYON PIC CNAF RAL SARA
NIKHEF FZK TRIUMF ASGC BNL NDGFT1 (1 hour time
bin)Overall (all sites, 6 hours time bin) grey
Beginning of the step1a (Aug 7, 2007)
Large DS subscribed, RAL is back 2.0d
Step1b begins 6.5d
FT is stopped (13.3d)
INFORMATION RETRIEVAL TIME 26 Aug 2038)
28 files (46 GB in total)were transferred
afterthe official FT stop
PRELIMINARY RESULTS
Dataset subscription activity GUID creation with
no replica creation time defined
28
FT Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
Legend (Tier-1 sites)LYON PIC CNAF RAL SARA
NIKHEF FZK TRIUMF ASGC BNL NDGFT1 (1 hour time
bin)Overall (all sites, 6 hours time bin) grey
Large DS subscribed, RAL is back 2.0d
Step1b begins 6.5d
INFORMATION RETRIEVAL TIME 26 Aug 2038)
FT is stopped (13.3d)
PRELIMINARY RESULTS
Dataset subscription activity GUID creation with
no replica creation time defined
29
FT Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
1st dip
2nd dip
Legend (Tier-1 sites)LYON PIC CNAF RAL SARA
NIKHEF FZK TRIUMF ASGC BNL NDGFT1 (1 hour time
bin)Overall (all sites, 6 hours time bin) grey
PRELIMINARY RESULTS
30
FT Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
1st dip
2nd dip
Legend (Tier-1 sites)LYON PIC CNAF RAL SARA
NIKHEF FZK TRIUMF ASGC BNL NDGFT1 (1 hour time
bin)Overall (all sites, 6 hours time bin) grey
Large files (gt 1 GB) Small files (lt 1 GB)
PRELIMINARY RESULTS
31
FT Step 1a (Tier0-Tier1s 76 DS, 1334 CpFiles)
All files involved in the FT Step1a included(LFC
only)
PRELIMINARY RESULTS
The stability and reliability of the DDM/DQ2
services should be increased significantly in
order tomatch the future challenges of the full
scale ATLAS DDM activities!
32
Summary Future Plans
  • LFC/LRC Test Suite
  • Maintaining the optimization loop
  • Keeping the test suite up-to-date
  • Validating new versions of the LFC server
  • Establishing continuous LFC performance
    monitoring system
  • DDM Req. I/F
  • Adding more advanced DDM requests controls
  • More sophisticated request validation mechanisms
  • More documentation
  • DB schema optimization
  • PanDA _at_ CERN
  • Complete PanDA server deployment at CERN
  • Production installation of AutoPilot at CERN and
    LYON
  • PanDA Server installation manuals
  • DDM/DQ2 Functional Tests
  • M4 data replication monitoring
  • Automatic data analysis and plot generation tools
  • Better understanding of the bottlenecks in the
    existing DDM infrastructure
  • More functional test sessions in the future

CHEP2007
ATLAS SW 06/2007
CHEP2007
ATLAS SW 06/2007
CHEP2007
33
Backup Slides
34
Infrastructure Involved (updated)
LFC server prod-lfc-atlas-local.cern.ch LFC
Python API /afs/cern.ch/project/gd/LCG-share/3
.0.21-0/lcg/lib/python
(production version with bulk operations
enabled) LRC testbed (GSI-enabled MySQL
server) lxmrrb5310.cern.ch LRC Python API
exploiting the local builds by of Python
2.4, mysql-openssl, and mysql-gsi-5.0.37
  • Machine used for running the local tests
    lxmrrb5310.cern.ch
  • CPUs 2x Intel Xeon 3.0 GHz (2 MB L2 cache)
  • RAM 4 GB
  • NIC 1 Gbps Ethernet

On the remote sitesthe similar dual CPU ATLAS VO
boxes were used.
  • Local test conditions
  • Background load lt 2 (CPUs), lt 45 (RAM)
  • Ping to the LFC (LRC) server 6.8 (0.05) ms

35
Functional Test Aug 2007
Datasets Replication Status within clouds
Write a Comment
User Comments (0)
About PowerShow.com