DDM - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

DDM

Description:

Using improved version of DQ1 Servers, separate from Production. Includes Reliable File Transfer ... Before xmas with an additional machine did double that amount ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 30

Provided by: ppephysi

Category:

Tags: ddm | xmas

more less

Transcript and Presenter's Notes

Title: DDM

1
DDM

ATLAS software week 23-27 May 2005

2
Outline

Existing Tools
Lessons Learned
DQ2 Architecture
Datasets
Catalogs, mappings and interactions
DQ2 Prototype
Technology
Servers, CLI, web page
Data Movement, subscriptions and datablocks
Plans for future
SC3
Evolution of prototype

3
Tools
dms2
dms3
dms4
Using improved version of DQ1 Servers, separate
from Production Includes Reliable File Transfer
Will use new Distributed Data Management
Infrastructure (DQ2 Servers)
First version of tool for end-users Using DQ1
Production Servers
New Grid catalogs Native support for
Datasets Automatic movement of blocks of data
Slow servers and Grid catalogs Problems with file
movement
Slow Grid catalogs Dealing only with individual
files is difficult
Current Version
4
dms3

dms3 is currently the tool to get data from Rome
or DC-2 production
Still based on older DQ Production Servers and
existing Grid catalogs
Requires Grid Certificates
http//lcg-registrar.cern.ch
Documentation on Twiki, including installation
notes and common use cases
https//uimon.cern.ch/twiki/bin/view/Atlas/DonQuij
oteDms3
(please, feel free to add notes to this page!)
Software
CERN
/afs/cern.ch/atlas/offline/external/DQClient/dms3/
dms3.py
External Users
Download http//cern.ch/mbranco/cern/dms3.tar.gz
or contact your site administrator to have a
shared installation

5
Reliable File Transfer

DQ includes a simple file transfer service RFT
MySQL backend database with transfer definition
and queue
Set of transfer agents fetching requests from
database
Only uses GridFTP to move data, supports SRM
(get, put) and transfer priorities
Uses existing DQ Servers to interface with Grid
catalogs
Two client interfaces
dms3, using replicate command transfer is queued
into RFT
for end-users only transfer priority is limited
super-user client
allows priorities to be set, transfers to be
rescheduled, cancelled, tagging of transfers,
access limited to Production

6
Reliable File Transfer

Transfer rate depends mostly on status of sites
Can go from smooth running to very large number
of failures
critical for files with a single copy on the Grid
Bottlenecks
Grid catalogs for querying
Transferring individual files, not blocks of
files
Lack of people to monitor transfer failures and
sites
A single machine being used
7000 transfers/day (mostly requests to transfer
data to CERN)
Will not increase otherwise kills Castor_at_CERN
for other users
Before xmas with an additional machine did double
that amount
But was killing Castor GridFTP front-end machines
very often

7
Lessons learned

Catalogs were provided by Grid providers and used
as-is
Granularity file-level
No datasets, no file collections
No scoping of queries (difficult to find data,
slow)
No bulk operations
Metadata support not usable
Too slow
Not valid workaround to query data per site, MD5
checksums, file sizes
Logical Collection Name as metadata
/datafiles/rome/
Catalogs not always geographically distributed
Single point of failure (middleware,
people/timezones)
No ATLAS resources information system (with
known/negotiated QoS)
and unreliable information systems from Grid
providers

8
Lessons learned

No managed and transparent data access,
unreliable GridFTP
SRM (and GridFTP with mass storage) still not
sufficient
Difficult to handle different mass storage
staggers from Grid
DQ
Single point of failure
Naïve validation procedure
No self-validation at sites, between site
contents and global catalogs
Operations level
Too centralized
Insufficient man-power
Still need to identify site contacts, at least
for major sites
Insufficient training for users/production
managers
Lack of coordination launching requests for
files not stagged, ..
Also due to lack of automatic connections between
Data Management and Production System tasks
Monitoring insufficient tools and people!

9
Lessons learned

Multiple flavors of Grid Catalogs with slightly
different interfaces
Effort wasted on developing common interfaces
Minimal functionality with maximum error
propagation!
No single data management tool for
Production
End-user analysis
(common across all Grids!)
No reliable file transfer plugged into Production
System
Moving individual files non-optimal!
Too many sites used for permanent storage
Should restrict the list and comply with
Computing Model and Tier organization

10
Distributed Data Management - outline

Database ( Data Management) project recently
took responsibility in this area (formerly
Production)
Approach proceed by evolving Don Quijote, while
revisiting requirements, design and
implementation
Provide continuity of needed functionality
Add dataset management above file management
Dataset named collection of files descriptive
metadata
Container Dataset named collection of datasets
descriptive metadata
Design, implementation, component selection
driven by startup requirements for performance
and functionality
Covering end user analysis (with priority) as
well as production
Make decisions on implementation and component
selection accordingly, to achieve the most
capable system
Foresee progressive integration of new middleware
over time

11
Don Quijote 2

Moves from a file based system to one based on
datasets
Hides file level granularity from users
A hierarchical structure makes cataloging more
manageable
However file level access is still possible
Scalable global data discovery and access via a
catalog hierarchy
No global physical file replica catalog (but
global dataset replica catalog and global logical
file catalog)

12
Catalog architecture and interactions
13
Global catalogs
Holds all dataset names and unique IDs ( system
metadata)
Maintains versioning information and information
on container datasets, datasets consisting of
other datasets
Maps each dataset to its constituent files This
one holds info on every logical file so must be
highly scalable, however it can be highly
partitioned using metadata etc..
Stores locations of each dataset
All logically global but may be distributed
physically
14
Local Catalogs
Per grid/site/tier logical to physical file name
mapping. Implementations of this catalog are Grid
specific but must use a standard interface.
Per site storing of user claims on files and
datasets. Claims are used to manage stage
lifetime, resources and provide accounting.
15
(Some) DDM Use Cases (1)

Data acquisition and publication

Publish replica locations
Publish dataset info
Publish file replica locations
Publish dataset file content
16
DDM Use Cases (2)

Event selection

Select datasets based on physics attributes
Versioning / container dataset info
Get locations of datasets
Local file information
Get the files in the datasets
17
DDM Use Cases (3)

Dataset replication (see also subscriptions
later)

Get current dataset location, replicate, then
publish new replica info
Get/publish local file info
Get the files to replicate
For more use cases and details see https//uimon.c
ern.ch/twiki/bin/view/Atlas/DonQuijoteUseCases
18
Implementation - Prototype Development Status

Technology choices
Python clients/servers based on HTTP GET/POST
POOL FC interface gives us choice of back-end
(all our catalogs fit to the LFNGUIDPFN mapping
system)
For prototype MySQL DB is used (with planned
future evaluation of LCG File Catalog would
give us ACLs, support for user defined catalogs
etc.)
Servers
Use HTTPS (with Globus proxy certs) for POSTs and
HTTP for GETs, ie world-readable data (can be
made secure to eg ATLAS VO if required though)
Clients
Python command line client per server and overall
UI client dq2
Web page interface directly to HTTP servers for
querying

19
dq2 commands

dq2
Usage dq2 ltcommandgt ltargsgt
Commands
registerNewDataset ltdataset namegt ltlfn1 guid1
lfn2 guid2...gt
registerDatasetLocations lt-i-cgt -v
dataset version ltdataset namegt ltlocation(s)gt
registerNewVersion ltdataset namegt ltnew
files lfn1 guid1 lfn2 guid2...gt
listDatasetReplicas -i-c -v dataset version
ltdataset namegt
listFilesInDataset -v dataset version
ltdataset namegt listDatasetsInSite -i-c ltsite
namegt
listFileReplicas ltlogical file namegt
listDatasets -v dataset version ltdataset namegt
eraseDataset ltdataset namegt
-i and -c signify incomplete and complete
datasets respectively (mandatory for adds,
optional for queries (default is return both)) If
no -v option is supplied the latest version is
used.

20
Web browser interface
21
Datablocks

Datablocks are defined as immutable and
unbreakable collections of files
They are a special case of datasets
A site cannot hold partial datablocks
There are no versions for datablocks
Used to aggregate files for convenient
distribution
Files grouped together by physics properties, run
number etc..
Much more scalable than file level distribution
The principal means of data distribution and data
discovery
immutability avoids consistency problems when
distributing data
moving data in blocks improves data distribution
(bulk SRM requests)

22
Subscriptions

A site can subscribe to data
When a new version is available, this latest
version of the dataset is automatically made
available through site-local specific services
carrying out the required replication
Subscriptions can be made to datasets (for file
distribution) or container datasets (for
datablock distribution)
Use cases
Automatic distribution of datasets holding a
variable collection of datablocks (container
datasets)
Automatic replication of files by subscribing to
a mutable dataset (eg file-based calibration data
distribution)

Site X
Subscriptions
Site Y
23
Subscriptions

System supports subscriptions for
Datasets
latest version of a dataset (triggers automatic
updates whenever a new version appears)
Container Datasets
which in turn contain datablocks or datasets
supports subscriptions to the latest version of a
container dataset (automatically triggers updates
whenever e.g. the set of datablocks making up the
container dataset changes)
Datablocks (immutable set of files)
Databuckets (see details next)
replication of a set of files using notification
model (whenever new content appears on the
databucket, the replication is triggered)

24
Subscription Agents
File state (local XML POOL FC)
Function
Agents
25
Data buckets
Remote Site
Data bucket
(file-based data bucket)
26
DQ concepts vs DQ2 concepts

DQ
File
identified by GUID or
by LFN
Only unit for data
movement, querying,
identifying sites (PFN),

27
Claims
28
Plans for future development

Service challenge 3 July - Dec
Prototype evolution
Fill catalogs with real data (Rome) and test
robustness and scalability
Implement catalogs not yet done (hierarchy,
claims)
External components
Testing of gLite FTS underway soon
POOL FC interfaces for LFC should be available
nowish will evaluate as suitable backend based
on performance
Users
Agreed with TDAQ to start discussions on whether
DDM can/should be applied to EF-gtT0 data movement
Support commissioning in the near term
Gradual release to user community for analysis