WLCG File Transfer Service - PowerPoint PPT Presentation

About This Presentation

Title:

WLCG File Transfer Service

Description:

It packages up a set of source/destination file pairs and submits transfer jobs to FTS ... Design work ongoing. Site grouping in channel definition ('clouds' ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 21

Provided by: marce243

Learn more at: https://osg-docdb.opensciencegrid.org

Category:

more less

Transcript and Presenter's Notes

Title: WLCG File Transfer Service

1
WLCG File Transfer Service

Sophie Lemaitre Gavin Mccance
Joint EGEE and OSG Workshop on Data Handlingin
Production Grids, Monterey
25 June 2007

2
FTS overview

The File Transfer Service (FTS) is a data
movement fabric service
It is a multi-VO service, used to balance usage
of site resources according to VO and site
policies
Why is it needed ?
For the user, the service it provides is the
reliable point to point movement of files
For the site manager, it provides a reliable and
manageable way of serving file movement requests
from their experiments
For the production manager, it provides ability
to control requests coming from his users
Re-ordering, prioritization,
The focus is on the service
It should make it easy to do these things well

3
Who uses it 1

The sites use it as part of their fabric
Its designed to make it easier for a multi-VO
site to run the transfers of its VOs
Tier-1 sites run the FTS servers and are
responsible for processing the transfer requests
from tier-2s and transferring data between
tier-1s
Tier-0 export is run from CERN
The focus is on the service delivered, the ease
of manageability and service monitoring

4
Who uses it 2

FTS is used by experiment frameworks
Typically end-users do not interact directly with
it they interact with their experiment
framework
Production managers sometimes query it directly
to debug / chase problems
Experiment framework decides it wants to move a
set of files
The expt. framework is responsible for staging-in
(for now..)
It packages up a set of source/destination file
pairs and submits transfer jobs to FTS
The state of each job is tracked as it progresses
through the various transfer stages
The experiment framework can poll the status at
any time

5
Service APIs

FTS has 3 basic API group
Job submission / tracking
Used by experiment frameworks to submit requests
Service / channel management
Used by admins and VO production managers to
control the service
Statistics tools
Providing aggregate statistics on what the
service has been doing, current failure rates,
classes, etc
This is being done as part of the WLCG monitoring
group to make sure the information is available
to all interested stakeholders

6
Security model

Transfers are always run using user credential
VOMS credential is now used (and renewed as
necessary) in FTS 2.0
Authorization to service is done using
Grid mapfile mechanism or
VOMS role
VO production manager roles
Channel administrator roles
Service manager role

7
User API

Uses a submit / poll pattern with unique job ID
Jobs can contain multiple copy requests
Various polling methods with different detail
Overall job status (is it done yet?)
Job summary
Detailed status of individual file failures /
status
Job cancelation and priority reshuffling
bysuitably authorised users
i.e. VO production managers
No notification mechanism yet
The submit/poll pattern isnt so efficient
Much commonality with Globus RFT API
Weve been talking

8
Channel concept

For management ease, the service supports
splitting jobs onto multiple channels
Once a job is submitted to the FTS it is assigned
to a suitable channel for serving
A channel may be
A point to point network link (e.g. we manage all
the T0-export links in WLCG on a separate
channel)
Various catch-all channels
(e.g. everything else coming to me, or everything
to one of my tier-2 sites)
More flexible grouping of sites channel
definitions are on the way
Channels are uni-directional
e.g. at CERN we have one set for the export and
one set for the import

9
Channels

Channel its not a great name
It always causes confusion... (but were stuck
with the name now)
It isnt tied to a physical network path
Its just a management concept
Queue might be a better name ?
All file transfer jobs on the same channel are
served as part of the same queue
Inter-VO priorities for the queue (Atlas gets
75, CMS gets the rest)
Internal-VO priorities within a VO
Each channel has its own set of transfer
parameters
Number of concurrent files running, number
streams, TCP buffer, etc
Given the transfers your FTS server is required
to support (as defined by experiment computing
models and WLCG), channels allow you to split up
the management of these as you see fit

10
FTS topology

Simplified tiered infrastructure
FTS servers are located atCERN and Tier-1 sites
To provide full coverageWLCG defines what
transfersa given FTS serverhas to support
FTS servers areindependent

11
FTS and data scheduling

FTS provides the reliable and manageable
transport layer
It does not (and will not) provide more complex
data scheduling
Multi-hop transfers
Broadcast transfers
Dataset collation
But it may be used as the underlying management
layer for services providing this
Much of this extra functionality is currently
provided in the experiment layer
Its quite computing model dependent
e.g. Phedex from CMS

12
FTS server architecture

All components are decoupled from each other
Each interacts only with the database

Experiments interact viaweb-service
VO agents do VO-specific operations (1 per VO)
Channel agents do channel specific operation
(e.g. the transfers)
Monitoring and statistics can be collected via
the DB

13
FTS server architecture

Designed for high availability and scalability
User front-end web-service is stateless and
(should be) load balanced to provide
availability and scalability
Service interventions that dont require a DB
schemaupgrade can be made with zero user-visible
downtime
Agent daemons are designed to scale over multiple
nodes as necessary with load
Critical component is central DB
WLCG production services on Oracle RACto provide
availability and scalability

14
FTS 2.0

FTS 2.0 server new features
Delegation of proxy from the client to the FTS
service
Improved monitoring capabilities
Critical to the overall transfer service
operational stability
Much more data retained in the database, some new
methods to access them in the admin API
Beta SRM 2.2 support
This is now being tested on the EGEE
pre-production service as part of the SRM 2.2
testing activity
Better administration tools
Make it easier to run the service
Better database model
Improve the performance and scalability
Placeholders for future functionality
Minimise the impact of future upgrade
interventions

15
FTS developments

FTS developments
Evolve the SRM 2.2 code as we understand the SRM
2.2 implementations (based on feedback from PPS)
Incrementally improve service monitoring
FTS will have the capacity to give very detailed
measurements about the current service level and
problems currently being observed with sites
Integration with experiment and operations
dashboards
Design work ongoing
Site grouping in channel definition (clouds)
To make it easier to implement the computing
models of CMS and ALICE
Code exists to be tested on pilot service
Incrementally improve service administration
tools
SRM/gridFTP split
Notification of job state changes
Not planned
Not planning to produce a non-Oracle version
Sites with lower production requirements can use
restricted Oracle XE

16
FTS current status

Current FTS production status
CERN has just moved to FTS 2.0
All T1 sites currently using FTS 1.5
gt 10 petabytes exported from CERN since SC4
A few more petabytes moved between tier-1 sites
and from tier-1 to tier-2 sites
FTS infrastructure runs well
CERN and T1 sites understand the software
Most problems ironed out last year
Remainder of the problems understood with
experiments and we have a plan to address them
There are still problems with overall transfer
service

17
Issues 1

There are still problems with overall transfer
service
The overall system is very complex
Understanding the cross-site end to end transfer
service is still an issue
Experiment layer, FTS, SRM at source, SRM at
destination, gridFTP servers, network, tape
backends
It can be done, but the manpower required is
significant and is not sustainable in the long
term
The number of retries needed to get files from A
to B is still rather high reduced efficiency
Improving services stability is critical (FTS
included ?)
Monitoring will help
Understanding the whole system is our primary
focus
Can we coordinate the logging / monitoring of FTS
and SRMs to improve this situation ?

18
Issues 2

Behaviour under error conditions is different for
different SRM implementations
This took a lot of effort to resolve in SRM 1.1
The hope is that the SRM 2.2 standard is better
in this regard
Still, a conservative deployment schedule must
anticipate problems of this type for SRM 2.2
deployment in production
The overall production service will not be
stable until any such integration problems are
understood

19
Issues 3

FTS easily lets you throttle channels writing to
your storage
This was a deployment choice of WLCG
But source overloading a still a problem
Recently reported by ATLAS (e.g. BNL)
It would be good if the SRMs could indicate their
busy-ness to FTS by some mechanism, so it could
back off
The other proposed solution of having all the FTS
servers and other SRM clients cooperating (in a
data scheduler model) so as not to overload an
SRM is not seen as credible by WLCG

20
Summary

FTS is designed as a highly available and
scalable service to help sites manage the file
transfer requests from their VOs
Focus is upon service management
Current WLCG FTS infrastructure runs well
Problems with overall transfer service
Complexity cross-site debugging is expensive
Resilience too easy to overload services,
standard interfaces not always quite standard,
especially under error conditions
This is where we need to focus