WLCG File Transfer Service - PowerPoint PPT Presentation

About This Presentation
Title:

WLCG File Transfer Service

Description:

It packages up a set of source/destination file pairs and submits transfer jobs to FTS ... Design work ongoing. Site grouping in channel definition ('clouds' ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 21
Provided by: marce243
Category:

less

Transcript and Presenter's Notes

Title: WLCG File Transfer Service


1
WLCG File Transfer Service
  • Sophie Lemaitre Gavin Mccance
  • Joint EGEE and OSG Workshop on Data Handlingin
    Production Grids, Monterey
  • 25 June 2007

2
FTS overview
  • The File Transfer Service (FTS) is a data
    movement fabric service
  • It is a multi-VO service, used to balance usage
    of site resources according to VO and site
    policies
  • Why is it needed ?
  • For the user, the service it provides is the
    reliable point to point movement of files
  • For the site manager, it provides a reliable and
    manageable way of serving file movement requests
    from their experiments
  • For the production manager, it provides ability
    to control requests coming from his users
  • Re-ordering, prioritization,
  • The focus is on the service
  • It should make it easy to do these things well

3
Who uses it 1
  • The sites use it as part of their fabric
  • Its designed to make it easier for a multi-VO
    site to run the transfers of its VOs
  • Tier-1 sites run the FTS servers and are
    responsible for processing the transfer requests
    from tier-2s and transferring data between
    tier-1s
  • Tier-0 export is run from CERN
  • The focus is on the service delivered, the ease
    of manageability and service monitoring

4
Who uses it 2
  • FTS is used by experiment frameworks
  • Typically end-users do not interact directly with
    it they interact with their experiment
    framework
  • Production managers sometimes query it directly
    to debug / chase problems
  • Experiment framework decides it wants to move a
    set of files
  • The expt. framework is responsible for staging-in
    (for now..)
  • It packages up a set of source/destination file
    pairs and submits transfer jobs to FTS
  • The state of each job is tracked as it progresses
    through the various transfer stages
  • The experiment framework can poll the status at
    any time

5
Service APIs
  • FTS has 3 basic API group
  • Job submission / tracking
  • Used by experiment frameworks to submit requests
  • Service / channel management
  • Used by admins and VO production managers to
    control the service
  • Statistics tools
  • Providing aggregate statistics on what the
    service has been doing, current failure rates,
    classes, etc
  • This is being done as part of the WLCG monitoring
    group to make sure the information is available
    to all interested stakeholders

6
Security model
  • Transfers are always run using user credential
  • VOMS credential is now used (and renewed as
    necessary) in FTS 2.0
  • Authorization to service is done using
  • Grid mapfile mechanism or
  • VOMS role
  • VO production manager roles
  • Channel administrator roles
  • Service manager role

7
User API
  • Uses a submit / poll pattern with unique job ID
  • Jobs can contain multiple copy requests
  • Various polling methods with different detail
  • Overall job status (is it done yet?)
  • Job summary
  • Detailed status of individual file failures /
    status
  • Job cancelation and priority reshuffling
    bysuitably authorised users
  • i.e. VO production managers
  • No notification mechanism yet
  • The submit/poll pattern isnt so efficient
  • Much commonality with Globus RFT API
  • Weve been talking

8
Channel concept
  • For management ease, the service supports
    splitting jobs onto multiple channels
  • Once a job is submitted to the FTS it is assigned
    to a suitable channel for serving
  • A channel may be
  • A point to point network link (e.g. we manage all
    the T0-export links in WLCG on a separate
    channel)
  • Various catch-all channels
  • (e.g. everything else coming to me, or everything
    to one of my tier-2 sites)
  • More flexible grouping of sites channel
    definitions are on the way
  • Channels are uni-directional
  • e.g. at CERN we have one set for the export and
    one set for the import

9
Channels
  • Channel its not a great name
  • It always causes confusion... (but were stuck
    with the name now)
  • It isnt tied to a physical network path
  • Its just a management concept
  • Queue might be a better name ?
  • All file transfer jobs on the same channel are
    served as part of the same queue
  • Inter-VO priorities for the queue (Atlas gets
    75, CMS gets the rest)
  • Internal-VO priorities within a VO
  • Each channel has its own set of transfer
    parameters
  • Number of concurrent files running, number
    streams, TCP buffer, etc
  • Given the transfers your FTS server is required
    to support (as defined by experiment computing
    models and WLCG), channels allow you to split up
    the management of these as you see fit

10
FTS topology
  • Simplified tiered infrastructure
  • FTS servers are located atCERN and Tier-1 sites
  • To provide full coverageWLCG defines what
    transfersa given FTS serverhas to support
  • FTS servers areindependent

11
FTS and data scheduling
  • FTS provides the reliable and manageable
    transport layer
  • It does not (and will not) provide more complex
    data scheduling
  • Multi-hop transfers
  • Broadcast transfers
  • Dataset collation
  • But it may be used as the underlying management
    layer for services providing this
  • Much of this extra functionality is currently
    provided in the experiment layer
  • Its quite computing model dependent
  • e.g. Phedex from CMS

12
FTS server architecture
  • All components are decoupled from each other
  • Each interacts only with the database
  • Experiments interact viaweb-service
  • VO agents do VO-specific operations (1 per VO)
  • Channel agents do channel specific operation
    (e.g. the transfers)
  • Monitoring and statistics can be collected via
    the DB

13
FTS server architecture
  • Designed for high availability and scalability
  • User front-end web-service is stateless and
    (should be) load balanced to provide
    availability and scalability
  • Service interventions that dont require a DB
    schemaupgrade can be made with zero user-visible
    downtime
  • Agent daemons are designed to scale over multiple
    nodes as necessary with load
  • Critical component is central DB
  • WLCG production services on Oracle RACto provide
    availability and scalability

14
FTS 2.0
  • FTS 2.0 server new features
  • Delegation of proxy from the client to the FTS
    service
  • Improved monitoring capabilities
  • Critical to the overall transfer service
    operational stability
  • Much more data retained in the database, some new
    methods to access them in the admin API
  • Beta SRM 2.2 support
  • This is now being tested on the EGEE
    pre-production service as part of the SRM 2.2
    testing activity
  • Better administration tools
  • Make it easier to run the service
  • Better database model
  • Improve the performance and scalability
  • Placeholders for future functionality
  • Minimise the impact of future upgrade
    interventions

15
FTS developments
  • FTS developments
  • Evolve the SRM 2.2 code as we understand the SRM
    2.2 implementations (based on feedback from PPS)
  • Incrementally improve service monitoring
  • FTS will have the capacity to give very detailed
    measurements about the current service level and
    problems currently being observed with sites
  • Integration with experiment and operations
    dashboards
  • Design work ongoing
  • Site grouping in channel definition (clouds)
  • To make it easier to implement the computing
    models of CMS and ALICE
  • Code exists to be tested on pilot service
  • Incrementally improve service administration
    tools
  • SRM/gridFTP split
  • Notification of job state changes
  • Not planned
  • Not planning to produce a non-Oracle version
  • Sites with lower production requirements can use
    restricted Oracle XE

16
FTS current status
  • Current FTS production status
  • CERN has just moved to FTS 2.0
  • All T1 sites currently using FTS 1.5
  • gt 10 petabytes exported from CERN since SC4
  • A few more petabytes moved between tier-1 sites
    and from tier-1 to tier-2 sites
  • FTS infrastructure runs well
  • CERN and T1 sites understand the software
  • Most problems ironed out last year
  • Remainder of the problems understood with
    experiments and we have a plan to address them
  • There are still problems with overall transfer
    service

17
Issues 1
  • There are still problems with overall transfer
    service
  • The overall system is very complex
  • Understanding the cross-site end to end transfer
    service is still an issue
  • Experiment layer, FTS, SRM at source, SRM at
    destination, gridFTP servers, network, tape
    backends
  • It can be done, but the manpower required is
    significant and is not sustainable in the long
    term
  • The number of retries needed to get files from A
    to B is still rather high reduced efficiency
  • Improving services stability is critical (FTS
    included ?)
  • Monitoring will help
  • Understanding the whole system is our primary
    focus
  • Can we coordinate the logging / monitoring of FTS
    and SRMs to improve this situation ?

18
Issues 2
  • Behaviour under error conditions is different for
    different SRM implementations
  • This took a lot of effort to resolve in SRM 1.1
  • The hope is that the SRM 2.2 standard is better
    in this regard
  • Still, a conservative deployment schedule must
    anticipate problems of this type for SRM 2.2
    deployment in production
  • The overall production service will not be
    stable until any such integration problems are
    understood

19
Issues 3
  • FTS easily lets you throttle channels writing to
    your storage
  • This was a deployment choice of WLCG
  • But source overloading a still a problem
  • Recently reported by ATLAS (e.g. BNL)
  • It would be good if the SRMs could indicate their
    busy-ness to FTS by some mechanism, so it could
    back off
  • The other proposed solution of having all the FTS
    servers and other SRM clients cooperating (in a
    data scheduler model) so as not to overload an
    SRM is not seen as credible by WLCG

20
Summary
  • FTS is designed as a highly available and
    scalable service to help sites manage the file
    transfer requests from their VOs
  • Focus is upon service management
  • Current WLCG FTS infrastructure runs well
  • Problems with overall transfer service
  • Complexity cross-site debugging is expensive
  • Resilience too easy to overload services,
    standard interfaces not always quite standard,
    especially under error conditions
  • This is where we need to focus
Write a Comment
User Comments (0)
About PowerShow.com