Data Management Services in GT2 and GT3 - PowerPoint PPT Presentation

1 / 87
About This Presentation
Title:

Data Management Services in GT2 and GT3

Description:

... to maintain data consistency ... State with relaxed consistency: Global index supports ... services with various consistency levels/guarantees ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 88
Provided by: annc170
Learn more at: http://www.isi.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Management Services in GT2 and GT3


1
Data Management Services in GT2 and GT3
2
Requirements for Grid Data Management
  • Terabytes or petabytes of data
  • Often read-only data, published by experiments
  • Other systems need to maintain data consistency
  • Large data storage and computational resources
    shared by researchers around the world
  • Distinct administrative domains
  • Respect local and global policies governing how
    resources may be used
  • Access raw experimental data
  • Run simulations and analysis to create derived
    data products

3
Requirements for Grid Data Management (Cont.)
  • Locate data
  • Record and query for existence of data
  • Data access based on metadata
  • High-level attributes of data
  • Support high-speed, reliable data movement
  • E.g., for efficient movement of large
    experimental data sets
  • Support flexible data access
  • E.g., databases, hierarchical data formats (HDF),
    aggregation of small objects
  • Data Filtering
  • Process data at storage system before transferring

4
Requirements for Grid Data Management (Cont.)
  • Planning, scheduling and monitoring execution of
    data requests and computations
  • Management of data replication
  • Register and query for replicas
  • Select the best replica for a data transfer
  • Security
  • Protect data on storage systems
  • Support secure data transfers
  • Protect knowledge about existence of data
  • Virtual data
  • Desired data may be stored on a storage system
    (materialized) or created on demand

5
Functional View of Grid Data Management
Location based on data attributes


Location of one or more physical replicas
State of grid resources, performance
measurements and predictions



6
Architecture Layers
Collective 2 Services for coordinating multiple
resources that are specific to an application
domain or virtual organization (e.g.,
Authorization, Consistency, Workflow)
Collective 1 General services for coordinating
multiple resources (e.g., RLS, MCS, RFT,
Federation, Brokering)
Resource sharing single resources (e.g.,
GridFTP, SRM, DBMS)
Connectivity (e.g., TCP/IP, GSI)
Fabric (e.g., storage, compute nodes, networks)
7
Outline Data Services for Grids
  • The Replica Location Service (RLS)
  • A distributed registy of replicas for data
    discovery maintains mappings between logical
    names for data and physical locations of replicas
  • The Metadata Catalog Service (MCS)
  • A catalog that associates descriptive attributes
    (metadata) that describe data items with logical
    names for data items
  • The GridFTP data transport protocol
  • Extends basic ftp protocol to provide parallel
    transfers, striped transfers, grid security,
    third-party transfers, control of TCP buffer
    sizes
  • The Reliable File Transfer (RFT) service
  • A grid service (extension of web service) that
    maintains state about outstanding transfers, is
    able to retry and restart after client failures

8
Replica Management in Grids
  • Data intensive applications
  • Produce Terabytes or Petabytes of data
  • Replicate data at multiple locations
  • Fault tolerance
  • Performance avoid wide area data transfer
    latencies, achieve load balancing
  • Issues
  • Locating replicas of desired files
  • Creating new replicas
  • Scalability
  • Reliability

9
A Replica Location Service
  • A Replica Location Service (RLS) is a distributed
    registry service that records the locations of
    data copies and allows discovery of replicas
  • Maintains mappings between logical identifiers
    and target names
  • Physical targets Map to exact locations of
    replicated data
  • Logical targets Map to another layer of logical
    names, allowing storage systems to move data
    without informing the RLS
  • RLS was designed and implemented in a
    collaboration between the Globus project and the
    DataGrid project

10
  • LRCs contain consistent information about
    logical-to-target mappings on a site
  • RLIs nodes aggregate information about LRCs
  • Soft state updates from LRCs to RLIs relaxed
    consistency of index information, used to rebuild
    index after failures
  • Arbitrary levels of RLI hierarchy

11
Giggle A Replica Location Service Framework
  • We define a flexible RLS framework
  • Allows users to make tradeoffs among
  • consistency
  • space overhead
  • reliability
  • update costs
  • query costs
  • By different combinations of 5 essential
    elements, the framework supports a variety of RLS
    designs

12
A Flexible RLS Framework
  • Five elements
  • 1. Consistent Local State Records mappings
    between logical names and target names and
    answers queries
  • 2. Global State with relaxed consistency Global
    index supports discovery of replicas at multiple
    sites relaxed consistency
  • 3. Soft state mechanisms for maintaining global
    state LRCs send information about their mappings
    (state) to RLIs using soft state protocols
  • 4. Compression of state updates (optional)
    reduce communication, CPU and storage overheads
  • 5. Membership service for location of
    participating LRCs and RLIs and dealing with
    changes in membership

13
1. Reliable Local State Local Replica Catalog
  • Maintains consistent information about replicas
    at a single replica site (may aggregate multiple
    storage resources)
  • Contains mappings between logical names and
    target names
  • Answers queries
  • What target names are associated with a logical
    name?
  • What logical names are associated with a target
    name?
  • Associates user-defined attributes with logical
    and target names and mappings
  • Sends soft state updates describing LRC mappings
    to global index nodes

14
2. Global State with Relaxed Consistency
Replica Location Index
  • Require a global index to support discovery of
    replicas at multiple sites
  • Consists of set of one or more Replica Location
    Index Nodes (RLIs)
  • Each RLI must
  • Contain mappings between logical names and LRCs
  • Accept periodic state updates from LRCs
  • Answer queries for mappings associated with a
    logical name
  • Implement time outs of information stored in
    index
  • Global index has relaxed consistency
  • RLIs are not required to maintain persistent
    state

15
2. The Replica Location Index (Cont.)
  • Can construct a wide range of index
    configurations by varying framework parameters
  • Number of RLIs
  • Redundancy of RLIs
  • Can guarantee that all LRCs send soft state
    updates to at least n RLIs
  • Partitioning of RLIs
  • Divide logical file namespace or stroage systems
    among RLIs

16
An RLS with No Redundancy, Partitioning of Index
by Storage Sites
Replica Location Indexes
RLI
RLI
LRC
LRC
LRC
LRC
LRC
Local Replica Catalogs
17
An RLS with Redundancy
18
3. Soft State Mechanisms for Maintaining Global
State
  • LRCs send information about their mappings
    (state) to RLIs using soft state protocols
  • Soft state information times out and must be
    periodically refreshed
  • Advantages of soft state mechanisms
  • Stale information in RLIs removed implicitly via
    timeouts
  • RLIs need not maintain persistent state can
    reconstruct state from soft state updates
  • Some delay in propagating changes in LRC state to
    RLIs
  • Provides relaxed consistency
  • Soft state update strategies
  • Complete state or incremental updates
  • Send immediately after LRC state changes or
    periodically

19
4. Compression of State Updates
  • Optional mechanism for reducing
  • communication requirements for state updates
  • storage system requirements on RLIs
  • Compression options
  • Hash digest techniques (e.g., Bloom filters)
  • Use structural or semantic information in logical
    names (e.g., logical collection names)
  • Others
  • Lossy compression
  • May lose accuracy about mappings
  • E.g., bloom filters
  • Small probability of false positives on RLI
    queries
  • Lose ability to do wildcard searches on logical
    names in RLIs

20
5. Membership Service
  • Used for the following
  • (Currently we provide only static membership
    configuration)
  • Locating participating LRCs and RLIs
  • Keeping track of which servers sends and receives
    soft state updates from one another
  • Dealing with changes in membership (RLI leaves or
    joins)
  • Membership service notifies LRCs of change in
    RLI(s) to which they send state
  • May repartition LFNs among set of RLIs

21
Replica Location Service In Context
  • The Replica Location Service is one component in
    a layered data management architecture
  • Provides a simple, distributed registry of
    mappings
  • Consistency management provided by higher-level
    services

22
Components of RLS Implementation
  • Front-End Server
  • Multi-threaded
  • Supports GSI Authentication
  • Common implementation for LRC and RLI
  • Back-end Server
  • mySQL or PostgreSQL Relational Database
  • Holds logical name to target name mappings
  • Client APIs C and Java
  • Client Command line tool

23
Implementation Features
  • Two types of soft state updates from LRCs to RLIs
  • Complete list of logical names registered in LRC
  • Bloom filter summaries of LRC
  • Immediate mode
  • When active, sends updates of new entries after
    30 seconds (default) or after 100 updates
  • User-defined attributes
  • May be associated with logical or target names
  • Partitioning (without bloom filters)
  • Divide LRC soft state updates among RLI index
    nodes using pattern matching of logical names
  • Currently, static configuration only

24
Installing the LRC and RLI
  • First requires installing the underlying database
  • PostgreSQL, MySQL
  • For each of these, must install both database and
    ODBC driver
  • See RLS installation guide for instructions on
    RLS server installation
  • Requires latest Globus Packaging Toolkit (GPT)
  • Source and binary bundles
  • Clients
  • C
  • Java (JNI wrapper, native Java client in
    progress)
  • Command line client tool

25
RLS Server and Soft State Update Configuration
  • RLS server configuration
  • Whether an LRC or RLI or both
  • If LRC, configure
  • Method of soft state update to send (stored in
    database, set via command line tool)
  • May send updates of different types to different
    RLIs
  • Frequency of soft state updates (in config file)
  • If RLI, configure
  • Method of soft state update to accept (in config
    file)
  • Can configure RLS server to act as a service
    provider to the MDS (Monitoring and Discovery
    Service)

26
Configuring Soft State Updates (Cont.)
  • LFN List
  • Send list of Logical Names stored on LRC
  • Can do exact and wildcard searches on RLI
  • RLI must maintain a database and update database
    whenever new soft state update arrives
  • Soft state updates get increasingly expensive
    (space, network transfer time, CPU time on RLI to
    update RLI DB) as number of LRC entries increases
  • E.g., with 1 million entries, takes 20 minutes to
    update mySQL on dual-processor 2 GHz machine
    (CPU-limited in this case)

27
Configuring Soft State Updates (Cont.)
  • Bloom filters
  • Construct a summary of LRC state by hashing
    logical names, creating a bitmap
  • Compression
  • Updates much smaller, faster
  • Can be stored in memory on RLI, no database
  • E.g., with 1 million entries, update takes less
    than 1 second
  • Supports higher query rate
  • Small probability of false positives (lossy
    compressions)
  • Lose ability to do wildcard queries

28
Configuring soft state updates (cont.)
  • Whether or not to use Immediate Mode
  • Send updates after 30 seconds (configurable) or
    after fixed number (100 default) of updates
  • Full updates are sent at a reduced rate
  • Immediate mode usually sends less data
  • Because of less frequent full updates
  • Tradeoffs depend on volatility of data
  • Frequency of updates
  • Need to have fast updates of RLI vs. allowing
    some inconsistency between LRC and RLI content
  • Usually advantageous
  • An exception would be initially loading of large
    database

29
Wide Area Complete Soft State Update Performance
  • LRCs in Geneva and Pisa updating RLI at Glasgow
  • Full soft state updates quite slow for large
    databases, dominated by update costs on RLI
    database
  • Performance does not scale as LRCs grow need
    compression of soft state updates

30
Soft State Performance With Bloom Filters
  • Sending bloom filter bitmap summarizing 1 million
    LRC mapping entries
  • Store bloom filters in RLI memory
  • Takes less than 1 millisecond to send updates on
    LAN
  • Currently measuring wide area performance
  • Bloom filter advantages
  • Reduce size of soft state updates
  • Reduce associated storage overheds and network
    requirements
  • Sending updates is faster and scales better with
    size of LRC

31
globus-rls-admin Command Line Administration Tool
  • globus-rls-admin option rli server
  • -p verifies that server is responding
  • -A add RLI to list of servers to which LRC sends
    updates
  • -s shows list of servers to which updates are
    sent
  • -c all retrieves all configuration options
  • -S show statistics for RLS server
  • -e clear LRC database

32
globus-rls-cli Command Line Tool
  • globus-rls-cli -c -h -l reslimit
    -s -t timeout -u command
    rls-server
  • If command is not specified, enters interactive
    mode
  • Create an initial mapping from a logical name to
    a target name
  • globus-rls-cli create logicalName targetName1
    rls//myrls.isi.edu
  • Add a mapping from same logical name to a second
    replica/target name
  • globus-rls-cli add logicalName targetName2
    rls//myrls.isi.edu

33
globus-rls-cli (cont.)
  • Attribute Functions
  • globus-rls-cli attribute add ltobjectgt ltattrgt
    ltobj-typegt ltattr-typegt
  • Add an attribute to an object
  • object should be the lfn or pfn name
  • obj-type should be one of lfn or pfn
  • attr-type should be one of date, float int, or
    string
  • attribute modify ltobjectgt ltattrgt ltobj-typegt
    ltattr-typegt
  • attribute query ltobjectgt ltattrgt ltobj-typegt

34
globus-rli-client (cont.)
  • Bulk Operations
  • bulk add ltlfngt ltpfngt ltlfngt ltpfngt
  • Bulk add lfn, pfn mappings.
  • bulk delete ltlfngt ltpfngt ltlfngt ltpfngt
  • Bulk delete lfn, pfn mappings.
  • bulk query lrc lfn ltlfngt ...
  • Bulk query lrc for lfns.
  • bulk query lrc pfn ltpfngt ...
  • Bulk query lrc for pfns.
  • bulk query rli lfn ltlfngt ...
  • Bulk query rli for lfns.

35
globus-rls-cli (cont.)
  • Bulk Attribute Operations
  • globus-rls-cli attribute bulk add ltobjectgt ltattrgt
    ltobj-typegt
  • Bulk add attribute values
  • globus-rls-cli attribute bulk delete ltobjectgt
    ltattrgt ltobj-typegt
  • globus-rls-cli attribute bulk query ltattrgt
    ltobj-typegt ltobjectgt
  • globus-rls-cli attribute define ltattrgt ltobj-typegt
    ltattr-typegt
  • globus-rls-cli attribute delete ltobjectgt ltattrgt
    ltobj-typegt

36
Registering a mapping using C API
  • globus_module_activate(GLOBUS_RLS_CLIENT_MODULE)
  • globus_rls_client_connect (serverURL,
    serverHandle)
  • globus_rls_client_lrc_create (serverHandle,
    logicalName, targetName1)
  • globus_rls_client_lrc_add (serverHandle,
    logicalName, targetName2)
  • globus_rls_client_close (serverHandle)

37
Registering a mapping using Java API
  • RLSClient rls new RLSClient(URLofServer)
  • RLSClient.LRC lrc rls.getLRC()
  • lrc.create(logicalName, targetName1)
  • lrc.add(logicalName, targetName2)
  • rls.Close()

38
Status of RLS and Future Work
  • Continued development of RLS
  • Code available as source and binary bundles at
  • www.globus.org/rls
  • RLS is part of the GT3.0 (as a GT2 service)
  • RLS will become an OGSI-compliant grid service
  • Replica location grid service specification will
    be standardized through Global Grid Forum
  • First step may be wrapping the current GT2
    services in a GT3 wrapper
  • Significant changes related to treatment of data
    entities as first-class OGSI-compliant services

39
Higher-Level OGSA Replication Services
  • Registration and Copy Service
  • Calls RFT to perform reliable file transfer
  • Calls RLS to register newly created replicas
  • Atomic operations roll back to previous
    consistent state if part of operation fails
  • General replication services with various
    consistency levels/guarantees
  • Subscription-based model
  • Updates of data items must be propagated to all
    replicas according to update policies
  • Plan is also to standardize these through GGF
    OGSA Data Replication Services Working Group

40
Outline Data Services for Grids
  • The Replica Location Service (RLS)
  • A distributed registy of replicas for data
    discovery maintains mappings between logical
    names for data and physical locations of replicas
  • The Metadata Catalog Service (MCS)
  • A catalog that associates descriptive attributes
    (metadata) that describe data items with logical
    names for data items
  • The GridFTP data transport protocol
  • Extends basic ftp protocol to provide parallel
    transfers, striped transfers, grid security,
    third-party transfers, control of TCP buffer
    sizes
  • The Reliable File Transfer (RFT) service
  • A grid service (extension of web service) that
    maintains state about outstanding transfers, is
    able to retry and restart after client failures

41
Grid Infrastructure forMetadata Cataloguing and
Discovery
  • Metadata is information that describes data sets
  • Distinguish between logical metadata and physical
    metadata
  • Logical metadata Describes the contents of files
    and collections
  • Variables contained in the data set, annotations
  • Provenance information
  • Applies to all physical file instances or
    replicas
  • Stored in Metadata Catalog Service
  • Physical metadata Describes a particular
    physical instance of a file
  • Mappings from physical to logical names stored in
    a Replica Location Service
  • Physical file information such as size, owner,
    modifier, etc. is typically stored in a file
    system or storage service

42
Metadata Examples
  • Application-specific
  • Temperature, longitude, latitude, depth
  • Time, duration, sensor
  • Application-independent
  • creator, logical name, time created, access
    control
  • notion of a data collectiondata collected during
    an experiment, data collected over a certain time
    interval
  • notion of a view--users might want to group the
    data in a way that they want to look at it

43
Metadata Service Requirements
  • Storing attributes assoicated with logical files
  • Responding to queries based on logical file name
    or on attribute names and values
  • Extensibility to support user-defined and
    application-specific attributes
  • Consistency of content
  • Security authentication and authorization
  • Support for logical collections Aggregations of
    logical files
  • Support for logical views
  • Provenance information history of creation and
    transformation
  • Auditing

44
Use of Metadata Catalogs in ESG
45
History of Metadata Catalog Service Development
  • Identified need for a stand-alone metadata
    service
  • Designed a general schema for metadata attributes
  • General attributes (based largely on Storage
    Resource Broker)
  • Ability to specify user-defined attributes
  • Implemented a prototype system in mid 2002
  • Used the prototype in several projects in late
    2002
  • Earth Systems Grid
  • GryiPhyn LIGO (Gravitational Wave Physics)
  • Gathered lessons from use in these systems
  • Currently re-designing the Metadata Catalog
    Service for greater functionality, extensibility
    and performance

46
Data Model
Logical file
Logical Collection
Logical View
47
MCS Data Model and Implementation
  • Logical files, logical collections and logical
    views
  • May associate pre-defined or user-defined
    attributes with files, collections or views
  • Prototype is a centralized service based on open
    source web service and database technology

SOAP/HTTP
MCS Server/ Apache Axis
SOAP Engine/ Apache Axis
MySQL DB
MCS Java Client API
48
Experience with MCS within the Earth System
Grid Project
  • Store climate model metadata corresponding to
    ESG schema
  • ESG metadata in XML format
  • Parse or shred the metadata and store in MCS
    relational tables
  • Create new user-defined attributes for
    domain-specific metadata schema
  • Shredding is fairly slow and cumbersome
  • Query performance is acceptable
  • Can recreate the original XML documents
  • Used in SC2002 ESG Demo and in subsequent
    demonstrations

49
MCS and GriPhyN
  • Provide on-demand data derivation based on
    existing data recipes
  • If data products already available, no need to
    recompute
  • Data easily stored in relational db
  • Used to find the existing data products
  • Query MCS based on application-specific
    attributes, receive list of logical file names
  • Store information about newly created data
    products

50
For 2003 Redesigning the MCS
  • New implementation will be based on OGSA Database
    Access and Integration (DAI) Service
  • Being standardized through Global Grid Forum
  • Reference implementation involving IBM, Oracle,
    UK eScience researchers, academic institutions
  • Provides both relational and native XML back ends
  • Provides a grid service front end with grid
    security
  • Provides a general pass-through SQL query
    interface
  • Testing OGSA DAI services with ESG metadata
  • Supporting provenance information
  • Common schema with the Chimera project
  • Provenance information describes data
    transformations

51
Redesigning the MCS (Cont.)
  • Extensibility of the metadata service
  • Need rich, efficient mechanisms for adding
    user-defined attributes
  • Reconsider usefulness of pre-defined attributes
  • Distribution and federation of heterogeneous
    metadata services
  • Will explore relaxed consistency models
    heterogeneous metadata services export discovery
    information to aggregating index nodes

52
Current Functionality
  • Data Access
  • Querying Database based on attributes
  • Querying attributes of an object
  • Querying collection or view contents
  • Querying based on user defined attributes
  • Retrieving XML metadata
  • Data Publishing
  • Creating a logical file, collection or a view
  • Modifying attributes
  • Deleting a logical file, collection or a view
  • Annotating a logical file, collection or a view
  • Adding contents to a view
  • Storing XML metadata
  • Grant/revoke authorization (dn based)

53
Command line tools
  • The following slides illustrate the command line
    tools available for accessing the MCS
  • The command line tools are wrapper around java
    classes for accessing the mcs.
  • The MCS server location has to be specified in a
    configuration file.

54
Creating an object
  • To create an object in the mcs
  • create l ltlogical_file_namegt
  • create c ltcollection_namegt
  • create v ltview_namegt
  • Attributes can also be specified at creation time
  • create lcv ltobject_namegt -f ltattributes_filegt

The following slides will be shown along the
demonstration of the capabilities
55
Adding attributes
  • To add an attribute to a logical object
  • add_att l ltlogical_file_namegt ltatt_namegt
    ltatt_typegt ltatt_valuegt
  • add_att c ltcollection_namegt ltatt_namegt
    ltatt_typegt ltatt_valuegt
  • add_att v ltview_namegt ltatt_namegt ltatt_typegt
    ltatt_valuegt
  • For adding bulk attributes to an object
  • add_blk lcv ltobject_namegt -f ltfilegt

56
Modifying attributes
  • To modify an attribute value
  • modify_att l ltlogical_file_namegt ltatt_namegt
    ltatt_typegt ltatt_valuegt
  • modify_att c ltcollection_namegt ltatt_namegt
    ltatt_typegt ltatt_valuegt
  • modify_att v ltview_namegt ltatt_namegt ltatt_typegt
    ltatt_valuegt
  • For modifying bulk attributes
  • modify_blk lcv ltobject_namegt -f ltfilegt

57
Deleting attributes
  • To delete an attribute from the mcs
  • delete_att l ltlogical_file_namegt ltatt_namegt
    ltatt_typegt ltatt_valuegt
  • delete_att c ltcollection_namegt ltatt_namegt
    ltatt_typegt ltatt_valuegt
  • delete_att v ltview_namegt ltatt_namegt ltatt_typegt
    ltatt_valuegt
  • For deleting bulk attributes
  • delete_blk lcv ltobject_namegt -f ltfilegt

58
Querying objects based on attributes
  • To query a file in the MCS
  • query l ltatt_namegt ltatt_typegt ltatt_valuegt
  • To query a collection
  • query c ltatt_namegt ltatt_typegt ltatt_valuegt
  • To query a view
  • query v ltatt_namegt ltatt_typegt ltatt_valuegt

59
Listing objects in the MCS
  • To get a listing of all logical files in MCS
  • list l
  • To get a listing of all collections
  • list c
  • To get a listing of all views
  • list v

60
Listing objects within a collection or view
  • To find a listing of objects in a collection
  • list_coll ltcoll_namegt
  • To find a listing of objects in a view
  • list_view ltview_namegt

61
Querying the attributes of an object
  • To query attributes of a logical file
  • listattributes l ltlogical_file_namegt
  • To query attributes of a logical collection
  • listattributes c ltcollection_namegt
  • To query attributes of a view
  • listattributes v ltview_namegt

62
Adding an object to a collection
  • To add a logical file under a collection
  • add_coll ltparent_collgt -l ltlogical_filegt
  • To add a collection under a collection
  • add_coll ltparent_collgt -c ltchild_collgt

63
Deleting an object from MCS
  • Deleting an object from the MCS also deletes the
    related user defined attributes of the object
  • To delete an object from the MCS
  • delete lcv ltobject_namegt

64
Outline Data Services for Grids
  • The Replica Location Service (RLS)
  • A distributed registy of replicas for data
    discovery maintains mappings between logical
    names for data and physical locations of replicas
  • The Metadata Catalog Service (MCS)
  • A catalog that associates descriptive attributes
    (metadata) that describe data items with logical
    names for data items
  • The GridFTP data transport protocol
  • Extends basic ftp protocol to provide parallel
    transfers, striped transfers, grid security,
    third-party transfers, control of TCP buffer
    sizes
  • The Reliable File Transfer (RFT) service
  • A grid service (extension of web service) that
    maintains state about outstanding transfers, is
    able to retry and restart after client failures

65
GridFTP
  • Data-intensive grid applications need to transfer
    and replciate large data sets (terabytes,
    petabytes)
  • GridFTP Features
  • Third party (client mediated) transfer
  • Parallel transfers
  • Striped transfers
  • TCP buffer optimizations
  • Grid security

66
GridFTP Basic Approach
  • FTP protocol is defined by several IETF RFCs
  • Start with most commonly used subset
  • Standard FTP get/put etc., 3rd-party transfer
  • Implement standard but often unused features
  • GSS binding, extended directory listing, simple
    restart
  • Extend in various ways, while preserving
    interoperability with existing servers
  • Striped/parallel data channels, partial file,
    automatic manual TCP buffer setting, progress
    monitoring, extended restart

67
The GridFTP Protocol
  • Based on 4 RFCs and our extensions
  • RFC 959 The base FTP protocol document
  • RFC 2228 Security Extensions
  • RFC 2389 Feature Negotiation and support for
    command options
  • IETF Draft Stream Mode restarts, standard file
    listings

68
GridFTP Implementation
  • The GT2 GridFTP is based on the wuftpd server and
    client
  • Ours is the only implementation right now
  • Likely to be others in the future
  • Important feature is separation of control and
    data channels
  • GridFTP is a Command Response Protocol
  • Issue a command
  • Get only responses to that command until it is
    completed
  • Then can issue another command

69
Command line tool globus-url-copy
  • This is the GridFTP client tool provided with the
    Globus Toolkit
  • It takes a source URL and destination URL and
    will do protocol conversion for http, https, FTP,
    gsiftp, and file (file must be local).
  • globus-url-copy sourceURL destURL
  • globus-url-copy gsiftp//sourceHostNameport/dir1/
    dir2/file17 gsiftp//destHostNameport/dirX/dirY/f
    ileA

70
Demonstration globus-url-copy Command Line Tool
  • globus-url-copy options sourceURL destURL
  • OPTIONS
  • -b -binary
  • Do not apply any conversion to the files.
    default
  • -tcp-bs ltsizegt -tcp-buffer-size ltsizegt
  • specify the size (in bytes) of the buffer
    to be used by the underlying ftp data channels
  • -bs ltblock sizegt -block-size ltblock sizegt
  • specify the size (in bytes) of the buffer
    to be used by the underlying transfer methods

71
Globus-url-copy (cont.)
  • -p ltparallelismgt -parallel ltparallelismgt
  • specify the number of streams to be used
    in the ftp transfer
  • -notpt -no-third-party-transfers
  • turn third-party transfers off (on by
    default)

72
GridFTP APIs
  • Under the covers, two APIs
  • globus_ftp_control
  • Provides access to low-level GridFTP control and
    data channel operations.
  • globus_ftp_client
  • Provides typical GridFTP client operations.

73
globus_ftp_control API
  • Low level GridFTP driver
  • Control channel management
  • Both client and server sides
  • Handles message framing, security, etc
  • Data channel management
  • Symmetric for client and server sides
  • Designed for performance caller controls buffer
    management, no data copies needed
  • Must understand details of GridFTP protocol to
    use this API
  • Intended for custom GridFTP client and server
    developers

74
globus_ftp_client
  • Functionality
  • get, put, third_party_transfer
  • Variants normal, partial file, extended
  • delete, mkdir, rmdir, move
  • Note no cd. All operations use URLs with full
    paths
  • list, verbose_list
  • modification_time, size, exists
  • Hides the state machine
  • PlugIn Architecture provides access to
    interesting events.
  • All data transfer is to/from memory buffers
  • Facilitates wide range of clients

75
Example globus_ftp_client call
  • globus_ftp_client_put/get/3rd Party
  • Function signature
  • globus_result_t globus_ftp_client_get
    (globus_ftp_client_handle_t handle,
  • const char url,
  • globus_ftp_client_operationattr_t attr,
    globus_ftp_client_restart_marker_t restart,
    globus_ftp_client_complete_callback_t
    complete_callback,
  • void callback_arg)

76
Components of a GridFTP Client
  • Module Activation / Initialization
  • Set Attributes (determine much of advanced
    functionality)
  • Select Mode (stream or extended)
  • Enable any needed plug-ins
  • Execute the operation
  • Module Deactivation / Clean up

77
Attributes
  • Control much of advanced GridFTP functionality
  • Functions
  • globus_ftp_client_operationattr_set_ltattributegt
    (attr, ltattribute_structgt)
  • globus_ftp_client_operationattr_get_ltattributegt
    (attr, ltattribute_structgt)
  • Two types of attributes
  • Handle Attributes Apply for an entire session
    and independent of any specific operation
  • Operation Attributes Apply for a single
    operation

78
Attributes (Cont)
  • Handle Attributes
  • Initialize/Destroy/Copy Attribute Handle
  • Connection Caching
  • Plugin Management Add/Remove Plugins
  • Operation Attributes
  • Parallelism
  • Striped Data Movement
  • Striped File Layout
  • TCP Buffer Control
  • File Type
  • Transfer Mode
  • Authorization/Privacy/Protection

79
Example Code Setting Parallelism Attributes
  • globus_ftp_client_handle_t
    handle
  • globus_ftp_client_operationattr_t
    attr
  • globus_ftp_client_handleattr_t
    handle_attr
  • globus_size_t
    parallelism_level 4
  • globus_ftp_control_parallelism_t
    parallelism
  • globus_module_activate(GLOBUS_FTP_CLIENT_MODUL
    E)
  • globus_ftp_client_handleattr_init(handle_attr
    )
  • globus_ftp_client_operationattr_init(attr)
  • parallelism.mode GLOBUS_FTP_CONTROL_PARALLEL
    ISM_FIXED
  • parallelism.fixed.size parallelism_level
  • globus_ftp_client_operationattr_set_mode(attr
    ,
  • GLOBUS_FTP_CONTROL_MODE_EXTENDED_BLOCK)
  • globus_ftp_client_operationattr_set_parallelis
    m(attr, parallelism)
  • globus_ftp_client_handle_init(handle,
    handle_attr)

80
Mode S versus Mode E
  • Mode S is stream mode as defined by RFC 959
  • No advanced features except simple restart
  • Mode E (extended mode) enables advanced
    functionality
  • Adds 64 bit offset and length fields to the
    header
  • This allows discontiguous, out-of-order
    transmission and enables parallelism and striping
  • Command
  • globus_ftp_client_operationattr_set_mode(attr,
    GLOBUS_FTP_CONTROL_MODE_EXTENDED_BLOCK)

81
Plug-Ins
  • Interface to one or more plug-ins
  • Callouts for all interesting protocol events
  • Allows performance and failure monitoring
  • Callins to restart a transfer
  • Can build custom restart logic
  • Included plug-ins
  • Debug Writes event log
  • Restart Parameterized automatic restart
  • Retry N times, with a certain delay between each
    try
  • Give up after some amount of time
  • Performance Real time performance data

82
End-to-end transfer performance may be limited
by several factors
  • OS Limitations on streams and buffers
  • Buffer size limits (defaults, Max)
  • We use 64K default, 8MB Max per socket
  • of sockets per process and total
  • Striping and parallelism may require lots of
    memory and streams
  • NICs vary widely in performance
  • Buses Moving a lot of data On/Off Disk, In/Out
    the NIC.
  • CPUs Fast network connections and software RAID
    require a lot of CPU
  • Disk can be the biggest bottleneck
  • RAID helps

83
GridFTP Development For GT3
  • Major redesign planned
  • Part 1 Replace existing globus_io libraries
    with XIO libraries (under development)
  • Pluggable protocol stack
  • TCP, reliable UDP, HTTP, GSI
  • Part 2 GridFTP OGSA Service (?)
  • Based on redesign of GRAM job submission, service
    level agreements
  • Data transfer is just another type of job to be
    executed

84
RFT Reliable File Transfer
  • GT3 service
  • Multiple-file version available in current
    release
  • Allows monitoring and control of third-party data
    transfer operations between two GridFTP servers

85
RFT
  • A client issues a request to an RFT factory
  • Factory instantiates an RFT service instance
  • The RFT instance does the following
  • Communicates with two storage resources running
    GridFTP servers
  • Initiates a third-party transfer from source to
    destination GridFTP server
  • Monitors status of the transfer, updating the
    state describing the transfer in a database
  • If the transfer fails because the client or one
    of the storage resources fails
  • Transfer state in RFT database is sufficient to
    resume or restart when resources become available

86
Tutorial Outline
  • Introduction Grids, Data management services,
    component overview (20 minutes)
  • GridFTP and Reliable File Transfer (RFT) service
    (25 minutes)
  • The Replica Location Service (RLS) (25 minutes)
  • The Metadata Catalog Service (MCS) (25 minutes)
  • Break (15 minutes)
  • The Chimera system (30 minutes)
  • The Pegasus system (30 minutes)
  • Summary (10 minutes)

87
OGSA Data Access and Integration Service (OGSA
DAI)
  • OGSI-Compliant grid service for access to
    existing databases
  • GSI security, lifetime management, service data
    elements, etc.
  • Provides both relational and native XML database
    back ends (mySQL, Xindice, DB2 in progress)
  • Provides a general pass-through SQL query
    interface
  • Being standardized through Global Grid Forum
  • Reference implementation by UK researchers, IBM
Write a Comment
User Comments (0)
About PowerShow.com