Title: Data Management Services in GT2 and GT3
1Data Management Services in GT2 and GT3
2Requirements for Grid Data Management
- Terabytes or petabytes of data
- Often read-only data, published by experiments
- Other systems need to maintain data consistency
- Large data storage and computational resources
shared by researchers around the world - Distinct administrative domains
- Respect local and global policies governing how
resources may be used - Access raw experimental data
- Run simulations and analysis to create derived
data products
3Requirements for Grid Data Management (Cont.)
- Locate data
- Record and query for existence of data
- Data access based on metadata
- High-level attributes of data
- Support high-speed, reliable data movement
- E.g., for efficient movement of large
experimental data sets - Support flexible data access
- E.g., databases, hierarchical data formats (HDF),
aggregation of small objects - Data Filtering
- Process data at storage system before transferring
4Requirements for Grid Data Management (Cont.)
- Planning, scheduling and monitoring execution of
data requests and computations - Management of data replication
- Register and query for replicas
- Select the best replica for a data transfer
- Security
- Protect data on storage systems
- Support secure data transfers
- Protect knowledge about existence of data
- Virtual data
- Desired data may be stored on a storage system
(materialized) or created on demand
5Functional View of Grid Data Management
Location based on data attributes
Location of one or more physical replicas
State of grid resources, performance
measurements and predictions
6Architecture Layers
Collective 2 Services for coordinating multiple
resources that are specific to an application
domain or virtual organization (e.g.,
Authorization, Consistency, Workflow)
Collective 1 General services for coordinating
multiple resources (e.g., RLS, MCS, RFT,
Federation, Brokering)
Resource sharing single resources (e.g.,
GridFTP, SRM, DBMS)
Connectivity (e.g., TCP/IP, GSI)
Fabric (e.g., storage, compute nodes, networks)
7Outline Data Services for Grids
- The Replica Location Service (RLS)
- A distributed registy of replicas for data
discovery maintains mappings between logical
names for data and physical locations of replicas - The Metadata Catalog Service (MCS)
- A catalog that associates descriptive attributes
(metadata) that describe data items with logical
names for data items - The GridFTP data transport protocol
- Extends basic ftp protocol to provide parallel
transfers, striped transfers, grid security,
third-party transfers, control of TCP buffer
sizes - The Reliable File Transfer (RFT) service
- A grid service (extension of web service) that
maintains state about outstanding transfers, is
able to retry and restart after client failures
8Replica Management in Grids
- Data intensive applications
- Produce Terabytes or Petabytes of data
- Replicate data at multiple locations
- Fault tolerance
- Performance avoid wide area data transfer
latencies, achieve load balancing - Issues
- Locating replicas of desired files
- Creating new replicas
- Scalability
- Reliability
9A Replica Location Service
- A Replica Location Service (RLS) is a distributed
registry service that records the locations of
data copies and allows discovery of replicas - Maintains mappings between logical identifiers
and target names - Physical targets Map to exact locations of
replicated data - Logical targets Map to another layer of logical
names, allowing storage systems to move data
without informing the RLS - RLS was designed and implemented in a
collaboration between the Globus project and the
DataGrid project
10- LRCs contain consistent information about
logical-to-target mappings on a site - RLIs nodes aggregate information about LRCs
- Soft state updates from LRCs to RLIs relaxed
consistency of index information, used to rebuild
index after failures - Arbitrary levels of RLI hierarchy
11Giggle A Replica Location Service Framework
- We define a flexible RLS framework
- Allows users to make tradeoffs among
- consistency
- space overhead
- reliability
- update costs
- query costs
- By different combinations of 5 essential
elements, the framework supports a variety of RLS
designs
12A Flexible RLS Framework
- Five elements
- 1. Consistent Local State Records mappings
between logical names and target names and
answers queries - 2. Global State with relaxed consistency Global
index supports discovery of replicas at multiple
sites relaxed consistency - 3. Soft state mechanisms for maintaining global
state LRCs send information about their mappings
(state) to RLIs using soft state protocols - 4. Compression of state updates (optional)
reduce communication, CPU and storage overheads - 5. Membership service for location of
participating LRCs and RLIs and dealing with
changes in membership
131. Reliable Local State Local Replica Catalog
- Maintains consistent information about replicas
at a single replica site (may aggregate multiple
storage resources) - Contains mappings between logical names and
target names - Answers queries
- What target names are associated with a logical
name? - What logical names are associated with a target
name? - Associates user-defined attributes with logical
and target names and mappings - Sends soft state updates describing LRC mappings
to global index nodes
142. Global State with Relaxed Consistency
Replica Location Index
- Require a global index to support discovery of
replicas at multiple sites - Consists of set of one or more Replica Location
Index Nodes (RLIs) - Each RLI must
- Contain mappings between logical names and LRCs
- Accept periodic state updates from LRCs
- Answer queries for mappings associated with a
logical name - Implement time outs of information stored in
index - Global index has relaxed consistency
- RLIs are not required to maintain persistent
state
152. The Replica Location Index (Cont.)
- Can construct a wide range of index
configurations by varying framework parameters - Number of RLIs
- Redundancy of RLIs
- Can guarantee that all LRCs send soft state
updates to at least n RLIs - Partitioning of RLIs
- Divide logical file namespace or stroage systems
among RLIs
16An RLS with No Redundancy, Partitioning of Index
by Storage Sites
Replica Location Indexes
RLI
RLI
LRC
LRC
LRC
LRC
LRC
Local Replica Catalogs
17An RLS with Redundancy
183. Soft State Mechanisms for Maintaining Global
State
- LRCs send information about their mappings
(state) to RLIs using soft state protocols - Soft state information times out and must be
periodically refreshed - Advantages of soft state mechanisms
- Stale information in RLIs removed implicitly via
timeouts - RLIs need not maintain persistent state can
reconstruct state from soft state updates - Some delay in propagating changes in LRC state to
RLIs - Provides relaxed consistency
- Soft state update strategies
- Complete state or incremental updates
- Send immediately after LRC state changes or
periodically
194. Compression of State Updates
- Optional mechanism for reducing
- communication requirements for state updates
- storage system requirements on RLIs
- Compression options
- Hash digest techniques (e.g., Bloom filters)
- Use structural or semantic information in logical
names (e.g., logical collection names) - Others
- Lossy compression
- May lose accuracy about mappings
- E.g., bloom filters
- Small probability of false positives on RLI
queries - Lose ability to do wildcard searches on logical
names in RLIs
205. Membership Service
- Used for the following
- (Currently we provide only static membership
configuration) - Locating participating LRCs and RLIs
- Keeping track of which servers sends and receives
soft state updates from one another - Dealing with changes in membership (RLI leaves or
joins) - Membership service notifies LRCs of change in
RLI(s) to which they send state - May repartition LFNs among set of RLIs
21Replica Location Service In Context
- The Replica Location Service is one component in
a layered data management architecture - Provides a simple, distributed registry of
mappings - Consistency management provided by higher-level
services
22Components of RLS Implementation
- Front-End Server
- Multi-threaded
- Supports GSI Authentication
- Common implementation for LRC and RLI
- Back-end Server
- mySQL or PostgreSQL Relational Database
- Holds logical name to target name mappings
- Client APIs C and Java
- Client Command line tool
23Implementation Features
- Two types of soft state updates from LRCs to RLIs
- Complete list of logical names registered in LRC
- Bloom filter summaries of LRC
- Immediate mode
- When active, sends updates of new entries after
30 seconds (default) or after 100 updates - User-defined attributes
- May be associated with logical or target names
- Partitioning (without bloom filters)
- Divide LRC soft state updates among RLI index
nodes using pattern matching of logical names - Currently, static configuration only
24Installing the LRC and RLI
- First requires installing the underlying database
- PostgreSQL, MySQL
- For each of these, must install both database and
ODBC driver - See RLS installation guide for instructions on
RLS server installation - Requires latest Globus Packaging Toolkit (GPT)
- Source and binary bundles
- Clients
- C
- Java (JNI wrapper, native Java client in
progress) - Command line client tool
25RLS Server and Soft State Update Configuration
- RLS server configuration
- Whether an LRC or RLI or both
- If LRC, configure
- Method of soft state update to send (stored in
database, set via command line tool) - May send updates of different types to different
RLIs - Frequency of soft state updates (in config file)
- If RLI, configure
- Method of soft state update to accept (in config
file) - Can configure RLS server to act as a service
provider to the MDS (Monitoring and Discovery
Service)
26Configuring Soft State Updates (Cont.)
- LFN List
- Send list of Logical Names stored on LRC
- Can do exact and wildcard searches on RLI
- RLI must maintain a database and update database
whenever new soft state update arrives - Soft state updates get increasingly expensive
(space, network transfer time, CPU time on RLI to
update RLI DB) as number of LRC entries increases - E.g., with 1 million entries, takes 20 minutes to
update mySQL on dual-processor 2 GHz machine
(CPU-limited in this case)
27Configuring Soft State Updates (Cont.)
- Bloom filters
- Construct a summary of LRC state by hashing
logical names, creating a bitmap - Compression
- Updates much smaller, faster
- Can be stored in memory on RLI, no database
- E.g., with 1 million entries, update takes less
than 1 second - Supports higher query rate
- Small probability of false positives (lossy
compressions) - Lose ability to do wildcard queries
28Configuring soft state updates (cont.)
- Whether or not to use Immediate Mode
- Send updates after 30 seconds (configurable) or
after fixed number (100 default) of updates - Full updates are sent at a reduced rate
- Immediate mode usually sends less data
- Because of less frequent full updates
- Tradeoffs depend on volatility of data
- Frequency of updates
- Need to have fast updates of RLI vs. allowing
some inconsistency between LRC and RLI content - Usually advantageous
- An exception would be initially loading of large
database
29Wide Area Complete Soft State Update Performance
- LRCs in Geneva and Pisa updating RLI at Glasgow
- Full soft state updates quite slow for large
databases, dominated by update costs on RLI
database - Performance does not scale as LRCs grow need
compression of soft state updates
30Soft State Performance With Bloom Filters
- Sending bloom filter bitmap summarizing 1 million
LRC mapping entries - Store bloom filters in RLI memory
- Takes less than 1 millisecond to send updates on
LAN - Currently measuring wide area performance
- Bloom filter advantages
- Reduce size of soft state updates
- Reduce associated storage overheds and network
requirements - Sending updates is faster and scales better with
size of LRC
31globus-rls-admin Command Line Administration Tool
- globus-rls-admin option rli server
- -p verifies that server is responding
- -A add RLI to list of servers to which LRC sends
updates - -s shows list of servers to which updates are
sent - -c all retrieves all configuration options
- -S show statistics for RLS server
- -e clear LRC database
32globus-rls-cli Command Line Tool
- globus-rls-cli -c -h -l reslimit
-s -t timeout -u command
rls-server - If command is not specified, enters interactive
mode - Create an initial mapping from a logical name to
a target name - globus-rls-cli create logicalName targetName1
rls//myrls.isi.edu - Add a mapping from same logical name to a second
replica/target name - globus-rls-cli add logicalName targetName2
rls//myrls.isi.edu
33globus-rls-cli (cont.)
- Attribute Functions
- globus-rls-cli attribute add ltobjectgt ltattrgt
ltobj-typegt ltattr-typegt - Add an attribute to an object
- object should be the lfn or pfn name
- obj-type should be one of lfn or pfn
- attr-type should be one of date, float int, or
string - attribute modify ltobjectgt ltattrgt ltobj-typegt
ltattr-typegt - attribute query ltobjectgt ltattrgt ltobj-typegt
34globus-rli-client (cont.)
- Bulk Operations
- bulk add ltlfngt ltpfngt ltlfngt ltpfngt
- Bulk add lfn, pfn mappings.
- bulk delete ltlfngt ltpfngt ltlfngt ltpfngt
- Bulk delete lfn, pfn mappings.
- bulk query lrc lfn ltlfngt ...
- Bulk query lrc for lfns.
- bulk query lrc pfn ltpfngt ...
- Bulk query lrc for pfns.
- bulk query rli lfn ltlfngt ...
- Bulk query rli for lfns.
35globus-rls-cli (cont.)
- Bulk Attribute Operations
- globus-rls-cli attribute bulk add ltobjectgt ltattrgt
ltobj-typegt - Bulk add attribute values
- globus-rls-cli attribute bulk delete ltobjectgt
ltattrgt ltobj-typegt - globus-rls-cli attribute bulk query ltattrgt
ltobj-typegt ltobjectgt - globus-rls-cli attribute define ltattrgt ltobj-typegt
ltattr-typegt - globus-rls-cli attribute delete ltobjectgt ltattrgt
ltobj-typegt
36Registering a mapping using C API
- globus_module_activate(GLOBUS_RLS_CLIENT_MODULE)
- globus_rls_client_connect (serverURL,
serverHandle) - globus_rls_client_lrc_create (serverHandle,
logicalName, targetName1) - globus_rls_client_lrc_add (serverHandle,
logicalName, targetName2) - globus_rls_client_close (serverHandle)
37Registering a mapping using Java API
- RLSClient rls new RLSClient(URLofServer)
- RLSClient.LRC lrc rls.getLRC()
- lrc.create(logicalName, targetName1)
- lrc.add(logicalName, targetName2)
- rls.Close()
38Status of RLS and Future Work
- Continued development of RLS
- Code available as source and binary bundles at
- www.globus.org/rls
- RLS is part of the GT3.0 (as a GT2 service)
- RLS will become an OGSI-compliant grid service
- Replica location grid service specification will
be standardized through Global Grid Forum - First step may be wrapping the current GT2
services in a GT3 wrapper - Significant changes related to treatment of data
entities as first-class OGSI-compliant services
39Higher-Level OGSA Replication Services
- Registration and Copy Service
- Calls RFT to perform reliable file transfer
- Calls RLS to register newly created replicas
- Atomic operations roll back to previous
consistent state if part of operation fails - General replication services with various
consistency levels/guarantees - Subscription-based model
- Updates of data items must be propagated to all
replicas according to update policies - Plan is also to standardize these through GGF
OGSA Data Replication Services Working Group
40Outline Data Services for Grids
- The Replica Location Service (RLS)
- A distributed registy of replicas for data
discovery maintains mappings between logical
names for data and physical locations of replicas - The Metadata Catalog Service (MCS)
- A catalog that associates descriptive attributes
(metadata) that describe data items with logical
names for data items - The GridFTP data transport protocol
- Extends basic ftp protocol to provide parallel
transfers, striped transfers, grid security,
third-party transfers, control of TCP buffer
sizes - The Reliable File Transfer (RFT) service
- A grid service (extension of web service) that
maintains state about outstanding transfers, is
able to retry and restart after client failures
41Grid Infrastructure forMetadata Cataloguing and
Discovery
- Metadata is information that describes data sets
- Distinguish between logical metadata and physical
metadata - Logical metadata Describes the contents of files
and collections - Variables contained in the data set, annotations
- Provenance information
- Applies to all physical file instances or
replicas - Stored in Metadata Catalog Service
- Physical metadata Describes a particular
physical instance of a file - Mappings from physical to logical names stored in
a Replica Location Service - Physical file information such as size, owner,
modifier, etc. is typically stored in a file
system or storage service
42Metadata Examples
- Application-specific
- Temperature, longitude, latitude, depth
- Time, duration, sensor
- Application-independent
- creator, logical name, time created, access
control - notion of a data collectiondata collected during
an experiment, data collected over a certain time
interval - notion of a view--users might want to group the
data in a way that they want to look at it
43Metadata Service Requirements
- Storing attributes assoicated with logical files
- Responding to queries based on logical file name
or on attribute names and values - Extensibility to support user-defined and
application-specific attributes - Consistency of content
- Security authentication and authorization
- Support for logical collections Aggregations of
logical files - Support for logical views
- Provenance information history of creation and
transformation - Auditing
44Use of Metadata Catalogs in ESG
45History of Metadata Catalog Service Development
- Identified need for a stand-alone metadata
service - Designed a general schema for metadata attributes
- General attributes (based largely on Storage
Resource Broker) - Ability to specify user-defined attributes
- Implemented a prototype system in mid 2002
- Used the prototype in several projects in late
2002 - Earth Systems Grid
- GryiPhyn LIGO (Gravitational Wave Physics)
- Gathered lessons from use in these systems
- Currently re-designing the Metadata Catalog
Service for greater functionality, extensibility
and performance
46Data Model
Logical file
Logical Collection
Logical View
47MCS Data Model and Implementation
- Logical files, logical collections and logical
views - May associate pre-defined or user-defined
attributes with files, collections or views - Prototype is a centralized service based on open
source web service and database technology
SOAP/HTTP
MCS Server/ Apache Axis
SOAP Engine/ Apache Axis
MySQL DB
MCS Java Client API
48Experience with MCS within the Earth System
Grid Project
- Store climate model metadata corresponding to
ESG schema - ESG metadata in XML format
- Parse or shred the metadata and store in MCS
relational tables - Create new user-defined attributes for
domain-specific metadata schema - Shredding is fairly slow and cumbersome
- Query performance is acceptable
- Can recreate the original XML documents
- Used in SC2002 ESG Demo and in subsequent
demonstrations
49MCS and GriPhyN
- Provide on-demand data derivation based on
existing data recipes - If data products already available, no need to
recompute - Data easily stored in relational db
- Used to find the existing data products
- Query MCS based on application-specific
attributes, receive list of logical file names - Store information about newly created data
products
50For 2003 Redesigning the MCS
- New implementation will be based on OGSA Database
Access and Integration (DAI) Service - Being standardized through Global Grid Forum
- Reference implementation involving IBM, Oracle,
UK eScience researchers, academic institutions - Provides both relational and native XML back ends
- Provides a grid service front end with grid
security - Provides a general pass-through SQL query
interface - Testing OGSA DAI services with ESG metadata
- Supporting provenance information
- Common schema with the Chimera project
- Provenance information describes data
transformations
51Redesigning the MCS (Cont.)
- Extensibility of the metadata service
- Need rich, efficient mechanisms for adding
user-defined attributes - Reconsider usefulness of pre-defined attributes
- Distribution and federation of heterogeneous
metadata services - Will explore relaxed consistency models
heterogeneous metadata services export discovery
information to aggregating index nodes
52Current Functionality
- Data Access
- Querying Database based on attributes
- Querying attributes of an object
- Querying collection or view contents
- Querying based on user defined attributes
- Retrieving XML metadata
- Data Publishing
- Creating a logical file, collection or a view
- Modifying attributes
- Deleting a logical file, collection or a view
- Annotating a logical file, collection or a view
- Adding contents to a view
- Storing XML metadata
- Grant/revoke authorization (dn based)
53Command line tools
- The following slides illustrate the command line
tools available for accessing the MCS - The command line tools are wrapper around java
classes for accessing the mcs. - The MCS server location has to be specified in a
configuration file.
54Creating an object
- To create an object in the mcs
- create l ltlogical_file_namegt
- create c ltcollection_namegt
- create v ltview_namegt
- Attributes can also be specified at creation time
- create lcv ltobject_namegt -f ltattributes_filegt
The following slides will be shown along the
demonstration of the capabilities
55Adding attributes
- To add an attribute to a logical object
- add_att l ltlogical_file_namegt ltatt_namegt
ltatt_typegt ltatt_valuegt - add_att c ltcollection_namegt ltatt_namegt
ltatt_typegt ltatt_valuegt - add_att v ltview_namegt ltatt_namegt ltatt_typegt
ltatt_valuegt - For adding bulk attributes to an object
- add_blk lcv ltobject_namegt -f ltfilegt
56Modifying attributes
- To modify an attribute value
- modify_att l ltlogical_file_namegt ltatt_namegt
ltatt_typegt ltatt_valuegt - modify_att c ltcollection_namegt ltatt_namegt
ltatt_typegt ltatt_valuegt - modify_att v ltview_namegt ltatt_namegt ltatt_typegt
ltatt_valuegt - For modifying bulk attributes
- modify_blk lcv ltobject_namegt -f ltfilegt
57Deleting attributes
- To delete an attribute from the mcs
- delete_att l ltlogical_file_namegt ltatt_namegt
ltatt_typegt ltatt_valuegt - delete_att c ltcollection_namegt ltatt_namegt
ltatt_typegt ltatt_valuegt - delete_att v ltview_namegt ltatt_namegt ltatt_typegt
ltatt_valuegt - For deleting bulk attributes
- delete_blk lcv ltobject_namegt -f ltfilegt
58Querying objects based on attributes
- To query a file in the MCS
- query l ltatt_namegt ltatt_typegt ltatt_valuegt
- To query a collection
- query c ltatt_namegt ltatt_typegt ltatt_valuegt
- To query a view
- query v ltatt_namegt ltatt_typegt ltatt_valuegt
59Listing objects in the MCS
- To get a listing of all logical files in MCS
- list l
- To get a listing of all collections
- list c
- To get a listing of all views
- list v
60Listing objects within a collection or view
- To find a listing of objects in a collection
- list_coll ltcoll_namegt
- To find a listing of objects in a view
- list_view ltview_namegt
61Querying the attributes of an object
- To query attributes of a logical file
- listattributes l ltlogical_file_namegt
- To query attributes of a logical collection
- listattributes c ltcollection_namegt
- To query attributes of a view
- listattributes v ltview_namegt
62Adding an object to a collection
- To add a logical file under a collection
- add_coll ltparent_collgt -l ltlogical_filegt
- To add a collection under a collection
- add_coll ltparent_collgt -c ltchild_collgt
63Deleting an object from MCS
- Deleting an object from the MCS also deletes the
related user defined attributes of the object - To delete an object from the MCS
- delete lcv ltobject_namegt
64Outline Data Services for Grids
- The Replica Location Service (RLS)
- A distributed registy of replicas for data
discovery maintains mappings between logical
names for data and physical locations of replicas - The Metadata Catalog Service (MCS)
- A catalog that associates descriptive attributes
(metadata) that describe data items with logical
names for data items - The GridFTP data transport protocol
- Extends basic ftp protocol to provide parallel
transfers, striped transfers, grid security,
third-party transfers, control of TCP buffer
sizes - The Reliable File Transfer (RFT) service
- A grid service (extension of web service) that
maintains state about outstanding transfers, is
able to retry and restart after client failures
65GridFTP
- Data-intensive grid applications need to transfer
and replciate large data sets (terabytes,
petabytes) - GridFTP Features
- Third party (client mediated) transfer
- Parallel transfers
- Striped transfers
- TCP buffer optimizations
- Grid security
66GridFTP Basic Approach
- FTP protocol is defined by several IETF RFCs
- Start with most commonly used subset
- Standard FTP get/put etc., 3rd-party transfer
- Implement standard but often unused features
- GSS binding, extended directory listing, simple
restart - Extend in various ways, while preserving
interoperability with existing servers - Striped/parallel data channels, partial file,
automatic manual TCP buffer setting, progress
monitoring, extended restart
67The GridFTP Protocol
- Based on 4 RFCs and our extensions
- RFC 959 The base FTP protocol document
- RFC 2228 Security Extensions
- RFC 2389 Feature Negotiation and support for
command options - IETF Draft Stream Mode restarts, standard file
listings
68GridFTP Implementation
- The GT2 GridFTP is based on the wuftpd server and
client - Ours is the only implementation right now
- Likely to be others in the future
- Important feature is separation of control and
data channels - GridFTP is a Command Response Protocol
- Issue a command
- Get only responses to that command until it is
completed - Then can issue another command
69Command line tool globus-url-copy
- This is the GridFTP client tool provided with the
Globus Toolkit - It takes a source URL and destination URL and
will do protocol conversion for http, https, FTP,
gsiftp, and file (file must be local). - globus-url-copy sourceURL destURL
- globus-url-copy gsiftp//sourceHostNameport/dir1/
dir2/file17 gsiftp//destHostNameport/dirX/dirY/f
ileA
70Demonstration globus-url-copy Command Line Tool
- globus-url-copy options sourceURL destURL
- OPTIONS
- -b -binary
- Do not apply any conversion to the files.
default - -tcp-bs ltsizegt -tcp-buffer-size ltsizegt
- specify the size (in bytes) of the buffer
to be used by the underlying ftp data channels - -bs ltblock sizegt -block-size ltblock sizegt
- specify the size (in bytes) of the buffer
to be used by the underlying transfer methods -
71Globus-url-copy (cont.)
- -p ltparallelismgt -parallel ltparallelismgt
- specify the number of streams to be used
in the ftp transfer - -notpt -no-third-party-transfers
- turn third-party transfers off (on by
default)
72GridFTP APIs
- Under the covers, two APIs
- globus_ftp_control
- Provides access to low-level GridFTP control and
data channel operations. - globus_ftp_client
- Provides typical GridFTP client operations.
73globus_ftp_control API
- Low level GridFTP driver
- Control channel management
- Both client and server sides
- Handles message framing, security, etc
- Data channel management
- Symmetric for client and server sides
- Designed for performance caller controls buffer
management, no data copies needed - Must understand details of GridFTP protocol to
use this API - Intended for custom GridFTP client and server
developers
74globus_ftp_client
- Functionality
- get, put, third_party_transfer
- Variants normal, partial file, extended
- delete, mkdir, rmdir, move
- Note no cd. All operations use URLs with full
paths - list, verbose_list
- modification_time, size, exists
- Hides the state machine
- PlugIn Architecture provides access to
interesting events. - All data transfer is to/from memory buffers
- Facilitates wide range of clients
75Example globus_ftp_client call
- globus_ftp_client_put/get/3rd Party
- Function signature
- globus_result_t globus_ftp_client_get
(globus_ftp_client_handle_t handle, - const char url,
- globus_ftp_client_operationattr_t attr,
globus_ftp_client_restart_marker_t restart,
globus_ftp_client_complete_callback_t
complete_callback, - void callback_arg)
76Components of a GridFTP Client
- Module Activation / Initialization
- Set Attributes (determine much of advanced
functionality) - Select Mode (stream or extended)
- Enable any needed plug-ins
- Execute the operation
- Module Deactivation / Clean up
77Attributes
- Control much of advanced GridFTP functionality
- Functions
- globus_ftp_client_operationattr_set_ltattributegt
(attr, ltattribute_structgt) - globus_ftp_client_operationattr_get_ltattributegt
(attr, ltattribute_structgt) - Two types of attributes
- Handle Attributes Apply for an entire session
and independent of any specific operation - Operation Attributes Apply for a single
operation
78Attributes (Cont)
- Handle Attributes
- Initialize/Destroy/Copy Attribute Handle
- Connection Caching
- Plugin Management Add/Remove Plugins
- Operation Attributes
- Parallelism
- Striped Data Movement
- Striped File Layout
- TCP Buffer Control
- File Type
- Transfer Mode
- Authorization/Privacy/Protection
79Example Code Setting Parallelism Attributes
- globus_ftp_client_handle_t
handle - globus_ftp_client_operationattr_t
attr - globus_ftp_client_handleattr_t
handle_attr - globus_size_t
parallelism_level 4 - globus_ftp_control_parallelism_t
parallelism -
- globus_module_activate(GLOBUS_FTP_CLIENT_MODUL
E) - globus_ftp_client_handleattr_init(handle_attr
) - globus_ftp_client_operationattr_init(attr)
- parallelism.mode GLOBUS_FTP_CONTROL_PARALLEL
ISM_FIXED - parallelism.fixed.size parallelism_level
- globus_ftp_client_operationattr_set_mode(attr
, - GLOBUS_FTP_CONTROL_MODE_EXTENDED_BLOCK)
- globus_ftp_client_operationattr_set_parallelis
m(attr, parallelism) - globus_ftp_client_handle_init(handle,
handle_attr)
80Mode S versus Mode E
- Mode S is stream mode as defined by RFC 959
- No advanced features except simple restart
- Mode E (extended mode) enables advanced
functionality - Adds 64 bit offset and length fields to the
header - This allows discontiguous, out-of-order
transmission and enables parallelism and striping - Command
- globus_ftp_client_operationattr_set_mode(attr,
GLOBUS_FTP_CONTROL_MODE_EXTENDED_BLOCK)
81Plug-Ins
- Interface to one or more plug-ins
- Callouts for all interesting protocol events
- Allows performance and failure monitoring
- Callins to restart a transfer
- Can build custom restart logic
- Included plug-ins
- Debug Writes event log
- Restart Parameterized automatic restart
- Retry N times, with a certain delay between each
try - Give up after some amount of time
- Performance Real time performance data
82End-to-end transfer performance may be limited
by several factors
- OS Limitations on streams and buffers
- Buffer size limits (defaults, Max)
- We use 64K default, 8MB Max per socket
- of sockets per process and total
- Striping and parallelism may require lots of
memory and streams - NICs vary widely in performance
- Buses Moving a lot of data On/Off Disk, In/Out
the NIC. - CPUs Fast network connections and software RAID
require a lot of CPU - Disk can be the biggest bottleneck
- RAID helps
83GridFTP Development For GT3
- Major redesign planned
- Part 1 Replace existing globus_io libraries
with XIO libraries (under development) - Pluggable protocol stack
- TCP, reliable UDP, HTTP, GSI
- Part 2 GridFTP OGSA Service (?)
- Based on redesign of GRAM job submission, service
level agreements - Data transfer is just another type of job to be
executed
84RFT Reliable File Transfer
- GT3 service
- Multiple-file version available in current
release - Allows monitoring and control of third-party data
transfer operations between two GridFTP servers
85RFT
- A client issues a request to an RFT factory
- Factory instantiates an RFT service instance
- The RFT instance does the following
- Communicates with two storage resources running
GridFTP servers - Initiates a third-party transfer from source to
destination GridFTP server - Monitors status of the transfer, updating the
state describing the transfer in a database - If the transfer fails because the client or one
of the storage resources fails - Transfer state in RFT database is sufficient to
resume or restart when resources become available
86Tutorial Outline
- Introduction Grids, Data management services,
component overview (20 minutes) - GridFTP and Reliable File Transfer (RFT) service
(25 minutes) - The Replica Location Service (RLS) (25 minutes)
- The Metadata Catalog Service (MCS) (25 minutes)
- Break (15 minutes)
- The Chimera system (30 minutes)
- The Pegasus system (30 minutes)
- Summary (10 minutes)
87OGSA Data Access and Integration Service (OGSA
DAI)
- OGSI-Compliant grid service for access to
existing databases - GSI security, lifetime management, service data
elements, etc. - Provides both relational and native XML database
back ends (mySQL, Xindice, DB2 in progress) - Provides a general pass-through SQL query
interface - Being standardized through Global Grid Forum
- Reference implementation by UK researchers, IBM