Title: Distributed Data Management and Processing 2.3
1Distributed Data Management and Processing 2.3
2Introduction to 2.3, Distributed Data Management
and Processing
- 2.3 Distributed Data Management and Processing
develops software tools to support the CMS
Distributed Computing Model.
Tier 01
100 MBytes/sec
Online System
PBytes/sec
US Center Fermilab
Italy Regional Center
Tier 1
France Regional Center
UK Regional Center
Tier 2
Tier 3
Institute
Institute
Institute
Institute
Physics data cache
Tier 4
3Introduction to 2.3, Distributed Data Management
and Processing
- Supporting the CMS Distributed Computing Model is
a daunting task. - 1/3 Processing Capability at Tier0, 1/3 Tier1s,
and 1/3 Tier2s - Centers are spread globally over networks of
variable bandwidth - Most physicists will be performing analysis at
remote centers that locally have only a portion
of the raw or reconstructed data. - Creating tools that will allow efficient access
to data at local sites and to resources at remote
sites is a complicated task. - This is a larger dataset, using more computing
power, spread over a greater distance, than HEP
has previously attempted and it requires a more
advanced set of tools.
4Introduction to 2.3, Distributed Data Management
and Processing
- DDMP attempts to break the project into 5
manageable pieces for efficient development - 2.3.1 Distributed Process Management
- 2.3.2 Distributed Database Management
- 2.3.3 Load Balancing
- 2.3.4 Distributed Production Tools
- 2.3.5 System Simulation
- The combination of these should allow CMS to
take full advantage of the grid of Distributed
Computing Resources. - CMS has immediate needs for simulation for use in
completing TDRs (HLT, Physics, ). Wherever
possible the project attempts to develop tools
that are useful in production at existing
facilities while developing more advanced tools
for the future.
5Introduction to 2.3 Distributed Data Management
and Processing
- Distributed Data Management and Processing
attempts to integrate software developed
elsewhere whenever possible. Exploiting a number
of tools being developed for grid computing.
6Introduction to Distributed Process Management
2.3.1
- The Goal of Distributed Process Management is to
develop tools that enable physicists to make
efficient use of computing resources distributed
world wide. - There are a number of tools available with
similar goals (LSF, PBS, DQS, Condor, ) but at
the moment none are judged to be adequate to meet
long term needs of CMS. - Important Issues
- Keeping track of long running jobs
- Support collaboration among multiple physicists
- Conserving limiting network bandwidth
- Maintaining high availability
- Tolerating partition failures that are common to
WANs
7Distributed Process Management Prototype
Development
- Prototype introduces the concept of a session
which is a container for interrelated jobs. This
allows submission, monitoring, and termination
with a single command. Sessions can be shared - Processors can be chosen based on data
availability,processor type, and load.
- Replicated states are maintained so that
computations - will not be lost if a server fails.
- Prototype is based on functional language ML and
- Group Communications Toolkit. The Group
- Communications Toolkit aids writing
distributed - programs.
8Distributed Process Management Current Status
- Working prototype exists with features described
on previous slide. - The system has been tested with 32 processors
performing CMS ORCA production. - Some scalability issues were encountered and
repaired. - System has been tested on 65 processors with no
scalability problems encountered.
9Distributed Process Management Prototype Future
Plans
- In the next few months Distributed Process
Management will move development efforts to the
CMS Tier2 Prototype Computation Center at
Caltech/UCSD. - The unique split nature of the center and large
number of processors makes it a nearly ideal
place to work on scalability, remote submission,
and more complex ORCA scenarios. - Spring 2001 there are plans for support for
multiple users and development of a queuing
system when resources are unavailable. - First Prototype is expected to be complete in the
summer of 2001.
10Distributed Process Management Fully Functional
Development
- Milestones tied to deliverables to CMS for use in
production. - Program starts with algorithm development for use
in Process Management including data aware Self
Organizing Neural Network agents for scheduling. - Fully Functional should be completed sometime in
2003.
11Introduction to Distributed Database Management
2.3.2
- Distributed Database Management develops tools
external to the ODBMS that control replication
and synchronization of data over the grid as well
as monitoring and improving the performance of
database access. - As event production becomes less CERN centric
there is an immediate need for tools to replicate
data produced at remote sites. There is also a
need to evaluate and improve the performance of
database access. - In the future for Distributed Production there is
a need for tools to automatically synchronize
some databases over all sites analyzing results
and a need to replicate databases on demand for
remote analysis jobs. - Distributed Database Management attempts to meet
both these needs.
12Distributed Database Management Prototype
Development
- To meet the long and short term goals two paths
were pursued an investigational prototype
written in Perl and development with the Grid
Data Management Pilot (GDMP) on a functional
prototype based on Globus Middleware. - Both require high speed transfers, secure data
access, transfer verification, integration of the
data upon arrival, and remote catalogue querying
and publishing.
13Investigational Prototype Goals
- one-way bulk replication of read-only (static)
datafiles - simple prototype using available software
- RFIO from HPSS to disk at CERN
- SCP from disk at CERN to disk at Fermilab
- FMSS to archive from disk at Fermilab to tape
- Objectivity tools (oodumpcatalog, oodumpschema,
oonewfd, ooschemaupgrade, ooattachdb) - all wrapped up in Perl with HTTP and TCP/IP
- aim for automation, not performance
- transferring 1 2GB file is easy, transferring
1000 is not - objective is to clone (part of) a federation from
CERN in Fermilab. - automated MSS ? MSS transfer via (small) disk
pools - use as a possible fallback solution.
- documentation at http//home.cern.ch/wildish
14Investigational Prototype Steps
- The basic steps
- Create an empty federation with the right schema
and pagesize. - Get the schema directly from the source
federation via a web-enabled ooschemadump. - Find out what data is available.
- Use a web-enabled oodumpcatalog to list the
source federation catalogue. - Determine what is new w.r.t. your local
federation. - Use a catalogue-diff, based on DB Name or ID.
- Request the files you want from a server at the
source site. - Server will stage files from HPSS to a local disk
buffer, then send them to you. - Process files as they arrive.
- Attach them to your federation, archive them to
MSS, purge them when your local disk buffer fills
up. - Repeat steps 2,3,4 as desired, and step 5 as
desired.
15Exporting data prototype design
http
Catalogue-server
Remote client
ooschemadump oodumpcatalog
ooschemaupgrade oonewfd
User federation
New cloned federation
(firewall?)
(firewall?)
CERN
Fermilab
16Distributed Database Management Investigational
Prototype Exporting
http
Catalogue-server
Catalogue-diff
TCP socket
DBServer
oodumpcatalog
HPSS
Local cloned federation
rfcp
(firewall?)
MSS
disk pool
ooattachdb
scp (Secure Shell Copy)
archive
Process new DBs
(firewall?)
CERN
disk pool
Fermilab
17Distributed Database Management Investigational
Prototype Results
- 600GB transferred in 9 days
- SHIFT20 (200GB disk) ? CMSUN1 (280GB disk)
- federation built automatically as data arrived
- data archived automatically to FMSS
- peak rate 2.7MB/sec sustainable for several hours
- performance unaffected by batch jobs running on
Fermilab client or CERN server. - best results with ? 40 simultaneous copies
running - monitored with the production-monitoring system
- monitor Fermilab client from a desktop at CERN
- Investigation Prototype development frozen, but
parts of the code are being reused for
Distributed Production Tools, updated monitoring
system, and database comparisons.
18Distributed Database Management Functional
Prototype
- Flexible, layered, and modular architecture
designed to be able to support modifications and
extensions using Globus as the basic Middleware. - Data Model
- Export Catalog
- Contains information about the new files produced
which are ready to be accessed by other sites. - Export catalog is published to all the subscribed
sites. - A new export catalog is generated, every time a
site wants to publish its files, which contains
the newly generated files only. - Import Catalog
- Contains the information about the files which
have been published by other sites but not yet
transferred locally. - As soon as the file is transferred locally,
validated and attached to the federation, it is
removed from the import catalog. - Subscription Service
- All the sites that subscribe to a particular site
get notified whenever there is an update in its
catalog. Supports both a push and pull
mechanism.
19Database Replicator Functional Prototype
Architecture
- Communication
- Control Messages
- Data Mover
- File Transfers
- Logging Incoming and Outgoing Files
- Resuming File Transfers
- Progress Meters
- Error Checks
- Security
- Authentication and authorization
- Replica Manager
- Handling Replica Catalogue
- Replica Selection and Synchronization
Application
Globus-threads
Request Manager
Globus-dc
Globus Rep. Manager
gssapi
GIS
Objy API
DB Manager
Information Service
Replica Manager
Security
Control Comm.
Data Mover
Globus-ftp
Globus_io
Layered Architecture for Distributed Data
Management
20Database Replicator Functional Prototype
Architecture
- Information Service
- Publish data and network resources at sites.
- DB Manager
- Backend to database specific functions.
- Request Manager
- Generating Request on the client side and
handling requests on the server side. - Application
- Multi-threaded Server handling clients.
Application
Globus-threads
Request Manager
Globus-dc
Globus Rep. Manager
gssapi
GIS
Objy API
DB Manager
Information Service
Replica Manager
Security
Control Comm.
Data Mover
Globus-ftp
Globus_io
Layered Architecture for Distributed Data
Management
21Integration into the CMS Environment
Site A
CMS environment
Physics software
CheckDB script
GDMP system
Write DB
DB completeness check
CMS/GDMP interface
Production federation
catalog
Site B
Purge file
Copy file to MSS
Stage Purge scripts
Stage Purge scripts
Copy file to MSS
MSS
MSS
Transfer attach
Update catalog
Purge file
User federation
User federation
catalog
catalog
wan
Stage file (opt)
trigger
trigger
trigger
read
GDMP export catalog
Subscribers list
GDMP import catalog
Replicate files
write
Generate import catalog
Publish new catalog
Generate new catalog
GDMP server
22Database Replicator Functional Prototype Current
Status
- The decision was made to use the Functional
Prototype in the fall ORCA production. - This required adding some features and making it
more fault tolerant. - Parallel Transfers to improve performance.
- Resumption of file transfer from checkpoints to
handle network interruptions. - Catalogue filtering to allow more choices for
files to import and export to remote sites. - User Guide
- Being used at remote centers for ORCA fall
production to handle replication of Objectivity
files.
23Database Replicator Prototype Future Plans
- When the GDMP tools were written they were
tightly coupled to Objectivity applications and
they were unable to replicate non-Objectivity
files. With the addition of Globus Replica
Catalogue, they should be able to perform file
format independent replication in January of
2001. - In May 2001 integration and development of Grid
Information Services should begin. At the moment
the data replicator cannot make an intelligent
choice as to which copy to access given choices.
This decision should be made based on current
network bandwidth, latency between two given
nodes, load on the data servers, etc.
24Fully Functional Prototype Development
- Development toward a fully functional prototype
is foreseen starting after the summer of 2001 and
continuing until 2003. - This involves the testing and integration of grid
tools currently under development - Mobile agents that float on the network
independently, communicate, and make intelligent
decisions when triggered. - Use of virtual data, the concept that all except
irreproducible raw experimental data need exist
only as specifications for how to derive them.
25Request Redirection Protocol
- The second goal of Distributed Database
Management was to evaluate and improve database
access. - The performance and capabilities of the
Objectivity AMS server can be improved by writing
plugins that conform to a well defined interface.
- To improve the availability of the database
servers, one such plugin, the Request Redirection
Protocol has been implemented. When the
Federated Database has determined that an AMS has
crashed (due to a disk failure, etc.), jobs can
be automatically transferred to an alternate
server. This has been implemented on the CERN
AMS servers for a month. - In early 2001, a security protocol plugin will be
implemented.
26Introduction to Load Balancing 2.3.3
- Balancing the use of resources in a distributed
computing environment is difficult and requires
the integration and augmentation of elements
Distributed Process Management and Distributed
Database Management with intelligent algorithms
to determine the most efficient course of action. - In a distributed computing system jobs can be
submitted to the computing resources where the
data is available or the data can be moved to
available computing resources. - Deciding between these two cases to efficiently
complete all requests and balance the load over
all the computing grid requires good algorithms
and lots of information about network traffic,
CPU loads, and data availability.
27Load Balancing Current Status
- While there has been considerable work on
Distributed Process Management and Distributed
Database Management and some effort on
information services and algorithm development,
most of the work on Load Balancing is still to
come. - Preliminary Work has been done on a prototype of
Grid Information Services using Globus
Middleware. - Publish outside domain resources that can be
accessed inside domain.
- Static
- CPU Power
- Operating System Details
- Software Versions
- Available Memory
- Dynamic
- CPU Load
- Network Bandwidth
- Network Latency
- Updates every few seconds
28Load Balancing Future Plans
- Algorithm Development should start in the summer
of 2001 using conventional and Self Organizing
Neural Network techniques. - Integration of Distributed Process Management and
Distributed Database Management should begin as
those projects enter the fully functional
prototype phase.
29Introduction to Distributed Production Tools 2.3.4
- The Goal of Distributed Production Tools is to
develop tools for immediate use to aid CMS
production at existing computing facilities. - Job submission
- Transferring and Archiving Results
- System Monitoring
- US-CMS until recently had no dedicated production
facilities. Production in the US was performed
on existing facilities with a wide variety of
capabilities, platforms, and configurations. - CMS has an immediate need for simulated events to
complete the Trigger TDR and later the Physics
TDR. This project helps to meet the immediate
need, while lessons learned help long term goals
as well.
30Distributed Production Tools Current Status
- Based on the database replicator investigative
prototype, tools have been designed to
automatically record and archive results of
production performed at remote sites and to
transfer these results to the CERN mass storage
system. This has been primarily used for
archiving CMSIM production performed at Padua,
Moscow, IN2P3, Caltech, Fermilab, Bristol, and
Helsinki. - Tools have been developed to utilize existing
facilities in the US. The aging HP X-class
Exemplar System has been used for CMSIM
production and the Wisconsin Condor system, which
is a scavenger system using spare cycles of
Linux systems has been used for CMSIM production
and will be used for ORCA production this fall.
31Distributed Production Tools Current Status of
System Monitoring
- Tools have been developed to monitor production
systems. - This helps to evaluate and repair bottlenecks in
the production systems. - This provides realistic input parameters to the
system simulation tools and improves the quality
of simulation. - This provides information to make intelligent
choices about requirements of future production
facilities. - Monitoring uses Perl/bash scripts running on each
node - Information is generated in a netlogger
inspired format. - UDP datagrams transmit results to collection
machines - Numerical quantities are histogrammed every n
minutes and put on the web. - During Spring Production it was used to monitor
150 nodes with 25MB ASCII logging per day.
32Distributed Production Tools Current Status of
System Monitoring
- Goals of the project are to try to understand how
best to arrange the data for fast access. - Monitor standard things on data servers
- CPU, network, disk I/O, paging, swap, load
average etc. - Monitor the AMS.
- Which files the user reads (includes those
already on disk). - Number of open filehandles (also for Lockserver).
- Monitor the lockserver
- Transaction ages, hosts holding locks etc.
- Monitor the staging system
- Names of files staged in.
- Time it takes for them to arrive.
- Names of purged files.
33System Monitoring Results
- This shows the AMS activity on 6 AMS servers by
the simple means of counting the number of
filehandles that each server had open at a given
time.
34Distributed Production Tools Future Plans
- Tools are being developed to support generic job
submission over diverse existing computing
facilities to improve the ease of use. The first
of these which is based on LSF will be available
in the spring of 2001. - It is a relatively small extension of the system
monitoring tools to initiate an action if the
monitoring measures certain kinds of problems.
Already a system exists to send e-mail to the
appropriate people. Tools are being developed so
that all the jobs in a batch queue should be able
to be cleanly stopped or paused if the system
monitoring tools determine that a server has
crashed or that a disk has filled up.
35Introduction to System Simulation 2.3.5
- Distributed Computing Systems of the scope and
complexity proposed by CMS do not yet exist. The
System Simulation project attempts to evaluate
distributed computing plans by performing
simulations of large scale computing systems. - The MONARC simulation toolkit is used. The goals
of MONARC are - To provide realistic modeling of distributed
computing systems, customized for specific HEP
applications. - To reliably model the behavior of computing
facilities and networks, using specific
application software and usage patterns. - To offer a dynamic and flexible simulation
environment. - To provide a design framework to evaluate a range
of possible computing systems as measured by the
ability to provide physicists with the requested
data within the required time. - To narrow down a region of parameter space in
which viable models can be chosen. - The toolkit is Java based to take advantage of
Javas built in support for multi-threaded for
concurrent processing.
36System Simulation Current Status
- MONARC is currently in its third phase and was
recently updated to be able to handle larger
scale simulations. - The simulation of the spring 2000 ORCA production
served as a nice validation of the tool kit.
Using inputs from the system monitoring tools the
simulation was able to accurately reproduce the
behavior of the computing farm CPU utilization,
network traffic and total time to complete jobs. - As an indication of the maturity of the
simulation tools, there is a simulation being
performed of Distributed Process Management using
Self Organizing Neural Networks. Since full
scale production facilities will not be available
for some time, it is nice to get a head start on
algorithm development using the simulation.
37System Simulation Current Status
38System Simulation Spring HLT Production
Below are simulation examples of network traffic
and CPU efficiency
Measurement
Simulation
39System Simulation Future Plans
- Plans to update the estimated CMS computing needs
in December. - In early 2001 there are plans to update the
MONARC package to have modules for Distributed
Process Management and Distributed Database
Management.
40System Simulation Future Plans
- The upgraded package should allow better
simulation of Distributed Computing Systems. Two
are planned for Spring 2001 - A study of the role of tapes in Tier1-Tier2
interactions, which should help describe
interactions and evaluate storage needs. - A complex study of Tier0-Tier1-Tier2 interactions
to evaluate a complete CMS data processing
scenario, including all the major tasks
distributed among regions centers. - During the remainder of 2001 the System
Simulation Project will aid in the development of
load balancing schemes.
41Conclusions
- The CMS Distributed Computing model is complex
and advanced software is needed to make it work. - Tools are needed to submit, monitor and control
groups of jobs at remote and local sites. - Data needs to be moved over the computing grid to
the processes that need it. - An intelligent system needs to exist to determine
the most efficient split of moving data and
exporting processes. - CMS has TDRs due which require large numbers of
simulated events for analysis and tools are
needed to facilitate production. - We are trying to deliver both.