ARDA Report to the LCG SC2

About This Presentation

Title:

ARDA Report to the LCG SC2

Description:

Prototype provides the initial blueprint -- do not aim for a full ... Release ARDA RTAG document. Philippe Charpentier. ARDA Report, SC2 Meeting. Oct 3, 2003 ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 43

Provided by: usc1

Learn more at: https://uscms.org

Category:

more less

Transcript and Presenter's Notes

Title: ARDA Report to the LCG SC2

1
ARDA Reportto the LCG SC2

Philippe Charpentier
For the RTAG-11/ARDA group

2
Repeats of last talk emphasizing main points
3
ARDA roadmap to distributed analysis

Analysis model
How to implement Hepcal-II use cases
Ev. Proviso for Hepcal-II use cases?

4
API to Grid services

Importance of API to interface
Exp. framework
Analysis shells, e.g. ROOT
Grid portals and other forms of user interactions
with environment
Advanced services e.g. virtual data, analysis
logbooks etc
Exp. Specific services e.g. data and meta data
management systems

5
ARDA and Grid services arch

OGSI gives framework in which to run LHC/ARDA
services
Addresses architecture for
-- communication, lifetime support, whatever,
Provides framework for advanced interactions with
Grid
This is outside of the analysis services API,
but to be implemented in standard ways
Need to address issues of OGSI performance and
scalability up-front
Importance of modeling, plan for scaling up,
engineering of underlying services i/s

6
Roadmap to a Grid Services Architecture for the
LHC

Transition to grid services explicitly addressed
in several existing projects
Clarens and Caltech GAE, MonaLisa
Based on web services for communication,
Jini-based agent architecture
Dirac
Based on intelligent agents working within
batch environments
AliEn
Based on web services and communication to dbase
proxy
Initial work on OGSA within LCG-GTA
GT3 prototyping
No evolutionary path from GT2-based grids, but
augmenting LCG-1 and other grid services
Grid services interface to CE, SE, VO management
interfaces
OGSI-based services speak JDL, DAGman etc
ARDA provides decomposition into those services
that address the LHC distributed analysis use
cases

7
ARDAs model of interfacing the Applications to
the Grid UI/API

Stress importance to providing an API that others
can project against
Benefits of common API to framework
Goes beyond traditional UIs à la GANGA, Grid
portals, etc
Benefits in interfacing to physics applications
like ROOT et al
Process to get a common API b/w experiments --gt
prototype

8
ARDA Roadmap for Prototype

Prototype provides the initial blueprint -- do
not aim for a full specification of all the
interfaces
4-prong approach
re-factoring of AliEn web services into ARDA
Initial release w/ OGSIlight/GT3 proxy,
consolidation of API, release
implementation of agreed interfaces, testing,
release
GT3 modeling and testing, ev. quality assurance
Interfacing to POOL, analysis shells,
Also opportunity to early interfacing to
complementary projects
Interfacing to experiments frameworks
metadata handlers, exp. specific services
Provide interaction points with community
early releases and workshops every few months
Early strong feedback on API and services
Decouple from deployment issues

9
Experiments and LCG Involved in Prototyping

ARDA prototype would define the initial set of
services and their interfaces
Important to involve experiments and LCG at the
right level
Initial modeling of GT3-based services
Interface to major cross-exp packages POOL,
ROOT, PROOF, others
Program experiment frameworks against ARDA API,
integrate with experiment environments
Expose services and UI/API to other LHC projects
to allow synergies
Spend appropriate effort to document, package,
release, deploy
After the prototype is delivered, in Spring 2004,
Scale up and re-engineer as needed OGSI,
databases, information services
Deployment and interfaces to site and grid
operations, VO management etc
Build higher-level services and experiment
specific functionality
Work on interactive analysis interfaces and new
functionalities

10
Possible Strawman

Strawman workplan for ARDA prototype

11
Synergy with and Engagement of other Projects

e.g. GANGA --gt Rob

12
Action Items

Develop ARDA work plan, schedule, milestones
Identify effort and build team(s)
Develop plan for interfacing to and engaging of
LHC and Grid community
Release ARDA RTAG document

13
Slides from previous presentations
14
ARDA Mandate

15
ARDA Mandate
Long list of projects being looked at, analyzing
how their components and services would map to
the ARDA services, synthesized to provide
description of ARDA components

GAG discussed an initial internal working draft,
GAG to follow up
Both of these are in progress --- will provide a
technical annex that documents these
This is a main thrust of the ARDA roadmap
Will be part of the technical annex -- e.g.
security, auditing etc
Main deliverable of ARDA, approach to be
described in this talk
16
Makeup of ARDA RTAG

Requirements and Technical Assessment Group of
the SC2
Give recommendations to the SC2, thus the LCG and
the four experiments
Members
Alice Fons Rademakers and Predrag Buncic
Atlas Roger Jones and Rob Gardner
CMS Lothar Bauerdick and Lucia Silvestris
LHCb Philippe Charpentier and Andrei
Tsaregorodtsev
LCG GTA David Foster, stand-in Massimo Lamanna
LCG AA Torre Wenaus
GAG Federico Carminati (CMS members in GAG
Rick, Claudio)

17
ARDA mode of operation

constructive and open-minded committee
Series of weekly meetings July and August,
mini-workshop in September
Invited talks from existing experiments
projects
Summary of Caltech GAE workshop (Torre)
PROOF (Fons)
AliEn (Predrag)
DIAL (David Adams)
GAE and Clarens (Conrad Steenberg)
Ganga (Pere Mato)
Dirac (Andrei)
Cross-check w/ other projects of emerging ARDA
decomposition of services
Magda, DIAL -- Torre, Rob
EDG, NorduGrid -- Andrei, Massimo
SAM, MCRunjob -- Roger, Lothar
BOSS, MCRunob -- Lucia, Lothar
Clarens, GAE -- Lucia, Lothar
Ganga -- Rob, Torre
PROOF -- Fons
AliEn -- Predrag

18
Initial Picture Distributed Analysis (Torre,
Caltech w/s)
19
Hepcal-II Analysis Use Cases

Scenarios based on GAG HEPCAL-II report
Determine data sets and eventually event
components
Input data are selected via a query to a metadata
catalogue
Perform iterative analysis activity
Selection and algorithm are passed to a workload
management system, together with spec of the
execution environment
Algorithms are executed on one or many nodes
User monitors progress of job execution
Results are gathered together and passed back to
the job owner
Resulting datasets can be published to be
accessible to other users
Specific requirements from Hepcal-II
Job traceability, provenance, logbooks
Also discussed support for finer-grain access
control and enabling to share data within physics
groups

20
Analysis Scenario

This scenario represents the analysis activity
from the user perspective. However, some other
actions are done behind the scene of the user
interface
To carry out the analysis tasks users are
accessing shared computing resources. To do so,
they must be registered with their Virtual
Organization (VO), authenticated and their
actions must be authorized according to their
roles within the VO
The user specifies the necessary execution
environment (software packages, databases, system
requirements, etc) and the system insures it on
the execution node. In particular, the necessary
environment can be installed according to the
needs of a particular job
The execution of the user job may trigger
transfers of various datasets between a user
interface computer, execution nodes and storage
elements. These transfers are transparent for the
user

21
Example Asynchronous Analysis

Running Grid-based analysis from inside ROOT
(adapted from AliEn example)
ROOT calling the ARDA API from the command prompt
// connect authenticate to the GRID Service
arda as lucia
TGrid arda TGridConnect(arda",lucia,"",""
)
// create a new analysis Object ( ltunique IDgt,
lttitlegt, subjobs)
TArdaAnalysis analysis new TArdaAnalysis(pass
001",MyAnalysis",10)
// set the program, which executes the Analysis
Macro/Script
analysis-gtExec("ArdaRoot.sh,"file/home/vincenzo
/test.C") // script to execute
// setup the event metadata query
analysis-gtQuery("2003-09/V6.08.Rev.04/00110/gjet
met.root?ptgt0.2")
// specify job splitting and run
analysis-gtOutputFileAutoMerge(true) // merge
all produced .root files
analysis-gtSplit() // split the task in subjobs
analysis-gtRun() // submit all subjobs to the
ARDA queue
// asynchronously, at any time get the (partial
or complete) results
analysis-gtGetResults() // download
partial/final results and merge them
analysis-gtInfo() // display job information

22
Asynchronous Analysis Model

Extract a subset of the datasets from the virtual
file catalogue using metadata conditions provided
by the user.
Split the tasks according to the location of data
sets.
A trade-off has to be found between best use of
available resources and minimal data movements.
Ideally jobs should be executed where the data
are stored. Since one cannot expect a uniform
storage location distribution for every subset of
data, the analysis framework has to negotiate
with dedicated Grid services the balancing
between local data access and data replication.
Spawn sub-jobs and submit to Workload Management
with precise job descriptions
User can control the results while and after data
are processed
Collect and Merge available results from all
terminated sub-jobs on request
Analysis objects associated with the analysis
task remains persistent in the Grid environment
so the user can go offline and reload an analysis
task at a later date, check the status, merge
current results or resubmit the same task with
modified analysis code.

23
Synchronous Analysis

Scenario using PROOF in the Grid environment
Parallel ROOT Facility, main developer Maarten
Ballintjin/MIT
PROOF already provides a ROOT-based framework to
use a (local) cluster computing resources
balancing dynamically the workload, with the goal
of optimizing CPU exploitation and minimizing
data transfers
makes use of the inherent parallelism in event
data
works in heterogeneous clusters with distributed
storage
Extend this to the Grid using interactive
analysis services, that could be based on the
ARDA services

24
ARDA Roadmap Informed By DA Implementations

Following SC2 advice, reviewed major existing DA
projects
Clearly AliEn today provides the most complete
implementation of a distributed analysis
services, that is fully functional -- also
already interfaces to PROOF
Implements the major Hepcal-II use cases
Presents a clean API to experiments application,
Web portals,
Should address most requirements for upcoming
experiments physics studies
Existing and fully functional interface to
complete analysis package --- ROOT
Interface to PROOF cluster-based interactive
analysis system
Interfaces to any other system well defined and
certainly feasible
Based on Web-services, with a global (federated)
database as a backend to give state and
persistency to the system
ARDA approach
Re-factoring AliEn, using the experience of the
other project, to generalize it in an
architecture Consider OGSI as a natural
foundation for that
Confront ARDA services with existing projects
(notably EDG, SAM, Dirac, etc)
Synthesize service definition, defining their
contracts and behavior
Blueprint for initial distributed analysis
service infrastructure
ARDA services blueprint gains credibility w/
functional prototypical implementation

25
ARDA Distributed Analysis Services

Distributed Analysis in a Grid Services based
architecture
ARDA Services should be OGSI compliant -- built
upon OGSI middleware
Frameworks and applications use an ARDA API with
bindings to C, Java, Python, PERL,
interface through UI/API factory --
authentication, persistent session
Fabric Interface to resources through CE, SE
services
job description language, based on Condor
ClassAds and matchmaking
Database(ses) backend provide statefulness and
persistence (accessed through proxy)
We arrived at a decomposition into the following
key services
API and User Interface
Authentication, Authorization, Accounting and
Auditing services
Workload Management and Data Management services
File and (event) Metadata Catalogues
Information service
Grid and Job Monitoring services
Storage Element and Computing Element services
Package Manager and Job Provenance services

26
AliEn (re-factored)

28
ARDA Key Services for Distributed Analysis
Framework
29
API and User Interface

30
API and User Interface

ARDA services present an API, called by
applications like the experiments frameworks,
interactive analysis packages, Grid portals, Grid
shells, etc
allows to implement a wide variety of different
applications. Examples are command line interface
similar to a UNIX file system. Similar
functionality can be provided by graphical user
interfaces.
Using these interfaces, it will be possible to
access the catalogue, submit jobs and retrieve
the output. Web portals can be provided as an
alternative user interface, where one can check
the status of the current and past jobs, submit
new jobs and interact with them.
Web portals should also offer additional
functionality to power users Grid
administrators can check the status of all
services, monitor, start and stop them while VO
administrators (production user) can submit and
manipulate bulk jobs.
The user interface can use the Condor ClassAds as
a Job Description Language
This will maintain compatibility with existing
job execution services, in particular LCG-1.
The JDL defines the executable, its arguments and
the software packages or data and the resources
that are required by the job
The Workload Management service can modify the
jobs JDL entry by adding or elaborating
requirements based on the detailed information it
can get from the system like the exact location
of the dataset and replicas, client and service
capabilities.

31
File Catalogue and Data Management

Input and output associated with any job can be
registered in the File Catalogue, a virtual file
system in which a logical name is assigned to a
file.
Unlike real file systems, the File Catalogue does
not own the files it only keeps an association
between the Logical File Name (LFN) and (possibly
more than one) Physical File Names (PFN) on a
real file or mass storage system. PFNs describe
the physical location of the files and include
the name of the Storage Element and the path to
the local file
This could be extended to the more general case
of object collections, that are denoted to by a
metadata system (Dirk Düllmann)
The system supports file replication and caching
and will use file location information when it
comes to scheduling jobs for execution.
The directories and files in the File Catalogue
have privileges for owner, group and the world.
This means that every user can have exclusive
read and write privileges for his portion of the
logical file namespace (home directory).
Etc pp

32
Job Provenance service

The File Catalogue is not meant to support only
data sets this is extended to include
information about running processes in the system
(in analogy with the /proc directory on Linux
systems) and to support virtual data services
Each job sent for execution gets an unique id and
a corresponding /proc/id directory where it can
register temporary files, standard input and
output as well as all job products. In a typical
production scenario, only after a separate
process has verified the output, the job products
will be renamed and registered in their final
destination in the File Catalogue.
The entries (LFNs) in the File Catalogue have an
immutable unique file id attribute that is
required to support long references (for instance
in ROOT) and symbolic links.

33
Package Manager Service

Allows dynamic installation of application
software released by the VO (e.g. the experiment
or a physics group).
Each VO can provide the Packages and Commands
that can be subsequently executed
Once the corresponding files with bundled
executables and libraries are published in the
File Catalogue and registered, the Package
Manager will install them automatically as soon
as a job becomes eligible to run on a site whose
policy accepts these jobs.
While installing the package in a shared package
repository, the Package Manager will resolve the
dependencies on other packages and, taking into
account package versions, install them as well.
This means that old versions of packages can be
safely removed from the shared repository and, if
these are needed again at some point later, they
will be re-installed automatically by the system.
This provides a convenient and automated way to
distribute the experiment specific software
across the Grid and assures accountability in the
long term.

34
Computing Element

Computing Element is a service representing a
computing resource. Its interface should allow
submission of a job to be executed on the
underlying computing facility, access to the job
status information as well as high level job
manipulation commands. The interface should also
provide access to the dynamic status of the
computing resource like its available capacity,
load, number of waiting and running jobs.
This service should be available on a per VO
basis.

35
Workload Management

pull approach, with jobs submitted to the
central task queue
central service component manages all the tasks
(described by JDL)
computing elements are defined as remote
queues, to provide access to a cluster of
computers, to a single machine dedicated to run a
specific task, or even an entire foreign Grid
Workload manager optimizes the task queue taking
into account JDL
JDL describes job requirements like input files,
CPU time, architecture, disk space etc.
Makes job eligible to run on one or more
computing elements
Active nodes fetch jobs from central task queue
and start
Job Monitoring service to access job progression
and stdout, stderr
Optimizers inspect JDL and try to fulfill
requests and resolve conflicts
results in triggering file replication, etc
Other optimizers can be constructed for special
purposes
E.g. implement policy monitors to enforce VO
policies by altering job priorities
E.g. estimate time to complete work, taking into
account other work going on, resolve specific
conflicts or optimize in a heuristic or neural
net way etc.

36
Auditing Services

The auditing services provides a central syslog
can be queried by monitors or agents with
specific intelligence
Should allow to implement Event Handling" or
fault recovery
Together with monitoring services, allows to
implement specific tools and approaches for
operation and debugging

37
Etc. pp

38
General ARDA Roadmap

Emerging picture of waypoints on the ARDA
roadmap
ARDA RTAG report
review of existing projects, component
decomposition re-factoring, capturing of common
architectures, synthesis of existing approaches
recommendations for a prototypical architecture
and definition of prototypical functionality and
a development strategy
development of a prototype and first release
Re-factoring AliEn web services, studying the
ARDA architecture in a OGSI context, based on
existing implementation
POOL and other LCG components (VO, CE, SE, )
interface to ARDA
Adaptation of specific ARDA services to
experiments requirements
E.g. File catalogs, package manager, metadata
handling for different data models
Integration with and deployment on LCG-1
resources and services
This will give CMS a (initially ROOT based)
distributed analysis environment
including PROOF-based interactive analysis
Re-engineering of prototypical ARDA services, as
required
Evolving services scaling up and adding
functionality, robustness, resilience, etc

39
Talking Points

Horizontally structured system of services with a
well-defined API and a database backend
Can easily be extended with additional services,
new implementations can be moved in, alternative
approaches tested and commissioned
Interface to LCG-1 infrastructure
VDT/EDG interface through CE, SE and the use of
JDL, compatible with existing i/s
ARDA VO services can build on emerging LCG VO
management infrastructure
ARDA initially looked at file based datasets, not
object collection
talk with POOL how to extend the file concept to
a more generic collection concept
investigate experiments metadata/file catalog
interaction
VO system and site security
Jobs are executed on behalf of VO, however users
fully traceable
How do policies get implemented, e.g. analysis
priorities, MoU contributions etc
Auditing and accounting system, priorities
through special optimizers
accounting of site contributions, that depend
what resources sites expose
Database backend for the prototype
Address latency, stability and scalability issues
up-front good experience exists
In a sense, the system is the database (possibly
federated and distributed) that contains all
there is to know about all jobs, files, metadata,
algorithms of all users within a VO
set of OGSI grid services provide
windows/views into the database, while the
API provides the user access
allows structuring into federated grids and
dynamic workspaces

40
Major Role for Middleware Engineering

ARDA roadmap based on a well-factored prototype
implementation that allows evolutionary
development into a complete system that evolves
to the full LHC scale
David Foster lets recognize that very little
work has so far been done on the underlying
mechanisms needed to provide the appropriate
foundations (message passing structures, fault
recovery procedures, component instrumentation
etc)
ARDA prototype would be pretty lightweight
Stability through using a global database as a
backend, to which services talk through a
database proxy
people know how to do large databases -- well
founded principle (see e.g. SAM for RunII), with
many possible migration paths
HEP-specific services, however based on generic
OGSI-compliant services
Expect LCG/EGEE middleware effort to play major
role to evolve this foundation, concepts and
implementation
re-casting the (HEP-specific event-data analysis
oriented) services into more general services,
from which the ARDA services would be derived
addressing major issues like a solid OGSI
foundation, robustness, resilience, fault
recovery, operation and debugging

41
Framework for evolution and end-to-end services

The ARDA services architecture would allow
implement end-to-end services on top, and to
interact with the ARDA base services in
interesting ways
experiments and physicists doing production and
analysis will need services that look across and
help manage the system
E.g. the accounting services allows to implement
policies through optimizers
E.g. the auditing services allows to implement
Grid Operations and expert systems that work on
the global syslog and intervene in case of
problems
That could even allow to implement artificial
intelligence (e.g. in the form of agents) to help
users decide what to do, and eventually take
remedial action automatically, to deal with
problems (that have been adequately diagnosed)
automatically (Harveys email)
Experiment-wide prioritization can be implemented
and managed
The ARDA prototype would provide an initial
implementation of a multi session, many user,
interactive grid
Allows to evolve the architecture of the
services, to scale
Allows to evolve the OGSI services infrastructure
functionaity
Allows to implement views of the Grid.
user/system interaction and eventually some
user-initiated steering (Harveys email)

42
Conclusions

ARDA is identifying a services oriented
architecture and an initial decomposition of
services required for distributed analysis
Recognize a central role for a Grid API which
provides a factory of user interfaces for
experiment frameworks, applications, portals, etc
ARDA Prototype would provide an distributed
physics analysis environment of distributed
experimental data
for experiment framework based analysis
Cobra, Athena, Gaudi, AliRoot,
for ROOT based analysis
interfacing to other analysis packages like JAS
event displays like Iguana grid portals etc.
can be easily implemented