ARDA Report to the LCG SC2 - PowerPoint PPT Presentation

About This Presentation
Title:

ARDA Report to the LCG SC2

Description:

Prototype provides the initial blueprint -- do not aim for a full ... Release ARDA RTAG document. Philippe Charpentier. ARDA Report, SC2 Meeting. Oct 3, 2003 ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 43
Provided by: usc1
Learn more at: https://uscms.org
Category:
Tags: arda | lcg | report | sc2

less

Transcript and Presenter's Notes

Title: ARDA Report to the LCG SC2


1
ARDA Reportto the LCG SC2
  • Philippe Charpentier
  • For the RTAG-11/ARDA group

2
Repeats of last talk emphasizing main points
3
ARDA roadmap to distributed analysis
  • Analysis model
  • How to implement Hepcal-II use cases
  • Ev. Proviso for Hepcal-II use cases?

4
API to Grid services
  • Importance of API to interface
  • Exp. framework
  • Analysis shells, e.g. ROOT
  • Grid portals and other forms of user interactions
    with environment
  • Advanced services e.g. virtual data, analysis
    logbooks etc
  • Exp. Specific services e.g. data and meta data
    management systems

5
ARDA and Grid services arch
  • OGSI gives framework in which to run LHC/ARDA
    services
  • Addresses architecture for
  • -- communication, lifetime support, whatever,
  • Provides framework for advanced interactions with
    Grid
  • This is outside of the analysis services API,
    but to be implemented in standard ways
  • Need to address issues of OGSI performance and
    scalability up-front
  • Importance of modeling, plan for scaling up,
    engineering of underlying services i/s

6
Roadmap to a Grid Services Architecture for the
LHC
  • Transition to grid services explicitly addressed
    in several existing projects
  • Clarens and Caltech GAE, MonaLisa
  • Based on web services for communication,
    Jini-based agent architecture
  • Dirac
  • Based on intelligent agents working within
    batch environments
  • AliEn
  • Based on web services and communication to dbase
    proxy
  • Initial work on OGSA within LCG-GTA
  • GT3 prototyping
  • No evolutionary path from GT2-based grids, but
    augmenting LCG-1 and other grid services
  • Grid services interface to CE, SE, VO management
    interfaces
  • OGSI-based services speak JDL, DAGman etc
  • ARDA provides decomposition into those services
    that address the LHC distributed analysis use
    cases

7
ARDAs model of interfacing the Applications to
the Grid UI/API
  • Stress importance to providing an API that others
    can project against
  • Benefits of common API to framework
  • Goes beyond traditional UIs à la GANGA, Grid
    portals, etc
  • Benefits in interfacing to physics applications
    like ROOT et al
  • Process to get a common API b/w experiments --gt
    prototype

8
ARDA Roadmap for Prototype
  • Prototype provides the initial blueprint -- do
    not aim for a full specification of all the
    interfaces
  • 4-prong approach
  • re-factoring of AliEn web services into ARDA
  • Initial release w/ OGSIlight/GT3 proxy,
    consolidation of API, release
  • implementation of agreed interfaces, testing,
    release
  • GT3 modeling and testing, ev. quality assurance
  • Interfacing to POOL, analysis shells,
  • Also opportunity to early interfacing to
    complementary projects
  • Interfacing to experiments frameworks
  • metadata handlers, exp. specific services
  • Provide interaction points with community
  • early releases and workshops every few months
  • Early strong feedback on API and services
  • Decouple from deployment issues

9
Experiments and LCG Involved in Prototyping
  • ARDA prototype would define the initial set of
    services and their interfaces
  • Important to involve experiments and LCG at the
    right level
  • Initial modeling of GT3-based services
  • Interface to major cross-exp packages POOL,
    ROOT, PROOF, others
  • Program experiment frameworks against ARDA API,
    integrate with experiment environments
  • Expose services and UI/API to other LHC projects
    to allow synergies
  • Spend appropriate effort to document, package,
    release, deploy
  • After the prototype is delivered, in Spring 2004,
  • Scale up and re-engineer as needed OGSI,
    databases, information services
  • Deployment and interfaces to site and grid
    operations, VO management etc
  • Build higher-level services and experiment
    specific functionality
  • Work on interactive analysis interfaces and new
    functionalities

10
Possible Strawman
  • Strawman workplan for ARDA prototype

11
Synergy with and Engagement of other Projects
  • e.g. GANGA --gt Rob

12
Action Items
  • Develop ARDA work plan, schedule, milestones
  • Identify effort and build team(s)
  • Develop plan for interfacing to and engaging of
    LHC and Grid community
  • Release ARDA RTAG document

13
Slides from previous presentations
14
ARDA Mandate

15
ARDA Mandate
Long list of projects being looked at, analyzing
how their components and services would map to
the ARDA services, synthesized to provide
description of ARDA components

GAG discussed an initial internal working draft,
GAG to follow up
Both of these are in progress --- will provide a
technical annex that documents these
This is a main thrust of the ARDA roadmap
Will be part of the technical annex -- e.g.
security, auditing etc
Main deliverable of ARDA, approach to be
described in this talk
16
Makeup of ARDA RTAG
  • Requirements and Technical Assessment Group of
    the SC2
  • Give recommendations to the SC2, thus the LCG and
    the four experiments
  • Members
  • Alice Fons Rademakers and Predrag Buncic
  • Atlas Roger Jones and Rob Gardner
  • CMS Lothar Bauerdick and Lucia Silvestris
  • LHCb Philippe Charpentier and Andrei
    Tsaregorodtsev
  • LCG GTA David Foster, stand-in Massimo Lamanna
  • LCG AA Torre Wenaus
  • GAG Federico Carminati (CMS members in GAG
    Rick, Claudio)

17
ARDA mode of operation
  • constructive and open-minded committee
  • Series of weekly meetings July and August,
    mini-workshop in September
  • Invited talks from existing experiments
    projects
  • Summary of Caltech GAE workshop (Torre)
  • PROOF (Fons)
  • AliEn (Predrag)
  • DIAL (David Adams)
  • GAE and Clarens (Conrad Steenberg)
  • Ganga (Pere Mato)
  • Dirac (Andrei)
  • Cross-check w/ other projects of emerging ARDA
    decomposition of services
  • Magda, DIAL -- Torre, Rob
  • EDG, NorduGrid -- Andrei, Massimo
  • SAM, MCRunjob -- Roger, Lothar
  • BOSS, MCRunob -- Lucia, Lothar
  • Clarens, GAE -- Lucia, Lothar
  • Ganga -- Rob, Torre
  • PROOF -- Fons
  • AliEn -- Predrag

18
Initial Picture Distributed Analysis (Torre,
Caltech w/s)
19
Hepcal-II Analysis Use Cases
  • Scenarios based on GAG HEPCAL-II report
  • Determine data sets and eventually event
    components
  • Input data are selected via a query to a metadata
    catalogue
  • Perform iterative analysis activity
  • Selection and algorithm are passed to a workload
    management system, together with spec of the
    execution environment
  • Algorithms are executed on one or many nodes
  • User monitors progress of job execution
  • Results are gathered together and passed back to
    the job owner
  • Resulting datasets can be published to be
    accessible to other users
  • Specific requirements from Hepcal-II
  • Job traceability, provenance, logbooks
  • Also discussed support for finer-grain access
    control and enabling to share data within physics
    groups

20
Analysis Scenario
  • This scenario represents the analysis activity
    from the user perspective. However, some other
    actions are done behind the scene of the user
    interface
  • To carry out the analysis tasks users are
    accessing shared computing resources. To do so,
    they must be registered with their Virtual
    Organization (VO), authenticated and their
    actions must be authorized according to their
    roles within the VO
  • The user specifies the necessary execution
    environment (software packages, databases, system
    requirements, etc) and the system insures it on
    the execution node. In particular, the necessary
    environment can be installed according to the
    needs of a particular job
  • The execution of the user job may trigger
    transfers of various datasets between a user
    interface computer, execution nodes and storage
    elements. These transfers are transparent for the
    user

21
Example Asynchronous Analysis
  • Running Grid-based analysis from inside ROOT
    (adapted from AliEn example)
  • ROOT calling the ARDA API from the command prompt
  • // connect authenticate to the GRID Service
    arda as lucia
  • TGrid arda TGridConnect(arda",lucia,"",""
    )
  • // create a new analysis Object ( ltunique IDgt,
    lttitlegt, subjobs)
  • TArdaAnalysis analysis new TArdaAnalysis(pass
    001",MyAnalysis",10)
  • // set the program, which executes the Analysis
    Macro/Script
  • analysis-gtExec("ArdaRoot.sh,"file/home/vincenzo
    /test.C") // script to execute
  • // setup the event metadata query
  • analysis-gtQuery("2003-09/V6.08.Rev.04/00110/gjet
    met.root?ptgt0.2")
  • // specify job splitting and run
  • analysis-gtOutputFileAutoMerge(true) // merge
    all produced .root files
  • analysis-gtSplit() // split the task in subjobs
  • analysis-gtRun() // submit all subjobs to the
    ARDA queue
  • // asynchronously, at any time get the (partial
    or complete) results
  • analysis-gtGetResults() // download
    partial/final results and merge them
  • analysis-gtInfo() // display job information

22
Asynchronous Analysis Model
  • Extract a subset of the datasets from the virtual
    file catalogue using metadata conditions provided
    by the user.
  • Split the tasks according to the location of data
    sets.
  • A trade-off has to be found between best use of
    available resources and minimal data movements.
    Ideally jobs should be executed where the data
    are stored. Since one cannot expect a uniform
    storage location distribution for every subset of
    data, the analysis framework has to negotiate
    with dedicated Grid services the balancing
    between local data access and data replication.
  • Spawn sub-jobs and submit to Workload Management
    with precise job descriptions
  • User can control the results while and after data
    are processed
  • Collect and Merge available results from all
    terminated sub-jobs on request
  • Analysis objects associated with the analysis
    task remains persistent in the Grid environment
    so the user can go offline and reload an analysis
    task at a later date, check the status, merge
    current results or resubmit the same task with
    modified analysis code.

23
Synchronous Analysis
  • Scenario using PROOF in the Grid environment
  • Parallel ROOT Facility, main developer Maarten
    Ballintjin/MIT
  • PROOF already provides a ROOT-based framework to
    use a (local) cluster computing resources
  • balancing dynamically the workload, with the goal
    of optimizing CPU exploitation and minimizing
    data transfers
  • makes use of the inherent parallelism in event
    data
  • works in heterogeneous clusters with distributed
    storage
  • Extend this to the Grid using interactive
    analysis services, that could be based on the
    ARDA services

24
ARDA Roadmap Informed By DA Implementations
  • Following SC2 advice, reviewed major existing DA
    projects
  • Clearly AliEn today provides the most complete
    implementation of a distributed analysis
    services, that is fully functional -- also
    already interfaces to PROOF
  • Implements the major Hepcal-II use cases
  • Presents a clean API to experiments application,
    Web portals,
  • Should address most requirements for upcoming
    experiments physics studies
  • Existing and fully functional interface to
    complete analysis package --- ROOT
  • Interface to PROOF cluster-based interactive
    analysis system
  • Interfaces to any other system well defined and
    certainly feasible
  • Based on Web-services, with a global (federated)
    database as a backend to give state and
    persistency to the system
  • ARDA approach
  • Re-factoring AliEn, using the experience of the
    other project, to generalize it in an
    architecture Consider OGSI as a natural
    foundation for that
  • Confront ARDA services with existing projects
    (notably EDG, SAM, Dirac, etc)
  • Synthesize service definition, defining their
    contracts and behavior
  • Blueprint for initial distributed analysis
    service infrastructure
  • ARDA services blueprint gains credibility w/
    functional prototypical implementation

25
ARDA Distributed Analysis Services
  • Distributed Analysis in a Grid Services based
    architecture
  • ARDA Services should be OGSI compliant -- built
    upon OGSI middleware
  • Frameworks and applications use an ARDA API with
    bindings to C, Java, Python, PERL,
  • interface through UI/API factory --
    authentication, persistent session
  • Fabric Interface to resources through CE, SE
    services
  • job description language, based on Condor
    ClassAds and matchmaking
  • Database(ses) backend provide statefulness and
    persistence (accessed through proxy)
  • We arrived at a decomposition into the following
    key services
  • API and User Interface
  • Authentication, Authorization, Accounting and
    Auditing services
  • Workload Management and Data Management services
  • File and (event) Metadata Catalogues
  • Information service
  • Grid and Job Monitoring services
  • Storage Element and Computing Element services
  • Package Manager and Job Provenance services

26
AliEn (re-factored)

27

28
ARDA Key Services for Distributed Analysis
Framework
29
API and User Interface

30
API and User Interface
  • ARDA services present an API, called by
    applications like the experiments frameworks,
    interactive analysis packages, Grid portals, Grid
    shells, etc
  • allows to implement a wide variety of different
    applications. Examples are command line interface
    similar to a UNIX file system. Similar
    functionality can be provided by graphical user
    interfaces.
  • Using these interfaces, it will be possible to
    access the catalogue, submit jobs and retrieve
    the output. Web portals can be provided as an
    alternative user interface, where one can check
    the status of the current and past jobs, submit
    new jobs and interact with them.
  • Web portals should also offer additional
    functionality to power users Grid
    administrators can check the status of all
    services, monitor, start and stop them while VO
    administrators (production user) can submit and
    manipulate bulk jobs.
  • The user interface can use the Condor ClassAds as
    a Job Description Language
  • This will maintain compatibility with existing
    job execution services, in particular LCG-1.
  • The JDL defines the executable, its arguments and
    the software packages or data and the resources
    that are required by the job
  • The Workload Management service can modify the
    jobs JDL entry by adding or elaborating
    requirements based on the detailed information it
    can get from the system like the exact location
    of the dataset and replicas, client and service
    capabilities.

31
File Catalogue and Data Management
  • Input and output associated with any job can be
    registered in the File Catalogue, a virtual file
    system in which a logical name is assigned to a
    file.
  • Unlike real file systems, the File Catalogue does
    not own the files it only keeps an association
    between the Logical File Name (LFN) and (possibly
    more than one) Physical File Names (PFN) on a
    real file or mass storage system. PFNs describe
    the physical location of the files and include
    the name of the Storage Element and the path to
    the local file
  • This could be extended to the more general case
    of object collections, that are denoted to by a
    metadata system (Dirk Düllmann)
  • The system supports file replication and caching
    and will use file location information when it
    comes to scheduling jobs for execution.
  • The directories and files in the File Catalogue
    have privileges for owner, group and the world.
    This means that every user can have exclusive
    read and write privileges for his portion of the
    logical file namespace (home directory).
  • Etc pp

32
Job Provenance service
  • The File Catalogue is not meant to support only
    data sets this is extended to include
    information about running processes in the system
    (in analogy with the /proc directory on Linux
    systems) and to support virtual data services
  • Each job sent for execution gets an unique id and
    a corresponding /proc/id directory where it can
    register temporary files, standard input and
    output as well as all job products. In a typical
    production scenario, only after a separate
    process has verified the output, the job products
    will be renamed and registered in their final
    destination in the File Catalogue.
  • The entries (LFNs) in the File Catalogue have an
    immutable unique file id attribute that is
    required to support long references (for instance
    in ROOT) and symbolic links.

33
Package Manager Service
  • Allows dynamic installation of application
    software released by the VO (e.g. the experiment
    or a physics group).
  • Each VO can provide the Packages and Commands
    that can be subsequently executed
  • Once the corresponding files with bundled
    executables and libraries are published in the
    File Catalogue and registered, the Package
    Manager will install them automatically as soon
    as a job becomes eligible to run on a site whose
    policy accepts these jobs.
  • While installing the package in a shared package
    repository, the Package Manager will resolve the
    dependencies on other packages and, taking into
    account package versions, install them as well.
  • This means that old versions of packages can be
    safely removed from the shared repository and, if
    these are needed again at some point later, they
    will be re-installed automatically by the system.
  • This provides a convenient and automated way to
    distribute the experiment specific software
    across the Grid and assures accountability in the
    long term.

34
Computing Element
  • Computing Element is a service representing a
    computing resource. Its interface should allow
    submission of a job to be executed on the
    underlying computing facility, access to the job
    status information as well as high level job
    manipulation commands. The interface should also
    provide access to the dynamic status of the
    computing resource like its available capacity,
    load, number of waiting and running jobs.
  • This service should be available on a per VO
    basis.

35
Workload Management
  • pull approach, with jobs submitted to the
    central task queue
  • central service component manages all the tasks
    (described by JDL)
  • computing elements are defined as remote
    queues, to provide access to a cluster of
    computers, to a single machine dedicated to run a
    specific task, or even an entire foreign Grid
  • Workload manager optimizes the task queue taking
    into account JDL
  • JDL describes job requirements like input files,
    CPU time, architecture, disk space etc.
  • Makes job eligible to run on one or more
    computing elements
  • Active nodes fetch jobs from central task queue
    and start
  • Job Monitoring service to access job progression
    and stdout, stderr
  • Optimizers inspect JDL and try to fulfill
    requests and resolve conflicts
  • results in triggering file replication, etc
  • Other optimizers can be constructed for special
    purposes
  • E.g. implement policy monitors to enforce VO
    policies by altering job priorities
  • E.g. estimate time to complete work, taking into
    account other work going on, resolve specific
    conflicts or optimize in a heuristic or neural
    net way etc.

36
Auditing Services
  • The auditing services provides a central syslog
  • can be queried by monitors or agents with
    specific intelligence
  • Should allow to implement Event Handling" or
    fault recovery
  • Together with monitoring services, allows to
    implement specific tools and approaches for
    operation and debugging

37
Etc. pp

38
General ARDA Roadmap
  • Emerging picture of waypoints on the ARDA
    roadmap
  • ARDA RTAG report
  • review of existing projects, component
    decomposition re-factoring, capturing of common
    architectures, synthesis of existing approaches
  • recommendations for a prototypical architecture
    and definition of prototypical functionality and
    a development strategy
  • development of a prototype and first release
  • Re-factoring AliEn web services, studying the
    ARDA architecture in a OGSI context, based on
    existing implementation
  • POOL and other LCG components (VO, CE, SE, )
    interface to ARDA
  • Adaptation of specific ARDA services to
    experiments requirements
  • E.g. File catalogs, package manager, metadata
    handling for different data models
  • Integration with and deployment on LCG-1
    resources and services
  • This will give CMS a (initially ROOT based)
    distributed analysis environment
  • including PROOF-based interactive analysis
  • Re-engineering of prototypical ARDA services, as
    required
  • Evolving services scaling up and adding
    functionality, robustness, resilience, etc

39
Talking Points
  • Horizontally structured system of services with a
    well-defined API and a database backend
  • Can easily be extended with additional services,
    new implementations can be moved in, alternative
    approaches tested and commissioned
  • Interface to LCG-1 infrastructure
  • VDT/EDG interface through CE, SE and the use of
    JDL, compatible with existing i/s
  • ARDA VO services can build on emerging LCG VO
    management infrastructure
  • ARDA initially looked at file based datasets, not
    object collection
  • talk with POOL how to extend the file concept to
    a more generic collection concept
  • investigate experiments metadata/file catalog
    interaction
  • VO system and site security
  • Jobs are executed on behalf of VO, however users
    fully traceable
  • How do policies get implemented, e.g. analysis
    priorities, MoU contributions etc
  • Auditing and accounting system, priorities
    through special optimizers
  • accounting of site contributions, that depend
    what resources sites expose
  • Database backend for the prototype
  • Address latency, stability and scalability issues
    up-front good experience exists
  • In a sense, the system is the database (possibly
    federated and distributed) that contains all
    there is to know about all jobs, files, metadata,
    algorithms of all users within a VO
  • set of OGSI grid services provide
    windows/views into the database, while the
    API provides the user access
  • allows structuring into federated grids and
    dynamic workspaces

40
Major Role for Middleware Engineering
  • ARDA roadmap based on a well-factored prototype
    implementation that allows evolutionary
    development into a complete system that evolves
    to the full LHC scale
  • David Foster lets recognize that very little
    work has so far been done on the underlying
    mechanisms needed to provide the appropriate
    foundations (message passing structures, fault
    recovery procedures, component instrumentation
    etc)
  • ARDA prototype would be pretty lightweight
  • Stability through using a global database as a
    backend, to which services talk through a
    database proxy
  • people know how to do large databases -- well
    founded principle (see e.g. SAM for RunII), with
    many possible migration paths
  • HEP-specific services, however based on generic
    OGSI-compliant services
  • Expect LCG/EGEE middleware effort to play major
    role to evolve this foundation, concepts and
    implementation
  • re-casting the (HEP-specific event-data analysis
    oriented) services into more general services,
    from which the ARDA services would be derived
  • addressing major issues like a solid OGSI
    foundation, robustness, resilience, fault
    recovery, operation and debugging

41
Framework for evolution and end-to-end services
  • The ARDA services architecture would allow
    implement end-to-end services on top, and to
    interact with the ARDA base services in
    interesting ways
  • experiments and physicists doing production and
    analysis will need services that look across and
    help manage the system
  • E.g. the accounting services allows to implement
    policies through optimizers
  • E.g. the auditing services allows to implement
    Grid Operations and expert systems that work on
    the global syslog and intervene in case of
    problems
  • That could even allow to implement artificial
    intelligence (e.g. in the form of agents) to help
    users decide what to do, and eventually take
    remedial action automatically, to deal with
    problems (that have been adequately diagnosed)
    automatically (Harveys email)
  • Experiment-wide prioritization can be implemented
    and managed
  • The ARDA prototype would provide an initial
    implementation of a multi session, many user,
    interactive grid
  • Allows to evolve the architecture of the
    services, to scale
  • Allows to evolve the OGSI services infrastructure
    functionaity
  • Allows to implement views of the Grid.
    user/system interaction and eventually some
    user-initiated steering (Harveys email)

42
Conclusions
  • ARDA is identifying a services oriented
    architecture and an initial decomposition of
    services required for distributed analysis
  • Recognize a central role for a Grid API which
    provides a factory of user interfaces for
    experiment frameworks, applications, portals, etc
  • ARDA Prototype would provide an distributed
    physics analysis environment of distributed
    experimental data
  • for experiment framework based analysis
  • Cobra, Athena, Gaudi, AliRoot,
  • for ROOT based analysis
  • interfacing to other analysis packages like JAS
    event displays like Iguana grid portals etc.
    can be easily implemented
Write a Comment
User Comments (0)
About PowerShow.com