The SAMGrid Fabric Services - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

The SAMGrid Fabric Services

Description:

The SAM-Grid adopts Fabric-level configurable solutions for batch system ... needs to come up with standard fabric-level services to make any Grid usable ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 28
Provided by: igo98
Category:

less

Transcript and Presenter's Notes

Title: The SAMGrid Fabric Services


1
The SAM-Grid Fabric Services
  • Gabriele Garzoglio (for the SAM-Grid team)
  • Computing Division
  • Fermilab

2
Overview
  • Introduction
  • The grid-level services an overview
  • Job Management
  • The fabric-level services
  • Local batch system adaptation
  • Dynamic product retrieval
  • Local sandbox management
  • Job complex-status logging

3
Introduction
  • SAM is a Data Handling System for HEP the
    project was started in 1997 by DZero
  • SAM-Grid project started in 2001-2002 to handle
    DZeros expanded needs for globally distributed
    computing
  • CDF joined SAM-Grid at the end of 2002
  • JIM complements the data handling system (SAM)
    with Job and Info ManagementSAM-Grid JIM
    SAM
  • JIM is funded by PPDG and GridPP
  • Participated at SC02 and SC03

4
Overview
  • Introduction
  • The grid-level services an overview
  • Job Management
  • The fabric-level services
  • Local batch system adaptation
  • Dynamic product retrieval
  • Local sandbox management
  • Job complex-status logging

5
Job Management
User Interface
User Interface
Submission Client
Submission Client
Match Making Service
Match Making Service
Broker
Queuing System
Queuing System
Information Collector
Information Collector
JOB
Data Handling System
Data Handling System
Data Handling System
Data Handling System
Execution Site 1
Execution Site n
Computing Element
Computing Element
Computing Element
Storage Element
Storage Element
Storage Element
Storage Element
Storage Element
Grid Sensors
Grid Sensors
Grid Sensors
Grid Sensors
Computing Element
6
Overview
  • Introduction
  • The grid-level services an overview
  • Job Management
  • The fabric-level services
  • Local batch system adaptation
  • Dynamic product retrieval
  • Local sandbox management
  • Job complex-status logging

7
Running jobs on Grid resources the trend
  • Grid resources are not dedicated to a single
    experiment
  • Translation
  • no daemons running on the worker nodes of a Batch
    System
  • no experiment specific software installed

8
Running jobs on Grid resources today
  • The situation is transitioning
  • Generally, experiments can install specific
    services on a node close to the cluster.
  • Worker nodes typically access the software via
    shared FS not scalable!
  • Local resource configuration still too diverse to
    easily plug into the Grid
  • Today, most of our efforts are directed to coping
    with (the lack of) standard local fabric services

9
Overview
  • Introduction
  • The grid-level services an overview
  • Job Management
  • The fabric-level services
  • Local batch system adaptation
  • Dynamic product retrieval
  • Local sandbox management
  • Job complex-status logging

10
Motivation
  • Problem standard grid batch system adapters
    (globus job-managers) are too restrictive to fit
    all the local configurations
  • Examples
  • the terms of the agreement for using the batch
    system can be expressed with special directives
    to the batch system
  • system administrators end up writing wrappers
    around the standard batch system commands

11
SAM Batch System Adapter
  • We factor out the local batch system
    configuration using an intermediate layer that
    abstracts the basic interactions with the batch
    system
  • submit command
  • lookup command
  • remove command
  • For each of the commands above, the administrator
    can specify how to parse the output to fish out
    the relevant information e.g. local job id when
    submitting
  • We have written JIM globus job managers that use
    this layer

12
Overview
  • Introduction
  • The grid-level services an overview
  • Job Management
  • The fabric-level services
  • Local batch system adaptation
  • Dynamic product retrieval
  • Local sandbox management
  • Job complex-status logging

13
Motivation
  • Portability of the software for DZero and CDF is
    still a problem not completely solved.
  • Most of the CDF and DZero applications still rely
    on the offline software to be preinstalled at the
    site.
  • Administrators need to install and maintain the
    software at each site
  • A job submitted to the grid must be able to
    execute at a site where its dependencies are
    installed

14
Old solution software advertisement
  • Administrators install the software at each site
  • The JIM advertisement framework senses the new
    product and advertises it to the broker as one of
    the characteristics of the site
  • Drawbacks
  • the administrators still need to install the
    software
  • increased complexity of the advertisement
    framework it needs to know how to detect the
    list of installed products
  • increased complexity of the broker it needs to
    enforce the matching to the eligible sites
  • jobs running on old software versions may not
    find an eligible site

15
New solution dynamic software retrieval
  • Product developers store the software into SAM
    with appropriate metadata
  • Before running a job at a site, the
    infrastructure asks SAM for the delivery of the
    dependent products
  • The products live in the SAM cache and are
    automatically managed
  • Drawbacks
  • increased complexity of local job submission

16
Overview
  • Introduction
  • The grid-level services an overview
  • Job Management
  • The fabric-level services
  • Local batch system adaptation
  • Dynamic product retrieval
  • Local sandbox management
  • Job complex-status logging

17
Nomenclature
  • Input sandbox
  • from the client (user sandbox)
  • the executable
  • configuration files
  • special dependencies (libraries, products,)
  • from the local site
  • the product dependencies
  • Output sandbox
  • stdout, stderr
  • log files
  • small custom output (e.g. histograms)

18
Requirements
  • We want an infrastructure that
  • Locally stores the user sandbox (from the Grid)
    at the site
  • transports and installs the input sandbox to the
    worker node
  • packages the output and hands it over to the Grid

19
Limitations to overcome
  • the file transport mechanism of a batch system is
    site specific and needs to be factored out
  • shared file systems have scalability limits we
    want to rely on them as little as possible
  • the worker nodes may have connectivity
    restrictions (firewalls)

20
The sandbox management 1
  • It creates a sandbox area (reorganizing the
    native globus gass cache)
  • It starts up a gridftp server for the
    communications between worker nodes and head node
    (no shared FS)
  • It requests the delivery of the product
    dependencies
  • It creates a self extracting archive that
    contains the gridftp client and a bootstrapping
    script when running, this transfers and installs
    the product dependencies, then passes control to
    the application

21
The sandbox management 2
  • It submits to the batch system parallel instances
    of the self extracting archive
  • The job relies on SAM for large input/output
    files transfers
  • When the job finishes, stdout/stderr custom
    output is packaged at the head node to be
    transported back to the submission site via grid
    mechanisms

22
Open problems
  • Not all the batch system allow the selection of a
    node with sufficient scratch space to install the
    needed software
  • We would greatly simplify this infrastructure if
    there were a standard local storage service at
    all the sites (e.g. DiskFarm)

23
Overview
  • Introduction
  • The grid-level services an overview
  • Job Management
  • The fabric-level services
  • Local batch system adaptation
  • Dynamic product retrieval
  • Local sandbox management
  • Job complex-status logging

24
Motivation
  • Distributed logging of job status/history
  • Web monitoring
  • Statistics on historical data
  • Grid scheduling based upon job status/history at
    a certain site

25
The XML DB Status Logger
  • The status of the job is reported to an XML
    database deployed at each execution site
  • The information comes from the local batch system
    (simple job status e.g. idle, running, ) AND
    from the application (complex status e.g.
    Processing executable X in the chain)
  • The XML database gives flexible remote access via
    standard mechanisms, such as XPath

26
Conclusions
  • The SAM-Grid offers an extensible working
    framework for Grid-level Job/Data/Info Management
  • The SAM-Grid adopts Fabric-level configurable
    solutions for batch system adaptation, product
    delivery, sandboxing and job complex-status
    logging
  • The community needs to come up with standard
    fabric-level services to make any Grid usable

27
More info at
  • http//www-d0.fnal.gov/computing/grid/
  • http//samgrid.fnal.gov8080/
  • Morag Burgon-Lyons Talk on SAM-Grid for CDF!
Write a Comment
User Comments (0)
About PowerShow.com