TDDS Session - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

TDDS Session

Description:

test productions in Jan 2002 at CERN (LSF) enjoyed nearly 100% ... just prepend an AFS path. a wrapper is created. insert commands to set up correct environment ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 30
Provided by: lgoo6
Learn more at: http://chep03.ucsd.edu
Category:
Tags: tdds | prepend | session

less

Transcript and Presenter's Notes

Title: TDDS Session


1
Vandy Berten Luc Goossens (CERN/EP/ATC) Alvin
Tan (University of Birmingham)
2
  • Pre-history of AtCom project
  • test productions in Jan 2002 at CERN
    (LSF)enjoyed nearly 100 success rate
  • no need for tools to clean-up/resubmit oddfailed
    job
  • real production in summer 2002 suffered on
    average of 20 failures
  • many factors were against us
  • at 300 CPU slots capacity, cleanup
    becameoverwhelming
  • Sep 2002 Vandy Berten (technical student)
  • needed a well-defined project
  • AtCom (Atlas Commander) project was born

3
  • What is the Atlas Commander?
  • graphical interactive tool to support production
    manager
  • define jobs in large quantities
  • submit and monitor progress
  • scan log files for (un)known errors
  • update bookkeeping Databases (AMI, Magda)
  • clean up in case of failures

4
  • History of AtCom project
  • total resource count is about 5 man-months
  • ideal situation two persons in same office, one
    CS student as developer/designer plus one
    software engineer as client/designer/developer
  • successful multi-cluster live demo at the Atlas
    SW workshop in Nov
  • has been in continuous production use at CERN
    since Oct 2002
  • v1.0 released end of Jan -gt production quality
  • v1.2 released early March
  • removed last important limitation

5
  • AtCom has its own web site
  • http//atlas-project-atcom.web.cern.ch/atlas-proj
    ect-atcom/
  • contains user guide, developers guide,
    documentation, downloads, relevant
    contacte-mails, etc.

6
  • Architecture application plug-ins

AtCom core
7
  • Architecture (continued)
  • plug-in implements abstract cluster interface
    for specific clusters
  • e.g. LSF
  • a plug-in is a Java class configuration
    parameters
  • e.g. LSF_at_TIMBUKTU
  • the AtCom configuration file defines all existing
    plug-ins and allows each to have its own
    configuration section
  • they are loaded at run-time

8
  • Available plug-ins
  • LSF
  • well understood and supported
  • NG
  • development suspended after last SW
  • PBS
  • developed by Alvin Tan
  • EDG
  • working, but no EDG based clusters used in
    production
  • BQS
  • developed by Jerome Fulachier

9
  • Bookkeeping databases
  • 5 logical database domains, two physical
    databases

10
  • Concepts datasets

abstract transformation process
abstract dataset
evgen
evgen.2000
simul
simul.2000
pileup
lumi02.2000
simul.2099
11
  • Concepts partitions

concrete tranformation process job
concrete partition file
evgen546
evgen.2000.0001
simul876
evgen546
simul876
evgen.2000.0001
evgen546
simul876
evgen.2000.0001
simul.2000.0035
simul.2000.0035
simul.2000.0035
pileup760
lumi02.2000.0035
simul.2099.0812
pileup760
lumi02.2000.0035
simul.2099.0812
pileup760
lumi02.2000.0035
simul.2099.0812
12
  • Two main functions of AtCom
  • definition of jobs
  • job submission/monitoring

13
  • Definition of jobs
  • select a dataset
  • with SQL query composer
  • dataset determines transformation
  • select a version of its transformation
  • version determines uses, signature, outputs,
  • for each partition you want to create/define
  • some constant attributes
  • the logical values for the parameters of the
    transformation
  • the LocalFileName -gt LogicalFileName mapping of
    all outputs
  • the destination of stdout/stderr
  • AtCom allows you to define a counter range and
    then use the counter value in expressions for the
    parameter values

14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
  • submission
  • select a number of defined partitions
  • using SQL query composer
  • select a target cluster
  • jobs are submitted
  • for most clusters this means a number of
    auxiliary files are created (wrappers, jdl/xrsl
    files, )

21
(No Transcript)
22
(No Transcript)
23
  • What happens on submission? (LSF_at_CERN)
  • when partition is unreserved in DB reserve it and
    create part_run_info record
  • transformation definition path resolved
  • just prepend an AFS path
  • a wrapper is created
  • insert commands to set up correct environment
  • logical to physical value resolution
  • insert line calling core script
  • insert commands to copy outputs to final
    destination

24
  • What happens when you submit? (continued)
  • the job is submitted to LSF using the right queue
    (conf file) using o and e to specify temp
    locations for stdout and stderr in same dir
  • the wrapper code is saved with a unique name in a
    dir with a unique name (in the dir specified in
    the AtCom.conf file)
  • the jobID returned by LSF is recorded in the
    part_run_info record together with the temp
    locations for stdout and stderr.

25
  • Monitoring
  • jobs you submit are automatically added to list
    of monitored jobs
  • running jobs can be recovered from the
    part_run_info table if needed
  • e.g. after having closed AtCom
  • any other partition can be added to the list as
    well
  • using SQL query composer
  • allows you to see also finished, defined jobs
  • for the bar charts of course ?

26
(No Transcript)
27
  • When a job moves from RUNNING to DONE post
    processing commences
  • resolve validation script logical name into
    physical name and apply it to stdout/stderr in
    temp locations
  • returns 1OK, 2Undecided or 3Failed
  • if OK
  • register output files with Magda replica catalog
  • resolve extract script and apply it to stdout
  • writes to stdout a set of attribute value pairs
  • AtCom will attempt an UPDATE query with this on
    the partition table
  • copy/move logfiles to final destination
  • set status of partition to Validated

28
  • if Failed
  • delete output files
  • if Undecided
  • mark job as such
  • production manager can look at output of
    validation script or at the logfiles themselves
    and then force a decision as OK or Failed

29
(No Transcript)
30
(No Transcript)
31
  • Questions ?

32
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com