Virtual%20Data%20Management%20for%20CMS%20Simulation%20Production PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Virtual%20Data%20Management%20for%20CMS%20Simulation%20Production


1
Virtual Data ManagementforCMS Simulation
Production
  • A GriPhyN Prototype

2
Goals
  • Explore
  • virtual data dependency tracking
  • data derivability
  • integrate virtual data catalog functionality
  • use of DAGs in virtual data production
  • Identify
  • architectural issues planners, catalogs,
    interfaces
  • hard issues in executing real production physics
    applications
  • Create prototypes
  • tools that can go into the VDT
  • Test virtual data concepts on something real

3
Which Part of GriPhyN
Application
initial solution is operational
aDAG
Catalog Services
Monitoring
Planner
Info Services
cDAG
Repl. Mgmt.
Executor
Policy/Security
Reliable Transfer Service
Compute Resource
Storage Resource
4
What Was Done
  • Created
  • A virtual data catalog
  • A catalog scheme for a RDBMS
  • A virtual data language ? VDL
  • A VDL command interpreter
  • Simple DAGs for the CMS pipeline
  • Complex DAGs for a canonical test application
  • Kanonical executable for GriPhyN ? keg
  • These DAGs actually execute on a Condor-Globus
    Grid

5
The CMS Challenge
  • Remember Ricks slides and the complexity!
  • Types of executables (4)
  • Parameters, inputs, and outputs
  • Templates of parameter lists
  • Sensitivities of binaries
  • Dynamic libraries
  • Environment variables
  • Condor-related environment issues less obvious

6
Assumptions
  • Grid activity takes place as sub-jobs
  • Some subset of subjects create tracked, durable
    data products these are tracked in the virtual
    data catalog
  • Job execution mechanisms execute VDL functions to
    describe their virtual data manipulations
    dependencies and derivations
  • The product of sub-jobs are physical instances of
    logical files (i.e., physical files)
  • Planners decide where physical files should be
    created
  • Physical copies of logical files are tracked in
    replica catalog

7
The VDL
filename1filename2
  • begin v /bin/cat arg n file i filename1
    file i filename2 stdout filename3 env
    keyvalueend

setenv /bin/cat -n
filename3
8
Dependent Programs
  • begin v /bin/phys1 arg n file i f1 file i
    f2 stdout f3 env keyvalueend

begin v /bin/phys2 arg m file i f1 file i
f3 file o f4 env keyvalueend
note that dependencies can be complex graphs
9
The Interpreter
  • How program invocations are formed
  • Environment variables
  • Regular Parameters
  • Input files
  • Output file
  • How DAGs are formed
  • Recursive determination of dependencies
  • Parallel execution
  • How scripts are formed
  • Recursive determination of dependencies
  • Serial execution (now) parallel is possible

10
Virtual Data CatalogRelational Database
Structure As Implemented
11
Virtual Data CatalogConceptual Data Structure
12
DAGs Data Structures
  • DAGMan Example
  • TOP generates even random number
  • LEFT and RIGHT divide number by 2
  • BOTTOM sums

random
f.a
f.a
half
half
f.b
f.c
sum
f.d
  • Diamond DAG

13
DAGs Data Structures II
  • begin v random stdout f.aendbegin v half
    stdin f.a stdout f.bendbegin v half stdin
    f.a stdout f.cendbegin v sum file i f.b
    file i f.c stdout f.d

random
f.a
f.a
half
half
f.b
f.c
sum
  • endrc f.a out.arc f.b out.brc f.c out.crc
    f.d out.d

14
DAGs Data Structures III
XFORM XFORM XFORM PARAM PARAM DERIVED DERIVED DERIVED DERIVED DERIVED RC RC
xid cu prg pid value xid pid ddid pos flg pid pfn
1 v rnd 1 f.a 1 1 1 -1 O 1 out.a
2 v half 2 f.b 2 1 2 -1 I 2 out.b
3 v sum 3 f.c 2 2 2 -1 O 3 out.c
4 f.d 2 1 3 -1 I 4 out.d
2 3 3 -1 O
3 2 4 0 i
3 3 4 1 i
3 4 4 -1 O
15
Abstract and Concrete DAGs
  • Abstract DAGs
  • Resource locations unspecified
  • File names are logical
  • Data destinations unspecified
  • Concrete DAGs
  • Resource locations determined
  • Physical file names specified
  • Data delivered to and returned from physical
    locations
  • Translation is the job of the planner

16
What We Tested
  • DAG structures
  • Diamond DAG
  • Canonical keg app in complex DAGs
  • The CMS pipeline
  • Execution environments
  • Local execution
  • Grid execution via DAGMan

17
Generality
  • simple fabric à very powerful DAGs

DAGs of this pattern with gt260 nodes were run.
18
What We Have Learned
  • UNIX program execution semantics is messy but
    manageable
  • Command line execution is manageable
  • File accesses can be trapped and tracked
  • Dynamic loading makes reproducibility more
    difficult should be avoided if possible
  • Object handling obviously needs concentrated
    research effort

19
Future Work
  • Working with OO Databases
  • Handling navigational access
  • Refining notion of signatures
  • Dealing with fuzzy dependencies and equivalence
  • Cost tracking and calculations (w/ planner)
  • Automating the cataloging process
  • Integration with portals
  • Uniform execution language
  • Analysis of scripts (shell, Perl, Python, Tcl)
  • Refinement of data staging paradigms
  • Handling shell details
  • Pipes, 3gt1 (fd s)

20
Future Work IIDesign of Staging Semantics
  • What files need to be moved where to start a
    computation
  • How do you know (exactly) where the computation
    will run, and how to get the file there (NFS,
    local etc)
  • How/when to get the results back
  • How/when to trust the catalog
  • Double-check files existence/ safe arrival when
    you get there to use it
  • DB marking of files existence schema, timing
  • Mechanisms to audit and correct consistency of
    catalog vs. reality
Write a Comment
User Comments (0)
About PowerShow.com