Virtual%20Data%20Management%20for%20CMS%20Simulation%20Production presentation

About This Presentation

Transcript and Presenter's Notes

Title: Virtual%20Data%20Management%20for%20CMS%20Simulation%20Production

1
Virtual Data ManagementforCMS Simulation
Production

A GriPhyN Prototype

2
Goals

Explore
virtual data dependency tracking
data derivability
integrate virtual data catalog functionality
use of DAGs in virtual data production
Identify
architectural issues planners, catalogs,
interfaces
hard issues in executing real production physics
applications
Create prototypes
tools that can go into the VDT
Test virtual data concepts on something real

3
Which Part of GriPhyN
Application
initial solution is operational
aDAG
Catalog Services
Monitoring
Planner
Info Services
cDAG
Repl. Mgmt.
Executor
Policy/Security
Reliable Transfer Service
Compute Resource
Storage Resource
4
What Was Done

Created
A virtual data catalog
A catalog scheme for a RDBMS
A virtual data language ? VDL
A VDL command interpreter
Simple DAGs for the CMS pipeline
Complex DAGs for a canonical test application
Kanonical executable for GriPhyN ? keg
These DAGs actually execute on a Condor-Globus
Grid

5
The CMS Challenge

Remember Ricks slides and the complexity!
Types of executables (4)
Parameters, inputs, and outputs
Templates of parameter lists
Sensitivities of binaries
Dynamic libraries
Environment variables
Condor-related environment issues less obvious

6
Assumptions

Grid activity takes place as sub-jobs
Some subset of subjects create tracked, durable
data products these are tracked in the virtual
data catalog
Job execution mechanisms execute VDL functions to
describe their virtual data manipulations
dependencies and derivations
The product of sub-jobs are physical instances of
logical files (i.e., physical files)
Planners decide where physical files should be
created
Physical copies of logical files are tracked in
replica catalog

7
The VDL
filename1filename2

begin v /bin/cat arg n file i filename1
file i filename2 stdout filename3 env
keyvalueend

setenv /bin/cat -n
filename3
8
Dependent Programs

begin v /bin/phys1 arg n file i f1 file i
f2 stdout f3 env keyvalueend

begin v /bin/phys2 arg m file i f1 file i
f3 file o f4 env keyvalueend
note that dependencies can be complex graphs
9
The Interpreter

How program invocations are formed
Environment variables
Regular Parameters
Input files
Output file
How DAGs are formed
Recursive determination of dependencies
Parallel execution
How scripts are formed
Recursive determination of dependencies
Serial execution (now) parallel is possible

10
Virtual Data CatalogRelational Database
Structure As Implemented
11
Virtual Data CatalogConceptual Data Structure
12
DAGs Data Structures

DAGMan Example
TOP generates even random number
LEFT and RIGHT divide number by 2
BOTTOM sums

random
f.a
f.a
half
half
f.b
f.c
sum
f.d

Diamond DAG

13
DAGs Data Structures II

begin v random stdout f.aendbegin v half
stdin f.a stdout f.bendbegin v half stdin
f.a stdout f.cendbegin v sum file i f.b
file i f.c stdout f.d

random
f.a
f.a
half
half
f.b
f.c
sum

endrc f.a out.arc f.b out.brc f.c out.crc
f.d out.d

14
DAGs Data Structures III
XFORM XFORM XFORM PARAM PARAM DERIVED DERIVED DERIVED DERIVED DERIVED RC RC
xid cu prg pid value xid pid ddid pos flg pid pfn
1 v rnd 1 f.a 1 1 1 -1 O 1 out.a
2 v half 2 f.b 2 1 2 -1 I 2 out.b
3 v sum 3 f.c 2 2 2 -1 O 3 out.c
4 f.d 2 1 3 -1 I 4 out.d
2 3 3 -1 O
3 2 4 0 i
3 3 4 1 i
3 4 4 -1 O
15
Abstract and Concrete DAGs

Abstract DAGs
Resource locations unspecified
File names are logical
Data destinations unspecified
Concrete DAGs
Resource locations determined
Physical file names specified
Data delivered to and returned from physical
locations
Translation is the job of the planner

16
What We Tested

DAG structures
Diamond DAG
Canonical keg app in complex DAGs
The CMS pipeline
Execution environments
Local execution
Grid execution via DAGMan

17
Generality

simple fabric à very powerful DAGs

DAGs of this pattern with gt260 nodes were run.
18
What We Have Learned

UNIX program execution semantics is messy but
manageable
Command line execution is manageable
File accesses can be trapped and tracked
Dynamic loading makes reproducibility more
difficult should be avoided if possible
Object handling obviously needs concentrated
research effort

19
Future Work

Working with OO Databases
Handling navigational access
Refining notion of signatures
Dealing with fuzzy dependencies and equivalence
Cost tracking and calculations (w/ planner)
Automating the cataloging process
Integration with portals
Uniform execution language
Analysis of scripts (shell, Perl, Python, Tcl)
Refinement of data staging paradigms
Handling shell details
Pipes, 3gt1 (fd s)

20
Future Work IIDesign of Staging Semantics

What files need to be moved where to start a
computation
How do you know (exactly) where the computation
will run, and how to get the file there (NFS,
local etc)
How/when to get the results back
How/when to trust the catalog
Double-check files existence/ safe arrival when
you get there to use it
DB marking of files existence schema, timing
Mechanisms to audit and correct consistency of
catalog vs. reality

Write a Comment

User Comments (0)

About PowerShow.com

Virtual%20Data%20Management%20for%20CMS%20Simulation%20Production PowerPoint PPT Presentation