DIAL Distributed Interactive Analysis of Large datasets - PowerPoint PPT Presentation

About This Presentation
Title:

DIAL Distributed Interactive Analysis of Large datasets

Description:

Real world deliverable. Another experiment would show generality. June 23, ... Apply an algorithm which flags which events to accept. Event content selection ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 35
Provided by: pcbunnCac
Category:

less

Transcript and Presenter's Notes

Title: DIAL Distributed Interactive Analysis of Large datasets


1
DIALDistributed Interactive Analysis of Large
datasets
GAE workshop Caltech
  • David Adams
  • BNL
  • June 23, 2003

2
Contents
  • Goals of DIAL
  • What is DIAL?
  • Design
  • Application
  • Task
  • Result
  • Scheduler
  • Dataset properties
  • Exchange format
  • Status
  • Development plans
  • Interactivity issues

3
Goals of DIAL
  • 1. Demonstrate the feasibility of interactive
    analysis of large datasets
  • How much data can we analyze interactively?
  • 2. Set requirements for GRID services
  • Datasets, schedulers, jobs, results, resource
    discovery, authentication, allocation, ...
  • 3. Provide ATLAS with a useful analysis tool
  • For current and upcoming data challenges
  • Real world deliverable
  • Another experiment would show generality

4
What is DIAL?
  • Distributed
  • Data and processing
  • Interactive
  • Iterative processing with prompt response
  • (seconds rather than hours)
  • Analysis of
  • Fill histograms, select events,
  • Large datasets
  • Any event data (not just ntuples or tag)

5
What is DIAL? (cont)
  • DIAL provides a connection between
  • Interactive analysis framework
  • Fitting, presentation graphics,
  • E.g. ROOT
  • and Data processing application
  • Natural to the data of interest
  • E.g. athena for ATLAS
  • DIAL distributes processing
  • Among sites, farms, nodes
  • To provide user with desired response time
  • Look to other projects to provide most
    infrastructure

6
(No Transcript)
7
Design
  • DIAL has the following major components
  • Dataset describing the data of interest
  • Application defined by experiment/site
  • Task is user extension to the application
  • Result (generated by applying application and
    task to an input dataset)
  • Scheduler directs the processing

8
Job 1
9. fill
Dataset 1
Dataset 2
Result
7. create
8. run(app,tsk,ds1)
6. split
Dataset
10. gather
Scheduler
4. select
User Analysis
e.g. ROOT
1. Create or locate
8. run(app,tsk,ds2)
5. submit(app,tsk,ds)
e.g. athena
Job 2
2. select
3. Create or select
Result
Application
Task
9. fill
Result
Code
9
Application
  • User specifies application with
  • Name and version
  • E.g. dial_cbnt 0.3
  • Scheduler
  • Maps this specification to an executable
  • E.g. ChildScheduler uses mapping files at each
    site or node

10
Task
  • Task allows users to extend application
  • Application
  • Defines the task syntax
  • Provides means to build and install task
  • E.g. compile and dynamic load
  • Examples
  • Empty histograms plus code to fill them
  • Sequence of algorithms

11
Result
  • Result is filled during processing
  • Examples
  • Histogram
  • Event list
  • New dataset
  • Combination of the above
  • Returned to the user
  • Should be small
  • Logical file identifiers rather than the data in
    those files

12
Schedulers
  • A DIAL scheduler provides means to
  • Submit a job
  • Terminate a job
  • Monitor a job
  • Status
  • Events processed
  • Partial results
  • Verify availability of an application
  • Install and verify the presence of a task for a
    given application

13
Schedulers (cont)
  • Schedulers form a hierarchy
  • Corresponding to that of compute nodes
  • Grid, site, farm, node
  • Each scheduler splits job into sub-jobs and
    distributes these over lower-level schedulers
  • Lowest level ChildScheduler starts processes to
    carry out the sub-jobs
  • Scheduler concatenates results for its sub-jobs
  • User may enter the hierarchy at any level
  • Client-server communication

14
(No Transcript)
15
Dataset properties
  • Datasets have the following properties
  • 0. Identity
  • 1. Content
  • 2. Location (of the data)
  • 3. Mapping (content to location)
  • 4. Provenance (of the dataset)
  • 5. History (of the production)
  • 6. Labels (describing dataset)

16
Dataset properties (cont)
  • All the above are metadata
  • Dataset also has data
  • Details
  • 0. Identity
  • Index or name to distinguish datasets
  • 1. Content
  • Tells what kind of data is carried by the dataset
  • E.g. for HEP event data
  • List of event identifiers
  • Content for each event (raw, tracks, jets, )

17
Dataset properties (cont)
  • 2. Location
  • Most interesting example is a collection of
    logical files
  • Could also be
  • physical files
  • DB
  • data store such as POOL
  • 3. Mapping
  • For each piece of content, where (in the
    location) is the associated data
  • For collection of logical files, which file and
    where in the file

18
Dataset properties (cont)
  • 4. Provenance
  • Parent datasets
  • Transformation
  • Applied to parent dataset to obtain this dataset
  • Sufficient to construct dataset (virtual data)
  • 5. History
  • Production history beyond provenance
  • Division into jobs, which compute nodes, time
    stamps,

19
Dataset properties (cont)
  • 6. Labels
  • Additional data characterizing the dataset
  • For example
  • Motivation for dataset
  • E.g. starting point for Higgs searches
  • Results of dataset evaluation
  • E.g. approved for basis of publications
  • Other global properties
  • E.g. integrated luminosity for an event dataset

20
Dataset operations
  • Operations that datasets should support
  • Content selection
  • Splitting
  • Merging
  • Content selection
  • User can create a new dataset by selecting
    content from an existing dataset
  • For an event dataset, there are two categories
  • Event selection
  • Apply an algorithm which flags which events to
    accept
  • Event content selection
  • e.g. keep tracks, drop jets

21
Dataset operations (cont)
  • Splitting
  • Distributed processing requires that the input
    dataset be split into sub-datasets for processing
  • Natural to split along file boundaries
  • For event datasets, the split is along event
    boundaries
  • Merging
  • Combine multiple datasets to form a new dataset
  • For event datasets, there are again two
    dimensions
  • Same event content and data for different events
  • E.g. data from two different run periods
  • Same events and different content
  • E.g. raw data and the corresponding reconstructed
    data
  • Or reconstructed data and relevant conditions
    data

22
Dataset operations (cont)
Event dataset operations
23
Dataset implementation
  • The dataset properties span many realms
  • Different properties will be stored in different
    ways
  • Properties used for selection naturally reside in
    relational DB tables
  • Content, provenance, metadata
  • Properties required for end applications on
    worker nodes might expect properties to reside in
    files
  • Content, location mapping
  • Some property data will likely be replicated
  • both RDB and files
  • Identification of the different properties is a
    first step in implementing datasets

24
Dataset implementation (cont)
  • Interface for schedulers
  • Central problem is how to split dataset
  • Account for
  • Data location
  • Compute cycle locations
  • Matching these
  • Interactive analysis (fast response) is a special
    challenge
  • How can different dataset providers share a
    scheduler?
  • Options
  • Common dataset interface which provides the
    required information
  • Logical file mapping or
  • Composite structure (sub-datasets)
  • Each provider also provides a service to do the
    splitting

25
Exchange format
  • DIAL components are exchanged
  • Between
  • User and scheduler
  • Scheduler and scheduler
  • Scheduler and application executable
  • Components have an XML representation
  • Exchange mechanism can be
  • C objects
  • SOAP
  • Files
  • Mechanism defined by scheduler

26
Status
  • All DIAL components in place
  • http//www.usatlas.bnl.gov/dladams/dial
  • Release 0.30 made last week
  • Includes demo to demonstrate look and feel
  • But scheduler is very simple
  • Only local ChildScheduler is implemented
  • Grid, site, farm and client-server schedulers not
    yet implemented
  • More details in CHEP paper at
  • http//www.usatlas.bnl.gov/dladams/dial/talks/dia
    l_chep2003.pdf

27
Status (cont)
  • Dataset implemented as a separate system
  • http//www.usatlas.bnl.gov/dladams/dataset
  • Implementations
  • ATLAS combined ntuple hbook file
  • CbntDataset
  • in release 0.3
  • ATLAS AthenaRoot file
  • Holds Monte Carlo generator information
  • In 0.2 not in 0.3 (easy to add)
  • Athena-Pool files
  • when they are available

28
Status (cont)
  • DIAL and dataset classes imported to ROOT
  • ROOT can be used as interactive user interface
  • All DIAL and dataset classes and methods
    available at command prompt
  • DIAL and dataset libraries must be loaded
  • Import done with ACLiC
  • Need to add result for TH1 and any other classes
    of interest

29
Status (cont)
  • Applications
  • Test program dialproc counts events
  • Wrapper around PAW dial_cbnt
  • Processes CbntDataset
  • Task provides
  • HBOOK file with empty histograms
  • Fortran subroutine to fill histograms

30
Status (cont)
  • Interface for logical files was added recently
  • Keep results small
  • Carry logical file instead of data in the file
  • Includes abstract interface for a file catalog
  • Implemented
  • Local directory
  • AFS
  • Planned
  • Magda (under development)
  • RLS (add flavors as required)

31
Development plans
  • Farm Scheduler
  • Distribute processing over a single farm
  • Condor, LSF, ssh, STAR, ?
  • Provides useful tool for distributed processing
  • Remote access to scheduler
  • Job submission from anywhere
  • Web service
  • Add policy to scheduler interface
  • Response time
  • How to split dataset for distributed processing

32
Development plans (cont)
  • Grid schedulers
  • Distribute data and processing over multiple
    sites
  • Interact with dataset, file and replica catalogs
  • Authentication, authorization, resource location
    and allocation,
  • ATLAS POOL dataset
  • After ATLAS incorporates POOL
  • ATLAS athena as application

33
Interactivity issues
  • Important aspect is latency
  • Interactive system provides means for user to
    specify maximum acceptable response time
  • All actions must take place within this time
  • Locate data and resources
  • Splitting and matchmaking
  • Job submission
  • Gathering of results
  • Longer latency for first pass over a dataset
  • Record state for later passes
  • Still must be able to adjust to changing
    conditions

34
Interactivity issues (cont)
  • Interactive and batch must share resources
  • Share implies more available resources for both
  • Interactive use varies significantly
  • Time of day
  • Time to the next conference
  • Discovery of interesting events
  • Interactive request must be able to preempt
    long-running batch jobs
  • But allocation determined by sites, experiments,
Write a Comment
User Comments (0)
About PowerShow.com