Data Centric Issues - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Data Centric Issues

Description:

Data Centric Issues. Particle Physics and. Grid Data Management. Tony Doyle ... auto-building of DataGrid RPMs. publishing of generated API documentation ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 26
Provided by: tonyd2
Category:

less

Transcript and Presenter's Notes

Title: Data Centric Issues


1
Data Centric Issues
  • Particle Physics and
  • Grid Data Management
  • Tony Doyle
  • University of Glasgow

2
Outline Data to Metadata to Data
  • Introduction
  • Yesterday .. all my troubles seemed so far
    away
  • (non-Grid) Database Access
  • Data Hierarchy
  • Today .. is the greatest day Ive ever known
  • Grids and Metadata Management
  • File Replication
  • Replica Optimisation
  • Tomorrow .. never knows
  • Event Replication
  • Query Optimisation

3
GRID Services Context
Chemistry
Cosmology
Environment
Applications
Biology
High Energy Physics
Data- intensive applications toolkit
Remote Visualisation applications toolkit
Distributed computing toolkit
Problem solving applications toolkit
Remote instrumentation applications toolkit
Collaborative applications toolkit
Application Toolkits

E.g.,
Resource-independent and application-independent
services
Grid Services (Middleware)
authentication, authorisation, resource location,
resource allocation, events, accounting, remote
data access, information, policy, fault detection

Resource-specific implementations of basic
services
Grid Fabric (Resources)
E.g., transport protocols, name servers,
differentiated services, CPU schedulers, public
key infrastructure, site accounting, directory
service, OS bypass
4
Online Data Rate vs Size
High Level-1 Trigger(1 MHz)
High No. ChannelsHigh Bandwidth(500 Gbit/s)
Level 1 Rate (Hz)
106
LHCB
ATLAS CMS
105
HERA-B
How can this data reach the end user?
KLOE
CDF II
104
High Data Archive(PetaByte)
CDF
It doesnt Factor O(1000) Online
data reduction via trigger selection
103
H1ZEUS
ALICE
NA49
UA1
102
104
106
105
107
LEP
Event Size (bytes)
5
Offline Data Hierarchy
RAW, ESD, AOD, TAG
RAW
Recorded by DAQ Triggered events
Detector digitisation
1 MB/event
ESD
Pseudo-physical information Clusters, track
candidates (electrons, muons), etc.
Reconstructed information
100 kB/event
Physical information Transverse momentum,
Association of particles, jets, (best) id of
particles, Physical info for relevant objects
AOD
Selected information
10 kB/event
Analysis information
TAG
Relevant information for fast event selection
1 kB/event
6
Physics Analysis
ESD Data or Monte Carlo
Event Tags
Tier 0,1 Collaboration wide
Event Selection
Calibration Data
Analysis, Skims
INCREASING DATA FLOW
Raw Data
Tier 2 Analysis Groups
Physics Objects
Physics Objects
Physics Objects
Tier 3, 4 Physicists
Physics Analysis
7
Data Structure
Trigger System
Data Acquisition
Run Conditions
Level 3 trigger
Calibration Data
Raw Data
Trigger Tags
Reconstruction
Event Summary Data ESD
Event Tags
REAL and SIMULATED data required. Central and
Distributed production.
8
A running (non-Grid) experiment
  • Three Steps to select an event today
  • Remote access to O(100) TBytes of ESD data
  • Via remote access to 100 GBytes of TAG data
  • Using offline selection e.g. ZeusIO-Variable
    (Eegt20.0)and(Ntrksgt4)
  • Access to remote store via batch job
  • 1 database event finding overhead
  • O(1M) lines of reconstruction code
  • No middleware
  • 20k lines of C glue from Objectivity (TAG) to
    ADAMO (ESD) database

ESD
TAG
  • 100 Million selected events from 5 years data
  • TAG selection via 250 variables/event

9
A future (Grid) experiment
  • Three steps to (analysis) heaven
  • 10 (1) PByte of RAW (ESD) data/yr
  • 1 TByte of TAG data (local access)/yr
  • Offline selection e.g. ATLASIO-Variable
    (Meegt100.0)and(Njetsgt4)
  • Interactive access to local TAG store
  • Automated batch jobs to distributed Tier-0, -1,
    -2 centres
  • O(1M) lines of reconstruction code
  • O(1M) lines of middleware NEW
  • O(20k) lines of Java/C provide TAG glue from
    TAG to ESD database
  • All working?
  • Efficiently?

Inter
DataBase Solutions Inc.
  • 1000 Million events
  • from 1 years data-taking
  • TAG selection via
  • 250 variables

10
Grid Data Management Requirements
  • Robust - software development infrastructure
  • Secure via Grid certificates
  • Scalable non-centralised
  • Efficient Optimised replication
  • Examples

11
Robust?Development Infrastructure
  • CVS Repository
  • management of DataGrid source code
  • all code available (some mirrored)
  • Bugzilla
  • Package Repository
  • public access to packaged DataGrid code
  • Development of Management Tools
  • statistics concerning DataGrid code
  • auto-building of DataGrid RPMs
  • publishing of generated API documentation
  • latest build Release 1.2 (August 2002)

140506 Lines of Code 10 Languages (Release 1.0)
12
Robust?Software Evaluation
ETT Extensively Tested in Testbed
UT Unit Testing
IT Integrated Testing
NI Not Installed
NFF Some Non-Functioning Features
MB Some Minor Bugs
SD Successfully Deployed
Component ETT UT IT NI NFF MB SD
Resource Broker v v v l
Job Desc. Lang. v v v l
Info. Index v v v l
User Interface v v v l
Log. Book. Svc. v v v l
Job Sub. Svc. v v v l
Broker Info. API v v l
SpitFire v v l
GDMP l
Rep. Cat. API v v l
Globus Rep. Cat. v v l
Component ETT UT IT NI NFF MB SD
SE Info. Prov. V v l
File Elem. Script l
Info. Prov. Config. V v l
RFIO V v l
MSS Staging l
Mkgridmap daemon v l
CRL update daemon v l
Security RPMs v l
EDG Globus Config. v v l
Component ETT UT IT NI NFF MB SD
Schema v v v l
FTree v v l
R-GMA v v l
Archiver Module v v l
GRM/PROVE v v l
LCFG v v v l
CCM v l
Image Install. v l
PBS Info. Prov. v v v l
LSF Info. Prov. v v l
Component ETT UT IT NI NFF MB SD
PingER v v l
UDPMon v v l
IPerf v v l
Globus2 Toolkit v v l
13
Robust?Middleware Testbed(s)
Validation/ Maintenance gtTestbed(s) EU-wide
development
14
1. Robust? Code Development Issues
  • Reverse Engineering (C code analysis and
    restructuring coding standards) gt abstraction
    of existing code to UML architecture diagrams
  • Language choice
  • (currently 10 used in DataGrid)
  • Java C - - features (global variables,
    pointer manipulation, goto statements, etc.).
  • Constraints (performance, libraries, legacy code)
  • Testing (automation, object oriented testing)
  • Industrial strength?
  • OGSA-compliant?
  • O(20 year) Future proof??

ETT Extensively Tested in Testbed
UT Unit Testing
IT Integrated Testing
NI Not Installed
NFF Some Non-Functioning Features
MB Some Minor Bugs
SD Successfully Deployed
15
Data Management on the Grid
  • Data in particle physics is centred on events
    stored in a database Groups of events are
    collected in (typically GByte) files
    In order to utilise additional
    resources and minimise data analysis time, Grid
    replication mechanisms are currently being used
    at the file level.
  • Access to a database via Grid certificates
  • (Spitfire/OGSA-DAI)
  • Replication of files on the Grid
  • (GDMP/Giggle)
  • Replication and Optimisation Simulation
  • (Reptor/Optor)

16
2. Spitfire
Servlet Container
SSLServletSocketFactory
RDBMS
Trusted CAs
TrustManager
Revoked Certsrepository
Secure? At the level required in Particle
Physics
Security Servlet
ConnectionPool
Authorization Module
Does user specify role?
Role repository
Translator Servlet
Role
Connectionmappings
Map role to connection id
17
2. Database client API
  • A database client API has been defined
  • Implement as grid service using standard web
    service technologies
  • Ongoing development with OGSA-DAI
  • Talk
  • Project Spitfire - Towards Grid
  • Web Service Databases

18
3. GDMP and the Replica Catalogue
CentralisedLDAP based
Replica Catalogue TODAY
Globus 2.0 Replica Catalogue (LDAP)
StorageElement1
StorageElement2
StorageElement3
GDMP 3.0 File mirroring/replication
tool Originally for replicating CMS Objectivity
files for High Level Trigger studies. Now used
widely in HEP.
19
3. Giggle Hierarchical P2P
RLI
Hierarchical indexing. The higher- level RLI
contains pointers to lower-level RLIs or LRCs.
Scalable? Trade-off Consistency Versus Efficien
cy
RLI
RLI
RLI Replica Location Index
LRC Local Replica Catalog
LRC
LRC
LRC
LRC
LRC
Storage Element
Storage Element
Storage Element
Storage Element
Storage Element
20
4. Reptor/Optor File Replication/ Simulation
Efficient? Requires simulation Studies
  • Tests file replication strategies e.g. economic
    model
  • Reptor Replica architecture
  • Optor Test file replication strategies economic
    model
  • Demo and Poster
  • Studying Dynamic Grid Optimisation
  • Algorithms for File Replication

21
Application Requirements
  • The current EMBL production database is 150 GB,
    which takes over four hours to download at full
    bandwidth capability at the EBI. The EBI's data
    repositories receive 100,000 to 250,000 hits per
    day with 20 from UK sites 563 unique UK domains
    with 27 sites have more than 50 hits per day.
    MyGrid Proposal
  • Suggests
  • Less emphasis on efficient data access and data
    hierarchy aspects (application specific).
  • Large gains in biological applications from
    efficient file replication.
  • Larger gains from application-specific
    replication?

22
Events.. to Files.. to Events
Event 1 Event 2 Event 3
Data Files
Data Files
Data Files
RAW
Tier-0 (International)
RAW
RAW
Data Files
RAW Data File
ESD
Tier-1 (National)
Data Files
ESD
ESD
Data Files
Data Files
ESD Data
AOD
Tier-2 (Regional)
AOD
AOD
Data Files
Data Files
Data Files
AOD Data
TAG
Tier-3 (Local)
TAG
TAG
TAG Data
Not all pre-filtered events are interesting Non
pre-filtered events may be File Replication
Overhead.
Interesting Events List
23
Events.. to EventsEvent Replication and Query
Optimisation
Event 1 Event 2 Event 3
Distributed (Replicated) Database
RAW
Tier-0 (International)
RAW
RAW
ESD
Tier-1 (National)
ESD
ESD
AOD
Tier-2 (Regional)
AOD
AOD
TAG
Tier-3 (Local)
TAG
TAG
Knowledge Stars in Stripes
Interesting Events List
24
Data Grid for the Scientist
In order to get back to the real (or
simulated) data.
_at_!
E mc2
Grid Middleware
Incremental Process
Level of the metadata? file? event?
sub-event?
25
Summary
  • Yesterdays data access issues are still here
  • They just got bigger (by a factor 100)
  • Data Hierarchy is required to access more data
    more efficiently insufficient
  • Todays Grid tools are developing rapidly
  • Enable replicated file access across the grid
  • File replication standard (lfn\\, pfn\\)
  • Emerging standards for Grid Data Access..
  • Tomorrow .. never knows
  • Replicated Events on the Grid?..
  • Distributed databases?..
  • or did that diagram look a little too monolithic?
Write a Comment
User Comments (0)
About PowerShow.com