Title: D
1DØ RACE
DØ Internal Computing Review May 9 10, 2002 Jae
Yu
- Introduction
- Current Status
- DØRAM Architecture
- Regional Analysis Centers
- Conclusions
2How Do You Want to Do?
- John Krane would say I want to measure
inclusive jet cross section at my desk in ISU!! - Chip Brock would say I want to measure W cross
section at MSU!! - Meena would say I want to find the Higgs at
BU!! - All of the above should be possible in the near
future!!! - What do we need to do to accomplish the above?
3What is DØRACE, and Why Do We Need It?
- DØ Remote Analysis Coordination Efforts
- In existence to accomplish
- Setting up and maintaining remote analysis
environment - Promote institutional contribution remotely
- Allow remote institutions to participate in data
analysis - To prepare for the future of data analysis
- More efficient and faster delivery of multi-PB
data - More efficient sharing processing resources
- Prepare for possible massive re-processing and MC
production to expedite the process - Expedite physics result production
4DØRACE Contd
- Maintain self-sustained support amongst the
remote institutions to construct a broader bases
of knowledge - Alleviate the load on expert by sharing the
knowledge and allow them to concentrate on
preparing for the future - Improve communication between the experiment site
and the remote institutions - Minimize travel around the globe for data access
- Sociological issues of HEP people at the home
institutions and within the field. - Primary goal is allow individual desktop users to
make significant contribution without being at
the lab
5From the Nov. Survey
- Difficulties
- Having hard time setting up initially
- Lack of updated documentation
- Rather complicated set up procedure
- Lack of experience? No forum to share experiences
- OS version differences (RH6.2 vs 7.1), let alone
OS - Most the established sites have easier time
updating releases - Network problems affecting successful completion
of large size releases (4GB) takes a couple of
hours (SA) - No specific responsible persons to ask questions
- Availability of all necessary software via
UPS/UPD - Time difference between continents affecting
efficiencies
6DØRACE Strategy
- Categorized remote analysis system set up by the
functionality - Desk top only
- A modest analysis server
- Linux installation
- UPS/UPD Installation and deployment
- External package installation via UPS/UPD
- CERNLIB
- Kai-lib
- Root
- Download and Install a DØ release
- Tar-ball for ease of initial set up?
- Use of existing utilities for latest release
download - Installation of cvs
- Code development
- KAI C compiler
- SAM station setup
Phase 0 Preparation
Phase I Rootuple Analysis
Phase II Executables
Phase III Code Dev.
Phase IV Data Delivery
7What has been accomplished?
- Regular bi-weekly meetings every on-week
Thursdays - Remote participating through video conferencing
(ISDN)? Moving toward switching over to VRVS per
VCTFs recommendation - Keep up with the progress via site reports
- Provide forum to share experience
- DØRACE home page established (http//www-hep.uta.e
du/d0race) - To ease the barrier over the difficulties in
initial set up - Updated and simplified instructions for set up
available on the web ? Many institutions have
participated in refining the instruction - Tools to make DØ software download and
installation made available - More tools identified and are in the works (Need
to automate download and installation as much as
we can, if possible one button based operation)
8- Release Ready notification system activated
- Success is defined by institutions
- Pull system ? you can decide whether to download
and install a specific release - Build Error log and dependency tree utility in
place - Release packet split to minimize network
dependence - Automated one-button release download and
operation utility in the works - Had a DØRACE workshop with hands-in session in
Feb.
9(No Transcript)
10Where are we?
- DØRACE is entering the next stage
- The compilation and running
- Active code development
- Propagation of setup to all institutions
- Instructions seem to take their shape well
- Need to maintain and to keep them up to date
- Support to help problems people encounter
- DØGRID
- Prepare SAM and other utilities for transparent
and efficient remote contribution - Need to establish Regional Analysis Centers
11Proposed DØRAM Architecture
Central Analysis Center (CAC)
Regional Analysis Centers
Provide Various Services
Institutional Analysis Centers
Desktop Analysis Stations
12Why do we need a DØRAM?
- Total Run II data size reaches over multiple PB
- 300TB and 2.8PB for RAW in Run IIa and IIb
- 410TB and 3.8PB for RAWDSTTMB
- 1.0x109/ 1.0x109 Events total
- At the fully optimized 10sec/event (40 specInt95)
reco.?1.0x1010 Seconds for one time reprocessing
for Run IIa - Takes 7.6Mo using 500 750MHz (40 specInt 95)
machines at 100 CPU efficiency - 1.5 Mo with 500 4GHz machines for Run IIa
- 7.5 to 9 Mos with 500 4GHz machines for Run II b
- Time for data transfer occupying 100 of a
gigabit (125Mbyte/s) network - 3.2x106/3.2x107 seconds to transfer the entire
data set (A full year with 100 OC3 bandwidth)
13- Data should be readily available for expeditious
analyses - Preferably disk resident so that time for caching
is minimized - Analysis processing compute power should be
available without having the users relying on CAC - MC generation should be transparently done
- Should exploit compute resources at remote sites
- Should exploit human resources at remote sites
- Minimize resource needs at the CAC
- Different resources will be needed
14What is a DØRAC?
- An institute with large concentrated and
available computing resources - Many 100s of CPUs
- Many 10s of TBs of disk cache
- Many 100Mbytes of network bandwidth
- Possibly equipped with HPSS
- An institute willing to provide services to a few
small institutes in the region - An institute willing to provide increased
infrastructure as the data from the experiment
grows - An institute willing to provide support personnel
if necessary
15Chips W x-sec Measurement
3
4
2
16What services do we want a DØRAC do?
- Provide intermediary code distribution
- Generate and reconstruct MC data set
- Accept and execute analysis batch job requests
- Store data and deliver them upon requests
- Participate in re-reconstruction of data
- Provide database access
- Provide manpower support for the above activities
17Code Distribution Service
- Current releases 4GB total ? will grow to gt8GB?
- Why needed?
- Downloading 8GB once every week is not a big load
on network bandwidth - Efficiency of release update rely on Network
stability - Exploit remote human resources
- What is needed?
- Release synchronization must be done at all RACs
every time a new release become available - Potentially need large disk spaces to keep
releases - UPS/UPD deployment at RACs
- FNAL specific
- Interaction with other systems?
- Need administrative support for bookkeeping
- Current DØRACE procedure works well, even for
individual users ? Do not see the need for this
service
18Generate and Reconstruct MC data
- Currently done 100 at remote sites
- Why needed?
- Extremely self-contained
- Code distribution done via a tar-ball
- Demand will grow
- Exploit available compute resources
- What is needed?
- A mechanism to automate request processing
- A Grid that can
- Accept job requests
- Packages the job
- Identify and locate the necessary resources
- Assign the job to the located institution
- Provide status to the users
- Deliver or keep the results
- Perhaps most undisputable task but do we need a
DØRAC?
19Batch Job Processing
- Currently rely on FNAL resources
- D0mino, ClueD0, CLUBS, etc
- Why needed?
- Bring the compute resources closer to the user
- Distribute the computing load to available
resources - Allow remote users to process their jobs
expeditiously - Exploit the available compute resources
- Minimize resource load at CAC
- Exploit remote human resources
20Batch Job Processing contd
- What is needed?
- Sufficient computing infrastructure to process
requests - Network
- CPU
- Cache storage
- Access to relevant databases
- A Grid that can
- Accept job requests
- Packages the job
- Identify and locate the necessary resources
- Assign the job to the located institution
- Provide status to the users
- Deliver or keep the results
- This task definitely needs a DØRAC
- What do we do with input? Keep them at RACs?
21Data Caching and Delivery
- Currently only at FNAL
- Why needed?
- Limited disk cache at FNAL
- Tape access needed
- Latencies involved, sometimes very long
- Delivering data within a reasonable time over the
network to all the requests is imprudent - Reduce resource load on the CAC
- Data should be readily available to the users
with minimal latency for delivery
22Data Caching and Delivery contd
- What is needed?
- Need to know what data and how much we want to
store - 100 TMB
- 10-20 DST?
- Any RAW data at all?
- What about MC? 50 of the actual data
- Should be on disk to minimize data caching
latency - How much disk space? (50TB if 100 TMB and 10
DST for RunIIa) - Constant shipment of data to all RACs from the
CAC - Constant bandwidth occupation (14MB/sec for Run
IIa RAW) - Resources from CAC needed
- A Grid that can
- Locate the data (SAM can do this already)
- Tell the requester about the extent of the
request - Decide whether to move the data or pull the job
over
23Data Reprocessing Services
- These include
- Re-reconstruction of the data
- From DST?
- From RAW?
- Re-streaming of data
- Re-production of TMB data sets
- Re-production of roottree
- ab initio reconstruction
- Currently done only at CAC offline farm
24Reprocessing Services contd
- Why needed?
- The CAC offline farm will be busy with fresh data
reconstruction - Only 50 of the projected capacity is used for
this but - Going to be harder to re-reconstruct as more data
accumulates - We will have to
- Reconstruct a few times (gt2) to improve data
- Re-stream TMB
- Re-produce TMBs from DST and RAW
- Re-produce root-tree
- It will take a many months to re-reconstruct the
large amount of data - 1.5 Mo with 500 4GHz machines for Run IIa
- 7.5 to 9 Mos for full reprocessing Run IIb
- Exploit large resources in remote institutions
- Expedite re-processing for expeditious analyses
- Cutting down the time by a factor of 2 to 3 will
make difference - Reduce the load on CAC offline farm
- Just in case the CAC offline farm is having
trouble, the RACs can even help out with ab
initio reconstruction
25Reprocessing Services contd
- What is needed?
- Permanently store necessary data, because it
would take a long time just to transfer data - DSTs
- RAW
- Large date storage
- Constant data transfer from CAC to RACs as we
take and reconstruct data - Dedicated file server for data distribution to
RACs - Constant bandwidth occupation
- Sufficient buffer storage at CAC in case network
goes down - Reliable and stable network
- Access to relevant databases
- Calibration
- Luminosity
- Geometry and Magnetic Field Map
26Database Access Service
- Currently done only at CAC
- Why needed?
- For data analysis
- For reconstruction of data
- To exploit available resources
- What is needed?
- Remote DB access software services
- Some copy of DB at RACs
- A substitute of Oracle DB at remote sites
- A means of synchronizing DBs
27Reprocessing Services contd
- Transfer of new TMB and Roottrees to other sites
- Well synchronized reconstruction code
- A grid that can
- Identify resources on the net
- Optimize resource allocation for most expeditious
reproduction - Move data around if necessary
- A dedicated block of time for concentrated CPU
usage if disaster strikes - Questions
- Do we keep copies of all data at the CAC?
- Do we ship DSTs and TMBs back to CAC?
- This service is perhaps the most debatable one
but I strongly believe this is one of the most
valuable functionality of RAC.
28Progress in DØRAC Proposal
- Working group members
- I. Bertram, R. Brock, F. Filthaut, L. Lueking, P.
Mattig, - M. Narain , P. Lebrun, B. Thooris , J. Yu, C.
Zeitnitz - A proposal document has been worked on
- Target to release within two weeks, sufficiently
prior to the Directors review in June - Doc. At http//www-hep.uta.edu/d0race/d0rac-wg/d
0rac-spec-050602.pdf
29DØRAC Implementation Timescale
- Implement First RAC by Oct. 1, 2002
- Cluster associated IACs
- Transfer Thumbnail data set constantly from CAC
to the RAC - Workshop on RAC in Nov., 2002
- Implement the next set of RAC by Apr. 1, 2003
30Conclusions
- DØRACE has been rather successful
- DØ must prepare for large data set era
- Need to expedite analyses in timely fashion
- Need to distribute data set throughout the
collaboration - DØRAC proposal almost ready for release
- Establishing regional analysis centers will be
the first step toward DØ Grid? By the end of Run
IIa (2-3 years)