Alfredo Castaeda, Mengmeng Chen, Annabelle Leung, presentation

About This Presentation

Transcript and Presenter's Notes

Title: Alfredo Castaeda, Mengmeng Chen, Annabelle Leung,

1
PROOF/Xrootd Tests in GLOW-ATLAS (Wisconsin)

Alfredo Castañeda, Mengmeng Chen, Annabelle
Leung,
Bruce Mellado, Neng Xu and Sau Lan Wu
University of Wisconsin
Special thanks to Gerri Ganis, Andy Hanushevsky,
Jan Iwaszkiewicz and Fons Rademakers and the BNL
team
Physics Analysis Tools meeting, 24/10/07

2
PROOF/XROOTD

When data comes it will not be possible for the
physicist to do analysis with ROOT in one node
due to large data volumes
Need to move to a model that allows parallel
processing for data analysis, or distributed
analysis.
As far as software for distributed analysis goes
US ATLAS is going for the xrootd/PROOF system
Xrootd is a distributed file system, maintained
by SLAC which is proven to support up to 1000
nodes with no scalability problems within this
range
PROOF (the Parallel ROOT Facility, CERN) is an
extension of ROOT allowing transparent analysis
of large sets of ROOT files in parallel on
compute clusters or multi-core computers

Bruce Mellado, 24/10/07
3
PROOF in a Slide

PROOF Dynamic approach to end-user HEP analysis
on distributed systems exploiting the intrinsic
parallelism of HEP data

Analysis Facility, Tier3
Bruce Mellado, 24/10/07
4

Structure of PROOF pool
Redirector
Worker
Supervisor
Procedure of PROOF job
User submit the PROOF job
Redirector find the exact location of each file
Workers validate each file
Workers process the root file
Master collects the results and sends to user
User make the plots
Packetizers. They work like job schedulers.
TAdaptivePacketizer (Default one, with dynamic
packet size)
TPacketizer (Optional one, with fixed packet
size)
TForceLocalPacktizer (Special one, no network
traffic between workers. Workers only deal with
the file stored locally)

Some Technical Details
To be optimized for the Tier3
Bruce Mellado, 24/10/07
5
PROOF test farms at GLOW-ATLAS

Big pool
1 Redirector 86 computers
47 AMD 4x2.0GHz cores, 4GB memory
39 Pentium4 2x2.8GHz, 2GB memory
We use just the local disk for performance tests
Only one PROOF worker run each node
Small pool A
1 Redirector 2 computers
4 x AMD 2.0GHz cores, 4GB memory, 70GB disk
Best performance with 8 workers running on each
node
Small pool B
1 Redirector 2 computers
8 x Intel 2.66GHz cores, 16GB memory, 8x750GB on
RAID 5
Best performance when 8 workers running on each
node, mainly for high performance tests

Bruce Mellado, 24/10/07
6
Xrootd/PROOF Tests at GLOW-ATLAS

Focused on needs of a university-based Tier3
Dedicated farms for data analysis, including
detector calibration and performance, and physics
analysis with high level objects
Various performance test and optimizations
Performance in various hardware configurations
Response to different data formats, volumes and
file multiplicities
Understanding system with multiple users
Developing new ideas with the PROOF team
Tests and optimization of packetizers
Understanding the complexities of the packetizers

Bruce Mellado, 24/10/07
7
PROOF test webpage

http//www-wisconsin.cern.ch/nengxu/proof/

Bruce Mellado, 24/10/07
8
Tests with CBNT Ntuples

PoolB Intel 8x2.66 cores, 16GB memory, RAID5
with 8x750GB disks

Bruce Mellado, 24/10/07
9

Test with EV H??? ntuples on Pool B
Intel 8x2.66 cores, 16GB memory, RAID5 with
8x750GB disks

Bruce Mellado, 24/10/07
10
Monitoring System (MonaLisa)
TAdaptivePacketizers single disk
TPacketizer single disk
CPU
CPU
TAdaptivePacketizers single disk
TPacketizer single disk
Memory
Memory
Bruce Mellado, 24/10/07
11
Our Views on a Tier3 at GLOW
Putting PROOF into Perspective
12
Main Issues to Address

Network Traffic
Avoiding Empty CPU cycles
Urgent need for CPU resources
Bookkeeping, management and and processing of
large amount of data

Core Technologies

CONDOR
Job management
MySQL
Bookkeeping and file management

XROOD
Storage
PROOF
Data analysis

Bruce Mellado, 24/10/07
12
13
One Possible way to go...
GRID
Computing PoolComputing nodes with small local
disk.
The gatekeeper Takes the production jobs
from Grid and submits to local pool.
Batch SystemNormally uses Condor, PBS, LSF, etc.
The users Submit their own jobs to the local
pool.
Heavy I/O Load
Dedicated PROOF poolcpus cores big disks.
Storage PoolCentralized storage servers (NFS,
Xrootd, Dcache, CASTOR)
CPUs are idle most of the time.
Bruce Mellado, 24/10/07
13
14
The way we want to go...
GRID
The gatekeeper Takes the production jobs
from Grid and submits to local pool.
Local job Submission Users own jobs to the
whole pool.
Pure computing Poolcpus cores small local disk
Xrootd Poolcpus cores big disks.
Less I/O Load
Proof jobs Submission Users PROOF jobs to the
Xrootd pool.
Storage Poolvery big disks
Bruce Mellado, 24/10/07
14
15
The Multi-layer Condor System
PROOF QueueFor PROOF Jobs, Cover all the CPUs,
no affective to the condor queue, jobs get the
CPU immediately.
Proof jobs Submission Users PROOF jobs to the
Xrootd pool.
Fast QueueFor high priority private Jobs, No
number limitation, run time limitation, cover all
the CPUs, half with suspension and half without,
with highest priority
I/O QueueFor I/O intensive jobs, No number
limitation, No run time limitation, Cover the
CPUs in Xrootd pool, Higher priority
Local job Submission Users own jobs to the
whole pool.
Local Job QueueFor Private Jobs, No number
limitation, No run time limitation, Cover all the
CPUs, Higher priority.
Production QueueNo Pre-emption, Cover all the
CPUs, Maximum 3 days, No number limitation.
The gatekeeper Takes the production jobs
from Grid and submits to local pool.
Bruce Mellado, 24/10/07
15
16
Our Xrootd File Tracking System Framework
Local_xrd.sh
DATA
Xrootd_sync.py
Xrootd_sync.py

Xrootd_sync.py
Bruce Mellado, 24/10/07
16
17
The I/O Queue
Mysql database server
0
Xrootd Poolcpus cores big disks.
1
2
4
3
Condor master
Submitting node
5
0. The tracking system provides file locations in
the Xrood pool. 1. Submission node asks Mysql
database for the input file location. 2. Database
provides the location for file and also the
validation info of the file. 3. Submission node
adds the location to the job requirement and
submit to the condor system. 4. Condor sends the
job to the node where the input file stored. 5.
The node runs the job and puts the output file on
the local disk. 0. The tracking system provides
file locations in the Xrood pool.
Bruce Mellado, 24/10/07
17
18
Benefits of the I/O queue

Reduce large amount of data transfer, especially
via network. Even the local file transfer can be
ignored. Input/Output files can be directly
accessed from the local storage disk.
Better usage of CPU resources. CPU cycles wont
be wasted on waiting file transfer.
During the job submission, users can see what
files are unavailable and the job wont be
submitted if it can not find input file. This can
reduce the waste of CPU cycles.
Better use of the XrootdPROOF pool. (Before,
PROOF pool only store the analyzable data, like
CBNT, AOD, DPD. Now it can also store RDO, ESD,
raw data for production.)
User can use dq2-like commands to submit jobs.
Transparent to the user

Bruce Mellado, 24/10/07
18
19
Outlook

We are gaining experience in PROOF/Xrootd and
understanding its performance
Extensive tests are being performed using
different hardware/software configurations, data
formats. A webpage has been set for people to
follow them up
http//www-wisconsin.cern.ch/nengxu/proof/
Installed Monalisa Monitoring system
We have spelled out our views on a university
based Tier3 facility for data analysis and
production
Use Condor/MySQL/PROOF/Xrootd
Have developed a prototype of multi-layer Condor
system including the I/O queue for I/O intensive
jobs
Have developed a file tracking system for Xrootd
Aiming at compatibility with Panda and ATLAS DDM

Bruce Mellado, 24/10/07

Write a Comment

User Comments (0)

About PowerShow.com

Alfredo Castaeda, Mengmeng Chen, Annabelle Leung, PowerPoint PPT Presentation