Alfredo Castaeda, Mengmeng Chen, Annabelle Leung, - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Alfredo Castaeda, Mengmeng Chen, Annabelle Leung,

Description:

The Multi-layer Condor System. Production Queue ... 4. Condor sends the job to the node where the input file stored. ... Use Condor/MySQL/PROOF/Xrootd ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 20
Provided by: brucem159
Category:

less

Transcript and Presenter's Notes

Title: Alfredo Castaeda, Mengmeng Chen, Annabelle Leung,


1
PROOF/Xrootd Tests in GLOW-ATLAS (Wisconsin)
  • Alfredo Castañeda, Mengmeng Chen, Annabelle
    Leung,
  • Bruce Mellado, Neng Xu and Sau Lan Wu
  • University of Wisconsin
  • Special thanks to Gerri Ganis, Andy Hanushevsky,
    Jan Iwaszkiewicz and Fons Rademakers and the BNL
    team
  • Physics Analysis Tools meeting, 24/10/07

2
PROOF/XROOTD
  • When data comes it will not be possible for the
    physicist to do analysis with ROOT in one node
    due to large data volumes
  • Need to move to a model that allows parallel
    processing for data analysis, or distributed
    analysis.
  • As far as software for distributed analysis goes
    US ATLAS is going for the xrootd/PROOF system
  • Xrootd is a distributed file system, maintained
    by SLAC which is proven to support up to 1000
    nodes with no scalability problems within this
    range
  • PROOF (the Parallel ROOT Facility, CERN) is an
    extension of ROOT allowing transparent analysis
    of large sets of ROOT files in parallel on
    compute clusters or multi-core computers

Bruce Mellado, 24/10/07
3
PROOF in a Slide
  • PROOF Dynamic approach to end-user HEP analysis
    on distributed systems exploiting the intrinsic
    parallelism of HEP data

Analysis Facility, Tier3
Bruce Mellado, 24/10/07
4
  • Structure of PROOF pool
  • Redirector
  • Worker
  • Supervisor
  • Procedure of PROOF job
  • User submit the PROOF job
  • Redirector find the exact location of each file
  • Workers validate each file
  • Workers process the root file
  • Master collects the results and sends to user
  • User make the plots
  • Packetizers. They work like job schedulers.
  • TAdaptivePacketizer (Default one, with dynamic
    packet size)
  • TPacketizer (Optional one, with fixed packet
    size)
  • TForceLocalPacktizer (Special one, no network
    traffic between workers. Workers only deal with
    the file stored locally)

Some Technical Details
To be optimized for the Tier3
Bruce Mellado, 24/10/07
5
PROOF test farms at GLOW-ATLAS
  • Big pool
  • 1 Redirector 86 computers
  • 47 AMD 4x2.0GHz cores, 4GB memory
  • 39 Pentium4 2x2.8GHz, 2GB memory
  • We use just the local disk for performance tests
  • Only one PROOF worker run each node
  • Small pool A
  • 1 Redirector 2 computers
  • 4 x AMD 2.0GHz cores, 4GB memory, 70GB disk
  • Best performance with 8 workers running on each
    node
  • Small pool B
  • 1 Redirector 2 computers
  • 8 x Intel 2.66GHz cores, 16GB memory, 8x750GB on
    RAID 5
  • Best performance when 8 workers running on each
    node, mainly for high performance tests

Bruce Mellado, 24/10/07
6
Xrootd/PROOF Tests at GLOW-ATLAS
  • Focused on needs of a university-based Tier3
  • Dedicated farms for data analysis, including
    detector calibration and performance, and physics
    analysis with high level objects
  • Various performance test and optimizations
  • Performance in various hardware configurations
  • Response to different data formats, volumes and
    file multiplicities
  • Understanding system with multiple users
  • Developing new ideas with the PROOF team
  • Tests and optimization of packetizers
  • Understanding the complexities of the packetizers

Bruce Mellado, 24/10/07
7
PROOF test webpage
  • http//www-wisconsin.cern.ch/nengxu/proof/

Bruce Mellado, 24/10/07
8
Tests with CBNT Ntuples
  • PoolB Intel 8x2.66 cores, 16GB memory, RAID5
    with 8x750GB disks

Bruce Mellado, 24/10/07
9
  • Test with EV H??? ntuples on Pool B
  • Intel 8x2.66 cores, 16GB memory, RAID5 with
    8x750GB disks

Bruce Mellado, 24/10/07
10
Monitoring System (MonaLisa)
TAdaptivePacketizers single disk
TPacketizer single disk
CPU
CPU
TAdaptivePacketizers single disk
TPacketizer single disk
Memory
Memory
Bruce Mellado, 24/10/07
11
Our Views on a Tier3 at GLOW
Putting PROOF into Perspective
12
Main Issues to Address
  • Network Traffic
  • Avoiding Empty CPU cycles
  • Urgent need for CPU resources
  • Bookkeeping, management and and processing of
    large amount of data

Core Technologies
  • CONDOR
  • Job management
  • MySQL
  • Bookkeeping and file management
  • XROOD
  • Storage
  • PROOF
  • Data analysis

Bruce Mellado, 24/10/07
12
13
One Possible way to go...
GRID
Computing PoolComputing nodes with small local
disk.
The gatekeeper Takes the production jobs
from Grid and submits to local pool.
Batch SystemNormally uses Condor, PBS, LSF, etc.
The users Submit their own jobs to the local
pool.
Heavy I/O Load
Dedicated PROOF poolcpus cores big disks.
Storage PoolCentralized storage servers (NFS,
Xrootd, Dcache, CASTOR)
CPUs are idle most of the time.
Bruce Mellado, 24/10/07
13
14
The way we want to go...
GRID
The gatekeeper Takes the production jobs
from Grid and submits to local pool.
Local job Submission Users own jobs to the
whole pool.
Pure computing Poolcpus cores small local disk
Xrootd Poolcpus cores big disks.
Less I/O Load
Proof jobs Submission Users PROOF jobs to the
Xrootd pool.
Storage Poolvery big disks
Bruce Mellado, 24/10/07
14
15
The Multi-layer Condor System
PROOF QueueFor PROOF Jobs, Cover all the CPUs,
no affective to the condor queue, jobs get the
CPU immediately.
Proof jobs Submission Users PROOF jobs to the
Xrootd pool.
Fast QueueFor high priority private Jobs, No
number limitation, run time limitation, cover all
the CPUs, half with suspension and half without,
with highest priority
I/O QueueFor I/O intensive jobs, No number
limitation, No run time limitation, Cover the
CPUs in Xrootd pool, Higher priority
Local job Submission Users own jobs to the
whole pool.
Local Job QueueFor Private Jobs, No number
limitation, No run time limitation, Cover all the
CPUs, Higher priority.
Production QueueNo Pre-emption, Cover all the
CPUs, Maximum 3 days, No number limitation.
The gatekeeper Takes the production jobs
from Grid and submits to local pool.
Bruce Mellado, 24/10/07
15
16
Our Xrootd File Tracking System Framework
Local_xrd.sh
DATA
Xrootd_sync.py
Xrootd_sync.py

Xrootd_sync.py
Bruce Mellado, 24/10/07
16
17
The I/O Queue
Mysql database server
0
Xrootd Poolcpus cores big disks.
1
2
4
3
Condor master
Submitting node
5
0. The tracking system provides file locations in
the Xrood pool. 1. Submission node asks Mysql
database for the input file location. 2. Database
provides the location for file and also the
validation info of the file. 3. Submission node
adds the location to the job requirement and
submit to the condor system. 4. Condor sends the
job to the node where the input file stored. 5.
The node runs the job and puts the output file on
the local disk. 0. The tracking system provides
file locations in the Xrood pool.
Bruce Mellado, 24/10/07
17
18
Benefits of the I/O queue
  • Reduce large amount of data transfer, especially
    via network. Even the local file transfer can be
    ignored. Input/Output files can be directly
    accessed from the local storage disk.
  • Better usage of CPU resources. CPU cycles wont
    be wasted on waiting file transfer.
  • During the job submission, users can see what
    files are unavailable and the job wont be
    submitted if it can not find input file. This can
    reduce the waste of CPU cycles.
  • Better use of the XrootdPROOF pool. (Before,
    PROOF pool only store the analyzable data, like
    CBNT, AOD, DPD. Now it can also store RDO, ESD,
    raw data for production.)
  • User can use dq2-like commands to submit jobs.
  • Transparent to the user

Bruce Mellado, 24/10/07
18
19
Outlook
  • We are gaining experience in PROOF/Xrootd and
    understanding its performance
  • Extensive tests are being performed using
    different hardware/software configurations, data
    formats. A webpage has been set for people to
    follow them up
  • http//www-wisconsin.cern.ch/nengxu/proof/
  • Installed Monalisa Monitoring system
  • We have spelled out our views on a university
    based Tier3 facility for data analysis and
    production
  • Use Condor/MySQL/PROOF/Xrootd
  • Have developed a prototype of multi-layer Condor
    system including the I/O queue for I/O intensive
    jobs
  • Have developed a file tracking system for Xrootd
  • Aiming at compatibility with Panda and ATLAS DDM

Bruce Mellado, 24/10/07
Write a Comment
User Comments (0)
About PowerShow.com