Title: The D
1The DØ Computing Model
- Overview
- The picture
- Planning history
- Status of acquisitions
- Performance
- More detail
- On the current operation
- On the R D
- General Status
- Future plan
2Overview
- The data handling system
- SAM ? ENSTORE ? Robot(s)
- The offline user computing systems
- dØmino - O (20 TB) disk
- linux analysis server(s) - O (2 TB) disk
- linux development machines - O (0.2 TB)
- build cluster
- ClueDØ
- remote linux machines
- non-development desktops
- Associated systems
- Fermilab production farm (raw data
reconstruction) - Remote production farms (simulation)
- Database servers
3ClueDØ Server
High speed Network
4Planning history
- Original plan January 97
- DØ Internal Review February 97
- External review Von Rüden Committee
- Mar 97, Oct 97, Jun 98, Jan 99, Jun 99
- Funding profile (DMNAG - Joint with CDF) approved
97 - Plan updates
- January 99 for VR IV
- Global Computing Model reports (98-99)Addition
of Analysis Servers to plan - Plan implementation 97 - 01
- Run II Computing and Software Project
co-leaders Computing Planning Board
5Status of acquisitions
- Analysis cpu
- Dømino 192 proc O2000 complete (except add
memory) - Desktops responsibility of institutions
- Analysis Clusters/Servers - 1 purchased of (6?)
- Reconstruction cpu
- 200 processors acquired of 400 planned 40 Hz
cap _at_ current reco cpu perf. 80 Hz _at_ target
reco perf - Disk storage
- 30 TB total - complete (plan was 15 TB)
- See allocation slide
6Disk space in the offline systems
Total available disk space 30 Tbyte
3 Tbytes are on D0test, d0lxac1, d0lxbld 27
Tbytes are on D0MINO
( all units are Tbytes)
Allocated
Used
Available
Disk space on D0MINO
1
1
1
Scratch, releases other config.
6
variable
6
SAM cache
12
variable
12
DST/mDST
2.6
2.0?
4
Project disks
0.9
?
2
Tmp ( group space)
2
contingency
22.5
27
TOTAL
7Status of acquisitions contd
- Robotic tape storage
- 1 ADIC robot (750 TB capacity) - complete
- 18 Mammoth II tape drives - will be retired
- 6 LTO drives - now
- 2 STK robots (600 TB capacity) - FY02
- 9 STK 9940 drives - FY02
- Post shutdown stopgap - use existing STKen w/ 4
drives - Database servers - complete
- 2 SUN systems w/ 600 GB disk
8Performance
- Farm production stats
- dØmino cpu mem stats
- AC1 cpu mem stats
- SAM encp stats
- Disk usage stats
- Conclusion Chief needs
- More memory for Dømino
- More reliable tape drives
- More farm nodes
- More linux cpu
- Open questions - DB server upgrades?
9Farm Production Statistics
- See web link from Main DØ Computing for weekly
reportsWeek of 08/31 - 09/06800,000 evts
proc / 140,000 from data collected in that
week1.9 M events collected in that week - Problems in this weekencp problem (code change
from ENSTORE)disk failure on dØbbin (the farm IO
server)several other problems as well...
10The Current Operation
- Code release model
- Mapping activities to systems
- ClueD0 operation
- Remote farm operation
- Role of the ORB
11The code release model
- Weekly test releases
- Production releases every three months
- Weekly subsystem coordinators meetingMinutes to
d0rug mailing list - Rules for interface changes
- Schedules for big disruptive changes (e.g. switch
to KAI 4.0)
12Mapping activities to systems
- Code development your Linux box, if possible
d0mino is the backup solution - Large sample processing a SAM station
- d0mino, lxac1, special farm allocation (gtr) ,
(ClueD0 - in RD) - Small sample processing create derived DS on
SAM station, transfer to desktop - Office/Web browsing use your desktop!
- Remote users new position to address needs
13Mapping activities to systems
- Disk usage
- Home areas - backed up you can ask for up to
250MB (possibility of more for good reason) BUT
NFS-mounted - dont use for data files! - TMP areas - not backed up. Code development and
/ or data files, allocated per institution. 37
institutions are using it so far. A good place
to start off if you are not working with a
well-defined project. - PRJ areas - not backed up. Code development and
/ or data files, allocated per project. 3 large
pools commissioning, algorithm development,
simulation, plus physics and ID groups and some
smaller projects. - Web pages - DØ Main Computing ( SAM Data Handling
section) --gt General description of where data
samples are stored in our system
14ClueD0 Operation
- The current population is111 nodes with 138
CPUs and a total memory of 37GB396 Users - Rules for joining and policies can be found
athttp//www-clued0.fnal.gov/clued0/http//www-
clued0.fnal.gov/clued0/policies.html - Current difficulties from the lack of Redhat 7.1
builds are being actively worked on
15Monte Carlo Production Status
- Current Software mcp07
- p07.00.05a Generator, DØgstar, Døsim
- P08.12.00 Døreco, recoanalyze
- 950 kevents generated at reco level
- Run IIB Simulation is a major effort
- Will move to p08.13.00 to remove memory leak
- Future Releases p09.10.00
- Problem running DØgstar under investigation
- Plate level available
- p10 certification will be available by the end of
the month
16The Offline Resources Board
- Charge Allocate offline resources according to
the experiments priorities - Project tmp disk
- Sample priorities for simulation on remote farms
- Partitions in SAM cache
- Batch queues
- Chair Nick Hadley
- Web Pagehttp//www-d0.fnal.gov/Run2Physics/orb/d0
_private/orb_home.html - Institutions which have no tmp disk allocation
and have active users - email to hadley_at_fnal.gov - 18 GB will be allocated
17R D
- Analysis clusters - one in service
- ClueD0 servers ( a relocated analysis cluster) -
software being tested networking strategy being
developed - Compute servers for dØmino (a user-accessible
farm) - 2 nodes available for tests - Remote farms for raw data reconstruction and
analysis - Remote desktop analysis
18Institutional contributions
- Desktop seats
- Backup tapes
- Remote simulation capacity
- Disk for Dømino via budget code - issues
- How to allocate between project tmp?
- Lifetime for contribution?
- Unit of contribution 1 rack of disk
- Analysis cluster for Feynman via budget code
- Similar issues
- Analysis cluster for ClueDØ - all the above
issues SAM bandwidth, networking, sysadmin, ...
19General Status - Where are the limits/problems?
- Online
- Max rate tested 40 Hz to tape
- Max rate sustained for a shift, to date 25 Hz
to tape - Max rate expected with next iteration 60 Hz to
tape - Final limitation tape budget (FY02 400 TB
) - Running p 10 on the farms
- Processes raw data _at_ 23 sec/event
- Thanks to Alg Group - worked out of box on raw
data - Limits 2-3 Hz w/ current nodes cpu perf of
reco Output size HUGE - writing too
much tape, breaking DB model, using more than
allocated network and disk resources all down the
line
20Expected Farm Performance
21General Status - Where are the limits/problems?
- SAM/ENSTORE status
- Working for many months with servers on automatic
recovery - Not all features complete (pick events)
- 5 GB interfaces ? can deliver 150 MB/sec to
dØmino - Robot status
- Design rates met, but robustness severely limited
by M II drive error rate - plan switchover by end
of shutdown
22Future Plan
- Major purchases still in FY02
- New robot and reliable drives
- New farm nodes
- More memory for dØmino
- Some linux cpu
- Continue RD for linux analysis strategies
- Hope to establish effectiveness and practicality
of the three proposed models AC, CS, AC_at_DØ - Operational improvements
- SAM personnel _at_ DØ
- RECO continue with current release schedules
emphasize quality control and testing for
releasespush on cpu, memory, output size issues