The D - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

The D

Description:

Sept 12 2001. Wyatt Merritt D Collaboration Meeting Plenary Session. 1. The D Computing Model ... 18 Mammoth II tape drives - will be retired. 6 LTO drives - now ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 23
Provided by: kwyattm
Category:
Tags: mammoth

less

Transcript and Presenter's Notes

Title: The D


1
The DØ Computing Model
  • Overview
  • The picture
  • Planning history
  • Status of acquisitions
  • Performance
  • More detail
  • On the current operation
  • On the R D
  • General Status
  • Future plan

2
Overview
  • The data handling system
  • SAM ? ENSTORE ? Robot(s)
  • The offline user computing systems
  • dØmino - O (20 TB) disk
  • linux analysis server(s) - O (2 TB) disk
  • linux development machines - O (0.2 TB)
  • build cluster
  • ClueDØ
  • remote linux machines
  • non-development desktops
  • Associated systems
  • Fermilab production farm (raw data
    reconstruction)
  • Remote production farms (simulation)
  • Database servers

3
ClueDØ Server
High speed Network
4
Planning history
  • Original plan January 97
  • DØ Internal Review February 97
  • External review Von Rüden Committee
  • Mar 97, Oct 97, Jun 98, Jan 99, Jun 99
  • Funding profile (DMNAG - Joint with CDF) approved
    97
  • Plan updates
  • January 99 for VR IV
  • Global Computing Model reports (98-99)Addition
    of Analysis Servers to plan
  • Plan implementation 97 - 01
  • Run II Computing and Software Project
    co-leaders Computing Planning Board

5
Status of acquisitions
  • Analysis cpu
  • Dømino 192 proc O2000 complete (except add
    memory)
  • Desktops responsibility of institutions
  • Analysis Clusters/Servers - 1 purchased of (6?)
  • Reconstruction cpu
  • 200 processors acquired of 400 planned 40 Hz
    cap _at_ current reco cpu perf. 80 Hz _at_ target
    reco perf
  • Disk storage
  • 30 TB total - complete (plan was 15 TB)
  • See allocation slide

6
Disk space in the offline systems
Total available disk space 30 Tbyte
3 Tbytes are on D0test, d0lxac1, d0lxbld 27
Tbytes are on D0MINO
( all units are Tbytes)
Allocated
Used
Available
Disk space on D0MINO
1
1
1
Scratch, releases other config.
6
variable
6
SAM cache
12
variable
12
DST/mDST
2.6
2.0?
4
Project disks
0.9
?
2
Tmp ( group space)
2
contingency
22.5
27
TOTAL
7
Status of acquisitions contd
  • Robotic tape storage
  • 1 ADIC robot (750 TB capacity) - complete
  • 18 Mammoth II tape drives - will be retired
  • 6 LTO drives - now
  • 2 STK robots (600 TB capacity) - FY02
  • 9 STK 9940 drives - FY02
  • Post shutdown stopgap - use existing STKen w/ 4
    drives
  • Database servers - complete
  • 2 SUN systems w/ 600 GB disk

8
Performance
  • Farm production stats
  • dØmino cpu mem stats
  • AC1 cpu mem stats
  • SAM encp stats
  • Disk usage stats
  • Conclusion Chief needs
  • More memory for Dømino
  • More reliable tape drives
  • More farm nodes
  • More linux cpu
  • Open questions - DB server upgrades?

9
Farm Production Statistics
  • See web link from Main DØ Computing for weekly
    reportsWeek of 08/31 - 09/06800,000 evts
    proc / 140,000 from data collected in that
    week1.9 M events collected in that week
  • Problems in this weekencp problem (code change
    from ENSTORE)disk failure on dØbbin (the farm IO
    server)several other problems as well...

10
The Current Operation
  • Code release model
  • Mapping activities to systems
  • ClueD0 operation
  • Remote farm operation
  • Role of the ORB

11
The code release model
  • Weekly test releases
  • Production releases every three months
  • Weekly subsystem coordinators meetingMinutes to
    d0rug mailing list
  • Rules for interface changes
  • Schedules for big disruptive changes (e.g. switch
    to KAI 4.0)

12
Mapping activities to systems
  • Code development your Linux box, if possible
    d0mino is the backup solution
  • Large sample processing a SAM station
  • d0mino, lxac1, special farm allocation (gtr) ,
    (ClueD0 - in RD)
  • Small sample processing create derived DS on
    SAM station, transfer to desktop
  • Office/Web browsing use your desktop!
  • Remote users new position to address needs

13
Mapping activities to systems
  • Disk usage
  • Home areas - backed up you can ask for up to
    250MB (possibility of more for good reason) BUT
    NFS-mounted - dont use for data files!
  • TMP areas - not backed up. Code development and
    / or data files, allocated per institution. 37
    institutions are using it so far. A good place
    to start off if you are not working with a
    well-defined project.
  • PRJ areas - not backed up. Code development and
    / or data files, allocated per project. 3 large
    pools commissioning, algorithm development,
    simulation, plus physics and ID groups and some
    smaller projects.
  • Web pages - DØ Main Computing ( SAM Data Handling
    section) --gt General description of where data
    samples are stored in our system

14
ClueD0 Operation
  • The current population is111 nodes with 138
    CPUs and a total memory of 37GB396 Users
  • Rules for joining and policies can be found
    athttp//www-clued0.fnal.gov/clued0/http//www-
    clued0.fnal.gov/clued0/policies.html
  • Current difficulties from the lack of Redhat 7.1
    builds are being actively worked on

15
Monte Carlo Production Status
  • Current Software mcp07
  • p07.00.05a Generator, DØgstar, Døsim
  • P08.12.00 Døreco, recoanalyze
  • 950 kevents generated at reco level
  • Run IIB Simulation is a major effort
  • Will move to p08.13.00 to remove memory leak
  • Future Releases p09.10.00
  • Problem running DØgstar under investigation
  • Plate level available
  • p10 certification will be available by the end of
    the month

16
The Offline Resources Board
  • Charge Allocate offline resources according to
    the experiments priorities
  • Project tmp disk
  • Sample priorities for simulation on remote farms
  • Partitions in SAM cache
  • Batch queues
  • Chair Nick Hadley
  • Web Pagehttp//www-d0.fnal.gov/Run2Physics/orb/d0
    _private/orb_home.html
  • Institutions which have no tmp disk allocation
    and have active users
  • email to hadley_at_fnal.gov - 18 GB will be allocated

17
R D
  • Analysis clusters - one in service
  • ClueD0 servers ( a relocated analysis cluster) -
    software being tested networking strategy being
    developed
  • Compute servers for dØmino (a user-accessible
    farm) - 2 nodes available for tests
  • Remote farms for raw data reconstruction and
    analysis
  • Remote desktop analysis

18
Institutional contributions
  • Desktop seats
  • Backup tapes
  • Remote simulation capacity
  • Disk for Dømino via budget code - issues
  • How to allocate between project tmp?
  • Lifetime for contribution?
  • Unit of contribution 1 rack of disk
  • Analysis cluster for Feynman via budget code
  • Similar issues
  • Analysis cluster for ClueDØ - all the above
    issues SAM bandwidth, networking, sysadmin, ...

19
General Status - Where are the limits/problems?
  • Online
  • Max rate tested 40 Hz to tape
  • Max rate sustained for a shift, to date 25 Hz
    to tape
  • Max rate expected with next iteration 60 Hz to
    tape
  • Final limitation tape budget (FY02 400 TB
    )
  • Running p 10 on the farms
  • Processes raw data _at_ 23 sec/event
  • Thanks to Alg Group - worked out of box on raw
    data
  • Limits 2-3 Hz w/ current nodes cpu perf of
    reco Output size HUGE - writing too
    much tape, breaking DB model, using more than
    allocated network and disk resources all down the
    line

20
Expected Farm Performance
21
General Status - Where are the limits/problems?
  • SAM/ENSTORE status
  • Working for many months with servers on automatic
    recovery
  • Not all features complete (pick events)
  • 5 GB interfaces ? can deliver 150 MB/sec to
    dØmino
  • Robot status
  • Design rates met, but robustness severely limited
    by M II drive error rate - plan switchover by end
    of shutdown

22
Future Plan
  • Major purchases still in FY02
  • New robot and reliable drives
  • New farm nodes
  • More memory for dØmino
  • Some linux cpu
  • Continue RD for linux analysis strategies
  • Hope to establish effectiveness and practicality
    of the three proposed models AC, CS, AC_at_DØ
  • Operational improvements
  • SAM personnel _at_ DØ
  • RECO continue with current release schedules
    emphasize quality control and testing for
    releasespush on cpu, memory, output size issues
Write a Comment
User Comments (0)
About PowerShow.com