D0 Central Systems - PowerPoint PPT Presentation

1 / 6
About This Presentation
Title:

D0 Central Systems

Description:

Project Drivers & Scope. Migrate off of expensive SGI hardware. D0mino was ... PBS server and SAM stations coexist on Dell hardware (offsite routers separate) ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 7
Provided by: tim150
Category:

less

Transcript and Presenter's Notes

Title: D0 Central Systems


1
D0 Central Systems
2
Project Drivers Scope
  • Migrate off of expensive SGI hardware
  • D0mino was main D0 machine
  • Reliable homes and application areas on d02ka
  • Crusty code on d0world
  • Retire unique hardware
  • Several 4 and 8 processor Dell machines
  • Push towards Scientific Linux everywhere
  • Disk servers
  • Compute nodes
  • Infrastructure machines
  • Automate system administration tasks!!!
  • Monitor performance
  • Error reporting
  • Tracking availability
  • Continuing quest for cheap reliable disk that
    isnt headache

3
Project Deliverables
  • D0mino powered off for over a month
  • PBS server and SAM stations coexist on Dell
    hardware (offsite routers separate)
  • SAM cache servers deployed on 3ware based IDE
    raid arrays
  • Linux login pool with a shared scratch space on
    SAN via GFS
  • Commodity NFS servers with fiber attached
    Infortrend IDE raid arrays replaced 270 separate
    project and user disk areas
  • Pick events runs on CAB nodes and havent had a
    cache flushing problem
  • Recycling d0mino disks
  • Retired d0lxac1 and working on d0lomite
  • Plans for Scientific Linux everywhere (almost)
  • On all disk servers and infrastructure machines
  • Turning off d0lxbldXX machines soon
  • Latest CAB KOI nodes installed with SL and in use
  • Four release machines for official D0 code builds
    with no clear plan
  • Constantly looking for ways to automate
  • Using Ganglia to monitor distributed computing
  • Web pages display PBS batch usage at a glance
  • Central SYSLOG database to track system errors
    enmasse
  • Cfengine to automate changes
  • DHCP based installs on everything

4
Effort Profile
  • Approx 3.5 system administrators for D0 offline
  • Much D0mino retirement work done by SAM team and
    Gustaff (Hurray!)
  • Number of CAB nodes should level off someday and
    shouldnt be a problem
  • Need to make the 3ware raids behave a little
    better
  • Helpdesk requests from users dont seem to be
    increasing

5
Ongoing work/Risks
  • RunII sysadmin team not fully cross trained yet
  • Need to take the best of both worlds to have
    highly automated and scaleable architecture with
    each person being able to do any job
  • Replace d02ka and d0world??
  • Get rid of last big Dell box
  • Roll out SL on all CAB nodes
  • Need better tracking of problems and downtimes
  • Tools may not scale up to number of computers

6
Questions/Additional Info
  • http//d0om.fnal.gov/d0admin/cab/
  • http//d0om.fnal.gov/d0admin/syslog/
  • http//d0om.fnal.gov/d0admin/
Write a Comment
User Comments (0)
About PowerShow.com