Managing Mature White Box Clusters at CERN - PowerPoint PPT Presentation

About This Presentation
Title:

Managing Mature White Box Clusters at CERN

Description:

Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT Contents Scale Behind the Scenes Hardware Complexity Dynamics Practical Steps ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 15
Provided by: TimS104
Category:

less

Transcript and Presenter's Notes

Title: Managing Mature White Box Clusters at CERN


1
Managing Mature White Box Clusters at CERN
  • LCW Practical Experience
  • Tim Smith CERN/IT

2
Contents
  • Scale
  • Behind the Scenes
  • Hardware
  • Complexity
  • Dynamics
  • Practical Steps
  • Software
  • Legacy
  • Projects

3
Scale
  • 1000 boxes
  • 140k Jobs/wk
  • 2400 int user
  • 50 parallel reinstalls
  • Parallel cmd engines
  • 350kSi2000
  • 7/38 in top 500 clusters

4
Complexity
  • Hardware
  • 12 hardware acquisitions
  • 38 combinations of CPU/Mem/Disk
  • Software
  • 4 versions of RedHat OS
  • 37 clusters (indep. configurations)
  • User Communities
  • 30 expts/user communities Public
  • 12,000 users

5
Dynamics
  • Hardware Drift
  • e.g. missing after reboot
  • CPUs, Memory, Disks
  • Ethernet speed wrong
  • Volatile configurations
  • e.g. passwd file every couple of hours
  • Hardware Failures
  • Up to 4 of farm on holiday
  • Replacements generate new configurations

Monitoring
Inventory Tracking
6
Vendor Call Analysis
1 every 2 days!
7
Acquisition Cycles
8
Addressing the Challenge
  • Interactive Refresh from uniform batch machines
  • Batch One large production facility
  • Shares (and priorities)
  • Selectable resources
  • Flexibility
  • Redundancy to reduced sensitivity to failures
  • Remedy Hardware workflows
  • But intractable
  • Scatter in job return times
  • Assumed but undeclared job requirements

9
SW Legacy from Maturity
BIS
Mgmt Tools
ASIS
Applications
OS
SUE
/home /usr/cute /usr/local /var /opt
KickStart
10
SW Legacy from Maturity
acrontabs
crontabs
BIS
Local
Mgmt Tools
ASIS
AFS
AFS
Applications
AFS
AFS
OS
SUE
Oracle
BIS DB
/home /usr/cute /usr/local /var /opt
KickStart
Multiple locations
Multiple owners, methods, formats
11
A Clean Restart
Fault Mgmt System
Monitoring System
Node
Configuration System
Installation System
12
A Clean Restart SnapShot
Fault Mgmt System
HW SW
Monitoring System
Node
Configuration System
API
Function State
RPM
Installation System
PXE Kickstart
Software Update
Base Installation
13
State and Configuration Mgt
  • Clean Initial State
  • Linux Standards Base, RPM
  • Externally Specified
  • Configuration System, local cache
  • Versioned Repository
  • CVS
  • No inherent drift
  • No external crontabs
  • No unregistered application provider triggered
    updates
  • Update verification nodes release cycle
  • Procedures and Workflows
  • Transactions
  • Notifications

14
Conclusions
  • Maturity brings
  • Degradation of initial state definition
  • HW SW
  • Accumulation of innocuous temporary procedures
  • Scale brings
  • Marginal activities become full time
  • Many hands on the systems
  • Combat with strong management automation
Write a Comment
User Comments (0)
About PowerShow.com