Title: Infrastructure and Provisioning at the Fermilab High Density Computing Facility
1Infrastructure and Provisioning at the Fermilab
High Density Computing Facility
- Steven C. Timm
- Fermilab
- HEPiX conference
- May 24-26, 2004
2Outline
- Current Fermilab facilities
- Expected need for future Fermilab facilities
- Construction activity at High Density Computing
Facility - Networking and power infrastructure
- Provisioning and management at remote location
3A cast of thousands.
- HDCF design done by Fermilab Facilities
Engineering, - Construction by outside contractor
- Managed by CD Operations (G. Bellendir et al)
- Requirements planning by taskforce of Computing
Division personnel including system
administrators, department heads, networking,
facilities people. - Rocks development work by S. Timm, M. Greaney, J.
Kaiser
4Current Fermilab facilities
- Feynman Computing Center built in 1988 (to house
large IBM-compatible mainframe). - 18000 square feet of computer rooms
- 200 tons of cooling
- Maximum input current 1800A
- Computer rooms backed up with UPS
- Full building backed up with generator
- 1850 dual-CPU compute servers, 200 multi-TB IDE
RAID servers in FCC right now - Many other general-purpose servers, file servers,
tape robots.
5Current facilities continued
- Satellite computing facility in former
experimental hall New Muon Lab - Historically for Lattice QCD clusters (208?512)
- Now contains gt320 other nodes waiting for
construction of new facility
6The long hot summer
- In summer it takes considerably more energy to
run the air conditioning. - Dependent on shallow pond for cooling water.
- In May building has already come within 25A (out
of 1800) from having to shut down equipment to
shed power load and avoid brownout. - Current equipment exhausts the cooling capacity
of Feynman computing center as well as the
electric - No way to increase either in existing building
without long service outages.
7Computers just keep getting hotter
- Anticipate that in fall 04 we can buy dual Intel
3.6 GHz Nocona chip, 105W apiece - Expect at least 2.5A current draw per node, maybe
more, 12-13 kVA per rack of 40 nodes. - In FCC we have 32 computers per rack, 8-9 kVA
- Have problems cooling the top nodes even now.
- New facility will have 5x more cooling, 270 tons
for 2000 square feet - New facility will have up to 3000A of electrical
current available.
8We keep needing more computers
- Moores law doubling time isnt holding true in
commodity market - Computing needs are growing faster than Moores
law and must be met with more computers - 5 year projections are based on plans from
experiments. -
9Fermi Cycles as a function of time
YR2(X/F) Moores law says F1.5 years,
F2.02 years and growing. 1000 Fermi Cycles ?
PIII 1 GHz
10Fermi Cycles per ampere as function of time
11Fermi cycles per dollar as function of time
12Strategy
- Feynman center will be UPS and generator-backed
facility for important servers - New HDCF will have UPS for graceful shutdown but
no generator backup. Designed for high-density
compute nodes (plus a few tape robots). - 10-20 racks of existing 1U will be moved to new
facility and reracked - Anticipate10-15 racks of new purchase this fall
also in new building
13Location ofHDCF
1.5 miles away from FCC No administrators will be
housed therewill manage lights out
14Floor plan of HDCF
Room for 72 racks in each of 2 computer rooms.
15Cabling plan
Network Infrastructure Will use bundles of
individual Cat-6 cables
16Current status
- Construction began early May
- Occupancy Nov/Dec 2004 (est).
- Phase III, space for 56 racks at that time.
- Expected cost US2.8M.
17Power/console infrastructure
- Cyclades AlterPath series
- Includes console servers, network-based KVM
adapters, and power strips - Alterpath ACS48 runs PPC Linux
- Supports Kerberos 5 authentication
- Access control can be divided by each port
- Any number of power strip outlets can be
associated with each machine on each console
port. - All configurable via command line or Java-based
GUI
18Power/console infrastructure
PM-10 Power strip 120VAC 30A 10
nodes/circuit Four units/rack
19Installation with NPACI-Rocks
- NPACI (National Partnership for Advanced
Computational Infrastructure), lead institution
is San Diego Supercomputing Center - Rocksultimate cluster-in-a-box tool. Combines
Linux distribution, database, highly modified
installer, and a large amount of parallel
computing applications such as PBS, Maui, SGE,
MPICH, Atlas, PVFS. - Rocks 3.0 based on Red Hat Linux 7.3
- Rocks 3.1 and greater based on SRPMS of Red Hat
Enterprise Linux 3.0.
20Rocks vs. Fermi Linux comparison
Fermi Linux REDHAT Rocks 3.0
R E D H A T 7. 3
Adds Workgroups Yum OpenAFS Fermi
Kerberos/ OpenSSH
Adds Extended kickstart HPC applications MySQL
database
21Rocks Fermiarchitecture Application
- Expects all compute nodes on private net behind a
firewall - Reinstall node if any changes
- All network services (DHCP, DNS, NIS) supplied by
the frontend.
- Nodes on public net
- Users wont allow downtime for frequent reinstall
- Use yum and other Fermi Linux tools for security
updates - Configure ROCKS to use our external network
services
22Fermi extensions to Rocks
- Fermi production farms currently have 752 nodes
all installed with Rocks - This Rocks cluster has the most CPUs registered
of any cluster at rocksclusters.org - Added extra tables to database for customizing
kickstart configuration (we have 14 different
disk configurations) - Added Fermi Linux comps files to have all Fermi
workgroups available in installs, and all added
Fermi RPMS - Made slave frontend install servers during mass
reinstall phases. During normal operation one
install server is enough. - Added logic to recreate kerberos keytabs
23S.M.A.R.T Monitoring
- smartd daemon from smartmontools package gives
early warning of disk failures - Disk failures are 70 of all hardware failures
in our farms over last 5 years. - Run short self-test on all disks every day
24Temperature/power monitoring
- Wrappers for lm_sensors feed NGOP and Ganglia.
- Measure average temperature of nodes over a month
- Alarm when 5C or 10C above average
- Page when 50 of any group at 10C above average
- Automated shutdown script activates when any
single node is over emergency temperature. - Building-wide signal will provide notice that we
are on UPS power and have 5 minutes to shut down. - Automated OS shutdown and SNMP poweroff scripts
25Reliability is key
- Can only successfully manage remote clusters if
hardware is reliable - All new contracts are written with vendor
providing 3 year warranty parts and laborthey
only make money if they build good hardware - 30-day acceptance test is critical to identify
hardware problems and fix them before production
begins. - With 750 nodes and 99 reliability, still 8 nodes
would be down a day. - Historically reliability is closer to 96 but new
Intel-based Xeon nodes are much better.