Title: Infrastructure for Terascale Computing
1Infrastructure for Terascale Computing and
Beyond Henry Gun-Why Daresbury Science and
Innovation Campus Friday 27th July 2007
2- National Campus one of only two in the
country - Fourth Generation Light Source(4GLS)
- Only one in the world
- Home to the National Supercomputer
- top two computational science facilities
- National Centre for Electron Spectroscopy
Surface Analysis - Cockcroft Institute National Centre for
Accelerator Science -
-
3Daresbury - HPCx New 3.5 trillion-a-second
computer to unlock secrets The Guardian
- Tim Radford, science editorTuesday December
31, 2002 - Scientists have just switched on the most
powerful academic computer outside the United
States. - The HPCx at Daresbury laboratory in Warrington,
Cheshire - the main lab of the government-funded
research council - has enough electronic
brainpower to complete a full year's maths
homework and classwork for every child in Britain
in one fifth of a second. - When in full service, the 53m network of 1,280
powerful processors running in parallel will be
able to undertake the hitherto undreamed of
model the air turbulence behind a jet as it
lands, or the machinery of a cell as it manages
the business of life. - The Daresbury installation ranks ninth in the
world's top 500 supercomputer league table, and
can handle 3.5 trillion number-crunching
operations a second. -
-
4Definitions - teraflop
- A teraflop is a measure of a computer's speed
and can be expressed as - A trillion floating point operations per second
- 10 to the 12th power floating-point operations
per second - 2 to the 40th power flops
- Today's fastest parallel computing operations are
capable of teraflop speeds. Scientists have begun
to envision computers operating at petaflop
speeds.
5Definitions - petaflop
- A petaflop is a measure of a computer's
processing speed and can be expressed as - A thousand trillion floating point operations per
second (FLOPS) - A thousand teraflops
- 10 to the 15th power FLOPS
- 2 to the 50th power FLOPS
- Today's fastest parallel computing operations are
capable of teraflop speeds. - A petaflop computer would actually require a
massive number of computers working in parallel
on the same problem. - Applications might include real-time nuclear
magnetic resonance imaging during surgery,
computer-based drug design, astrophysical
simulation, the modelling of environmental
pollution, and the study of long-term climate
changes.
6Presentation Overview of High Performance
Computing- HPCx
- Development of the Infrastructure provision
for HPCx facility on the Daresbury Science
and Innovation Campus and a look forward to
the next generation of petaflop systems - - Costs and VAT- 53 m
- Programme - Nov 2002
- Building Construction - existing and ancillary
plant areas - Electrical Services, supplies and resilience
Pillar UK - Mechanical Services - cooling and BMS
monitoring - Insurance leading edge facility
- Security Fire, Intrusion detection, CCTV and
Access - Lessons Learnt
- The Big Issue
- Petascale and beyond
7Daresbury - HPCx Collaboration
8Project Development N1
- N1 philosophy underpins resilience, minimising
the risk of business disruption. - This means we provide one standby unit in
addition to the normal (N) operational
requirement. - Where one primary mains power feed is normal, two
separate feeds where one standby generator unit
is required, provide an additional backup. - N1 minimum is standard across our critical
service systems and equipment, including - primary HV power
- generators and Uninterruptible Power Supplies
(UPS) - AB electrical switchboards
- chillers
- pumps
- air handling equipment
- fire detection and suppression
92 No 2 MVA transformers
10Project Development - Power
- Power
- Resilience depends on diverse and durable
supplies of power, designed to minimise the
risk of power failure and be responsive to
individual power requirements. Even during
prolonged blackouts, it may be critical to
keep HPC systems up and running. - Standard features include
- primary HV power provision
- LV switchboards for power provision to the room
- on site sub-stations
- standby generators with a minimum 24 hours
autonomy at full capacity - Uninterruptible Power Supplies (UPS)
- sufficient fuel on site to run for days, together
with priority re-fuelling contracts - planned preventative maintenance programmes
- process oriented engineers and on site
controllers available 24x7
11Rotary UPS and Power Supplies Layout
12UPS being craned in situ
13Electrical Services, supplies and resilience
- UPS Systems
- Technical factors for consideration
- Choice static verses rotary
- Types of UPS
- Reliability
- Summary
14Types of UPS TechnologiesSeries on line
14
15ReliabilityTypes of UPS Technologies
MTBF 75,000 to 100,000hrs
MTBF 610,000 to 700,000hrs
Source Electricity in buildings, CIBSE Guide K
2004
15
16ReliabilityNumber of Components
16
17Performance CharacteristicsHarmonic Galvanic
Isolation
- IT loads produce harmonics
- Harmonics on input of UPS must be attenuated to
approximately 5 - In Data Centre design it is preferable to
electrically isolate IT loads from mechanical
loads - Galvanic isolation of IT loads for optimum
earthing
17
18Types of UPS TechnologiesTechnology Summary
18
19UPS Control Panel
20UPS Battery string
21Rotary UPS
22Rotary UPS
23Uni Block Rotary UPS
24UPS and Switchgear housing
25UPS and Switchgear housing
26UPS and Switchgear housing
27UPS and Switchgear housing slab
28Switchgear delivery
29Switchgear
30Power Distribution
31Fire Alarm Panels
32Cable Management in Sub- floor
33Computer Room Refurbishment - 31st May 2002
Remove kit/ partitions/ ceiling
30 years of history - remove air-conditioning
vents, electrical services and fire-suppressant
34Computer Room Refurbishment - 10th June 2002
Remove floor and services - think about girder!
35Computer Room Refurbishment - 30th June 2002
Room stripped back to shell - start to remove
girder
36Computer Room Refurbishment - 12th July 2002
Girder removed - spur walls trimmed back
37Computer Room Refurbishment - August September
2002
Sockets/ underfloor power distribution/ yellow
spots for tile pedestals/ trays for fibre channel
38Transition to service - Installation and
Commissioning- 2002
- 4-9th Oct delivery of hardware
- 10-17th Oct power-up tests. IBM internal
diagnostics. Computer room power and
air-conditioning commissioning. - 18-22nd Oct Software reinstall
- Logical partition O/S
- Parallel System Support Program
- General Parallel File System
- LoadLeveller - batch configuration
- Tivoli Storage Manager
- High Availability Cluster Multi-processing
39Vigorous Future Plans Nov 2002
- Complete fire and security insurance enhancements
- November 2002 - Explore Gang Scheduling
- - January 2003
- Complete Standard Operating Procedures
- - February 2003
- UPS commissioning/ Move Network
- - February 2003
- GPFS Swap/ Switch Plane Optimisation
- - March 2003
- Development System
- - separate development system to manage
evaluation of new software releases - April 2003
40Cold Aisle
41IBM Installation - PHASE 1
42IBM Installation - PHASE 2
43Project Development - Controlled Environment
- Infrastructure must be designed to cope with
environmental and cooling requirements - even the
extraordinary demands of the latest high density
server technologies. - maximise equipment layout
- balance air flows where necessary
- deliver all cooling and environmental
requirements using modelling - primary cooling infrastructure, centrally managed
and linked to BMS.. Other key features include - Room Air Conditioning Units (RACU's)
- regulated humidity within a constant range
- HVAC system with minimum impact on the
environment - Down Flow Air Conditioning Units
- adjustable floor grills to achieve balanced air
flows - planned preventative maintenance and 24x7 on-site
engineers - critical Service Level Agreements (SLA's) on
cooling provision
44HPCx Project Outline of Requirements
- Computer Room Infrastructure at DL - January
2001 - HPCx Core Programme Refurbishment Plan
- Timetable - original/ revised
- IBM Phase 1/2/3 installation floor plans- Note
2a added later - Computer Room Refurbishment
- Photographs
- Transition to Service - timetable
- installation and commissioning
- acceptance and early user service
- Systems Management
- - capability computing
- Reliability, Availability and Serviceability
Management -
45HPCx Computing Requirements
- Phase 1 - 40 Regatta H Compute Frames, 2 I/O
frames 400 KVA power - 400 KW cooling - 2160
square feet - Phase 2 - 48 Regatta H Compute Frames, 3 I/O
Frames, Federation switch 500 KVA power - 500 KW
cooling 2160 square feet - Phase 3 - 96 Regatta H Compute Frames, 4 I/O
Frames, alternative technologies IPF, BlueGene/L?
960 KVA power - 960 KW cooling - 3769 square feet - 15 under-floor gap for 26 miles of cabling!
- air floor grills to enhance air-flow
46Computer Room Infrastructure at DL - January 2001
- Designated site for HPC-X primary operations
(Q2/02) - Approx 5000 sq ft of space
- Built 1960s to house IBM mainframes and the Cray
1 - multiple floor levels/ small underfloor voids -
largely used for communications equipment and
various servers - Power/cooling/fire-protection
- 200 KVA - 200KW sufficient for existing
requirements - HALON 1301 system in need of upgrade with modern
environmentally-acceptable gases - Security
- sufficient for existing programme but a review
was required - CLRC invested 1M of Infrastructure budget to
refurbish computer room (floor/ more power and
air-conditioning)
47HPCx Core Programme Refurbishment Plan
- Replacement of computer room floor uniform depth
(18) and bonded stringer construction - Fire-protection new detection/ FM200 suppression
system - Cooling multiple CCRU using direct-expansion
refrigeration technology with a delivered
capacity of 1 MW heat-rejection, plus standby
capability - Uprating of power capability 1 MVA capacity
through the UPS to be available approximately end
of January 2003 - Security
- Reviewed in May 2002 - all brokers
recommendations accepted (access control, new
doors/ locks, intruder alarms, CCTV, fencing,
remove fire hazards) - Re-reviewed in September 2002 - underwriters
recommendations under discussion (off-site CCTV
and intruder alarm monitoring/ VESDA fire system,
additional detectors, roof)
48ProgrammeTimetable - January 2002
Actual contract signature July 12th 2002 3TF
build October 4th 2002 Service start date 1st
December 2002
49HPC(x) Daresbury Lab Room - recommendations
- 1.Tile flowrates look good
- I would move the p575 to the middle and not over
to one side as it is shown now. Should reduce
shortcircuiting of cold air back to crac units. - 3. Can you make the cold aisle width for the
p575s 3 tiles wide. This would help if you have
room. - 4. You may have some hot spots at the ends of the
rows since some of those tiles are close to the
crac units. May need to play with the tile layout
some when you get this installed.
50IBM Installation - PHASE 2a Scheme - Technology
Refresh reduced 52 No to 25No
51Project Development Building Management Systems
- Manage and monitor all our critical building
services using highly sophisticated Building
Management Systems (BMS). - These allow our on-site facilities management
team to monitor all key parameters - from power
supply to security access. - audio and visual alarms in the event of
discrepancies from the 'norm' - customer infrastructure interface to monitor
individual room/suite systems - monitoring and management of power, cooling and
humidity - power monitoring for consumption statistics and
billing - leak detection from cooling systems
- generation of system performance and facility
data
52Project Development Fire Detection Suppression
- Protect all fire detection and suppression
systems built to N1. - If a fire breaks out, these will react rapidly to
minimise the impact and reduce the chance of it
spreading to other areas. - As standard, provide three stage detection
systems in plant and technical areas, and fire
detection in every room below raised floors and
in ceiling voids. Other key features include - VESDA (Very Early Smoke Detection Apparatus)
systems - environmentally-friendly gas suppression systems
using Argonite or Inergen - gas and smoke extraction in conjunction with
pressure relief systems - fire alarms and wet-pipe sprinkler systems in
ancillary areas - fire detection and suppression systems linked to
BMS - on-site 24x7 monitoring
53VESDA System
54FM 200 Fire Suppressant
55Project Development - Security
- Security is a top priority , but we also
recognise that security requirements must be
balanced against your need for quick, easy access
to your HPC and IT equipment. - Security strategy maximises security, without
compromising convenience. - Key features of our multi-level security
infrastructure include - door access controls at main site and building
entrances - proximity activation cards to authorise access
levels - movement logs on all proximity card usage
- internal and external CCTV cameras and digital
image archiving - internal and external intruder detection devices
- vehicle entrance barriers with vehicle number
plate recognition and secure loading bays - strict policies on handling customers postal
packages - security systems linked to central BMS
- 24x7 monitoring by dedicated security teams
- Security extends beyond our site to include wider
security authorities such as the local police and
estate security. We exchange information and work
together to counter potential threats.
56How and why Mission Critical cooling differ
from common air conditioners?
- Todays HPC machine rooms require precise
, stable environments in order for sensitive
equipment to operate optimally. Standard
comfort cooling is ill suited for this
applications and if applied would lead to
system shut downs and component failures. The
application of air conditioning to HPC ,
requires lots of cold air to be shifted to
take the heat away from the processors and
provide environmental stability , whilst
minimising business downtime .
57HPC(x) Daresbury Lab Room
Simulated Layout with Tileflow
Actual Layout Provided
Canatal 9AD26 CRAC Units
Cold Aisle
40 Perf Tiles
25 sq in. Cable Opening
58HPC(x) Daresbury Lab Room isometric view
18 inch raised floor
59HPC(x) Daresbury Lab Room flowrates
60HPC(x) Daresbury Lab Room velocity and pressure
distribution
61Typical Hot and Cold AisleArrangement
- All hot air is contained within the hot aisles
keeping all the inlet temperatures at around 20ºC
even at 1.8M (42U)
62HPC(x) Daresbury Lab Room row 19 perf tile
distribution
Air Flow drops Near CRAC units
63A CFD Model to show the typical movement of air
within a standard server rack.
64A CFD Model to show the unpredictable movement of
air within a typical HPC centre.
65Mechanical Services - cooling and BMS
monitoring
- Today, cooling a computer room is not enough.
- Since it's the processors in the HPC that
represent the actual load, - airflow cooling is the challenge.
- This means carefully considering air distribution
- like the choice of racks and the layout of the
room. - What can you do to solve your cooling problems
including energy ? - Air Distribution
- In-Row precision air conditioning
- ISX High Density Solutions
66Lessons Learnt - 1
- Infrastructure projects have a long lead time.
Leave a sensible time for installation of
infrastructure. Machine rooms are not just
waiting for an order to be placed. HPCY will
require a major new investment. - Dont change installation requirements on the fly
eg 1TF to 3TF - make sure that the infrastructure
people are at the negotiating table. Now 12TF
with HPCx - Refurbishment costs are very difficult to
quantify up front against performance
requirements and require extensive contingency. - Computing technology changes quickly -
infrastructure requirements for current
technology may be radically different from the
infrastructure required in 4 years time eg change
from air-cooling to water cooling, inert gas to
water suppression systems - dont box yourself
in.
67Lessons Learnt - 2
- Electricity is a significant fraction of capital
project cost . A commodity but subject to large
fluctuations in supply/demand/ price - Dont underestimate the cost of insurance - since
9/11 insurers are reluctant to take any risks in
particular wrt fire, theft and professional
indemnity for large scale installations. - Dont forget Education and Training vital if
project is to succeed in implementation phases. - Insist on full user acceptance test before system
leaves factory. - Insist on full service documentation covering
systems architecture/ software system image/
operating procedures/ support processes.
68The Big Issues and ?
- The compaction of HPC and Information Technology
equipment - and simultaneous increases in processor power
consumption - are creating challenges for Designers and
Project Managers - In ensuring adequate distribution of cool air,
removal of hot air - and sufficient cooling capacity.
- Ref Next Presentation by Robin Stone
- Energy and Energy Recovery
- Cost in use?
- How business critical?
- N1 ?
- Can you just provide sufficient power to
take the processors down, with damage ? - ETC?????????
69The world's fastest commercial supercomputer has
been launched by computer giant IBM.
- Blue Gene/P is three times more potent than the
current fastest machine, BlueGene/L, also built
by IBM. - The latest number cruncher is capable of
operating at so called "petaflop" speeds - the
equivalent of 1,000 trillion calculations per
second. - Approximately 100,000 times more powerful than a
PC, the first machine has been bought by the US
government. - It will be installed at the Department of
Energy's (DOE) Argonne National Laboratory in
Illinois later this year. - Two further machines are planned for US
laboratories and a fourth has been bought by the
UK Science and Technology Facilities Council for
its Daresbury Laboratory Cheshire. - The ultra powerful machines will be used for
complex simulations to study everything from
particle physics to nanotechnology.
70Daresbury Laboratory Blue Gene/P Petascale
and beyond