Infrastructure for Terascale Computing - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Infrastructure for Terascale Computing

Description:

Infrastructure for Terascale Computing – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 70
Provided by: dl98
Category:

less

Transcript and Presenter's Notes

Title: Infrastructure for Terascale Computing


1
Infrastructure for Terascale Computing and
Beyond Henry Gun-Why Daresbury Science and
Innovation Campus Friday 27th July 2007
2
  • National Campus one of only two in the
    country
  • Fourth Generation Light Source(4GLS)
  • Only one in the world
  • Home to the National Supercomputer
  • top two computational science facilities
  • National Centre for Electron Spectroscopy
    Surface Analysis
  • Cockcroft Institute National Centre for
    Accelerator Science

3
Daresbury - HPCx New 3.5 trillion-a-second
computer to unlock secrets The Guardian
  • Tim Radford, science editorTuesday December
    31, 2002
  • Scientists have just switched on the most
    powerful academic computer outside the United
    States.
  • The HPCx at Daresbury laboratory in Warrington,
    Cheshire - the main lab of the government-funded
    research council - has enough electronic
    brainpower to complete a full year's maths
    homework and classwork for every child in Britain
    in one fifth of a second.
  • When in full service, the 53m network of 1,280
    powerful processors running in parallel will be
    able to undertake the hitherto undreamed of
    model the air turbulence behind a jet as it
    lands, or the machinery of a cell as it manages
    the business of life.
  • The Daresbury installation ranks ninth in the
    world's top 500 supercomputer league table, and
    can handle 3.5 trillion number-crunching
    operations a second.

4
Definitions - teraflop
  • A teraflop is a measure of a computer's speed
    and can be expressed as
  • A trillion floating point operations per second
  • 10 to the 12th power floating-point operations
    per second
  • 2 to the 40th power flops
  • Today's fastest parallel computing operations are
    capable of teraflop speeds. Scientists have begun
    to envision computers operating at petaflop
    speeds.

5
Definitions - petaflop
  • A petaflop is a measure of a computer's
    processing speed and can be expressed as
  • A thousand trillion floating point operations per
    second (FLOPS)
  • A thousand teraflops
  • 10 to the 15th power FLOPS
  • 2 to the 50th power FLOPS
  • Today's fastest parallel computing operations are
    capable of teraflop speeds.
  • A petaflop computer would actually require a
    massive number of computers working in parallel
    on the same problem.
  • Applications might include real-time nuclear
    magnetic resonance imaging during surgery,
    computer-based drug design, astrophysical
    simulation, the modelling of environmental
    pollution, and the study of long-term climate
    changes.

6
Presentation Overview of High Performance
Computing- HPCx
  • Development of the Infrastructure provision
    for HPCx facility on the Daresbury Science
    and Innovation Campus and a look forward to
    the next generation of petaflop systems -
  • Costs and VAT- 53 m
  • Programme - Nov 2002
  • Building Construction - existing and ancillary
    plant areas
  • Electrical Services, supplies and resilience
    Pillar UK
  • Mechanical Services - cooling and BMS
    monitoring
  • Insurance leading edge facility
  • Security Fire, Intrusion detection, CCTV and
    Access
  • Lessons Learnt
  • The Big Issue
  • Petascale and beyond

7
Daresbury - HPCx Collaboration
8
Project Development N1
  • N1 philosophy underpins resilience, minimising
    the risk of business disruption.
  • This means we provide one standby unit in
    addition to the normal (N) operational
    requirement.
  • Where one primary mains power feed is normal, two
    separate feeds where one standby generator unit
    is required, provide an additional backup.
  • N1 minimum is standard across our critical
    service systems and equipment, including
  • primary HV power
  • generators and Uninterruptible Power Supplies
    (UPS)
  • AB electrical switchboards
  • chillers
  • pumps
  • air handling equipment
  • fire detection and suppression

9
2 No 2 MVA transformers
10
Project Development - Power
  • Power
  • Resilience depends on diverse and durable
    supplies of power, designed to minimise the
    risk of power failure and be responsive to
    individual power requirements. Even during
    prolonged blackouts, it may be critical to
    keep HPC systems up and running.
  • Standard features include
  • primary HV power provision
  • LV switchboards for power provision to the room
  • on site sub-stations
  • standby generators with a minimum 24 hours
    autonomy at full capacity
  • Uninterruptible Power Supplies (UPS)
  • sufficient fuel on site to run for days, together
    with priority re-fuelling contracts
  • planned preventative maintenance programmes
  • process oriented engineers and on site
    controllers available 24x7

11
Rotary UPS and Power Supplies Layout
12
UPS being craned in situ
13
Electrical Services, supplies and resilience
  • UPS Systems
  • Technical factors for consideration
  • Choice static verses rotary
  • Types of UPS
  • Reliability
  • Summary

14
Types of UPS TechnologiesSeries on line
14
15
ReliabilityTypes of UPS Technologies
MTBF 75,000 to 100,000hrs
MTBF 610,000 to 700,000hrs
Source Electricity in buildings, CIBSE Guide K
2004
15
16
ReliabilityNumber of Components
16
17
Performance CharacteristicsHarmonic Galvanic
Isolation
  • IT loads produce harmonics
  • Harmonics on input of UPS must be attenuated to
    approximately 5
  • In Data Centre design it is preferable to
    electrically isolate IT loads from mechanical
    loads
  • Galvanic isolation of IT loads for optimum
    earthing

17
18
Types of UPS TechnologiesTechnology Summary
18
19
UPS Control Panel
20
UPS Battery string
21
Rotary UPS
22
Rotary UPS
23
Uni Block Rotary UPS
24
UPS and Switchgear housing
25
UPS and Switchgear housing
26
UPS and Switchgear housing
27
UPS and Switchgear housing slab
28
Switchgear delivery
29
Switchgear
30
Power Distribution
31
Fire Alarm Panels
32
Cable Management in Sub- floor
33
Computer Room Refurbishment - 31st May 2002
Remove kit/ partitions/ ceiling
30 years of history - remove air-conditioning
vents, electrical services and fire-suppressant
34
Computer Room Refurbishment - 10th June 2002
Remove floor and services - think about girder!
35
Computer Room Refurbishment - 30th June 2002
Room stripped back to shell - start to remove
girder
36
Computer Room Refurbishment - 12th July 2002
Girder removed - spur walls trimmed back
37
Computer Room Refurbishment - August September
2002
Sockets/ underfloor power distribution/ yellow
spots for tile pedestals/ trays for fibre channel
38
Transition to service - Installation and
Commissioning- 2002
  • 4-9th Oct delivery of hardware
  • 10-17th Oct power-up tests. IBM internal
    diagnostics. Computer room power and
    air-conditioning commissioning.
  • 18-22nd Oct Software reinstall
  • Logical partition O/S
  • Parallel System Support Program
  • General Parallel File System
  • LoadLeveller - batch configuration
  • Tivoli Storage Manager
  • High Availability Cluster Multi-processing

39
Vigorous Future Plans Nov 2002
  • Complete fire and security insurance enhancements
    - November 2002
  • Explore Gang Scheduling
  • - January 2003
  • Complete Standard Operating Procedures
  • - February 2003
  • UPS commissioning/ Move Network
  • - February 2003
  • GPFS Swap/ Switch Plane Optimisation
  • - March 2003
  • Development System
  • - separate development system to manage
    evaluation of new software releases - April 2003

40
Cold Aisle
41
IBM Installation - PHASE 1
42
IBM Installation - PHASE 2
43
Project Development - Controlled Environment
  • Infrastructure must be designed to cope with
    environmental and cooling requirements - even the
    extraordinary demands of the latest high density
    server technologies.
  • maximise equipment layout
  • balance air flows where necessary
  • deliver all cooling and environmental
    requirements using modelling
  • primary cooling infrastructure, centrally managed
    and linked to BMS.. Other key features include
  • Room Air Conditioning Units (RACU's)
  • regulated humidity within a constant range
  • HVAC system with minimum impact on the
    environment
  • Down Flow Air Conditioning Units
  • adjustable floor grills to achieve balanced air
    flows
  • planned preventative maintenance and 24x7 on-site
    engineers
  • critical Service Level Agreements (SLA's) on
    cooling provision

44
HPCx Project Outline of Requirements
  • Computer Room Infrastructure at DL - January
    2001
  • HPCx Core Programme Refurbishment Plan
  • Timetable - original/ revised
  • IBM Phase 1/2/3 installation floor plans- Note
    2a added later
  • Computer Room Refurbishment
  • Photographs
  • Transition to Service - timetable
  • installation and commissioning
  • acceptance and early user service
  • Systems Management
  • - capability computing
  • Reliability, Availability and Serviceability
    Management

45
HPCx Computing Requirements
  • Phase 1 - 40 Regatta H Compute Frames, 2 I/O
    frames 400 KVA power - 400 KW cooling - 2160
    square feet
  • Phase 2 - 48 Regatta H Compute Frames, 3 I/O
    Frames, Federation switch 500 KVA power - 500 KW
    cooling 2160 square feet
  • Phase 3 - 96 Regatta H Compute Frames, 4 I/O
    Frames, alternative technologies IPF, BlueGene/L?
    960 KVA power - 960 KW cooling - 3769 square feet
  • 15 under-floor gap for 26 miles of cabling!
  • air floor grills to enhance air-flow

46
Computer Room Infrastructure at DL - January 2001
  • Designated site for HPC-X primary operations
    (Q2/02)
  • Approx 5000 sq ft of space
  • Built 1960s to house IBM mainframes and the Cray
    1
  • multiple floor levels/ small underfloor voids -
    largely used for communications equipment and
    various servers
  • Power/cooling/fire-protection
  • 200 KVA - 200KW sufficient for existing
    requirements
  • HALON 1301 system in need of upgrade with modern
    environmentally-acceptable gases
  • Security
  • sufficient for existing programme but a review
    was required
  • CLRC invested 1M of Infrastructure budget to
    refurbish computer room (floor/ more power and
    air-conditioning)

47
HPCx Core Programme Refurbishment Plan
  • Replacement of computer room floor uniform depth
    (18) and bonded stringer construction
  • Fire-protection new detection/ FM200 suppression
    system
  • Cooling multiple CCRU using direct-expansion
    refrigeration technology with a delivered
    capacity of 1 MW heat-rejection, plus standby
    capability
  • Uprating of power capability 1 MVA capacity
    through the UPS to be available approximately end
    of January 2003
  • Security
  • Reviewed in May 2002 - all brokers
    recommendations accepted (access control, new
    doors/ locks, intruder alarms, CCTV, fencing,
    remove fire hazards)
  • Re-reviewed in September 2002 - underwriters
    recommendations under discussion (off-site CCTV
    and intruder alarm monitoring/ VESDA fire system,
    additional detectors, roof)

48
ProgrammeTimetable - January 2002
Actual contract signature July 12th 2002 3TF
build October 4th 2002 Service start date 1st
December 2002
49
HPC(x) Daresbury Lab Room - recommendations
  • 1.Tile flowrates look good
  • I would move the p575 to the middle and not over
    to one side as it is shown now. Should reduce
    shortcircuiting of cold air back to crac units.
  • 3. Can you make the cold aisle width for the
    p575s 3 tiles wide. This would help if you have
    room.
  • 4. You may have some hot spots at the ends of the
    rows since some of those tiles are close to the
    crac units. May need to play with the tile layout
    some when you get this installed.

50
IBM Installation - PHASE 2a Scheme - Technology
Refresh reduced 52 No to 25No
51
Project Development Building Management Systems
  • Manage and monitor all our critical building
    services using highly sophisticated Building
    Management Systems (BMS).
  • These allow our on-site facilities management
    team to monitor all key parameters - from power
    supply to security access.
  • audio and visual alarms in the event of
    discrepancies from the 'norm'
  • customer infrastructure interface to monitor
    individual room/suite systems
  • monitoring and management of power, cooling and
    humidity
  • power monitoring for consumption statistics and
    billing
  • leak detection from cooling systems
  • generation of system performance and facility
    data

52
Project Development Fire Detection Suppression
  • Protect all fire detection and suppression
    systems built to N1.
  • If a fire breaks out, these will react rapidly to
    minimise the impact and reduce the chance of it
    spreading to other areas.
  • As standard, provide three stage detection
    systems in plant and technical areas, and fire
    detection in every room below raised floors and
    in ceiling voids. Other key features include
  • VESDA (Very Early Smoke Detection Apparatus)
    systems
  • environmentally-friendly gas suppression systems
    using Argonite or Inergen
  • gas and smoke extraction in conjunction with
    pressure relief systems
  • fire alarms and wet-pipe sprinkler systems in
    ancillary areas
  • fire detection and suppression systems linked to
    BMS
  • on-site 24x7 monitoring

53
VESDA System
54
FM 200 Fire Suppressant
55
Project Development - Security
  • Security is a top priority , but we also
    recognise that security requirements must be
    balanced against your need for quick, easy access
    to your HPC and IT equipment.
  • Security strategy maximises security, without
    compromising convenience.
  • Key features of our multi-level security
    infrastructure include
  • door access controls at main site and building
    entrances
  • proximity activation cards to authorise access
    levels
  • movement logs on all proximity card usage
  • internal and external CCTV cameras and digital
    image archiving
  • internal and external intruder detection devices
  • vehicle entrance barriers with vehicle number
    plate recognition and secure loading bays
  • strict policies on handling customers postal
    packages
  • security systems linked to central BMS
  • 24x7 monitoring by dedicated security teams
  • Security extends beyond our site to include wider
    security authorities such as the local police and
    estate security. We exchange information and work
    together to counter potential threats.

56
How and why Mission Critical cooling differ
from common air conditioners?
  • Todays HPC machine rooms require precise
    , stable environments in order for sensitive
    equipment to operate optimally. Standard
    comfort cooling is ill suited for this
    applications and if applied would lead to
    system shut downs and component failures. The
    application of air conditioning to HPC ,
    requires lots of cold air to be shifted to
    take the heat away from the processors and
    provide environmental stability , whilst
    minimising business downtime .

57
HPC(x) Daresbury Lab Room
Simulated Layout with Tileflow
Actual Layout Provided
Canatal 9AD26 CRAC Units
Cold Aisle
40 Perf Tiles
25 sq in. Cable Opening
58
HPC(x) Daresbury Lab Room isometric view
18 inch raised floor
59
HPC(x) Daresbury Lab Room flowrates
60
HPC(x) Daresbury Lab Room velocity and pressure
distribution
61
Typical Hot and Cold AisleArrangement
  • All hot air is contained within the hot aisles
    keeping all the inlet temperatures at around 20ºC
    even at 1.8M (42U)

62
HPC(x) Daresbury Lab Room row 19 perf tile
distribution
Air Flow drops Near CRAC units
63
A CFD Model to show the typical movement of air
within a standard server rack.
64
A CFD Model to show the unpredictable movement of
air within a typical HPC centre.
65
Mechanical Services - cooling and BMS
monitoring
  • Today, cooling a computer room is not enough.
  • Since it's the processors in the HPC that
    represent the actual load,
  • airflow cooling is the challenge.
  • This means carefully considering air distribution
  • like the choice of racks and the layout of the
    room.
  • What can you do to solve your cooling problems
    including energy ?
  • Air Distribution
  • In-Row precision air conditioning
  • ISX High Density Solutions

66
Lessons Learnt - 1
  • Infrastructure projects have a long lead time.
    Leave a sensible time for installation of
    infrastructure. Machine rooms are not just
    waiting for an order to be placed. HPCY will
    require a major new investment.
  • Dont change installation requirements on the fly
    eg 1TF to 3TF - make sure that the infrastructure
    people are at the negotiating table. Now 12TF
    with HPCx
  • Refurbishment costs are very difficult to
    quantify up front against performance
    requirements and require extensive contingency.
  • Computing technology changes quickly -
    infrastructure requirements for current
    technology may be radically different from the
    infrastructure required in 4 years time eg change
    from air-cooling to water cooling, inert gas to
    water suppression systems - dont box yourself
    in.

67
Lessons Learnt - 2
  • Electricity is a significant fraction of capital
    project cost . A commodity but subject to large
    fluctuations in supply/demand/ price
  • Dont underestimate the cost of insurance - since
    9/11 insurers are reluctant to take any risks in
    particular wrt fire, theft and professional
    indemnity for large scale installations.
  • Dont forget Education and Training vital if
    project is to succeed in implementation phases.
  • Insist on full user acceptance test before system
    leaves factory.
  • Insist on full service documentation covering
    systems architecture/ software system image/
    operating procedures/ support processes.

68
The Big Issues and ?
  • The compaction of HPC and Information Technology
    equipment
  • and simultaneous increases in processor power
    consumption
  • are creating challenges for Designers and
    Project Managers
  • In ensuring adequate distribution of cool air,
    removal of hot air
  • and sufficient cooling capacity.
  • Ref Next Presentation by Robin Stone
  • Energy and Energy Recovery
  • Cost in use?
  • How business critical?
  • N1 ?
  • Can you just provide sufficient power to
    take the processors down, with damage ?
  • ETC?????????

69
The world's fastest commercial supercomputer has
been launched by computer giant IBM.
  • Blue Gene/P is three times more potent than the
    current fastest machine, BlueGene/L, also built
    by IBM.
  • The latest number cruncher is capable of
    operating at so called "petaflop" speeds - the
    equivalent of 1,000 trillion calculations per
    second.
  • Approximately 100,000 times more powerful than a
    PC, the first machine has been bought by the US
    government.
  • It will be installed at the Department of
    Energy's (DOE) Argonne National Laboratory in
    Illinois later this year.
  • Two further machines are planned for US
    laboratories and a fourth has been bought by the
    UK Science and Technology Facilities Council for
    its Daresbury Laboratory Cheshire.
  • The ultra powerful machines will be used for
    complex simulations to study everything from
    particle physics to nanotechnology.

70
Daresbury Laboratory Blue Gene/P Petascale
and beyond
Write a Comment
User Comments (0)
About PowerShow.com