Next decade in supercomputing - PowerPoint PPT Presentation

1 / 69

About This Presentation

Title:

Next decade in supercomputing

Description:

Next decade in supercomputing – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 70

Provided by: serg167

Category:

more less

Transcript and Presenter's Notes

Title: Next decade in supercomputing

1
Next decade in supercomputing

José M. Cela
Director departmento CASE
BSC-CNS
josem.cela_at_bsc.es

2
Talk outline

Supercomputing from the past..
Architecture evolution
Applications and algorithms
..Supercomputing for the future
Technology trends
Multidisciplinary top-down approach
BSC-CNS activities
Conclusions

3
One upon a time ENIAC 1946
Eniac, 1946, Moore School 18000 válvulas de
vacio, 70000 resistores y 5 millones de
conexiones soldadas Consumo 140
Kw Dimensiones 8x3x100 pies Peso gt
30 toneladas Capacidad de cálculo 5000 sumas y
360 multiplicaciones por segundo
4
Technological Achievements

Transistor (Bell Labs, 1947)
DEC PDP-1 (1957)
IBM 7090 (1960)
Integrated circuit (1958)
IBM System 360 (1965)
DEC PDP-8 (1965)
Microprocessor (1971)
Intel 4004
2.300 transistors
Could access 300 bytes of memory

5
Technology Trends Microprocessor Capacity
Moores Law
2X transistors/Chip Every 1.5 years Called
Moores Law
Gordon Moore (co-founder of Intel) predicted in
1965 that the transistor density of semiconductor
chips would double roughly every 18 months.
Microprocessors have become smaller, denser, and
more powerful. Not just processors, bandwidth,
storage, etc
6
Pipeline (H. Ford)
7
DRAM access bottleneck

Not everything is scaling up fast
DRAM access speed is hardly improved

8
Latencies and Pipelines
9
Hybrid SMP-cluster parallel systems

Most modern high-performance computing systems
are clusters of SMP nodes (performance/cost
trade-off)
MPI parallel level
Threads (openMP) parallel level

10
TOP500
11
TOP500
12
Technology Outlook
High Volume Manufacturing 2004 2006 2008 2010 2012 2014 2016 2018
Technology Node (nm) 90 65 45 32 22 16 11 8
Integration Capacity (BT) 2 4 8 16 32 64 128 256
Delay CV/I scaling 0.7 0.7 gt0.7 Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down
Energy/Logic Op scaling gt0.35 gt0.5 gt0.5 Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down
Bulk Planar CMOS High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability
Alternate, 3G etc Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability
Variability Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High
ILD (K) 3 lt3 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5
RC Delay 1 1 1 1 1 1 1 1
Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation
Shekhar Borkar, Micro37, P
13
Increasing CPU performance a delicate balancing
Increasing the number of gates into a tight knot
and decreasing the cycle time of the processor
We have seen increasing number of gates on a chip
and increasing clock speed. Heat becoming an
unmanageable problem, Intel Processors gt 100
Watts We will not see the dramatic increases in
clock speeds in the future. However, the number
of
gates on a chip will
continue to increase.
14
Moores law
15
Multicore chips
16
(No Transcript)
17
ORNL Computing Power and Cooling 2006 - 2011

Immediate need to add 8 MW to prepare for 2007
installs of new systems
NLCF petascale system could require an additional
10 MW by 2008
Need total of 40-50 MW for projected systems by
2011
Numbers just for computers add 75 for cooling
Cooling will require 12,000 15,000 tons of
chiller capacity

Cost estimates based on 0.05 kW/hr
Data taken from Energy Management System-4
(EMS4). EMS4 is the DOE corporate system for
collecting energy information from the sites.
EMS4 is a web-based system that collects energy
consumption and cost information for all energy
sources used at each DOE site. Information is
entered into EMS4 by the site and reviewed at
Headquarters for accuracy.
18
View from the Computer Room
19
How to reduce energy but not performance?

Reduce the amount of DRAM memory per core and
redesign everything for energy saving
Blue Gene Solution
Eliminate the cache coherency in a multicore chip
and use accelerators instead of general purpose
cores
Cell/B.E. solution
GPU solution
FPGA solution

20
Blue Gene/P
System
72 Racks
Cabled 8x8x16
Blue Gene/P continues Blue Genes leadership
performance in a space-saving, power-efficient
package for the most demanding and scalable
high-performance computing applications
Rack
32 Node Cards 1024 chips, 4096 procs
Final System1 PF/s,144 TB November 2007 0.596
PF/s
14 TF/s 2 TB
Compute Card
1 chip, 20 DRAMs
435 GF/s 64 GB
Chip
4 processors
HPC SW Compilers GPFS ESSL Loadleveler
13.6 GF/s 2.0 (or 4.0) GB DDR Supports 4-way SMP
Front End Node / Service Node JS21 / Power5 Linux
SLES10
13.6 GF/s 8 MB EDRAM
21
Cell Broadband Engine architecture
235 Mtransistors 235 mm2
22
Cell Broadband Engine Architecture (CBEA)
Technology Competitive Roadmap
2006 2007 2008 2009 2010
Next Gen (2PPE32SPE)45nm SOI 1 TFlop (est.)
Performance Enhancements/Scaling
AdvancedCell BE(18eDP SPE)65nm SOI
Cell BE(18)90nm SOI
CostReduction
Cell BE(18)65nm SOI
All future dates and specifications are
estimations only Subject to change without
notice. Dashed outlines indicate concept designs.
23
First PetaFlop computer (Nov2008) Roadrunner at
LANL
7,000 dual-core Opterons ? 50 TeraFlop/s
(total) 13,000 eDP Cell chips ? 1.4
PetaFlop/s (Cell)
Connected Unit cluster 192 Opteron nodes (180
w/ 2 dual-Cell blades connected w/ 4 PCIe x8
links)
24
How we are going to program it?

MPI layer will continue
Hybrid codes will be mandatory just for load
balancing
openMP on homogeneous processors
But with heterogeneous processors
openCL
CUDA
SIMD code should be provided by the compiler

BSC-CNS activities

26
Barcelona Supercomputing CenterCentro Nacional
de Supercomputación

Mission
Investigate, develop and manage technology to
facilitate the advancement of science.
Objectives
Operate national supercomputing facility
RD in Supercomputing
Collaborate in RD e-Science
Public Consortium
the Spanish Government (MEC) 51
the Catalonian Government (DURSI) 37
the Technical University of Catalonia (UPC)12

27
Location
28
(No Transcript)
29
Blades, blade center and racks
30
Network Myrinet
Spine 1280
Spine 1280
Clos 256x256
Clos 256x256
128 Links
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
256 links (1 to each node) 250MB/s each direction

255
0
31
MareNostrum

2560 JS21
2 PPC 790 MP 2,3 GHz
8 Gigabytes (20 TB)
36 Gigabytes HD SAS
Myrinet daughter card
2x1Gb Ethernet on board
Myrinet
10 clos256256
2 spines 1280s
20 Storage nodes
2 P615, 2 Power4, 4 GigaBytes
28 SATA disc, 512 Gbytes (280 TB)

Performance Summary
4 instructions per cycle, 2,3 GHz
10240 processors
94,21 TFlops
20 TB Memory, 300 TB disk

32
Additional Systems

Tape facility
6 Petabytes
LTO4 Technology
HSM and Backup
Shared memory system (ALTIX)
128 cores Montecito
2.5 TByte Main Memory

33
Spanish Supercomputing Network
34
RES services

Red Española de Supercomputacion (RES)
supercomputers can be access free by any public
Spanish research group. MareNostrum is the main
RES node.
Web application form and instructions could be
found in the web page www.bsc.es
(supportservices / RES)
External committee evaluates the proposals
Access reviewed every 4 months
For any question contact BSC operations director
Sergi Girona (sergi.girona_at_bsc.es)

35
Top500 who is who?
36
Can Europe compete?
37
ESFRI European Infrastructure Roadmap

The high-end (capability) resources should be
implemented every 2-3 years in a renewal spiral
process
Tier0 Centre total cost over a 5 year period
shall be in the range of 200-400 M
With supporting actions in the national/regional
centers to maintain the transfer of knowledge and
feed projects to the top capability layer

38
PRACE
GENCI
Ecosystem
tier 1
Principal Partners
General Partners
Associated Partners
39
BSC-IBM MareIncognito project

Our 10 Petaflop research project for BSC (2011)
Port/develop applications to reduce
time-to-production once installed
Programming models
Tools for application developmentand to support
previous evaluations
Evaluate node architecture
Evaluate interconnect options

40
BSC Departments

Computational Mechanics
Applied Computer Science
Optimization

41
What are the CASE objectives?

Identify scientific communities with
supercomputing needs and help them to develop
software
Material Science (SIESTA)
Fusion (EUTERPE, EIRENE, BIT1)
Spectroscopy (OCTOPUS, ALYA)
Atmospheric modeling (ALYA, WRF)
Geophysics (BSIT, ALYA)
Develop our own technology in Computational
Mechanics
ALYA, BSIT,
Perform technology transfer with companies
REPSOL, AIRBUS,

42
Who needs 10 Petaflops?
43
Airbus 380 Design
44
Seismic Imagining RTM (REPSOL)
45
RTM Performance in Cell
Platform Gflops Power (W) Gflops/W
JS21 8,3 267 0,03
QS20 108,2 315 0,34
QS21 116,6 370 0,32
22.1 GB/s of memory BW used
46
ALYA
Computational Mechanics and Design In-house
development Parallel Coupled Multiphysics
Fluid dynamics Structure dynamics Heat
transfer Wave propagation Excitable media
47
Alya Multiphysics Code
MUMPS sparse direct solver
Domain decomposition
Parallelization
Optimization
Optima
Dodeme
Parall
Solmum
Services

Mesh
Coupling
Solvers
Input/output

Kernel
Temper
Turbul
Nastin
Nastal
Exmedi
Apelme
Solidz
Wavequ
Gotita
Incompressible Navier-Stokes
Turbulence models
Heat transfer
Compressible Navier-Stokes
Excitable media
Fracture mechanics
Structure dynamics
Wave propagation
Droplet Impingement (icing)
Modules
48
ALYA keywords

Multi-physics modular code for High Performance
Computational Mechanics
Numerical solution of PDEs
Variational methods are preferred (FEM)...
Coupling between multi-physics (loose or strong)
Explicit and Implicit formulations
Hybrid meshes, non-conforming meshes
Advanced meshing issues
Parallelization by MPI OpenMP
Automatic mesh partition using Metis
Portability is a must (Compiled on Windows,
Linux, MacOS)
Porting to new architectures CELL,
Scalability Tested on
IBM JS21 blades on MareNostrum BSC, 10000 CPUs
IBM Blue Gene/P /L IBM Lab. Montpellier and
Watson, 4000 CPUs
SGI Altix shared memory BSC, Barcelona 128 CPUs
PC cluster, 10 - 80 CPUs

49
Alya speed-up
MARENOSTRUM - IBM Blades Boundary layer flow, 25M
hexas
NASTAL module Explicit compressible
flow Fractional step
NASTIN module Implicit incompressible flow
Fractional step
50
CASE RD Aero-Acoustics

High Speed train

51
CASE RD Automotive

Ahmed body benchmark
Win speed 120 km/h

52
CASE RD Building Energy

Benchmark cavity

MareNostrum Cooling

53
CASE RD Aerospace

Icing Simulation
Subsonic / Transonic / Supersonic flows
Adjoint methods in Shape Optimization

54
CASE RD Aerospace

Subsonic cavity flow (0.82 Mach)

55
CASE RD Free surface problems

Level set method

56
CASE RD Mesh generation

Meshing boundary layer

57
CASE RD Mesh adaptativity

Meshing

58
CASE RD Atmospheric Flows

San Antonio Quarter (Barcelona)

59
CASE RD Meteo Mesh

Surface from topography
Semi-structured in volume

60
CASE RD Biomechanics

Cardiac Simulator

By-pass flow

Arterial brain system

61
CASE RD Biomechanics

Nose air flow

62
Scalability problems The deflated PCG
63
The deflated PCG

Mesh partitioner slices arteries gt two neigbours
But, there exists fat meeting points of arteries
gt more neighbours

64
Parallel footprint
512 proc Efficiency Load balance
Overall 0.67 0.92
GMRES 0.74 0.92
Deflated CG 0.43 0.83
120 ms
6.6 ms
All_reduce
Sendrec
Momentum solver
Pressure solver
170 µs very fine grain
8Bytes support for fast reductions would be
useful
65
Solver continuity Deflated CG
(1)
(6)
(9)
(2)
Subdomains with lots of neighbors
Sendrec
All_reduce (500x8b)
All_reduce (8B)
Sendrec
All_reduce (8B)
66