Title: Next decade in supercomputing
1Next decade in supercomputing
- José M. Cela
- Director departmento CASE
- BSC-CNS
- josem.cela_at_bsc.es
2Talk outline
- Supercomputing from the past..
- Architecture evolution
- Applications and algorithms
- ..Supercomputing for the future
- Technology trends
- Multidisciplinary top-down approach
- BSC-CNS activities
- Conclusions
3One upon a time ENIAC 1946
Eniac, 1946, Moore School 18000 válvulas de
vacio, 70000 resistores y 5 millones de
conexiones soldadas Consumo 140
Kw Dimensiones 8x3x100 pies Peso gt
30 toneladas Capacidad de cálculo 5000 sumas y
360 multiplicaciones por segundo
4Technological Achievements
- Transistor (Bell Labs, 1947)
- DEC PDP-1 (1957)
- IBM 7090 (1960)
- Integrated circuit (1958)
- IBM System 360 (1965)
- DEC PDP-8 (1965)
- Microprocessor (1971)
- Intel 4004
- 2.300 transistors
- Could access 300 bytes of memory
5Technology Trends Microprocessor Capacity
Moores Law
2X transistors/Chip Every 1.5 years Called
Moores Law
Gordon Moore (co-founder of Intel) predicted in
1965 that the transistor density of semiconductor
chips would double roughly every 18 months.
Microprocessors have become smaller, denser, and
more powerful. Not just processors, bandwidth,
storage, etc
6Pipeline (H. Ford)
7DRAM access bottleneck
- Not everything is scaling up fast
- DRAM access speed is hardly improved
8Latencies and Pipelines
9Hybrid SMP-cluster parallel systems
- Most modern high-performance computing systems
are clusters of SMP nodes (performance/cost
trade-off) - MPI parallel level
- Threads (openMP) parallel level
10TOP500
11TOP500
12Technology Outlook
High Volume Manufacturing 2004 2006 2008 2010 2012 2014 2016 2018
Technology Node (nm) 90 65 45 32 22 16 11 8
Integration Capacity (BT) 2 4 8 16 32 64 128 256
Delay CV/I scaling 0.7 0.7 gt0.7 Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down
Energy/Logic Op scaling gt0.35 gt0.5 gt0.5 Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down
Bulk Planar CMOS High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability
Alternate, 3G etc Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability
Variability Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High
ILD (K) 3 lt3 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5
RC Delay 1 1 1 1 1 1 1 1
Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation
Shekhar Borkar, Micro37, P
13Increasing CPU performance a delicate balancing
Increasing the number of gates into a tight knot
and decreasing the cycle time of the processor
We have seen increasing number of gates on a chip
and increasing clock speed. Heat becoming an
unmanageable problem, Intel Processors gt 100
Watts We will not see the dramatic increases in
clock speeds in the future. However, the number
of
gates on a chip will
continue to increase.
14Moores law
15Multicore chips
16(No Transcript)
17ORNL Computing Power and Cooling 2006 - 2011
- Immediate need to add 8 MW to prepare for 2007
installs of new systems - NLCF petascale system could require an additional
10 MW by 2008 - Need total of 40-50 MW for projected systems by
2011 - Numbers just for computers add 75 for cooling
- Cooling will require 12,000 15,000 tons of
chiller capacity
Cost estimates based on 0.05 kW/hr
Data taken from Energy Management System-4
(EMS4). EMS4 is the DOE corporate system for
collecting energy information from the sites.
EMS4 is a web-based system that collects energy
consumption and cost information for all energy
sources used at each DOE site. Information is
entered into EMS4 by the site and reviewed at
Headquarters for accuracy.
18View from the Computer Room
19How to reduce energy but not performance?
- Reduce the amount of DRAM memory per core and
redesign everything for energy saving - Blue Gene Solution
- Eliminate the cache coherency in a multicore chip
and use accelerators instead of general purpose
cores - Cell/B.E. solution
- GPU solution
- FPGA solution
20Blue Gene/P
System
72 Racks
Cabled 8x8x16
Blue Gene/P continues Blue Genes leadership
performance in a space-saving, power-efficient
package for the most demanding and scalable
high-performance computing applications
Rack
32 Node Cards 1024 chips, 4096 procs
Final System1 PF/s,144 TB November 2007 0.596
PF/s
14 TF/s 2 TB
Compute Card
1 chip, 20 DRAMs
435 GF/s 64 GB
Chip
4 processors
HPC SW Compilers GPFS ESSL Loadleveler
13.6 GF/s 2.0 (or 4.0) GB DDR Supports 4-way SMP
Front End Node / Service Node JS21 / Power5 Linux
SLES10
13.6 GF/s 8 MB EDRAM
21Cell Broadband Engine architecture
235 Mtransistors 235 mm2
22Cell Broadband Engine Architecture (CBEA)
Technology Competitive Roadmap
2006 2007 2008 2009 2010
Next Gen (2PPE32SPE)45nm SOI 1 TFlop (est.)
Performance Enhancements/Scaling
AdvancedCell BE(18eDP SPE)65nm SOI
Cell BE(18)90nm SOI
CostReduction
Cell BE(18)65nm SOI
All future dates and specifications are
estimations only Subject to change without
notice. Dashed outlines indicate concept designs.
23First PetaFlop computer (Nov2008) Roadrunner at
LANL
7,000 dual-core Opterons ? 50 TeraFlop/s
(total) 13,000 eDP Cell chips ? 1.4
PetaFlop/s (Cell)
Connected Unit cluster 192 Opteron nodes (180
w/ 2 dual-Cell blades connected w/ 4 PCIe x8
links)
24How we are going to program it?
- MPI layer will continue
- Hybrid codes will be mandatory just for load
balancing - openMP on homogeneous processors
- But with heterogeneous processors
- openCL
- CUDA
-
- SIMD code should be provided by the compiler
25 26Barcelona Supercomputing CenterCentro Nacional
de Supercomputación
- Mission
- Investigate, develop and manage technology to
facilitate the advancement of science. - Objectives
- Operate national supercomputing facility
- RD in Supercomputing
- Collaborate in RD e-Science
- Public Consortium
- the Spanish Government (MEC) 51
- the Catalonian Government (DURSI) 37
- the Technical University of Catalonia (UPC)12
27Location
28(No Transcript)
29Blades, blade center and racks
30Network Myrinet
Spine 1280
Spine 1280
Clos 256x256
Clos 256x256
128 Links
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
256 links (1 to each node) 250MB/s each direction
255
0
31MareNostrum
- 2560 JS21
- 2 PPC 790 MP 2,3 GHz
- 8 Gigabytes (20 TB)
- 36 Gigabytes HD SAS
- Myrinet daughter card
- 2x1Gb Ethernet on board
- Myrinet
- 10 clos256256
- 2 spines 1280s
- 20 Storage nodes
- 2 P615, 2 Power4, 4 GigaBytes
- 28 SATA disc, 512 Gbytes (280 TB)
- Performance Summary
- 4 instructions per cycle, 2,3 GHz
- 10240 processors
- 94,21 TFlops
- 20 TB Memory, 300 TB disk
32Additional Systems
- Tape facility
- 6 Petabytes
- LTO4 Technology
- HSM and Backup
- Shared memory system (ALTIX)
- 128 cores Montecito
- 2.5 TByte Main Memory
33Spanish Supercomputing Network
34RES services
- Red Española de Supercomputacion (RES)
supercomputers can be access free by any public
Spanish research group. MareNostrum is the main
RES node. - Web application form and instructions could be
found in the web page www.bsc.es
(supportservices / RES) - External committee evaluates the proposals
- Access reviewed every 4 months
- For any question contact BSC operations director
- Sergi Girona (sergi.girona_at_bsc.es)
35Top500 who is who?
36Can Europe compete?
37ESFRI European Infrastructure Roadmap
- The high-end (capability) resources should be
implemented every 2-3 years in a renewal spiral
process - Tier0 Centre total cost over a 5 year period
shall be in the range of 200-400 M - With supporting actions in the national/regional
centers to maintain the transfer of knowledge and
feed projects to the top capability layer
38PRACE
GENCI
Ecosystem
tier 1
Principal Partners
General Partners
Associated Partners
39BSC-IBM MareIncognito project
- Our 10 Petaflop research project for BSC (2011)
- Port/develop applications to reduce
time-to-production once installed - Programming models
- Tools for application developmentand to support
previous evaluations - Evaluate node architecture
- Evaluate interconnect options
40BSC Departments
- Computational Mechanics
- Applied Computer Science
- Optimization
41What are the CASE objectives?
- Identify scientific communities with
supercomputing needs and help them to develop
software - Material Science (SIESTA)
- Fusion (EUTERPE, EIRENE, BIT1)
- Spectroscopy (OCTOPUS, ALYA)
- Atmospheric modeling (ALYA, WRF)
- Geophysics (BSIT, ALYA)
- Develop our own technology in Computational
Mechanics - ALYA, BSIT,
- Perform technology transfer with companies
- REPSOL, AIRBUS,
42Who needs 10 Petaflops?
43Airbus 380 Design
44Seismic Imagining RTM (REPSOL)
45RTM Performance in Cell
Platform Gflops Power (W) Gflops/W
JS21 8,3 267 0,03
QS20 108,2 315 0,34
QS21 116,6 370 0,32
22.1 GB/s of memory BW used
46ALYA
Computational Mechanics and Design In-house
development Parallel Coupled Multiphysics
Fluid dynamics Structure dynamics Heat
transfer Wave propagation Excitable media
47Alya Multiphysics Code
MUMPS sparse direct solver
Domain decomposition
Parallelization
Optimization
Optima
Dodeme
Parall
Solmum
Services
- Mesh
- Coupling
- Solvers
- Input/output
Kernel
Temper
Turbul
Nastin
Nastal
Exmedi
Apelme
Solidz
Wavequ
Gotita
Incompressible Navier-Stokes
Turbulence models
Heat transfer
Compressible Navier-Stokes
Excitable media
Fracture mechanics
Structure dynamics
Wave propagation
Droplet Impingement (icing)
Modules
48ALYA keywords
- Multi-physics modular code for High Performance
Computational Mechanics - Numerical solution of PDEs
- Variational methods are preferred (FEM)...
- Coupling between multi-physics (loose or strong)
- Explicit and Implicit formulations
- Hybrid meshes, non-conforming meshes
- Advanced meshing issues
- Parallelization by MPI OpenMP
- Automatic mesh partition using Metis
- Portability is a must (Compiled on Windows,
Linux, MacOS) - Porting to new architectures CELL,
- Scalability Tested on
- IBM JS21 blades on MareNostrum BSC, 10000 CPUs
- IBM Blue Gene/P /L IBM Lab. Montpellier and
Watson, 4000 CPUs - SGI Altix shared memory BSC, Barcelona 128 CPUs
- PC cluster, 10 - 80 CPUs
49Alya speed-up
MARENOSTRUM - IBM Blades Boundary layer flow, 25M
hexas
NASTAL module Explicit compressible
flow Fractional step
NASTIN module Implicit incompressible flow
Fractional step
50CASE RD Aero-Acoustics
51CASE RD Automotive
- Ahmed body benchmark
- Win speed 120 km/h
52CASE RD Building Energy
53CASE RD Aerospace
- Icing Simulation
- Subsonic / Transonic / Supersonic flows
- Adjoint methods in Shape Optimization
54CASE RD Aerospace
- Subsonic cavity flow (0.82 Mach)
55CASE RD Free surface problems
56CASE RD Mesh generation
57CASE RD Mesh adaptativity
58CASE RD Atmospheric Flows
- San Antonio Quarter (Barcelona)
59CASE RD Meteo Mesh
- Surface from topography
- Semi-structured in volume
60CASE RD Biomechanics
61CASE RD Biomechanics
62Scalability problems The deflated PCG
63The deflated PCG
- Mesh partitioner slices arteries gt two neigbours
- But, there exists fat meeting points of arteries
gt more neighbours
64Parallel footprint
512 proc Efficiency Load balance
Overall 0.67 0.92
GMRES 0.74 0.92
Deflated CG 0.43 0.83
120 ms
6.6 ms
All_reduce
Sendrec
Momentum solver
Pressure solver
170 µs very fine grain
8Bytes support for fast reductions would be
useful
65Solver continuity Deflated CG
(1)
(6)
(9)
(2)
Subdomains with lots of neighbors
Sendrec
All_reduce (500x8b)
All_reduce (8B)
Sendrec
All_reduce (8B)
66 67The accelerator era
Cell
Multi-core
Wedge of Opportunity
Multi-threading
FPGAs
Vector
GPUs
performance
68Near Future Supercomputing Trends
- Performance will be provided by
- Multi-core
- Without cache coherency
- With Accelerators (top-down approach)
- Programming is going to suffer a revolution
- openCL
- CUDA
-
- Compilers should provided SIDM parallelism level
69Thank you !