Next decade in supercomputing - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Next decade in supercomputing

Description:

Next decade in supercomputing – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 70
Provided by: serg167
Category:

less

Transcript and Presenter's Notes

Title: Next decade in supercomputing


1
Next decade in supercomputing
  • José M. Cela
  • Director departmento CASE
  • BSC-CNS
  • josem.cela_at_bsc.es

2
Talk outline
  • Supercomputing from the past..
  • Architecture evolution
  • Applications and algorithms
  • ..Supercomputing for the future
  • Technology trends
  • Multidisciplinary top-down approach
  • BSC-CNS activities
  • Conclusions

3
One upon a time ENIAC 1946
Eniac, 1946, Moore School 18000 válvulas de
vacio, 70000 resistores y 5 millones de
conexiones soldadas Consumo 140
Kw Dimensiones 8x3x100 pies Peso gt
30 toneladas Capacidad de cálculo 5000 sumas y
360 multiplicaciones por segundo
4
Technological Achievements
  • Transistor (Bell Labs, 1947)
  • DEC PDP-1 (1957)
  • IBM 7090 (1960)
  • Integrated circuit (1958)
  • IBM System 360 (1965)
  • DEC PDP-8 (1965)
  • Microprocessor (1971)
  • Intel 4004
  • 2.300 transistors
  • Could access 300 bytes of memory

5
Technology Trends Microprocessor Capacity
Moores Law
2X transistors/Chip Every 1.5 years Called
Moores Law
Gordon Moore (co-founder of Intel) predicted in
1965 that the transistor density of semiconductor
chips would double roughly every 18 months.
Microprocessors have become smaller, denser, and
more powerful. Not just processors, bandwidth,
storage, etc
6
Pipeline (H. Ford)
7
DRAM access bottleneck
  • Not everything is scaling up fast
  • DRAM access speed is hardly improved

8
Latencies and Pipelines
9
Hybrid SMP-cluster parallel systems
  • Most modern high-performance computing systems
    are clusters of SMP nodes (performance/cost
    trade-off)
  • MPI parallel level
  • Threads (openMP) parallel level

10
TOP500
11
TOP500
12
Technology Outlook
High Volume Manufacturing 2004 2006 2008 2010 2012 2014 2016 2018
Technology Node (nm) 90 65 45 32 22 16 11 8
Integration Capacity (BT) 2 4 8 16 32 64 128 256
Delay CV/I scaling 0.7 0.7 gt0.7 Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down Delay scaling will slow down
Energy/Logic Op scaling gt0.35 gt0.5 gt0.5 Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down Energy scaling will slow down
Bulk Planar CMOS High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability
Alternate, 3G etc Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability Low Probability High Probability
Variability Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High Medium High Very High
ILD (K) 3 lt3 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5 Reduce slowly towards 2-2.5
RC Delay 1 1 1 1 1 1 1 1
Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation 0.5 to 1 layer per generation
Shekhar Borkar, Micro37, P
13
Increasing CPU performance a delicate balancing
Increasing the number of gates into a tight knot
and decreasing the cycle time of the processor
We have seen increasing number of gates on a chip
and increasing clock speed. Heat becoming an
unmanageable problem, Intel Processors gt 100
Watts We will not see the dramatic increases in
clock speeds in the future. However, the number
of
gates on a chip will
continue to increase.
14
Moores law
15
Multicore chips
16
(No Transcript)
17
ORNL Computing Power and Cooling 2006 - 2011
  • Immediate need to add 8 MW to prepare for 2007
    installs of new systems
  • NLCF petascale system could require an additional
    10 MW by 2008
  • Need total of 40-50 MW for projected systems by
    2011
  • Numbers just for computers add 75 for cooling
  • Cooling will require 12,000 15,000 tons of
    chiller capacity

Cost estimates based on 0.05 kW/hr
Data taken from Energy Management System-4
(EMS4). EMS4 is the DOE corporate system for
collecting energy information from the sites.
EMS4 is a web-based system that collects energy
consumption and cost information for all energy
sources used at each DOE site. Information is
entered into EMS4 by the site and reviewed at
Headquarters for accuracy.
18
View from the Computer Room
19
How to reduce energy but not performance?
  • Reduce the amount of DRAM memory per core and
    redesign everything for energy saving
  • Blue Gene Solution
  • Eliminate the cache coherency in a multicore chip
    and use accelerators instead of general purpose
    cores
  • Cell/B.E. solution
  • GPU solution
  • FPGA solution

20
Blue Gene/P
System
72 Racks
Cabled 8x8x16
Blue Gene/P continues Blue Genes leadership
performance in a space-saving, power-efficient
package for the most demanding and scalable
high-performance computing applications
Rack
32 Node Cards 1024 chips, 4096 procs
Final System1 PF/s,144 TB November 2007 0.596
PF/s
14 TF/s 2 TB
Compute Card
1 chip, 20 DRAMs
435 GF/s 64 GB
Chip
4 processors
HPC SW Compilers GPFS ESSL Loadleveler
13.6 GF/s 2.0 (or 4.0) GB DDR Supports 4-way SMP
Front End Node / Service Node JS21 / Power5 Linux
SLES10
13.6 GF/s 8 MB EDRAM
21
Cell Broadband Engine architecture
235 Mtransistors 235 mm2
22
Cell Broadband Engine Architecture (CBEA)
Technology Competitive Roadmap
2006 2007 2008 2009 2010
Next Gen (2PPE32SPE)45nm SOI 1 TFlop (est.)
Performance Enhancements/Scaling
AdvancedCell BE(18eDP SPE)65nm SOI
Cell BE(18)90nm SOI
CostReduction
Cell BE(18)65nm SOI
All future dates and specifications are
estimations only Subject to change without
notice. Dashed outlines indicate concept designs.
23
First PetaFlop computer (Nov2008) Roadrunner at
LANL
7,000 dual-core Opterons ? 50 TeraFlop/s
(total) 13,000 eDP Cell chips ? 1.4
PetaFlop/s (Cell)
Connected Unit cluster 192 Opteron nodes (180
w/ 2 dual-Cell blades connected w/ 4 PCIe x8
links)
24
How we are going to program it?
  • MPI layer will continue
  • Hybrid codes will be mandatory just for load
    balancing
  • openMP on homogeneous processors
  • But with heterogeneous processors
  • openCL
  • CUDA
  • SIMD code should be provided by the compiler

25
  • BSC-CNS activities

26
Barcelona Supercomputing CenterCentro Nacional
de Supercomputación
  • Mission
  • Investigate, develop and manage technology to
    facilitate the advancement of science.
  • Objectives
  • Operate national supercomputing facility
  • RD in Supercomputing
  • Collaborate in RD e-Science
  • Public Consortium
  • the Spanish Government (MEC) 51
  • the Catalonian Government (DURSI) 37
  • the Technical University of Catalonia (UPC)12

27
Location
28
(No Transcript)
29
Blades, blade center and racks
30
Network Myrinet
Spine 1280
Spine 1280
Clos 256x256
Clos 256x256
128 Links
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
Clos 256x256
256 links (1 to each node) 250MB/s each direction

255
0
31
MareNostrum
  • 2560 JS21
  • 2 PPC 790 MP 2,3 GHz
  • 8 Gigabytes (20 TB)
  • 36 Gigabytes HD SAS
  • Myrinet daughter card
  • 2x1Gb Ethernet on board
  • Myrinet
  • 10 clos256256
  • 2 spines 1280s
  • 20 Storage nodes
  • 2 P615, 2 Power4, 4 GigaBytes
  • 28 SATA disc, 512 Gbytes (280 TB)
  • Performance Summary
  • 4 instructions per cycle, 2,3 GHz
  • 10240 processors
  • 94,21 TFlops
  • 20 TB Memory, 300 TB disk

32
Additional Systems
  • Tape facility
  • 6 Petabytes
  • LTO4 Technology
  • HSM and Backup
  • Shared memory system (ALTIX)
  • 128 cores Montecito
  • 2.5 TByte Main Memory

33
Spanish Supercomputing Network
34
RES services
  • Red Española de Supercomputacion (RES)
    supercomputers can be access free by any public
    Spanish research group. MareNostrum is the main
    RES node.
  • Web application form and instructions could be
    found in the web page www.bsc.es
    (supportservices / RES)
  • External committee evaluates the proposals
  • Access reviewed every 4 months
  • For any question contact BSC operations director
  • Sergi Girona (sergi.girona_at_bsc.es)

35
Top500 who is who?
36
Can Europe compete?
37
ESFRI European Infrastructure Roadmap
  • The high-end (capability) resources should be
    implemented every 2-3 years in a renewal spiral
    process
  • Tier0 Centre total cost over a 5 year period
    shall be in the range of 200-400 M
  • With supporting actions in the national/regional
    centers to maintain the transfer of knowledge and
    feed projects to the top capability layer

38
PRACE
GENCI
Ecosystem
tier 1
Principal Partners
General Partners
Associated Partners
39
BSC-IBM MareIncognito project
  • Our 10 Petaflop research project for BSC (2011)
  • Port/develop applications to reduce
    time-to-production once installed
  • Programming models
  • Tools for application developmentand to support
    previous evaluations
  • Evaluate node architecture
  • Evaluate interconnect options

40
BSC Departments
  • Computational Mechanics
  • Applied Computer Science
  • Optimization

41
What are the CASE objectives?
  • Identify scientific communities with
    supercomputing needs and help them to develop
    software
  • Material Science (SIESTA)
  • Fusion (EUTERPE, EIRENE, BIT1)
  • Spectroscopy (OCTOPUS, ALYA)
  • Atmospheric modeling (ALYA, WRF)
  • Geophysics (BSIT, ALYA)
  • Develop our own technology in Computational
    Mechanics
  • ALYA, BSIT,
  • Perform technology transfer with companies
  • REPSOL, AIRBUS,

42
Who needs 10 Petaflops?
43
Airbus 380 Design
44
Seismic Imagining RTM (REPSOL)
45
RTM Performance in Cell
Platform Gflops Power (W) Gflops/W
JS21 8,3 267 0,03
QS20 108,2 315 0,34
QS21 116,6 370 0,32
22.1 GB/s of memory BW used
46
ALYA
Computational Mechanics and Design In-house
development Parallel Coupled Multiphysics
Fluid dynamics Structure dynamics Heat
transfer Wave propagation Excitable media
47
Alya Multiphysics Code
MUMPS sparse direct solver
Domain decomposition
Parallelization
Optimization
Optima
Dodeme
Parall
Solmum
Services
  • Mesh
  • Coupling
  • Solvers
  • Input/output

Kernel
Temper
Turbul
Nastin
Nastal
Exmedi
Apelme
Solidz
Wavequ
Gotita
Incompressible Navier-Stokes
Turbulence models
Heat transfer
Compressible Navier-Stokes
Excitable media
Fracture mechanics
Structure dynamics
Wave propagation
Droplet Impingement (icing)
Modules
48
ALYA keywords
  • Multi-physics modular code for High Performance
    Computational Mechanics
  • Numerical solution of PDEs
  • Variational methods are preferred (FEM)...
  • Coupling between multi-physics (loose or strong)
  • Explicit and Implicit formulations
  • Hybrid meshes, non-conforming meshes
  • Advanced meshing issues
  • Parallelization by MPI OpenMP
  • Automatic mesh partition using Metis
  • Portability is a must (Compiled on Windows,
    Linux, MacOS)
  • Porting to new architectures CELL,
  • Scalability Tested on
  • IBM JS21 blades on MareNostrum BSC, 10000 CPUs
  • IBM Blue Gene/P /L IBM Lab. Montpellier and
    Watson, 4000 CPUs
  • SGI Altix shared memory BSC, Barcelona 128 CPUs
  • PC cluster, 10 - 80 CPUs

49
Alya speed-up
MARENOSTRUM - IBM Blades Boundary layer flow, 25M
hexas
NASTAL module Explicit compressible
flow Fractional step
NASTIN module Implicit incompressible flow
Fractional step
50
CASE RD Aero-Acoustics
  • High Speed train

51
CASE RD Automotive
  • Ahmed body benchmark
  • Win speed 120 km/h

52
CASE RD Building Energy
  • Benchmark cavity
  • MareNostrum Cooling

53
CASE RD Aerospace
  • Icing Simulation
  • Subsonic / Transonic / Supersonic flows
  • Adjoint methods in Shape Optimization

54
CASE RD Aerospace
  • Subsonic cavity flow (0.82 Mach)

55
CASE RD Free surface problems
  • Level set method

56
CASE RD Mesh generation
  • Meshing boundary layer

57
CASE RD Mesh adaptativity
  • Meshing

58
CASE RD Atmospheric Flows
  • San Antonio Quarter (Barcelona)

59
CASE RD Meteo Mesh
  • Surface from topography
  • Semi-structured in volume

60
CASE RD Biomechanics
  • Cardiac Simulator
  • By-pass flow
  • Arterial brain system

61
CASE RD Biomechanics
  • Nose air flow

62
Scalability problems The deflated PCG
63
The deflated PCG
  • Mesh partitioner slices arteries gt two neigbours
  • But, there exists fat meeting points of arteries
    gt more neighbours

64
Parallel footprint
512 proc Efficiency Load balance
Overall 0.67 0.92
GMRES 0.74 0.92
Deflated CG 0.43 0.83
120 ms
6.6 ms
All_reduce
Sendrec
Momentum solver
Pressure solver
170 µs very fine grain
8Bytes support for fast reductions would be
useful
65
Solver continuity Deflated CG
(1)
(6)
(9)
(2)
Subdomains with lots of neighbors
Sendrec
All_reduce (500x8b)
All_reduce (8B)
Sendrec
All_reduce (8B)
66
  • The conclusions

67
The accelerator era
Cell
Multi-core
Wedge of Opportunity
Multi-threading
FPGAs
Vector
GPUs
performance
68
Near Future Supercomputing Trends
  • Performance will be provided by
  • Multi-core
  • Without cache coherency
  • With Accelerators (top-down approach)
  • Programming is going to suffer a revolution
  • openCL
  • CUDA
  • Compilers should provided SIDM parallelism level

69
Thank you !
Write a Comment
User Comments (0)
About PowerShow.com