Title: An Overview of High Performance Computing and Performance Issues
1An Overview of High Performance Computing and
Performance Issues
- Jack Dongarra
- University of Tennessee
- and
- Oak Ridge National Laboratory
2(No Transcript)
3Outline
- Not going to talk about . . .
- Linear algebra algorithms or software
- Message passing systems like PVM or MPI
- Distributed network computing
- Focus on . . .
- Trends in High-Performance Computing
- Tools for efficient software of numerical kernels
4High-Performance Computing Today
- In the past decade, the world has experienced one
of the most exciting periods in computer
development. - Microprocessors have become smaller, denser, and
more powerful. - The result is that microprocessor-based
supercomputing is rapidly becoming the technology
of preference in attacking some of the most
important problems of science and engineering.
5Growth of Microprocessor Performance
10000
Cray T90
Cray C90
1000
Cray 2
Cray Y-MP
Alpha
RS6000/590
Cray X-MP
Alpha
100
RS6000/540
Cray 1S
i860
2X transistors/Chip Every 1.5 years Moores
Law Microprocessors have become smaller,
denser, and more powerful.
Performance in Mflop/s
10
R2000
1
80387
0.1
6881
80287
8087
0.01
1998
1980
1982
1986
1988
1990
1992
1994
1996
Year
6Gflop/s
2000
Linpack-HPC Benchmark Over Time
1500
Cray Y-MP (8)
97
95
99
91
93
Year
7Different Architectures
- SIMD
- Vector Computers
- MIMD
- Shared Memory
- SUN Enterprise Shared Memory
- Cray J90/T90 Vector Computer
- Distributed Shared Memory
- SGI Origin Distributed Shared Memory
- HP-Convex Distributed Shared Memory
- NEC SX Vector Computer
- Distributed Memory
- Cray T3E Distributed Memory
- IBM SP Distributed Memory
- Fujitsu VPP Vector Distributed Memory
- Cluster of Processors (COWS, NOWS, Beowulf)
8Vectors
- Fujitsu VPP-700
- Distributed memory vector multi-processor
- Cross bar connected up to 256
- 2.4 Gflop/s per processor
- VPP-5000 (512 procs/9.6 Gflop/s per proc)
- NEC SX-5
- Distributed memory vector multi-processor
- Multistage crossbar up to 512
- 8 Gflop/s per processor
- Hitachi SR-8000
- RISC-based distributed memory multi-processor
(PVP) - Multi-dimensional crossbar up to 128
- 8 Gflop/s per processor
- SGI SV1
- RISC based
- Ring based up to 1024
- 4.8 Gflop/s per processor
9Scalar
- IBM SP
- RISC-based distributed-memory multi-processor
- Omega switch, message passing
- 200 MHz, 4 ops/cycle, 800 Mflop/s
- 2 processors/node
- ASCIs machine 5836 processors
- SGI Origin 2000
- Distributed shared memory, CC-NUMA, message
passing - Crossbar (4 proc/node), nodes hypercubed
- 500 Mflop/s
- 4 processos/node (crossbar), Hypercubed
- ASCIs machine 6144 processors
- Sun Enterprise
- SMP 64 way crossbar
- 800 Mflop/s
10Scalar (Continued)
- HP Exemplar
- Distributed shared memory, CC-NUMA, message
passing - 32 processors SMP interconnected via ring
- 1.76 Gflop/s processor
- Compaq (DEC)
- Memory channel switch
- 112 processors
- 1.2 Gflop/s processor
11High-Performance Computing Directions
- Move toward distributed shared memory
- Distributed Shared Memory (clusters of
processors connected) - Shared address space w/deep memory hierarchy
- Clustering of shared memory machines for
scalability - Emergence of PC commodity systems
- Pentium/Alpha based, NT or Linux driven
- Efficiency of message passing and data parallel
programming - Helped by standards efforts such as PVM, MPI,
Open-MP and HPF - In many cases used as a single user environments
- Pure COTS
12Top500 Fastest Installed Computers
- List the top 500 supercomputer at sites
worldwide. - Provides a snapshot of the SC installed around
the world. - Began in 1993, published every 6 months.
- Measures the performance for the TPP Linpack
Benchmark - Provides a way to measure
trends
13TOP10
top10-1/2 Tf/s 3 DOE ASCI 10 7 gt 1Tflop/s
peak, 1 Japan
14Performance Development
1/2 machines replaced mpp358, pvp58
indust241
18 machines in France, Meteo VPP700 116
15Continents
16(No Transcript)
17Producers
18Manufacturer
19Customer Type
20Processor Type
21Chip Technology
22Chip Technology
23Architectures
24Excerpt from TOP500
25Where do the Flops go?Who Cares About the Memory
Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
26Performance Issues - Cache Bandwidth
- Performance instability
- Small changes may cause dramatic changes in
delivered performance. - Latency tolerant and bandwidth parsimonious
algorithms and software are critical - Recompute rather than store/load
- Need to help the compiler
- Have a hard time today getting performance
- Only going to get harder
27Where Do the Flops Go?Memory Hierarchy
- Can only do arithmetic on data at the top of the
hierarchy - Higher level BLAS lets us do this
Fast, small, expensive
Slow, large, cheap
28BLAS for Performance
- Development of blocked algorithms important for
performance
29How To Get Performance From Commodity
Processors?
- Todays processors can achieve high-performance,
but this requires extensive machine-specific hand
tuning. - Routines have a large design space w/many
parameters - Blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies,
register allocations, and instruction schedules. - Complicated interactions with the increasingly
sophisticated micro-architectures of new
microprocessors. - A few months ago no tuned BLAS for Pentium for
Linux. - Need for quick/dynamic deployment of optimized
routines. - ATLAS - Automatic Tuned Linear Algebra Software
- PhiPac from Berkeley
- FFTW from MIT
30Why ATLAS is needed
- BLAS require many man-hours / platform
- Only done if financial incentive is there
- Many platforms will never have an optimal version
- Lags behind hardware
- May not be affordable by everyone
- Improves Vendor code
- Operations may be important, but not general
enough for standard - Allows for portably optimal codes
31Adaptive Approach for Level 3 BLAS
- Do a parameter study of the operation on the
target machine, done once. - Only generated code is on-chip multiply
- BLAS operation written in terms of generated
on-chip multiply - All tranpose cases coerced through data copy to 1
case of on-chip multiply - Only 1 case generated per platform
32Code Generation Strategy
- On-chip multiply optimizes for
- TLB access
- L1 cache reuse
- FP unit usage
- Memory fetch
- Register reuse
- Loop overhead minimization
- Takes a couple of hours to run.
- Code is iteratively generated timed until
optimal case is found. We try - Differing NBs
- Breaking false dependencies
- M, N and K loop unrolling
33ATLAS 500x500 DGEMM Across Various Architectures
34500 x 500 LU Right-Looking
35Recursive Approach for Other Level
3 BLAS
Recursive TRMM
- Recur down to L1 cache block size
- Need kernel at bottom of recursion
- Use gemm-based kernel for portability
36500x500 Recursive BLAS on 433Mhz
DEC 21164
37Multithreaded BLAS for Performance
ATLAS 2 procs
ATLAS 1 proc
Intel BLAS
38ATLAS and the Computational Grid
- Keep a repository of kernels for specific
machines. - Develop a means of dynamically downloading code
- Extend work to allow sparse matrix operations
- Extend work to include arbitrary code segments
- www.netlib.org/atlas/
39High Performance Computing60 Years From 1 to
1013 Flop/s
1941
1 1 FLOPS
ABC 1945
100
ENIAC 1949
1,000 1 KiloFLOPS BINAC 1951
10,000
UNIVAC 1961
100,000
IBM 7030 1964
1,000,000
1 MegaFLOPS CDC 6600 1968
10,000,000
CDC 7600 1975
100,000,000
Cray 1 1987
1,000,000,000
1 GigaFLOPS NEC SX-2 1992
10,000,000,000
NEX SX-3 1993
100,000,000,000
TMC CM-5 1997
1,000,000,000,000 1 TeraFLOPS
Intel ASCI Red 2000 ???
10,000,000,000,000
40Performance Development
41Performance Development
42Summary
- Hans Meuer
- Mannheim University
- Aad van der Steen
- Utrecht
- Antoine Petitet
- Erich Strohmaier
- Clint Whaley
- UTK
- http//www.netlib.org/utk/papers/advanced-computer
s/paper.html - http//www.netlib.org/benchmark/top500.html
- http//www.netlib.org/atlas/
43Europe - Countries
44Fujitsu VPP 700
- Distributed memory vector multi-processor
- Cross bar connected
- VPP800 4ns
- http//www.fujitsu.co.jp/hypertext/Products/Info_p
rocess/hpc/vpp-e/index.html
45NEC SX-5
- Distributed memory vector multi-processor
- Multistage crossbar
- http//www.ess.nec.de
46IBM SP
- RISC-based distributed-memory multi-processor
- Omega switch, message passing
- New processors, available today
- 200 MHz, 4 ops/cycle, 800 Mflop/s
- 2 processors/node
- http//www.rs6000.ibm.com/hardware/largescale/inde
x.html
47SGI Origin
- Distributed shared memory, CC-NUMA, message
passing - Crossbar (4 proc/node), nodes hypercubed
- http//www.sgi.com/origin/2000/index.html
- SV1 vector based 1.2 Gflop/s processor, 4 to a
board
48SUN Enterprise
- RISC-based SMP
- 2.5 ns (400 MHz), 800 Mflop/s
- 64 way crossbar, 12.8 GB/s bandwidth
- http//www.sun.com/servers/ultra_enterprise/10000/
49Hitachi SR8000
- RISC-based distributed memory multi-processor
- Multi-dimensional crossbar
- www.hitachi.co.jp/Prod/comp/hpc/eng/sr81e.html
50HP Exemplar
- RISC-based distributed-memory multi-processor
- DSM/Message passing
- http//www.enterprisecomputing.hp.com/index.html
51Compaq/DEC Alpha
- COWS coupled with your interconnect