An Overview of High Performance Computing and Performance Issues - PowerPoint PPT Presentation

About This Presentation

Title:

An Overview of High Performance Computing and Performance Issues

Description:

In the past decade, the world has experienced one of the most exciting periods ... Aad van der Steen. Utrecht. Antoine Petitet. Erich Strohmaier. Clint Whaley. UTK ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 43

Provided by: jack236

Learn more at: https://netlib.org

Category:

more less

Transcript and Presenter's Notes

Title: An Overview of High Performance Computing and Performance Issues

1
An Overview of High Performance Computing and
Performance Issues

Jack Dongarra
University of Tennessee
and
Oak Ridge National Laboratory

2
(No Transcript)
3
Outline

Not going to talk about . . .
Linear algebra algorithms or software
Message passing systems like PVM or MPI
Distributed network computing

Focus on . . .
Trends in High-Performance Computing
Tools for efficient software of numerical kernels

4
High-Performance Computing Today

In the past decade, the world has experienced one
of the most exciting periods in computer
development.
Microprocessors have become smaller, denser, and
more powerful.
The result is that microprocessor-based
supercomputing is rapidly becoming the technology
of preference in attacking some of the most
important problems of science and engineering.

5
Growth of Microprocessor Performance
10000
Cray T90
Cray C90
1000
Cray 2
Cray Y-MP
Alpha
RS6000/590
Cray X-MP
Alpha
100
RS6000/540
Cray 1S
i860
2X transistors/Chip Every 1.5 years Moores
Law Microprocessors have become smaller,
denser, and more powerful.
Performance in Mflop/s
10
R2000
1
80387
0.1
6881
80287
8087
0.01
1998
1980
1982
1986
1988
1990
1992
1994
1996
Year
6
Gflop/s
2000
Linpack-HPC Benchmark Over Time
1500
Cray Y-MP (8)
97
95
99
91
93
Year
7
Different Architectures

SIMD
Vector Computers
MIMD
Shared Memory
SUN Enterprise Shared Memory
Cray J90/T90 Vector Computer
Distributed Shared Memory
SGI Origin Distributed Shared Memory
HP-Convex Distributed Shared Memory
NEC SX Vector Computer
Distributed Memory
Cray T3E Distributed Memory
IBM SP Distributed Memory
Fujitsu VPP Vector Distributed Memory
Cluster of Processors (COWS, NOWS, Beowulf)

8
Vectors

Fujitsu VPP-700
Distributed memory vector multi-processor
Cross bar connected up to 256
2.4 Gflop/s per processor
VPP-5000 (512 procs/9.6 Gflop/s per proc)
NEC SX-5
Distributed memory vector multi-processor
Multistage crossbar up to 512
8 Gflop/s per processor
Hitachi SR-8000
RISC-based distributed memory multi-processor
(PVP)
Multi-dimensional crossbar up to 128
8 Gflop/s per processor
SGI SV1
RISC based
Ring based up to 1024
4.8 Gflop/s per processor

9
Scalar

IBM SP
RISC-based distributed-memory multi-processor
Omega switch, message passing
200 MHz, 4 ops/cycle, 800 Mflop/s
2 processors/node
ASCIs machine 5836 processors
SGI Origin 2000
Distributed shared memory, CC-NUMA, message
passing
Crossbar (4 proc/node), nodes hypercubed
500 Mflop/s
4 processos/node (crossbar), Hypercubed
ASCIs machine 6144 processors
Sun Enterprise
SMP 64 way crossbar
800 Mflop/s

10
Scalar (Continued)

HP Exemplar
Distributed shared memory, CC-NUMA, message
passing
32 processors SMP interconnected via ring
1.76 Gflop/s processor
Compaq (DEC)
Memory channel switch
112 processors
1.2 Gflop/s processor

11
High-Performance Computing Directions

Move toward distributed shared memory
Distributed Shared Memory (clusters of
processors connected)
Shared address space w/deep memory hierarchy
Clustering of shared memory machines for
scalability
Emergence of PC commodity systems
Pentium/Alpha based, NT or Linux driven
Efficiency of message passing and data parallel
programming
Helped by standards efforts such as PVM, MPI,
Open-MP and HPF
In many cases used as a single user environments
Pure COTS

12
Top500 Fastest Installed Computers

List the top 500 supercomputer at sites
worldwide.
Provides a snapshot of the SC installed around
the world.
Began in 1993, published every 6 months.
Measures the performance for the TPP Linpack
Benchmark
Provides a way to measure
trends

13
TOP10
top10-1/2 Tf/s 3 DOE ASCI 10 7 gt 1Tflop/s
peak, 1 Japan
14
Performance Development
1/2 machines replaced mpp358, pvp58
indust241
18 machines in France, Meteo VPP700 116
15
Continents
16
(No Transcript)
17
Producers
18
Manufacturer
19
Customer Type
20
Processor Type
21
Chip Technology
22
Chip Technology
23
Architectures
24
Excerpt from TOP500
25
Where do the Flops go?Who Cares About the Memory
Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
26
Performance Issues - Cache Bandwidth

Performance instability
Small changes may cause dramatic changes in
delivered performance.
Latency tolerant and bandwidth parsimonious
algorithms and software are critical
Recompute rather than store/load
Need to help the compiler
Have a hard time today getting performance
Only going to get harder

27
Where Do the Flops Go?Memory Hierarchy

Can only do arithmetic on data at the top of the
hierarchy
Higher level BLAS lets us do this

Fast, small, expensive
Slow, large, cheap
28
BLAS for Performance

Development of blocked algorithms important for
performance

29
How To Get Performance From Commodity
Processors?

Todays processors can achieve high-performance,
but this requires extensive machine-specific hand
tuning.
Routines have a large design space w/many
parameters
Blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies,
register allocations, and instruction schedules.
Complicated interactions with the increasingly
sophisticated micro-architectures of new
microprocessors.
A few months ago no tuned BLAS for Pentium for
Linux.
Need for quick/dynamic deployment of optimized
routines.
ATLAS - Automatic Tuned Linear Algebra Software
PhiPac from Berkeley
FFTW from MIT

30
Why ATLAS is needed

BLAS require many man-hours / platform
Only done if financial incentive is there
Many platforms will never have an optimal version
Lags behind hardware
May not be affordable by everyone
Improves Vendor code
Operations may be important, but not general
enough for standard
Allows for portably optimal codes

31
Adaptive Approach for Level 3 BLAS

Do a parameter study of the operation on the
target machine, done once.
Only generated code is on-chip multiply
BLAS operation written in terms of generated
on-chip multiply
All tranpose cases coerced through data copy to 1
case of on-chip multiply
Only 1 case generated per platform

32
Code Generation Strategy

On-chip multiply optimizes for
TLB access
L1 cache reuse
FP unit usage
Memory fetch
Register reuse
Loop overhead minimization
Takes a couple of hours to run.

Code is iteratively generated timed until
optimal case is found. We try
Differing NBs
Breaking false dependencies
M, N and K loop unrolling

33
ATLAS 500x500 DGEMM Across Various Architectures
34
500 x 500 LU Right-Looking
35
Recursive Approach for Other Level
3 BLAS
Recursive TRMM

Recur down to L1 cache block size
Need kernel at bottom of recursion
Use gemm-based kernel for portability

36
500x500 Recursive BLAS on 433Mhz
DEC 21164
37
Multithreaded BLAS for Performance
ATLAS 2 procs
ATLAS 1 proc
Intel BLAS
38
ATLAS and the Computational Grid

Keep a repository of kernels for specific
machines.
Develop a means of dynamically downloading code
Extend work to allow sparse matrix operations
Extend work to include arbitrary code segments
www.netlib.org/atlas/

39
High Performance Computing60 Years From 1 to
1013 Flop/s

1941
1 1 FLOPS
ABC 1945
100
ENIAC 1949

1,000 1 KiloFLOPS BINAC 1951

10,000
UNIVAC 1961
100,000
IBM 7030 1964
1,000,000
1 MegaFLOPS CDC 6600 1968

10,000,000
CDC 7600 1975
100,000,000
Cray 1 1987
1,000,000,000
1 GigaFLOPS NEC SX-2 1992
10,000,000,000
NEX SX-3 1993

100,000,000,000
TMC CM-5 1997
1,000,000,000,000 1 TeraFLOPS
Intel ASCI Red 2000 ???
10,000,000,000,000
40
Performance Development
41
Performance Development
42
Summary