An Overview of High Performance Computing and Performance Issues - PowerPoint PPT Presentation

About This Presentation
Title:

An Overview of High Performance Computing and Performance Issues

Description:

In the past decade, the world has experienced one of the most exciting periods ... Aad van der Steen. Utrecht. Antoine Petitet. Erich Strohmaier. Clint Whaley. UTK ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 43
Provided by: jack236
Learn more at: https://netlib.org
Category:

less

Transcript and Presenter's Notes

Title: An Overview of High Performance Computing and Performance Issues


1
An Overview of High Performance Computing and
Performance Issues
  • Jack Dongarra
  • University of Tennessee
  • and
  • Oak Ridge National Laboratory

2
(No Transcript)
3
Outline
  • Not going to talk about . . .
  • Linear algebra algorithms or software
  • Message passing systems like PVM or MPI
  • Distributed network computing
  • Focus on . . .
  • Trends in High-Performance Computing
  • Tools for efficient software of numerical kernels

4
High-Performance Computing Today
  • In the past decade, the world has experienced one
    of the most exciting periods in computer
    development.
  • Microprocessors have become smaller, denser, and
    more powerful.
  • The result is that microprocessor-based
    supercomputing is rapidly becoming the technology
    of preference in attacking some of the most
    important problems of science and engineering.

5
Growth of Microprocessor Performance
10000
Cray T90
Cray C90
1000
Cray 2
Cray Y-MP
Alpha
RS6000/590
Cray X-MP
Alpha
100
RS6000/540
Cray 1S
i860
2X transistors/Chip Every 1.5 years Moores
Law Microprocessors have become smaller,
denser, and more powerful.
Performance in Mflop/s
10
R2000
1
80387
0.1
6881
80287
8087
0.01
1998
1980
1982
1986
1988
1990
1992
1994
1996
Year
6
Gflop/s
2000
Linpack-HPC Benchmark Over Time
1500
Cray Y-MP (8)
97
95
99
91
93
Year
7
Different Architectures
  • SIMD
  • Vector Computers
  • MIMD
  • Shared Memory
  • SUN Enterprise Shared Memory
  • Cray J90/T90 Vector Computer
  • Distributed Shared Memory
  • SGI Origin Distributed Shared Memory
  • HP-Convex Distributed Shared Memory
  • NEC SX Vector Computer
  • Distributed Memory
  • Cray T3E Distributed Memory
  • IBM SP Distributed Memory
  • Fujitsu VPP Vector Distributed Memory
  • Cluster of Processors (COWS, NOWS, Beowulf)

8
Vectors
  • Fujitsu VPP-700
  • Distributed memory vector multi-processor
  • Cross bar connected up to 256
  • 2.4 Gflop/s per processor
  • VPP-5000 (512 procs/9.6 Gflop/s per proc)
  • NEC SX-5
  • Distributed memory vector multi-processor
  • Multistage crossbar up to 512
  • 8 Gflop/s per processor
  • Hitachi SR-8000
  • RISC-based distributed memory multi-processor
    (PVP)
  • Multi-dimensional crossbar up to 128
  • 8 Gflop/s per processor
  • SGI SV1
  • RISC based
  • Ring based up to 1024
  • 4.8 Gflop/s per processor

9
Scalar
  • IBM SP
  • RISC-based distributed-memory multi-processor
  • Omega switch, message passing
  • 200 MHz, 4 ops/cycle, 800 Mflop/s
  • 2 processors/node
  • ASCIs machine 5836 processors
  • SGI Origin 2000
  • Distributed shared memory, CC-NUMA, message
    passing
  • Crossbar (4 proc/node), nodes hypercubed
  • 500 Mflop/s
  • 4 processos/node (crossbar), Hypercubed
  • ASCIs machine 6144 processors
  • Sun Enterprise
  • SMP 64 way crossbar
  • 800 Mflop/s

10
Scalar (Continued)
  • HP Exemplar
  • Distributed shared memory, CC-NUMA, message
    passing
  • 32 processors SMP interconnected via ring
  • 1.76 Gflop/s processor
  • Compaq (DEC)
  • Memory channel switch
  • 112 processors
  • 1.2 Gflop/s processor

11
High-Performance Computing Directions
  • Move toward distributed shared memory
  • Distributed Shared Memory (clusters of
    processors connected)
  • Shared address space w/deep memory hierarchy
  • Clustering of shared memory machines for
    scalability
  • Emergence of PC commodity systems
  • Pentium/Alpha based, NT or Linux driven
  • Efficiency of message passing and data parallel
    programming
  • Helped by standards efforts such as PVM, MPI,
    Open-MP and HPF
  • In many cases used as a single user environments
  • Pure COTS

12
Top500 Fastest Installed Computers
  • List the top 500 supercomputer at sites
    worldwide.
  • Provides a snapshot of the SC installed around
    the world.
  • Began in 1993, published every 6 months.
  • Measures the performance for the TPP Linpack
    Benchmark
  • Provides a way to measure
    trends

13
TOP10
top10-1/2 Tf/s 3 DOE ASCI 10 7 gt 1Tflop/s
peak, 1 Japan
14
Performance Development
1/2 machines replaced mpp358, pvp58
indust241
18 machines in France, Meteo VPP700 116
15
Continents
16
(No Transcript)
17
Producers
18
Manufacturer
19
Customer Type
20
Processor Type
21
Chip Technology
22
Chip Technology
23
Architectures
24
Excerpt from TOP500
25
Where do the Flops go?Who Cares About the Memory
Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
26
Performance Issues - Cache Bandwidth
  • Performance instability
  • Small changes may cause dramatic changes in
    delivered performance.
  • Latency tolerant and bandwidth parsimonious
    algorithms and software are critical
  • Recompute rather than store/load
  • Need to help the compiler
  • Have a hard time today getting performance
  • Only going to get harder

27
Where Do the Flops Go?Memory Hierarchy
  • Can only do arithmetic on data at the top of the
    hierarchy
  • Higher level BLAS lets us do this

Fast, small, expensive
Slow, large, cheap
28
BLAS for Performance
  • Development of blocked algorithms important for
    performance

29
How To Get Performance From Commodity
Processors?
  • Todays processors can achieve high-performance,
    but this requires extensive machine-specific hand
    tuning.
  • Routines have a large design space w/many
    parameters
  • Blocking sizes, loop nesting permutations, loop
    unrolling depths, software pipelining strategies,
    register allocations, and instruction schedules.
  • Complicated interactions with the increasingly
    sophisticated micro-architectures of new
    microprocessors.
  • A few months ago no tuned BLAS for Pentium for
    Linux.
  • Need for quick/dynamic deployment of optimized
    routines.
  • ATLAS - Automatic Tuned Linear Algebra Software
  • PhiPac from Berkeley
  • FFTW from MIT

30
Why ATLAS is needed
  • BLAS require many man-hours / platform
  • Only done if financial incentive is there
  • Many platforms will never have an optimal version
  • Lags behind hardware
  • May not be affordable by everyone
  • Improves Vendor code
  • Operations may be important, but not general
    enough for standard
  • Allows for portably optimal codes

31
Adaptive Approach for Level 3 BLAS
  • Do a parameter study of the operation on the
    target machine, done once.
  • Only generated code is on-chip multiply
  • BLAS operation written in terms of generated
    on-chip multiply
  • All tranpose cases coerced through data copy to 1
    case of on-chip multiply
  • Only 1 case generated per platform

32
Code Generation Strategy
  • On-chip multiply optimizes for
  • TLB access
  • L1 cache reuse
  • FP unit usage
  • Memory fetch
  • Register reuse
  • Loop overhead minimization
  • Takes a couple of hours to run.
  • Code is iteratively generated timed until
    optimal case is found. We try
  • Differing NBs
  • Breaking false dependencies
  • M, N and K loop unrolling

33
ATLAS 500x500 DGEMM Across Various Architectures
34
500 x 500 LU Right-Looking
35
Recursive Approach for Other Level
3 BLAS
Recursive TRMM
  • Recur down to L1 cache block size
  • Need kernel at bottom of recursion
  • Use gemm-based kernel for portability

36
500x500 Recursive BLAS on 433Mhz
DEC 21164
37
Multithreaded BLAS for Performance
ATLAS 2 procs
ATLAS 1 proc
Intel BLAS
38
ATLAS and the Computational Grid
  • Keep a repository of kernels for specific
    machines.
  • Develop a means of dynamically downloading code
  • Extend work to allow sparse matrix operations
  • Extend work to include arbitrary code segments
  • www.netlib.org/atlas/

39
High Performance Computing60 Years From 1 to
1013 Flop/s


1941
1 1 FLOPS
ABC 1945
100
ENIAC 1949

1,000 1 KiloFLOPS BINAC 1951

10,000
UNIVAC 1961
100,000
IBM 7030 1964
1,000,000
1 MegaFLOPS CDC 6600 1968

10,000,000
CDC 7600 1975
100,000,000
Cray 1 1987
1,000,000,000
1 GigaFLOPS NEC SX-2 1992
10,000,000,000
NEX SX-3 1993

100,000,000,000
TMC CM-5 1997
1,000,000,000,000 1 TeraFLOPS
Intel ASCI Red 2000 ???
10,000,000,000,000
40
Performance Development
41
Performance Development
42
Summary
  • Hans Meuer
  • Mannheim University
  • Aad van der Steen
  • Utrecht
  • Antoine Petitet
  • Erich Strohmaier
  • Clint Whaley
  • UTK
  • http//www.netlib.org/utk/papers/advanced-computer
    s/paper.html
  • http//www.netlib.org/benchmark/top500.html
  • http//www.netlib.org/atlas/

43
Europe - Countries
44
Fujitsu VPP 700
  • Distributed memory vector multi-processor
  • Cross bar connected
  • VPP800 4ns
  • http//www.fujitsu.co.jp/hypertext/Products/Info_p
    rocess/hpc/vpp-e/index.html

45
NEC SX-5
  • Distributed memory vector multi-processor
  • Multistage crossbar
  • http//www.ess.nec.de

46
IBM SP
  • RISC-based distributed-memory multi-processor
  • Omega switch, message passing
  • New processors, available today
  • 200 MHz, 4 ops/cycle, 800 Mflop/s
  • 2 processors/node
  • http//www.rs6000.ibm.com/hardware/largescale/inde
    x.html

47
SGI Origin
  • Distributed shared memory, CC-NUMA, message
    passing
  • Crossbar (4 proc/node), nodes hypercubed
  • http//www.sgi.com/origin/2000/index.html
  • SV1 vector based 1.2 Gflop/s processor, 4 to a
    board

48
SUN Enterprise
  • RISC-based SMP
  • 2.5 ns (400 MHz), 800 Mflop/s
  • 64 way crossbar, 12.8 GB/s bandwidth
  • http//www.sun.com/servers/ultra_enterprise/10000/

49
Hitachi SR8000
  • RISC-based distributed memory multi-processor
  • Multi-dimensional crossbar
  • www.hitachi.co.jp/Prod/comp/hpc/eng/sr81e.html

50
HP Exemplar
  • RISC-based distributed-memory multi-processor
  • DSM/Message passing
  • http//www.enterprisecomputing.hp.com/index.html

51
Compaq/DEC Alpha
  • COWS coupled with your interconnect
Write a Comment
User Comments (0)
About PowerShow.com