Algorithms and Architecture - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Algorithms and Architecture

Description:

Latency to memory will continue to grow relative to CPU speed ... Memory Intensive Operations. Setup and 'conventional' (scalar) code. Network Connection ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 34

Provided by: william123

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms and Architecture

1
Algorithms and Architecture

William D. GroppMathematics and Computer
Sciencewww.mcs.anl.gov/gropp

2
Algorithms

What is an algorithm?
A set of instructions to perform a task
How do we evaluate an algorithm?
Correctness
Accuracy
Not an absolute
Efficiency
Relative to current and future machines
How do we measure efficiency?
Often by counting floating point operations
Compare to peak performance

3
Real and IdealizedComputer Architectures

Any algorithm assumes an idealized architecture
Common choice
Floating point work costs time
Data movement is free
Real systems
Floating point is free (fully overlapped with
other operations)
Data movement costs timea lot of time
Classical complexity analysis for numerical
algorithms is no longer correct (more precisely,
no longer relevant)
Known since at least BLAS2 and BLAS3

4
CPU and Memory Performance
DRAM Performance
5
More Recent Results(or meet Y2K, or correctness
first)
6
Trends in Computer Architecture I

Latency to memory will continue to grow relative
to CPU speed
Latency hiding techniques require finding
increasing amounts of independent work Littles
law implies
Number of concurrent memory references Latency
rate
For 1 reference per cycle, this is already
1001000 concurrent references

7
Trends in Computer Architecture II

Clock speeds will continue to increase
The rate of clock rate increase has increased
recently ?
Light travels 3 cm (in a vacuum) in one cycle of
a 10 GHz clock
CPU chips wont be causally connected within a
single clock cycle, i.e., a signal will not cross
the chip in a single clock cycle
Processors will be parallel!

8
Trends in Computer Architecture III

Power dissipation problems will force more
changes
Current trends imply chips with energy densities
greater than a nuclear reactor
Already a problem The current issue of consumer
reports looks at the likelihood of getting a
serious burn from your laptop!
Will forcenew waysto get performance,such as
extensiveparallelism

9
Consequences

Gap between memory and processor performance will
continue to grow
Data motion will dominate the cost of many (most)
calculations
The key is to find a computational cost
abstraction that is as simple as possible but no
simpler

10
Architecture Invariants

Performance is determined by memory performance
Memory system design for performance makes system
performance less predictable
Fast memories possible, but
Expensive ()
Large (meters3)
Power hungry (Watts)
Algorithms that dont take these realities into
account may be irrelevant

11
Node Performance

Current laptops now have a peak speed (based on
clock rate) of over 1 Gflops (10 Cray1s!)
Observed (sustained) performance is often a small
fraction of peak
Why is the gap between peak and sustained
performance so large?
Lets look at a simple numerical kernel

12
Sparse Matrix-Vector Product

Common operation for optimal (in floating-point
operations) solution of linear systems
Sample codefor row1,n m irow -
irow-1 sum 0 for k1,m sum
a xj yi sum
Data structures are annz, jnnz, in, xn,
yn

13
Simple Performance Analysis

Memory motion
nnz (sizeof(double) sizeof(int)) n
(2sizeof(double) sizeof(int))
Assume a perfect cache (never load same data
twice)
Computation
nnz multiply-add (MA)
Roughly 12 bytes per MA
Typical WS node can move 1-4 bytes/MA
Maximum performance is 8-33 of peak

14
More Performance Analysis

Instruction Counts
nnz (2load-double load-int mult-add) n
(load-int store-double)
Roughly 4 instructions per MA
Maximum performance is 25 of peak (33 if MA
overlaps one load/store)
(wide instruction words can help here)
Changing matrix data structure (e.g., exploit
small block structure) allows reuse of data in
register, eliminating some loads (x and j)
Implementation improvements (tricks) cannot
improve on these limits

15
Realistic Measures of Peak PerformanceSparse
Matrix Vector ProductOne vector, matrix size, m
90,708, nonzero entries nz 5,047,120
Thanks to Dinesh Kaushik ORNL and ANL for
compute time
16
Realistic Measures of Peak PerformanceSparse
Matrix Vector Productone vector, matrix size, m
90,708, nonzero entries nz 5,047,120
17
What About CPU-Bound Operations?

Dense Matrix-Matrix Product
Most studied numerical program by compiler
writers
Core of some important applications
More importantly, the core operation in High
Performance Linpack
Benchmark used to rate the top 500 fastest
systems
Should give optimal performance

18
The Compiler Will Handle It (?)
Large gap between natural code and specialized
code
Enormous effort required to get good performance
19
DGEMM (n500)
20
Performance for Real Applications

Dense matrix-matrix example shows that even for
well-studied, compute-bound kernels,
compiler-generated code achieves only a small
fraction of available performance
Fortran code uses natural loops, i.e., what a
user would write for most code
Others use multi-level blocking, careful
instruction scheduling etc.
Algorithms design also needs to take into account
the capabilities of the system, not just the
hardware
Example Cache-Oblivious Algorithms
(http//supertech.lcs.mit.edu/cilk/papers/abstract
s/abstract4.html)

21
Challenges in Creating a Performance Model Based
on Memory Accesses

Different levels of the memory hierarchies have
significantly different performance
Cache behavior sensitive to details of data
layout
Still no good calculus for predicting performance

STREAM performance in MB/s versus data size
Interleaved data causes data to be displaced
while still needed for later steps
22
Parallel Performance Issues

Coordination of accesses
Ensuring that two (or more) processes/threads do
not access data before it is ready
Related to main (not cache) memory behavoir and
timing
Intrinsically nonscalable operations
Inner products
Opportunity for algorithm designers
Use nonblocking or split operations
Arrange algorithms so that other operations can
take place while inner product is being assembled

23
Sample Parallel System Architecture

Systems have an increasingly deep memory
hierarchy (1, 2, 3, and more levels of cache)
Time to reference main memory 100s of cycles
Access to shared data requires synchronization
Better to ensure data is local and unshared when
possible

SMP
...
Interconnect
24
Is Parallel Computing Hard?

Conjecture
The difficulty in writing high-performance code
is directly related to the ratio of memory access
times
Reason hiding latency requires blocking,
prefetch, and other special techniques
Parallel Computing now the easy part

(Easy only relative to performance programming
for uniprocessors)

25
Aggregate Parallel Performance of PETSc-FUN3DIBM
Power 4 512 Processors (1.3 GHz)Pentium 4 Xeon
Cluster 250 Processors (2.4 GHz)
26
Algorithms for Grids

Similar features
Yet another level of memory hierarchy
(Unique) features
Very dynamic resources

27
Is Performance Everything?

In August 1991, the Sleipner A, an oil and gas
platform built in Norway for operation in the
North Sea, sank during construction. The total
economic loss amounted to about 700 million.
After investigation, it was found that the
failure of the walls of the support structure
resulted from a serious error in the finite
element analysis of the linear elastic model.
(http//www.ima.umn.edu/arnold/disasters/sleipner
.html)

28
Correctness and Accuracy

Many current algorithms designed to balance
performance and accuracy
These choices often made when computers were 106
times slower than they are now
Is it time to re-examine these choices,
particularly for applications that are now done
on laptops?

29
New (and Not So New) Architectures

Commodity high-performance processors
Conventional desk/lap tops
Game systems
Cell phones (voice recognition, games)
DARPA High Productivity Computing Systems project
Vectors and Streams

30
Typical 6-Gflop Commodity System (circa 2000)
Over 100 (32bit) Petaflops (0.1 Exaflops) already
delivered!
31
PIM-based Node
Network Connection

Homogenous
All memory the same
Heterogeneous
Different memory access costs
Different processors(?)

32
IBM BlueGene/L
Processor-rich system High performance,
memory-oriented interconnect Processors near
memory canreduce need to move data
33
(Somewhat) Exotic SolutionsHTMT

Radically new technologies
Superconducting (RSFQ) logic (240 GHz)
Multithreaded/stranded execution model
Multilevel PIM-enhanced memory
Optical networks
Optical (holographic) main store
http//htmt.caltech.edu/
Still a (deep) memory hierarchy
Elements finding their way into new innovative
architectures

34
Conventional Supercomputers
Earth Simulator
PN 636
PN 637
PN 638
PN 639
35
Algorithms

Exploit problem behavior at different scales
Multigrid
Domain Decomposition
Generalizes multigrid (or multigrid generalizes
DD)
Provides a spectrum of robust, optimal methods
for a wide range of problems
Nonlinear versions hold great promise
Continuation
Divide and conquer
Multipole and Wavelets
Cache-sensitive algorithms
See Karp in SIAM Review 1996
Even the Mathematicians know about this now
(McGeoch, AMS Notices March 2001)

36
Conclusions

Performance models should count data motion, not
flops
Computers will continue to have multiple levels
of memory hierarchy
Algorithms should exploit them
Computers will be parallel
Algorithms can make effective use of greater
adaptivity to give better time-to-solution and
accuracy
Denial is not a solution