Title: Algorithms and Architecture
1Algorithms and Architecture
- William D. GroppMathematics and Computer
Sciencewww.mcs.anl.gov/gropp
2Algorithms
- What is an algorithm?
- A set of instructions to perform a task
- How do we evaluate an algorithm?
- Correctness
- Accuracy
- Not an absolute
- Efficiency
- Relative to current and future machines
- How do we measure efficiency?
- Often by counting floating point operations
- Compare to peak performance
3Real and IdealizedComputer Architectures
- Any algorithm assumes an idealized architecture
- Common choice
- Floating point work costs time
- Data movement is free
- Real systems
- Floating point is free (fully overlapped with
other operations) - Data movement costs timea lot of time
- Classical complexity analysis for numerical
algorithms is no longer correct (more precisely,
no longer relevant) - Known since at least BLAS2 and BLAS3
4CPU and Memory Performance
DRAM Performance
5More Recent Results(or meet Y2K, or correctness
first)
6Trends in Computer Architecture I
- Latency to memory will continue to grow relative
to CPU speed - Latency hiding techniques require finding
increasing amounts of independent work Littles
law implies - Number of concurrent memory references Latency
rate - For 1 reference per cycle, this is already
1001000 concurrent references
7Trends in Computer Architecture II
- Clock speeds will continue to increase
- The rate of clock rate increase has increased
recently ? - Light travels 3 cm (in a vacuum) in one cycle of
a 10 GHz clock - CPU chips wont be causally connected within a
single clock cycle, i.e., a signal will not cross
the chip in a single clock cycle - Processors will be parallel!
8Trends in Computer Architecture III
- Power dissipation problems will force more
changes - Current trends imply chips with energy densities
greater than a nuclear reactor - Already a problem The current issue of consumer
reports looks at the likelihood of getting a
serious burn from your laptop! - Will forcenew waysto get performance,such as
extensiveparallelism
9Consequences
- Gap between memory and processor performance will
continue to grow - Data motion will dominate the cost of many (most)
calculations - The key is to find a computational cost
abstraction that is as simple as possible but no
simpler
10Architecture Invariants
- Performance is determined by memory performance
- Memory system design for performance makes system
performance less predictable - Fast memories possible, but
- Expensive ()
- Large (meters3)
- Power hungry (Watts)
- Algorithms that dont take these realities into
account may be irrelevant
11Node Performance
- Current laptops now have a peak speed (based on
clock rate) of over 1 Gflops (10 Cray1s!) - Observed (sustained) performance is often a small
fraction of peak - Why is the gap between peak and sustained
performance so large? - Lets look at a simple numerical kernel
12Sparse Matrix-Vector Product
- Common operation for optimal (in floating-point
operations) solution of linear systems - Sample codefor row1,n m irow -
irow-1 sum 0 for k1,m sum
a xj yi sum - Data structures are annz, jnnz, in, xn,
yn
13Simple Performance Analysis
- Memory motion
- nnz (sizeof(double) sizeof(int)) n
(2sizeof(double) sizeof(int)) - Assume a perfect cache (never load same data
twice) - Computation
- nnz multiply-add (MA)
- Roughly 12 bytes per MA
- Typical WS node can move 1-4 bytes/MA
- Maximum performance is 8-33 of peak
14More Performance Analysis
- Instruction Counts
- nnz (2load-double load-int mult-add) n
(load-int store-double) - Roughly 4 instructions per MA
- Maximum performance is 25 of peak (33 if MA
overlaps one load/store) - (wide instruction words can help here)
- Changing matrix data structure (e.g., exploit
small block structure) allows reuse of data in
register, eliminating some loads (x and j) - Implementation improvements (tricks) cannot
improve on these limits
15Realistic Measures of Peak PerformanceSparse
Matrix Vector ProductOne vector, matrix size, m
90,708, nonzero entries nz 5,047,120
Thanks to Dinesh Kaushik ORNL and ANL for
compute time
16Realistic Measures of Peak PerformanceSparse
Matrix Vector Productone vector, matrix size, m
90,708, nonzero entries nz 5,047,120
17What About CPU-Bound Operations?
- Dense Matrix-Matrix Product
- Most studied numerical program by compiler
writers - Core of some important applications
- More importantly, the core operation in High
Performance Linpack - Benchmark used to rate the top 500 fastest
systems - Should give optimal performance
18The Compiler Will Handle It (?)
Large gap between natural code and specialized
code
Enormous effort required to get good performance
19DGEMM (n500)
20Performance for Real Applications
- Dense matrix-matrix example shows that even for
well-studied, compute-bound kernels,
compiler-generated code achieves only a small
fraction of available performance - Fortran code uses natural loops, i.e., what a
user would write for most code - Others use multi-level blocking, careful
instruction scheduling etc. - Algorithms design also needs to take into account
the capabilities of the system, not just the
hardware - Example Cache-Oblivious Algorithms
(http//supertech.lcs.mit.edu/cilk/papers/abstract
s/abstract4.html)
21Challenges in Creating a Performance Model Based
on Memory Accesses
- Different levels of the memory hierarchies have
significantly different performance - Cache behavior sensitive to details of data
layout - Still no good calculus for predicting performance
STREAM performance in MB/s versus data size
Interleaved data causes data to be displaced
while still needed for later steps
22Parallel Performance Issues
- Coordination of accesses
- Ensuring that two (or more) processes/threads do
not access data before it is ready - Related to main (not cache) memory behavoir and
timing - Intrinsically nonscalable operations
- Inner products
- Opportunity for algorithm designers
- Use nonblocking or split operations
- Arrange algorithms so that other operations can
take place while inner product is being assembled
23Sample Parallel System Architecture
- Systems have an increasingly deep memory
hierarchy (1, 2, 3, and more levels of cache) - Time to reference main memory 100s of cycles
- Access to shared data requires synchronization
- Better to ensure data is local and unshared when
possible
SMP
...
Interconnect
24Is Parallel Computing Hard?
- Conjecture
- The difficulty in writing high-performance code
is directly related to the ratio of memory access
times - Reason hiding latency requires blocking,
prefetch, and other special techniques - Parallel Computing now the easy part
- (Easy only relative to performance programming
for uniprocessors)
25Aggregate Parallel Performance of PETSc-FUN3DIBM
Power 4 512 Processors (1.3 GHz)Pentium 4 Xeon
Cluster 250 Processors (2.4 GHz)
26Algorithms for Grids
- Similar features
- Yet another level of memory hierarchy
- (Unique) features
- Very dynamic resources
27Is Performance Everything?
- In August 1991, the Sleipner A, an oil and gas
platform built in Norway for operation in the
North Sea, sank during construction. The total
economic loss amounted to about 700 million.
After investigation, it was found that the
failure of the walls of the support structure
resulted from a serious error in the finite
element analysis of the linear elastic model.
(http//www.ima.umn.edu/arnold/disasters/sleipner
.html)
28Correctness and Accuracy
- Many current algorithms designed to balance
performance and accuracy - These choices often made when computers were 106
times slower than they are now - Is it time to re-examine these choices,
particularly for applications that are now done
on laptops?
29New (and Not So New) Architectures
- Commodity high-performance processors
- Conventional desk/lap tops
- Game systems
- Cell phones (voice recognition, games)
- DARPA High Productivity Computing Systems project
- Vectors and Streams
30Typical 6-Gflop Commodity System (circa 2000)
Over 100 (32bit) Petaflops (0.1 Exaflops) already
delivered!
31PIM-based Node
Network Connection
- Homogenous
- All memory the same
- Heterogeneous
- Different memory access costs
- Different processors(?)
32IBM BlueGene/L
Processor-rich system High performance,
memory-oriented interconnect Processors near
memory canreduce need to move data
33(Somewhat) Exotic SolutionsHTMT
- Radically new technologies
- Superconducting (RSFQ) logic (240 GHz)
- Multithreaded/stranded execution model
- Multilevel PIM-enhanced memory
- Optical networks
- Optical (holographic) main store
- http//htmt.caltech.edu/
- Still a (deep) memory hierarchy
- Elements finding their way into new innovative
architectures
34Conventional Supercomputers
Earth Simulator
PN 636
PN 637
PN 638
PN 639
35Algorithms
- Exploit problem behavior at different scales
- Multigrid
- Domain Decomposition
- Generalizes multigrid (or multigrid generalizes
DD) - Provides a spectrum of robust, optimal methods
for a wide range of problems - Nonlinear versions hold great promise
- Continuation
- Divide and conquer
- Multipole and Wavelets
- Cache-sensitive algorithms
- See Karp in SIAM Review 1996
- Even the Mathematicians know about this now
(McGeoch, AMS Notices March 2001)
36Conclusions
- Performance models should count data motion, not
flops - Computers will continue to have multiple levels
of memory hierarchy - Algorithms should exploit them
- Computers will be parallel
- Algorithms can make effective use of greater
adaptivity to give better time-to-solution and
accuracy - Denial is not a solution