Algorithms and Architecture - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Algorithms and Architecture

Description:

Latency to memory will continue to grow relative to CPU speed ... Memory Intensive Operations. Setup and 'conventional' (scalar) code. Network Connection ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 34
Provided by: william123
Category:

less

Transcript and Presenter's Notes

Title: Algorithms and Architecture


1
Algorithms and Architecture
  • William D. GroppMathematics and Computer
    Sciencewww.mcs.anl.gov/gropp

2
Algorithms
  • What is an algorithm?
  • A set of instructions to perform a task
  • How do we evaluate an algorithm?
  • Correctness
  • Accuracy
  • Not an absolute
  • Efficiency
  • Relative to current and future machines
  • How do we measure efficiency?
  • Often by counting floating point operations
  • Compare to peak performance

3
Real and IdealizedComputer Architectures
  • Any algorithm assumes an idealized architecture
  • Common choice
  • Floating point work costs time
  • Data movement is free
  • Real systems
  • Floating point is free (fully overlapped with
    other operations)
  • Data movement costs timea lot of time
  • Classical complexity analysis for numerical
    algorithms is no longer correct (more precisely,
    no longer relevant)
  • Known since at least BLAS2 and BLAS3

4
CPU and Memory Performance
DRAM Performance
5
More Recent Results(or meet Y2K, or correctness
first)
6
Trends in Computer Architecture I
  • Latency to memory will continue to grow relative
    to CPU speed
  • Latency hiding techniques require finding
    increasing amounts of independent work Littles
    law implies
  • Number of concurrent memory references Latency
    rate
  • For 1 reference per cycle, this is already
    1001000 concurrent references

7
Trends in Computer Architecture II
  • Clock speeds will continue to increase
  • The rate of clock rate increase has increased
    recently ?
  • Light travels 3 cm (in a vacuum) in one cycle of
    a 10 GHz clock
  • CPU chips wont be causally connected within a
    single clock cycle, i.e., a signal will not cross
    the chip in a single clock cycle
  • Processors will be parallel!

8
Trends in Computer Architecture III
  • Power dissipation problems will force more
    changes
  • Current trends imply chips with energy densities
    greater than a nuclear reactor
  • Already a problem The current issue of consumer
    reports looks at the likelihood of getting a
    serious burn from your laptop!
  • Will forcenew waysto get performance,such as
    extensiveparallelism

9
Consequences
  • Gap between memory and processor performance will
    continue to grow
  • Data motion will dominate the cost of many (most)
    calculations
  • The key is to find a computational cost
    abstraction that is as simple as possible but no
    simpler

10
Architecture Invariants
  • Performance is determined by memory performance
  • Memory system design for performance makes system
    performance less predictable
  • Fast memories possible, but
  • Expensive ()
  • Large (meters3)
  • Power hungry (Watts)
  • Algorithms that dont take these realities into
    account may be irrelevant

11
Node Performance
  • Current laptops now have a peak speed (based on
    clock rate) of over 1 Gflops (10 Cray1s!)
  • Observed (sustained) performance is often a small
    fraction of peak
  • Why is the gap between peak and sustained
    performance so large?
  • Lets look at a simple numerical kernel

12
Sparse Matrix-Vector Product
  • Common operation for optimal (in floating-point
    operations) solution of linear systems
  • Sample codefor row1,n m irow -
    irow-1 sum 0 for k1,m sum
    a xj yi sum
  • Data structures are annz, jnnz, in, xn,
    yn

13
Simple Performance Analysis
  • Memory motion
  • nnz (sizeof(double) sizeof(int)) n
    (2sizeof(double) sizeof(int))
  • Assume a perfect cache (never load same data
    twice)
  • Computation
  • nnz multiply-add (MA)
  • Roughly 12 bytes per MA
  • Typical WS node can move 1-4 bytes/MA
  • Maximum performance is 8-33 of peak

14
More Performance Analysis
  • Instruction Counts
  • nnz (2load-double load-int mult-add) n
    (load-int store-double)
  • Roughly 4 instructions per MA
  • Maximum performance is 25 of peak (33 if MA
    overlaps one load/store)
  • (wide instruction words can help here)
  • Changing matrix data structure (e.g., exploit
    small block structure) allows reuse of data in
    register, eliminating some loads (x and j)
  • Implementation improvements (tricks) cannot
    improve on these limits

15
Realistic Measures of Peak PerformanceSparse
Matrix Vector ProductOne vector, matrix size, m
90,708, nonzero entries nz 5,047,120
Thanks to Dinesh Kaushik ORNL and ANL for
compute time
16
Realistic Measures of Peak PerformanceSparse
Matrix Vector Productone vector, matrix size, m
90,708, nonzero entries nz 5,047,120
17
What About CPU-Bound Operations?
  • Dense Matrix-Matrix Product
  • Most studied numerical program by compiler
    writers
  • Core of some important applications
  • More importantly, the core operation in High
    Performance Linpack
  • Benchmark used to rate the top 500 fastest
    systems
  • Should give optimal performance

18
The Compiler Will Handle It (?)
Large gap between natural code and specialized
code
Enormous effort required to get good performance
19
DGEMM (n500)
20
Performance for Real Applications
  • Dense matrix-matrix example shows that even for
    well-studied, compute-bound kernels,
    compiler-generated code achieves only a small
    fraction of available performance
  • Fortran code uses natural loops, i.e., what a
    user would write for most code
  • Others use multi-level blocking, careful
    instruction scheduling etc.
  • Algorithms design also needs to take into account
    the capabilities of the system, not just the
    hardware
  • Example Cache-Oblivious Algorithms
    (http//supertech.lcs.mit.edu/cilk/papers/abstract
    s/abstract4.html)

21
Challenges in Creating a Performance Model Based
on Memory Accesses
  • Different levels of the memory hierarchies have
    significantly different performance
  • Cache behavior sensitive to details of data
    layout
  • Still no good calculus for predicting performance

STREAM performance in MB/s versus data size
Interleaved data causes data to be displaced
while still needed for later steps
22
Parallel Performance Issues
  • Coordination of accesses
  • Ensuring that two (or more) processes/threads do
    not access data before it is ready
  • Related to main (not cache) memory behavoir and
    timing
  • Intrinsically nonscalable operations
  • Inner products
  • Opportunity for algorithm designers
  • Use nonblocking or split operations
  • Arrange algorithms so that other operations can
    take place while inner product is being assembled

23
Sample Parallel System Architecture
  • Systems have an increasingly deep memory
    hierarchy (1, 2, 3, and more levels of cache)
  • Time to reference main memory 100s of cycles
  • Access to shared data requires synchronization
  • Better to ensure data is local and unshared when
    possible

SMP
...
Interconnect
24
Is Parallel Computing Hard?
  • Conjecture
  • The difficulty in writing high-performance code
    is directly related to the ratio of memory access
    times
  • Reason hiding latency requires blocking,
    prefetch, and other special techniques
  • Parallel Computing now the easy part
  • (Easy only relative to performance programming
    for uniprocessors)

25
Aggregate Parallel Performance of PETSc-FUN3DIBM
Power 4 512 Processors (1.3 GHz)Pentium 4 Xeon
Cluster 250 Processors (2.4 GHz)
26
Algorithms for Grids
  • Similar features
  • Yet another level of memory hierarchy
  • (Unique) features
  • Very dynamic resources

27
Is Performance Everything?
  • In August 1991, the Sleipner A, an oil and gas
    platform built in Norway for operation in the
    North Sea, sank during construction. The total
    economic loss amounted to about 700 million.
    After investigation, it was found that the
    failure of the walls of the support structure
    resulted from a serious error in the finite
    element analysis of the linear elastic model.
    (http//www.ima.umn.edu/arnold/disasters/sleipner
    .html)

28
Correctness and Accuracy
  • Many current algorithms designed to balance
    performance and accuracy
  • These choices often made when computers were 106
    times slower than they are now
  • Is it time to re-examine these choices,
    particularly for applications that are now done
    on laptops?

29
New (and Not So New) Architectures
  • Commodity high-performance processors
  • Conventional desk/lap tops
  • Game systems
  • Cell phones (voice recognition, games)
  • DARPA High Productivity Computing Systems project
  • Vectors and Streams

30
Typical 6-Gflop Commodity System (circa 2000)
Over 100 (32bit) Petaflops (0.1 Exaflops) already
delivered!
31
PIM-based Node
Network Connection
  • Homogenous
  • All memory the same
  • Heterogeneous
  • Different memory access costs
  • Different processors(?)

32
IBM BlueGene/L
Processor-rich system High performance,
memory-oriented interconnect Processors near
memory canreduce need to move data
33
(Somewhat) Exotic SolutionsHTMT
  • Radically new technologies
  • Superconducting (RSFQ) logic (240 GHz)
  • Multithreaded/stranded execution model
  • Multilevel PIM-enhanced memory
  • Optical networks
  • Optical (holographic) main store
  • http//htmt.caltech.edu/
  • Still a (deep) memory hierarchy
  • Elements finding their way into new innovative
    architectures

34
Conventional Supercomputers
Earth Simulator
PN 636
PN 637
PN 638
PN 639
35
Algorithms
  • Exploit problem behavior at different scales
  • Multigrid
  • Domain Decomposition
  • Generalizes multigrid (or multigrid generalizes
    DD)
  • Provides a spectrum of robust, optimal methods
    for a wide range of problems
  • Nonlinear versions hold great promise
  • Continuation
  • Divide and conquer
  • Multipole and Wavelets
  • Cache-sensitive algorithms
  • See Karp in SIAM Review 1996
  • Even the Mathematicians know about this now
    (McGeoch, AMS Notices March 2001)

36
Conclusions
  • Performance models should count data motion, not
    flops
  • Computers will continue to have multiple levels
    of memory hierarchy
  • Algorithms should exploit them
  • Computers will be parallel
  • Algorithms can make effective use of greater
    adaptivity to give better time-to-solution and
    accuracy
  • Denial is not a solution
Write a Comment
User Comments (0)
About PowerShow.com