Lecture 5 Page 1 - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 5 Page 1

Description:

Xian-He Sun. Illinois Institute of Technology. Sun_at_iit.edu. CS546. Lecture 5 Page 2. X. Sun (IIT) ... Post-execution, Algorithm improvement, Architecture ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 79
Provided by: junw3
Learn more at: http://www.cs.iit.edu
Category:
Tags: lecture | page | xian

less

Transcript and Presenter's Notes

Title: Lecture 5 Page 1


1
Performance Evaluation ofParallel Processing
Xian-He Sun Illinois Institute of
Technology Sun_at_iit.edu
2
Outline
  • Performance metrics
  • Speedup
  • Efficiency
  • Scalability
  • Examples
  • Reading Kumar ch 5

3
Performance Evaluation
(Improving performance is the goal)
  • Performance Measurement
  • Metric, Parameter
  • Performance Prediction
  • Model, Application-Resource
  • Performance Diagnose/Optimization
  • Post-execution, Algorithm improvement,
    Architecture improvement, State-of-the-art,
    Scheduling, Resource management/Scheduling

4
Parallel Performance Metrics(Run-time is the
dominant metric)
  • Run-Time (Execution Time)
  • Speed mflops, mips, cpi
  • Efficiency throughput
  • Speedup
  • Parallel Efficiency
  • Scalability The ability to maintain performance
    gain when system and problem size increase
  • Others portability, programming ability,etc

5
Models of Speedup
Performance Evaluation of Parallel Processing
  • Speedup
  • Scaled Speedup
  • Parallel processing gain over sequential
    processing, where problem size scales up with
    computing power (having sufficient
    workload/parallelism)

6
Speedup
  • Ts time for the best serial algorithm
  • Tptime for parallel algorithm using p processors

7
Example
100
35 35 35 35
time
time
time
25 25 25 25
1 2 3 4
1 2 3 4
Processor 1
(c)
(a)
(b)
8
Example (cont.)
50 50 50 50
30 20 40 10
time
time
1 2 3 4
1 2 3 4
(e)
(d)
9
What Is Good Speedup?
  • Linear speedup
  • Superlinear speedup
  • Sub-linear speedup

10
Speedup
speedup
p
11
Sources of Parallel Overheads
  • Interprocessor communication
  • Load imbalance
  • Synchronization
  • Extra computation

12
Degradations of Parallel Processing
Unbalanced Workload
Communication Delay
Overhead Increases with the Ensemble Size
13
Degradations of Distributed Computing
Unbalanced Computing Power and Workload
Shared Computing and Communication Resource
Uncertainty, Heterogeneity, and Overhead
Increases with the Ensemble Size
14
Causes of Superlinear Speedup
  • Cache size increased
  • Overhead reduced
  • Latency hidden
  • Randomized algorithms
  • Mathematical inefficiency of the serial algorithm
  • Higher memory access cost in sequential processing
  • X.H. Sun, and J. Zhu, "Performance
    Considerations of Shared Virtual Memory
    Machines,"
  • IEEE Trans. on Parallel and Distributed Systems,
    Nov. 1995

15
  • Fixed-Size Speedup (Amdahls law)
  • Emphasis on turnaround time
  • Problem size, W, is fixed

16
Amdahls Law
  • The performance improvement that can be gained by
    a parallel implementation is limited by the
    fraction of time parallelism can actually be used
    in an application
  • Let ? fraction of program (algorithm) that is
    serial and cannot be parallelized. For instance
  • Loop initialization
  • Reading/writing to a single disk
  • Procedure call overhead
  • Parallel run time is given by

17
Amdahls Law
  • Amdahls law gives a limit on speedup in terms of
    ?

18
Enhanced Amdahls Law
  • To include overhead
  • The overhead includes parallelism and interaction
    overheads

Amdahls law argument against massively parallel
systems
19
Fixed-Size Speedup (Amdahl Law, 67)
Amount of Work
Elapsed Time
W1
W1
W1
W1
W1
T1
Wp
Wp
Wp
Wp
Wp
Tp
T1
T1
Tp
T1
T1
Tp
Tp
Tp
1
2
3
4
5
1
2
3
4
5
Number of Processors (p)
Number of Processors (p)
20
Amdahls Law
  • The speedup that is achievable on p processors
    is
  • If we assume that the serial fraction is fixed,
    then the speedup for infinite processors is
    limited by 1/?
  • For example, if ?10, then the maximum speedup
    is 10, even if we use an infinite number of
    processors

21
Comments on Amdahls Law
  • The Amdahls fraction ? in practice depends on
    the problem size n and the number of processors p
  • An effective parallel algorithm has
  • For such a case, even if one fixes p, we can get
    linear speedups by choosing a suitable large
    problem size
  • Scalable speedup
  • Practically, the problem size that we can run for
    a particular problem is limited by the time and
    memory of the parallel computer

22
  • Fixed-Time Speedup (Gustafson, 88)
  • Emphasis on work finished in a fixed time
  • Problem size is scaled from W to W'
  • W' Work finished within the fixed time with
    parallel processing

23
Gustafsons Law (Without Overhead)
24
Fixed-Time Speedup (Gustafson)
W1
Wp
Amount of Work
Elapsed Time
W1
Wp
W1
Wp
W1
T1
T1
T1
T1
T1
Wp
W1
Tp
Tp
Tp
Tp
Tp
Wp
1
2
3
4
5
1
2
3
4
5
Number of Processors (p)
Number of Processors (p)
25
Converting s between Amdahls and Gustafons
laws
Based on this observation, Amdahls
and Gustafons laws are identical.
26
Gustafsons Law (With Overhead)
(1-a)p
p
-

a
a
)
1
(
)
(
p
p
Work


Speedup
FT

/
1
)
1
(
T
T
Work
a
1-a
1
overhead
time
p
a
1-a
time
27
Memory Constrained Scaling Sun and Nis Law
  • Scale the largest possible solution limited by
    the memory space. Or, fix memory usage per
    processor
  • (ex) N-body problem
  • Problem size is scaled from W to W
  • W is the work executed under memory limitation
    of a parallel computer
  • For simple profile, and G(n) is the increase of
    parallel workload as the memory capacity
    increases p times.

28
Sun Nis Law
29
  • Memory-Bounded Speedup (Sun Ni, 90)
  • Emphasis on work finished under current physical
    limitation
  • Problem size is scaled from W to W
  • W Work executed under memory limitation with
    parallel processing
  • X.H. Sun, and L. Ni , "Scalable Problems and
    Memory-Bounded Speedup,"
  • Journal of Parallel and Distributed Computing,
    Vol. 19, pp.27-37, Sept. 1993 (SC90).

30
Memory-Boundary Speedup (Sun Ni)
  • Work executed under memory limitation
  • Hierarchical memory

W1
Wp
Amount of Work
Elapsed Time
W1
Wp
W1
T1
Wp
T1
T1
W1
T1
T1
Tp
Tp
Wp
Tp
W1
Tp
Tp
Wp
1
2
3
4
5
1
2
3
4
5
Number of Processors (p)
Number of Processors (p)
31
Characteristics
  • Connection to other scaling models
  • G(p) 1, problem constrained scaling
  • G(p) p, time constrained scaling
  • With overhead
  • G(p) gt p, can lead to large increase in execution
    time
  • (ex) 10K x 10K matrix factorization 800MB, 1 hr
    in uniprocessorwith 1024 processors, 320K x 320K
    matrix, 32 hrs

32
Why Scalable Computing
  • Scalable
  • More accurate solution
  • Sufficient parallelism
  • Maintain efficiency
  • Efficient in parallel
  • computing
  • Load balance
  • Communication
  • Mathematically
  • effective
  • Adaptive
  • Accuracy

33
  • Memory-Bounded Speedup
  • Natural for domain decomposition based computing
  • Show the potential of parallel processing (In
    gerneal, computing requirement increases faster
    with problem size than that of communication)
  • Impacts extend to architecture design trade-off
    of memory size and computing speed

34
Why Scalable Computing (2)
Small Work
  • Appropriate for small machine
  • Parallelism overheads begin to dominate benefits
    for larger machines
  • Load imbalance
  • Communication to computation ratio
  • May even achieve slowdowns
  • Does not reflect real usage, and inappropriate
    for large machine
  • Can exaggerate benefits of improvements

35
Why Scalable Computing (3)
Large Work
  • Appropriate for big machine
  • Difficult to measure improvement
  • May not fit for small machine
  • Cant run
  • Thrashing to disk
  • Working set doesnt fit in cache
  • Fits at some p, leading to superlinear speedup

36
Demonstrating Scaling Problems
Big equation solver problem On SGI Origin2000
Small Ocean problem On SGI Origin2000
superlinear
parallelism overhead
User want to scale problems as machines grow!
37
How to Scale
  • Scaling a machine
  • Make a machine more powerful
  • Machine size
  • ltprocessor, memory, communication, I/Ogt
  • Scaling a machine in parallel processing
  • Add more identical nodes
  • Problem size
  • Input configuration
  • data set size the amount of storage required to
    run it on a single processor
  • memory usage the amount of memory used by the
    program

38
Two Key Issues in Problem Scaling
  • Under what constraints should the problem be
    scaled?
  • Some properties must be fixed as the machine
    scales
  • How should the problem be scaled?
  • Which parameters?
  • How?

39
Constraints To Scale
  • Two types of constraints
  • Problem-oriented
  • Ex) Time
  • Resource-oriented
  • Ex) Memory
  • Work to scale
  • Metric-oriented
  • Floating point operation, instructions
  • User-oriented
  • Easy to change but may difficult to compare
  • Ex) particles, rows, transactions
  • Difficult cross comparison

40
Rethinking of Speedup
  • Speedup
  • Why it is called speedup but compare time
  • Could we compare speed directly?
  • Generalized speedup
  • X.H. Sun, and J. Gustafson, "Toward A Better
    Parallel Performance Metric,"
  • Parallel Computing, Vol. 17, pp.1093-1109, Dec.
    1991.

41
(No Transcript)
42
Compute ? Problem
  • Consider parallel algorithm for computing the
    value of ?3.1415through the following numerical
    integration

43
Compute ? Sequential Algorithm
  • computepi()
  • h1.0/n
  • sum 0.0
  • for (i0iltni)
  • xh(i0.5)
  • sumsum4.0/(1xx)
  • pihsum

44
Compute ? Parallel Algorithm
  • Each processor computes on a set of about n/p
    points which are allocated to each processor in a
    cyclic manner
  • Finally, we assume that the local values of ? are
    accumulated among the p processors under
    synchronization

0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
45
Compute ? Parallel Algorithm
  • computepi()
  • idmy_proc_id()
  • nprocsnumber_of_procs()
  • h1.0/n
  • sum0.0
  • for(iidiltniinprocs)
  • xh(i0.5)
  • sumsum4.0/(1xx)
  • localpisumh
  • use_tree_based_combining_for_critical_section()
  • pipilocalpi
  • end_critical_section()

46
Compute ? Analysis
  • Assume that the computation of ? is performed
    over n points
  • The sequential algorithm performs 6 operations
    (two multiplications, one division, three
    additions) per points on the x-axis. Hence, for n
    points, the number of operations executed in the
    sequential algorithm is

3 additions
for (i0iltni) xh(i0.5) sumsum4.0/
(1xx)
1 division
2 multiplications
47
Compute ? Analysis
  • The parallel algorithm uses p processors with
    static interleaved scheduling. Each processor
    computes on a set of m points which are allocated
    to each process in a cyclic manner
  • The expression for m is given by
    if p does not exactly divide n. The
    runtime for the parallel algorithm for the
    parallel computation of the local values of ? is

48
Compute ? Analysis
  • The accumulation of the local values of ? using a
    tree-based combining can be optimally performed
    in log2(p) steps
  • The total runtime for the parallel algorithm for
    the computation of ? including the parallel
    computation and the combining is
  • The speedup of the parallel algorithm is

49
Compute ? Analysis
  • The Amdahls fraction for this parallel algorithm
    can be determined by rewriting the previous
    equation as
  • Hence, the Amdahls fraction ?(n,p) is
  • The parallel algorithm is effective because

50
Finite Differences Problem
  • Consider a finite difference iterative method
    applied to a 2D grid where

51
Finite Differences Serial Algorithm
  • finitediff()
  • for (t0tltTt)
  • for (i0iltni)
  • for (j0jltnj)
  • xi,jw_1(xi,j-1xi,j1xi-1,jxi1,j
    w_2xi,j

52
Finite Differences Parallel Algorithm
  • Each processor computes on a sub-grid of
    points
  • Synch between processors after every iteration
    ensures correct values being used for subsequent
    iterations

53
Finite Differences Parallel Algorithm
  • finitediff()
  • row_idmy_processor_row_id()
  • col_idmy_processor_col_id()
  • pnumbre_of_processors()
  • spsqrt(p)
  • rowscolsceil(n/sp)
  • row_startrow_idrows
  • col_startcol_idcols
  • for (t0tltTt)
  • for (irow_startiltmin(row_startrows,n)i)
  • for (jcol_startjltmin(col_startcols,n)j
    )
  • xi,jw_1(xi,j-1xi,j
    1xi-1,jxi1,jw_2xi,j
  • barrier()

54
Finite DifferencesAnalysis
  • The sequential algorithm performs 6 operations(2
    multiplications, 4 additions) every iteration per
    point on the grid. Hence, for an nn grid and T
    iterations, the number of operations executed in
    the sequential algorithm is

2 multiplications
xi,jw_1(xi,j-1xi,j1xi-1,jxi1,jw_
2xi,j
4 additions
55
Finite DifferencesAnalysis
  • The parallel algorithm uses p processors with
    static blockwise scheduling. Each processor
    computes on an mm sub-grid allocated to each
    processor in a blockwise manner
  • The expression for m is given by
    The runtime for the parallel algorithm
    is

56
Finite DifferencesAnalysis
  • The barrier synch needed for each iteration can
    be optimally performed in log(p) steps
  • The total runtime for the parallel algorithm for
    the computation is
  • The speedup of the parallel algorithm is

57
Finite DifferencesAnalysis
  • The Amdahls fraction for this parallel algorithm
    can be determined by rewriting the previous
    equation as
  • Hence, the Amdahls fraction ?(n.p) is
  • We finally note that
  • Hence, the parallel algorithm is effective

58
Equation Solver
procedure solve (A) while(!done) do
diff 0 for i 1 to n do
for j 1 to n do temp Ai,
j Ai, j
diff abs(Ai,j temp) end for
end for if (diff/(nn) lt TOL) then
done 1 end while end procedure
n
n
Ai,j 0.2 (Ai, j Ai, j-1 Ai-1, j
ai, j1 ai1, j)
59
Workloads
  • Basic properties
  • Memory requirement O(n2)
  • Computational complexity O(n3), assuming the
    number of iterations to converge to be O(n)
  • Assume speedups equal to of p
  • Grid size
  • Fixed-size fixed
  • Fixed-time
  • Memory-bound

60
Memory Requirement of Equation Solver
Fixed-size
Fixed-time
,
Memory-bound
61
Time Complexity of Equation Solver
Fixed-size
Fixed-time
Memory-bound
Sequential time complexity
,
62
Concurrency
Concurrency is proportional to the number of grid
points
Fixed-size
Fixed-time
,
Memory-bound
63
Communication to Computation Ratio
Fixed-size
Memory-bound
Fixed-time
64
  • Scalability
  • The Need for New Metrics
  • Comparison of performances with different
    workload
  • Availability of massively parallel processing
  • Scalability
  • Ability to maintain parallel processing gain when
    both problem size and system size increase

65
Parallel Efficiency
  • The achieved fraction of total potential parallel
    processing gain
  • Assuming linear speedup p is ideal case
  • The ability to maintain efficiency when problem
    size increase

66
Maintain Efficiency
  • Efficiency of adding n numbers in parallel
  • For an efficiency of 0.80 on 4 procs, n64
  • For an efficiency of 0.80 on 8 procs, n192
  • For an efficiency of 0.80 on 16 procs, n512

E1/(12plogp/n)
67
  • Ideally Scalable
  • T(m ? p, m ? W) T(p, W)
  • T execution time
  • W work executed
  • P number of processors used
  • m scale up m times
  • work flop count based on the best practical
  • serial algorithm
  • Fact
  • T(m ? p, m ? W) T(p, W)
  • if and only if
  • The Average Unit Speed Is Fixed

68
  • Definition
  • The average unit speed is the achieved speed
    divided by the number of processors
  • Definition (Isospeed Scalability)
  • An algorithm-machine combination is scalable if
    the achieved average unit speed can remain
    constant with increasing numbers of processors,
    provided the problem size is increased
    proportionally

69
  • Isospeed Scalability (Sun Rover, 91)
  • W work executed when p processors are employed
  • W' work executed when p' gt p processors are
    employed to maintain the average speed
  • Ideal case
  • Scalability in terms of time

70
  • Isospeed Scalability (Sun Rover)
  • W work executed when p processors are employed
  • W' work executed when p' gt p processors are
    employed
  • to maintain the average speed
  • Ideal case
  • X. H. Sun, and D. Rover, "Scalability of
    Parallel Algorithm-Machine Combinations,"
  • IEEE Trans. on Parallel and Distributed Systems,
    May, 1994 (Ames TR91)

71
The Relation of Scalability and Time
  • More scalable leads to smaller time
  • Better initial run-time and higher scalability
    lead to superior run-time
  • Same initial run-time and same scalability lead
    to same scaled performance
  • Superior initial performance may not last long if
    scalability is low
  • Range Comparison
  • X.H. Sun, "Scalability Versus Execution Time
    in Scalable Systems,"
  • Journal of Parallel and Distributed Computing,
    Vol. 62, No. 2, pp. 173-192, Feb 2002.

72
Range Comparison Via Performance Crossing Point
Assume Program I is oz times slower than program
2 at the initial state Begin (Range
Comparison) p' p Repeat p' p' 1 Compute
the scalability of program 1 ? (p,p') Compute
the scalability of program 2 ? (p,p') Until (?
(p,p') gt ?? (p,p') or p' the limit of ensemble
size) If ? (p,p') gt ?? (p,p') Then p is the
smallest scaled crossing point program 2 is
superior at any ensemble size p, p ? p lt p'
Else program 2 is superior at any ensemble size
p, p ? p ? p End if End Range Comparison
73
  • Range Comparison

Influence of Communication Speed
Influence of Computing Speed
  • X.H. Sun, M. Pantano, and Thomas Fahringer,
    "Integrated Range Comparison for Data-Parallel
  • Compilation Systems," IEEE Trans. on Parallel and
    Distributed Processing, May 1999.

74
The SCALA (SCALability Analyzer) System
  • Design Goals
  • Predict performance
  • Support program optimization
  • Estimate the influence of hardware variations
  • Uniqueness
  • Designed to be integrated into advanced compiler
    systems
  • Based on scalability analysis

75
  • Vienna Fortran Compilation System
  • A data-parallel restructuring compilation system
  • Consists of a parallelizing compiler for VF/HPF
    and tools for program analysis and restructuring
  • Under a major upgrade for HPF2
  • Performance prediction is crucial for appropriate
    program restructuring

SCS
76
The Structure of SCALA
77
Prototype Implementation
  • Automatic range comparison for different data
    distributions
  • The P3T static performance estimator
  • Test cases Jacobi and Redblack

No Crossing Point
Have Crossing Point
78
Summary
  • Relation between Iso-speed scalability and
    iso-efficiency scalability
  • Both measure the ability to maintain parallel
    efficiency defined as
  • Where iso-efficiencys speedup is the traditional
    speedup defined as
  • Iso-speeds speedup is the generalized speedup
    defined as
  • If the the sequential execution speed is
    independent of problem size, iso-speed and
    iso-efficiency is equivalent
  • Due to memory hierarchy, sequential execution
    performance varies largely with problem size

79
Summary
  • Predict the sequential execution performance
    becomes a major task of SCALA due to advanced
    memory hierarchy
  • Memory-LogP model is introduced for data access
    cost
  • New challenge in distributed computing
  • Generalized iso-speed scalability
  • Generalized performance tool GHS
  • K. Cameron and X.-H. Sun, "Quantifying Locality
    Effect in Data Access Delay Memory logP,"
  • Proc. of 2003 IEEE IPDPS 2003, Nice, France,
    April, 2003.
  • X.-H. Sun and M. Wu, "Grid Harvest Service A
    System for Long-Term, Application-Level Task
    Scheduling," Proc. of 2003 IEEE IPDPS 2003, Nice,
    France, April, 2003.
Write a Comment
User Comments (0)
About PowerShow.com