Title: Lecture 5 Page 1
1Performance Evaluation ofParallel Processing
Xian-He Sun Illinois Institute of
Technology Sun_at_iit.edu
2Outline
- Performance metrics
- Speedup
- Efficiency
- Scalability
- Examples
- Reading Kumar ch 5
3Performance Evaluation
(Improving performance is the goal)
- Performance Measurement
- Metric, Parameter
- Performance Prediction
- Model, Application-Resource
- Performance Diagnose/Optimization
- Post-execution, Algorithm improvement,
Architecture improvement, State-of-the-art,
Scheduling, Resource management/Scheduling
4Parallel Performance Metrics(Run-time is the
dominant metric)
- Run-Time (Execution Time)
- Speed mflops, mips, cpi
- Efficiency throughput
- Speedup
- Parallel Efficiency
- Scalability The ability to maintain performance
gain when system and problem size increase - Others portability, programming ability,etc
5Models of Speedup
Performance Evaluation of Parallel Processing
- Speedup
- Scaled Speedup
- Parallel processing gain over sequential
processing, where problem size scales up with
computing power (having sufficient
workload/parallelism)
6Speedup
- Ts time for the best serial algorithm
- Tptime for parallel algorithm using p processors
7Example
100
35 35 35 35
time
time
time
25 25 25 25
1 2 3 4
1 2 3 4
Processor 1
(c)
(a)
(b)
8Example (cont.)
50 50 50 50
30 20 40 10
time
time
1 2 3 4
1 2 3 4
(e)
(d)
9What Is Good Speedup?
- Linear speedup
- Superlinear speedup
- Sub-linear speedup
10Speedup
speedup
p
11Sources of Parallel Overheads
- Interprocessor communication
- Load imbalance
- Synchronization
- Extra computation
12Degradations of Parallel Processing
Unbalanced Workload
Communication Delay
Overhead Increases with the Ensemble Size
13Degradations of Distributed Computing
Unbalanced Computing Power and Workload
Shared Computing and Communication Resource
Uncertainty, Heterogeneity, and Overhead
Increases with the Ensemble Size
14Causes of Superlinear Speedup
- Cache size increased
- Overhead reduced
- Latency hidden
- Randomized algorithms
- Mathematical inefficiency of the serial algorithm
- Higher memory access cost in sequential processing
- X.H. Sun, and J. Zhu, "Performance
Considerations of Shared Virtual Memory
Machines," - IEEE Trans. on Parallel and Distributed Systems,
Nov. 1995
15- Fixed-Size Speedup (Amdahls law)
- Emphasis on turnaround time
- Problem size, W, is fixed
16Amdahls Law
- The performance improvement that can be gained by
a parallel implementation is limited by the
fraction of time parallelism can actually be used
in an application - Let ? fraction of program (algorithm) that is
serial and cannot be parallelized. For instance - Loop initialization
- Reading/writing to a single disk
- Procedure call overhead
- Parallel run time is given by
17Amdahls Law
- Amdahls law gives a limit on speedup in terms of
?
18Enhanced Amdahls Law
- To include overhead
- The overhead includes parallelism and interaction
overheads
Amdahls law argument against massively parallel
systems
19 Fixed-Size Speedup (Amdahl Law, 67)
Amount of Work
Elapsed Time
W1
W1
W1
W1
W1
T1
Wp
Wp
Wp
Wp
Wp
Tp
T1
T1
Tp
T1
T1
Tp
Tp
Tp
1
2
3
4
5
1
2
3
4
5
Number of Processors (p)
Number of Processors (p)
20Amdahls Law
- The speedup that is achievable on p processors
is - If we assume that the serial fraction is fixed,
then the speedup for infinite processors is
limited by 1/? - For example, if ?10, then the maximum speedup
is 10, even if we use an infinite number of
processors
21Comments on Amdahls Law
- The Amdahls fraction ? in practice depends on
the problem size n and the number of processors p - An effective parallel algorithm has
- For such a case, even if one fixes p, we can get
linear speedups by choosing a suitable large
problem size - Scalable speedup
- Practically, the problem size that we can run for
a particular problem is limited by the time and
memory of the parallel computer
22- Fixed-Time Speedup (Gustafson, 88)
- Emphasis on work finished in a fixed time
- Problem size is scaled from W to W'
- W' Work finished within the fixed time with
parallel processing
23Gustafsons Law (Without Overhead)
24 Fixed-Time Speedup (Gustafson)
W1
Wp
Amount of Work
Elapsed Time
W1
Wp
W1
Wp
W1
T1
T1
T1
T1
T1
Wp
W1
Tp
Tp
Tp
Tp
Tp
Wp
1
2
3
4
5
1
2
3
4
5
Number of Processors (p)
Number of Processors (p)
25Converting s between Amdahls and Gustafons
laws
Based on this observation, Amdahls
and Gustafons laws are identical.
26Gustafsons Law (With Overhead)
(1-a)p
p
-
a
a
)
1
(
)
(
p
p
Work
Speedup
FT
/
1
)
1
(
T
T
Work
a
1-a
1
overhead
time
p
a
1-a
time
27Memory Constrained Scaling Sun and Nis Law
- Scale the largest possible solution limited by
the memory space. Or, fix memory usage per
processor - (ex) N-body problem
- Problem size is scaled from W to W
- W is the work executed under memory limitation
of a parallel computer - For simple profile, and G(n) is the increase of
parallel workload as the memory capacity
increases p times.
28Sun Nis Law
29- Memory-Bounded Speedup (Sun Ni, 90)
- Emphasis on work finished under current physical
limitation - Problem size is scaled from W to W
- W Work executed under memory limitation with
parallel processing
- X.H. Sun, and L. Ni , "Scalable Problems and
Memory-Bounded Speedup," - Journal of Parallel and Distributed Computing,
Vol. 19, pp.27-37, Sept. 1993 (SC90).
30 Memory-Boundary Speedup (Sun Ni)
- Work executed under memory limitation
- Hierarchical memory
W1
Wp
Amount of Work
Elapsed Time
W1
Wp
W1
T1
Wp
T1
T1
W1
T1
T1
Tp
Tp
Wp
Tp
W1
Tp
Tp
Wp
1
2
3
4
5
1
2
3
4
5
Number of Processors (p)
Number of Processors (p)
31Characteristics
- Connection to other scaling models
- G(p) 1, problem constrained scaling
- G(p) p, time constrained scaling
- With overhead
- G(p) gt p, can lead to large increase in execution
time - (ex) 10K x 10K matrix factorization 800MB, 1 hr
in uniprocessorwith 1024 processors, 320K x 320K
matrix, 32 hrs
32Why Scalable Computing
- Scalable
- More accurate solution
- Sufficient parallelism
- Maintain efficiency
- Efficient in parallel
- computing
- Load balance
- Communication
- Mathematically
- effective
- Adaptive
- Accuracy
33- Memory-Bounded Speedup
- Natural for domain decomposition based computing
- Show the potential of parallel processing (In
gerneal, computing requirement increases faster
with problem size than that of communication) - Impacts extend to architecture design trade-off
of memory size and computing speed
34Why Scalable Computing (2)
Small Work
- Appropriate for small machine
- Parallelism overheads begin to dominate benefits
for larger machines - Load imbalance
- Communication to computation ratio
- May even achieve slowdowns
- Does not reflect real usage, and inappropriate
for large machine - Can exaggerate benefits of improvements
35Why Scalable Computing (3)
Large Work
- Appropriate for big machine
- Difficult to measure improvement
- May not fit for small machine
- Cant run
- Thrashing to disk
- Working set doesnt fit in cache
- Fits at some p, leading to superlinear speedup
36Demonstrating Scaling Problems
Big equation solver problem On SGI Origin2000
Small Ocean problem On SGI Origin2000
superlinear
parallelism overhead
User want to scale problems as machines grow!
37How to Scale
- Scaling a machine
- Make a machine more powerful
- Machine size
- ltprocessor, memory, communication, I/Ogt
- Scaling a machine in parallel processing
- Add more identical nodes
- Problem size
- Input configuration
- data set size the amount of storage required to
run it on a single processor - memory usage the amount of memory used by the
program
38Two Key Issues in Problem Scaling
- Under what constraints should the problem be
scaled? - Some properties must be fixed as the machine
scales - How should the problem be scaled?
- Which parameters?
- How?
39Constraints To Scale
- Two types of constraints
- Problem-oriented
- Ex) Time
- Resource-oriented
- Ex) Memory
- Work to scale
- Metric-oriented
- Floating point operation, instructions
- User-oriented
- Easy to change but may difficult to compare
- Ex) particles, rows, transactions
- Difficult cross comparison
40Rethinking of Speedup
- Why it is called speedup but compare time
- Could we compare speed directly?
- Generalized speedup
- X.H. Sun, and J. Gustafson, "Toward A Better
Parallel Performance Metric," - Parallel Computing, Vol. 17, pp.1093-1109, Dec.
1991.
41(No Transcript)
42Compute ? Problem
- Consider parallel algorithm for computing the
value of ?3.1415through the following numerical
integration
43Compute ? Sequential Algorithm
- computepi()
-
- h1.0/n
- sum 0.0
- for (i0iltni)
- xh(i0.5)
- sumsum4.0/(1xx)
-
- pihsum
-
44Compute ? Parallel Algorithm
- Each processor computes on a set of about n/p
points which are allocated to each processor in a
cyclic manner - Finally, we assume that the local values of ? are
accumulated among the p processors under
synchronization
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
45Compute ? Parallel Algorithm
- computepi()
-
- idmy_proc_id()
- nprocsnumber_of_procs()
- h1.0/n
- sum0.0
- for(iidiltniinprocs)
- xh(i0.5)
- sumsum4.0/(1xx)
-
- localpisumh
- use_tree_based_combining_for_critical_section()
- pipilocalpi
- end_critical_section()
-
46Compute ? Analysis
- Assume that the computation of ? is performed
over n points - The sequential algorithm performs 6 operations
(two multiplications, one division, three
additions) per points on the x-axis. Hence, for n
points, the number of operations executed in the
sequential algorithm is
3 additions
for (i0iltni) xh(i0.5) sumsum4.0/
(1xx)
1 division
2 multiplications
47Compute ? Analysis
- The parallel algorithm uses p processors with
static interleaved scheduling. Each processor
computes on a set of m points which are allocated
to each process in a cyclic manner - The expression for m is given by
if p does not exactly divide n. The
runtime for the parallel algorithm for the
parallel computation of the local values of ? is
48Compute ? Analysis
- The accumulation of the local values of ? using a
tree-based combining can be optimally performed
in log2(p) steps - The total runtime for the parallel algorithm for
the computation of ? including the parallel
computation and the combining is - The speedup of the parallel algorithm is
49Compute ? Analysis
- The Amdahls fraction for this parallel algorithm
can be determined by rewriting the previous
equation as - Hence, the Amdahls fraction ?(n,p) is
- The parallel algorithm is effective because
50Finite Differences Problem
- Consider a finite difference iterative method
applied to a 2D grid where
51Finite Differences Serial Algorithm
- finitediff()
-
- for (t0tltTt)
- for (i0iltni)
- for (j0jltnj)
- xi,jw_1(xi,j-1xi,j1xi-1,jxi1,j
w_2xi,j -
-
-
52Finite Differences Parallel Algorithm
- Each processor computes on a sub-grid of
points - Synch between processors after every iteration
ensures correct values being used for subsequent
iterations
53Finite Differences Parallel Algorithm
- finitediff()
-
- row_idmy_processor_row_id()
- col_idmy_processor_col_id()
- pnumbre_of_processors()
- spsqrt(p)
- rowscolsceil(n/sp)
- row_startrow_idrows
- col_startcol_idcols
- for (t0tltTt)
- for (irow_startiltmin(row_startrows,n)i)
- for (jcol_startjltmin(col_startcols,n)j
) - xi,jw_1(xi,j-1xi,j
1xi-1,jxi1,jw_2xi,j -
- barrier()
-
-
54Finite DifferencesAnalysis
- The sequential algorithm performs 6 operations(2
multiplications, 4 additions) every iteration per
point on the grid. Hence, for an nn grid and T
iterations, the number of operations executed in
the sequential algorithm is
2 multiplications
xi,jw_1(xi,j-1xi,j1xi-1,jxi1,jw_
2xi,j
4 additions
55Finite DifferencesAnalysis
- The parallel algorithm uses p processors with
static blockwise scheduling. Each processor
computes on an mm sub-grid allocated to each
processor in a blockwise manner - The expression for m is given by
The runtime for the parallel algorithm
is
56Finite DifferencesAnalysis
- The barrier synch needed for each iteration can
be optimally performed in log(p) steps - The total runtime for the parallel algorithm for
the computation is - The speedup of the parallel algorithm is
57Finite DifferencesAnalysis
- The Amdahls fraction for this parallel algorithm
can be determined by rewriting the previous
equation as - Hence, the Amdahls fraction ?(n.p) is
- We finally note that
- Hence, the parallel algorithm is effective
58Equation Solver
procedure solve (A) while(!done) do
diff 0 for i 1 to n do
for j 1 to n do temp Ai,
j Ai, j
diff abs(Ai,j temp) end for
end for if (diff/(nn) lt TOL) then
done 1 end while end procedure
n
n
Ai,j 0.2 (Ai, j Ai, j-1 Ai-1, j
ai, j1 ai1, j)
59Workloads
- Basic properties
- Memory requirement O(n2)
- Computational complexity O(n3), assuming the
number of iterations to converge to be O(n) - Assume speedups equal to of p
- Grid size
- Fixed-size fixed
- Fixed-time
- Memory-bound
60Memory Requirement of Equation Solver
Fixed-size
Fixed-time
,
Memory-bound
61Time Complexity of Equation Solver
Fixed-size
Fixed-time
Memory-bound
Sequential time complexity
,
62Concurrency
Concurrency is proportional to the number of grid
points
Fixed-size
Fixed-time
,
Memory-bound
63Communication to Computation Ratio
Fixed-size
Memory-bound
Fixed-time
64- Scalability
- The Need for New Metrics
- Comparison of performances with different
workload - Availability of massively parallel processing
- Scalability
- Ability to maintain parallel processing gain when
both problem size and system size increase
65Parallel Efficiency
- The achieved fraction of total potential parallel
processing gain - Assuming linear speedup p is ideal case
- The ability to maintain efficiency when problem
size increase
66Maintain Efficiency
- Efficiency of adding n numbers in parallel
- For an efficiency of 0.80 on 4 procs, n64
- For an efficiency of 0.80 on 8 procs, n192
- For an efficiency of 0.80 on 16 procs, n512
E1/(12plogp/n)
67- Ideally Scalable
- T(m ? p, m ? W) T(p, W)
- T execution time
- W work executed
- P number of processors used
- m scale up m times
- work flop count based on the best practical
- serial algorithm
- Fact
- T(m ? p, m ? W) T(p, W)
- if and only if
- The Average Unit Speed Is Fixed
68- Definition
- The average unit speed is the achieved speed
divided by the number of processors - Definition (Isospeed Scalability)
- An algorithm-machine combination is scalable if
the achieved average unit speed can remain
constant with increasing numbers of processors,
provided the problem size is increased
proportionally
69- Isospeed Scalability (Sun Rover, 91)
- W work executed when p processors are employed
- W' work executed when p' gt p processors are
employed to maintain the average speed - Ideal case
-
- Scalability in terms of time
70- Isospeed Scalability (Sun Rover)
- W work executed when p processors are employed
- W' work executed when p' gt p processors are
employed - to maintain the average speed
- Ideal case
-
- X. H. Sun, and D. Rover, "Scalability of
Parallel Algorithm-Machine Combinations," - IEEE Trans. on Parallel and Distributed Systems,
May, 1994 (Ames TR91)
71The Relation of Scalability and Time
- More scalable leads to smaller time
- Better initial run-time and higher scalability
lead to superior run-time - Same initial run-time and same scalability lead
to same scaled performance - Superior initial performance may not last long if
scalability is low - Range Comparison
- X.H. Sun, "Scalability Versus Execution Time
in Scalable Systems," - Journal of Parallel and Distributed Computing,
Vol. 62, No. 2, pp. 173-192, Feb 2002.
72Range Comparison Via Performance Crossing Point
Assume Program I is oz times slower than program
2 at the initial state Begin (Range
Comparison) p' p Repeat p' p' 1 Compute
the scalability of program 1 ? (p,p') Compute
the scalability of program 2 ? (p,p') Until (?
(p,p') gt ?? (p,p') or p' the limit of ensemble
size) If ? (p,p') gt ?? (p,p') Then p is the
smallest scaled crossing point program 2 is
superior at any ensemble size p, p ? p lt p'
Else program 2 is superior at any ensemble size
p, p ? p ? p End if End Range Comparison
73Influence of Communication Speed
Influence of Computing Speed
- X.H. Sun, M. Pantano, and Thomas Fahringer,
"Integrated Range Comparison for Data-Parallel - Compilation Systems," IEEE Trans. on Parallel and
Distributed Processing, May 1999.
74The SCALA (SCALability Analyzer) System
- Design Goals
- Predict performance
- Support program optimization
- Estimate the influence of hardware variations
- Uniqueness
- Designed to be integrated into advanced compiler
systems - Based on scalability analysis
75- Vienna Fortran Compilation System
- A data-parallel restructuring compilation system
- Consists of a parallelizing compiler for VF/HPF
and tools for program analysis and restructuring - Under a major upgrade for HPF2
- Performance prediction is crucial for appropriate
program restructuring
SCS
76The Structure of SCALA
77Prototype Implementation
- Automatic range comparison for different data
distributions - The P3T static performance estimator
- Test cases Jacobi and Redblack
No Crossing Point
Have Crossing Point
78Summary
- Relation between Iso-speed scalability and
iso-efficiency scalability - Both measure the ability to maintain parallel
efficiency defined as -
- Where iso-efficiencys speedup is the traditional
speedup defined as - Iso-speeds speedup is the generalized speedup
defined as - If the the sequential execution speed is
independent of problem size, iso-speed and
iso-efficiency is equivalent - Due to memory hierarchy, sequential execution
performance varies largely with problem size
79Summary
- Predict the sequential execution performance
becomes a major task of SCALA due to advanced
memory hierarchy - Memory-LogP model is introduced for data access
cost - New challenge in distributed computing
- Generalized iso-speed scalability
- Generalized performance tool GHS
- K. Cameron and X.-H. Sun, "Quantifying Locality
Effect in Data Access Delay Memory logP," - Proc. of 2003 IEEE IPDPS 2003, Nice, France,
April, 2003. - X.-H. Sun and M. Wu, "Grid Harvest Service A
System for Long-Term, Application-Level Task
Scheduling," Proc. of 2003 IEEE IPDPS 2003, Nice,
France, April, 2003.