Lecture 5 Page 1

About This Presentation

Title:

Lecture 5 Page 1

Description:

Xian-He Sun. Illinois Institute of Technology. Sun_at_iit.edu. CS546. Lecture 5 Page 2. X. Sun (IIT) ... Post-execution, Algorithm improvement, Architecture ... – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 79

Provided by: junw3

Learn more at: http://www.cs.iit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 5 Page 1

1
Performance Evaluation ofParallel Processing
Xian-He Sun Illinois Institute of
Technology Sun_at_iit.edu
2
Outline

Performance metrics
Speedup
Efficiency
Scalability
Examples
Reading Kumar ch 5

3
Performance Evaluation
(Improving performance is the goal)

Performance Measurement
Metric, Parameter
Performance Prediction
Model, Application-Resource
Performance Diagnose/Optimization
Post-execution, Algorithm improvement,
Architecture improvement, State-of-the-art,
Scheduling, Resource management/Scheduling

4
Parallel Performance Metrics(Run-time is the
dominant metric)

Run-Time (Execution Time)
Speed mflops, mips, cpi
Efficiency throughput
Speedup
Parallel Efficiency
Scalability The ability to maintain performance
gain when system and problem size increase
Others portability, programming ability,etc

5
Models of Speedup
Performance Evaluation of Parallel Processing

Speedup
Scaled Speedup
Parallel processing gain over sequential
processing, where problem size scales up with
computing power (having sufficient
workload/parallelism)

6
Speedup

Ts time for the best serial algorithm
Tptime for parallel algorithm using p processors

7
Example
100
35 35 35 35
time
time
time
25 25 25 25
1 2 3 4
1 2 3 4
Processor 1
(c)
(a)
(b)
8
Example (cont.)
50 50 50 50
30 20 40 10
time
time
1 2 3 4
1 2 3 4
(e)
(d)
9
What Is Good Speedup?

Linear speedup
Superlinear speedup
Sub-linear speedup

10
Speedup
speedup
p
11
Sources of Parallel Overheads

Interprocessor communication
Load imbalance
Synchronization
Extra computation

12
Degradations of Parallel Processing
Unbalanced Workload
Communication Delay
Overhead Increases with the Ensemble Size
13
Degradations of Distributed Computing
Unbalanced Computing Power and Workload
Shared Computing and Communication Resource
Uncertainty, Heterogeneity, and Overhead
Increases with the Ensemble Size
14
Causes of Superlinear Speedup

Cache size increased
Overhead reduced
Latency hidden
Randomized algorithms
Mathematical inefficiency of the serial algorithm
Higher memory access cost in sequential processing

X.H. Sun, and J. Zhu, "Performance
Considerations of Shared Virtual Memory
Machines,"
IEEE Trans. on Parallel and Distributed Systems,
Nov. 1995

Fixed-Size Speedup (Amdahls law)
Emphasis on turnaround time
Problem size, W, is fixed

16
Amdahls Law

The performance improvement that can be gained by
a parallel implementation is limited by the
fraction of time parallelism can actually be used
in an application
Let ? fraction of program (algorithm) that is
serial and cannot be parallelized. For instance
Loop initialization
Reading/writing to a single disk
Procedure call overhead
Parallel run time is given by

17
Amdahls Law

Amdahls law gives a limit on speedup in terms of
?

18
Enhanced Amdahls Law

To include overhead
The overhead includes parallelism and interaction
overheads

Amdahls law argument against massively parallel
systems
19
Fixed-Size Speedup (Amdahl Law, 67)
Amount of Work
Elapsed Time
W1
W1
W1
W1
W1
T1
Wp
Wp
Wp
Wp
Wp
Tp
T1
T1
Tp
T1
T1
Tp
Tp
Tp
1
2
3
4
5
1
2
3
4
5
Number of Processors (p)
Number of Processors (p)
20
Amdahls Law

The speedup that is achievable on p processors
is
If we assume that the serial fraction is fixed,
then the speedup for infinite processors is
limited by 1/?
For example, if ?10, then the maximum speedup
is 10, even if we use an infinite number of
processors

21
Comments on Amdahls Law

The Amdahls fraction ? in practice depends on
the problem size n and the number of processors p
An effective parallel algorithm has
For such a case, even if one fixes p, we can get
linear speedups by choosing a suitable large
problem size
Scalable speedup
Practically, the problem size that we can run for
a particular problem is limited by the time and
memory of the parallel computer

Fixed-Time Speedup (Gustafson, 88)
Emphasis on work finished in a fixed time
Problem size is scaled from W to W'
W' Work finished within the fixed time with
parallel processing

23
Gustafsons Law (Without Overhead)
24
Fixed-Time Speedup (Gustafson)
W1
Wp
Amount of Work
Elapsed Time
W1
Wp
W1
Wp
W1
T1
T1
T1
T1
T1
Wp
W1
Tp
Tp
Tp
Tp
Tp
Wp
1
2
3
4
5
1
2
3
4
5
Number of Processors (p)
Number of Processors (p)
25
Converting s between Amdahls and Gustafons
laws
Based on this observation, Amdahls
and Gustafons laws are identical.
26
Gustafsons Law (With Overhead)
(1-a)p
p
-

a
a
)
1
(
)
(
p
p
Work

Speedup
FT

/
1
)
1
(
T
T
Work
a
1-a
1
overhead
time
p
a
1-a
time
27
Memory Constrained Scaling Sun and Nis Law

Scale the largest possible solution limited by
the memory space. Or, fix memory usage per
processor
(ex) N-body problem
Problem size is scaled from W to W
W is the work executed under memory limitation
of a parallel computer
For simple profile, and G(n) is the increase of
parallel workload as the memory capacity
increases p times.

28
Sun Nis Law
29

Memory-Bounded Speedup (Sun Ni, 90)
Emphasis on work finished under current physical
limitation
Problem size is scaled from W to W
W Work executed under memory limitation with
parallel processing

X.H. Sun, and L. Ni , "Scalable Problems and
Memory-Bounded Speedup,"
Journal of Parallel and Distributed Computing,
Vol. 19, pp.27-37, Sept. 1993 (SC90).

30
Memory-Boundary Speedup (Sun Ni)

Work executed under memory limitation
Hierarchical memory

W1
Wp
Amount of Work
Elapsed Time
W1
Wp
W1
T1
Wp
T1
T1
W1
T1
T1
Tp
Tp
Wp
Tp
W1
Tp
Tp
Wp
1
2
3
4
5
1
2
3
4
5
Number of Processors (p)
Number of Processors (p)
31
Characteristics

Connection to other scaling models
G(p) 1, problem constrained scaling
G(p) p, time constrained scaling
With overhead
G(p) gt p, can lead to large increase in execution
time
(ex) 10K x 10K matrix factorization 800MB, 1 hr
in uniprocessorwith 1024 processors, 320K x 320K
matrix, 32 hrs

32
Why Scalable Computing

Scalable
More accurate solution
Sufficient parallelism
Maintain efficiency
Efficient in parallel
computing
Load balance
Communication
Mathematically
effective
Adaptive
Accuracy

Memory-Bounded Speedup
Natural for domain decomposition based computing
Show the potential of parallel processing (In
gerneal, computing requirement increases faster
with problem size than that of communication)
Impacts extend to architecture design trade-off
of memory size and computing speed

34
Why Scalable Computing (2)
Small Work

Appropriate for small machine
Parallelism overheads begin to dominate benefits
for larger machines
Load imbalance
Communication to computation ratio
May even achieve slowdowns
Does not reflect real usage, and inappropriate
for large machine
Can exaggerate benefits of improvements

35
Why Scalable Computing (3)
Large Work

Appropriate for big machine
Difficult to measure improvement
May not fit for small machine
Cant run
Thrashing to disk
Working set doesnt fit in cache
Fits at some p, leading to superlinear speedup

36
Demonstrating Scaling Problems
Big equation solver problem On SGI Origin2000
Small Ocean problem On SGI Origin2000
superlinear
parallelism overhead
User want to scale problems as machines grow!
37
How to Scale

Scaling a machine
Make a machine more powerful
Machine size
ltprocessor, memory, communication, I/Ogt
Scaling a machine in parallel processing
Add more identical nodes
Problem size
Input configuration
data set size the amount of storage required to
run it on a single processor
memory usage the amount of memory used by the
program

38
Two Key Issues in Problem Scaling

Under what constraints should the problem be
scaled?
Some properties must be fixed as the machine
scales
How should the problem be scaled?
Which parameters?
How?

39
Constraints To Scale

Two types of constraints
Problem-oriented
Ex) Time
Resource-oriented
Ex) Memory
Work to scale
Metric-oriented
Floating point operation, instructions
User-oriented
Easy to change but may difficult to compare
Ex) particles, rows, transactions
Difficult cross comparison

40
Rethinking of Speedup

Speedup

Why it is called speedup but compare time
Could we compare speed directly?
Generalized speedup

X.H. Sun, and J. Gustafson, "Toward A Better
Parallel Performance Metric,"
Parallel Computing, Vol. 17, pp.1093-1109, Dec.
1991.

41
(No Transcript)
42
Compute ? Problem

Consider parallel algorithm for computing the
value of ?3.1415through the following numerical
integration

43
Compute ? Sequential Algorithm

computepi()
h1.0/n
sum 0.0
for (i0iltni)
xh(i0.5)
sumsum4.0/(1xx)
pihsum

44
Compute ? Parallel Algorithm

Each processor computes on a set of about n/p
points which are allocated to each processor in a
cyclic manner
Finally, we assume that the local values of ? are
accumulated among the p processors under
synchronization

0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
45
Compute ? Parallel Algorithm

computepi()
idmy_proc_id()
nprocsnumber_of_procs()
h1.0/n
sum0.0
for(iidiltniinprocs)
xh(i0.5)
sumsum4.0/(1xx)
localpisumh
use_tree_based_combining_for_critical_section()
pipilocalpi
end_critical_section()

46
Compute ? Analysis

Assume that the computation of ? is performed
over n points
The sequential algorithm performs 6 operations
(two multiplications, one division, three
additions) per points on the x-axis. Hence, for n
points, the number of operations executed in the
sequential algorithm is

3 additions
for (i0iltni) xh(i0.5) sumsum4.0/
(1xx)
1 division
2 multiplications
47
Compute ? Analysis

The parallel algorithm uses p processors with
static interleaved scheduling. Each processor
computes on a set of m points which are allocated
to each process in a cyclic manner
The expression for m is given by
if p does not exactly divide n. The
runtime for the parallel algorithm for the
parallel computation of the local values of ? is

48
Compute ? Analysis

The accumulation of the local values of ? using a
tree-based combining can be optimally performed
in log2(p) steps
The total runtime for the parallel algorithm for
the computation of ? including the parallel
computation and the combining is
The speedup of the parallel algorithm is

49
Compute ? Analysis

The Amdahls fraction for this parallel algorithm
can be determined by rewriting the previous
equation as
Hence, the Amdahls fraction ?(n,p) is
The parallel algorithm is effective because

50
Finite Differences Problem

Consider a finite difference iterative method
applied to a 2D grid where

51
Finite Differences Serial Algorithm

finitediff()
for (t0tltTt)
for (i0iltni)
for (j0jltnj)
xi,jw_1(xi,j-1xi,j1xi-1,jxi1,j
w_2xi,j

52
Finite Differences Parallel Algorithm

Each processor computes on a sub-grid of
points
Synch between processors after every iteration
ensures correct values being used for subsequent
iterations

53
Finite Differences Parallel Algorithm

finitediff()
row_idmy_processor_row_id()
col_idmy_processor_col_id()
pnumbre_of_processors()
spsqrt(p)
rowscolsceil(n/sp)
row_startrow_idrows
col_startcol_idcols
for (t0tltTt)
for (irow_startiltmin(row_startrows,n)i)
for (jcol_startjltmin(col_startcols,n)j
)
xi,jw_1(xi,j-1xi,j
1xi-1,jxi1,jw_2xi,j
barrier()

54
Finite DifferencesAnalysis

The sequential algorithm performs 6 operations(2
multiplications, 4 additions) every iteration per
point on the grid. Hence, for an nn grid and T
iterations, the number of operations executed in
the sequential algorithm is

2 multiplications
xi,jw_1(xi,j-1xi,j1xi-1,jxi1,jw_
2xi,j
4 additions
55
Finite DifferencesAnalysis

The parallel algorithm uses p processors with
static blockwise scheduling. Each processor
computes on an mm sub-grid allocated to each
processor in a blockwise manner
The expression for m is given by
The runtime for the parallel algorithm
is

56
Finite DifferencesAnalysis

The barrier synch needed for each iteration can
be optimally performed in log(p) steps
The total runtime for the parallel algorithm for
the computation is
The speedup of the parallel algorithm is

57
Finite DifferencesAnalysis

The Amdahls fraction for this parallel algorithm
can be determined by rewriting the previous
equation as
Hence, the Amdahls fraction ?(n.p) is
We finally note that
Hence, the parallel algorithm is effective

58
Equation Solver
procedure solve (A) while(!done) do
diff 0 for i 1 to n do
for j 1 to n do temp Ai,
j Ai, j
diff abs(Ai,j temp) end for
end for if (diff/(nn) lt TOL) then
done 1 end while end procedure
n
n
Ai,j 0.2 (Ai, j Ai, j-1 Ai-1, j
ai, j1 ai1, j)
59
Workloads

Basic properties
Memory requirement O(n2)
Computational complexity O(n3), assuming the
number of iterations to converge to be O(n)
Assume speedups equal to of p
Grid size
Fixed-size fixed
Fixed-time
Memory-bound

60
Memory Requirement of Equation Solver
Fixed-size
Fixed-time
,
Memory-bound
61
Time Complexity of Equation Solver
Fixed-size
Fixed-time
Memory-bound
Sequential time complexity
,
62
Concurrency
Concurrency is proportional to the number of grid
points
Fixed-size
Fixed-time
,
Memory-bound
63
Communication to Computation Ratio
Fixed-size
Memory-bound
Fixed-time
64

Scalability
The Need for New Metrics
Comparison of performances with different
workload
Availability of massively parallel processing
Scalability
Ability to maintain parallel processing gain when
both problem size and system size increase

65
Parallel Efficiency

The achieved fraction of total potential parallel
processing gain
Assuming linear speedup p is ideal case
The ability to maintain efficiency when problem
size increase

66
Maintain Efficiency

Efficiency of adding n numbers in parallel
For an efficiency of 0.80 on 4 procs, n64
For an efficiency of 0.80 on 8 procs, n192
For an efficiency of 0.80 on 16 procs, n512

E1/(12plogp/n)
67

Ideally Scalable
T(m ? p, m ? W) T(p, W)
T execution time
W work executed
P number of processors used
m scale up m times
work flop count based on the best practical
serial algorithm
Fact
T(m ? p, m ? W) T(p, W)
if and only if
The Average Unit Speed Is Fixed

Definition
The average unit speed is the achieved speed
divided by the number of processors
Definition (Isospeed Scalability)
An algorithm-machine combination is scalable if
the achieved average unit speed can remain
constant with increasing numbers of processors,
provided the problem size is increased
proportionally

Isospeed Scalability (Sun Rover, 91)
W work executed when p processors are employed
W' work executed when p' gt p processors are
employed to maintain the average speed
Ideal case
Scalability in terms of time

Isospeed Scalability (Sun Rover)
W work executed when p processors are employed
W' work executed when p' gt p processors are
employed
to maintain the average speed
Ideal case

X. H. Sun, and D. Rover, "Scalability of
Parallel Algorithm-Machine Combinations,"
IEEE Trans. on Parallel and Distributed Systems,
May, 1994 (Ames TR91)

71
The Relation of Scalability and Time

More scalable leads to smaller time
Better initial run-time and higher scalability
lead to superior run-time
Same initial run-time and same scalability lead
to same scaled performance
Superior initial performance may not last long if
scalability is low
Range Comparison

X.H. Sun, "Scalability Versus Execution Time
in Scalable Systems,"
Journal of Parallel and Distributed Computing,
Vol. 62, No. 2, pp. 173-192, Feb 2002.

72
Range Comparison Via Performance Crossing Point
Assume Program I is oz times slower than program
2 at the initial state Begin (Range
Comparison) p' p Repeat p' p' 1 Compute
the scalability of program 1 ? (p,p') Compute
the scalability of program 2 ? (p,p') Until (?
(p,p') gt ?? (p,p') or p' the limit of ensemble
size) If ? (p,p') gt ?? (p,p') Then p is the
smallest scaled crossing point program 2 is
superior at any ensemble size p, p ? p lt p'
Else program 2 is superior at any ensemble size
p, p ? p ? p End if End Range Comparison
73

Range Comparison

Influence of Communication Speed
Influence of Computing Speed

X.H. Sun, M. Pantano, and Thomas Fahringer,
"Integrated Range Comparison for Data-Parallel
Compilation Systems," IEEE Trans. on Parallel and
Distributed Processing, May 1999.

74
The SCALA (SCALability Analyzer) System

Design Goals
Predict performance
Support program optimization
Estimate the influence of hardware variations
Uniqueness
Designed to be integrated into advanced compiler
systems
Based on scalability analysis

Vienna Fortran Compilation System
A data-parallel restructuring compilation system
Consists of a parallelizing compiler for VF/HPF
and tools for program analysis and restructuring
Under a major upgrade for HPF2
Performance prediction is crucial for appropriate
program restructuring

SCS
76
The Structure of SCALA
77
Prototype Implementation

Automatic range comparison for different data
distributions
The P3T static performance estimator
Test cases Jacobi and Redblack

No Crossing Point
Have Crossing Point
78
Summary

Relation between Iso-speed scalability and
iso-efficiency scalability
Both measure the ability to maintain parallel
efficiency defined as
Where iso-efficiencys speedup is the traditional
speedup defined as
Iso-speeds speedup is the generalized speedup
defined as
If the the sequential execution speed is
independent of problem size, iso-speed and
iso-efficiency is equivalent
Due to memory hierarchy, sequential execution
performance varies largely with problem size

79
Summary

Predict the sequential execution performance
becomes a major task of SCALA due to advanced
memory hierarchy
Memory-LogP model is introduced for data access
cost
New challenge in distributed computing
Generalized iso-speed scalability
Generalized performance tool GHS

K. Cameron and X.-H. Sun, "Quantifying Locality
Effect in Data Access Delay Memory logP,"
Proc. of 2003 IEEE IPDPS 2003, Nice, France,
April, 2003.
X.-H. Sun and M. Wu, "Grid Harvest Service A
System for Long-Term, Application-Level Task
Scheduling," Proc. of 2003 IEEE IPDPS 2003, Nice,
France, April, 2003.

Write a Comment

User Comments (0)