Programming for Performance Part II presentation

About This Presentation

Transcript and Presenter's Notes

Title: Programming for Performance Part II

1
Programming for PerformancePart II
2
Orchestration for Performance

Reducing amount of communication
Inherent change logical data sharing patterns in
algorithm
Artifactual exploit spatial, temporal locality
in extended hierarchy
Techniques often similar to those on
uniprocessors
Structuring communication to reduce cost

3
Reducing Artifactual Communication

Message passing model
Communication and replication are both explicit
Even artifactual communication is in explicit
messages
Shared address space model
More interesting from an architectural
perspective
Occurs transparently due to interactions of
program and system
sizes and granularities in extended memory
hierarchy
Use shared address space to illustrate issues

4
Exploiting Temporal Locality

Structure algorithm so working sets map well to
hierarchy
often techniques to reduce inherent communication
do well here
schedule tasks for data reuse once assigned
Solver example blocking

More useful when O(nk1) computation on O(nk)
data
many linear algebra computations
(factorization, matrix multiply)

5
Exploiting Spatial Locality

Besides capacity, granularities are important
Granularity of allocation
Granularity of communication or data transfer
Granularity of coherence
Major spatial-related causes of artifactual
communication
Conflict misses
Data distribution/layout (allocation granularity)
Fragmentation (communication granularity)
False sharing of data (coherence granularity)
All depend on how spatial access patterns
interact with data structures
Fix problems by modifying data structures, or
layout/alignment
Examine later in context of architectures
one simple example here data distribution in SAS
solver

6
Spatial Locality Example

Repeated sweeps over 2-d grid, each time adding
1 to elements
Natural 2-d versus higher-dimensional array
representation

7
Tradeoffs with Inherent Communication

Partitioning grid solver blocks versus rows
Blocks still have a spatial locality problem on
remote data
Row-wise can perform better despite worse
inherent c-to-c ratio

Result depends on n and p

8
Example Performance Impact

Equation solver on SGI Origin2000

Superliner
Why?

Long cache block 128 bytes

512 x 512
12K x 12K
9
Architectural Implications of Locality

Communication abstraction that makes exploiting
it easy
For cache-coherent SAS, e.g.
Size and organization of levels of memory
hierarchy
cost-effectiveness caches are expensive
caveats flexibility for different and
time-shared workloads
Replication in main memory useful? If so, how to
manage?
hardware, OS/runtime, program?
Granularities of allocation, communication,
coherence (?)
small granularities gt high overheads, but easier
to program
Machine granularity (resource division among
processors, memory...)

10
Orchestration for Performance

Reducing amount of communication
Inherent change logical data sharing patterns in
algorithm
Artifactual exploit spatial, temporal locality
in extended hierarchy
Techniques often similar to those on
uniprocessors
Structuring communication to reduce cost

11
Structuring Communication

Given amount of comm (inherent or artifactual),
goal is to reduce cost
Cost of communication as seen by process
C f ( o l tc - overlap)
f frequency of messages
o overhead per message (at both ends)
l network delay per message
nc total data sent
m number of messages
B bandwidth along path (determined by network,
NI, assist)
tc cost induced by contention per message
overlap amount of latency hidden by overlap
with comp. or comm.
Portion in parentheses is cost of a message (as
seen by processor)
That portion, ignoring overlap, is latency of a
message
Goal reduce terms in latency and increase
overlap

12
Reducing Overhead

Can reduce no. of messages m or overhead per
message o
o is usually determined by hardware or system
software
Program should try to reduce m by coalescing
messages
More control when communication is explicit
Coalescing data into larger messages
Easy for regular, coarse-grained communication
Can be difficult for irregular, naturally
fine-grained communication
may require changes to algorithm and extra work
coalescing data and determining what and to whom
to send
will discuss more in implications for programming
models later

13
Reducing Network Delay

Network delay component fhth
h number of hops traversed in network
th linkswitch latency per hop
Reducing f communicate less, or make messages
larger
Reducing h
Map communication patterns to network topology
e.g. nearest-neighbor on mesh and ring
all-to-all
How important is this?
used to be major focus of parallel algorithms
depends on no. of processors, how large th is
relative to other components
single phit in a pipelined networks
message in store-and-forward networks less
important on modern machines
overheads, processor count, multiprogramming

14
Reducing Contention

All resources have nonzero occupancy
Memory, communication controller, network link,
etc.
Finite bandwidth for serving transactions
Effects of contention
Increased end-to-end cost for messages
Reduced available bandwidth for individual
messages
Causes imbalances across processors
Particularly insidious performance problem
Easy to ignore when programming
Slow down messages that dont even need that
resource
by causing other dependent resources to also
congest
Effect can be devastating Dont flood a
resource!

15
Types of Contention

Network contention and end-point contention
(hot-spots)
Location and Module Hot-spots
Location e.g. accumulating into global variable,
barrier
solution tree-structured communication
Module all-to-all personalized comm. in matrix
transpose
solution stagger access by different processors
to same node temporally
In general, reduce burstiness may conflict with
making messages larger

16
Overlapping Communication

Cannot afford to stall for high latencies
even on uniprocessors!
Overlap with computation or communication to hide
latency
Requires extra concurrency (slackness), higher
bandwidth
Techniques
Prefetching
Block data transfer
Overlap
Multithreading

17
Summary of Tradeoffs

Different goals often have conflicting demands
Load Balance
fine-grain tasks
random or dynamic assignment
Communication
usually coarse grain tasks
decompose to obtain locality not random/dynamic
Extra Work
coarse grain tasks
simple assignment
Communication Cost
big transfers amortize overhead and latency
small transfers reduce contention

18
Processor-Centric Perspective
1
0
0
1
0
0
S
y
n
c
h
r
o
n
i
z
a
t
i
o
n
D
a
t
a
-
r
e
m
o
t
e
B
u
s
y
-
u
s
e
f
u
l
B
u
s
y
-
o
v
e
r
h
e
a
d
D
a
t
a
-
l
o
c
a
l
7
5
7
5
)
)
s
s
(
(

e
e
m
m
i
i
5
0
5
0
T
T
2
5
2
5
P
P

P
P

0
1

2

3
(
a
)

S
e
q
u
e
n
t
i
a
l
(
b
)

P
a
r
a
l
l
e
l

w
i
t
h

f
o
u
r

p
r
o
c
e
s
s
o
r
s
19
Relationship between Perspectives
20
Summary

Speedupprob(p)
Goal is to reduce denominator components
Both programmer and system have role to play
Architecture cannot do much about load imbalance
or too much communication
But it can
reduce incentive for creating ill-behaved
programs (efficient naming, communication and
synchronization)
reduce artifactual communication
provide efficient naming for flexible assignment
allow effective overlapping of communication

21
Workload-Driven Architectural Evaluation
22
Evaluation in Uniprocessors

Evaluation
For existing systems comparison and procurement
evaluation
For future systems careful extrapolation from
known quantities
Standard benchmarks
Measured on wide range of machines and successive
generations
Measurements and technology assessment Features
Simulation new design
Simulator simulate the design with and without a
feature
Benchmarks run through the simulator to obtain
results
Together with cost and complexity, decisions made

23
Difficult Enough for Uniprocessors

Workloads need to be renewed and reconsidered
Input data sets affect key interactions
Changes from SPEC92 to SPEC95 to SPEC98
Simulation is time-consuming
Accurate simulators costly to develop and verify
Good evaluation leads to good design
Quantitative evaluation increasingly important
for multiprocessors
Maturity of architecture, and greater continuity
among generations
Its a grounded, engineering discipline now
Good evaluation is critical, and we must learn to
do it right

24
More Difficult for Multiprocessors

What is a representative workload?
Software model has not stabilized
Many architectural and application degrees of
freedom
Huge design space no. of processors, other
architectural, application
Impact of these parameters and their interactions
can be huge
High cost of communication
What are the appropriate metrics?
Simulation is expensive
Realistic configurations and sensitivity analysis
difficult
Larger design space, but more difficult to cover
Understanding of parallel programs as workloads
is critical
Particularly interaction of application and
architectural parameters

25
A Lot Depends on Sizes

Application parameters and no. of procs affect
inherent properties
Load balance, communication, extra work, temporal
and spatial locality
Interactions with organization parameters of
extended memory hierarchy affect artifactual
communication and performance
Effects often dramatic, sometimes small
application-dependent

Barnes-Hut
Grid points(N)

Understanding size interactions and scaling
relationships is key

26
Outline

Performance and scaling (of workload and
architecture)
Techniques
Implications for behavioral characteristics and
performance metrics
Evaluating a real machine
Choosing workloads
Choosing workload parameters
Choosing metrics and presenting results
Evaluating an architectural idea/tradeoff through
simulation
Public-domain workload suites

27
Measuring Performance

Absolute performance
Most important to end user
Performance improvement due to parallelism
Speedup(p) Performance(p) / Performance(1),
always
Performance Work / Time, always
Work is determined by input configuration of the
problem
If work is fixed, can measure performance as
1/Time
Or retain explicit work measure (e.g.
transactions/sec, bonds/sec)
Still w.r.t particular configuration, and still
whats measured is time
Speedup(p) or

28
Scaling Why Worry?

Fixed problem size is limited
Too small a problem
May be appropriate for small machine
Parallelism overheads begin to dominate benefits
for larger machines
Load imbalance
Communication to computation ratio
May even achieve slowdowns
Doesnt reflect real usage, and inappropriate for
large machines
Too large a problem
Difficult to measure improvement (may not be
runnable on a single processor)

29
Too Large a Problem

Suppose problem realistically large for big
machine
May not fit in small machine
Cant run
Thrashing to disk
Working set doesnt fit in cache
Fits at some p, leading to superlinear speedup
Finally, users want to scale problems as machines
grow

30
Demonstrating Scaling Problems

Small Ocean and big equation solver problems on
SGI Origin2000

31
Questions in Scaling

Under what constraints to scale the application?
What are the appropriate metrics for performance
improvement?
work is not fixed any more, so time not enough
How should the application be scaled?
Definitions
Scaling a machine Can scale power in many ways
Assume adding identical nodes, each bringing
memory
Problem size Vector of input parameters, e.g. N
(n, q, Dt)
Determines work done
Distinct from data set size and memory usage
Start by assuming its only one parameter n, for
simplicity

Write a Comment

User Comments (0)

About PowerShow.com

Programming for Performance Part II PowerPoint PPT Presentation