Programming for Performance Part II - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Programming for Performance Part II

Description:

Inherent: change logical data sharing patterns in algorithm ... big transfers: amortize overhead and latency. small transfers: reduce contention. 18 ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 32
Provided by: jaswi3
Category:

less

Transcript and Presenter's Notes

Title: Programming for Performance Part II


1
Programming for PerformancePart II
2
Orchestration for Performance
  • Reducing amount of communication
  • Inherent change logical data sharing patterns in
    algorithm
  • Artifactual exploit spatial, temporal locality
    in extended hierarchy
  • Techniques often similar to those on
    uniprocessors
  • Structuring communication to reduce cost

3
Reducing Artifactual Communication
  • Message passing model
  • Communication and replication are both explicit
  • Even artifactual communication is in explicit
    messages
  • Shared address space model
  • More interesting from an architectural
    perspective
  • Occurs transparently due to interactions of
    program and system
  • sizes and granularities in extended memory
    hierarchy
  • Use shared address space to illustrate issues

4
Exploiting Temporal Locality
  • Structure algorithm so working sets map well to
    hierarchy
  • often techniques to reduce inherent communication
    do well here
  • schedule tasks for data reuse once assigned
  • Solver example blocking
  • More useful when O(nk1) computation on O(nk)
    data
  • many linear algebra computations
    (factorization, matrix multiply)

5
Exploiting Spatial Locality
  • Besides capacity, granularities are important
  • Granularity of allocation
  • Granularity of communication or data transfer
  • Granularity of coherence
  • Major spatial-related causes of artifactual
    communication
  • Conflict misses
  • Data distribution/layout (allocation granularity)
  • Fragmentation (communication granularity)
  • False sharing of data (coherence granularity)
  • All depend on how spatial access patterns
    interact with data structures
  • Fix problems by modifying data structures, or
    layout/alignment
  • Examine later in context of architectures
  • one simple example here data distribution in SAS
    solver

6
Spatial Locality Example
  • Repeated sweeps over 2-d grid, each time adding
    1 to elements
  • Natural 2-d versus higher-dimensional array
    representation

7
Tradeoffs with Inherent Communication
  • Partitioning grid solver blocks versus rows
  • Blocks still have a spatial locality problem on
    remote data
  • Row-wise can perform better despite worse
    inherent c-to-c ratio
  • Result depends on n and p

8
Example Performance Impact
  • Equation solver on SGI Origin2000
  • Superliner
  • Why?
  • Long cache block 128 bytes

512 x 512
12K x 12K
9
Architectural Implications of Locality
  • Communication abstraction that makes exploiting
    it easy
  • For cache-coherent SAS, e.g.
  • Size and organization of levels of memory
    hierarchy
  • cost-effectiveness caches are expensive
  • caveats flexibility for different and
    time-shared workloads
  • Replication in main memory useful? If so, how to
    manage?
  • hardware, OS/runtime, program?
  • Granularities of allocation, communication,
    coherence (?)
  • small granularities gt high overheads, but easier
    to program
  • Machine granularity (resource division among
    processors, memory...)

10
Orchestration for Performance
  • Reducing amount of communication
  • Inherent change logical data sharing patterns in
    algorithm
  • Artifactual exploit spatial, temporal locality
    in extended hierarchy
  • Techniques often similar to those on
    uniprocessors
  • Structuring communication to reduce cost

11
Structuring Communication
  • Given amount of comm (inherent or artifactual),
    goal is to reduce cost
  • Cost of communication as seen by process
  • C f ( o l tc - overlap)
  • f frequency of messages
  • o overhead per message (at both ends)
  • l network delay per message
  • nc total data sent
  • m number of messages
  • B bandwidth along path (determined by network,
    NI, assist)
  • tc cost induced by contention per message
  • overlap amount of latency hidden by overlap
    with comp. or comm.
  • Portion in parentheses is cost of a message (as
    seen by processor)
  • That portion, ignoring overlap, is latency of a
    message
  • Goal reduce terms in latency and increase
    overlap

12
Reducing Overhead
  • Can reduce no. of messages m or overhead per
    message o
  • o is usually determined by hardware or system
    software
  • Program should try to reduce m by coalescing
    messages
  • More control when communication is explicit
  • Coalescing data into larger messages
  • Easy for regular, coarse-grained communication
  • Can be difficult for irregular, naturally
    fine-grained communication
  • may require changes to algorithm and extra work
  • coalescing data and determining what and to whom
    to send
  • will discuss more in implications for programming
    models later

13
Reducing Network Delay
  • Network delay component fhth
  • h number of hops traversed in network
  • th linkswitch latency per hop
  • Reducing f communicate less, or make messages
    larger
  • Reducing h
  • Map communication patterns to network topology
  • e.g. nearest-neighbor on mesh and ring
    all-to-all
  • How important is this?
  • used to be major focus of parallel algorithms
  • depends on no. of processors, how large th is
    relative to other components
  • single phit in a pipelined networks
  • message in store-and-forward networks less
    important on modern machines
  • overheads, processor count, multiprogramming

14
Reducing Contention
  • All resources have nonzero occupancy
  • Memory, communication controller, network link,
    etc.
  • Finite bandwidth for serving transactions
  • Effects of contention
  • Increased end-to-end cost for messages
  • Reduced available bandwidth for individual
    messages
  • Causes imbalances across processors
  • Particularly insidious performance problem
  • Easy to ignore when programming
  • Slow down messages that dont even need that
    resource
  • by causing other dependent resources to also
    congest
  • Effect can be devastating Dont flood a
    resource!

15
Types of Contention
  • Network contention and end-point contention
    (hot-spots)
  • Location and Module Hot-spots
  • Location e.g. accumulating into global variable,
    barrier
  • solution tree-structured communication
  • Module all-to-all personalized comm. in matrix
    transpose
  • solution stagger access by different processors
    to same node temporally
  • In general, reduce burstiness may conflict with
    making messages larger

16
Overlapping Communication
  • Cannot afford to stall for high latencies
  • even on uniprocessors!
  • Overlap with computation or communication to hide
    latency
  • Requires extra concurrency (slackness), higher
    bandwidth
  • Techniques
  • Prefetching
  • Block data transfer
  • Overlap
  • Multithreading

17
Summary of Tradeoffs
  • Different goals often have conflicting demands
  • Load Balance
  • fine-grain tasks
  • random or dynamic assignment
  • Communication
  • usually coarse grain tasks
  • decompose to obtain locality not random/dynamic
  • Extra Work
  • coarse grain tasks
  • simple assignment
  • Communication Cost
  • big transfers amortize overhead and latency
  • small transfers reduce contention

18
Processor-Centric Perspective
1
0
0
1
0
0
S
y
n
c
h
r
o
n
i
z
a
t
i
o
n
D
a
t
a
-
r
e
m
o
t
e
B
u
s
y
-
u
s
e
f
u
l
B
u
s
y
-
o
v
e
r
h
e
a
d
D
a
t
a
-
l
o
c
a
l
7
5
7
5
)
)
s
s
(
(


e
e
m
m
i
i
5
0
5
0
T
T
2
5
2
5
P
P

P
P

0
1

2

3
(
a
)

S
e
q
u
e
n
t
i
a
l
(
b
)

P
a
r
a
l
l
e
l

w
i
t
h

f
o
u
r

p
r
o
c
e
s
s
o
r
s
19
Relationship between Perspectives
20
Summary
  • Speedupprob(p)
  • Goal is to reduce denominator components
  • Both programmer and system have role to play
  • Architecture cannot do much about load imbalance
    or too much communication
  • But it can
  • reduce incentive for creating ill-behaved
    programs (efficient naming, communication and
    synchronization)
  • reduce artifactual communication
  • provide efficient naming for flexible assignment
  • allow effective overlapping of communication

21
Workload-Driven Architectural Evaluation
22
Evaluation in Uniprocessors
  • Evaluation
  • For existing systems comparison and procurement
    evaluation
  • For future systems careful extrapolation from
    known quantities
  • Standard benchmarks
  • Measured on wide range of machines and successive
    generations
  • Measurements and technology assessment Features
    Simulation new design
  • Simulator simulate the design with and without a
    feature
  • Benchmarks run through the simulator to obtain
    results
  • Together with cost and complexity, decisions made

23
Difficult Enough for Uniprocessors
  • Workloads need to be renewed and reconsidered
  • Input data sets affect key interactions
  • Changes from SPEC92 to SPEC95 to SPEC98
  • Simulation is time-consuming
  • Accurate simulators costly to develop and verify
  • Good evaluation leads to good design
  • Quantitative evaluation increasingly important
    for multiprocessors
  • Maturity of architecture, and greater continuity
    among generations
  • Its a grounded, engineering discipline now
  • Good evaluation is critical, and we must learn to
    do it right

24
More Difficult for Multiprocessors
  • What is a representative workload?
  • Software model has not stabilized
  • Many architectural and application degrees of
    freedom
  • Huge design space no. of processors, other
    architectural, application
  • Impact of these parameters and their interactions
    can be huge
  • High cost of communication
  • What are the appropriate metrics?
  • Simulation is expensive
  • Realistic configurations and sensitivity analysis
    difficult
  • Larger design space, but more difficult to cover
  • Understanding of parallel programs as workloads
    is critical
  • Particularly interaction of application and
    architectural parameters

25
A Lot Depends on Sizes
  • Application parameters and no. of procs affect
    inherent properties
  • Load balance, communication, extra work, temporal
    and spatial locality
  • Interactions with organization parameters of
    extended memory hierarchy affect artifactual
    communication and performance
  • Effects often dramatic, sometimes small
    application-dependent

Barnes-Hut
Grid points(N)
  • Understanding size interactions and scaling
    relationships is key

26
Outline
  • Performance and scaling (of workload and
    architecture)
  • Techniques
  • Implications for behavioral characteristics and
    performance metrics
  • Evaluating a real machine
  • Choosing workloads
  • Choosing workload parameters
  • Choosing metrics and presenting results
  • Evaluating an architectural idea/tradeoff through
    simulation
  • Public-domain workload suites

27
Measuring Performance
  • Absolute performance
  • Most important to end user
  • Performance improvement due to parallelism
  • Speedup(p) Performance(p) / Performance(1),
    always
  • Performance Work / Time, always
  • Work is determined by input configuration of the
    problem
  • If work is fixed, can measure performance as
    1/Time
  • Or retain explicit work measure (e.g.
    transactions/sec, bonds/sec)
  • Still w.r.t particular configuration, and still
    whats measured is time
  • Speedup(p) or

28
Scaling Why Worry?
  • Fixed problem size is limited
  • Too small a problem
  • May be appropriate for small machine
  • Parallelism overheads begin to dominate benefits
    for larger machines
  • Load imbalance
  • Communication to computation ratio
  • May even achieve slowdowns
  • Doesnt reflect real usage, and inappropriate for
    large machines
  • Too large a problem
  • Difficult to measure improvement (may not be
    runnable on a single processor)

29
Too Large a Problem
  • Suppose problem realistically large for big
    machine
  • May not fit in small machine
  • Cant run
  • Thrashing to disk
  • Working set doesnt fit in cache
  • Fits at some p, leading to superlinear speedup
  • Finally, users want to scale problems as machines
    grow

30
Demonstrating Scaling Problems
  • Small Ocean and big equation solver problems on
    SGI Origin2000

31
Questions in Scaling
  • Under what constraints to scale the application?
  • What are the appropriate metrics for performance
    improvement?
  • work is not fixed any more, so time not enough
  • How should the application be scaled?
  • Definitions
  • Scaling a machine Can scale power in many ways
  • Assume adding identical nodes, each bringing
    memory
  • Problem size Vector of input parameters, e.g. N
    (n, q, Dt)
  • Determines work done
  • Distinct from data set size and memory usage
  • Start by assuming its only one parameter n, for
    simplicity
Write a Comment
User Comments (0)
About PowerShow.com