Dynamic Analysis for Optimizing Software ThreadLevel Speculation TLS

About This Presentation

Title:

Dynamic Analysis for Optimizing Software ThreadLevel Speculation TLS

Description:

Dynamic Analysis for Optimizing. Software Thread-Level Speculation (TLS)? Cosmin E. Oancea. Alan Mycroft. Cambridge Laboratory, University of Cambridge ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 17

Provided by: ResearchM53

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Analysis for Optimizing Software ThreadLevel Speculation TLS

1
Dynamic Analysis for Optimizing Software
Thread-Level Speculation (TLS)?
Cosmin E. Oancea
Alan Mycroft
Cambridge Laboratory, University of Cambridge
2
I. Introduction Thread-Level Spec (TLS)
Software TLS higher granularity than hardware
TLS Wmin lt W lt Wmax
for(i0iltNi)? B(i)
Seq 0 1 2 3 4 5 6 7
... Unroll_1(0 1 2 3)
(4 5 6 7) ...
Unroll_2 (0 2 4 6) (1
3 5 7) ...
Diagram is slick, but for - serial commit vs
in-place - dependency violations - TLS
memory overhead
Program (Original) Memory
Idea use a hash-like function from addresses to
indexes into a smallish vector used to
track dependencies at run-time)
TLS's Read Vector
Hash Function
3
I. Introduction TLS Example
Schedule iters of B on a CMP P num of
processors, C num of threads, W window
size of concurrent iters, ?(i) iters
executed by thread i, 0?(i)jltW Find C, W,
? with C maximal load-balanced
threads good granularity
for(i0iltNi)? B(i)
parfor(t0 tltC t) for(k0 kltN/W k)?
for_each(j in ?(t))? B_spec(kWj)
cond_wait
cond_wait implements the invariant that
KiKjltC, where Ki is the value of k for thread
i. (If C4, iterations k0 and k4or 5 are not
executed concurrently.)
4
I. Introduction Goals

Dynamic analysis framework for lightweight TLS
Find a thread partitioning that satisfies
frequentdependencies addresses granularity
needs
Fine-grained memory partitioning identify
regular accesspatterns compute TLS model's
hash function
Suitable for on-line off-line analysis light
Related static approaches relational (complex).
Our approach non-relational (simple).

5
IIa. Thread Partitioning Examples
I. B(i) is ai4 ai2 Degree of
parallelism is 4 (ct dep dist)! a) Possible
result C4, W4, ?(j) j i.e. iters
(j4Z) j, j4, j8, ...
execute on thread j b) Increasing iter
granularity C4, W16, ?(j)j, j4,
j8, j12 - expanded iter 0 executes
0,4,8,12 - applicable only to in-place
models
//Assume P 8 parfor(t0 tltC t) for(k0
kltN/W k)? for_each(j in ?(t))? B(kWj)?
partial_barrier
II. B(i) is ai ai2 Degree of par is 8
-- dep-free! Increased iter granularity C8,
W32, ?(j)4j, 4j1, 4j2, 4j3 - expanded
iter 0 executes 0,1,2,3 - ok for both
in-place serial-commit
6
IIb. Memory Partitioning Example

partition mem into eq classes (hash fun
computation)hash(x)((x-s)quo q)rem Q, xy iff
hash(x)hash(y)?
Spec read/write(x) op assumes that any locations
in x's eq class have been read/written!
Challenge false-positives do not result in
violations

For (I.b) B is ai4ai2 hash(x)((x-s)quo
4)rem 4,
For (II) B is aiai2 hash(x)((x-s)quo
16)rem 8,

Thread j accesses addresses x with hash(x)j!
Small overhead cache ideal layout no
violations

7
III. Thread Partitioning Goals

Identify cross-iteration dependencies (dep)
Classify them - rare events (ignore) and likely
events (solve).(Execute iterations involved in
likely dep on the same thread)
Model dependencies via congruence relations (in
ZxZ)?

8
III. Profiling ADDGs

SPP (speculative prg point) accesses requiring
TLS support
Gather address-iteration pairs per SPP. q is an
SPPPIA(q)(a,i)address a was accessed by
iteration i
E.g. of RAW (a,i)? PIA(p), (a,j)? PIA(q), iltj, q
reads, p writes i is the source, j is the sink
(sourceltsink)
Two SPPs induce a directed acyclic dep graph
(ADDG) nodes are iter numbers, edges are
directed from source to sink
SPP pairs because we aim to identify simple
patterns

9
III. ADDGs Examples
Three basic patterns parallelism is still
extracted
int D4,B128 for(iD iltN i) ai ...
//PP1 ... ai-D//PP2 if(i8 1)? ...
ai-1//PP3 aiB ...//PP4 ...
aiD//PP5 if(UnlikelyCond)? ...
ai-1//PP6 ei ... //PP7 ...
eN-i//PP8
ADDG(PP1, PP6)?
ADDG(PP7, PP8)?
10
III. Set-Congruence Model

Modulo Generator(a,b)ltMgt (x,y) xa yb
(mod M) (apM, bqM) p,q?N
Step Generator(a,b)Mgt(akM, bkM) k?N
, if altb (akM, b(k1)M k?N ,
othws
(a,b)Mgt ? (a,b)ltMgt. E.g. (0,8) ? (0,0)4gt
ADDG(PP1,PP2) ??(i,i)lt4gt, 0ilt4 same for
ADDG(PP4,PP5)?
ADDG(PP1,PP3) (0,1)8gt
LiftMgt,ltMgt notations to sets SltMgt?(a,b)ltMgt,(a,b
)? S?

11
III. Set-Congruence Algebra

ltmgt/(Z/MZ)is the additive subgroup generated by m
in Z/MZ
(a,b)mgt (akm, bkm) k?Z ??
(aeKM,beKM) K?Z, e ? ltmgt/(Z/MZ) ? (ae,
be)Mgt, where e ? ltmgt/(Z/MZ)
(a,b)ltmgt ... ? ... (a,b)ltgcd(m,M)gtSimilar to
Granger Congruences in Z

12
III. Set-Congruence Algebra (Cont)?

Example(0,1)lt8gt ?(0,1)lt18gt ? (0,1)lt2gt
However(0,1)8gt ?(0,1)lt18gt ? Slt18gt,where S
(0,1),(8,9),(16,17),(6,7),(14,15),(4,5),(12,13),(
2,3),(10,11)Up to 9 concurrent threads thread
i execs iters Si0 Si1 18Z
Degree of parallelism (ParDeg) of SltMgt is M/m,
where m is the max num of nodes of a connected
component of S's ADDG
Formula unification aims best ParDeg and
conciseness

13
III. Th Part Concluding Remarks

ADDG(PP4,PP5) orthogonal pattern (assume D ?
B)
repetitive struct several nodes are the
source/sink of all deps
exploitable barrier-semantic execution
0,3lt128gt executed sequentially, 4,127lt128gt
concurrently
If MDB then ADDG(PP4,PP5) ? ??(i,i)ltMgt, 0iltM
useful simplification. No need for
synchronization.
Time complexity average O(n log(n)) sort
addresses per SPP, build ADDGs only between SPP
pairs whose addresses overlap.
Loop nests start analysis for the most-outermost
loop of suitable granularity. If analysis fails
repeat for inner loops.

14
IVa. Coarse-Grained Mem Partit

One hash function may not effectively describe
all access patterns
Clustering partition mem into mostly-disjoint
intervals
Predict how intervals grow, compute boundaries,
choose suitable TLS model
Safe Lgs extra (orthogonal) partitioning (type
inf, etc.)?

15
IVb. Fine-Grained Mem Partit

Interval/congruence analysis to map addresses
accessed by diff threads into mostly disjoint,
coset-based equivalence-classes (? two
non-trivial SPPs)
Addresses a in the i'th interval are in one eq
class and correspond to the union of cosets
iqQZ,(iq1)QZ,...,(iqq-1)QZ
Consequences small mem-overhead, optimized cache
behaviour (L1-cache)?
Guided-Search Heuristic optimal sol, and is
linear for most practical cases

16
V. FFT Example Conclusions
4-proc, profiling when dual 4
hash(x)((x-s)quo 8)rem 4 - Dep tracking is
implemented on a vector with 4 elements -
91 of hand-parallel speed-up - Masdupuy's
trapezoid analysis a-2a 0,1mod 4, where
a index in array x
for(dual in Powers(2)) for(a1 altdual a)?
for(b0 bltn b2dual) int i2(ba),
j2(badual) xjExp(xj1,
xj) xiExp(xi1, xi)

Performance results that assume this dynamic
analysis, yield on average 84 of the
hand-parallel speed-up, and as high as 323 on a
four-processor machine

Write a Comment

User Comments (0)