Title: Dynamic Analysis for Optimizing Software ThreadLevel Speculation TLS
1Dynamic Analysis for Optimizing Software
Thread-Level Speculation (TLS)?
Cosmin E. Oancea
Alan Mycroft
Cambridge Laboratory, University of Cambridge
2I. Introduction Thread-Level Spec (TLS)
Software TLS higher granularity than hardware
TLS Wmin lt W lt Wmax
for(i0iltNi)? B(i)
Seq 0 1 2 3 4 5 6 7
... Unroll_1(0 1 2 3)
(4 5 6 7) ...
Unroll_2 (0 2 4 6) (1
3 5 7) ...
Diagram is slick, but for - serial commit vs
in-place - dependency violations - TLS
memory overhead
Program (Original) Memory
Idea use a hash-like function from addresses to
indexes into a smallish vector used to
track dependencies at run-time)
TLS's Read Vector
Hash Function
3I. Introduction TLS Example
Schedule iters of B on a CMP P num of
processors, C num of threads, W window
size of concurrent iters, ?(i) iters
executed by thread i, 0?(i)jltW Find C, W,
? with C maximal load-balanced
threads good granularity
for(i0iltNi)? B(i)
parfor(t0 tltC t) for(k0 kltN/W k)?
for_each(j in ?(t))? B_spec(kWj)
cond_wait
cond_wait implements the invariant that
KiKjltC, where Ki is the value of k for thread
i. (If C4, iterations k0 and k4or 5 are not
executed concurrently.)
4I. Introduction Goals
- Dynamic analysis framework for lightweight TLS
- Find a thread partitioning that satisfies
frequentdependencies addresses granularity
needs - Fine-grained memory partitioning identify
regular accesspatterns compute TLS model's
hash function - Suitable for on-line off-line analysis light
- Related static approaches relational (complex).
Our approach non-relational (simple).
5IIa. Thread Partitioning Examples
I. B(i) is ai4 ai2 Degree of
parallelism is 4 (ct dep dist)! a) Possible
result C4, W4, ?(j) j i.e. iters
(j4Z) j, j4, j8, ...
execute on thread j b) Increasing iter
granularity C4, W16, ?(j)j, j4,
j8, j12 - expanded iter 0 executes
0,4,8,12 - applicable only to in-place
models
//Assume P 8 parfor(t0 tltC t) for(k0
kltN/W k)? for_each(j in ?(t))? B(kWj)?
partial_barrier
II. B(i) is ai ai2 Degree of par is 8
-- dep-free! Increased iter granularity C8,
W32, ?(j)4j, 4j1, 4j2, 4j3 - expanded
iter 0 executes 0,1,2,3 - ok for both
in-place serial-commit
6IIb. Memory Partitioning Example
- partition mem into eq classes (hash fun
computation)hash(x)((x-s)quo q)rem Q, xy iff
hash(x)hash(y)? - Spec read/write(x) op assumes that any locations
in x's eq class have been read/written! - Challenge false-positives do not result in
violations
For (I.b) B is ai4ai2 hash(x)((x-s)quo
4)rem 4,
For (II) B is aiai2 hash(x)((x-s)quo
16)rem 8,
- Thread j accesses addresses x with hash(x)j!
- Small overhead cache ideal layout no
violations
7III. Thread Partitioning Goals
- Identify cross-iteration dependencies (dep)
- Classify them - rare events (ignore) and likely
events (solve).(Execute iterations involved in
likely dep on the same thread) - Model dependencies via congruence relations (in
ZxZ)?
8III. Profiling ADDGs
- SPP (speculative prg point) accesses requiring
TLS support - Gather address-iteration pairs per SPP. q is an
SPPPIA(q)(a,i)address a was accessed by
iteration i - E.g. of RAW (a,i)? PIA(p), (a,j)? PIA(q), iltj, q
reads, p writes i is the source, j is the sink
(sourceltsink) - Two SPPs induce a directed acyclic dep graph
(ADDG) nodes are iter numbers, edges are
directed from source to sink - SPP pairs because we aim to identify simple
patterns
9III. ADDGs Examples
Three basic patterns parallelism is still
extracted
int D4,B128 for(iD iltN i) ai ...
//PP1 ... ai-D//PP2 if(i8 1)? ...
ai-1//PP3 aiB ...//PP4 ...
aiD//PP5 if(UnlikelyCond)? ...
ai-1//PP6 ei ... //PP7 ...
eN-i//PP8
ADDG(PP1, PP6)?
ADDG(PP7, PP8)?
10III. Set-Congruence Model
- Modulo Generator(a,b)ltMgt (x,y) xa yb
(mod M) (apM, bqM) p,q?N - Step Generator(a,b)Mgt(akM, bkM) k?N
, if altb (akM, b(k1)M k?N ,
othws - (a,b)Mgt ? (a,b)ltMgt. E.g. (0,8) ? (0,0)4gt
- ADDG(PP1,PP2) ??(i,i)lt4gt, 0ilt4 same for
ADDG(PP4,PP5)? - ADDG(PP1,PP3) (0,1)8gt
- LiftMgt,ltMgt notations to sets SltMgt?(a,b)ltMgt,(a,b
)? S?
11III. Set-Congruence Algebra
- ltmgt/(Z/MZ)is the additive subgroup generated by m
in Z/MZ - (a,b)mgt (akm, bkm) k?Z ??
(aeKM,beKM) K?Z, e ? ltmgt/(Z/MZ) ? (ae,
be)Mgt, where e ? ltmgt/(Z/MZ) - (a,b)ltmgt ... ? ... (a,b)ltgcd(m,M)gtSimilar to
Granger Congruences in Z
12III. Set-Congruence Algebra (Cont)?
- Example(0,1)lt8gt ?(0,1)lt18gt ? (0,1)lt2gt
- However(0,1)8gt ?(0,1)lt18gt ? Slt18gt,where S
(0,1),(8,9),(16,17),(6,7),(14,15),(4,5),(12,13),(
2,3),(10,11)Up to 9 concurrent threads thread
i execs iters Si0 Si1 18Z - Degree of parallelism (ParDeg) of SltMgt is M/m,
where m is the max num of nodes of a connected
component of S's ADDG - Formula unification aims best ParDeg and
conciseness
13III. Th Part Concluding Remarks
- ADDG(PP4,PP5) orthogonal pattern (assume D ?
B) - repetitive struct several nodes are the
source/sink of all deps - exploitable barrier-semantic execution
0,3lt128gt executed sequentially, 4,127lt128gt
concurrently - If MDB then ADDG(PP4,PP5) ? ??(i,i)ltMgt, 0iltM
useful simplification. No need for
synchronization. - Time complexity average O(n log(n)) sort
addresses per SPP, build ADDGs only between SPP
pairs whose addresses overlap. - Loop nests start analysis for the most-outermost
loop of suitable granularity. If analysis fails
repeat for inner loops.
14IVa. Coarse-Grained Mem Partit
- One hash function may not effectively describe
all access patterns - Clustering partition mem into mostly-disjoint
intervals - Predict how intervals grow, compute boundaries,
choose suitable TLS model - Safe Lgs extra (orthogonal) partitioning (type
inf, etc.)?
15IVb. Fine-Grained Mem Partit
- Interval/congruence analysis to map addresses
accessed by diff threads into mostly disjoint,
coset-based equivalence-classes (? two
non-trivial SPPs) - Addresses a in the i'th interval are in one eq
class and correspond to the union of cosets
iqQZ,(iq1)QZ,...,(iqq-1)QZ - Consequences small mem-overhead, optimized cache
behaviour (L1-cache)? - Guided-Search Heuristic optimal sol, and is
linear for most practical cases
16V. FFT Example Conclusions
4-proc, profiling when dual 4
hash(x)((x-s)quo 8)rem 4 - Dep tracking is
implemented on a vector with 4 elements -
91 of hand-parallel speed-up - Masdupuy's
trapezoid analysis a-2a 0,1mod 4, where
a index in array x
for(dual in Powers(2)) for(a1 altdual a)?
for(b0 bltn b2dual) int i2(ba),
j2(badual) xjExp(xj1,
xj) xiExp(xi1, xi)
- Performance results that assume this dynamic
analysis, yield on average 84 of the
hand-parallel speed-up, and as high as 323 on a
four-processor machine