Dynamic Analysis for Optimizing Software ThreadLevel Speculation TLS - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Dynamic Analysis for Optimizing Software ThreadLevel Speculation TLS

Description:

Dynamic Analysis for Optimizing. Software Thread-Level Speculation (TLS)? Cosmin E. Oancea. Alan Mycroft. Cambridge Laboratory, University of Cambridge ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 17
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Analysis for Optimizing Software ThreadLevel Speculation TLS


1
Dynamic Analysis for Optimizing Software
Thread-Level Speculation (TLS)?
Cosmin E. Oancea
Alan Mycroft
Cambridge Laboratory, University of Cambridge
2
I. Introduction Thread-Level Spec (TLS)
Software TLS higher granularity than hardware
TLS Wmin lt W lt Wmax
for(i0iltNi)? B(i)
Seq 0 1 2 3 4 5 6 7
... Unroll_1(0 1 2 3)
(4 5 6 7) ...
Unroll_2 (0 2 4 6) (1
3 5 7) ...
Diagram is slick, but for - serial commit vs
in-place - dependency violations - TLS
memory overhead
Program (Original) Memory
Idea use a hash-like function from addresses to
indexes into a smallish vector used to
track dependencies at run-time)
TLS's Read Vector
Hash Function
3
I. Introduction TLS Example
Schedule iters of B on a CMP P num of
processors, C num of threads, W window
size of concurrent iters, ?(i) iters
executed by thread i, 0?(i)jltW Find C, W,
? with C maximal load-balanced
threads good granularity
for(i0iltNi)? B(i)
parfor(t0 tltC t) for(k0 kltN/W k)?
for_each(j in ?(t))? B_spec(kWj)
cond_wait
cond_wait implements the invariant that
KiKjltC, where Ki is the value of k for thread
i. (If C4, iterations k0 and k4or 5 are not
executed concurrently.)
4
I. Introduction Goals
  • Dynamic analysis framework for lightweight TLS
  • Find a thread partitioning that satisfies
    frequentdependencies addresses granularity
    needs
  • Fine-grained memory partitioning identify
    regular accesspatterns compute TLS model's
    hash function
  • Suitable for on-line off-line analysis light
  • Related static approaches relational (complex).
    Our approach non-relational (simple).

5
IIa. Thread Partitioning Examples
I. B(i) is ai4 ai2 Degree of
parallelism is 4 (ct dep dist)! a) Possible
result C4, W4, ?(j) j i.e. iters
(j4Z) j, j4, j8, ...
execute on thread j b) Increasing iter
granularity C4, W16, ?(j)j, j4,
j8, j12 - expanded iter 0 executes
0,4,8,12 - applicable only to in-place
models
//Assume P 8 parfor(t0 tltC t) for(k0
kltN/W k)? for_each(j in ?(t))? B(kWj)?
partial_barrier
II. B(i) is ai ai2 Degree of par is 8
-- dep-free! Increased iter granularity C8,
W32, ?(j)4j, 4j1, 4j2, 4j3 - expanded
iter 0 executes 0,1,2,3 - ok for both
in-place serial-commit
6
IIb. Memory Partitioning Example
  • partition mem into eq classes (hash fun
    computation)hash(x)((x-s)quo q)rem Q, xy iff
    hash(x)hash(y)?
  • Spec read/write(x) op assumes that any locations
    in x's eq class have been read/written!
  • Challenge false-positives do not result in
    violations

For (I.b) B is ai4ai2 hash(x)((x-s)quo
4)rem 4,
For (II) B is aiai2 hash(x)((x-s)quo
16)rem 8,
  • Thread j accesses addresses x with hash(x)j!
  • Small overhead cache ideal layout no
    violations

7
III. Thread Partitioning Goals
  • Identify cross-iteration dependencies (dep)
  • Classify them - rare events (ignore) and likely
    events (solve).(Execute iterations involved in
    likely dep on the same thread)
  • Model dependencies via congruence relations (in
    ZxZ)?

8
III. Profiling ADDGs
  • SPP (speculative prg point) accesses requiring
    TLS support
  • Gather address-iteration pairs per SPP. q is an
    SPPPIA(q)(a,i)address a was accessed by
    iteration i
  • E.g. of RAW (a,i)? PIA(p), (a,j)? PIA(q), iltj, q
    reads, p writes i is the source, j is the sink
    (sourceltsink)
  • Two SPPs induce a directed acyclic dep graph
    (ADDG) nodes are iter numbers, edges are
    directed from source to sink
  • SPP pairs because we aim to identify simple
    patterns

9
III. ADDGs Examples
Three basic patterns parallelism is still
extracted
int D4,B128 for(iD iltN i) ai ...
//PP1 ... ai-D//PP2 if(i8 1)? ...
ai-1//PP3 aiB ...//PP4 ...
aiD//PP5 if(UnlikelyCond)? ...
ai-1//PP6 ei ... //PP7 ...
eN-i//PP8
ADDG(PP1, PP6)?
ADDG(PP7, PP8)?
10
III. Set-Congruence Model
  • Modulo Generator(a,b)ltMgt (x,y) xa yb
    (mod M) (apM, bqM) p,q?N
  • Step Generator(a,b)Mgt(akM, bkM) k?N
    , if altb (akM, b(k1)M k?N ,
    othws
  • (a,b)Mgt ? (a,b)ltMgt. E.g. (0,8) ? (0,0)4gt
  • ADDG(PP1,PP2) ??(i,i)lt4gt, 0ilt4 same for
    ADDG(PP4,PP5)?
  • ADDG(PP1,PP3) (0,1)8gt
  • LiftMgt,ltMgt notations to sets SltMgt?(a,b)ltMgt,(a,b
    )? S?

11
III. Set-Congruence Algebra
  • ltmgt/(Z/MZ)is the additive subgroup generated by m
    in Z/MZ
  • (a,b)mgt (akm, bkm) k?Z ??
    (aeKM,beKM) K?Z, e ? ltmgt/(Z/MZ) ? (ae,
    be)Mgt, where e ? ltmgt/(Z/MZ)
  • (a,b)ltmgt ... ? ... (a,b)ltgcd(m,M)gtSimilar to
    Granger Congruences in Z

12
III. Set-Congruence Algebra (Cont)?
  • Example(0,1)lt8gt ?(0,1)lt18gt ? (0,1)lt2gt
  • However(0,1)8gt ?(0,1)lt18gt ? Slt18gt,where S
    (0,1),(8,9),(16,17),(6,7),(14,15),(4,5),(12,13),(
    2,3),(10,11)Up to 9 concurrent threads thread
    i execs iters Si0 Si1 18Z
  • Degree of parallelism (ParDeg) of SltMgt is M/m,
    where m is the max num of nodes of a connected
    component of S's ADDG
  • Formula unification aims best ParDeg and
    conciseness

13
III. Th Part Concluding Remarks
  • ADDG(PP4,PP5) orthogonal pattern (assume D ?
    B)
  • repetitive struct several nodes are the
    source/sink of all deps
  • exploitable barrier-semantic execution
    0,3lt128gt executed sequentially, 4,127lt128gt
    concurrently
  • If MDB then ADDG(PP4,PP5) ? ??(i,i)ltMgt, 0iltM
    useful simplification. No need for
    synchronization.
  • Time complexity average O(n log(n)) sort
    addresses per SPP, build ADDGs only between SPP
    pairs whose addresses overlap.
  • Loop nests start analysis for the most-outermost
    loop of suitable granularity. If analysis fails
    repeat for inner loops.

14
IVa. Coarse-Grained Mem Partit
  • One hash function may not effectively describe
    all access patterns
  • Clustering partition mem into mostly-disjoint
    intervals
  • Predict how intervals grow, compute boundaries,
    choose suitable TLS model
  • Safe Lgs extra (orthogonal) partitioning (type
    inf, etc.)?

15
IVb. Fine-Grained Mem Partit
  • Interval/congruence analysis to map addresses
    accessed by diff threads into mostly disjoint,
    coset-based equivalence-classes (? two
    non-trivial SPPs)
  • Addresses a in the i'th interval are in one eq
    class and correspond to the union of cosets
    iqQZ,(iq1)QZ,...,(iqq-1)QZ
  • Consequences small mem-overhead, optimized cache
    behaviour (L1-cache)?
  • Guided-Search Heuristic optimal sol, and is
    linear for most practical cases

16
V. FFT Example Conclusions
4-proc, profiling when dual 4
hash(x)((x-s)quo 8)rem 4 - Dep tracking is
implemented on a vector with 4 elements -
91 of hand-parallel speed-up - Masdupuy's
trapezoid analysis a-2a 0,1mod 4, where
a index in array x
for(dual in Powers(2)) for(a1 altdual a)?
for(b0 bltn b2dual) int i2(ba),
j2(badual) xjExp(xj1,
xj) xiExp(xi1, xi)
  • Performance results that assume this dynamic
    analysis, yield on average 84 of the
    hand-parallel speed-up, and as high as 323 on a
    four-processor machine
Write a Comment
User Comments (0)
About PowerShow.com