Compactly Representing Parallel Program Executions - PowerPoint PPT Presentation

About This Presentation

Title:

Compactly Representing Parallel Program Executions

Description:

Compactly Representing Parallel Program Executions Ankit Goel Abhik Roychoudhury Tulika Mitra National University of Singapore – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 27

Provided by: edus91

Category:

more less

Transcript and Presenter's Notes

Title: Compactly Representing Parallel Program Executions

1
Compactly Representing Parallel Program Executions

Ankit Goel Abhik Roychoudhury Tulika Mitra
National University of Singapore

2
Path profiles

Profiling a programs execution
Count based
Path based
Count based profiles are more aggregate
of execution of the programs basic blocks
of accesses of various memory locations
Path based profiles are more accurate
Sequence of basic blocks executed
Sequence of memory locations accessed
Use Online compression to generate compact path
profiles.

3
Organization

Compressed Path Profiles in Sequential Programs
Parallel Program Path Profiles
Compression Efficiency and Overheads
Data race detection over path profiles

4
Compressed Path - Example
Uncompressed Path 123123
1
Compressed Representation S ? AA A ? 123
2
3
Control Flow Graph
5
Online Path Compression

A program path is a string over a finite alphabet
Alphabet decided by what we instrument
Control flow (Basic Blocks executed)
Data flow (Memory Locations accessed)
A string s is represented by a Context Free
Grammar Gs Language of Gs is s
Construction of Gs is online and not post-mortem
Start with trivial grammar modify it for each
symbol
No recursive rules (DAG representation)
Compression scheme Nevill-Manning Witten 97
Application to program paths Larus 99

6
Online Compression in action
Path Executed Compressed Representation
1
S -gt 1
12
S -gt 12
123
S -gt 123
1231
S -gt 1231
12312
S -gt 12312
S -gt A3A A -gt 12
7
Online Compression in action
Path Executed Compressed Representation
S -gt A3A3 A -gt 12
123123
S -gt BB B -gt A3 A -gt 12
S -gt BB B -gt 123
8
Organization

Compressed Path Profiles in Sequential Programs
Parallel Program Path Profiles
Compression Efficiency and Overheads
Data race detection over path profiles

9
What to represent ?

Control/data flow in each program thread
Communication among threads
Synchronization (locks, barriers)
Unsynchronized shared variable accesses
Too costly to observe/record order of all shared
variable accesses
We will represent
Compressed flow in each thread (via Grammar)
Communication via synchronizations (How ?)

10
Synchronization Pattern (Locks)
lock
Compute
Pgm P1 P2
unlock
lock
unlock
Memory
P1
P2
Message Sequence Chart (MSC)
11
Synchronization Pattern (Barrier)
ready
Pgm P1 P2
Blocked
ready
go
go
Compute
Compute
P1
P2
Memory
12
Connection to MSCs
Partial Order of MSC

Matches Observed Ordering
Total order in each thread
Ordering across threads visible via
synchronization (msg. exchange)

unlock
lock
Th. 1
Th. 2
Shared Mem.
All synchronization ops. form a total order
13
A first cut

Instrument each thread to observe local
control/data flow and global synch.
Represent path profile of P1 P2
Each threads flow as a Grammar (G1, G2)
Contains synch. ops. as well.
All synchronization ops. as a list.
Associate entries in this list to the occurrence
of synch. ops. in (G1,G2)
How to navigate the path profile ?
Zoom in to a specific lockunlock segment of P1

14
Edge annotations
a b (lock) c (unlock) x b (lock) c (unlock) y
S
4
0
2
2
y
A
a
x
0
1
b
c
Grammar for one thread
15
Locating synch. operations
S
4
X
0
2
2
y
n
A
a
x
Y

0
1
b
c
n synch ops.
Locating the 3rd synchronization operation Can
find synch. segments by looking up global list.
16
So far

Control flow of each thread stored as a grammar
Synchronization ops. form a global list
Grammar of each thread annotated with counts
Easy searching of synchronization operations
What about shared data accesses ?
Sequence of memory locations accessed by a single
LD/ST instruction can be compressed
Use a Grammar representation for this seq. as well

17
Further compression

Locations accessed by a memory operation
10,14,18,22,26,54,58,62,66,70,98
Online Compression of the string as grammar
10(1), 4(4), 28(1), 4(4), 28(1)
Difference representation Run-length encoding
Useful for detecting regularity of array accesses
Sweep through an array A run of constant diffs.
Accessing a sub-grid of a multidimensional array

18
Organization

Compressed Path Profiles in Sequential Programs
Parallel Program Path Profiles
Compression Efficiency and Overheads
Data race detection over path profiles

19
Any better than gzip ?
Compression (2 Processors)
20
Scalability of Compression
Compression for our scheme
21
Concerns about Timing Overheads

Our scheme does not add substantial time overhead
over grammar based string compression
Our experiments conducted using RSIM
Tracing overheads can be higher in a real
multiprocessor
Can tracing distort program behavior ?
Possible solution
Trace minimal number of operations in a parallel
program execution (Netzer 1993) to ensure
deterministic replay
Collect compressed path profile during replay.

22
Organization

Compressed Path Profiles in Sequential Programs
Parallel Program Path Profiles
Compression Efficiency and Overheads
Data race detection over path profiles

23
Apparent Data races
lock

Last unlock in Th. 1 (first unlock)
Next lock in Th. 1 (second lock)
Locate root-to-leaf paths of these ops.
Tree rooted at the least common ancestor of these
ops.

unlock
lock
unlock
lock
unlock
lock
unlock
Th. 1
Th.2
Th.3
Mem.
No Decompression of the grammar of Th. 1
24
Data race artifacts
Sub 1 A1 0
X Sub Y AX
(artifact)
X decides which addr. is accessed in Y AX X
is set by Sub 1 which is also in a data
race. Detecting artifacts requires Data-flow Not
captured by rd/wr sets in synch.
segments Captured in our compact path profiles.
25
Summary

Compressed representation of the execution
profile of shared memory parallel programs
Control and shared data flow per thread
Synchronization patterns across threads
Overall compression efficiency 0.25 -- 9.81
Compression efficiency scalable with increasing
number of processors
Application Post-mortem debugging such as
detecting data races

26
Other Applications

We do not capture actual order of unsynchronized
shared memory accesses across processors
Can be useful in making architectural decisions
such as choice of cache coherence protocol
Sufficient to maintain Netzer 1993
transitive reduction of program order on each
proc.
shared variable conflict orders
Can we capture transitive reduction relation via
annotations of WPP edges?