Data-parallel Abstractions for Irregular Applications

About This Presentation

Title:

Data-parallel Abstractions for Irregular Applications

Description:

... neighbor of a point. Convention: if there is only one point, nearest neighbor is ... Role of commit pool is similar to that of reorder buffer in out-of-order ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 50

Provided by: Ping60

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data-parallel Abstractions for Irregular Applications

1
Data-parallel Abstractionsfor Irregular
Applications

Keshav Pingali
University of Texas, Austin

2
Motivation

Multicore processors are here
but no one knows how to program them
A few domains have succeeded in exploiting
parallelism
Databases billions of SQL queries are run in
parallel everyday
Computational science
Both these domains deal with structured data
Databases relations
Computational science mostly dense and sparse
arrays
Universal parallel computing
Unstructured data is the norm graphs, trees,
lists,
What can we do to make it easier for programs
that manipulate unstructured data to exploit
multicore parallelism?

3
Organization of talk

Two case studies
Delaunay mesh refinement
Agglomerative clustering
? Irregular programs have generalized
data-parallelism
Galois system exploiting generalized
data-parallelism
Programming model
Implementation
Experimental evaluation
Ongoing work
Exploiting locality
Scheduling

4
Two case studies
5
Delaunay Mesh Refinement

Meshes useful for
Finite element method for solving PDEs
Graphics rendering
Delaunay meshes (2-D)
Triangulation of a surface, given vertices
Delaunay property circumcircle of any triangle
does not contain another point in the mesh
In practice, want all triangles in mesh to meet
certain quality constraints
(e.g.) no angle gt 120
Mesh refinement
fix bad triangles through iterative refinement

6
Refinement Algorithm
while there are bad triangles

pick a bad triangle
add new vertex at center of circumcircle
gather all triangles that no longer satisfy
Delaunay property into cavity
re-triangulate affected region, including new
point
// some new triangles may be bad
themselves

7
Sequential Algorithm
Mesh m / read in mesh / WorkList
wl wl.add(mesh.badTriangles()) while (true)
if ( wl.empty() ) break Element e
wl.get() if (e no longer in mesh)
continue Cavity c new Cavity(e)//determine
new cavity c.expand() c.retriangulate()//re-tr
iangulate region m.update(c)//update
mesh wl.add(c.badTriangles())
8
Refinement Example
Original Mesh
Refined Mesh
9
Properties of algorithm

Actual code is far more complex
boundaries, especially non-convex boundaries are
a pain
Average work per triangle (measured on Itanium)
1M instructions, 100K floating-pt instructions
Dont-care non-determinism
Cavities of bad triangles may overlap
Therefore final mesh may depend on order in which
bad triangles are processed
Any order will end up with a good mesh (in 2-D)
Number of bad triangles fixed by algorithm may be
different for different orders
Heuristics for ordering bad triangles for
processing are known
Not widely used

10
Parallelization Opportunities

Unit of work fixing a bad triangle
Data-parallelism bad triangles with
non-overlapping cavities can be processed in
parallel
No obvious way to tell if cavities of two bad
triangles will overlap without actually building
cavities
? compile-time parallelization will not work

11
Agglomerative Clustering

Input
Set of data points
Measure of distance (similarity) between them
Output dendrogram
Tree that exposes similarity hierarchy
Applications
Data mining
Graphics lightcuts for rendering with large
numbers of light sources

12
Clustering algorithm

Sequential algorithm iterative
Find two closest points in data set
Cluster them in dendrogram
Replace pair in data set with a supernode that
represents pair
Placement of supernode use heuristics like
center of mass
Repeat until there is only one point left

13
Key Data Structures

Priority queue
Elements are pairs ltp,ngt where
p is point in data set
n is its nearest neighbor
Ordered by increasing distance
kdTree
Answers queries for nearest neighbor of a point
Convention if there is only one point, nearest
neighbor is point at infinity (ptAtInfinity)
Similar to a binary search tree but in higher
dimensions

14
Clustering algorithm implementation
kdTree new KDTree(points) pq new
PriorityQueue() for each p in points
(pq.add(ltp,kdTree.nearest(p)gt)) while (true) do
if (pq.size() 0) break pair ltp,ngt
pq.get() //get closest pair . Cluster c
new Cluster(p,n) //create supernode
dendrogram.add(c) kdTree.remove(p) //update
kdTree kdTree.remove(n) kdTree.add(c)
Point m kdTree.nearest(c) //update priority
queue . pq.add(ltc,mgt)
15
Clustering algorithm details
kdTree new KDTree(points) pq new
PriorityQueue() for each p in points
(pq.add(ltp,kdTree.nearest(p)gt) while (true) do
if (pq.size() 0) break pair ltp,ngt
pq.get() if (p.isAlreadyClustered())
continue if (n.isAlreadyClustered())
pq.add(ltp, kdTree.nearest(p)gt) continue
Cluster c new Cluster(p,n)
dendrogram.add(c) kdTree.remove(p)
kdTree.remove(n) kdTree.add(c) Point m
kdTree.nearest(c) if (m! ptAtInfinity)
pq.add(ltc,mgt)
16
Parallelization Opportunities

Natural unit of work processing of a pair in PQ
Algorithm appears to be sequential
pair enqueued in one iteration into PQ may be the
pair dequeued in next iteration
However, in example, lta,bgt and ltc,dgt can be
clustered in parallel
Cost per pair in graphics app
100K instructions, 4K floating-point operations

17
Take-away lessons

Irregular programs have data-parallelism
Data-parallelism has been studied in the context
of arrays
For unstructured data, data-parallelism arises
from work-lists of various kinds
Delaunay mesh refinement list of bad triangles
Agglomerative clustering priority queue of pairs
of points
Maxflow algorithmslist of active nodes
Boykov-Kolmogorov algorithm for image
segmentation
Preflow-push algorithm
Approximate SAT solvers
.
Data-parallelism in irregular programs is
obscured within while loops, exit conditions,
etc.
Need transparent syntax similar to FOR loops for
structured data-parallelism

18
Take-away lessons (contd.)

Parallelism may depend on data values
whether or not two potential data-parallel
computations conflict may depend on input data
(e.g.) Delaunay mesh generation depends on shape
of mesh
Optimistic parallelization is necessary in
general
Compile-time approaches using points-to analysis
or shape analysis may be adequate for some cases
In general, runtime conflict-checking is needed
Handling of conflicts depends on the application
Delaunay mesh generation roll back all but one
conflicting computation
Agglomerative clustering must respect priority
queue order

19
Current approachesto optimistic parallelization
20
Manual approaches

Time-warp (1986)
Optimistic event-driven simulation
Distributed-memory computing model
Buffering of speculative state/roll-backs/commits
implemented manually for particular application
Pthreads hand-coded optimistic parallelization
Most current implementations of Delaunay mesh
refinement use this approach
Writing correct fine-grain locking code is tricky
code tends to be very unstructured and complex
tripled software costs for Unreal game engine

21
System support

Hardware/software support for
buffering speculative state
detecting dependence violations by tracking
reads and writes to memory locations
rollback/commit
Implementations
Thread-level speculation (TLS)
Transactional memory
TLS
Speculative execution of DO-loops with irregular
array accesses (Padua/Rauchwerger/Torrellas/)
Most implementations do not target while loops
Only data speculation, no control speculation
Hardware support can be fairly complex
Mis-speculations limit speed-up (SUIF study)
Transactional memory
Leverage cache-coherence hardware (Herlihy/Moss)
Support for optimistic synchronization in
explicitly parallel programming model

22
Limitations of TLS/TM

Applications require unbounded while-loops
Most algorithms involve a work-list of some kind
Detect more work as you traverse data structure
to perform work
Detecting dependence violations by tracking reads
and writes to memory locations results in lots of
spurious conflicts
Example Delaunay mesh refinement
Regardless of how worklist is implemented, there
must be a location head that points to next bad
triangle in list
Every thread must read and write this location to
get work
When should update made a thread be made visible
to other threads?
As soon as thread pulls work from worklist
cascading roll-backs are possible
Only after thread finishes its work only one bad
triangle can be processed at time
Other data structure manipulations
(PQ,kdTree,graph) may have similar problems

23
Galois programming model and implementation
24
Beliefs underlying Galois system

Optimistic parallelism is the only general
approach to parallelizing irregular apps
Static analysis can be used to optimize
optimistic execution
Concurrency should be packaged within syntactic
constructs that are natural for application
programmers and obvious to compilers and runtime
systems
Libraries/runtime system should manage
concurrency (cf. SQL)
Application code should be sequential
Crucial to exploit abstractions provided by
object-oriented languages
in particular, distinction between abstract data
type and its implementation type
Concurrent access to shared mutable objects is
essential

25
Components of Galois approach

Two syntactic constructs for packaging optimistic
parallelism as iteration over sets
Assertions about methods in class libraries
Runtime system for detecting and recovering from
potentially unsafe accesses by optimistic
computations

26
Concurrency constructs two set iterators

for each e in Set S do B(e)
evaluate block B(e) for each element in set S
sequential implementation
set elements are unordered, so no a priori order
on iterations
there may be dependences between iterations
set S may get new elements during execution
for each e in PoSet S do B(e)
evaluate block B(e) for each element in set S
sequential implementation
perform iterations in order specified by poSet
there may be dependences between iterations
set S may get new elements during execution

27
Galois version of mesh refinement
Mesh m / read in mesh / Set
wl wl.add(mesh.badTriangles()) //
non-deterministic order for each e in Set wl do
//unordered iterator if
(e no longer in mesh) continue Cavity c new
Cavity(e) //determine new cavity c.expand() /
/determine affected triangles c.retriangulate()
//re-triangulate region m.update(c) //update
mesh wl.add(c.badTriangles()) //add new bad
triangles to workset
28
Observations

Application program has a well-defined sequential
semantics
No notion of threads/locks/critical sections etc.
Set iterators
SETL language was probably first to introduce set
iterators
However, SETL set iterators did not permit the
sets being iterated on to grow during execution,
which is important for our applications

29
Parallel computational model

Object-based shared-memory model
Computation performed by some number of threads
Threads can have their own local memory
Threads must invoke methods to access internal
state of objects
mesh refinementshared objects are
worklist
Mesh
agglomerative clustering
priority queue
kdTree
dendrogram

30
Parallel execution of iterators

Master thread and some number of worker threads
master thread begins execution of program and
executes code between iterators
when it encounters iterator, worker threads help
by executing some iterations concurrently with
master
threads synchronize by barrier synchronization at
end of iterator
Key technical problem
Parallel execution must respect sequential
semantics of application program
result of parallel execution must appear as
though iterations were performed in some
interleaved order
for poSet iterator, this order must correspond to
poSet order
Non-trivial problem
each iteration may access mutable shared objects

31
Implementing semantics of iterators

Concurrent method invocations that modify object
should not step on each other (mutual exclusion)
Library writer uses locks or some other mutex
mechanism
Locks acquired during method invocation and
released when method invocation ends
Uncontrolled interleaving may violate iterator
semantics
In (a), contains?(x) must always return false but
some interleavings will violate this (e.g.,
add(x),contains?(x),remove(x)
Sometimes, interleaving is OK and is needed for
concurrency
In (b) (motivated by Delaunay mesh refinement),
method invocations can be interleaved provided
result of get() is not argument of add()

32
(II) Assertions on methods
Shared Memory

Concurrent accesses to a mutable object by
multiple threads are OK provided method
invocations commute

Objects
get()
get()
add()
add()
get() add() get() add()
get() get() add() add()
get()
get()
add()
add()
33
Assertions on methods (contd.)
get() add() get() add()
get() get() add() add()
?

Semantic commutativity vs. concrete commutativity
for most implementations of workset, concrete
data structure will be different for these two
sequences, so commutativity fails
however, at semantic level, these set operations
commute provide they operate on different set
elements
Conclusion
semantic commutativity is crucial
class implementor must specify this information
Commutativity of method invocations, not methods
get() commutes with add() only if element
inserted by add() is not the same as the element
inserted by get()

34
Assertions on methods (contd.)
Shared Memory

Updates to objects happen before iteration
completes (eager commit)
So we need a way of undoing the effect of a
method invocation
Class implementer must provide an inverse
method
As before, semantic inverse is key, not concrete
inverse

m1
m2
m3
35
Example set
Class SetInterface void add (Element x)
conflicts - add(x) - remove(x)
- contains?(x) - get() x
inverse remove(x) void remove(Element x)
conflicts - add(x) -
remove(x) - contains?(x) - get()
x inverse add(x)
36
Remarks

Commutativity information is optional
No commutativity information for a mutable object
means only one iteration can manipulate the
object at a time
Inverse method is more or less essential
for a class w/o commutativity information,
inverse methods can be implemented by data
copying
Difficulty of writing specifications
in our apps, most shared objects are collections
(sets, bags, maps)
(e.g.), kdTree is simply a set with a
nearestNeighbor operation
writing specifications is quite easy
Relationship to Abelian group axioms
commutativity, inverse, identity

37
(III) Runtime system commit pool

Maintains iteration record for each ongoing
iteration in system
Status of iteration
running
ready-to-commit (RTC)
aborted
Life-cycle of iteration
thread goes to commit pool for work
commit pool
obtains next element from iterator
assigns priority to iterator based on priority of
element in set
creates an iteration record with status running
when iteration completes
status of iteration record is set to RTC
when that record has highest priority in system,
it is allowed to commit
if commutativity conflict is detected
commit buffer arbitrates to determine which
iteration(s) should be aborted
commit buffer executes undo logs of aborted
iterations
Role of commit pool is similar to that of reorder
buffer in out-of-order execution microprocessors

38
(III) Runtime systemconflict logs

Each object has a conflict log
Contains sequence of method invocations that have
been performed by ongoing iterations
Each thread has undo log that contains sequence
of inverse method invocations it must execute if
it aborts
When thread invokes method m on object O
Check if m commutes with method invocations and
their inverses in conflict log of object O
If so, add m to conflict log of object O, and
m-1 to undo log of thread and execute method
Otherwise, iteration aborts
When thread commits iteration
Remove its invocations from conflict logs of all
objects it has touched
Zero out its undo log
Easy to extend this to support nested method
invocations

39
Experiments
40
Experimental Setup

Machines
4-processor 1.5 GHz Itanium 2
16 KB L1, 256 KB L2, 3MB L3 cache
no shared cache between processors
Red Hat Linux
Dual processor, dual core 3.0 GHz Xeon
32 KB L1, 4 MB L2 cache
dual cores share L2
Red Hat Linux

41
Delaunay mesh generation

Mesh implemented as a graph
each triangle is a node
edges in graph represent triangle adjacencies
used adjacency list representation of graph
Input mesh
from Shewchucks Triangle program
10,156 triangles of which 4,837 were bad
Galois work-set implementation
used STL queue first high abort ratio
Sequential code 21,918 completed0 aborted
Galois(q) 21,736 completed28,290 aborted
replaced queue with arrayrandom choice
Galois(r) 21,908 completed49 aborted

42
Code versions

Three versions
reference sequential version without
locks/threads/etc.
FGL handwritten code that uses fine-grain locks
on triangles
meshgen Galois version

43
Results
44
Performance Breakdown
4 processor numbers are summed over all
processors
45
Agglomerative clustering

Two versions
reference sequential version w/o locks/threads
treebuild Galois version
Data structures
priority queue
kd-tree
dendrogram
Data set
from graphics scene with roughly 50,000 light
sources

46
Speedups

sequential version is best on 1 processor
self-relative speed-up of almost 2.75 on 4
processors

47
Abort ratios and CPI
Committed iterations Aborted iterations
1 proc 57486 n/a
4 proc 57861 2528

Sequential and treebuild perform almost same
number of instructions
As before, cycles/instruction (CPI) is higher for
treebuild mainly because of L3 cache misses
mainly from kdTree

48
Degree of speculation

Measured number of iterations ready to commit
(RTC) whenever commit pool creates/aborts/commits
an iteration
Histogram shown above
X-axis in figure is truncated to show detail near
origin
maximum number of RTC iterations is 120
Most of the time, we do not need to speculate too
deeply to keep 4 threads busy
but on occasion, we do need to speculate deeply

49
Take-away points

Support for ordering speculative computations is
very useful for some apps
hard to do agglomerative clustering otherwise
May need to speculate deeply in some apps
Domain-specific information is very useful for
proper scheduling
workset implementation made a huge difference in
performance
will probably need to provide hooks for user to
specify scheduling policy
Reducing cache traffic is important to improve
performance further

50
Ongoing work
51
Improving Performance

Locality enhancement
Galois approach can expose data-parallelism in
irregular applications
Scalable exploitation of parallelism requires
attending to locality
Specifying scheduling strategies
Delaunay mesh refinement example shows that
scheduling of iterations can be critical to lower
abort ratios
needed domain knowledge to fix problem

52
Galois methodology

How easy is it to specify commutativity of method
invocations?
How important is the distinction between semantic
and concrete commutativity?
How easy is it to write inverse methods?
Given a specification of the ADT, can we check
commutativity and inverse directives?

53
Benchmarks

Existing benchmarks are useless
Wirth Program Algorithm Data structure
current benchmarks are programs
we need algorithms and data structures
experience with Delaunay mesh generation STL
queue
variety of input data sets to illustrate range of
behavior

54
Conclusions

Irregular programs have data-parallelism
Work-list based iterative algorithms over
irregular data structures
Data-parallelism may be inherently data-dependent
Pointer/shape analysis cannot work for these apps
Optimistic parallelization is essential for such
apps
Analysis might be useful to optimize parallel
program execution
Exploiting abstractions provided by OO is
critical
Only CS people still worry about F77 and C
anyway.
Exploiting high-level semantic information about
programs is critical
Galois knows about sets and ordered sets
Commutativity information is crucial
Support for ordering speculative computations
important
Concurrent access to mutable objects is important
Benchmark programs are bad
Programs ?
Algorithmsdata structures ?

55
Current approaches to parallelization
56
Pessimistic parallelization

Use compiler (static) analyses to produce
parallel schedule
(e.g.) points-to/shape analysis
Problem
Any static technique must produce a parallel
schedule that is valid for all inputs to the
program
In our applications, parallelism is very
dependent on input data
Conclusion static techniques must serialize all
iterations
accuracy of analysis is not the issue

57
Semi-static techniques

Inspector-executor technique
inspector small, fast program that uses input
data to produce parallel schedule
executor runs actual program using input data
and parallel schedule from inspector
Inspector generated by hand or by compiler
Applications
Parallel sparse matrix factorization
Communication schedules for parallel sparse
direct solvers
Problem
data-sets in our problems change as programs
execute
Inspector must do all the work of the executor

58
Take-away lessons (contd.)

Updates to data structures by one computation
must be visible to other computations before
first one terminates
(e.g.) worklist
thread grabs a bad triangle from worklist
before bad triangle processing is done, another
thread may want a bad triangle from worklist
modifications to worklist made by first thread
must be visible to second thread before first
thread completes
similar issues arise with priority queue and
kdTree
But how do we prevent cascading roll-backs?