Data-parallel Abstractions for Irregular Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Data-parallel Abstractions for Irregular Applications

Description:

... neighbor of a point. Convention: if there is only one point, nearest neighbor is ... Role of commit pool is similar to that of reorder buffer in out-of-order ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 50
Provided by: Ping60
Category:

less

Transcript and Presenter's Notes

Title: Data-parallel Abstractions for Irregular Applications


1
Data-parallel Abstractionsfor Irregular
Applications
  • Keshav Pingali
  • University of Texas, Austin

2
Motivation
  • Multicore processors are here
  • but no one knows how to program them
  • A few domains have succeeded in exploiting
    parallelism
  • Databases billions of SQL queries are run in
    parallel everyday
  • Computational science
  • Both these domains deal with structured data
  • Databases relations
  • Computational science mostly dense and sparse
    arrays
  • Universal parallel computing
  • Unstructured data is the norm graphs, trees,
    lists,
  • What can we do to make it easier for programs
    that manipulate unstructured data to exploit
    multicore parallelism?

3
Organization of talk
  • Two case studies
  • Delaunay mesh refinement
  • Agglomerative clustering
  • ? Irregular programs have generalized
    data-parallelism
  • Galois system exploiting generalized
    data-parallelism
  • Programming model
  • Implementation
  • Experimental evaluation
  • Ongoing work
  • Exploiting locality
  • Scheduling

4
Two case studies
5
Delaunay Mesh Refinement
  • Meshes useful for
  • Finite element method for solving PDEs
  • Graphics rendering
  • Delaunay meshes (2-D)
  • Triangulation of a surface, given vertices
  • Delaunay property circumcircle of any triangle
    does not contain another point in the mesh
  • In practice, want all triangles in mesh to meet
    certain quality constraints
  • (e.g.) no angle gt 120
  • Mesh refinement
  • fix bad triangles through iterative refinement

6
Refinement Algorithm
while there are bad triangles
  • pick a bad triangle
  • add new vertex at center of circumcircle
  • gather all triangles that no longer satisfy
    Delaunay property into cavity
  • re-triangulate affected region, including new
    point
  • // some new triangles may be bad
    themselves

7
Sequential Algorithm
Mesh m / read in mesh / WorkList
wl wl.add(mesh.badTriangles()) while (true)
if ( wl.empty() ) break Element e
wl.get() if (e no longer in mesh)
continue Cavity c new Cavity(e)//determine
new cavity c.expand() c.retriangulate()//re-tr
iangulate region m.update(c)//update
mesh wl.add(c.badTriangles())
8
Refinement Example
Original Mesh
Refined Mesh
9
Properties of algorithm
  • Actual code is far more complex
  • boundaries, especially non-convex boundaries are
    a pain
  • Average work per triangle (measured on Itanium)
  • 1M instructions, 100K floating-pt instructions
  • Dont-care non-determinism
  • Cavities of bad triangles may overlap
  • Therefore final mesh may depend on order in which
    bad triangles are processed
  • Any order will end up with a good mesh (in 2-D)
  • Number of bad triangles fixed by algorithm may be
    different for different orders
  • Heuristics for ordering bad triangles for
    processing are known
  • Not widely used

10
Parallelization Opportunities
  • Unit of work fixing a bad triangle
  • Data-parallelism bad triangles with
    non-overlapping cavities can be processed in
    parallel
  • No obvious way to tell if cavities of two bad
    triangles will overlap without actually building
    cavities
  • ? compile-time parallelization will not work

11
Agglomerative Clustering
  • Input
  • Set of data points
  • Measure of distance (similarity) between them
  • Output dendrogram
  • Tree that exposes similarity hierarchy
  • Applications
  • Data mining
  • Graphics lightcuts for rendering with large
    numbers of light sources

12
Clustering algorithm
  • Sequential algorithm iterative
  • Find two closest points in data set
  • Cluster them in dendrogram
  • Replace pair in data set with a supernode that
    represents pair
  • Placement of supernode use heuristics like
    center of mass
  • Repeat until there is only one point left

13
Key Data Structures
  • Priority queue
  • Elements are pairs ltp,ngt where
  • p is point in data set
  • n is its nearest neighbor
  • Ordered by increasing distance
  • kdTree
  • Answers queries for nearest neighbor of a point
  • Convention if there is only one point, nearest
    neighbor is point at infinity (ptAtInfinity)
  • Similar to a binary search tree but in higher
    dimensions

14
Clustering algorithm implementation
kdTree new KDTree(points) pq new
PriorityQueue() for each p in points
(pq.add(ltp,kdTree.nearest(p)gt)) while (true) do
if (pq.size() 0) break pair ltp,ngt
pq.get() //get closest pair . Cluster c
new Cluster(p,n) //create supernode
dendrogram.add(c) kdTree.remove(p) //update
kdTree kdTree.remove(n) kdTree.add(c)
Point m kdTree.nearest(c) //update priority
queue . pq.add(ltc,mgt)
15
Clustering algorithm details
kdTree new KDTree(points) pq new
PriorityQueue() for each p in points
(pq.add(ltp,kdTree.nearest(p)gt) while (true) do
if (pq.size() 0) break pair ltp,ngt
pq.get() if (p.isAlreadyClustered())
continue if (n.isAlreadyClustered())
pq.add(ltp, kdTree.nearest(p)gt) continue
Cluster c new Cluster(p,n)
dendrogram.add(c) kdTree.remove(p)
kdTree.remove(n) kdTree.add(c) Point m
kdTree.nearest(c) if (m! ptAtInfinity)
pq.add(ltc,mgt)
16
Parallelization Opportunities
  • Natural unit of work processing of a pair in PQ
  • Algorithm appears to be sequential
  • pair enqueued in one iteration into PQ may be the
    pair dequeued in next iteration
  • However, in example, lta,bgt and ltc,dgt can be
    clustered in parallel
  • Cost per pair in graphics app
  • 100K instructions, 4K floating-point operations

17
Take-away lessons
  • Irregular programs have data-parallelism
  • Data-parallelism has been studied in the context
    of arrays
  • For unstructured data, data-parallelism arises
    from work-lists of various kinds
  • Delaunay mesh refinement list of bad triangles
  • Agglomerative clustering priority queue of pairs
    of points
  • Maxflow algorithmslist of active nodes
  • Boykov-Kolmogorov algorithm for image
    segmentation
  • Preflow-push algorithm
  • Approximate SAT solvers
  • .
  • Data-parallelism in irregular programs is
    obscured within while loops, exit conditions,
    etc.
  • Need transparent syntax similar to FOR loops for
    structured data-parallelism

18
Take-away lessons (contd.)
  • Parallelism may depend on data values
  • whether or not two potential data-parallel
    computations conflict may depend on input data
  • (e.g.) Delaunay mesh generation depends on shape
    of mesh
  • Optimistic parallelization is necessary in
    general
  • Compile-time approaches using points-to analysis
    or shape analysis may be adequate for some cases
  • In general, runtime conflict-checking is needed
  • Handling of conflicts depends on the application
  • Delaunay mesh generation roll back all but one
    conflicting computation
  • Agglomerative clustering must respect priority
    queue order

19
Current approachesto optimistic parallelization
20
Manual approaches
  • Time-warp (1986)
  • Optimistic event-driven simulation
  • Distributed-memory computing model
  • Buffering of speculative state/roll-backs/commits
    implemented manually for particular application
  • Pthreads hand-coded optimistic parallelization
  • Most current implementations of Delaunay mesh
    refinement use this approach
  • Writing correct fine-grain locking code is tricky
  • code tends to be very unstructured and complex
  • tripled software costs for Unreal game engine

21
System support
  • Hardware/software support for
  • buffering speculative state
  • detecting dependence violations by tracking
    reads and writes to memory locations
  • rollback/commit
  • Implementations
  • Thread-level speculation (TLS)
  • Transactional memory
  • TLS
  • Speculative execution of DO-loops with irregular
    array accesses (Padua/Rauchwerger/Torrellas/)
  • Most implementations do not target while loops
  • Only data speculation, no control speculation
  • Hardware support can be fairly complex
  • Mis-speculations limit speed-up (SUIF study)
  • Transactional memory
  • Leverage cache-coherence hardware (Herlihy/Moss)
  • Support for optimistic synchronization in
    explicitly parallel programming model

22
Limitations of TLS/TM
  • Applications require unbounded while-loops
  • Most algorithms involve a work-list of some kind
  • Detect more work as you traverse data structure
    to perform work
  • Detecting dependence violations by tracking reads
    and writes to memory locations results in lots of
    spurious conflicts
  • Example Delaunay mesh refinement
  • Regardless of how worklist is implemented, there
    must be a location head that points to next bad
    triangle in list
  • Every thread must read and write this location to
    get work
  • When should update made a thread be made visible
    to other threads?
  • As soon as thread pulls work from worklist
    cascading roll-backs are possible
  • Only after thread finishes its work only one bad
    triangle can be processed at time
  • Other data structure manipulations
    (PQ,kdTree,graph) may have similar problems

23
Galois programming model and implementation
24
Beliefs underlying Galois system
  • Optimistic parallelism is the only general
    approach to parallelizing irregular apps
  • Static analysis can be used to optimize
    optimistic execution
  • Concurrency should be packaged within syntactic
    constructs that are natural for application
    programmers and obvious to compilers and runtime
    systems
  • Libraries/runtime system should manage
    concurrency (cf. SQL)
  • Application code should be sequential
  • Crucial to exploit abstractions provided by
    object-oriented languages
  • in particular, distinction between abstract data
    type and its implementation type
  • Concurrent access to shared mutable objects is
    essential

25
Components of Galois approach
  1. Two syntactic constructs for packaging optimistic
    parallelism as iteration over sets
  2. Assertions about methods in class libraries
  3. Runtime system for detecting and recovering from
    potentially unsafe accesses by optimistic
    computations

26
Concurrency constructs two set iterators
  • for each e in Set S do B(e)
  • evaluate block B(e) for each element in set S
  • sequential implementation
  • set elements are unordered, so no a priori order
    on iterations
  • there may be dependences between iterations
  • set S may get new elements during execution
  • for each e in PoSet S do B(e)
  • evaluate block B(e) for each element in set S
  • sequential implementation
  • perform iterations in order specified by poSet
  • there may be dependences between iterations
  • set S may get new elements during execution

27
Galois version of mesh refinement
Mesh m / read in mesh / Set
wl wl.add(mesh.badTriangles()) //
non-deterministic order for each e in Set wl do
//unordered iterator if
(e no longer in mesh) continue Cavity c new
Cavity(e) //determine new cavity c.expand() /
/determine affected triangles c.retriangulate()
//re-triangulate region m.update(c) //update
mesh wl.add(c.badTriangles()) //add new bad
triangles to workset
28
Observations
  • Application program has a well-defined sequential
    semantics
  • No notion of threads/locks/critical sections etc.
  • Set iterators
  • SETL language was probably first to introduce set
    iterators
  • However, SETL set iterators did not permit the
    sets being iterated on to grow during execution,
    which is important for our applications

29
Parallel computational model
  • Object-based shared-memory model
  • Computation performed by some number of threads
  • Threads can have their own local memory
  • Threads must invoke methods to access internal
    state of objects
  • mesh refinementshared objects are
  • worklist
  • Mesh
  • agglomerative clustering
  • priority queue
  • kdTree
  • dendrogram

30
Parallel execution of iterators
  • Master thread and some number of worker threads
  • master thread begins execution of program and
    executes code between iterators
  • when it encounters iterator, worker threads help
    by executing some iterations concurrently with
    master
  • threads synchronize by barrier synchronization at
    end of iterator
  • Key technical problem
  • Parallel execution must respect sequential
    semantics of application program
  • result of parallel execution must appear as
    though iterations were performed in some
    interleaved order
  • for poSet iterator, this order must correspond to
    poSet order
  • Non-trivial problem
  • each iteration may access mutable shared objects

31
Implementing semantics of iterators
  • Concurrent method invocations that modify object
    should not step on each other (mutual exclusion)
  • Library writer uses locks or some other mutex
    mechanism
  • Locks acquired during method invocation and
    released when method invocation ends
  • Uncontrolled interleaving may violate iterator
    semantics
  • In (a), contains?(x) must always return false but
    some interleavings will violate this (e.g.,
    add(x),contains?(x),remove(x)
  • Sometimes, interleaving is OK and is needed for
    concurrency
  • In (b) (motivated by Delaunay mesh refinement),
    method invocations can be interleaved provided
    result of get() is not argument of add()

32
(II) Assertions on methods
Shared Memory
  • Concurrent accesses to a mutable object by
    multiple threads are OK provided method
    invocations commute

Objects
get()
get()
add()
add()
get() add() get() add()
get() get() add() add()
get()
get()
add()
add()
33
Assertions on methods (contd.)
get() add() get() add()
get() get() add() add()
?
  • Semantic commutativity vs. concrete commutativity
  • for most implementations of workset, concrete
    data structure will be different for these two
    sequences, so commutativity fails
  • however, at semantic level, these set operations
    commute provide they operate on different set
    elements
  • Conclusion
  • semantic commutativity is crucial
  • class implementor must specify this information
  • Commutativity of method invocations, not methods
  • get() commutes with add() only if element
    inserted by add() is not the same as the element
    inserted by get()

34
Assertions on methods (contd.)
Shared Memory
  • Updates to objects happen before iteration
    completes (eager commit)
  • So we need a way of undoing the effect of a
    method invocation
  • Class implementer must provide an inverse
    method
  • As before, semantic inverse is key, not concrete
    inverse

m1
m2
m3
35
Example set
Class SetInterface void add (Element x)
conflicts - add(x) - remove(x)
- contains?(x) - get() x
inverse remove(x) void remove(Element x)
conflicts - add(x) -
remove(x) - contains?(x) - get()
x inverse add(x)
36
Remarks
  • Commutativity information is optional
  • No commutativity information for a mutable object
    means only one iteration can manipulate the
    object at a time
  • Inverse method is more or less essential
  • for a class w/o commutativity information,
    inverse methods can be implemented by data
    copying
  • Difficulty of writing specifications
  • in our apps, most shared objects are collections
    (sets, bags, maps)
  • (e.g.), kdTree is simply a set with a
    nearestNeighbor operation
  • writing specifications is quite easy
  • Relationship to Abelian group axioms
  • commutativity, inverse, identity

37
(III) Runtime system commit pool
  • Maintains iteration record for each ongoing
    iteration in system
  • Status of iteration
  • running
  • ready-to-commit (RTC)
  • aborted
  • Life-cycle of iteration
  • thread goes to commit pool for work
  • commit pool
  • obtains next element from iterator
  • assigns priority to iterator based on priority of
    element in set
  • creates an iteration record with status running
  • when iteration completes
  • status of iteration record is set to RTC
  • when that record has highest priority in system,
    it is allowed to commit
  • if commutativity conflict is detected
  • commit buffer arbitrates to determine which
    iteration(s) should be aborted
  • commit buffer executes undo logs of aborted
    iterations
  • Role of commit pool is similar to that of reorder
    buffer in out-of-order execution microprocessors

38
(III) Runtime systemconflict logs
  • Each object has a conflict log
  • Contains sequence of method invocations that have
    been performed by ongoing iterations
  • Each thread has undo log that contains sequence
    of inverse method invocations it must execute if
    it aborts
  • When thread invokes method m on object O
  • Check if m commutes with method invocations and
    their inverses in conflict log of object O
  • If so, add m to conflict log of object O, and
    m-1 to undo log of thread and execute method
  • Otherwise, iteration aborts
  • When thread commits iteration
  • Remove its invocations from conflict logs of all
    objects it has touched
  • Zero out its undo log
  • Easy to extend this to support nested method
    invocations

39
Experiments
40
Experimental Setup
  • Machines
  • 4-processor 1.5 GHz Itanium 2
  • 16 KB L1, 256 KB L2, 3MB L3 cache
  • no shared cache between processors
  • Red Hat Linux
  • Dual processor, dual core 3.0 GHz Xeon
  • 32 KB L1, 4 MB L2 cache
  • dual cores share L2
  • Red Hat Linux

41
Delaunay mesh generation
  • Mesh implemented as a graph
  • each triangle is a node
  • edges in graph represent triangle adjacencies
  • used adjacency list representation of graph
  • Input mesh
  • from Shewchucks Triangle program
  • 10,156 triangles of which 4,837 were bad
  • Galois work-set implementation
  • used STL queue first high abort ratio
  • Sequential code 21,918 completed0 aborted
  • Galois(q) 21,736 completed28,290 aborted
  • replaced queue with arrayrandom choice
  • Galois(r) 21,908 completed49 aborted

42
Code versions
  • Three versions
  • reference sequential version without
    locks/threads/etc.
  • FGL handwritten code that uses fine-grain locks
    on triangles
  • meshgen Galois version

43
Results
44
Performance Breakdown
4 processor numbers are summed over all
processors
45
Agglomerative clustering
  • Two versions
  • reference sequential version w/o locks/threads
  • treebuild Galois version
  • Data structures
  • priority queue
  • kd-tree
  • dendrogram
  • Data set
  • from graphics scene with roughly 50,000 light
    sources

46
Speedups
  • sequential version is best on 1 processor
  • self-relative speed-up of almost 2.75 on 4
    processors

47
Abort ratios and CPI
Committed iterations Aborted iterations
1 proc 57486 n/a
4 proc 57861 2528
  • Sequential and treebuild perform almost same
    number of instructions
  • As before, cycles/instruction (CPI) is higher for
    treebuild mainly because of L3 cache misses
  • mainly from kdTree

48
Degree of speculation
  • Measured number of iterations ready to commit
    (RTC) whenever commit pool creates/aborts/commits
    an iteration
  • Histogram shown above
  • X-axis in figure is truncated to show detail near
    origin
  • maximum number of RTC iterations is 120
  • Most of the time, we do not need to speculate too
    deeply to keep 4 threads busy
  • but on occasion, we do need to speculate deeply

49
Take-away points
  • Support for ordering speculative computations is
    very useful for some apps
  • hard to do agglomerative clustering otherwise
  • May need to speculate deeply in some apps
  • Domain-specific information is very useful for
    proper scheduling
  • workset implementation made a huge difference in
    performance
  • will probably need to provide hooks for user to
    specify scheduling policy
  • Reducing cache traffic is important to improve
    performance further

50
Ongoing work
51
Improving Performance
  • Locality enhancement
  • Galois approach can expose data-parallelism in
    irregular applications
  • Scalable exploitation of parallelism requires
    attending to locality
  • Specifying scheduling strategies
  • Delaunay mesh refinement example shows that
    scheduling of iterations can be critical to lower
    abort ratios
  • needed domain knowledge to fix problem

52
Galois methodology
  • How easy is it to specify commutativity of method
    invocations?
  • How important is the distinction between semantic
    and concrete commutativity?
  • How easy is it to write inverse methods?
  • Given a specification of the ADT, can we check
    commutativity and inverse directives?

53
Benchmarks
  • Existing benchmarks are useless
  • Wirth Program Algorithm Data structure
  • current benchmarks are programs
  • we need algorithms and data structures
  • experience with Delaunay mesh generation STL
    queue
  • variety of input data sets to illustrate range of
    behavior

54
Conclusions
  • Irregular programs have data-parallelism
  • Work-list based iterative algorithms over
    irregular data structures
  • Data-parallelism may be inherently data-dependent
  • Pointer/shape analysis cannot work for these apps
  • Optimistic parallelization is essential for such
    apps
  • Analysis might be useful to optimize parallel
    program execution
  • Exploiting abstractions provided by OO is
    critical
  • Only CS people still worry about F77 and C
    anyway.
  • Exploiting high-level semantic information about
    programs is critical
  • Galois knows about sets and ordered sets
  • Commutativity information is crucial
  • Support for ordering speculative computations
    important
  • Concurrent access to mutable objects is important
  • Benchmark programs are bad
  • Programs ?
  • Algorithmsdata structures ?

55
Current approaches to parallelization
56
Pessimistic parallelization
  • Use compiler (static) analyses to produce
    parallel schedule
  • (e.g.) points-to/shape analysis
  • Problem
  • Any static technique must produce a parallel
    schedule that is valid for all inputs to the
    program
  • In our applications, parallelism is very
    dependent on input data
  • Conclusion static techniques must serialize all
    iterations
  • accuracy of analysis is not the issue

57
Semi-static techniques
  • Inspector-executor technique
  • inspector small, fast program that uses input
    data to produce parallel schedule
  • executor runs actual program using input data
    and parallel schedule from inspector
  • Inspector generated by hand or by compiler
  • Applications
  • Parallel sparse matrix factorization
  • Communication schedules for parallel sparse
    direct solvers
  • Problem
  • data-sets in our problems change as programs
    execute
  • Inspector must do all the work of the executor

58
Take-away lessons (contd.)
  • Updates to data structures by one computation
    must be visible to other computations before
    first one terminates
  • (e.g.) worklist
  • thread grabs a bad triangle from worklist
  • before bad triangle processing is done, another
    thread may want a bad triangle from worklist
  • modifications to worklist made by first thread
    must be visible to second thread before first
    thread completes
  • similar issues arise with priority queue and
    kdTree
  • But how do we prevent cascading roll-backs?
Write a Comment
User Comments (0)
About PowerShow.com