Title: The Galois Project
1The Galois Project
- Keshav Pingali
- University of Texas, Austin
Joint work with Milind Kulkarni, Martin
Burtscher, Patrick Carribault, Donald Nguyen,
Dimitrios Prountzos, Zifei Zhong
2Overview of Galois Project
- Focus of Galois project
- parallel execution of irregular programs
- pointer-based data structures like graphs and
trees - raise abstraction level for Joe programmers
- explicit parallelism is too difficult for most
programmers - performance penalty for abstraction should be
small - Research approach
- study algorithms to find common patterns of
parallelism and locality - - Amorphous Data-Parallelism (ADP)
- design programming constructs for expressing
these patterns - implement these constructs efficiently
- - Abstraction-Based Speculation (ABS)
- For more information
- papers in PLDI 2007, ASPLOS 2008, SPAA 2008
- website http//iss.ices.utexas.edu
3Organization
- Case study of amorphous data-parallelism
- Delaunay mesh refinement
- Galois system (PLDI 2007)
- Programming model
- Baseline implementation
- Galois system optimizations
- Scheduling (SPAA 2008)
- Data and computation partitioning (ASPLOS 2008)
- Experimental results
- Ongoing work
4Delaunay Mesh Refinement
- Iterative refinement to remove badly shaped
triangles - while there are bad triangles do
- Pick a bad triangle
- Find its cavity
- Retriangulate cavity
- // may create new bad triangles
-
- Order in which bad triangles should be refined
- final mesh depends on order in which bad
triangles are processed - but all bad triangles will be eliminated
ultimately regardless of order -
Mesh m / read in mesh / WorkList
wl wl.add(mesh.badTriangles()) while (true)
if ( wl.empty() ) break Triangle e
wl.get() if (e no longer in mesh)
continue Cavity c new Cavity(e) c.expand()
//determine cavity c.retriangulate()//re-triangu
late cavity m.update(c)//update
mesh wl.add(c.badTriangles())
5Delaunay Mesh Refinement
- Parallelism
- triangles with non-overlapping cavities can be
processed in parallel - if cavities of two triangles overlap, they must
be done serially - in practice, lots of parallelism
- Exploiting this parallelism
- compile-time parallelization techniques like
points-to and shape analysis cannot expose this
parallelism (property of algorithm, not program) - runtime dependence checking is needed
- Galois approach optimistic parallelization
6Take-away lessons
- Amorphous data-parallelism
- data structures graphs, trees, etc.
- iterative algorithm over unordered or ordered
work-list - elements can be added to work-list during
computation - complex patterns of dependences between
computations on different work-list elements
(possibly input-sensitive) - but many of these computations can be done in
parallel - Contrast crystalline (regular) data-parallelism
- data structures dense matrices
- iterative algorithm over fixed integer interval
- simple dependence patterns affine subscripts in
array accesses - (mostly input-insensitive)
- for i 1, N
- for j 1, N
- for k 1, N
- Ci,j Ci,j Ai,kBk,j
7Take-away lessons (contd.)
- Amorphous data-parallelism is ubiquitous
- Delaunay mesh generation points to be inserted
into mesh - Delaunay mesh refinement list of bad triangles
- Agglomerative clustering priority queue of
points from data-set - Boykov-Kolmogorov algorithm for image
segmentation - Reduction-based interpreters for l-calculus list
of redexes - Iterative dataflow analysis algorithms in
compilers - Approximate SAT solvers survey propagation,
WalkSAT -
8Take-away lessons (contd.)
- Amorphous data-parallelism is obscured within
while loops, exit conditions, etc. in
conventional languages - Need transparent syntax similar to FOR loops for
crystalline data-parallelism - Optimistic parallelization is necessary in
general - Compile-time approaches using points-to analysis
or shape analysis may be adequate for some cases - In general, runtime dependence checking is needed
- Property of algorithms, not programs
- Handling of dependence conflicts depends on the
application - Delaunay mesh generation roll back any
conflicting computation - Agglomerative clustering must respect priority
queue order
9Organization
- Case study of amorphous data-parallelism
- Delaunay mesh refinement
- Galois system (PLDI 2007)
- Programming model
- Baseline implementation
- Galois system optimizations
- Scheduling (SPAA 2008)
- Data and computation partitioning (ASPLOS 2008)
- Experimental results
- Ongoing work
10Galois Design Philosophy
- Do not worry about dusty decks (for now)
- Restructuring existing code to expose amorphous
data-parallelism not our focus - (cf. Google map/reduce)
- Evolution, not revolution
- Modification of existing programming paradigms
OK - Radical solutions like functional programming
not OK - No reliance on parallelizing compiler technology
- will not work for many of our applications anyway
- parallelizing compilers are very complex software
artifacts - Support two classes of programmers
- domain experts (Joe)
- should be shielded from complexities of parallel
programming - most programmers will be Joes
- parallel programming experts (Steve)
- small number of highly trained people
- analogs
- industry model even for sequential programming
- norm in domains like numerical linear algebra
- Steves implement BLAS libraries
11Galois system
- Application program
- Has well-defined sequential semantics
- current implementation sequential Java
- Uses optimistic iterators to highlight for the
runtime system opportunities for exploiting
parallelism - Class libraries
- Like Java collections library but with additional
information for concurrency control - Runtime system
- Managing optimistic parallelism
12Optimistic set iterators
- for each e in Set S do B(e)
- evaluate block B(e) for each element in set S
- sequential semantics
- set elements are unordered, so no a priori order
on iterations - there may be dependences between iterations
- set S may get new elements during execution
- for each e in OrderedSet S do B(e)
- evaluate block B(e) for each element in set S
- sequential semantics
- perform iterations in order specified by
OrderedSet - there may be dependences between iterations
- set S may get new elements during execution
13Galois version of mesh refinement
Mesh m / read in mesh / Set
wl wl.add(mesh.badTriangles()) // initialize
the Set wl for each e in Set wl do
//unordered Set iterator if
(e no longer in mesh) continue Cavity c new
Cavity(e) c.expand() c.retriangulate()
m.update(c) wl.add(c.badTriangles())
//add new bad triangles to Set
- - Scheduling policy for iterator
- controlled by implementation of Set class
- good choice for temporal locality stack
-
14Parallel execution model
Master
- Object-based shared-memory model
- Master thread and some number of worker threads
- master thread begins execution of program and
executes code between iterators - when it encounters iterator, worker threads help
by executing iterations concurrently with master - threads synchronize by barrier synchronization at
end of iterator - Threads invoke methods to access internal state
of objects - how do we ensure sequential semantics of program
are respected?
main() . for each .. . . ..... ..... for
each . ..... ..
Objects
Shared Memory
Threads
Program
15Baseline solution PLDI 2007
- Iteration must lock object to invoke method
- Two types of objects
- catch and keep policy
- lock is held even after method invocation
completes - all locks released at end of iteration
- poor performance for programs with collections
and accumulators - catch and release policy
- like Java locking policy
- permits method invocations from different
concurrent iterations to be interleaved - how do we make sure this is safe?
Objects
time
1
2
3
i
j
16Catch and keep iteration rollback
- What does iteration j do if object is already
locked by some other iteration i ? - one possibility wait and try to acquire lock
again - but this might lead to deadlock
- our implementation runtime system rolls back one
of the iterations by undoing its updates to
shared objects - Undoing updates any copy-on-write solution
works - Make a copy of entire object when you acquire the
lock (wasteful for large objects) - Runtime systems maintains an undo log that
holds information for undoing side-effects to
objects as they are made (cf. software
transactional memory)
Shared Memory
Objects
time
1
2
3
j
i
17Problem with catch and keep
Shared Memory
- Poor performance for programs that deal with
mutable collections and accumulators - work-sets are mutable collections
- accumulators are ubiquitous
- Example Delaunay refinement
- Work-set is a (mutable) collection of bad
triangles - Some thread grabs lock on work-set object, gets a
bad triangle and removes it from the work-set - That thread must retain the lock on the work-set
till iteration completes, which shuts out all
other threads (same problem arises with
transactional memory) - Lesson
- For some objects, we need to interleave method
invocations from different iterations - But must not lose serializability
Objects
2
1
3
4
j
i
18Galois solution selective catch and release
- Example accumulator
- two methods
- add(int)
- read() //return value
- adds commute with other adds and reads commute
with other reads - Interleaving of commuting method invocations from
different iterations - ? OK
- Interleaving of non-commuting method invocations
from different iterations - ? trigger abort
- Rolling back side-effects programmer must
provide inverse methods for forward methods - Inverse method for add(n) is subtract(n)
- Semantic inverse, not representational inverse
- This solution works for sets as well.
Shared Memory
Accumulator
0?5?8
? 3
a.add(5)
a.add(3)
a.add(8)
a.read()
a.add(-4)
a.add(5)
a.add(3)
a.add(-4)
a.add(8)
a.read()
19Abstraction-based Speculation
- Library writer
- specifies commutativity and inverse information
for some classes - Runtime system
- catch and release locking for these classes
- keeps track of forward method invocations
- checks commutativity of forward method
invocations - invokes appropriate inverse methods on abort
- More details PLDI 2007
- Related work
- logical locks in database systems
- Herlihy et al PPoPP 2008
20Organization
- Case study of amorphous data-parallelism
- Delaunay mesh refinement
- Galois system (PLDI 2007)
- Programming model
- Baseline implementation
- Galois system optimizations
- Scheduling (SPAA 2008)
- Data and computation partitioning (ASPLOS 2008)
- Experimental results
- Ongoing work
21Scheduling iterators
- Control scheduling by changing implementation of
work-set class - stack/queue/etc.
- Scheduling can have a profound effect on
performance - Example Delaunay mesh refinement
- 10,156 triangles of which 4,837 were bad
- sequential code, work-set is stack
- 21,918 completed iterations0 aborted
- 4-processor Itanium-2, work-set implementations
- stack 21,736 iterations completed28,290 aborted
- arrayrandom choice 21,908 iterations
completed49 aborted
22Scheduling iterators (SPAA 2008)
- Crystalline data-parallelism DO-ALL loops
- main scheduling concerns are locality and
load-balancing - Open-MP static, dynamic, guided, etc.
- Amorphous data-parallelism many more issues
- Conflicts
- Dynamically created work
- Algorithmic issues efficiency of data structures
- SPAA 2008 paper
- Scheduling framework for exploiting amorphous
data-parallelism - Generalizes Open-MP DO-ALL loop scheduling
constructs
23Data Partitioning (ASPLOS 2008)
Cores
- Partition the graph between cores
- Data-centric assignment of work
- core gets bad triangles from its own partitions
- improves locality
- can dramatically reduce conflicts
- Lock coarsening
- associate locks with partitions
- Over-decomposition
- improves core utilization
24Organization
- Case study of amorphous data-parallelism
- Delaunay mesh refinement
- Galois system (PLDI 2007)
- Programming model
- Baseline implementation
- Galois system optimizations
- Scheduling (SPAA 2008)
- Data and computation partitioning (ASPLOS 2008)
- Experimental results
- Ongoing work
25Small-scale multiprocessor results
- 4-processor Itanium 2
- 16 KB L1, 256 KB L2, 3MB L3 cache
- Versions
- GAL using stack as worklist
- PAR partitioned mesh data-centric work
assignment - LCO locks on partitions
- OVD over-decomposed version (factor of 4)
26Large-scale multiprocessor results
- Maverick_at_TACC
- 128-core Sun Fire E25K 1 GHz
- 64 dual-core processors
- Sun Solaris
- First out-of-the-box results
- Speed-up of 20 on 32 cores for refinement
- Mesh partitioning is still sequential
- time for mesh partitioning starts to dominate
after 8 processors (32 partitions) - Need parallel mesh partitioning
- Par-Metis (Karypis et al)
27Galois version of mesh refinement
Mesh m / read in mesh / Set
wl wl.add(mesh.badTriangles()) // initialize
the Set wl for each e in Set wl do
//unordered Set iterator if (e
no longer in mesh) continue Cavity c new
Cavity(e) c.expand() c.retriangulate()
m.update(c) wl.add(c.badTriangles()) //add
new bad triangles to Set
Partitioned Work-set Partitioned Graph
Galois runtime system
28Results for BK Image Segmentation
- Versions
- GAL standard Galois version
- PAR partitioned graph
- LCO locks on partitions
- OVD over-decomposed version
29Related work
- Transactions
- programming model is explicitly parallel
- assumes someone else is responsible for
parallelism, locality, load-balancing, and
scheduling, and focuses only on synchronization - Galois main concerns are parallelism, locality,
load-balancing, and scheduling - catch and keep classes can use TM for roll-back
but this is probably overkill - Thread level speculation
- not clear where to speculate in C programs
- wastes power in useless speculation
- many schemes require extensive hardware support
- no notion of abstraction-based speculation
- no analogs of data partitioning or scheduling
- overall results are disappointing
30Ongoing work
- Case studies of irregular programs
- understand parallelism and locality patterns in
irregular programs - Lonestar benchmark suite for irregular programs
- joint work with Calin Cascavals group at IBM
Yorktown Heights - Optimizing the Galois runtime system
- improve performance for iterators in which
work/iteration is relatively low - Compiler analysis to reduce overheads of
optimistic parallel execution - Scalability studies
- larger number of cores
- Distributed-memory implementation
- billion element meshes?
- Program analysis to verify assertions about class
methods - Need a semantic specification of class
31Summary
- Irregular applications have amorphous
data-parallelism - Work-list based iterative algorithms over
unordered and ordered sets - Amorphous data-parallelism may be inherently
data-dependent - Pointer/shape analysis cannot work for these apps
- Optimistic parallelization is essential for such
apps - Analysis might be useful to optimize parallel
program execution - Exploiting abstractions and high-level semantics
is critical - Galois knows about sets, ordered sets,
accumulators - Galois approach provides unified view of
data-parallelism in regular and irregular
programs - Baseline is optimistic parallelism
- Use compiler analysis to make decisions at
compile-time whenever possible