The Galois Project

About This Presentation

Title:

The Galois Project

Description:

Title: Optimizing Code for Memory Hierarchies Author: pingali Last modified by: Keshav Pingali Created Date: 1/31/2006 1:55:28 AM Document presentation format – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 32

Provided by: ping7

Category:

more less

Transcript and Presenter's Notes

Title: The Galois Project

1
The Galois Project

Keshav Pingali
University of Texas, Austin

Joint work with Milind Kulkarni, Martin
Burtscher, Patrick Carribault, Donald Nguyen,
Dimitrios Prountzos, Zifei Zhong
2
Overview of Galois Project

Focus of Galois project
parallel execution of irregular programs
pointer-based data structures like graphs and
trees
raise abstraction level for Joe programmers
explicit parallelism is too difficult for most
programmers
performance penalty for abstraction should be
small
Research approach
study algorithms to find common patterns of
parallelism and locality
- Amorphous Data-Parallelism (ADP)
design programming constructs for expressing
these patterns
implement these constructs efficiently
- Abstraction-Based Speculation (ABS)
For more information
papers in PLDI 2007, ASPLOS 2008, SPAA 2008
website http//iss.ices.utexas.edu

3
Organization

Case study of amorphous data-parallelism
Delaunay mesh refinement
Galois system (PLDI 2007)
Programming model
Baseline implementation
Galois system optimizations
Scheduling (SPAA 2008)
Data and computation partitioning (ASPLOS 2008)
Experimental results
Ongoing work

4
Delaunay Mesh Refinement

Iterative refinement to remove badly shaped
triangles
while there are bad triangles do
Pick a bad triangle
Find its cavity
Retriangulate cavity
// may create new bad triangles
Order in which bad triangles should be refined
final mesh depends on order in which bad
triangles are processed
but all bad triangles will be eliminated
ultimately regardless of order

Mesh m / read in mesh / WorkList
wl wl.add(mesh.badTriangles()) while (true)
if ( wl.empty() ) break Triangle e
wl.get() if (e no longer in mesh)
continue Cavity c new Cavity(e) c.expand()
//determine cavity c.retriangulate()//re-triangu
late cavity m.update(c)//update
mesh wl.add(c.badTriangles())
5
Delaunay Mesh Refinement

Parallelism
triangles with non-overlapping cavities can be
processed in parallel
if cavities of two triangles overlap, they must
be done serially
in practice, lots of parallelism
Exploiting this parallelism
compile-time parallelization techniques like
points-to and shape analysis cannot expose this
parallelism (property of algorithm, not program)
runtime dependence checking is needed
Galois approach optimistic parallelization

6
Take-away lessons

Amorphous data-parallelism
data structures graphs, trees, etc.
iterative algorithm over unordered or ordered
work-list
elements can be added to work-list during
computation
complex patterns of dependences between
computations on different work-list elements
(possibly input-sensitive)
but many of these computations can be done in
parallel
Contrast crystalline (regular) data-parallelism
data structures dense matrices
iterative algorithm over fixed integer interval
simple dependence patterns affine subscripts in
array accesses
(mostly input-insensitive)
for i 1, N
for j 1, N
for k 1, N
Ci,j Ci,j Ai,kBk,j

7
Take-away lessons (contd.)

Amorphous data-parallelism is ubiquitous
Delaunay mesh generation points to be inserted
into mesh
Delaunay mesh refinement list of bad triangles
Agglomerative clustering priority queue of
points from data-set
Boykov-Kolmogorov algorithm for image
segmentation
Reduction-based interpreters for l-calculus list
of redexes
Iterative dataflow analysis algorithms in
compilers
Approximate SAT solvers survey propagation,
WalkSAT

8
Take-away lessons (contd.)

Amorphous data-parallelism is obscured within
while loops, exit conditions, etc. in
conventional languages
Need transparent syntax similar to FOR loops for
crystalline data-parallelism
Optimistic parallelization is necessary in
general
Compile-time approaches using points-to analysis
or shape analysis may be adequate for some cases
In general, runtime dependence checking is needed
Property of algorithms, not programs
Handling of dependence conflicts depends on the
application
Delaunay mesh generation roll back any
conflicting computation
Agglomerative clustering must respect priority
queue order

9
Organization

Case study of amorphous data-parallelism
Delaunay mesh refinement
Galois system (PLDI 2007)
Programming model
Baseline implementation
Galois system optimizations
Scheduling (SPAA 2008)
Data and computation partitioning (ASPLOS 2008)
Experimental results
Ongoing work

10
Galois Design Philosophy

Do not worry about dusty decks (for now)
Restructuring existing code to expose amorphous
data-parallelism not our focus
(cf. Google map/reduce)
Evolution, not revolution
Modification of existing programming paradigms
OK
Radical solutions like functional programming
not OK
No reliance on parallelizing compiler technology
will not work for many of our applications anyway
parallelizing compilers are very complex software
artifacts
Support two classes of programmers
domain experts (Joe)
should be shielded from complexities of parallel
programming
most programmers will be Joes
parallel programming experts (Steve)
small number of highly trained people
analogs
industry model even for sequential programming
norm in domains like numerical linear algebra
Steves implement BLAS libraries

11
Galois system

Application program
Has well-defined sequential semantics
current implementation sequential Java
Uses optimistic iterators to highlight for the
runtime system opportunities for exploiting
parallelism
Class libraries
Like Java collections library but with additional
information for concurrency control
Runtime system
Managing optimistic parallelism

12
Optimistic set iterators

for each e in Set S do B(e)
evaluate block B(e) for each element in set S
sequential semantics
set elements are unordered, so no a priori order
on iterations
there may be dependences between iterations
set S may get new elements during execution
for each e in OrderedSet S do B(e)
evaluate block B(e) for each element in set S
sequential semantics
perform iterations in order specified by
OrderedSet
there may be dependences between iterations
set S may get new elements during execution

13
Galois version of mesh refinement
Mesh m / read in mesh / Set
wl wl.add(mesh.badTriangles()) // initialize
the Set wl for each e in Set wl do
//unordered Set iterator if
(e no longer in mesh) continue Cavity c new
Cavity(e) c.expand() c.retriangulate()
m.update(c) wl.add(c.badTriangles())
//add new bad triangles to Set

- Scheduling policy for iterator
controlled by implementation of Set class
good choice for temporal locality stack

14
Parallel execution model
Master

Object-based shared-memory model
Master thread and some number of worker threads
master thread begins execution of program and
executes code between iterators
when it encounters iterator, worker threads help
by executing iterations concurrently with master
threads synchronize by barrier synchronization at
end of iterator
Threads invoke methods to access internal state
of objects
how do we ensure sequential semantics of program
are respected?

main() . for each .. . . ..... ..... for
each . ..... ..
Objects
Shared Memory
Threads
Program
15
Baseline solution PLDI 2007

Iteration must lock object to invoke method
Two types of objects
catch and keep policy
lock is held even after method invocation
completes
all locks released at end of iteration
poor performance for programs with collections
and accumulators
catch and release policy
like Java locking policy
permits method invocations from different
concurrent iterations to be interleaved
how do we make sure this is safe?

Objects
time
1
2
3
i
j
16
Catch and keep iteration rollback

What does iteration j do if object is already
locked by some other iteration i ?
one possibility wait and try to acquire lock
again
but this might lead to deadlock
our implementation runtime system rolls back one
of the iterations by undoing its updates to
shared objects
Undoing updates any copy-on-write solution
works
Make a copy of entire object when you acquire the
lock (wasteful for large objects)
Runtime systems maintains an undo log that
holds information for undoing side-effects to
objects as they are made (cf. software
transactional memory)

Shared Memory
Objects
time
1
2
3
j
i
17
Problem with catch and keep
Shared Memory

Poor performance for programs that deal with
mutable collections and accumulators
work-sets are mutable collections
accumulators are ubiquitous
Example Delaunay refinement
Work-set is a (mutable) collection of bad
triangles
Some thread grabs lock on work-set object, gets a
bad triangle and removes it from the work-set
That thread must retain the lock on the work-set
till iteration completes, which shuts out all
other threads (same problem arises with
transactional memory)
Lesson
For some objects, we need to interleave method
invocations from different iterations
But must not lose serializability

Objects
2
1
3
4
j
i
18
Galois solution selective catch and release

Example accumulator
two methods
add(int)
read() //return value
adds commute with other adds and reads commute
with other reads
Interleaving of commuting method invocations from
different iterations
? OK
Interleaving of non-commuting method invocations
from different iterations
? trigger abort
Rolling back side-effects programmer must
provide inverse methods for forward methods
Inverse method for add(n) is subtract(n)
Semantic inverse, not representational inverse
This solution works for sets as well.

Shared Memory
Accumulator
0?5?8
? 3
a.add(5)
a.add(3)
a.add(8)
a.read()
a.add(-4)
a.add(5)
a.add(3)
a.add(-4)
a.add(8)
a.read()
19
Abstraction-based Speculation

Library writer
specifies commutativity and inverse information
for some classes
Runtime system
catch and release locking for these classes
keeps track of forward method invocations
checks commutativity of forward method
invocations
invokes appropriate inverse methods on abort
More details PLDI 2007
Related work
logical locks in database systems
Herlihy et al PPoPP 2008

20
Organization

Case study of amorphous data-parallelism
Delaunay mesh refinement
Galois system (PLDI 2007)
Programming model
Baseline implementation
Galois system optimizations
Scheduling (SPAA 2008)
Data and computation partitioning (ASPLOS 2008)
Experimental results
Ongoing work

21
Scheduling iterators

Control scheduling by changing implementation of
work-set class
stack/queue/etc.
Scheduling can have a profound effect on
performance
Example Delaunay mesh refinement
10,156 triangles of which 4,837 were bad
sequential code, work-set is stack
21,918 completed iterations0 aborted
4-processor Itanium-2, work-set implementations
stack 21,736 iterations completed28,290 aborted
arrayrandom choice 21,908 iterations
completed49 aborted

22
Scheduling iterators (SPAA 2008)

Crystalline data-parallelism DO-ALL loops
main scheduling concerns are locality and
load-balancing
Open-MP static, dynamic, guided, etc.
Amorphous data-parallelism many more issues
Conflicts
Dynamically created work
Algorithmic issues efficiency of data structures
SPAA 2008 paper
Scheduling framework for exploiting amorphous
data-parallelism
Generalizes Open-MP DO-ALL loop scheduling
constructs

23
Data Partitioning (ASPLOS 2008)
Cores

Partition the graph between cores
Data-centric assignment of work
core gets bad triangles from its own partitions
improves locality
can dramatically reduce conflicts
Lock coarsening
associate locks with partitions
Over-decomposition
improves core utilization

24
Organization

Case study of amorphous data-parallelism
Delaunay mesh refinement
Galois system (PLDI 2007)
Programming model
Baseline implementation
Galois system optimizations
Scheduling (SPAA 2008)
Data and computation partitioning (ASPLOS 2008)
Experimental results
Ongoing work

25
Small-scale multiprocessor results

4-processor Itanium 2
16 KB L1, 256 KB L2, 3MB L3 cache
Versions
GAL using stack as worklist
PAR partitioned mesh data-centric work
assignment
LCO locks on partitions
OVD over-decomposed version (factor of 4)

26
Large-scale multiprocessor results

Maverick_at_TACC
128-core Sun Fire E25K 1 GHz
64 dual-core processors
Sun Solaris
First out-of-the-box results
Speed-up of 20 on 32 cores for refinement
Mesh partitioning is still sequential
time for mesh partitioning starts to dominate
after 8 processors (32 partitions)
Need parallel mesh partitioning
Par-Metis (Karypis et al)

27
Galois version of mesh refinement
Mesh m / read in mesh / Set
wl wl.add(mesh.badTriangles()) // initialize
the Set wl for each e in Set wl do
//unordered Set iterator if (e
no longer in mesh) continue Cavity c new
Cavity(e) c.expand() c.retriangulate()
m.update(c) wl.add(c.badTriangles()) //add
new bad triangles to Set
Partitioned Work-set Partitioned Graph
Galois runtime system
28
Results for BK Image Segmentation

Versions
GAL standard Galois version
PAR partitioned graph
LCO locks on partitions
OVD over-decomposed version

29
Related work

Transactions
programming model is explicitly parallel
assumes someone else is responsible for
parallelism, locality, load-balancing, and
scheduling, and focuses only on synchronization
Galois main concerns are parallelism, locality,
load-balancing, and scheduling
catch and keep classes can use TM for roll-back
but this is probably overkill
Thread level speculation
not clear where to speculate in C programs
wastes power in useless speculation
many schemes require extensive hardware support
no notion of abstraction-based speculation
no analogs of data partitioning or scheduling
overall results are disappointing

30
Ongoing work

Case studies of irregular programs
understand parallelism and locality patterns in
irregular programs
Lonestar benchmark suite for irregular programs
joint work with Calin Cascavals group at IBM
Yorktown Heights
Optimizing the Galois runtime system
improve performance for iterators in which
work/iteration is relatively low
Compiler analysis to reduce overheads of
optimistic parallel execution
Scalability studies
larger number of cores
Distributed-memory implementation
billion element meshes?
Program analysis to verify assertions about class
methods
Need a semantic specification of class

31
Summary

Irregular applications have amorphous
data-parallelism
Work-list based iterative algorithms over
unordered and ordered sets
Amorphous data-parallelism may be inherently
data-dependent
Pointer/shape analysis cannot work for these apps
Optimistic parallelization is essential for such
apps
Analysis might be useful to optimize parallel
program execution
Exploiting abstractions and high-level semantics
is critical
Galois knows about sets, ordered sets,
accumulators
Galois approach provides unified view of
data-parallelism in regular and irregular
programs
Baseline is optimistic parallelism
Use compiler analysis to make decisions at
compile-time whenever possible