The Galois Project - PowerPoint PPT Presentation

About This Presentation
Title:

The Galois Project

Description:

Title: Optimizing Code for Memory Hierarchies Author: pingali Last modified by: Keshav Pingali Created Date: 1/31/2006 1:55:28 AM Document presentation format – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 32
Provided by: ping7
Category:

less

Transcript and Presenter's Notes

Title: The Galois Project


1
The Galois Project
  • Keshav Pingali
  • University of Texas, Austin

Joint work with Milind Kulkarni, Martin
Burtscher, Patrick Carribault, Donald Nguyen,
Dimitrios Prountzos, Zifei Zhong
2
Overview of Galois Project
  • Focus of Galois project
  • parallel execution of irregular programs
  • pointer-based data structures like graphs and
    trees
  • raise abstraction level for Joe programmers
  • explicit parallelism is too difficult for most
    programmers
  • performance penalty for abstraction should be
    small
  • Research approach
  • study algorithms to find common patterns of
    parallelism and locality
  • - Amorphous Data-Parallelism (ADP)
  • design programming constructs for expressing
    these patterns
  • implement these constructs efficiently
  • - Abstraction-Based Speculation (ABS)
  • For more information
  • papers in PLDI 2007, ASPLOS 2008, SPAA 2008
  • website http//iss.ices.utexas.edu

3
Organization
  • Case study of amorphous data-parallelism
  • Delaunay mesh refinement
  • Galois system (PLDI 2007)
  • Programming model
  • Baseline implementation
  • Galois system optimizations
  • Scheduling (SPAA 2008)
  • Data and computation partitioning (ASPLOS 2008)
  • Experimental results
  • Ongoing work

4
Delaunay Mesh Refinement
  • Iterative refinement to remove badly shaped
    triangles
  • while there are bad triangles do
  • Pick a bad triangle
  • Find its cavity
  • Retriangulate cavity
  • // may create new bad triangles
  • Order in which bad triangles should be refined
  • final mesh depends on order in which bad
    triangles are processed
  • but all bad triangles will be eliminated
    ultimately regardless of order

Mesh m / read in mesh / WorkList
wl wl.add(mesh.badTriangles()) while (true)
if ( wl.empty() ) break Triangle e
wl.get() if (e no longer in mesh)
continue Cavity c new Cavity(e) c.expand()
//determine cavity c.retriangulate()//re-triangu
late cavity m.update(c)//update
mesh wl.add(c.badTriangles())
5
Delaunay Mesh Refinement
  • Parallelism
  • triangles with non-overlapping cavities can be
    processed in parallel
  • if cavities of two triangles overlap, they must
    be done serially
  • in practice, lots of parallelism
  • Exploiting this parallelism
  • compile-time parallelization techniques like
    points-to and shape analysis cannot expose this
    parallelism (property of algorithm, not program)
  • runtime dependence checking is needed
  • Galois approach optimistic parallelization

6
Take-away lessons
  • Amorphous data-parallelism
  • data structures graphs, trees, etc.
  • iterative algorithm over unordered or ordered
    work-list
  • elements can be added to work-list during
    computation
  • complex patterns of dependences between
    computations on different work-list elements
    (possibly input-sensitive)
  • but many of these computations can be done in
    parallel
  • Contrast crystalline (regular) data-parallelism
  • data structures dense matrices
  • iterative algorithm over fixed integer interval
  • simple dependence patterns affine subscripts in
    array accesses
  • (mostly input-insensitive)
  • for i 1, N
  • for j 1, N
  • for k 1, N
  • Ci,j Ci,j Ai,kBk,j

7
Take-away lessons (contd.)
  • Amorphous data-parallelism is ubiquitous
  • Delaunay mesh generation points to be inserted
    into mesh
  • Delaunay mesh refinement list of bad triangles
  • Agglomerative clustering priority queue of
    points from data-set
  • Boykov-Kolmogorov algorithm for image
    segmentation
  • Reduction-based interpreters for l-calculus list
    of redexes
  • Iterative dataflow analysis algorithms in
    compilers
  • Approximate SAT solvers survey propagation,
    WalkSAT

8
Take-away lessons (contd.)
  • Amorphous data-parallelism is obscured within
    while loops, exit conditions, etc. in
    conventional languages
  • Need transparent syntax similar to FOR loops for
    crystalline data-parallelism
  • Optimistic parallelization is necessary in
    general
  • Compile-time approaches using points-to analysis
    or shape analysis may be adequate for some cases
  • In general, runtime dependence checking is needed
  • Property of algorithms, not programs
  • Handling of dependence conflicts depends on the
    application
  • Delaunay mesh generation roll back any
    conflicting computation
  • Agglomerative clustering must respect priority
    queue order

9
Organization
  • Case study of amorphous data-parallelism
  • Delaunay mesh refinement
  • Galois system (PLDI 2007)
  • Programming model
  • Baseline implementation
  • Galois system optimizations
  • Scheduling (SPAA 2008)
  • Data and computation partitioning (ASPLOS 2008)
  • Experimental results
  • Ongoing work

10
Galois Design Philosophy
  • Do not worry about dusty decks (for now)
  • Restructuring existing code to expose amorphous
    data-parallelism not our focus
  • (cf. Google map/reduce)
  • Evolution, not revolution
  • Modification of existing programming paradigms
    OK
  • Radical solutions like functional programming
    not OK
  • No reliance on parallelizing compiler technology
  • will not work for many of our applications anyway
  • parallelizing compilers are very complex software
    artifacts
  • Support two classes of programmers
  • domain experts (Joe)
  • should be shielded from complexities of parallel
    programming
  • most programmers will be Joes
  • parallel programming experts (Steve)
  • small number of highly trained people
  • analogs
  • industry model even for sequential programming
  • norm in domains like numerical linear algebra
  • Steves implement BLAS libraries

11
Galois system
  • Application program
  • Has well-defined sequential semantics
  • current implementation sequential Java
  • Uses optimistic iterators to highlight for the
    runtime system opportunities for exploiting
    parallelism
  • Class libraries
  • Like Java collections library but with additional
    information for concurrency control
  • Runtime system
  • Managing optimistic parallelism

12
Optimistic set iterators
  • for each e in Set S do B(e)
  • evaluate block B(e) for each element in set S
  • sequential semantics
  • set elements are unordered, so no a priori order
    on iterations
  • there may be dependences between iterations
  • set S may get new elements during execution
  • for each e in OrderedSet S do B(e)
  • evaluate block B(e) for each element in set S
  • sequential semantics
  • perform iterations in order specified by
    OrderedSet
  • there may be dependences between iterations
  • set S may get new elements during execution

13
Galois version of mesh refinement
Mesh m / read in mesh / Set
wl wl.add(mesh.badTriangles()) // initialize
the Set wl for each e in Set wl do
//unordered Set iterator if
(e no longer in mesh) continue Cavity c new
Cavity(e) c.expand() c.retriangulate()
m.update(c) wl.add(c.badTriangles())
//add new bad triangles to Set
  • - Scheduling policy for iterator
  • controlled by implementation of Set class
  • good choice for temporal locality stack

14
Parallel execution model
Master
  • Object-based shared-memory model
  • Master thread and some number of worker threads
  • master thread begins execution of program and
    executes code between iterators
  • when it encounters iterator, worker threads help
    by executing iterations concurrently with master
  • threads synchronize by barrier synchronization at
    end of iterator
  • Threads invoke methods to access internal state
    of objects
  • how do we ensure sequential semantics of program
    are respected?

main() . for each .. . . ..... ..... for
each . ..... ..
Objects
Shared Memory
Threads
Program
15
Baseline solution PLDI 2007
  • Iteration must lock object to invoke method
  • Two types of objects
  • catch and keep policy
  • lock is held even after method invocation
    completes
  • all locks released at end of iteration
  • poor performance for programs with collections
    and accumulators
  • catch and release policy
  • like Java locking policy
  • permits method invocations from different
    concurrent iterations to be interleaved
  • how do we make sure this is safe?

Objects
time
1
2
3
i
j
16
Catch and keep iteration rollback
  • What does iteration j do if object is already
    locked by some other iteration i ?
  • one possibility wait and try to acquire lock
    again
  • but this might lead to deadlock
  • our implementation runtime system rolls back one
    of the iterations by undoing its updates to
    shared objects
  • Undoing updates any copy-on-write solution
    works
  • Make a copy of entire object when you acquire the
    lock (wasteful for large objects)
  • Runtime systems maintains an undo log that
    holds information for undoing side-effects to
    objects as they are made (cf. software
    transactional memory)

Shared Memory
Objects
time
1
2
3
j
i
17
Problem with catch and keep
Shared Memory
  • Poor performance for programs that deal with
    mutable collections and accumulators
  • work-sets are mutable collections
  • accumulators are ubiquitous
  • Example Delaunay refinement
  • Work-set is a (mutable) collection of bad
    triangles
  • Some thread grabs lock on work-set object, gets a
    bad triangle and removes it from the work-set
  • That thread must retain the lock on the work-set
    till iteration completes, which shuts out all
    other threads (same problem arises with
    transactional memory)
  • Lesson
  • For some objects, we need to interleave method
    invocations from different iterations
  • But must not lose serializability

Objects
2
1
3
4
j
i
18
Galois solution selective catch and release
  • Example accumulator
  • two methods
  • add(int)
  • read() //return value
  • adds commute with other adds and reads commute
    with other reads
  • Interleaving of commuting method invocations from
    different iterations
  • ? OK
  • Interleaving of non-commuting method invocations
    from different iterations
  • ? trigger abort
  • Rolling back side-effects programmer must
    provide inverse methods for forward methods
  • Inverse method for add(n) is subtract(n)
  • Semantic inverse, not representational inverse
  • This solution works for sets as well.

Shared Memory
Accumulator
0?5?8
? 3
a.add(5)
a.add(3)
a.add(8)
a.read()
a.add(-4)
a.add(5)
a.add(3)
a.add(-4)
a.add(8)
a.read()
19
Abstraction-based Speculation
  • Library writer
  • specifies commutativity and inverse information
    for some classes
  • Runtime system
  • catch and release locking for these classes
  • keeps track of forward method invocations
  • checks commutativity of forward method
    invocations
  • invokes appropriate inverse methods on abort
  • More details PLDI 2007
  • Related work
  • logical locks in database systems
  • Herlihy et al PPoPP 2008

20
Organization
  • Case study of amorphous data-parallelism
  • Delaunay mesh refinement
  • Galois system (PLDI 2007)
  • Programming model
  • Baseline implementation
  • Galois system optimizations
  • Scheduling (SPAA 2008)
  • Data and computation partitioning (ASPLOS 2008)
  • Experimental results
  • Ongoing work

21
Scheduling iterators
  • Control scheduling by changing implementation of
    work-set class
  • stack/queue/etc.
  • Scheduling can have a profound effect on
    performance
  • Example Delaunay mesh refinement
  • 10,156 triangles of which 4,837 were bad
  • sequential code, work-set is stack
  • 21,918 completed iterations0 aborted
  • 4-processor Itanium-2, work-set implementations
  • stack 21,736 iterations completed28,290 aborted
  • arrayrandom choice 21,908 iterations
    completed49 aborted

22
Scheduling iterators (SPAA 2008)
  • Crystalline data-parallelism DO-ALL loops
  • main scheduling concerns are locality and
    load-balancing
  • Open-MP static, dynamic, guided, etc.
  • Amorphous data-parallelism many more issues
  • Conflicts
  • Dynamically created work
  • Algorithmic issues efficiency of data structures
  • SPAA 2008 paper
  • Scheduling framework for exploiting amorphous
    data-parallelism
  • Generalizes Open-MP DO-ALL loop scheduling
    constructs

23
Data Partitioning (ASPLOS 2008)
Cores
  • Partition the graph between cores
  • Data-centric assignment of work
  • core gets bad triangles from its own partitions
  • improves locality
  • can dramatically reduce conflicts
  • Lock coarsening
  • associate locks with partitions
  • Over-decomposition
  • improves core utilization

24
Organization
  • Case study of amorphous data-parallelism
  • Delaunay mesh refinement
  • Galois system (PLDI 2007)
  • Programming model
  • Baseline implementation
  • Galois system optimizations
  • Scheduling (SPAA 2008)
  • Data and computation partitioning (ASPLOS 2008)
  • Experimental results
  • Ongoing work

25
Small-scale multiprocessor results
  • 4-processor Itanium 2
  • 16 KB L1, 256 KB L2, 3MB L3 cache
  • Versions
  • GAL using stack as worklist
  • PAR partitioned mesh data-centric work
    assignment
  • LCO locks on partitions
  • OVD over-decomposed version (factor of 4)

26
Large-scale multiprocessor results
  • Maverick_at_TACC
  • 128-core Sun Fire E25K 1 GHz
  • 64 dual-core processors
  • Sun Solaris
  • First out-of-the-box results
  • Speed-up of 20 on 32 cores for refinement
  • Mesh partitioning is still sequential
  • time for mesh partitioning starts to dominate
    after 8 processors (32 partitions)
  • Need parallel mesh partitioning
  • Par-Metis (Karypis et al)

27
Galois version of mesh refinement
Mesh m / read in mesh / Set
wl wl.add(mesh.badTriangles()) // initialize
the Set wl for each e in Set wl do
//unordered Set iterator if (e
no longer in mesh) continue Cavity c new
Cavity(e) c.expand() c.retriangulate()
m.update(c) wl.add(c.badTriangles()) //add
new bad triangles to Set
Partitioned Work-set Partitioned Graph
Galois runtime system
28
Results for BK Image Segmentation
  • Versions
  • GAL standard Galois version
  • PAR partitioned graph
  • LCO locks on partitions
  • OVD over-decomposed version

29
Related work
  • Transactions
  • programming model is explicitly parallel
  • assumes someone else is responsible for
    parallelism, locality, load-balancing, and
    scheduling, and focuses only on synchronization
  • Galois main concerns are parallelism, locality,
    load-balancing, and scheduling
  • catch and keep classes can use TM for roll-back
    but this is probably overkill
  • Thread level speculation
  • not clear where to speculate in C programs
  • wastes power in useless speculation
  • many schemes require extensive hardware support
  • no notion of abstraction-based speculation
  • no analogs of data partitioning or scheduling
  • overall results are disappointing

30
Ongoing work
  • Case studies of irregular programs
  • understand parallelism and locality patterns in
    irregular programs
  • Lonestar benchmark suite for irregular programs
  • joint work with Calin Cascavals group at IBM
    Yorktown Heights
  • Optimizing the Galois runtime system
  • improve performance for iterators in which
    work/iteration is relatively low
  • Compiler analysis to reduce overheads of
    optimistic parallel execution
  • Scalability studies
  • larger number of cores
  • Distributed-memory implementation
  • billion element meshes?
  • Program analysis to verify assertions about class
    methods
  • Need a semantic specification of class

31
Summary
  • Irregular applications have amorphous
    data-parallelism
  • Work-list based iterative algorithms over
    unordered and ordered sets
  • Amorphous data-parallelism may be inherently
    data-dependent
  • Pointer/shape analysis cannot work for these apps
  • Optimistic parallelization is essential for such
    apps
  • Analysis might be useful to optimize parallel
    program execution
  • Exploiting abstractions and high-level semantics
    is critical
  • Galois knows about sets, ordered sets,
    accumulators
  • Galois approach provides unified view of
    data-parallelism in regular and irregular
    programs
  • Baseline is optimistic parallelism
  • Use compiler analysis to make decisions at
    compile-time whenever possible
Write a Comment
User Comments (0)
About PowerShow.com