Title: 6.895/SMA5509 Project Presentations
16.895/SMA5509 Project Presentations
2Improving Cilk
3Adaptively Parallel Processor Allocation for Cilk
Jobs
- Kunal Agrawal
- Siddhartha Sen
4Goal
- Design and implement a dynamic processor-allocatio
n system for adaptively parallel jobs (jobs for
which the number of processors that can be used
without waste varies during execution) - The problem of allocating processors to
adaptively parallel jobs is called the adaptively
parallel processor-allocation problem 2. - At any given time, each job j has a desire dj and
an allotment mj
5Goal (cont.)
- We want to design a processor-allocation system
that achieves a fair and efficient allocation
among all jobs. - fair means that whenever a job receives fewer
processors than it desires, no other job receives
more than one more processor than this job
received 2 - efficient means that no job receives more
processors than it desires 2
6Illustration of Problem
dj 6 mj 5
dk 10 mk 6
? ? ? ?
P processors
7Main Algorithmic Questions
- Dynamically determine the desires of each job in
the system - Heuristics on steal rate
- Heuristics on number of visible stack frames
- Heuristics on number of threads in ready deques
- Dynamically determine the allotment for each job
such that the resulting allocation is fair and
efficient - SRLBA algorithm 2
- Cilk macroscheduler algorithm 3
- Lottery scheduling techniques 5
8Assumptions
- All jobs on the system are Cilk jobs
- Jobs can enter and leave the system and change
their parallelism during execution - All jobs are mutually trusting (they will stay
within the bound of their allotments and
communicate their desires honestly) - Each job has at least one processor to start with
9References
- Supercomputing Technologies Group. Cilk 5.3.2
Reference Manual. MIT Lab for Computer Science,
November 2001. - B. Song. Scheduling adaptively parallel jobs.
Master's thesis, Massachusetts Institute of
Technology, January 1998. - R. D. Blumofe, C. E. Leiserson, and B. Song.
Automatic processor allocation for work-stealing
jobs. - C. A. Waldspurger. Lottery and Stride Scheduling
Flexible Proportional-Share Resource Management.
PhD thesis, Massachusetts Institute of
Technology, September 1995. - C. A. Waldspurger and W. E. Weihl. Lottery
scheduling Flexible proportional-share resource
management. In Proceedings of the First Symposium
on Operating System Design and Implementation,
pages 1-11. USENIX, November 1994. - R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E.
Leiserson, K. H. Randall, and Y. Zhou. Cilk An
ecient multithreaded runtime system. In
Proceedings of the Fifth ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming
(PPoPP), pages 207-216, Santa Barbara,
California, July 1995.
10Fast Serial-Append for Cilk
- 6.895 Theory of Parallel Systems
- Project Presentation (Proposal)
- Alexandru Caracas
11Serial-Append (cont'd)
- A Cilk dag
- Numbers represent the serial execution of threads
- For example
- Consider all threads perform file I/O operations
12Serial-Append (cont'd)
- Same Cilk dag
- Numbers represent the parallel execution of
threads - Goal
- We would like to execute the threads in parallel
while still allowing for the serial order to be
reconstructed.
13Serial-Append (cont'd)
- Partitioning of the Cilk dag
- Colors represent the I/O operations done by a
processor between two steal operations.
14Serial-Append
- The Serial-Append of the computation is obtained
by reading the output in the order - orange (I)
- green (II)
- dark-red (III)
- purple (IV)
- red (V)
15Project Goals
- Efficient serial-append algorithm
- possibly using B-trees
- Analyze possible implementation
- kernel level
- formatted file
- Cilk file system
- Implementation
- (one of the above)
- Port Cheerio to Cilk 5.4.
- Comparison with own implementation
- Port serial applications to use serial-append
(analyze their performance)
16References
- Robert D. Blumofe and Charles E. Leiserson.
Scheduling multithreaded computations by work
stealing. In Proceedings of the 35th Annual
Symposium on Foundations of Computer Science,
pages 356-368, Santa Fe, New Mexico, November
1994. - Matthew S. DeBergalis. A parallel file I/O API
for Cilk. Master's thesis, Department of
Electrical Engineering and Computer Science,
Massachusetts Institute of Technology, June 2000. - Supercomputing Technology Group MIT Laboratory
for Computer Science. Cilk 5.3.2 Reference
Manual, November 2001. Available at
http//supertech.lcs.mit.edu/cilk/manual-5.3.2.pdf
.
17A Space-Efficient Global Scheduler for Cilk
- Jason Hickey
- Tyeler Quentmeyer
18Idea
- Implement a new scheduler for Cilk based on
Narlikar and Blellochs Space-Efficient
Scheduling of Nested Parallelism paper
19AsyncDF Scheduler
- Idea minimize peak memory usage by preempting
threads that try to allocate large blocks of
memory - Algorithm Roughly, AsyncDF runs the P left-most
threads in the computation DAG
20Example
- In parallel for i 1 to n
- Temporary Bn
- In parallel for j 1 to n
- F(B,i,j)
- Free B
21Comparison to Cilk
- Cilk
- Distributed
- Run threads like a serial program until a
processor runs out of work and has to steal - Space pS1
- AsyncDF
- Global
- Preempt a thread when it tries to allocate a
large block of memory and replace it with a
computationally heavy thread - Space S1 O(KDp)
22AsyncDF Optmizations
- Two versions of scheduler
- Serial
- Parallel
- Running P left-most threads does not exploit the
locality of a serial program - Solution Look at the cP left-most threads and
group threads together for locality
23Project Plan
- Implement serial version of AsyncDF
- Experiment with algorithms to group threads for
locality - Performance comparison to Cilk
- Different values of AsyncDFs tunable parameter
- Case studies
- Recursive matrix multiplication
- Strassen multiplication
- N-body problem
- Implement parallel version of AsyncDF
- Performance comparison to Cilk
- Additional experiments/modifications as research
suggests
24Questions?
25Automatic conversion of NSP DAGs to SP DAGs
Sharad Ganesh
26- Objectives
- Design of an efficient Algorithm to convert an
arbitrary NSP DAG to SP DAG - - Correctness preserving
- Identifying different graph topologies
- Analysis of Space and Time complexities
- Bound on increase in critical path length
- Perform empirical analysis using cilk
27Transactional Cilk
28Language-Level Complex Transactions C. Scott
Ananian
29An Evaluation of Nested Transactions
- Sean Lie
- sean_lie_at_mit.edu
- 6.895 Theory of Parallel Systems
- Wednesday October 15th, 2003
30Transactional Memory
- The programmer is given the ability to define
atomic regions of arbitrary size. - Start_Transaction
- Flowi Flowi X
- Flowj Flowj X
- End_Transaction
- With hardware support, different transactions can
be executed concurrently when there are no memory
conflicts.
31Nested Transactions
- Nested transactions can be handled by simply
merging all inner transactions into the outermost
transaction.
Start_Transaction .. Start_Transaction ..
End_Transaction .. .. .. End_Transaction
Start_Transaction .. Start_Transaction ..
End_Transaction .. .. .. End_Transaction
32Concurrency
- Merging transactions is simple but may decrease
the program concurrency.
Start_Transaction .. Start_Transaction
Access S End_Transaction .. .. .. End_Transact
ion
Conflict!
Start_Transaction Access S End_Transaction
33Nested Concurrent Transactions
- Nested Concurrent Transactions (NCT)
- Allow the inner transaction to run and complete
independently from the outer transaction. - Questions
- When is NCT actually useful?
- How much overhead is incurred by NCT?
- How much do we gain from using NCT?
- The tool needed to get the answers
- Transactional memory hardware simulator UVSIM
- UVSIM currently supports merging nested
transactions. - How to get the answers
- Identify applications where NCT is useful. Look
at malloc? - Implement necessary infrastructure to run NCTs in
UVSIM. - Evaluate performance of NCT vs. merging
transactions.
34Compiler Support for Atomic Transactions in Cilk
- Jim Sukha
- Tushara Karunaratna
35Determinacy Race Example
36New Keywords in Cilk
It would be convenient if programmers could make
a section of code execute atomically by using
xbegin and xend keywords.
37Simulation of Transactional Memory
38Work Involved
- Modify the Cilk parser to accept xbegin and xend
as keywords. - Identify all load and store operations in user
code, and replace them with calls to functions
from the runtime system. - Implement the runtime system. An initial
implementation is to divide memory into blocks
and to use a hash table to store a lock and
backup values for each block. - Experiment with different runtime implementations.
39Data Race in Transactional Cilk
40Background
- Current Cilk achieve atomicity using lock
- problems such as priority inversion, deadlock,
etc. - Nontrivial to code
- Transactional memory
- Software TM overhead -gt slow
- Hardware TM
41Cilks transaction everywhere
- Every instruction becomes part of the a
transaction - Based on HTM
- Cilk transaction
- cut Cilk program into atomic pieces
- Base on some language construct or compiler
automatic generate
42Data Race
- Very different from traditional definition in
Nondeterminator - Assumption (correct parallelization)
- Definition (1st half of Kais master thesis)
- Efficient detection (v.s. NP-complete proved by
Kai) - Make use of current algorithms/tools
43Non-Determinacy Detection
44Linear-time determinacy race detection
45Motivation
- Nondeterminator has two parts
- Check whether threads are logically parallel
- Use shadow spaces to detect determinacy race
- SP-Bags algorithm uses LCA lookup to determine
whether two threads are parallel - LCA lookup with disjoint-sets data structure
takes O(a(v,v)) time. - We do not care about the LCA. We just want to
know if two threads are logically parallel.
46New algorithm for determinacy-race detection
- Similar to SP-Bags algorithm with two parts
- Check whether threads are logically parallel
- Use shadow spaces to detect determinacy race
- Use order maintenance data-structure to determine
whether threads are parallel - Gain Order maintenance operations are O(1)
amortized time.
47The algorithm
- Maintain two orders
- Regular serial, depth-first, left-to right
execution. - At each spawn, follow spawn thread before
continuation thread - Right-to-left execution.
- At each spawn, follow continuation thread before
spawn thread - Claim Two threads e1 and e2 are parallel if and
only if e1 precedes e2 in one order, and e2
precedes e1 in the other.
48Depth-first, left-to-right execution
e0
e6
e5
e7
F
e1
e4
e2
e3
F3
F1
F2
F2
49Right-to-left execution
e0
e6
e5
e7
F
e1
e4
e2
e3
F3
F1
F2
F2
50How to maintain both orders in serial execution
- Keep two pieces of state for each procedure F
- CF is current thread in F
- SF is next sync statement in F.
- In both orders, insert new threads after current
thread - On spawn, insert continuation thread before spawn
thread in one order. Do the opposite in the
other. - On sync, advance CF to SF
- In any other thread, update shadow spaces based
on current thread
51Example
e0
s1
e2
e5
e4
F
e1
e3
e1
CF
Order
e0
e0
s1
e2
e2
e4
e4
e3
e5
e5
s2
s1
s1
e4
e2
s1
SF
Order
e0
e1
e3
e5
52Project ProposalParallel Nondeterminator
53Objective
- I propose Parallel Nondeterminator to
- Check the determinacy race in the parallel
execution of the program written in the language
like Cilk - Develop efficient algorithm to decide the
concurrency between threads - Develop efficient algorithm to reduce the number
of entries in access history
54Primitive Idea in Concurrency Test
- (1) Labeling Scheme
- (2) Set operation
- Thread representation (fid, tid)
-
55- Two sets
- Parallel set PS(f) (pf, ptid) all threads
with tid gt ptid in function pf is parallel with
the running thread of f - Children set CS(f)fid fid is the descendant
of current function - Operations
- Spawn Thread Tx of function Fi spawn function Fj
- Operations on child Fj
-
-
-
- Operations on parent Fi
56- Sync Function Fi executes sync
- PS(Fi)PS(Fi)-CS(Fi)
- Return Function Fj returns to Function Fi
- CS(Fi) CS(Fi) CS(Fj)
- Release PS(Fj) and CS(Fj)
- Concurrency Test
- Check if (fx, tx) is parallel with the current
running thread (fc, tc)
57Primitive Idea for Access History
- Serial
- read(l), write(l)
- Simplest parallel program without nested
parallelism - Two parallel read records, one write record
- Language structure like Cilk
- read max level of parallelism, one write record
-
- Q Is it possible to keep only two read records
for each shared location in Cilk Parallel
Nondeterminator? - Keep two parallel read records with highest
level of LCA in parent child spawn tree.
58 59Parallel NondeterminatorWang Junqing
60Using Cilk
61Accelerating Multiprocessor Simulation
- 6.895 Project Proposal
- Kenneth C. Barr
62Introduction
- Inspiration
- UVSIM is slow
- FlexSIM is cool
- Interleave short periods of detailed simulation
with long periods of functional warmup (eg.,
cache and predictor updates, but not out-of-order
logic) - Achieve high accuracy in fraction of the time
- Multi-configuration simulation is cool
- Recognize structural properties. E.G., contents
of FA cache are subset of all larger FA caches
with same line size so search small-gtlarge.
Once we hit is small cache, we can stop searching - Simulate many configurations with a single run
63The Meta Directory
- Combine previous ideas to speed multiprocessor
simulation - Sequential Consistency is what matters, perhaps
detailed consistency mechanisms can be
interleaved with a fast, functional equivalent
method
64Meta-directory example
- Detailed simulation
- P1 ld x null -gt Sh, data sent to P1
- P2 ld x Sh -gt Sh, data sent to P2
- P3 ld x Sh- gt Sh data sent to P3
- P2 st x Sh -gt Ex, inv sent to P1 and P3, data
sent to P2 - P1 st x Ex -gt Ex, inv sent to P2, data sent to
P1 - P2 ld x Ex -gt Sh, data sent to P2
- Shortcut
- Record access in a small meta-directory x
P1, 0, r, P2, 1, r, P3, 2, r, P2, 3, w,
P1, 4, w, P2, 5, r - All reads and writes occur in a memory no
messages sent or received, no directory modeled,
no cache model in processor (?) - When it comes time for detailed simulation, we
can reconstruct directory by scanning backwards
x is shared by P1 and P2.
65Challenges
- Accesses stored in circular queue. How many
records needed for each address? - What happens when processor writes back data,
current scheme introduces false hits. - Does this scheme always work? Some proofs would
be nice.
66Methodology
- Platform
- I got Simics to boot SMP Linux
- But Bochs is open-source
- X86 is key if this is to be long-term framework
- Cilk benchmarks?
- Tasks
- Create detailed directory model
- Create meta directory
- Create a way to switch back and forth maintaining
program correctness - Measure the dramatic improvement!
67Conclusion
- Reality
- Sounds daunting for December deadline, but if I
can prove feasibility or fatal flaws, Id be
excited - Suggestions?
68Project Proposal Parallelizing METIS
69What is METIS?
- Graph Partitioning algorithm
- Developed at University of Minnesota by George
Karypis and Vipin Kumar - Information on METIS
- http//www-users.cs.umn.edu/karypis/metis/
70Stages of Algorithm Coarsening
- Task Create sequentially smaller graphs that
make good representation of original graph by
collapsing connected nodes. - Issues
- Minor concurrency issues.
- Maintaining data locality.
- Writing large amount of data to memory in a
scalable fashion.
71Stages of Algorithm Initial partitioning.
- Task partition small graph
- Issues none. Runtime of this portion of
algorithm very small.
72Stages of Algorithm Uncoarsening and Refinement
- Uncoarsening and Refinement project partition of
coarsest graph to original graph and refine
partition. - Issues
- Major concurrency issues.
- Remaining issues under research.
73Parallel SortingPaul Youn
- I propose to parallelize a sorting algorithm.
Initially, looking at Quick Sort, but may
investigate other sorting algorithms. - Sorting is a problem that can potentially be sped
up significantly (mergesort). - As an alternative, considering other algorithms.
74Goals
- Similar to Lab 1 approach.
- Speedy serial algorithm.
- Basic parallel algorithm.
- Runtime Analysis.
- Empirical verification of runtime analysis.
- Parallel speedups.
75HELP!
- Not sure about the scope of the problem.
- Anyone out there interested in this stuff?
- Other appropriate algorithms? Max-flow?
76PROJECT PROPOSALSMA5509Implement FIR filter in
parallel computing using Cilk
- Name Pham Duc Minh
- Matric HT030502H
- Email g0300231_at_nus.edu.sg
- phamducm_at_comp.nus.edu.sg
77Discrete Signal
- the discrete input signal is x(n)
- the output signal is y(n)
- the impulse response h(n)
78FIR filter
- Discrete time LTI (Linearity and Time Invariance)
systems can be classified into FIR and IIR. - An FIR filter has impulse response h(n) that
extends only over a finite time interval, say 0
n M, and is identically zero beyond that - h0, h1, h2 . . . , hM, 0, 0, 0 . . .M is
referred to as the filter order. - The output y(n) in FIR is simplified to the
finite-sum form - or, explicitly
79Methods to process
- 1. Block processing
- a. Convolution
(convolution table form)
80b. Direct FormThe length of the input signal
x(n) is L The length of output y(n) will be Ly
LxLh-1
81c. Matrix formwith the dimension of H is
82- Overlap-Add Block Convolution Method
- above methods are not feasible in infinitive or
extremely long applications - divide the long input into contiguous
non-overlapping blocks of manageable length, say
L samples - y0 hx0
- y1 hx1
- y2 hx2
- Thus the resulting block y0 starts at absolute
time n 0 - block y1 starts at n L, and so on.
- The last M samples of each output block will
overlap with - the first M outputs of the next block .
83- 2. Sampling Processing
- The direct form I/O convolutional equation
- We define the internal states w1(n), w2(2),
w3(n). - w0(n) x(n)
- w1(n) x(n-1) w0(n-1)
- w2(n) x(n-2) w1(n-1)
- ..
- With this definition, we can y(n) in the form
84- Goal of project
- Use Cilk to implement FIR filter in parallel
- Compare with the processes of DSP kit (pipeline)
and the process of FPGA (Field Programmable Gate
Array) (also parallel).
85- How to implement
- Multithreaded programming
- Cilk compiler
86Other things
- If I finish soon, I hope additional work
suggestion from staff.
87Cache-Oblivious Algorithms
88SMA5509 Theory of Parallel Systems
Project Proposal Cache-oblivious sorting for
Burrow-Wheeler Transform
- Sriram Saroop
- Advait D. Karande
89bzip2 compression
- Lossless text compression scheme
90Burrows-Wheeler Transform (1994)
- A pre-processing step for data compression
- Involves sorting of all rotations of the block of
data to be compressed - Rationale Transformed strings compress better
- Worst case complexity of N2log(N), where N size
of block to be sorted - Can be improved to N(logN)2 (Sadakane 98)
91Why make it cache-oblivious?
- Performance of BWT heavily dependent on cache
behavior (Seward 00) - Avoid slowdown for large files with high degree
of repetitiveness - Especially useful in applications like bzip2 that
are required to perform well on different memory
structures
92Project Plan
- Design and implementation of a cache oblivious
algorithm for BWT sorting - Performance comparisons with standard bzip2 for
available benchmarks - Back-up Optimizations for BWT including
parallelization, and incorporating
Cache-Obliviousness in other parts of bzip2
compression
93New Cache Oblivious Algorithms
94Our Goals
- Designing Cache Oblivious Algorithms for
problems in several areas of Compute Science. - Finding theoretical bounds of number of cache
misses for these algorithms. - Trying to parallelize them
95Large Integer Arithmetic Operations
- Basic Operators
- Addition, Subtraction
- Multiplication
- Factorial
- Matrix Multiplication
- Application RSA encryption algorithm
96Large Integer Multiplication
Basic Idea
A B
C D
B x D
A x D
B x C
A x C
97Dynamic Programming Algorithms
- Longest Common Sequence Basic Idea
B D C A B A
A 0 0 0 1 1 1
B 1 1 1 1 2 2
C 1 1 2 2 2 2
B 1 1 2 2 3 3
D 1 2 2 2 3 3
A 1 2 2 3 3 4
B 1 2 2 3 4 4
98Graph Algorithm
- All-pair shortest path algorithm Floyd
- Connectivity Testing of a Graph
- And more
99Computational Geometry
- Finding the closest pair of points
- And More
100Concurrent Cache-Oblivious Algorithms
101Cache-Oblivious Algorithms
- Optimal memory access
- Matrix multiply, Fast Fourier Transform
- B-trees, Priority Queues
- Concurrent?
- wait-free, lock-free, obstruction-free
- Synchronization?
- DCAS, CAS, LL/SC
102Packed Memory Structure
- Operations
- Traverse(k) -- O(k/B)
- Insert -- O(log2n/B)
- Delete -- O(log2n/B)
- Goal
- Design a lock-free version of the cache-oblivious
algorithm using CAS
103More
- Basis for many data structure
- B-tree
- distributed data structures
- Maybe parallel nondeterminator
104The End
105Packed Memory Structure
- size(array) lt 4 x num. elements
- elements well-spaced in array