6.895/SMA5509 Project Presentations presentation

About This Presentation

Transcript and Presenter's Notes

Title: 6.895/SMA5509 Project Presentations

1
6.895/SMA5509 Project Presentations

October 15-17 2003

2
Improving Cilk
3
Adaptively Parallel Processor Allocation for Cilk
Jobs

Kunal Agrawal
Siddhartha Sen

4
Goal

Design and implement a dynamic processor-allocatio
n system for adaptively parallel jobs (jobs for
which the number of processors that can be used
without waste varies during execution)
The problem of allocating processors to
adaptively parallel jobs is called the adaptively
parallel processor-allocation problem 2.
At any given time, each job j has a desire dj and
an allotment mj

5
Goal (cont.)

We want to design a processor-allocation system
that achieves a fair and efficient allocation
among all jobs.
fair means that whenever a job receives fewer
processors than it desires, no other job receives
more than one more processor than this job
received 2
efficient means that no job receives more
processors than it desires 2

6
Illustration of Problem
dj 6 mj 5
dk 10 mk 6
? ? ? ?
P processors

7
Main Algorithmic Questions

Dynamically determine the desires of each job in
the system
Heuristics on steal rate
Heuristics on number of visible stack frames
Heuristics on number of threads in ready deques
Dynamically determine the allotment for each job
such that the resulting allocation is fair and
efficient
SRLBA algorithm 2
Cilk macroscheduler algorithm 3
Lottery scheduling techniques 5

8
Assumptions

All jobs on the system are Cilk jobs
Jobs can enter and leave the system and change
their parallelism during execution
All jobs are mutually trusting (they will stay
within the bound of their allotments and
communicate their desires honestly)
Each job has at least one processor to start with

9
References

Supercomputing Technologies Group. Cilk 5.3.2
Reference Manual. MIT Lab for Computer Science,
November 2001.
B. Song. Scheduling adaptively parallel jobs.
Master's thesis, Massachusetts Institute of
Technology, January 1998.
R. D. Blumofe, C. E. Leiserson, and B. Song.
Automatic processor allocation for work-stealing
jobs.
C. A. Waldspurger. Lottery and Stride Scheduling
Flexible Proportional-Share Resource Management.
PhD thesis, Massachusetts Institute of
Technology, September 1995.
C. A. Waldspurger and W. E. Weihl. Lottery
scheduling Flexible proportional-share resource
management. In Proceedings of the First Symposium
on Operating System Design and Implementation,
pages 1-11. USENIX, November 1994.
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E.
Leiserson, K. H. Randall, and Y. Zhou. Cilk An
ecient multithreaded runtime system. In
Proceedings of the Fifth ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming
(PPoPP), pages 207-216, Santa Barbara,
California, July 1995.

10
Fast Serial-Append for Cilk

6.895 Theory of Parallel Systems
Project Presentation (Proposal)
Alexandru Caracas

11
Serial-Append (cont'd)

A Cilk dag
Numbers represent the serial execution of threads
For example
Consider all threads perform file I/O operations

12
Serial-Append (cont'd)

Same Cilk dag
Numbers represent the parallel execution of
threads
Goal
We would like to execute the threads in parallel
while still allowing for the serial order to be
reconstructed.

13
Serial-Append (cont'd)

Partitioning of the Cilk dag
Colors represent the I/O operations done by a
processor between two steal operations.

14
Serial-Append

The Serial-Append of the computation is obtained
by reading the output in the order
orange (I)
green (II)
dark-red (III)
purple (IV)
red (V)

15
Project Goals

Efficient serial-append algorithm
possibly using B-trees
Analyze possible implementation
kernel level
formatted file
Cilk file system
Implementation
(one of the above)

Port Cheerio to Cilk 5.4.
Comparison with own implementation
Port serial applications to use serial-append
(analyze their performance)

16
References

Robert D. Blumofe and Charles E. Leiserson.
Scheduling multithreaded computations by work
stealing. In Proceedings of the 35th Annual
Symposium on Foundations of Computer Science,
pages 356-368, Santa Fe, New Mexico, November
1994.
Matthew S. DeBergalis. A parallel file I/O API
for Cilk. Master's thesis, Department of
Electrical Engineering and Computer Science,
Massachusetts Institute of Technology, June 2000.
Supercomputing Technology Group MIT Laboratory
for Computer Science. Cilk 5.3.2 Reference
Manual, November 2001. Available at
http//supertech.lcs.mit.edu/cilk/manual-5.3.2.pdf
.

17
A Space-Efficient Global Scheduler for Cilk

Jason Hickey
Tyeler Quentmeyer

18
Idea

Implement a new scheduler for Cilk based on
Narlikar and Blellochs Space-Efficient
Scheduling of Nested Parallelism paper

19
AsyncDF Scheduler

Idea minimize peak memory usage by preempting
threads that try to allocate large blocks of
memory
Algorithm Roughly, AsyncDF runs the P left-most
threads in the computation DAG

20
Example

In parallel for i 1 to n
Temporary Bn
In parallel for j 1 to n
F(B,i,j)
Free B

21
Comparison to Cilk

Cilk
Distributed
Run threads like a serial program until a
processor runs out of work and has to steal
Space pS1
AsyncDF
Global
Preempt a thread when it tries to allocate a
large block of memory and replace it with a
computationally heavy thread
Space S1 O(KDp)

22
AsyncDF Optmizations

Two versions of scheduler
Serial
Parallel
Running P left-most threads does not exploit the
locality of a serial program
Solution Look at the cP left-most threads and
group threads together for locality

23
Project Plan

Implement serial version of AsyncDF
Experiment with algorithms to group threads for
locality
Performance comparison to Cilk
Different values of AsyncDFs tunable parameter
Case studies
Recursive matrix multiplication
Strassen multiplication
N-body problem
Implement parallel version of AsyncDF
Performance comparison to Cilk
Additional experiments/modifications as research
suggests

24
Questions?
25
Automatic conversion of NSP DAGs to SP DAGs

Sajindra Jayasena

Sharad Ganesh
26

Objectives
Design of an efficient Algorithm to convert an
arbitrary NSP DAG to SP DAG
- Correctness preserving
Identifying different graph topologies
Analysis of Space and Time complexities
Bound on increase in critical path length
Perform empirical analysis using cilk

27
Transactional Cilk
28
Language-Level Complex Transactions C. Scott
Ananian
29
An Evaluation of Nested Transactions

Sean Lie
sean_lie_at_mit.edu
6.895 Theory of Parallel Systems
Wednesday October 15th, 2003

30
Transactional Memory

The programmer is given the ability to define
atomic regions of arbitrary size.
Start_Transaction
Flowi Flowi X
Flowj Flowj X
End_Transaction
With hardware support, different transactions can
be executed concurrently when there are no memory
conflicts.

31
Nested Transactions

Nested transactions can be handled by simply
merging all inner transactions into the outermost
transaction.

Start_Transaction .. Start_Transaction ..
End_Transaction .. .. .. End_Transaction
Start_Transaction .. Start_Transaction ..
End_Transaction .. .. .. End_Transaction
32
Concurrency

Merging transactions is simple but may decrease
the program concurrency.

Start_Transaction .. Start_Transaction
Access S End_Transaction .. .. .. End_Transact
ion
Conflict!
Start_Transaction Access S End_Transaction
33
Nested Concurrent Transactions

Nested Concurrent Transactions (NCT)
Allow the inner transaction to run and complete
independently from the outer transaction.
Questions
When is NCT actually useful?
How much overhead is incurred by NCT?
How much do we gain from using NCT?
The tool needed to get the answers
Transactional memory hardware simulator UVSIM
UVSIM currently supports merging nested
transactions.
How to get the answers
Identify applications where NCT is useful. Look
at malloc?
Implement necessary infrastructure to run NCTs in
UVSIM.
Evaluate performance of NCT vs. merging
transactions.

34
Compiler Support for Atomic Transactions in Cilk

Jim Sukha
Tushara Karunaratna

35
Determinacy Race Example
36
New Keywords in Cilk
It would be convenient if programmers could make
a section of code execute atomically by using
xbegin and xend keywords.
37
Simulation of Transactional Memory
38
Work Involved

Modify the Cilk parser to accept xbegin and xend
as keywords.
Identify all load and store operations in user
code, and replace them with calls to functions
from the runtime system.
Implement the runtime system. An initial
implementation is to divide memory into blocks
and to use a hash table to store a lock and
backup values for each block.
Experiment with different runtime implementations.

39
Data Race in Transactional Cilk

Xie Yong

40
Background

Current Cilk achieve atomicity using lock
problems such as priority inversion, deadlock,
etc.
Nontrivial to code
Transactional memory
Software TM overhead -gt slow
Hardware TM

41
Cilks transaction everywhere

Every instruction becomes part of the a
transaction
Based on HTM
Cilk transaction
cut Cilk program into atomic pieces
Base on some language construct or compiler
automatic generate

42
Data Race

Very different from traditional definition in
Nondeterminator
Assumption (correct parallelization)
Definition (1st half of Kais master thesis)
Efficient detection (v.s. NP-complete proved by
Kai)
Make use of current algorithms/tools

43
Non-Determinacy Detection
44
Linear-time determinacy race detection

Jeremy Fineman

45
Motivation

Nondeterminator has two parts
Check whether threads are logically parallel
Use shadow spaces to detect determinacy race
SP-Bags algorithm uses LCA lookup to determine
whether two threads are parallel
LCA lookup with disjoint-sets data structure
takes O(a(v,v)) time.
We do not care about the LCA. We just want to
know if two threads are logically parallel.

46
New algorithm for determinacy-race detection

Similar to SP-Bags algorithm with two parts
Check whether threads are logically parallel
Use shadow spaces to detect determinacy race
Use order maintenance data-structure to determine
whether threads are parallel
Gain Order maintenance operations are O(1)
amortized time.

47
The algorithm

Maintain two orders
Regular serial, depth-first, left-to right
execution.
At each spawn, follow spawn thread before
continuation thread
Right-to-left execution.
At each spawn, follow continuation thread before
spawn thread
Claim Two threads e1 and e2 are parallel if and
only if e1 precedes e2 in one order, and e2
precedes e1 in the other.

48
Depth-first, left-to-right execution
e0
e6
e5
e7
F
e1
e4
e2
e3
F3
F1
F2
F2
49
Right-to-left execution
e0
e6
e5
e7
F
e1
e4
e2
e3
F3
F1
F2
F2
50
How to maintain both orders in serial execution

Keep two pieces of state for each procedure F
CF is current thread in F
SF is next sync statement in F.
In both orders, insert new threads after current
thread
On spawn, insert continuation thread before spawn
thread in one order. Do the opposite in the
other.
On sync, advance CF to SF
In any other thread, update shadow spaces based
on current thread

51
Example
e0
s1
e2
e5
e4
F
e1
e3
e1
CF
Order
e0
e0
s1
e2
e2
e4
e4
e3
e5
e5
s2
s1
s1
e4
e2
s1
SF
Order
e0
e1
e3
e5
52
Project ProposalParallel Nondeterminator

He Yuxiong

53
Objective

I propose Parallel Nondeterminator to
Check the determinacy race in the parallel
execution of the program written in the language
like Cilk
Develop efficient algorithm to decide the
concurrency between threads
Develop efficient algorithm to reduce the number
of entries in access history

54
Primitive Idea in Concurrency Test

(1) Labeling Scheme
(2) Set operation
Thread representation (fid, tid)

Two sets
Parallel set PS(f) (pf, ptid) all threads
with tid gt ptid in function pf is parallel with
the running thread of f
Children set CS(f)fid fid is the descendant
of current function
Operations
Spawn Thread Tx of function Fi spawn function Fj
Operations on child Fj
Operations on parent Fi

Sync Function Fi executes sync
PS(Fi)PS(Fi)-CS(Fi)
Return Function Fj returns to Function Fi
CS(Fi) CS(Fi) CS(Fj)
Release PS(Fj) and CS(Fj)
Concurrency Test
Check if (fx, tx) is parallel with the current
running thread (fc, tc)

57
Primitive Idea for Access History

Serial
read(l), write(l)
Simplest parallel program without nested
parallelism
Two parallel read records, one write record
Language structure like Cilk
read max level of parallelism, one write record
Q Is it possible to keep only two read records
for each shared location in Cilk Parallel
Nondeterminator?
Keep two parallel read records with highest
level of LCA in parent child spawn tree.

Thank you very much!

59
Parallel NondeterminatorWang Junqing
60
Using Cilk
61
Accelerating Multiprocessor Simulation

6.895 Project Proposal
Kenneth C. Barr

62
Introduction

Inspiration
UVSIM is slow
FlexSIM is cool
Interleave short periods of detailed simulation
with long periods of functional warmup (eg.,
cache and predictor updates, but not out-of-order
logic)
Achieve high accuracy in fraction of the time
Multi-configuration simulation is cool
Recognize structural properties. E.G., contents
of FA cache are subset of all larger FA caches
with same line size so search small-gtlarge.
Once we hit is small cache, we can stop searching
Simulate many configurations with a single run

63
The Meta Directory

Combine previous ideas to speed multiprocessor
simulation
Sequential Consistency is what matters, perhaps
detailed consistency mechanisms can be
interleaved with a fast, functional equivalent
method

64
Meta-directory example

Detailed simulation
P1 ld x null -gt Sh, data sent to P1
P2 ld x Sh -gt Sh, data sent to P2
P3 ld x Sh- gt Sh data sent to P3
P2 st x Sh -gt Ex, inv sent to P1 and P3, data
sent to P2
P1 st x Ex -gt Ex, inv sent to P2, data sent to
P1
P2 ld x Ex -gt Sh, data sent to P2
Shortcut
Record access in a small meta-directory x
P1, 0, r, P2, 1, r, P3, 2, r, P2, 3, w,
P1, 4, w, P2, 5, r
All reads and writes occur in a memory no
messages sent or received, no directory modeled,
no cache model in processor (?)
When it comes time for detailed simulation, we
can reconstruct directory by scanning backwards
x is shared by P1 and P2.

65
Challenges

Accesses stored in circular queue. How many
records needed for each address?
What happens when processor writes back data,
current scheme introduces false hits.
Does this scheme always work? Some proofs would
be nice.

66
Methodology

Platform
I got Simics to boot SMP Linux
But Bochs is open-source
X86 is key if this is to be long-term framework
Cilk benchmarks?
Tasks
Create detailed directory model
Create meta directory
Create a way to switch back and forth maintaining
program correctness
Measure the dramatic improvement!

67
Conclusion

Reality
Sounds daunting for December deadline, but if I
can prove feasibility or fatal flaws, Id be
excited
Suggestions?

68
Project Proposal Parallelizing METIS

Zardosht Kasheff

69
What is METIS?

Graph Partitioning algorithm
Developed at University of Minnesota by George
Karypis and Vipin Kumar
Information on METIS
http//www-users.cs.umn.edu/karypis/metis/

70
Stages of Algorithm Coarsening

Task Create sequentially smaller graphs that
make good representation of original graph by
collapsing connected nodes.
Issues
Minor concurrency issues.
Maintaining data locality.
Writing large amount of data to memory in a
scalable fashion.

71
Stages of Algorithm Initial partitioning.

Task partition small graph
Issues none. Runtime of this portion of
algorithm very small.

72
Stages of Algorithm Uncoarsening and Refinement

Uncoarsening and Refinement project partition of
coarsest graph to original graph and refine
partition.
Issues
Major concurrency issues.
Remaining issues under research.

73
Parallel SortingPaul Youn

I propose to parallelize a sorting algorithm.
Initially, looking at Quick Sort, but may
investigate other sorting algorithms.
Sorting is a problem that can potentially be sped
up significantly (mergesort).
As an alternative, considering other algorithms.

74
Goals

Similar to Lab 1 approach.
Speedy serial algorithm.
Basic parallel algorithm.
Runtime Analysis.
Empirical verification of runtime analysis.
Parallel speedups.

75
HELP!

Not sure about the scope of the problem.
Anyone out there interested in this stuff?
Other appropriate algorithms? Max-flow?

76
PROJECT PROPOSALSMA5509Implement FIR filter in
parallel computing using Cilk

Name Pham Duc Minh
Matric HT030502H
Email g0300231_at_nus.edu.sg
phamducm_at_comp.nus.edu.sg

77
Discrete Signal

the discrete input signal is x(n)
the output signal is y(n)
the impulse response h(n)

78
FIR filter

Discrete time LTI (Linearity and Time Invariance)
systems can be classified into FIR and IIR.
An FIR filter has impulse response h(n) that
extends only over a finite time interval, say 0
n M, and is identically zero beyond that
h0, h1, h2 . . . , hM, 0, 0, 0 . . .M is
referred to as the filter order.
The output y(n) in FIR is simplified to the
finite-sum form
or, explicitly

79
Methods to process

1. Block processing
a. Convolution

(convolution table form)
80
b. Direct FormThe length of the input signal
x(n) is L The length of output y(n) will be Ly
LxLh-1
81
c. Matrix formwith the dimension of H is
82

Overlap-Add Block Convolution Method
above methods are not feasible in infinitive or
extremely long applications
divide the long input into contiguous
non-overlapping blocks of manageable length, say
L samples
y0 hx0
y1 hx1
y2 hx2
Thus the resulting block y0 starts at absolute
time n 0
block y1 starts at n L, and so on.
The last M samples of each output block will
overlap with
the first M outputs of the next block .

2. Sampling Processing
The direct form I/O convolutional equation
We define the internal states w1(n), w2(2),
w3(n).
w0(n) x(n)
w1(n) x(n-1) w0(n-1)
w2(n) x(n-2) w1(n-1)
..
With this definition, we can y(n) in the form

Goal of project
Use Cilk to implement FIR filter in parallel
Compare with the processes of DSP kit (pipeline)
and the process of FPGA (Field Programmable Gate
Array) (also parallel).

How to implement
Multithreaded programming
Cilk compiler

86
Other things

If I finish soon, I hope additional work
suggestion from staff.

87
Cache-Oblivious Algorithms
88
SMA5509 Theory of Parallel Systems
Project Proposal Cache-oblivious sorting for
Burrow-Wheeler Transform

Sriram Saroop
Advait D. Karande

89
bzip2 compression

Lossless text compression scheme

90
Burrows-Wheeler Transform (1994)

A pre-processing step for data compression
Involves sorting of all rotations of the block of
data to be compressed
Rationale Transformed strings compress better
Worst case complexity of N2log(N), where N size
of block to be sorted
Can be improved to N(logN)2 (Sadakane 98)

91
Why make it cache-oblivious?

Performance of BWT heavily dependent on cache
behavior (Seward 00)
Avoid slowdown for large files with high degree
of repetitiveness
Especially useful in applications like bzip2 that
are required to perform well on different memory
structures

92
Project Plan

Design and implementation of a cache oblivious
algorithm for BWT sorting
Performance comparisons with standard bzip2 for
available benchmarks
Back-up Optimizations for BWT including
parallelization, and incorporating
Cache-Obliviousness in other parts of bzip2
compression

93
New Cache Oblivious Algorithms

Neel Kamal
Zhang JiaHui

94
Our Goals

Designing Cache Oblivious Algorithms for
problems in several areas of Compute Science.
Finding theoretical bounds of number of cache
misses for these algorithms.
Trying to parallelize them

95
Large Integer Arithmetic Operations

Basic Operators
Addition, Subtraction
Multiplication
Factorial
Matrix Multiplication
Application RSA encryption algorithm

96
Large Integer Multiplication
Basic Idea
A B
C D
B x D
A x D
B x C
A x C
97
Dynamic Programming Algorithms

Longest Common Sequence Basic Idea

B D C A B A
A 0 0 0 1 1 1
B 1 1 1 1 2 2
C 1 1 2 2 2 2
B 1 1 2 2 3 3
D 1 2 2 2 3 3
A 1 2 2 3 3 4
B 1 2 2 3 4 4
98
Graph Algorithm

All-pair shortest path algorithm Floyd
Connectivity Testing of a Graph
And more

99
Computational Geometry

Finding the closest pair of points
And More

100
Concurrent Cache-Oblivious Algorithms

Seth Gilbert

101
Cache-Oblivious Algorithms

Optimal memory access
Matrix multiply, Fast Fourier Transform
B-trees, Priority Queues
Concurrent?
wait-free, lock-free, obstruction-free
Synchronization?
DCAS, CAS, LL/SC

102
Packed Memory Structure

Operations
Traverse(k) -- O(k/B)
Insert -- O(log2n/B)
Delete -- O(log2n/B)
Goal
Design a lock-free version of the cache-oblivious
algorithm using CAS

103
More

Basis for many data structure
B-tree
distributed data structures
Maybe parallel nondeterminator

104
The End
105
Packed Memory Structure

size(array) lt 4 x num. elements
elements well-spaced in array

Write a Comment

User Comments (0)

About PowerShow.com

6.895/SMA5509 Project Presentations PowerPoint PPT Presentation