6.895/SMA5509 Project Presentations - PowerPoint PPT Presentation

About This Presentation
Title:

6.895/SMA5509 Project Presentations

Description:

Design and implement a dynamic processor ... Recognize structural properties. ... Sounds daunting for December deadline, but if I can prove feasibility or fatal ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 106
Provided by: sgil4
Category:

less

Transcript and Presenter's Notes

Title: 6.895/SMA5509 Project Presentations


1
6.895/SMA5509 Project Presentations
  • October 15-17 2003

2
Improving Cilk
3
Adaptively Parallel Processor Allocation for Cilk
Jobs
  • Kunal Agrawal
  • Siddhartha Sen

4
Goal
  • Design and implement a dynamic processor-allocatio
    n system for adaptively parallel jobs (jobs for
    which the number of processors that can be used
    without waste varies during execution)
  • The problem of allocating processors to
    adaptively parallel jobs is called the adaptively
    parallel processor-allocation problem 2.
  • At any given time, each job j has a desire dj and
    an allotment mj

5
Goal (cont.)
  • We want to design a processor-allocation system
    that achieves a fair and efficient allocation
    among all jobs.
  • fair means that whenever a job receives fewer
    processors than it desires, no other job receives
    more than one more processor than this job
    received 2
  • efficient means that no job receives more
    processors than it desires 2

6
Illustration of Problem
dj 6 mj 5
dk 10 mk 6
? ? ? ?
P processors


7
Main Algorithmic Questions
  • Dynamically determine the desires of each job in
    the system
  • Heuristics on steal rate
  • Heuristics on number of visible stack frames
  • Heuristics on number of threads in ready deques
  • Dynamically determine the allotment for each job
    such that the resulting allocation is fair and
    efficient
  • SRLBA algorithm 2
  • Cilk macroscheduler algorithm 3
  • Lottery scheduling techniques 5

8
Assumptions
  • All jobs on the system are Cilk jobs
  • Jobs can enter and leave the system and change
    their parallelism during execution
  • All jobs are mutually trusting (they will stay
    within the bound of their allotments and
    communicate their desires honestly)
  • Each job has at least one processor to start with

9
References
  • Supercomputing Technologies Group. Cilk 5.3.2
    Reference Manual. MIT Lab for Computer Science,
    November 2001.
  • B. Song. Scheduling adaptively parallel jobs.
    Master's thesis, Massachusetts Institute of
    Technology, January 1998.
  • R. D. Blumofe, C. E. Leiserson, and B. Song.
    Automatic processor allocation for work-stealing
    jobs.
  • C. A. Waldspurger. Lottery and Stride Scheduling
    Flexible Proportional-Share Resource Management.
    PhD thesis, Massachusetts Institute of
    Technology, September 1995.
  • C. A. Waldspurger and W. E. Weihl. Lottery
    scheduling Flexible proportional-share resource
    management. In Proceedings of the First Symposium
    on Operating System Design and Implementation,
    pages 1-11. USENIX, November 1994.
  • R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E.
    Leiserson, K. H. Randall, and Y. Zhou. Cilk An
    ecient multithreaded runtime system. In
    Proceedings of the Fifth ACM SIGPLAN Symposium on
    Principles and Practice of Parallel Programming
    (PPoPP), pages 207-216, Santa Barbara,
    California, July 1995.

10
Fast Serial-Append for Cilk
  • 6.895 Theory of Parallel Systems
  • Project Presentation (Proposal)
  • Alexandru Caracas

11
Serial-Append (cont'd)
  • A Cilk dag
  • Numbers represent the serial execution of threads
  • For example
  • Consider all threads perform file I/O operations

12
Serial-Append (cont'd)
  • Same Cilk dag
  • Numbers represent the parallel execution of
    threads
  • Goal
  • We would like to execute the threads in parallel
    while still allowing for the serial order to be
    reconstructed.

13
Serial-Append (cont'd)
  • Partitioning of the Cilk dag
  • Colors represent the I/O operations done by a
    processor between two steal operations.

14
Serial-Append
  • The Serial-Append of the computation is obtained
    by reading the output in the order
  • orange (I)
  • green (II)
  • dark-red (III)
  • purple (IV)
  • red (V)

15
Project Goals
  • Efficient serial-append algorithm
  • possibly using B-trees
  • Analyze possible implementation
  • kernel level
  • formatted file
  • Cilk file system
  • Implementation
  • (one of the above)
  • Port Cheerio to Cilk 5.4.
  • Comparison with own implementation
  • Port serial applications to use serial-append
    (analyze their performance)

16
References
  • Robert D. Blumofe and Charles E. Leiserson.
    Scheduling multithreaded computations by work
    stealing. In Proceedings of the 35th Annual
    Symposium on Foundations of Computer Science,
    pages 356-368, Santa Fe, New Mexico, November
    1994.
  • Matthew S. DeBergalis. A parallel file I/O API
    for Cilk. Master's thesis, Department of
    Electrical Engineering and Computer Science,
    Massachusetts Institute of Technology, June 2000.
  • Supercomputing Technology Group MIT Laboratory
    for Computer Science. Cilk 5.3.2 Reference
    Manual, November 2001. Available at
    http//supertech.lcs.mit.edu/cilk/manual-5.3.2.pdf
    .

17
A Space-Efficient Global Scheduler for Cilk
  • Jason Hickey
  • Tyeler Quentmeyer

18
Idea
  • Implement a new scheduler for Cilk based on
    Narlikar and Blellochs Space-Efficient
    Scheduling of Nested Parallelism paper

19
AsyncDF Scheduler
  • Idea minimize peak memory usage by preempting
    threads that try to allocate large blocks of
    memory
  • Algorithm Roughly, AsyncDF runs the P left-most
    threads in the computation DAG

20
Example
  • In parallel for i 1 to n
  • Temporary Bn
  • In parallel for j 1 to n
  • F(B,i,j)
  • Free B

21
Comparison to Cilk
  • Cilk
  • Distributed
  • Run threads like a serial program until a
    processor runs out of work and has to steal
  • Space pS1
  • AsyncDF
  • Global
  • Preempt a thread when it tries to allocate a
    large block of memory and replace it with a
    computationally heavy thread
  • Space S1 O(KDp)

22
AsyncDF Optmizations
  • Two versions of scheduler
  • Serial
  • Parallel
  • Running P left-most threads does not exploit the
    locality of a serial program
  • Solution Look at the cP left-most threads and
    group threads together for locality

23
Project Plan
  • Implement serial version of AsyncDF
  • Experiment with algorithms to group threads for
    locality
  • Performance comparison to Cilk
  • Different values of AsyncDFs tunable parameter
  • Case studies
  • Recursive matrix multiplication
  • Strassen multiplication
  • N-body problem
  • Implement parallel version of AsyncDF
  • Performance comparison to Cilk
  • Additional experiments/modifications as research
    suggests

24
Questions?
25
Automatic conversion of NSP DAGs to SP DAGs
  • Sajindra Jayasena

Sharad Ganesh
26
  • Objectives
  • Design of an efficient Algorithm to convert an
    arbitrary NSP DAG to SP DAG
  • - Correctness preserving
  • Identifying different graph topologies
  • Analysis of Space and Time complexities
  • Bound on increase in critical path length
  • Perform empirical analysis using cilk

27
Transactional Cilk
28
Language-Level Complex Transactions C. Scott
Ananian
29
An Evaluation of Nested Transactions
  • Sean Lie
  • sean_lie_at_mit.edu
  • 6.895 Theory of Parallel Systems
  • Wednesday October 15th, 2003

30
Transactional Memory
  • The programmer is given the ability to define
    atomic regions of arbitrary size.
  • Start_Transaction
  • Flowi Flowi X
  • Flowj Flowj X
  • End_Transaction
  • With hardware support, different transactions can
    be executed concurrently when there are no memory
    conflicts.

31
Nested Transactions
  • Nested transactions can be handled by simply
    merging all inner transactions into the outermost
    transaction.

Start_Transaction .. Start_Transaction ..
End_Transaction .. .. .. End_Transaction
Start_Transaction .. Start_Transaction ..
End_Transaction .. .. .. End_Transaction
32
Concurrency
  • Merging transactions is simple but may decrease
    the program concurrency.

Start_Transaction .. Start_Transaction
Access S End_Transaction .. .. .. End_Transact
ion
Conflict!
Start_Transaction Access S End_Transaction
33
Nested Concurrent Transactions
  • Nested Concurrent Transactions (NCT)
  • Allow the inner transaction to run and complete
    independently from the outer transaction.
  • Questions
  • When is NCT actually useful?
  • How much overhead is incurred by NCT?
  • How much do we gain from using NCT?
  • The tool needed to get the answers
  • Transactional memory hardware simulator UVSIM
  • UVSIM currently supports merging nested
    transactions.
  • How to get the answers
  • Identify applications where NCT is useful. Look
    at malloc?
  • Implement necessary infrastructure to run NCTs in
    UVSIM.
  • Evaluate performance of NCT vs. merging
    transactions.

34
Compiler Support for Atomic Transactions in Cilk
  • Jim Sukha
  • Tushara Karunaratna

35
Determinacy Race Example
36
New Keywords in Cilk
It would be convenient if programmers could make
a section of code execute atomically by using
xbegin and xend keywords.
37
Simulation of Transactional Memory
38
Work Involved
  • Modify the Cilk parser to accept xbegin and xend
    as keywords.
  • Identify all load and store operations in user
    code, and replace them with calls to functions
    from the runtime system.
  • Implement the runtime system. An initial
    implementation is to divide memory into blocks
    and to use a hash table to store a lock and
    backup values for each block.
  • Experiment with different runtime implementations.

39
Data Race in Transactional Cilk
  • Xie Yong

40
Background
  • Current Cilk achieve atomicity using lock
  • problems such as priority inversion, deadlock,
    etc.
  • Nontrivial to code
  • Transactional memory
  • Software TM overhead -gt slow
  • Hardware TM

41
Cilks transaction everywhere
  • Every instruction becomes part of the a
    transaction
  • Based on HTM
  • Cilk transaction
  • cut Cilk program into atomic pieces
  • Base on some language construct or compiler
    automatic generate

42
Data Race
  • Very different from traditional definition in
    Nondeterminator
  • Assumption (correct parallelization)
  • Definition (1st half of Kais master thesis)
  • Efficient detection (v.s. NP-complete proved by
    Kai)
  • Make use of current algorithms/tools

43
Non-Determinacy Detection
44
Linear-time determinacy race detection
  • Jeremy Fineman

45
Motivation
  • Nondeterminator has two parts
  • Check whether threads are logically parallel
  • Use shadow spaces to detect determinacy race
  • SP-Bags algorithm uses LCA lookup to determine
    whether two threads are parallel
  • LCA lookup with disjoint-sets data structure
    takes O(a(v,v)) time.
  • We do not care about the LCA. We just want to
    know if two threads are logically parallel.

46
New algorithm for determinacy-race detection
  • Similar to SP-Bags algorithm with two parts
  • Check whether threads are logically parallel
  • Use shadow spaces to detect determinacy race
  • Use order maintenance data-structure to determine
    whether threads are parallel
  • Gain Order maintenance operations are O(1)
    amortized time.

47
The algorithm
  • Maintain two orders
  • Regular serial, depth-first, left-to right
    execution.
  • At each spawn, follow spawn thread before
    continuation thread
  • Right-to-left execution.
  • At each spawn, follow continuation thread before
    spawn thread
  • Claim Two threads e1 and e2 are parallel if and
    only if e1 precedes e2 in one order, and e2
    precedes e1 in the other.

48
Depth-first, left-to-right execution
e0
e6
e5
e7
F
e1
e4
e2
e3
F3
F1
F2
F2
49
Right-to-left execution
e0
e6
e5
e7
F
e1
e4
e2
e3
F3
F1
F2
F2
50
How to maintain both orders in serial execution
  • Keep two pieces of state for each procedure F
  • CF is current thread in F
  • SF is next sync statement in F.
  • In both orders, insert new threads after current
    thread
  • On spawn, insert continuation thread before spawn
    thread in one order. Do the opposite in the
    other.
  • On sync, advance CF to SF
  • In any other thread, update shadow spaces based
    on current thread

51
Example
e0
s1
e2
e5
e4
F
e1
e3
e1
CF
Order
e0
e0
s1
e2
e2
e4
e4
e3
e5
e5
s2
s1
s1
e4
e2
s1
SF
Order
e0
e1
e3
e5
52
Project ProposalParallel Nondeterminator
  • He Yuxiong

53
Objective
  • I propose Parallel Nondeterminator to
  • Check the determinacy race in the parallel
    execution of the program written in the language
    like Cilk
  • Develop efficient algorithm to decide the
    concurrency between threads
  • Develop efficient algorithm to reduce the number
    of entries in access history

54
Primitive Idea in Concurrency Test
  • (1) Labeling Scheme
  • (2) Set operation
  • Thread representation (fid, tid)

55
  • Two sets
  • Parallel set PS(f) (pf, ptid) all threads
    with tid gt ptid in function pf is parallel with
    the running thread of f
  • Children set CS(f)fid fid is the descendant
    of current function
  • Operations
  • Spawn Thread Tx of function Fi spawn function Fj
  • Operations on child Fj
  • Operations on parent Fi

56
  • Sync Function Fi executes sync
  • PS(Fi)PS(Fi)-CS(Fi)
  • Return Function Fj returns to Function Fi
  • CS(Fi) CS(Fi) CS(Fj)
  • Release PS(Fj) and CS(Fj)
  • Concurrency Test
  • Check if (fx, tx) is parallel with the current
    running thread (fc, tc)

57
Primitive Idea for Access History
  • Serial
  • read(l), write(l)
  • Simplest parallel program without nested
    parallelism
  • Two parallel read records, one write record
  • Language structure like Cilk
  • read max level of parallelism, one write record
  • Q Is it possible to keep only two read records
    for each shared location in Cilk Parallel
    Nondeterminator?
  • Keep two parallel read records with highest
    level of LCA in parent child spawn tree.

58
  • Thank you very much!

59
Parallel NondeterminatorWang Junqing
60
Using Cilk
61
Accelerating Multiprocessor Simulation
  • 6.895 Project Proposal
  • Kenneth C. Barr

62
Introduction
  • Inspiration
  • UVSIM is slow
  • FlexSIM is cool
  • Interleave short periods of detailed simulation
    with long periods of functional warmup (eg.,
    cache and predictor updates, but not out-of-order
    logic)
  • Achieve high accuracy in fraction of the time
  • Multi-configuration simulation is cool
  • Recognize structural properties. E.G., contents
    of FA cache are subset of all larger FA caches
    with same line size so search small-gtlarge.
    Once we hit is small cache, we can stop searching
  • Simulate many configurations with a single run

63
The Meta Directory
  • Combine previous ideas to speed multiprocessor
    simulation
  • Sequential Consistency is what matters, perhaps
    detailed consistency mechanisms can be
    interleaved with a fast, functional equivalent
    method

64
Meta-directory example
  • Detailed simulation
  • P1 ld x null -gt Sh, data sent to P1
  • P2 ld x Sh -gt Sh, data sent to P2
  • P3 ld x Sh- gt Sh data sent to P3
  • P2 st x Sh -gt Ex, inv sent to P1 and P3, data
    sent to P2
  • P1 st x Ex -gt Ex, inv sent to P2, data sent to
    P1
  • P2 ld x Ex -gt Sh, data sent to P2
  • Shortcut
  • Record access in a small meta-directory x
    P1, 0, r, P2, 1, r, P3, 2, r, P2, 3, w,
    P1, 4, w, P2, 5, r
  • All reads and writes occur in a memory no
    messages sent or received, no directory modeled,
    no cache model in processor (?)
  • When it comes time for detailed simulation, we
    can reconstruct directory by scanning backwards
    x is shared by P1 and P2.

65
Challenges
  • Accesses stored in circular queue. How many
    records needed for each address?
  • What happens when processor writes back data,
    current scheme introduces false hits.
  • Does this scheme always work? Some proofs would
    be nice.

66
Methodology
  • Platform
  • I got Simics to boot SMP Linux
  • But Bochs is open-source
  • X86 is key if this is to be long-term framework
  • Cilk benchmarks?
  • Tasks
  • Create detailed directory model
  • Create meta directory
  • Create a way to switch back and forth maintaining
    program correctness
  • Measure the dramatic improvement!

67
Conclusion
  • Reality
  • Sounds daunting for December deadline, but if I
    can prove feasibility or fatal flaws, Id be
    excited
  • Suggestions?

68
Project Proposal Parallelizing METIS
  • Zardosht Kasheff

69
What is METIS?
  • Graph Partitioning algorithm
  • Developed at University of Minnesota by George
    Karypis and Vipin Kumar
  • Information on METIS
  • http//www-users.cs.umn.edu/karypis/metis/

70
Stages of Algorithm Coarsening
  • Task Create sequentially smaller graphs that
    make good representation of original graph by
    collapsing connected nodes.
  • Issues
  • Minor concurrency issues.
  • Maintaining data locality.
  • Writing large amount of data to memory in a
    scalable fashion.

71
Stages of Algorithm Initial partitioning.
  • Task partition small graph
  • Issues none. Runtime of this portion of
    algorithm very small.

72
Stages of Algorithm Uncoarsening and Refinement
  • Uncoarsening and Refinement project partition of
    coarsest graph to original graph and refine
    partition.
  • Issues
  • Major concurrency issues.
  • Remaining issues under research.

73
Parallel SortingPaul Youn
  • I propose to parallelize a sorting algorithm.
    Initially, looking at Quick Sort, but may
    investigate other sorting algorithms.
  • Sorting is a problem that can potentially be sped
    up significantly (mergesort).
  • As an alternative, considering other algorithms.

74
Goals
  • Similar to Lab 1 approach.
  • Speedy serial algorithm.
  • Basic parallel algorithm.
  • Runtime Analysis.
  • Empirical verification of runtime analysis.
  • Parallel speedups.

75
HELP!
  • Not sure about the scope of the problem.
  • Anyone out there interested in this stuff?
  • Other appropriate algorithms? Max-flow?

76
PROJECT PROPOSALSMA5509Implement FIR filter in
parallel computing using Cilk
  • Name Pham Duc Minh
  • Matric HT030502H
  • Email g0300231_at_nus.edu.sg
  • phamducm_at_comp.nus.edu.sg

77
Discrete Signal
  • the discrete input signal is x(n)
  • the output signal is y(n)
  • the impulse response h(n)

78
FIR filter
  • Discrete time LTI (Linearity and Time Invariance)
    systems can be classified into FIR and IIR.
  • An FIR filter has impulse response h(n) that
    extends only over a finite time interval, say 0
    n M, and is identically zero beyond that
  • h0, h1, h2 . . . , hM, 0, 0, 0 . . .M is
    referred to as the filter order.
  • The output y(n) in FIR is simplified to the
    finite-sum form
  • or, explicitly

79
Methods to process
  • 1. Block processing
  • a. Convolution

(convolution table form)
80
b. Direct FormThe length of the input signal
x(n) is L The length of output y(n) will be Ly
LxLh-1
81
c. Matrix formwith the dimension of H is
82
  • Overlap-Add Block Convolution Method
  • above methods are not feasible in infinitive or
    extremely long applications
  • divide the long input into contiguous
    non-overlapping blocks of manageable length, say
    L samples
  • y0 hx0
  • y1 hx1
  • y2 hx2
  • Thus the resulting block y0 starts at absolute
    time n 0
  • block y1 starts at n L, and so on.
  • The last M samples of each output block will
    overlap with
  • the first M outputs of the next block .

83
  • 2. Sampling Processing
  • The direct form I/O convolutional equation
  • We define the internal states w1(n), w2(2),
    w3(n).
  • w0(n) x(n)
  • w1(n) x(n-1) w0(n-1)
  • w2(n) x(n-2) w1(n-1)
  • ..
  • With this definition, we can y(n) in the form

84
  • Goal of project
  • Use Cilk to implement FIR filter in parallel
  • Compare with the processes of DSP kit (pipeline)
    and the process of FPGA (Field Programmable Gate
    Array) (also parallel).

85
  • How to implement
  • Multithreaded programming
  • Cilk compiler

86
Other things
  • If I finish soon, I hope additional work
    suggestion from staff.

87
Cache-Oblivious Algorithms
88
SMA5509 Theory of Parallel Systems
Project Proposal Cache-oblivious sorting for
Burrow-Wheeler Transform
  • Sriram Saroop
  • Advait D. Karande

89
bzip2 compression
  • Lossless text compression scheme

90
Burrows-Wheeler Transform (1994)
  • A pre-processing step for data compression
  • Involves sorting of all rotations of the block of
    data to be compressed
  • Rationale Transformed strings compress better
  • Worst case complexity of N2log(N), where N size
    of block to be sorted
  • Can be improved to N(logN)2 (Sadakane 98)

91
Why make it cache-oblivious?
  • Performance of BWT heavily dependent on cache
    behavior (Seward 00)
  • Avoid slowdown for large files with high degree
    of repetitiveness
  • Especially useful in applications like bzip2 that
    are required to perform well on different memory
    structures

92
Project Plan
  • Design and implementation of a cache oblivious
    algorithm for BWT sorting
  • Performance comparisons with standard bzip2 for
    available benchmarks
  • Back-up Optimizations for BWT including
    parallelization, and incorporating
    Cache-Obliviousness in other parts of bzip2
    compression

93
New Cache Oblivious Algorithms
  • Neel Kamal
  • Zhang JiaHui

94
Our Goals
  • Designing Cache Oblivious Algorithms for
    problems in several areas of Compute Science.
  • Finding theoretical bounds of number of cache
    misses for these algorithms.
  • Trying to parallelize them

95
Large Integer Arithmetic Operations
  • Basic Operators
  • Addition, Subtraction
  • Multiplication
  • Factorial
  • Matrix Multiplication
  • Application RSA encryption algorithm

96
Large Integer Multiplication
Basic Idea
A B
C D
B x D
A x D
B x C
A x C
97
Dynamic Programming Algorithms
  • Longest Common Sequence Basic Idea

B D C A B A
A 0 0 0 1 1 1
B 1 1 1 1 2 2
C 1 1 2 2 2 2
B 1 1 2 2 3 3
D 1 2 2 2 3 3
A 1 2 2 3 3 4
B 1 2 2 3 4 4
98
Graph Algorithm
  • All-pair shortest path algorithm Floyd
  • Connectivity Testing of a Graph
  • And more











99
Computational Geometry
  • Finding the closest pair of points
  • And More

100
Concurrent Cache-Oblivious Algorithms
  • Seth Gilbert

101
Cache-Oblivious Algorithms
  • Optimal memory access
  • Matrix multiply, Fast Fourier Transform
  • B-trees, Priority Queues
  • Concurrent?
  • wait-free, lock-free, obstruction-free
  • Synchronization?
  • DCAS, CAS, LL/SC

102
Packed Memory Structure
  • Operations
  • Traverse(k) -- O(k/B)
  • Insert -- O(log2n/B)
  • Delete -- O(log2n/B)
  • Goal
  • Design a lock-free version of the cache-oblivious
    algorithm using CAS

103
More
  • Basis for many data structure
  • B-tree
  • distributed data structures
  • Maybe parallel nondeterminator

104
The End
105
Packed Memory Structure
  • size(array) lt 4 x num. elements
  • elements well-spaced in array
Write a Comment
User Comments (0)
About PowerShow.com