Title: Companion slides for
1Introduction
- Companion slides for
- The Art of Multiprocessor Programming
- by Maurice Herlihy Nir Shavit
- Modified by Rajeev Alur
- for CIS 640 at Penn, Spring 2009
2Moores Law
Transistor count still rising
Clock speed flattening sharply
3Still on some of your desktops The Uniprocesor
cpu
memory
4In the Enterprise The Shared Memory
Multiprocessor(SMP)
5Your New Desktop The Multicore Processor(CMP)
Sun T2000 Niagara
All on the same chip
cache
cache
cache
Bus
Bus
shared memory
6Multicores Are Here
- Intel's Intel ups ante with 4-core chip. New
microprocessor, due this year, will be faster,
use less electricity... San Fran Chronicle - AMD will launch a dual-core version of its
Opteron server processor at an event in New York
on April 21. PC World - Suns Niagarawill have eight cores, each core
capable of running 4 threads in parallel, for 32
concurrently running threads. . The Inquirer
7Why do we care?
- Time no longer cures software bloat
- The free ride is over
- When you double your programs path length
- You cant just wait 6 months
- Your software must somehow exploit twice as much
concurrency
8Traditional Scaling Process
7x
Speedup
3.6x
1.8x
User code
Traditional Uniprocessor
Time Moores law
9Multicore Scaling Process
User code
Multicore
Unfortunately, not so simple
10Real-World Scaling Process
Speedup
2.9x
2x
1.8x
User code
Multicore
Parallelization and Synchronization require
great care
11Multicore Programming Course Overview
- Fundamentals
- Models, algorithms, impossibility
- Real-World programming
- Architectures
- Techniques
- Topics not in Textbook
- Memory models and system-level concurrency
libraries - High-level programming abstractions
12A Zoo of Terms
- Concurrent
- Parallel
- Distributed
- Multicore
- What do they all mean? How do they differ?
13Concurrent Computing
- Programs designed as a collection of interacting
threads/processes - Logical/programming abstraction
- May be implemented on single processor by
interleaving or on multiple processors or on
distributed computers - Coordination/synchronization mechanism in a model
of concurrency may be realized in many ways in an
implementation
14Parallel Computing
- Computations that execute simultaneously to solve
a common problem (more efficiently) - Parallel algorithms Which problems can have
speed-up given multiple execution units? - Parallelism can be at many levels (e.g.
bit-level, instruction-level, data path) - Grid computing Branch of parallel computing
where problems are solved on clusters of
computers (interacting by message passing) - Multicore computing Branch of parallel computing
focusing on multiple execution units on same chip
(interacting by shared memory)
15Distributed Computing
- Involves multiple agents/programs (possibly with
different computational tasks) with multiple
computational resources (computers,
multiprocessors, network) - Many examples of contemporary software (e.g. web
services) are distributed systems - Heterogeneous nature, and range of time scales
(web access vs local access), make
design/programming more challenging
16Sequential Computation
thread
memory
object
object
17Concurrent Computation
threads
memory
object
object
18Asynchrony
- Sudden unpredictable delays
- Cache misses (short)
- Page faults (long)
- Scheduling quantum used up (really long)
19Model Summary
- Multiple threads
- Sometimes called processes
- Single shared memory
- Objects live in memory
- Unpredictable asynchronous delays
20Road Map
- Textbook focuses on principles first, then
practice - Start with idealized models
- Look at simplistic problems
- Emphasize correctness over pragmatism
- Correctness may be theoretical, but
incorrectness has practical impact - In course, interleaving of chapters from the two
parts
21Concurrency Jargon
- Hardware
- Processors
- Software
- Threads, processes
- Sometimes OK to confuse them, sometimes not.
22Parallel Primality Testing
- Challenge
- Print primes from 1 to 1010
- Given
- Ten-processor multiprocessor
- One thread per processor
- Goal
- Get ten-fold speedup (or close)
23Load Balancing
109
2109
P0
P1
P9
- Split the work evenly
- Each thread tests range of 109
24Procedure for Thread i
void primePrint int i ThreadID.get() //
IDs in 0..9 for (j i1091, jlt(i1)109
j) if (isPrime(j)) print(j)
25Issues
- Higher ranges have fewer primes
- Yet larger numbers harder to test
- Thread workloads
- Uneven
- Hard to predict
26Issues
- Higher ranges have fewer primes
- Yet larger numbers harder to test
- Thread workloads
- Uneven
- Hard to predict
- Need dynamic load balancing
rejected
27Shared Counter
19
each thread takes a number
18
17
28Procedure for Thread i
int counter new Counter(1) void
primePrint long j 0 while (j lt 1010)
j counter.getAndIncrement() if
(isPrime(j)) print(j)
29Procedure for Thread i
Counter counter new Counter(1) void
primePrint long j 0 while (j lt 1010)
j counter.getAndIncrement() if
(isPrime(j)) print(j)
Shared counter object
30Where Things Reside
void primePrint int i ThreadID.get() //
IDs in 0..9 for (j i1091, jlt(i1)109
j) if (isPrime(j)) print(j)
Local variables
code
shared memory
1
shared counter
31Procedure for Thread i
Counter counter new Counter(1) void
primePrint long j 0 while (j lt 1010)
j counter.getAndIncrement() if
(isPrime(j)) print(j)
Stop when every value taken
32Procedure for Thread i
Counter counter new Counter(1) void
primePrint long j 0 while (j lt 1010)
j counter.getAndIncrement() if
(isPrime(j)) print(j)
Increment return each new value
33Counter Implementation
public class Counter private long value
public long getAndIncrement() return
value
34Counter Implementation
public class Counter private long value
public long getAndIncrement() return
value
OK for single thread, not for concurrent threads
35What It Means
public class Counter private long value
public long getAndIncrement() return
value
36What It Means
public class Counter private long value
public long getAndIncrement() return
value
temp value value temp 1 return temp
37Not so good
Value 1
2
3
2
read 1
write 2
read 2
write 3
read 1
write 2
38Is this problem inherent?
write
read
read
write
If we could only glue reads and writes
39Challenge
public class Counter private long value
public long getAndIncrement() temp
value value temp 1 return temp
40Challenge
public class Counter private long value
public long getAndIncrement() temp
value value temp 1 return temp
Make these steps atomic (indivisible)
41Hardware Solution
public class Counter private long value
public long getAndIncrement() temp
value value temp 1 return temp
ReadModifyWrite() instruction
42An Aside Java
public class Counter private long value
public long getAndIncrement() synchronized
temp value value temp 1
return temp
43An Aside Java
public class Counter private long value
public long getAndIncrement() synchronized
temp value value temp 1
return temp
Synchronized block
44An Aside Java
public class Counter private long value
public long getAndIncrement() synchronized
temp value value temp 1
return temp
Mutual Exclusion
45Why do we care?
- We want as much of the code as possible to
execute concurrently (in parallel) - A larger sequential part implies reduced
performance - Amdahls law this relation is not linear
46Amdahls Law
Speedup
of computation given n CPUs instead of 1
47Amdahls Law
Speedup
48Amdahls Law
Parallel fraction
Speedup
49Amdahls Law
Sequential fraction
Parallel fraction
Speedup
50Amdahls Law
Sequential fraction
Parallel fraction
Speedup
Number of processors
51Example
- Ten processors
- 60 concurrent, 40 sequential
- How close to 10-fold speedup?
52Example
- Ten processors
- 60 concurrent, 40 sequential
- How close to 10-fold speedup?
53Example
- Ten processors
- 80 concurrent, 20 sequential
- How close to 10-fold speedup?
54Example
- Ten processors
- 80 concurrent, 20 sequential
- How close to 10-fold speedup?
55Example
- Ten processors
- 90 concurrent, 10 sequential
- How close to 10-fold speedup?
56Example
- Ten processors
- 90 concurrent, 10 sequential
- How close to 10-fold speedup?
57Example
- Ten processors
- 99 concurrent, 01 sequential
- How close to 10-fold speedup?
58Example
- Ten processors
- 99 concurrent, 01 sequential
- How close to 10-fold speedup?
59The Moral
- Making good use of our multiple processors
(cores) means - Finding ways to effectively parallelize our code
- Minimize sequential parts
- Reduce idle time in which threads wait
60Multicore Programming
- This is what this course is about
- The that is not easy to make concurrent yet may
have a large impact on overall speedup