Title: ECE 1747H : Parallel Programming
1ECE 1747H Parallel Programming
2ECE 1747H
- Meeting time Mon 4-6 PM
- Instructor Cristiana Amza,
- http//www.eecg.toronto.edu/amza
- amza_at_eecg.toronto.edu, office Pratt 484E
3Material
- Course notes
- Web material (e.g., published papers)
- No required textbook, some recommended
4Prerequisites
- Programming in C or C
- Data structures
- Basics of machine architecture
- Basics of network programming
- Please send e-mail to ecehelp_at_ece.toronto.edu
- to get an eecg account !! (name, stuid,
class, instructor) - madalin_at_cs.toronto.edu to get a cluster
account - (this is on our research cluster for the purpose
of the homework).
5Other than that
- No written homeworks, no exams
- 10 for each small programming assignments
(expect 1) - 10 class participation
- Rest comes from major course project
6Programming Project
- Parallelizing a sequential program, or improving
the performance or the functionality of a
parallel program - Project proposal and final report
- In-class project proposal and final report
presentation - Sample project presentation can be posted
7Parallelism (1 of 2)
- Ability to execute different parts of a single
program concurrently on different machines - Goal shorter running time
- Grain of parallelism how big are the parts?
- Can be instruction, statement, procedure,
- Will mainly focus on relative coarse grain
8Parallelism (2 of 2)
- Coarse-grain parallelism mainly applicable to
long-running, scientific programs - Examples weather prediction, prime number
factorization, simulations,
9Lecture material (1 of 4)
- Parallelism
- What is parallelism?
- What can be parallelized?
- Inhibitors of parallelism dependences
10Lecture material (2 of 4)
- Standard models of parallelism
- shared memory (Pthreads)
- message passing (MPI)
- shared memory data parallelism (OpenMP)
- Classes of applications
- scientific
- servers
11Lecture material (3 of 4)
- Transaction processing
- classic programming model for databases
- now being proposed for scientific programs
12Lecture material (4 of 4)
- Perf. of parallel distributed programs
- architecture-independent optimization
- architecture-dependent optimization
13Course Organization
- First 2-3 weeks of semester
- lectures on parallelism, patterns, models
- small programming assignment done individually or
in teams of up to 3 - Rest of the semester
- major programming project, done individually or
in small group - Research paper discussions
14Parallel vs. Distributed Programming
- Parallel programming has matured
- Few standard programming models
- Few common machine architectures
- Portability between models and architectures
15Bottom Line
- Programmer can now focus on program and use
suitable programming model - Reasonable hope of portability
- Problem much performance optimization is still
platform-dependent - Performance portability is a problem
16ECE 1747H Parallel Programming
- Lecture 1-2 Parallelism, Dependences
17Parallelism
- Ability to execute different parts of a program
concurrently on different machines - Goal shorten execution time
18Measures of Performance
- To computer scientists speedup, execution time.
- To applications people size of problem, accuracy
of solution, etc.
19Speedup of Algorithm
- Speedup of algorithm sequential execution time
/ execution time on p processors (with the same
data set).
speedup
p
20Speedup on Problem
- Speedup on problem sequential execution time of
best known sequential algorithm / execution time
on p processors. - A more honest measure of performance.
- Avoids picking an easily parallelizable algorithm
with poor sequential execution time.
21What Speedups Can You Get?
- Linear speedup
- Confusing term implicitly means a 1-to-1 speedup
per processor. - (almost always) as good as you can do.
- Sub-linear speedup more normal due to overhead
of startup, synchronization, communication, etc.
22Speedup
speedup
linear
actual
p
23Scalability
- No really precise decision.
- Roughly speaking, a program is said to scale to a
certain number of processors p, if going from p-1
to p processors results in some acceptable
improvement in speedup (for instance, an increase
of 0.5).
24Super-linear Speedup?
- Due to cache/memory effects
- Subparts fit into cache/memory of each node.
- Whole problem does not fit in cache/memory of a
single node. - Nondeterminism in search problems.
- One thread finds near-optimal solution very
quickly gt leads to drastic pruning of search
space.
25Cardinal Performance Rule
- Dont leave (too) much of your code sequential!
26Amdahls Law
- If 1/s of the program is sequential, then you can
never get a speedup better than s. - (Normalized) sequential execution time 1/s
(1- 1/s) 1 - Best parallel execution time on p processors
1/s (1 - 1/s) /p - When p goes to infinity, parallel execution
1/s - Speedup s.
27Why keep something sequential?
- Some parts of the program are not parallelizable
(because of dependences) - Some parts may be parallelizable, but the
overhead dwarfs the increased speedup.
28When can two statements execute in parallel?
- On one processor
- statement 1
- statement 2
- On two processors
- processor1 processor2
- statement1 statement2
29Fundamental Assumption
- Processors execute independently no control over
order of execution between processors
30When can 2 statements execute in parallel?
- Possibility 1
- Processor1 Processor2
- statement1
- statement2
- Possibility 2
- Processor1 Processor2
- statement2
- statement1
31When can 2 statements execute in parallel?
- Their order of execution must not matter!
- In other words,
- statement1 statement2
- must be equivalent to
- statement2 statement1
32Example 1
- a 1
- b a
- Statements cannot be executed in parallel
- Program modifications may make it possible.
33Example 2
- a f(x)
- b a
- May not be wise to change the program (sequential
execution would take longer).
34Example 3
- a 1
- a 2
- Statements cannot be executed in parallel.
35True dependence
- Statements S1, S2
- S2 has a true dependence on S1
- iff
- S2 reads a value written by S1
36Anti-dependence
- Statements S1, S2.
- S2 has an anti-dependence on S1
- iff
- S2 writes a value read by S1.
37Output Dependence
- Statements S1, S2.
- S2 has an output dependence on S1
- iff
- S2 writes a variable written by S1.
38When can 2 statements execute in parallel?
- S1 and S2 can execute in parallel
- iff
- there are no dependences between S1 and S2
- true dependences
- anti-dependences
- output dependences
- Some dependences can be removed.
39Example 4
- Most parallelism occurs in loops.
- for(i0 ilt100 i)
- ai i
- No dependences.
- Iterations can be executed in parallel.
40Example 5
- for(i0 ilt100 i)
- ai i
- bi 2i
-
- Iterations and statements can be executed in
parallel.
41Example 6
- for(i0ilt100i) ai i
- for(i0ilt100i) bi 2i
- Iterations and loops can be executed in parallel.
42Example 7
- for(i0 ilt100 i)
- ai ai 100
- There is a dependence on itself!
- Loop is still parallelizable.
43Example 8
- for( i0 ilt100 i )
- ai f(ai-1)
- Dependence between ai and ai-1.
- Loop iterations are not parallelizable.
44Loop-carried dependence
- A loop carried dependence is a dependence that is
present only if the statements are part of the
execution of a loop. - Otherwise, we call it a loop-independent
dependence. - Loop-carried dependences prevent loop iteration
parallelization.
45Example 9
- for(i0 ilt100 i )
- for(j0 jlt100 j )
- aij f(aij-1)
- Loop-independent dependence on i.
- Loop-carried dependence on j.
- Outer loop can be parallelized, inner loop cannot.
46Example 10
- for( j0 jlt100 j )
- for( i0 ilt100 i )
- aij f(aij-1)
- Inner loop can be parallelized, outer loop
cannot. - Less desirable situation.
- Loop interchange is sometimes possible.
47Level of loop-carried dependence
- Is the nesting depth of the loop that carries the
dependence. - Indicates which loops can be parallelized.
48Be careful Example 11
- printf(a)
- printf(b)
- Statements have a hidden output dependence due to
the output stream.
49Be careful Example 12
- a f(x)
- b g(x)
- Statements could have a hidden dependence if f
and g update the same variable. - Also depends on what f and g can do to x.
50Be careful Example 13
- for(i0 ilt100 i)
- ai10 f(ai)
- Dependence between a10, a20,
- Dependence between a11, a21,
-
- Some parallel execution is possible.
51Be careful Example 14
- for( i1 ilt100i )
- ai
- ... ai-1
-
- Dependence between ai and ai-1
- Complete parallel execution impossible
- Pipelined parallel execution possible
52Be careful Example 15
- for( i0 ilt100 i )
- ai f(aindexai)
- Cannot tell for sure.
- Parallelization depends on user knowledge of
values in indexa. - User can tell, compiler cannot.
53Optimizations Example 16
- for (i 0 i lt 100000 i) ai 1000 ai
1
Cannot be parallelized as is. May be parallelized
by applying certain code transformations.
54An aside
- Parallelizing compilers analyze program
dependences to decide parallelization. - In parallelization by hand, user does the same
analysis. - Compiler more convenient and more correct
- User more powerful, can analyze more patterns.
55To remember
- Statement order must not matter.
- Statements must not have dependences.
- Some dependences can be removed.
- Some dependences may not be obvious.