Keynote: Parallel Programming for High Schools - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Keynote: Parallel Programming for High Schools

Description:

Ron Tzur, Purdue University. David Ellison, University of ... Supports application programming (VHDL/Verilog, OpenGL, MATLAB) AND performance programming ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 49
Provided by: umiac7
Category:

less

Transcript and Presenter's Notes

Title: Keynote: Parallel Programming for High Schools


1
Keynote Parallel Programming for High Schools
  • Uzi Vishkin, University of Maryland
  • Ron Tzur, Purdue University
  • David Ellison, University of Maryland and
    University of Indiana
  • George Caragea, University of Maryland
  • CS4HS Workshop, Carnegie-Mellon University, July
    26, 2009

2
Why are we here?
  • Its a time of emerging update to what literacy
    in CS means
  • Parallel Algorithmic Thinking (PAT)

3
Goals
  • Nurture your sense of
  • Sense of urgency of shift to parallel within
    computational thinking
  • Get a sense of PAT and of potential student
    understandings
  • Confidence, competence, and enthusiasm in ability
    to take on the challenge of promoting PAT in your
    students
  • At the end we hope youll say
  • I understand, I want to do it, I can, and I know
    it will not happen without me (irreplaceable
    member of the Jury)

4
Outline (RT)
  • Intro Whats all the fuss about parallelism?
    (UV)
  • Teaching-learning activities in XMT (DE)
  • A teachers voice (1) Its the future
    teachable (ST)
  • PAT Module goals, plan, hands-on pedagogy,
    learning theory (RT)
  • A teachers voice (2) XMT Approach/content (ST)
  • Hands-on The Merge-Sort Problem (UV)
  • A teachers voice (3) To begin PAT, use XMTC
    (ST)
  • How to (start)? (GC)
  • Q A (All, 12 min)

5
Intro Commodity Computer Systems (UV)
  • Serial General-purpose computing
  • 1946?2003, 5KHz?4GHz.
  • 2004 ? Clock frequency growth turns flat
  • 2004 Onward
  • Parallelism only game in town
  • If you want your program to run significantly
    faster youre going to have to parallelize it
  • General-purpose computing goes parallel
  • Transistors/chip 1980?2011 29K?30B!
  • cores dy-2003

6
Intro Commodity Computer Systems
  • 40 Years of Parallel Computing
  • Never a successful general-purpose parallel
    computer (easy to program good speedups)
  • Grade from NSF Blue-Ribbon Panel on
    Cyberinfrastructure F !!!
  • Programming existing parallel computers is as
    intimidating and time consuming as programming
    in assembly language

7
Intro Second Paradigm Shift - Within Parallel
  • Existing parallel paradigm Decomposition-First
  • Too painful to program
  • Needed Paradigm Express only what can be done
    in parallel
  • Natural (parallel) algorithm Parallel
    Random-Access Model (PRAM)
  • Build both machine (HW) and programming (SW)
    around this model

8
Middle School Summer Camp Class Picture, July09
(20 of 22 students)
9
Demonstration Exchange Problem (DE)
10
Lets Look at our First Step
I
Our first step XA
1
11
Lets Look at our Second Step
I
1
Our second step BA
2
12
Lets Look at our Third Step
I
1
2
Our third step BX
3
  • Our first algorithm and pseudo programming code
  • XA
  • AB
  • BX
  • Serial exchange, 3 steps, 3 operations, 1
    working memory space

How many steps?
3
How many operations?
3
Whats the connection between the number of steps
operations?
Equal
How much working memory space consumed?
1 Space
Hands-on Challenge Can we exchange the contents
of A (2) and B (5) in fewer steps?
13
I
What is the hint in this figure?
14
First Step in a parallel algorithm
I
1
XA and simultaneously YB
Can you anticipate the next step?
15
Second step in a parallel algorithm
I
XA and YB
1
2
AY and simultaneously BX
2
How many steps?
How many operations?
4
How much working memory space consumed?
2
Can you make any generalizations with respect to
serial and parallel problem solving?
Parallel algorithms tend to involve fewer Steps,
but may cost more operations and may consume more
working memory.
XA and YB
AY and BX
16
Array Exchange A and B as arrays with indices
0 9 and input state as shown. Using a single
working memory space X, devise an algorithm to
exchange the contents of cells with the same
index (e.g., replace A022 with B012) .
Consider number of steps, operations.
12
22
22
12
Step 1 XA0
12
Step 2 A0B0
13
Step 3 B0X
Step 4 XA1
For i0 to 9 Do XAi AiBi Bi
X end
How many steps needed to complete the exchange?
30
How many operations?
30
How much working memory space?
1
Your homework asks for the general case of arrays
A and B of length n
17
Array Exchange Problem Can you parallelize it?
Step 1 X0-9A0-9
22
12
12
Step 2 A0-9B0-9
23
13
13
Step 3 B0-9X0-9
24
14
14
25
15
15
Parallel algorithm For i0 to n-1
pardo X(i)A(i) A(i)B(i) B(i)X(i)
end
XMTC Program spawn(0,n-1) var x xA( ) A(
)B( ) B( )x
26
16
16
27
17
17
28
18
?
18
29
19
19
30
20
20

31
21
21
How many steps? How many operations? How much
working memory space consumed? Answer the above
questions for the general case of arrays A and B
of length n?
3
30
10
3 steps, 3n operations, n spaces
18
Array Exchange Algorithm A highly parallel
approach
Step 1 X0-9A0-9 And Y0-9B0-9
12
22
22
12
13
23
23
13
Step 2 A0-9Y0-9 And B0-9X0-9
14
24
24
14
15
25
25
15
16
26
26
16
For i1 to n pardo X(i)A(i) and
B(i)X(i) Y(i)B(i) and A(i)Y(i) end
17
27
27
17
18
28
28
18
19
29
29
19
20
30
30
20
21
31
31
21
How many steps? How many operations? How much
working memory space consumed? Answer the above
questions for the general case of arrays A and B
of length n?
2
40
20
2 steps , 4n operations, 2n spaces
19
Intro Second Paradigm Shift (cont.)
  • Late 1970s THEORY
  • Figure out how to think algorithmically in
    parallel
  • Huge success. But
  • 1997 Onward PRAM-On-Chip _at_ UMD
  • Derive specs for architecture design and build
  • Above premises contrasted with
  • Build-first, figure-out-how-to-program-later
    approach
  • J. Hennessy Many of the early parallel ideas
    were motivated by observations of what was easy
    to implement in the hardware rather than what was
    easy to use

20
Pre Many-Core Parallelism Three Thrusts
  • Improving single-task completion time for
    general-purpose parallelism was not the main
    target of parallel machines
  • Application-specific
  • computer graphics
  • Limiting origin
  • GPUs great performance if you figure out how
  • 2. Parallel machines for high-throughput (of
    serial programs)
  • Only choice for HPC?Language standards, but
    many issues (F!)
  • HW designers (that dominate vendors) YOU figure
    out how to program (their machines) for locality.

21
Pre Many-Core Parallelism 3 Thrusts (cont.)
  • Currently, how future computer will look is
    unknown
  • SW Vendor impasse What can a non-HW entity do
    without betting on the wrong horse?
  • Needed - successor to Pentium for multi-core area
    that
  • Is easy to program (hence, learning hence,
    teaching)
  • Gives good performance with any amount of
    parallelism
  • Supports application programming (VHDL/Verilog,
    OpenGL, MATLAB) AND performance programming
  • Fits current chip technology and scales with it
    (particularly strong speed-ups for single-task
    completion time)
  • Hindsight is always 20/20
  • Should have used the benchmark of Programmability
  • ? TEACHABILITY !!!

22
Pre Many-Core Parallelism 3 Thrusts (cont.)
  • PRAM algorithmic theory
  • Started with a clean slate target
  • Programmability
  • Single-task completion time for general-purpose
    parallel computing
  • Currently the theory common to all parallel
    approaches
  • necessary level of understanding parallelism
  • As simple as it gets Ahead of its time
    avant-garde
  • 1990s Common wisdom (LOGP) never implementable
  • UMD Built eXplicit Multi-Threaded (XMT) parallel
    computer
  • 100x speedups for 1000 processors on chip
  • XMTC programming language
  • Linux-based simulator download to any machine
  • Most importantly TAUGHT IT
  • Graduate ? seniors ? freshmen ? high school ?
    middle school
  • Reality check The human factor ? YOU
  • Teachers ? Students

23
One Teachers Voice (RT)
  • Mr. Shane Torbert (could not join us - sisters
    getting married!)
  • Thomas Jefferson (TJ) High School
  • Two years of trial
  • Interview question Why you gave Vishkins XMT a
    try?
  • Observe video segment 1
  • http//www.umiacs.umd.edu/users/vishkin/TEACHING/S
    HANE-TORBERT-INTERVIEW7-09/01 Shane Why
    XMT.m4v(It requires either some iTune
    installation or other m4v player)

24
Summary of Shanes Thesis
  • Its the Future and Teachable !!!

25
Teaching PAT with XMT-C
  • Overarching goal
  • Nurture a (50-year) generation of CS enthusiasts
    ready to think/work in parallel (programmers,
    developers, engineers, theoreticians, etc.)
  • Module goals for student learning
  • Understand what are parallel algorithms
  • Understand the differences, and links, between
    parallel and serial algorithms (serial as a
    special case of parallel - single processor)
  • Understand and master how to
  • Analyze a given problem into the shortest
    sequence of steps within which all possible
    concurrent operations are performed
  • Program (code, run, debug, improve, etc.)
    parallel algorithms
  • Understand and use measures of algorithm
    efficiency
  • Run-time
  • Work distinguish number of operations vs.
    number of steps
  • Complexity

26
Teaching PAT with XMT-C (cont.)
  • Objectives - students will be able to
  • Program parallel algorithms (that run) in XMTC
  • Solve general-purpose, genuine parallel problems
  • Compare and choose best parallel (and serial)
    algorithms
  • Explain why an algorithm is serial/parallel
  • Propose and execute reasoned improvements to
    their own and/or others parallel algorithms
  • Reason about correctness of algorithms Why an
    algorithm provides a solution to a given problem?

27
Hands-0n The Bill Gates Intro Problem(from
Baltimore Polytechnic Institute)
  • Please form small groups (3-4)
  • Consider Bill Gates, the richest person on earth
  • Well, he can hire as many helpers for any task in
    his life
  • Suggest an algorithm to accomplish the following
    morning tasks in the least number of steps and go
    out to work

28
10-Year Old Solves Bill Gates
  • Play tape

29
A Solution for Bill Gates
Moral Parallelism introduces both constraints
and opportunities Constraints We cant
just assume we can accomplish everything at
once! Opportunities Can be much faster
than serial! 5
parallel steps versus 11 serial steps
30
Pedagogical Considerations (1)
  • In your small groups discuss
  • How might solving the Bill Gates problem
    help students in learning PAT?
  • Will you use it as an intro to a PAT module?
    Why?
  • Be ready to share your ideas with the whole group
  • Whole group discussion of Bill Gates problem to
    initiate PAT

31
A Brain-based Learning Theory
  • Understanding anticipation and reasoning about
    invariant relationship between activity and its
    effects (AER)
  • Learning transformation in such anticipation,
    commencing with available and proceeding to
    intended
  • Mechanism Reflection (two types) on
    activity-effect relationship (RefAER)
  • Type-I comparison between goal and actual effect
  • Type-II comparison across records of
    experiences/situations in which AER has been used
    consistently
  • Stages
  • Participatory (provisional, oops), Anticipatory
    (transfer enabling, succeed)
  • For more, see www.edci.purdue.edu/faculty_profiles
    /tzur/index.html

32
A Teachers Voice XMT Approach/Content
  • Pay attention to his emphasis on student
    development of anticipation of run-time using
    complexity analysis (deep level of
    understanding even for serial thinking)
  • Play video segments 2 and 3 (530 min)
  • http//www.umiacs.umd.edu/users/vishkin/TEACHING/S
    HANE-TORBERT-INTERVIEW7-09/02 Shane Ease of
    Use.m4v
  • http//www.umiacs.umd.edu/users/vishkin/TEACHING/S
    HANE-TORBERT-INTERVIEW7-09/03 Shane Content
    Focus.m4v
  • Shanes suggested first trial with teaching this
    material
  • - Where your CS AP class (you most likely ask
    when )
  • - When Between the AP exam and the end of the
    school year.

33
PAT Module Plan
  • Intro Tasks Create informal algorithmic
    solutions for problems students can relate to
    parallelize
  • Bill Gates Way out of a maze train a dog to
    fetch a ball standing blindfolded in line, the
    toddler problem, building a sand castle, etc.
  • Discussion
  • What is Serial? Parallel? How do they differ?
    Advantages and disadvantages of both (tradeoffs)?
    Steps vs. operations? Breadth-first vs.
    Depth-first searches?
  • Establish XMT environment
  • Installation (Linux, Simulator)
  • Programming syntax (Logo? C? XMT-C?) Hello
    World and beyond
  • Algorithms for Meaningful Problems
  • For each problem create parallel and serial
    algorithms that solve it analyze and compare
    them (individual, pairs, small groups, whole
    class
  • Revisit discussion of how serial and parallel
    differ

34
Problem Sequence
  • Exchange problems
  • Ranking problems
  • Summation and Prefix-Sums (application
    compaction)
  • Matrix multiplication problems
  • Sorting problems (including merge-sort ,
    integer-sort and sample-sort)
  • Selection problems (finding the median)
  • Minimum problems
  • Nearest-one problems
  • See also
  • www.umiacs.umd.edu/users/vishkin/XMT/index.shtml
  • www.umiacs.umd.edu/users/vishkin/XMT/index.shtmlt
    utorial
  • www.umiacs.umd.edu/users/vishkin/XMT/sw-release.ht
    ml
  • www.umiacs.umd.edu/users/vishkin/XMT/teaching-plat
    form.html

35
PRAM-On-Chip Silicon 64-processor, 75MHz
prototype
FPGA Prototype built n4, TCUs64, m8,
75MHz. The system consists of 3 FPGA chips 2
Virtex-4 LX200 1 Virtex-4 FX100(Thanks Xilinx!)
Block diagram of XMT
36
Some experimental results (UV)
  • AMD Opteron 2.6 GHz, RedHat Linux Enterprise 3,
    64KB64KB L1 Cache, 1MB L2 Cache (none in XMT),
    memory bandwidth 6.4 GB/s (X2.67 of XMT)
  • M_Mult was 2000X2000 QSort was 20M
  • XMT enhancements Broadcast, prefetch buffer,
    non-blocking store, non-blocking caches.
  • XMT Wall clock time (in seconds)
  • App. XMT Basic XMT Opteron
  • M-Mult 179.14 63.7 113.83
  • QSort 16.71 6.59 2.61
  • Assume (arbitrary yet conservative)
  • ASIC XMT 800MHz and 6.4GHz/s
  • Reduced bandwidth to .6GB/s and projected back by
    800X/75
  • XMT Projected time (in seconds)
  • App. XMT Basic XMT Opteron
  • M-Mult 23.53 12.46 113.83
  • QSort 1.97 1.42 2.61
  • Simulation of 1024 processors 100X on standard
    benchmark suite for VHDL gate-level simulation.
    for 1024 processors Gu-V06
  • Silicon area of 64-processor XMT, same as 1
    commodity processor (core)

37
Hands-On Example Merging
  • Input
  • Two arrays A1. . n, B1. . N
  • Elements from a totally ordered domain S
  • Each array is monotonically non-decreasing
  • Merging task (Output)
  • Map each of these elements into a monotonically
    non-decreasing array C1..2n
  • Serial Merging algorithm
  • SERIAL - RANK(A1 . . B1. .)
  • Starting from A(1) and B(1), in each round
  • Compare an element from A with an element of B
  • Determine the rank of the smaller among them
  • Complexity O(n) time (hence, also O(n) work...)
  • Hands-on How will you parallelize this
    algorithm?

38
Partitioning Approach
  • Input size for a problemn Design a 2-stage
    parallel algorithm
  • Partition the input in each array into a large
    number, say p, of independent small jobs
  • Size of the largest small job is roughly n/p
  • Actual work - do the small jobs concurrently,
    using a separate (possibly serial) algorithm for
    each
  • Surplus-log parallel algorithm for
    Merging/Ranking
  • for 1 i n pardo
  • Compute RANK(i,B) using standard binary search
  • Compute RANK(i,A) using binary search
  • Complexity WO(n log n), TO(log n)

39
Middle School Students Experiment with Merge/Rank
40
Linear work parallel merging using a single spawn
  • Stage 1 of algorithm Partitioning for 1 i
    n/p pardo p lt n/log and p n
  • b(i)RANK(p(i-1) 1),B) using binary search
  • a(i)RANK(p(i-1) 1),A) using binary search
  • Stage 2 of algorithm Actual work
  • Observe Overall ranking task broken into 2p
    independent slices.
  • Example of a slice
  • Start at A(p(i-1) 1) and B(b(i)).
  • Using serial ranking advance till
  • Termination condition
  • Either some A(pi1) or some B(jp1) loses
  • Parallel program 2p concurrent threads
  • using a single spawn-join for the whole
  • algorithm
  • Example Thread of 20 Binary search B.
  • Rank as 11 (index of 15 in B) 9 (index of
  • 20 in A). Then compare 21 to 22 and rank
  • 21 compare 23 to 22 to rank 22 compare 23
  • to 24 to rank 23 compare 24 to 25, but terminate
  • since the Thread of 24 will rank 24.

41
Linear work parallel merging (contd)
  • Observation 2p slices. None has more than 2n/p
    elements
  • (not too bad since average is 2n/2pn/p elements)
  • Complexity Partitioning takes WO(p log n), and
    TO(log n) time, or O(n) work and O(log n) time,
    for p lt n/log n
  • Actual work employs 2p serial algorithms, each
    takes O(n/p) time
  • Total WO(n), and TO(n/p), for p lt n/log n
  • IMPORTANT Correctness complexity of parallel
    programs
  • Same as for algorithm
  • This is a big deal. Other parallel programming
    approaches do not have a simple concurrency
    model, and need to reason w.r.t. the program

42
A Teachers Voice Start PAT with XMT
  • Observe Shanes video segment 4
  • http//www.umiacs.umd.edu/users/vishkin/TEACHING/S
    HANE-TORBERT-INTERVIEW7-09/04 Shane Word to
    Teachers.m4v

43
How to (start)? (GC)
  • Contact us! ! !
  • Observe online teaching sessions (more to be
    added soon)
  • Contact us
  • Download and install simulator
  • Read manual
  • Google XMT or www.umiacs.umd.edu/users/vishkin/XMT
    /index.shtml
  • Solve a few problems on your own
  • Try programming a parallel algorithm in XMTC for
    prefix-sums
  • Contact us
  • Follow teaching plan (slides 29-30)
  • Did we already say CONTACT US ?!?! (entire team
    waiting for your call )

44
???
45
Additional Intro Problems
  • See next slides

46
(No Transcript)
47
How can we direct the computer to search this
maze and help the cat get to the milk a parallel
algorithm (depth first search?)
0
Back to A
We might imagine locations in the maze that force
a decision and call these junctions.
Over to E
Back to A
A
Over to F
H
B
Back to A
We might say the computer is forced to make a
decision at junction A
Over to G
C
D
Back to A
E
Over to H
And progresses in a left handed fashion to B
Back to A
Until it reaches a blockage C
And must return to B
And proceed to the next junction D but that
returns to itself
G
F
Back to B
48
Back-up slide FPGA 64-processor, 75MHz prototype
Specs and aspirations
  • Multi GHz clock rate
  • FPGA Prototype built n4, TCUs64, m8, 75MHz.
  • The system consists of 3 FPGA chips
  • 2 Virtex-4 LX200 1 Virtex-4 FX100(Thanks
    Xilinx!)

Block diagram of XMT
  • - Cache coherence defined away Local cache only
    at master thread control unit (MTCU)
  • Prefix-sum functional unit (FA like) with
    global register file (GRF)
  • Reduced global synchrony
  • Overall design idea no-busy-wait FSMs
Write a Comment
User Comments (0)
About PowerShow.com