Title: Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer
1Reinvention of Computing for Many-Core
Parallelism Requires Addressing Programmers
Productivity
Common wisdom cf. tribal lore collected by DARPA
HPCS, 2005 Programming for parallelism is
easy It is the programming for performance that
makes it hard
2Reinvention of Computing for Many-Core
Parallelism Requires Addressing Productivity
A less fatalistic position Programming for
parallelism is easy But, the difficulty of
programming for performance depends on the system
3Productivity in Parallel Computing
- The large parallel machines story
- Funding of productivity M650 HProductivityCS,
2002 - Met Gflops goals up by 1000X since mid-90s
Exascale talk plans - Met power goals. Also groomed eloquent
spokespeople - Progress on productivity No agreed benchmarks.
No spokesperson. Elusive! In fact, not much has
changed since as intimidating and time
consuming as programming in assembly
language--NSF Blue Ribbon Committee, 2003 or
even parallel software crisis, CACM 1991. - Common sense engineering Untreated bottleneck ?
diminished returns on improvements? bottleneck
becomes more critical - Next 10 years New specific programs on flops and
power. What about productivity?! - Reality economic island. Cleared by marketing
DOE applications - Enter mainstream many-cores
- Every CS major should be able to program
many-cores
4Coherence IssueWhen you come to a fork in the
road, take it!-Yogi Berra
- Camp 1 Many US best minds opt for occupations
that do not involve programming - NSF tries to lure them to CS in HS by (1)
presenting the steady march and broad reach of
computing across the sciences, industries,
culture and society, correcting the current
narrow focus on programming in introductory
course New Programs Aim to Lure Young Into
Digital Jobs, NYTimes, 12/09 (2) productivity
(3) computational thinking - Camp 2 Power/performance ? Reinvent mainstream
computing for parallelism - Vendors try to build many-cores that require
decomposition-first programming. Railroading to
productivity disaster area. Hacking.
Insufficient support from parallel algorithms
design analysis. Short on outreach/productivity/
abstraction - Unintended outcome of taking the fork (prod vs.
power/perf) - Camp cheerleaders core CS (alg design analysis
style) is radical. Peer review favors both sides
over center. Centrists as extremists is an
oxymoron! - Building wrong expectations among prospective CS
majors. Disappointment will lead to Get me out
of this major - Pool of CS majors to be engaged in decomposition-
first too limited (after subtracting the
lured-to-breadth-over-programming and the core) - Consequences of taking the fork surrealism
- Eventual casualties students, credibility
productivity - Research/comparison of several holistic parallel
platforms could (i) prevent much of the damage,
(ii) build up the real diversity needed for
natural selection, and (iii) advise the NSF on
programs that otherwise could cancel one another
5Lessons from Invention of Computing
- It should be noted that in comparing codes four
viewpoints must be kept in mind, all of them of
comparable importance - Simplicity and reliability of the engineering
solutions required by the code - Simplicity, compactness and completeness of the
code - Ease and speed of the human procedure of
translating mathematical conceived methods into
the code COMPUTATIONAL THINKING, and also of
finding and correcting errors in coding or of
applying to it changes that have been decided
upon at a later stage - Efficiency of the code in operating the machine
near it full intrinsic speed. - -H. Goldstine, J. von Neumann. Planning and
coding problems for an electronic computing
instrument, 1947 - Take home
- - Comparing codes is a pivotal and broad issue
- - Concern for Productivity is as old as
computing (development-time) - Human process intellectual/algorithm/planning
plus skill/coding - Contrast with Tendency to understand HW upgrade
from application code (even if machine not yet
built, A. Ghuloum, Intel, CACM 9/09)
unreasonable expectation from application code
developers
6How was the human procedure addressed?Answer
Basically, By Abstraction and Induction
- 1. General-Purpose computing is about a platform
for your future (whatever) program, as opposed
specific application, a general method for the
human procedure was key - 2. GvN47 based coding on mathematical induction
(known for math proofs and as axiom of the
natural numbers) - 3. It worked for establishing serial computing.
This method led to simplicity, compactness and
completeness of the resulting code. References - - Knuth67, The art of Computer Programming. Vol.
1 Fundamental Algorithms. Chapter 1 Basic
concepts. 1.1 Algorithms. 1.2 Math Prelims. 1.2.1
Math Induction - Algorithms 1. Finiteness. 2. Definiteness. 3.
Input. 4. Output. 5. Effectiveness. - Gold standards
- Definiteness Induction
- Effectiveness Uniform cost criterion" AHU74
abstraction
7Killer app for general-purpose many coresLet
the app-dreamers do their magic
- Oxymoron?.. general-purpose no one application
in particular - Not really If possible, a killer application
would be helpful - However, wrong as condition for progress
- General-purpose computing is an infrastructure
for the IT sector and the economy - The general-purpose computing infrastructure has
been realized by the software spiral (the cyclic
process of hardware improvements leading to
software improvements that lead back to hardware
improvements and so on Andy Grove, Intel) - Instituting a parallel software spiral is a
killer application for many-cores as in the past
app-dreamers will invent uses - ?Not surprisingly, the killer application is also
an infrastructure - Government has a role in building infrastructure
- ?Instituting a parallel software spiral merits
government funding - However, insufficient empowerment for creating
and developing alternative platforms to the point
of establishing their merit.
8Serial Abstraction A Parallel Counterpart
Example
- Rudimentary abstraction that made serial
computing simple that any single instruction
available for execution in a serial program
executes immediately - Abstracts away different execution time for
different operations (e.g., memory hierarchy) .
Used by programmers to conceptualize serial
computing and supported by hardware and
compilers. The program provides the instruction
to be executed next (inductively) - Rudimentary abstraction for making parallel
computing simple that indefinitely many
instructions, which are available for concurrent
execution, execute immediately, dubbed Immediate
Concurrent Execution (ICE) - ?Step-by-step (inductive) explication of the
instructions available next for concurrent
execution. processors not even mentioned. Falls
back on the serial abstraction if 1
instruction/step.
9CACM10 Using simple abstraction to guide the
reinvention of computing for parallelism
- Overall old Work-Depth description. Only
minimalist abstraction ICE builds only on
induction, itself a rudimentary concept - SV82 conjectured that the rest (full PRAM
algorithm) just a matter of skill - Lots of evidence that work-depth works. Used as
framework in PRAM algorithms texts JaJa-92,
KKT-01 - ICE in line with PRAM Only really successful
parallel algorithmic theory. Latent, though not
widespread, knowledgebase - Widely agreed workdepth are necessary. Jury is
out on what else. Our position as little as
possible.
10Workflow from parallel algorithms to programming
versus trial-and-error
Option 2
PAT
PAT
Parallel algorithmic thinking (ICE/WD/PRAM)
Domain decomposition, or task decomposition
Prove correctness
Program
Program
Insufficient inter-thread bandwidth?
Still correct
Rethink algorithm Take better advantage of cache
Tune
Compiler
Still correct
Hardware
Hardware
Is Option 1 good enough for the parallel
programmers model? Options 1B and 2 start with a
PRAM algorithm, but not option 1A. Options 1A
and 2 represent workflow, but not option 1B.
Not possible in the 1990s. Possible now
XMT_at_UMD Why settle for less?
11Mark Twain on the PRAM
- We should be careful to get out of an experience
only the wisdom that is in it and stop there
lest we be like the cat that sits down on a hot
stove-lid. She will never sit down on a hot
stove-lid again and that is well but also she
will never sit down on a cold one anymore Mark
Twain - PRAM algorithms did not become standard CS
knowledge in 1988-90 since hot stove-lid No
1990s implementable computer architecture allowed
programmers to look at a computer as a PRAM - The XMT project _at_UMD changed that
- PS NVidia happy to report success with 2 PRAM
algorithms in IPDPS09. Great to see that from a
major vendor - These 2 algorithms are decomposition-based,
unlike most PRAM algorithms. Freshmen programmed
same 2 algorithms on our XMT machine
12The Parallel Programmers Productivity
LandscapePostulation a continental divide
Decomposition-first programming
Ocean ?
Work-depth programming
?Great Lakes
- How different can productivity of many-core
architectures be? Answer very! - Metaphor Dropping rain a short distance apart.
Very different outcomes. Think of programmers
productivity as cost of producing usable water. - The decomposition-first programming side requires
domain-decomposition or task-decomposition that
have not worked in spite of big investment.
(Looks greener, since invested what if goes to
ocean while arid side to Sweetwater?) - Work-depth initial abstraction is
decomposition-free. (Arid, under-invested) - Require leap-of-faith for investment.
13Validation of Ease of Programming To Date
- 1. Comparison with MPI by DARPA-HPCS SW Eng
leaders HochsteinBasiliVGilbert - 2. Teachability demonstrated so far
TorbertVTzurEllison, SIGCSE10 to appear - - To freshman class with 11 non-CS students.
Some prog. assignments median finding,
merge-sort, integer-sort sample-sort. - Other teachers
- - Magnet HS teacher. Downloaded simulator,
assignments, class notes, from XMT page.
Self-taught. Recommends Teach XMT first. Easiest
to set up (simulator), program, analyze ability
to anticipate performance (as in serial). Can do
not just for embarrassingly parallel. Teaches
also OpenMP, MPI, CUDA. Lookup keynote at
CS4HS09_at_CMU interview with teacher. - - High school Middle School (some 10 year
olds) students from underrepresented groups by
HS Math teacher. - Teachability necessary (but not sufficient)
condition for ease-of-programming. Itself
necessary (but not sufficient) condition for
productivity. - Hence, teachability as good a benchmark as any
out there for productivity -
14Conclusion
- - Want future mainstream programmers to embrace
general-purpose parallelism (every CS major for
common SW architectures). Yet, in the past - Insufficient evidence on productivity. Yet,
history of repeated surprise Parallel machines
repel programmers - Research Drivers
- Empower select holistic (HWSW) parallel
platforms for merit-based comparison. Imagine a
new world with the given platform. Consider all
aspects e.g., is it sufficient for reinstating
the SW spiral? Is the barrier-to-entry for
creative applications low enough? How will the CS
curriculum will look? Who will be attracted to
study CS? - Then, gather evidence
- Methodically compare productivity
(development-time, run-time) of platforms. - ? Ownership stake role for Indian partner (Prof.
PJ Narayan, IIIT, Hyderabad) India largest
producer of SW. New platform requires sufficient
Indian interest. Lead benchmarking/comparison for
productivity, etc. - For session Coming from algorithms, computer
vision and computational biology, compare select
platforms for performance, productivity
(development-time and run-time), and overall for
reinstating the SW spiral. Benchmark algorithms
and applications based on their inherent
parallelism for future machine platforms, as
opposed to using existing code written for
yesterdays (serial or parallel) machines. Issue
How to benchmark for productivity?
15Not just a theory. XMT prototyped HWSW
64-core, 75MHz FPGA prototype SPAA07, Computing
Frontiers08 Original explicit multi-threaded
(XMT) architecture SPAA98
Interconnection Network for 128-core. 9mmX5mm,
IBM90nm process. 400 MHz prototype
HotInterconnects07
Same design as 64-core FPGA. 10mmX10mm, IBM90nm
process. 150 MHz prototype The design scales to
1000 cores on-chip
- Never a successful general-purpose parallel
computer (easy to program, good speedups, updown
scalable). IF you could program it ? great
speedups. - Motivation Fix the IF
16Programmers Model Engineering Workflow
- Arbitrary CRCW Work-depth algorithm. Reason about
correctness complexity in synchronous model - SPMD reduced synchrony
- Threads advance at own speed, not lockstep
- Main construct spawn-join block. Note can start
any number of processes at once. Can express
locality (decomposition-second) - Prefix-sum (ps). Independence of order semantics
(IOS). - Establish correctness complexity by relating to
WD analyses. - Circumvents The problem with threads, e.g.,
Lee.
spawn
join
spawn
join
- Tune (compiler or expert programmer) (i) Length
of sequence of round trips to memory, (ii) QRQW,
(iii) WD. VCL08 - Trialerror contrast similar start?while
insufficient inter-thread bandwidth dorethink
algorithm to take better advantage of cache
17Performance
- Simulation of 1024 processors 100X on standard
benchmark suite for VHDL gate-level simulation.
for 1024 processors GV06 - SPAA09 10X relative to Intel Core 2 Duo
- with 64-processor XMT same silicon area as 1
commodity processor (core) - Promise of 100X with 1024 processors also for
irregular, fine-grained parallelism with up- and
down-scalability.
18Some Credits
- Grad students, George Caragea, James Edwards,
David Ellison, Fuat Keceli, Beliz Saybasili, Alex
Tzannes. Recent grads Aydin Balkan, Mike Horak,
Xingzhi Wen - Industry design experts (pro-bono)
- Rajeev Barua, Compiler. Co-advisor of 2 CS grad
students. 2008 NSF grant - Gang Qu, VLSI and Power. Co-advisor
- Steve Nowick, Columbia U., Asynch computing.
Co-advisor. 2008 NSF team grant. - Ron Tzur, U. Colorado, K12 Education. Co-advisor.
2008 NSF seed funding - K12 Montgomery Blair Magnet HS, MD, Thomas
Jefferson HS, VA, Baltimore (inner city)
Ingenuity Project Middle School 2009 Summer Camp,
Montgomery County Public Schools - Marc Olano, UMBC, Computer graphics. Co-advisor.
- Tali Moreshet, Swarthmore College, Power.
Co-advisor. - Bernie Brooks, NIH. Co-Advisor
- Marty Peckerar, Microelectronics
- Igor Smolyaninov, Electro-optics
- Funding NSF, NSA 2008 deployed XMT computer, NIH
- 6 Issued patents. More patent applications
- Informal industry partner Intel
- Reinvention of Computing for Parallelism.
Selected for Maryland Research Center of
Excellence (MRCE) by USM. Not yet funded. 17
members, including UMBC, UMBI, UMSOM. Mostly
applications.