Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer

About This Presentation

Title:

Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer

Description:

Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer s Productivity Uzi Vishkin Common wisdom [cf. tribal lore collected by DARPA HPCS ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 19

Provided by: umiacsUmd

Learn more at: http://users.umiacs.umd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer

1
Reinvention of Computing for Many-Core
Parallelism Requires Addressing Programmers
Productivity

Uzi Vishkin

Common wisdom cf. tribal lore collected by DARPA
HPCS, 2005 Programming for parallelism is
easy It is the programming for performance that
makes it hard
2
Reinvention of Computing for Many-Core
Parallelism Requires Addressing Productivity

Uzi Vishkin

A less fatalistic position Programming for
parallelism is easy But, the difficulty of
programming for performance depends on the system
3
Productivity in Parallel Computing

The large parallel machines story
Funding of productivity M650 HProductivityCS,
2002
Met Gflops goals up by 1000X since mid-90s
Exascale talk plans
Met power goals. Also groomed eloquent
spokespeople
Progress on productivity No agreed benchmarks.
No spokesperson. Elusive! In fact, not much has
changed since as intimidating and time
consuming as programming in assembly
language--NSF Blue Ribbon Committee, 2003 or
even parallel software crisis, CACM 1991.
Common sense engineering Untreated bottleneck ?
diminished returns on improvements? bottleneck
becomes more critical
Next 10 years New specific programs on flops and
power. What about productivity?!
Reality economic island. Cleared by marketing
DOE applications
Enter mainstream many-cores
Every CS major should be able to program
many-cores

4
Coherence IssueWhen you come to a fork in the
road, take it!-Yogi Berra

Camp 1 Many US best minds opt for occupations
that do not involve programming
NSF tries to lure them to CS in HS by (1)
presenting the steady march and broad reach of
computing across the sciences, industries,
culture and society, correcting the current
narrow focus on programming in introductory
course New Programs Aim to Lure Young Into
Digital Jobs, NYTimes, 12/09 (2) productivity
(3) computational thinking
Camp 2 Power/performance ? Reinvent mainstream
computing for parallelism
Vendors try to build many-cores that require
decomposition-first programming. Railroading to
productivity disaster area. Hacking.
Insufficient support from parallel algorithms
design analysis. Short on outreach/productivity/
abstraction
Unintended outcome of taking the fork (prod vs.
power/perf)
Camp cheerleaders core CS (alg design analysis
style) is radical. Peer review favors both sides
over center. Centrists as extremists is an
oxymoron!
Building wrong expectations among prospective CS
majors. Disappointment will lead to Get me out
of this major
Pool of CS majors to be engaged in decomposition-
first too limited (after subtracting the
lured-to-breadth-over-programming and the core)
Consequences of taking the fork surrealism
Eventual casualties students, credibility
productivity
Research/comparison of several holistic parallel
platforms could (i) prevent much of the damage,
(ii) build up the real diversity needed for
natural selection, and (iii) advise the NSF on
programs that otherwise could cancel one another

5
Lessons from Invention of Computing

It should be noted that in comparing codes four
viewpoints must be kept in mind, all of them of
comparable importance
Simplicity and reliability of the engineering
solutions required by the code
Simplicity, compactness and completeness of the
code
Ease and speed of the human procedure of
translating mathematical conceived methods into
the code COMPUTATIONAL THINKING, and also of
finding and correcting errors in coding or of
applying to it changes that have been decided
upon at a later stage
Efficiency of the code in operating the machine
near it full intrinsic speed.
-H. Goldstine, J. von Neumann. Planning and
coding problems for an electronic computing
instrument, 1947
Take home
- Comparing codes is a pivotal and broad issue
- Concern for Productivity is as old as
computing (development-time)
Human process intellectual/algorithm/planning
plus skill/coding
Contrast with Tendency to understand HW upgrade
from application code (even if machine not yet
built, A. Ghuloum, Intel, CACM 9/09)
unreasonable expectation from application code
developers

6
How was the human procedure addressed?Answer
Basically, By Abstraction and Induction

1. General-Purpose computing is about a platform
for your future (whatever) program, as opposed
specific application, a general method for the
human procedure was key
2. GvN47 based coding on mathematical induction
(known for math proofs and as axiom of the
natural numbers)
3. It worked for establishing serial computing.
This method led to simplicity, compactness and
completeness of the resulting code. References
- Knuth67, The art of Computer Programming. Vol.
1 Fundamental Algorithms. Chapter 1 Basic
concepts. 1.1 Algorithms. 1.2 Math Prelims. 1.2.1
Math Induction
Algorithms 1. Finiteness. 2. Definiteness. 3.
Input. 4. Output. 5. Effectiveness.
Gold standards
Definiteness Induction
Effectiveness Uniform cost criterion" AHU74
abstraction

7
Killer app for general-purpose many coresLet
the app-dreamers do their magic

Oxymoron?.. general-purpose no one application
in particular
Not really If possible, a killer application
would be helpful
However, wrong as condition for progress
General-purpose computing is an infrastructure
for the IT sector and the economy
The general-purpose computing infrastructure has
been realized by the software spiral (the cyclic
process of hardware improvements leading to
software improvements that lead back to hardware
improvements and so on Andy Grove, Intel)
Instituting a parallel software spiral is a
killer application for many-cores as in the past
app-dreamers will invent uses
?Not surprisingly, the killer application is also
an infrastructure
Government has a role in building infrastructure
?Instituting a parallel software spiral merits
government funding
However, insufficient empowerment for creating
and developing alternative platforms to the point
of establishing their merit.

8
Serial Abstraction A Parallel Counterpart
Example

Rudimentary abstraction that made serial
computing simple that any single instruction
available for execution in a serial program
executes immediately
Abstracts away different execution time for
different operations (e.g., memory hierarchy) .
Used by programmers to conceptualize serial
computing and supported by hardware and
compilers. The program provides the instruction
to be executed next (inductively)
Rudimentary abstraction for making parallel
computing simple that indefinitely many
instructions, which are available for concurrent
execution, execute immediately, dubbed Immediate
Concurrent Execution (ICE)
?Step-by-step (inductive) explication of the
instructions available next for concurrent
execution. processors not even mentioned. Falls
back on the serial abstraction if 1
instruction/step.

9
CACM10 Using simple abstraction to guide the
reinvention of computing for parallelism

Overall old Work-Depth description. Only
minimalist abstraction ICE builds only on
induction, itself a rudimentary concept
SV82 conjectured that the rest (full PRAM
algorithm) just a matter of skill
Lots of evidence that work-depth works. Used as
framework in PRAM algorithms texts JaJa-92,
KKT-01
ICE in line with PRAM Only really successful
parallel algorithmic theory. Latent, though not
widespread, knowledgebase
Widely agreed workdepth are necessary. Jury is
out on what else. Our position as little as
possible.

10
Workflow from parallel algorithms to programming
versus trial-and-error

Option 1

Option 2
PAT
PAT
Parallel algorithmic thinking (ICE/WD/PRAM)
Domain decomposition, or task decomposition
Prove correctness
Program
Program
Insufficient inter-thread bandwidth?
Still correct
Rethink algorithm Take better advantage of cache
Tune
Compiler
Still correct
Hardware
Hardware
Is Option 1 good enough for the parallel
programmers model? Options 1B and 2 start with a
PRAM algorithm, but not option 1A. Options 1A
and 2 represent workflow, but not option 1B.
Not possible in the 1990s. Possible now
XMT_at_UMD Why settle for less?
11
Mark Twain on the PRAM

We should be careful to get out of an experience
only the wisdom that is in it and stop there
lest we be like the cat that sits down on a hot
stove-lid. She will never sit down on a hot
stove-lid again and that is well but also she
will never sit down on a cold one anymore Mark
Twain
PRAM algorithms did not become standard CS
knowledge in 1988-90 since hot stove-lid No
1990s implementable computer architecture allowed
programmers to look at a computer as a PRAM
The XMT project _at_UMD changed that
PS NVidia happy to report success with 2 PRAM
algorithms in IPDPS09. Great to see that from a
major vendor
These 2 algorithms are decomposition-based,
unlike most PRAM algorithms. Freshmen programmed
same 2 algorithms on our XMT machine

12
The Parallel Programmers Productivity
LandscapePostulation a continental divide
Decomposition-first programming
Ocean ?
Work-depth programming
?Great Lakes

How different can productivity of many-core
architectures be? Answer very!
Metaphor Dropping rain a short distance apart.
Very different outcomes. Think of programmers
productivity as cost of producing usable water.
The decomposition-first programming side requires
domain-decomposition or task-decomposition that
have not worked in spite of big investment.
(Looks greener, since invested what if goes to
ocean while arid side to Sweetwater?)
Work-depth initial abstraction is
decomposition-free. (Arid, under-invested)
Require leap-of-faith for investment.

13
Validation of Ease of Programming To Date

1. Comparison with MPI by DARPA-HPCS SW Eng
leaders HochsteinBasiliVGilbert
2. Teachability demonstrated so far
TorbertVTzurEllison, SIGCSE10 to appear
- To freshman class with 11 non-CS students.
Some prog. assignments median finding,
merge-sort, integer-sort sample-sort.
Other teachers
- Magnet HS teacher. Downloaded simulator,
assignments, class notes, from XMT page.
Self-taught. Recommends Teach XMT first. Easiest
to set up (simulator), program, analyze ability
to anticipate performance (as in serial). Can do
not just for embarrassingly parallel. Teaches
also OpenMP, MPI, CUDA. Lookup keynote at
CS4HS09_at_CMU interview with teacher.
- High school Middle School (some 10 year
olds) students from underrepresented groups by
HS Math teacher.
Teachability necessary (but not sufficient)
condition for ease-of-programming. Itself
necessary (but not sufficient) condition for
productivity.
Hence, teachability as good a benchmark as any
out there for productivity

14
Conclusion

- Want future mainstream programmers to embrace
general-purpose parallelism (every CS major for
common SW architectures). Yet, in the past
Insufficient evidence on productivity. Yet,
history of repeated surprise Parallel machines
repel programmers
Research Drivers
Empower select holistic (HWSW) parallel
platforms for merit-based comparison. Imagine a
new world with the given platform. Consider all
aspects e.g., is it sufficient for reinstating
the SW spiral? Is the barrier-to-entry for
creative applications low enough? How will the CS
curriculum will look? Who will be attracted to
study CS?
Then, gather evidence
Methodically compare productivity
(development-time, run-time) of platforms.
? Ownership stake role for Indian partner (Prof.
PJ Narayan, IIIT, Hyderabad) India largest
producer of SW. New platform requires sufficient
Indian interest. Lead benchmarking/comparison for
productivity, etc.
For session Coming from algorithms, computer
vision and computational biology, compare select
platforms for performance, productivity
(development-time and run-time), and overall for
reinstating the SW spiral. Benchmark algorithms
and applications based on their inherent
parallelism for future machine platforms, as
opposed to using existing code written for
yesterdays (serial or parallel) machines. Issue
How to benchmark for productivity?

15
Not just a theory. XMT prototyped HWSW
64-core, 75MHz FPGA prototype SPAA07, Computing
Frontiers08 Original explicit multi-threaded
(XMT) architecture SPAA98
Interconnection Network for 128-core. 9mmX5mm,
IBM90nm process. 400 MHz prototype
HotInterconnects07
Same design as 64-core FPGA. 10mmX10mm, IBM90nm
process. 150 MHz prototype The design scales to
1000 cores on-chip

Never a successful general-purpose parallel
computer (easy to program, good speedups, updown
scalable). IF you could program it ? great
speedups.
Motivation Fix the IF

16
Programmers Model Engineering Workflow

Arbitrary CRCW Work-depth algorithm. Reason about
correctness complexity in synchronous model
SPMD reduced synchrony
Threads advance at own speed, not lockstep
Main construct spawn-join block. Note can start
any number of processes at once. Can express
locality (decomposition-second)
Prefix-sum (ps). Independence of order semantics
(IOS).
Establish correctness complexity by relating to
WD analyses.
Circumvents The problem with threads, e.g.,
Lee.

spawn
join
spawn
join

Tune (compiler or expert programmer) (i) Length
of sequence of round trips to memory, (ii) QRQW,
(iii) WD. VCL08
Trialerror contrast similar start?while
insufficient inter-thread bandwidth dorethink
algorithm to take better advantage of cache

17
Performance

Simulation of 1024 processors 100X on standard
benchmark suite for VHDL gate-level simulation.
for 1024 processors GV06
SPAA09 10X relative to Intel Core 2 Duo
with 64-processor XMT same silicon area as 1
commodity processor (core)
Promise of 100X with 1024 processors also for
irregular, fine-grained parallelism with up- and
down-scalability.

18
Some Credits

Grad students, George Caragea, James Edwards,
David Ellison, Fuat Keceli, Beliz Saybasili, Alex
Tzannes. Recent grads Aydin Balkan, Mike Horak,
Xingzhi Wen
Industry design experts (pro-bono)
Rajeev Barua, Compiler. Co-advisor of 2 CS grad
students. 2008 NSF grant
Gang Qu, VLSI and Power. Co-advisor
Steve Nowick, Columbia U., Asynch computing.
Co-advisor. 2008 NSF team grant.
Ron Tzur, U. Colorado, K12 Education. Co-advisor.
2008 NSF seed funding
K12 Montgomery Blair Magnet HS, MD, Thomas
Jefferson HS, VA, Baltimore (inner city)
Ingenuity Project Middle School 2009 Summer Camp,
Montgomery County Public Schools
Marc Olano, UMBC, Computer graphics. Co-advisor.
Tali Moreshet, Swarthmore College, Power.
Co-advisor.
Bernie Brooks, NIH. Co-Advisor
Marty Peckerar, Microelectronics
Igor Smolyaninov, Electro-optics
Funding NSF, NSA 2008 deployed XMT computer, NIH
6 Issued patents. More patent applications
Informal industry partner Intel
Reinvention of Computing for Parallelism.
Selected for Maryland Research Center of
Excellence (MRCE) by USM. Not yet funded. 17
members, including UMBC, UMBI, UMSOM. Mostly
applications.

Write a Comment

User Comments (0)

About PowerShow.com

Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer - PowerPoint PPT Presentation

Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer

Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer s Productivity Uzi Vishkin Common wisdom [cf. tribal lore collected by DARPA HPCS ... – PowerPoint PPT presentation