Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer - PowerPoint PPT Presentation

About This Presentation
Title:

Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer

Description:

Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer s Productivity Uzi Vishkin Common wisdom [cf. tribal lore collected by DARPA HPCS ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer


1
Reinvention of Computing for Many-Core
Parallelism Requires Addressing Programmers
Productivity
  • Uzi Vishkin

Common wisdom cf. tribal lore collected by DARPA
HPCS, 2005 Programming for parallelism is
easy It is the programming for performance that
makes it hard
2
Reinvention of Computing for Many-Core
Parallelism Requires Addressing Productivity
  • Uzi Vishkin

A less fatalistic position Programming for
parallelism is easy But, the difficulty of
programming for performance depends on the system
3
Productivity in Parallel Computing
  • The large parallel machines story
  • Funding of productivity M650 HProductivityCS,
    2002
  • Met Gflops goals up by 1000X since mid-90s
    Exascale talk plans
  • Met power goals. Also groomed eloquent
    spokespeople
  • Progress on productivity No agreed benchmarks.
    No spokesperson. Elusive! In fact, not much has
    changed since as intimidating and time
    consuming as programming in assembly
    language--NSF Blue Ribbon Committee, 2003 or
    even parallel software crisis, CACM 1991.
  • Common sense engineering Untreated bottleneck ?
    diminished returns on improvements? bottleneck
    becomes more critical
  • Next 10 years New specific programs on flops and
    power. What about productivity?!
  • Reality economic island. Cleared by marketing
    DOE applications
  • Enter mainstream many-cores
  • Every CS major should be able to program
    many-cores

4
Coherence IssueWhen you come to a fork in the
road, take it!-Yogi Berra
  • Camp 1 Many US best minds opt for occupations
    that do not involve programming
  • NSF tries to lure them to CS in HS by (1)
    presenting the steady march and broad reach of
    computing across the sciences, industries,
    culture and society, correcting the current
    narrow focus on programming in introductory
    course New Programs Aim to Lure Young Into
    Digital Jobs, NYTimes, 12/09 (2) productivity
    (3) computational thinking
  • Camp 2 Power/performance ? Reinvent mainstream
    computing for parallelism
  • Vendors try to build many-cores that require
    decomposition-first programming. Railroading to
    productivity disaster area. Hacking.
    Insufficient support from parallel algorithms
    design analysis. Short on outreach/productivity/
    abstraction
  • Unintended outcome of taking the fork (prod vs.
    power/perf)
  • Camp cheerleaders core CS (alg design analysis
    style) is radical. Peer review favors both sides
    over center. Centrists as extremists is an
    oxymoron!
  • Building wrong expectations among prospective CS
    majors. Disappointment will lead to Get me out
    of this major
  • Pool of CS majors to be engaged in decomposition-
    first too limited (after subtracting the
    lured-to-breadth-over-programming and the core)
  • Consequences of taking the fork surrealism
  • Eventual casualties students, credibility
    productivity
  • Research/comparison of several holistic parallel
    platforms could (i) prevent much of the damage,
    (ii) build up the real diversity needed for
    natural selection, and (iii) advise the NSF on
    programs that otherwise could cancel one another

5
Lessons from Invention of Computing
  • It should be noted that in comparing codes four
    viewpoints must be kept in mind, all of them of
    comparable importance
  • Simplicity and reliability of the engineering
    solutions required by the code
  • Simplicity, compactness and completeness of the
    code
  • Ease and speed of the human procedure of
    translating mathematical conceived methods into
    the code COMPUTATIONAL THINKING, and also of
    finding and correcting errors in coding or of
    applying to it changes that have been decided
    upon at a later stage
  • Efficiency of the code in operating the machine
    near it full intrinsic speed.
  • -H. Goldstine, J. von Neumann. Planning and
    coding problems for an electronic computing
    instrument, 1947
  • Take home
  • - Comparing codes is a pivotal and broad issue
  • - Concern for Productivity is as old as
    computing (development-time)
  • Human process intellectual/algorithm/planning
    plus skill/coding
  • Contrast with Tendency to understand HW upgrade
    from application code (even if machine not yet
    built, A. Ghuloum, Intel, CACM 9/09)
    unreasonable expectation from application code
    developers

6
How was the human procedure addressed?Answer
Basically, By Abstraction and Induction
  • 1. General-Purpose computing is about a platform
    for your future (whatever) program, as opposed
    specific application, a general method for the
    human procedure was key
  • 2. GvN47 based coding on mathematical induction
    (known for math proofs and as axiom of the
    natural numbers)
  • 3. It worked for establishing serial computing.
    This method led to simplicity, compactness and
    completeness of the resulting code. References
  • - Knuth67, The art of Computer Programming. Vol.
    1 Fundamental Algorithms. Chapter 1 Basic
    concepts. 1.1 Algorithms. 1.2 Math Prelims. 1.2.1
    Math Induction
  • Algorithms 1. Finiteness. 2. Definiteness. 3.
    Input. 4. Output. 5. Effectiveness.
  • Gold standards
  • Definiteness Induction
  • Effectiveness Uniform cost criterion" AHU74
    abstraction

7
Killer app for general-purpose many coresLet
the app-dreamers do their magic
  • Oxymoron?.. general-purpose no one application
    in particular
  • Not really If possible, a killer application
    would be helpful
  • However, wrong as condition for progress
  • General-purpose computing is an infrastructure
    for the IT sector and the economy
  • The general-purpose computing infrastructure has
    been realized by the software spiral (the cyclic
    process of hardware improvements leading to
    software improvements that lead back to hardware
    improvements and so on Andy Grove, Intel)
  • Instituting a parallel software spiral is a
    killer application for many-cores as in the past
    app-dreamers will invent uses
  • ?Not surprisingly, the killer application is also
    an infrastructure
  • Government has a role in building infrastructure
  • ?Instituting a parallel software spiral merits
    government funding
  • However, insufficient empowerment for creating
    and developing alternative platforms to the point
    of establishing their merit.

8
Serial Abstraction A Parallel Counterpart
Example
  • Rudimentary abstraction that made serial
    computing simple that any single instruction
    available for execution in a serial program
    executes immediately
  • Abstracts away different execution time for
    different operations (e.g., memory hierarchy) .
    Used by programmers to conceptualize serial
    computing and supported by hardware and
    compilers. The program provides the instruction
    to be executed next (inductively)
  • Rudimentary abstraction for making parallel
    computing simple that indefinitely many
    instructions, which are available for concurrent
    execution, execute immediately, dubbed Immediate
    Concurrent Execution (ICE)
  • ?Step-by-step (inductive) explication of the
    instructions available next for concurrent
    execution. processors not even mentioned. Falls
    back on the serial abstraction if 1
    instruction/step.

9
CACM10 Using simple abstraction to guide the
reinvention of computing for parallelism
  • Overall old Work-Depth description. Only
    minimalist abstraction ICE builds only on
    induction, itself a rudimentary concept
  • SV82 conjectured that the rest (full PRAM
    algorithm) just a matter of skill
  • Lots of evidence that work-depth works. Used as
    framework in PRAM algorithms texts JaJa-92,
    KKT-01
  • ICE in line with PRAM Only really successful
    parallel algorithmic theory. Latent, though not
    widespread, knowledgebase
  • Widely agreed workdepth are necessary. Jury is
    out on what else. Our position as little as
    possible.

10
Workflow from parallel algorithms to programming
versus trial-and-error
  • Option 1

Option 2
PAT
PAT
Parallel algorithmic thinking (ICE/WD/PRAM)
Domain decomposition, or task decomposition
Prove correctness
Program
Program
Insufficient inter-thread bandwidth?
Still correct
Rethink algorithm Take better advantage of cache
Tune
Compiler
Still correct
Hardware
Hardware
Is Option 1 good enough for the parallel
programmers model? Options 1B and 2 start with a
PRAM algorithm, but not option 1A. Options 1A
and 2 represent workflow, but not option 1B.
Not possible in the 1990s. Possible now
XMT_at_UMD Why settle for less?
11
Mark Twain on the PRAM
  • We should be careful to get out of an experience
    only the wisdom that is in it and stop there
    lest we be like the cat that sits down on a hot
    stove-lid. She will never sit down on a hot
    stove-lid again and that is well but also she
    will never sit down on a cold one anymore Mark
    Twain
  • PRAM algorithms did not become standard CS
    knowledge in 1988-90 since hot stove-lid No
    1990s implementable computer architecture allowed
    programmers to look at a computer as a PRAM
  • The XMT project _at_UMD changed that
  • PS NVidia happy to report success with 2 PRAM
    algorithms in IPDPS09. Great to see that from a
    major vendor
  • These 2 algorithms are decomposition-based,
    unlike most PRAM algorithms. Freshmen programmed
    same 2 algorithms on our XMT machine

12
The Parallel Programmers Productivity
LandscapePostulation a continental divide
Decomposition-first programming
Ocean ?
Work-depth programming
?Great Lakes
  • How different can productivity of many-core
    architectures be? Answer very!
  • Metaphor Dropping rain a short distance apart.
    Very different outcomes. Think of programmers
    productivity as cost of producing usable water.
  • The decomposition-first programming side requires
    domain-decomposition or task-decomposition that
    have not worked in spite of big investment.
    (Looks greener, since invested what if goes to
    ocean while arid side to Sweetwater?)
  • Work-depth initial abstraction is
    decomposition-free. (Arid, under-invested)
  • Require leap-of-faith for investment.

13
Validation of Ease of Programming To Date
  • 1. Comparison with MPI by DARPA-HPCS SW Eng
    leaders HochsteinBasiliVGilbert
  • 2. Teachability demonstrated so far
    TorbertVTzurEllison, SIGCSE10 to appear
  • - To freshman class with 11 non-CS students.
    Some prog. assignments median finding,
    merge-sort, integer-sort sample-sort.
  • Other teachers
  • - Magnet HS teacher. Downloaded simulator,
    assignments, class notes, from XMT page.
    Self-taught. Recommends Teach XMT first. Easiest
    to set up (simulator), program, analyze ability
    to anticipate performance (as in serial). Can do
    not just for embarrassingly parallel. Teaches
    also OpenMP, MPI, CUDA. Lookup keynote at
    CS4HS09_at_CMU interview with teacher.
  • - High school Middle School (some 10 year
    olds) students from underrepresented groups by
    HS Math teacher.
  • Teachability necessary (but not sufficient)
    condition for ease-of-programming. Itself
    necessary (but not sufficient) condition for
    productivity.
  • Hence, teachability as good a benchmark as any
    out there for productivity

14
Conclusion
  • - Want future mainstream programmers to embrace
    general-purpose parallelism (every CS major for
    common SW architectures). Yet, in the past
  • Insufficient evidence on productivity. Yet,
    history of repeated surprise Parallel machines
    repel programmers
  • Research Drivers
  • Empower select holistic (HWSW) parallel
    platforms for merit-based comparison. Imagine a
    new world with the given platform. Consider all
    aspects e.g., is it sufficient for reinstating
    the SW spiral? Is the barrier-to-entry for
    creative applications low enough? How will the CS
    curriculum will look? Who will be attracted to
    study CS?
  • Then, gather evidence
  • Methodically compare productivity
    (development-time, run-time) of platforms.
  • ? Ownership stake role for Indian partner (Prof.
    PJ Narayan, IIIT, Hyderabad) India largest
    producer of SW. New platform requires sufficient
    Indian interest. Lead benchmarking/comparison for
    productivity, etc.
  • For session Coming from algorithms, computer
    vision and computational biology, compare select
    platforms for performance, productivity
    (development-time and run-time), and overall for
    reinstating the SW spiral. Benchmark algorithms
    and applications based on their inherent
    parallelism for future machine platforms, as
    opposed to using existing code written for
    yesterdays (serial or parallel) machines. Issue
    How to benchmark for productivity?

15
Not just a theory. XMT prototyped HWSW
64-core, 75MHz FPGA prototype SPAA07, Computing
Frontiers08 Original explicit multi-threaded
(XMT) architecture SPAA98
Interconnection Network for 128-core. 9mmX5mm,
IBM90nm process. 400 MHz prototype
HotInterconnects07
Same design as 64-core FPGA. 10mmX10mm, IBM90nm
process. 150 MHz prototype The design scales to
1000 cores on-chip
  • Never a successful general-purpose parallel
    computer (easy to program, good speedups, updown
    scalable). IF you could program it ? great
    speedups.
  • Motivation Fix the IF

16
Programmers Model Engineering Workflow
  • Arbitrary CRCW Work-depth algorithm. Reason about
    correctness complexity in synchronous model
  • SPMD reduced synchrony
  • Threads advance at own speed, not lockstep
  • Main construct spawn-join block. Note can start
    any number of processes at once. Can express
    locality (decomposition-second)
  • Prefix-sum (ps). Independence of order semantics
    (IOS).
  • Establish correctness complexity by relating to
    WD analyses.
  • Circumvents The problem with threads, e.g.,
    Lee.

spawn
join
spawn
join
  • Tune (compiler or expert programmer) (i) Length
    of sequence of round trips to memory, (ii) QRQW,
    (iii) WD. VCL08
  • Trialerror contrast similar start?while
    insufficient inter-thread bandwidth dorethink
    algorithm to take better advantage of cache

17
Performance
  • Simulation of 1024 processors 100X on standard
    benchmark suite for VHDL gate-level simulation.
    for 1024 processors GV06
  • SPAA09 10X relative to Intel Core 2 Duo
  • with 64-processor XMT same silicon area as 1
    commodity processor (core)
  • Promise of 100X with 1024 processors also for
    irregular, fine-grained parallelism with up- and
    down-scalability.

18
Some Credits
  • Grad students, George Caragea, James Edwards,
    David Ellison, Fuat Keceli, Beliz Saybasili, Alex
    Tzannes. Recent grads Aydin Balkan, Mike Horak,
    Xingzhi Wen
  • Industry design experts (pro-bono)
  • Rajeev Barua, Compiler. Co-advisor of 2 CS grad
    students. 2008 NSF grant
  • Gang Qu, VLSI and Power. Co-advisor
  • Steve Nowick, Columbia U., Asynch computing.
    Co-advisor. 2008 NSF team grant.
  • Ron Tzur, U. Colorado, K12 Education. Co-advisor.
    2008 NSF seed funding
  • K12 Montgomery Blair Magnet HS, MD, Thomas
    Jefferson HS, VA, Baltimore (inner city)
    Ingenuity Project Middle School 2009 Summer Camp,
    Montgomery County Public Schools
  • Marc Olano, UMBC, Computer graphics. Co-advisor.
  • Tali Moreshet, Swarthmore College, Power.
    Co-advisor.
  • Bernie Brooks, NIH. Co-Advisor
  • Marty Peckerar, Microelectronics
  • Igor Smolyaninov, Electro-optics
  • Funding NSF, NSA 2008 deployed XMT computer, NIH
  • 6 Issued patents. More patent applications
  • Informal industry partner Intel
  • Reinvention of Computing for Parallelism.
    Selected for Maryland Research Center of
    Excellence (MRCE) by USM. Not yet funded. 17
    members, including UMBC, UMBI, UMSOM. Mostly
    applications.
Write a Comment
User Comments (0)
About PowerShow.com