Title: Moderator:
1Programming Languages/Models and Compiler
Technologies
Moderator John Mellor-Crummey Department of
Computer Science Rice University
Microsoft Manycore Workshop June 21,
2007
2Panelists
- David August - Princeton University
- Saman Amarasinghe - Massachusetts Institute of
Technology - Guy Blelloch - Carnegie Mellon University
- Charles Leiserson - Massachusetts Institute of
Technology - Uzi Vishkin - University of Maryland, College
Park
3Architectural Challenges
- Significant parallelism
- Multiple kinds of parallelism
- cores
- ILP
- SIMD
- Diversity of cores
- Run-time throttling of cores for power mgmt
- Memory hierarchy
- bandwidth
- near term will continue to be a significant
bottleneck - long term 3D stacked memory?
- long and often non-uniform memory latencies
- scratch pads
4Roles of Parallel Programming Models
- Enhance programmer productivity through
abstraction - Manage platform resources to deliver performance
- Provide standard interface for platform
portability
5The Goal
- Simpler ways of conceptualizing, expressing,
- debugging, and tuning scalable parallel programs
- Multiple models will be necessary
- Models will necessarily trade off simplicity,
expressivity, relevance to legacy code, and
performance
6To Succeed, Parallel Programming Models Must
- Be ubiquitous
- cross platform
- at a minimum laptops, SMP servers
- distributed memory clusters?
- Be expressive
- Be productive
- easy to write
- easy to read and maintain
- easy to reuse
- Have a promise of future availability and
longevity - Be efficient
- Be supported by tools
7Simplifying Parallel Programming
- A high-level parallel language should
- Provide global address space
- beware exposed buffering
- Separate concerns partitioning, mapping, and
synchronization vs. algorithm specification - viscosity comes from premature mingling of
these issues - Enable programmer to manage locality at a high
level - locality performance
- affinity between data and computation
- e.g. HPFs ON HOME declarations
8Design Issues I
- Ultimate control vs. simplicity of use
- library developers vs. productivity users
- should it be the same language for both?
- extensible language model (Suns Fortress)
- kitchen sink model (X10)
- Implicit vs. explicit parallelism
- implicit parallelism is often more malleable
- better supports dynamic adaptation
- Compiler assisted vs. compiler-centric
- Co-array Fortran and UPC
- user control over work decomposition, data
movement, and synchronization - HPF compiler must deliver or all is lost
- Lazy vs. eager parallelism
- Cilks lazy parallelism provides a model for
scalable binaries - eager parallelism adds unnecessary overhead
9Design Issues II
- Deterministic vs. non-deterministic models
- deterministic clocked final model
- Saraswat et al. (www.saraswat.org/cf.pdf)
- Static vs. dynamic scheduling
- dynamic scheduling will be increasingly important
- irregular computations, task parallelism
- adaptive scheduling in response to core
throttling - Cooperative vs. independent scheduling of work
- does benefit of shared cache outweigh difficulty
of using it? - tightly synchronous vs. more loosely synchronous
- Scalable to distributed-memory ensembles?
- broad community probably only cares about
tightly-coupled platforms - some government and industry clients will always
have extreme needs - Importance of managing affinity between cores and
data - important for highest efficiency for library
developers
10Transactions are not THE Answer
- Transactions are a piece of the puzzle atomicity
- Other aspects of the parallel programming problem
- identifying concurrency
- partitioning work
- ordering actions
11Autotuning
- Seductive idea
- Very successful as a library-based approach
- FFTW, Atlas, OSKI,
- Much work needed to apply to applications rather
than kernels - huge search space
- progress in effective truncated search
- model guidance can be effective
- autotuning for parallelism
- dangerously close to automatic parallelization
12Rice Experience Lessons from HPF
- Good data and computation partitionings are
essential - without good partitionings, parallelism suffers
- flexible user-control is essential
- Excess communication undermines scalability
- both frequency and volume must be right
- embrace user hints to guide communication
placement and optimization - e.g. HPF/JA directives REFLECT, LOCAL, PIPELINE,
etc. - Single processor efficiency is critical
- must use caches effectively on microprocessors
- Icache beware of complex machine-generated code
- Dcache beware of communication footprint
- Optimizing tightly-coupled algorithms can be hard
- if the compiler doesnt optimize it, performance
may be doomed!
13Rice Experience HPF vs. Co-array Fortran
- Rice dHPF - a decade of investment in compiler
technology - not quite, govt cut funding here too, just like
architecture - polyhedral code generation models (like Lethin
described) - Co-array Fortran for clusters
- a few years effort by a pair of students
- Result Co-array Fortran bests HPF
- more expressive
- higher performance
- shorter time to solution
- currently, can be HARDER to program than MPI
14Principal Compiler and Runtime Challenges
- Exploiting multiple levels of heterogeneous
parallelism - Choreographing parallelism, data movement,
synchronization - Managing memory hierarchy
- cache
- scratch pad
Warning Dont try this at home.
15Programming Model Ecosystem Issues
- Semantic mismatch between programming model and
execution model - Debugging data races and non-determinism
- Performance analysis why isnt performance
scaling - insufficient parallelism
- parallelism is too fine grain to be efficient
- architecture level issues, e.g., false sharing
16A Path Forward
- Kernel, benchmark, and application driven studies
- assess strengths and weaknesses of models
- Explore alternatives evaluate their effects on
- simplicity
- expressiveness
- correctness
- performance