Title: For thousand-core microprocessors
1An IMplicitly PArallel Compiler Technology Based
on Phoenix
- For thousand-core microprocessors
- Wen-mei Hwu
- with
- Ryoo, Ueng, Rodrigues, Lathara, Kelm, Gelado,
Stone, Yi, Kidd, Barghsorkhi, Mahesri, Tsao,
Stratton, Navarro, Lumetta, Frank, Patel - University of Illinois, Urbana-Champaign
2Background
- Academic compiler research infrastructure is a
tough business - IMPACT, Trimaran, and ORC for VLIW and Itanium
processors - Polaris and SUIF for multiprocessors
- LLVM for portability and safety
- In 2001, IMPACT team moved into many-core
compilation with MARCO FCRC funding - A new implicitly parallel programming model that
balance the burden on programmers and the
compiler in parallel programming - Infrastructure work has slowed down
ground-breaking work - Timely visit by the Phoenix team in January 2007
- Rapid progress has since been taking place
- Future IMPACT research will be built on Phoenix
3The Next Software Challenge
Big picture
- Today, multi-core make more effective use of area
and power than large ILP CPUs - Scaling from 4-core to 1000-core chips could
happen in the next 15 years - All semiconductor market domains converging to
concurrent system platforms - PCs, game consoles, mobile handsets, servers,
supercomputers, networking, etc.
We need to make these systems effectively
execute valuable, demanding apps.
4The Compiler Challenge
Compilers and tools must extend the humans
ability to manage parallelism by doing the heavy
lifting.
- To meet this challenge, the compiler must
- Allow simple, effective control by programmers
- Discover and verify parallelism
- Eliminate tedious efforts in performance tuning
- Reduce testing and support cost of parallel
programs
5An Initial Experimental Platform
- A quiet revolution and potential build-up
- Calculation 450 GFLOPS vs. 32 GFLOPS
- Memory Bandwidth 86.4 GB/s vs. 8.4 GB/s
- Until last year, programmed through graphics API
- GPU in every PC and workstation massive volume
and potential impact
6GeForce 8800
- 16 highly threaded SMs, gt128 FPUs, 450 GFLOPS,
768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU
7Some Hand-code Results
App. Archit. Bottleneck Simult. T Kernel X App X
H.264 Registers, global memory latency 3,936 20.2 1.5
LBM Shared memory capacity 3,200 12.5 12.3
RC5-72 Registers 3,072 17.1 11.0
FEM Global memory bandwidth 4,096 11.0 10.1
RPES Instruction issue rate 4,096 210.0 79.4
PNS Global memory capacity 2,048 24.0 23.7
LINPACK Global memory bandwidth, CPU-GPU data transfer 12,288 19.4 11.8
TRACF Shared memory capacity 4,096 60.2 21.6
FDTD Global memory bandwidth 1,365 10.5 1.2
MRI-Q Instruction issue rate 8,192 457.0 431.0
HKR HotChips-2007
8Computing Q Performance
446x
GPU (V8) 96 GFLOPS
CPU (V6) 230 MFLOPS
9Lessons Learned
- Parallelism extraction requires global
understanding - Most programmers only understand parts of an
application - Algorithms need to be re-designed
- Programmers benefit from clear view of the
algorithmic effect on parallelism - Real but rare dependencies often needs to be
ignored - Error checking code, etc., parallel code is often
not equivalent to sequential code - Getting more than a small speedup over sequential
code is very tricky - 20 versions typically experimented for each
application to move away from architecture
bottlenecks
10Implicitly Parallel Programming Flow
Stylized C/C or DSL w/ assertions
Deep analysis w/ feedback assistance
Concurrency discovery
Human
For increased composability
Visualizable concurrent form
Systematic search for best/correct code gen
Code-gen space exploration
For increased scalability
Visualizable sequential assembly code with
parallel annotations
parallel execution w/ sequential semantics
Parallel HW w/sequential state gen
For increased supportability
Debugger
11Key Ideas
- Deep program analyses that extend programmer and
DSE knowledge for parallelism discovery - Key to reduced programmer parallelization efforts
- Exclusion of infrequent but real dependences
using HW STU (Speculative Threading with Undo)
support - Key to successful parallelization of many real
applications - Rich program information maintained in IR for
access by tools and HW - Key to integrate multiple programming models and
tools - Intuitive, visual presentation to programmers
- Key to good programmer understanding of algorithm
effects - Managed parallel execution arrangement search
space - Key to reduced programmer performance tuning
efforts
12Parallelism in Algorithms(H.263 motion
estimation example)
13MPEG-4 H.263 EncoderParallelism Redicovery
(b)
(c)
(d)
(e)
(a)
14Code Gen Space Exploration
15Moving an Accurate Interprocedural Analysis into
Phoenix
Unification Based
Fulcra
16Getting Started with Phoenix
- Meetings with Phoenix team in January 2007
- Determined the set of Phoenix API routines
necessary to support IMPACT analyses and
transformations - Received custom build of Phoenix that supports
full type information
17Fulcra to Phoenix Action!
- Four step process
- Convert IMPACTs data structure to Phoenixs
equivalents, and from C to C/CLI. - Creating the initial constraint graph using
Phoenixs IR instead of IMPACTs IR. - Convert the solver pointer analysis.
- Consist of porting from C to C/CLI and dealing
with any changes to Fulcra ported data
structures. - Annotate the points-to information back into
Phoenix's alias representation.
18Phoenix Support Wish List
- Access to code across file boundaries
- LTCG
- Access to multiple files within a pass
- Full (Source code level) type information
- Feed results from Fulcra back to Phoenix
- Need more information on Phoenix alias
representation - In the long run, we need highly extendable IR and
API for Phoenix
19Conclusion
- Compiler research for many-cores will require a
very high quality infrastructure with strong
engineering support - New language extensions, new user models, new
functionalities, new analyses, new
transformations - We chose Phoenix based on its robustness,
features and engineering support - Our current industry partners are also moving
into Phoenix - We also plan to share our advanced extensions to
the other academic Phoenix users