Title: Titanium: A High Performance Java-Based Language
1Titanium A High Performance Java-Based Language
Katherine Yelick Alex Aiken, Phillip Colella,
David Gay, Susan Graham, Paul Hilfinger, Arvind
Krishnamurthy, Ben Liblit, Carleton Miyamoto,
Geoff Pike, Luigi Semenzato,
2Talk Outline
- Motivation
- Extensions for uniprocessor performance
- Extensions for parallelism
- A framework for domain-specific languages
- Status and performance
3Programming Challenges on Millennium
- Large scale computations
- Optimized simulation algorithms are complex
- Use of hierarchical parallel machine
- Cost-conscious programming
Minimization algorithms
Unstructured meshes
?
Adaptive meshes
4Titanium Approach
- Performance is primary goal
- High uniprocessor performance
- Designed for shared and distributed memory
- Parallelism constructs with programmer control
- Optimizing compiler for caches, communication
scheduling, etc. - Expressiveness secondary goal
- Based on safe language Java
- Safety simplifies programming and compiler
analysis - Framework for domain-specific language extensions
5New Language Features
- Immutable classes
- Multidimensional arrays
- also points and index sets as first-class values
- multidimensional iterators
- Memory management
- semi-automated zone-based allocation
- Scalable parallelism
- SPMD model of execution with global address space
- Language-level synchronization
- Support for grid-based computation
6Java Objects
- Primitive scalar types boolean, double, int,
etc. - access is fast
- Objects user-defined and from the standard
library - has level of indirection (pointer to) implicit
- arrays are objects
- all objects can be checked for equality and a few
other operations
3 true
r 7.1 i 4.3
7Immutable Classes in Titanium
- For small objects, would sometimes prefer
- to avoid level of indirection
- pass by value
- extends the idea of primitive values (1, 4.2,
etc.) to user-defined values - Titanium introduces immutable classes
- all fields are final (implicitly)
- cannot inherit from (extend) or be inherited by
other classes - needs to have 0-argument constructor, e.g.,
Complex () - immutable class Complex ...
- Complex c new Complex(7.1, 4.3)
8Arrays in Java
- Arrays in Java are objects
- Only 1D arrays are directly supported
- Array bounds are checked (as in Fortran)
- Multidimensional arrays as arrays of arrays are
slow and cannot transform into contiguous memory
9Titanium Arrays
- Fast, expressive arrays
- multidimensional
- lower bound, upper bound, stride
- concise indexing Ap instead of A(i, j, k)
- Points
- tuple of integers as primitive type
- Domains
- rectangular sets of points (bounds and stride)
- arbitrary sets of points
- Multidimensional iterators
10Example Point, RectDomain, Array
Pointlt2gt lb 1, 1 Pointlt2gt ub 10,
20 RectDomainlt2gt R lb ub 2, 2 double
2d A new doubleR foreach (p in
A.domain()) Ap B2 p
- Standard optimizations
- strength reduction
- common subexpression elimination
- invariant code motion
- removing bounds checks from body
11Memory Management
- Java implemented with garbage collection
- Distributed GC too unpredictable
- Compile-time analysis can improve performance
- Zone-based memory management
- extends existing model
- good performance
- safe
- easy to use
12Zone-Based Memory Management
- Allocate objects in zones
- Release zones manually
Z1
Zone Z1 new Zone()
Zone Z2 new Zone()
T x new(Z1) T()
x
T y new(Z2) T()
x.field y
x y
Z2
delete Z1
y
delete Z2 // error
13Sequential Performance
Times in seconds (lower is better).
14Sequential Performance
On an Ultrasparc
C/C/ FORTRAN
Java Arrays
Titanium Arrays
Overhead
DAXPY
1.4s
7
1.5s
6.8s
3D multigrid
12s
83
22s
2D multigrid
5.4s
15
6.2s
EM3D
0.7s
1.8s
1.0s
42
On a Pentium II
C/C/ RTFORAN
Java Arrays
Titanium Arrays
Overhead
DAXPY
1.8s
27
2.3s
3D multigrid
23.0s
-13
20.0s
2D multigrid
7.3s
-25
5.5s
EM3D
1.0s
1.6s
60
15Model of Parallelism
n processes
- Single Program, Multiple Data
- fixed number of processes
- each process has own local data
- global synchronization (barrier)
start
...
barrier
...
barrier
...
...
barrier
...
end
16Global Address Space
- Each process has its own heap
- References can span process boundaries
Other processes
Process 0
LOCAL HEAP
LOCAL HEAP
Class T T gv T lv null if
(thisProc() 0) lv new T() //
allocate locally gv broadcast lv from 0
// distribute gv.field ...
17Global vs. Local References
- Global references may be slow
- distributed memory overhead of a few
instructions when using a global reference to
access a local object - shared memory no performance implications
- Solution use local qualifier
- statically restrict references to local objects
- example T local lv null
- use only in critical sections
18Global Synchronization Analysis
- In Titanium, processes must synchronize at the
same textual instances of barrier()
doThis() barrier() boolean x
someCondition() if (x) doThat()
barrier() doSomeMore() barrier()
19Global Synchronization Analysis
- In Titanium, processes must synchronize at the
same textual instances of barrier() - Singleness analysis statically guarantees
correctness by restricting the values of
variables that control program flow
doThis() barrier() boolean single x
someCondition() if (x) doThat()
barrier() doSomeMore() barrier()
20Support for Grid-Based Computation
R
Pointlt2gt lb 0, 0 Pointlt2gt ub 6,
4 RectDomainlt2gt R lb ub 2,
2 Domainlt2gt red R (R 1, 1) foreach
(p in red)
(6, 4)
(0, 0)
R 1, 1
(7, 5)
(1, 1)
red
(7, 5)
Gauss-Seidel relaxation with red-black ordering
(0, 0)
21Implementation
- Strategy
- compile Titanium into C (currently C)
- Posix threads for SMPs (currently Solaris
threads) - Lightweight Active Messages for communication
- Status
- runs on SUN Enterprise 8-way SMP
- runs on Berkeley NOW
- trivial ports to 1/2 dozen other architectures
- tuning for sequential performance
22Titanium Status
- Titanium language definition complete.
- Titanium compiler running.
- Compiles for uniprocessors, NOW others soon.
- Application developments ongoing.
- Many research opportunities.
23Applications
- Three-D AMR Poisson Solver (AMR3D)
- block-structured grids with multigrid computation
on each - 2000 line program
- algorithm not yet fully implemented in other
languages - tests performance and effectiveness of language
features - Three-D Electromagnetic Waves (EM3D)
- unstructured grids
- Several smaller benchmarks
24Parallel Performance
- Numbers from Ultrasparc SMP
- Parallel efficiency good
- EM3D (unstructured kernel)
- 3D AMR limited by algorithm
Speedup
Number of processors
25New Compiler Analyses for Parallelism
- Analysis of synchronization
- finds unmatched barriers, parallel code blocks
- extends traditional control flow analysis
- Analysis of communication
- reorder and pipeline memory operations without
observed effect - extends traditional dependence analysis
- Analyses extended to domain-specific constructs
- arrays indexed by domains of points
- looping constructs provide summarize information
26Future Directions
- Use of framework for domain-specific languages
- Fluids and AMR done
- Unstructured meshes and sparse solvers
- Better programming tools
- debuggers, performance analysis
- Optimizations
- analysis of parallel code and synchronization
done - optimizations for caches on uniprocessors and
SMPs underway - load balancing on clusters of SMPs
27Conclusions
- Performance
- sequential performance consistently close to
C/FORTRAN - currently 80 slower to 25 faster
- sequential efficiency very high
- Expressiveness
- safety of Java with small set of performance
features - extensible to new application domains
- Portability, compatibility, etc.
- no gratuitous departures from Java standard
- compilation model easily supports new platforms