Title:
1Towards an SSI for HP Java
- Francis Lau
- The University of Hong Kong
- With contributions from C.L. Wang, Ricky Ma, and
W.Z. Zhu
2Cluster Coming of Age
- HPC
- Cluster the de facto standard equipment
- Grid?
- Clusters
- Fortran or C MPI the norm
- 99 on top of bare-bone Linux or the like
- Ok if application is embarrassingly parallel and
regular
3Cluster for the Mass
Commercial Data mining, Financial Modeling, Oil
Reservoir Simulation, Seismic Data Processing,
Vehicle and Aircraft Simulation Government
Nuclear Stockpile Stewardship, Climate and
Weather, Satellite Image Processing, Forces
Modeling Academic Fundamental Physics
(particles, relativity, cosmology), Biochemistry,
Environmental Engineering, Earthquake Prediction
- Two modes
- For number crunching in Grande type applications
(superman) - As a CPU farm to support high-throughput
computing (poor man)
4Cluster Programming
- Auto-parallelization tools have limited success
- Parallelization a chore but have to do it (or
lets hire someone) - Optimization for performance not many users cup
of tea - Partitioning and parallelization
- Mapping
- Remapping (experts?)
5Amateur Parallel Programming
- Common problems
- Poor parallelization few large chunks or many
small chunks - Load imbalances large and small chunks
- Meeting the amateurs half-way
- They do crude parallelization
- System does the rest mapping/remapping
(automatic optimization) - And I/O?
6Automatic Optimization
- Feed the fat boy with two spoons, and a few slim
ones with one spoon - But load information could be elusive
- Need smart runtime supports
- Goal is to achieve high performance with good
resource utilization and load balancing - Large chunks that are single-threaded a problem
7The Good Fat Boys
- Large chunks that span multiple nodes
- Must be a program with multiple execution
threads - Threads can be in different nodes program
expands and shrinks - Threads/programs can roam around dynamic
migration - This encourages fine-grain programming
cluster node
amoeba
8Mechanism and Policy
- Mechanism for migration
- Traditional process migration
- Thread migration
- Redirection of I/O and messages
- Objects sharing between nodes for threads
- Policy for good dynamic load balancing
- Message traffic a crucial parameter
- Predictive
- Towards the single system image ideal
9Single System Image
- If user does only crude parallelization and
system does the rest - If processes/threads can roam, and processes
expand/shrink - If I/O (including sockets) can be at any node
anytime - We achieve at least 50 of SSI
- The rest is difficult
Single Entry Point File System Virtual
Networking I/O and Memory Space Process
Space Management / Programming View
10Bon Java!
- Java (for HPC) in good hands
- JGF Numerics Working Group, IBM Ninja,
- JGF Concurrency/Applications Working Group
(benchmarking, MPI, ) - The workshops
- Java has many advantages (vs. Fortran and C/C)
- Performance not an issue any more
- Threads as first-class citizens!
- JVM can be modified
Java has the greatest potential to deliver an
attractive productive programming environment
spanning the very broad range of tasks needed by
the Grande programmer The Java Grande Forum
Charter
11Process vs. Thread Migration
- Process migration easier than thread migration
- Threads are tightly coupled
- They share objects
- Two styles to explore
- Process, MPI (distributed computing)
- Thread, shared objects (parallel computing)
- Or combined
- Boils down to messages vs. distributed shared
objects
12Two Projects _at_ HKU
- M-JavaMPI M for Migration
- Process migration
- I/O redirection
- Extension to grid
- No modification of JVM and MPI
- JESSICA Java-Enabled Single System Image
Computing Architecture - By modifying JVM
- Thread migration, Amoeba mode
- Global object space, I/O redirection
- JIT mode (Version 2)
13Design Choices
- Bytecode instrumentation
- Insert code into programs, manually or via
pre-processor - JVM extension
- Make thread state accessible from Java program
- Non-transparent
- Modification of JVM is required
- Checkpointing the whole JVM process
- Powerful but heavy penalty
- Modification of JVM
- Runtime support
- Totally transparent to the applications
- Efficient but very difficult to implement
14M-JavaMPI
- Support transparent Java process migration and
provide communication redirection services - Communication using MPI
- Implemented as a middleware on top of standard
JVM - No modifications of JVM and MPI
- Checkpointing the Java process code insertion
by preprocessor
15System Architecture
16Preprocessing
- Bytecode is modified before passing to JVM for
execution - Restoration functions are inserted as exception
handlers, in the form of encapsulated try-catch
statements - Re-arrangement of bytecode, and addition of local
variables
17The Layers
- Java-MPI API layer
- Restorable MPI layer
- Provides restorable MPI communications
- No modification of MPI library
- Migration Layer
- Captures and save the execution state of the
migrating process in the source node, and
restores the execution state of the migrated
process in the destination node - Cooperates with the Restorable MPI layer to
reconstruct the communication channels of the
parallel application
18State Capturing and Restoring
- Program code re-used in the destination node
- Data captured and restored by using the object
serialization mechanism - Execution context captured by using JVMDI and
restored by inserted exception handlers - Eager (all) strategy For each frame, local
variables, referenced objects, the name of the
class and class method, and program counter are
saved using object serialization
19State Capturing using JVMDI
public class A try catch
(RestorationException e) a saved value
of local variable a b saved value of
local variable b pc saved value of
program counter when the program is
suspended jump to the location where the
program is suspended
- public class A
- int a
- char b
-
-
20Message Redirection Model
- MPI daemon in each node to support message
passing between distributed java processes - IPC between Java program and MPI daemon in the
same node through shared memory and semaphores
client-server
client-server
21Process migration steps
Source Node
Destination Node
22Experiments
- PC Cluster
- 16-node cluster
- 300 MHz Pentium II with 128MB of memory
- Linux 2.2.14 with Sun JDK 1.3.0
- 100Mb/s fast Ethernet
- All Java programs executed in interpreted mode
23- Native MPI 10.5 MB/s
- Direct Java-MPI binding 9.2 MB/s
- Restorable MPI layer 7.6 MB/s
24Latency PingPong Test
- Native MPI 0.2 ms
- Direct Java-MPI binding 0.23 ms
- Restorable MPI layer 0.26 ms
25- Migration Cost capturing and restoring objects
26Migration Cost capturing and restoring frames
27Application Performance
- PI calculation
- Recursive ray-tracing
- NAS integer sort
- Parallel SOR
28- Time spent in calculating PI and ray-tracing with
and without the migration layer
29- Execution time of NAS program with different
problem sizes (16 nodes)
No noticeable overhead introduced in the
computation part while in the communication
part, an overhead of about 10-20
30Time spent in executing SOR using different
numbers of nodes with and without migration layer
31Cost of Migration
Â
Time spent in executing the SOR program on an
array of size 256x256 without and with one
migration during the execution
32Cost of Migration
- Time spent in migration (in seconds) for
different applications
33Dynamic Load Balancing
- A simple test
- SOR program was executed using six nodes in an
unevenly loaded environment with one of the nodes
executing a computationally intensive program - Without migration 319s
- With migration 180s
34In Progress
- M-JavaMPI in JIT mode
- Develop system modules for automatic dynamic load
balancing - Develop system modules for effective
fault-tolerant supports
35Java Virtual Machine
Application Class File
Java API Class File
- Class Loader
- Loads class files
- Interpreter
- Executes bytecode
- Runtime Compiler
- Converts bytecode to native code
Class loader
Bytecode
Interpreter
0a0b0c0d0c6262431 c1d688662a0b0c0d0 c133451472652
2723
Runtime compiler
01010101000101110 10101011000111010 1011001101011
1011
Native code
36Threads in JVM
A Multithreaded Java Program
public class ProducerConsumerTest public
static void main(String args)
CubbyHole c new CubbyHole() Producer
p1 new Producer(c, 1) Consumer c1
new Consumer(c, 1) p1.start()
c1.start()
Thread 3
Thread 2
Java Method Area (Code)
Thread 1
PC
Class loader
Stack Frame
Execution Engine
Stack Frame
Class files
Heap (Data)
object
object
37JMM
Java Memory Model (How to maintain memory
consistency between threads)
T1
T2
Variable is modified in T1s working memory.
Per-Thread working memory
Main memory
Garbage Bin
Object
master copy
Heap Area
Variable
Threads T1, T2
38Problems in Existing DJVMs
- Mostly based on interpreters
- Simple but slow
- Layered design using distributed shared memory
system (DSM) ? cannot be tightly coupled with JVM - JVM runtime information cannot be channeled to
DSM - False sharing if page-based DSM is employed
- Page faults block the whole JVM
- Programmer to specify thread distribution ? lack
of transparency - Need to rewrite multithreaded Java applications
- No dynamic thread distribution (preemptive thread
migration) for load balancing
39Related Work
- Method shipping IBM cJVM
- Like remote method invocation (RMI) when
accessing object fields, the proxy redirects the
flow of execution to the node where the object's
master copy is located. - Executed in Interpreter mode.
- Load balancing problem affected by the object
distribution. - Page shipping Rice U. Java/DSM, HKU JESSICA
- Simple. GOS was supported by some page-based
Distributed Shared Memory (e.g., TreadMarks,
JUMP, JiaJia) - JVM runtime information cant be channeled to
DSM. - Executed in Interpreter mode.
- Object shipping Hyperion, Jackal
- Leverage some object-based DSM
- Executed in native mode Hyperion translate Java
bytecode to C. Jackal compile Java source code
directly to native code
40Distributed Java Virtual Machine (DJVM)
JESSICA2 A distributed Java Virtual Machine
(DJVM) spanning multiple cluster nodes can
provide a true parallel execution environment for
multithreaded Java applications with a Single
System Image illusion to Java threads.
Java Threads created in a program
Global Object Space
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
41JESSICA2 Main Features
JESSICA2
- Transparent Java thread migration
- Runtime capturing and restoring of thread
execution context. - No source code modification no bytecode
instrumentation (preprocessing) no new API
introduced - Enables dynamic load balancing on clusters
- Operated in Just-In-Time (JIT) compilation Mode
- Global Object Space
- A shared global heap spanning all cluster nodes
- Adaptive object home migration protocol
- I/O redirection
Transparent migration
JIT
GOS
42Transparent Thread Migration in JIT Mode
- Simple for interpreters (e.g., JESSICA)
- Interpreter sits in the bytecode decoding loop
which can be stopped upon a migration flag
checking - The full state of a thread is available in the
data structure of interpreter - No register allocation
- JIT mode execution makes things complex
(JESSICA2) - Native code has no clear bytecode boundary
- How to deal with machine registers?
- How to organize the stack frames (all are in
native form now)? - How to make extracted thread states portable and
recognizable by the remote JVM? - How to restore the extracted states (rebuild the
stack frames) and restart the execution in native
form?
Need to modify JIT compiler to instrument native
code
43Approaches
- Using JVMDI (e.g., M-JavaMPI)?
- Only newer JDKs (Aug., 2002) provide full speed
debugging to support the capturing of thread
status - Portable but too heavy
- need large data structures to keep debug
information - Only using JVMDI cannot support full function of
DJVM - How to access remote object?
- Put a DSM under it? But you cant control Sun
JVMs memory allocation unless you get the latest
JDK source codes - Our lightweight approach
- Provide the minimum functions required to capture
and restore Java threads to support Java thread
migration
44An overview of JESSICA2 Java thread migration
- Frame parsing
- Restore execution
Thread
GOS (heap)
(3)
Frames
Frames
Frames
Migration Manager
(4a) Object Access
GOS (heap)
Method Area
Frame
PC
- Stack analysis
- Stack capturing
(2)
Method Area
Thread Scheduler
JVM
PC
(4b) Load method from NFS
Source node
(1) Alert
Destination node
Load Monitor
45Essential Functions
- Migration points selection
- At the start of loop, basic block or method
- Register context handler
- Spill dirty registers at migration point without
invalidation so that native code can continue the
use of registers - Use register recovering stub at restoring phase
- Variable type deduction
- Spill type in stacks using compression
- Java frames linking
- Discover consecutive Java frames
46Dynamic Thread State Capturing and Restoring in
JESSICA2
migration point
Bytecode verifier
migration point Selection
(Restore)
cmp mflag,0 jz ...
invoke
register allocation
bytecode translation
cmp objoffset,0 jz ...
1. Add migration checking 2. Add object
checking 3. Add type register spilling
Intermediate Code
mov 0x110182, slot ...
Register recovering
code generation
reg
slots
Native Code
Global Object Access
(Capturing)
Linking Constant Resolution
Native stack scanning
Java frame
mov slot1-gtreg1 mov slot2-gtreg2 ...
C frame
Frame
Native thread stack
47How to Maintain Memory Consistency in a
Distributed Environment?
T2
T4
T6
T8
T1
T3
T5
T7
Heap
Heap
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
48Embedded Global Object Space (GOS)
- Take advantage of JVM runtime information for
optimization (e.g., object types, accessing
threads, etc.) - Use threaded I/O interface inside JVM for
communication to hide the latency ? Non-blocking
GOS access - OO-based to reduce false sharing
- Home-based, compliant with JVM Memory Model
(Lazy Release Consistency) - Master heap (home objects) and cache heap (local
and cached objects) reduce object access latency
49Object Cache
50Adaptive object home migration
- Definition
- home of an object the JVM that holds the
master copy of an object - Problems
- cache objects need to be flushed and re-fetched
from the home whenever synchronization happens - Adaptive object home migration
- if of accesses from a thread dominates the
total of accesses to an object, the object home
will be migrated to the node where the thread is
running
51I/O redirection
- Timer
- Use the time in master node as the standard time
- Calibrate the time in worker node when they
register to master node - File I/O
- Use half word of fd as node number
- Open file
- For read, check local first, then master node
- For write, go to master node
- Read/Write
- Go to the node specified by the node number in fd
- Network I/O
- Connectionless send do it locally
- Others, go to master
52Experimental Setting
- Modified Kaffe Open JVM version 1.0.6
- Linux PC clusters
- Pentium II PCs at 540MHz (Linux 2.2.1 kernel)
connected by Fast Ethernet - HKU Gideon 300 Cluster (for the Ray Tracing demo)
53Parallel Ray Tracing on JESSICA2(Using 64 nodes
of the Gideon 300 cluster)
Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes 108
seconds 1 node 4420 seconds ( 1 hour) Speedup
4402/108 40.75
54Micro Benchmarks
(PI Calculation)
55Java Grande Benchmark
56SPECjvm98 Benchmark
M- disabling migration mechanism M
enabling migration I enabling pseudo-inlining
I- disabling pseudo-inlining
57JESSICA2 vs JESSICA (CPI)
58Application Performance
59Effect of Adaptive Object Home Migration (SOR)
60Work in Progress
- New optimization techniques for GOS
- Incremental Distributed GC
- Load balancing module
- Enhanced single I/O space to benefit more
real-life applications - Parallel I/O support
61Conclusion
- Effective HPC for the mass
- They supply the (parallel) program, system does
the rest - Lets hope for parallelizing compilers
- Small to medium grain programming
- SSI the ideal
- Java the choice
- Poor man mode too
- Thread distribution and migration feasible
- Overhead reduction
- Advances in low-latency networking
- Migration as intrinsic function (JVM, OS,
hardware) - Grid and pervasive computing
62Some Publications
- W.Z. Zhu , C.L. Wang, and F.C.M. Lau, A
Lightweight Solution for Transparent Java Thread
Migration in Just-in-Time Compilers, ICPP 2003,
Taiwan, October 2003. - W.J. Fang, C.L. Wang, and F.C.M. Lau, On the
Design of Global Object Space for Efficient
Multi-threading Java Computing on Clusters,
Parallel Computing, to appear. - W.Z. Zhu , C.L. Wang, and F.C.M. Lau, JESSICA2
A Distributed Java Virtual Machine with
Transparent Thread Migration Support, CLUSTER
2002, Chicago, September 2002, 381-388. - R. Ma, C.L. Wang, and F.C.M. Lau, M-JavaMPI A
Java-MPI Binding with Process Migration
Support,'' CCGrid 2002, Berlin, May 2002. - M.J.M. Ma, C.L. Wang, and F.C.M. Lau, JESSICA
Java-Enabled Single-System-Image Computing
Architecture, Journal of Parallel and
Distributed Computing, Vol. 60, No. 10, October
2000, 1194-1222.
63THE END And Thanks!