PPT – PowerPoint presentation | free to download

About This Presentation

Title:

Description:

Title: ICPP Author: HKU Last modified by: HKU Created Date: 10/5/2003 10:01:10 AM Document presentation format: Company: HKU Other titles – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 63

Provided by: hku

Category:

more less

Transcript and Presenter's Notes

Title:

1
Towards an SSI for HP Java

Francis Lau
The University of Hong Kong
With contributions from C.L. Wang, Ricky Ma, and
W.Z. Zhu

2
Cluster Coming of Age

HPC
Cluster the de facto standard equipment
Grid?
Clusters
Fortran or C MPI the norm
99 on top of bare-bone Linux or the like
Ok if application is embarrassingly parallel and
regular

3
Cluster for the Mass
Commercial Data mining, Financial Modeling, Oil
Reservoir Simulation, Seismic Data Processing,
Vehicle and Aircraft Simulation Government
Nuclear Stockpile Stewardship, Climate and
Weather, Satellite Image Processing, Forces
Modeling Academic Fundamental Physics
(particles, relativity, cosmology), Biochemistry,
Environmental Engineering, Earthquake Prediction

Two modes
For number crunching in Grande type applications
(superman)
As a CPU farm to support high-throughput
computing (poor man)

4
Cluster Programming

Auto-parallelization tools have limited success
Parallelization a chore but have to do it (or
lets hire someone)
Optimization for performance not many users cup
of tea
Partitioning and parallelization
Mapping
Remapping (experts?)

5
Amateur Parallel Programming

Common problems
Poor parallelization few large chunks or many
small chunks
Load imbalances large and small chunks
Meeting the amateurs half-way
They do crude parallelization
System does the rest mapping/remapping
(automatic optimization)
And I/O?

6
Automatic Optimization

Feed the fat boy with two spoons, and a few slim
ones with one spoon
But load information could be elusive
Need smart runtime supports
Goal is to achieve high performance with good
resource utilization and load balancing
Large chunks that are single-threaded a problem

7
The Good Fat Boys

Large chunks that span multiple nodes
Must be a program with multiple execution
threads
Threads can be in different nodes program
expands and shrinks
Threads/programs can roam around dynamic
migration
This encourages fine-grain programming

cluster node
amoeba
8
Mechanism and Policy

Mechanism for migration
Traditional process migration
Thread migration
Redirection of I/O and messages
Objects sharing between nodes for threads
Policy for good dynamic load balancing
Message traffic a crucial parameter
Predictive
Towards the single system image ideal

9
Single System Image

If user does only crude parallelization and
system does the rest
If processes/threads can roam, and processes
expand/shrink
If I/O (including sockets) can be at any node
anytime
We achieve at least 50 of SSI
The rest is difficult

Single Entry Point File System Virtual
Networking I/O and Memory Space Process
Space Management / Programming View
10
Bon Java!

Java (for HPC) in good hands
JGF Numerics Working Group, IBM Ninja,
JGF Concurrency/Applications Working Group
(benchmarking, MPI, )
The workshops
Java has many advantages (vs. Fortran and C/C)
Performance not an issue any more
Threads as first-class citizens!
JVM can be modified

Java has the greatest potential to deliver an
attractive productive programming environment
spanning the very broad range of tasks needed by
the Grande programmer The Java Grande Forum
Charter
11
Process vs. Thread Migration

Process migration easier than thread migration
Threads are tightly coupled
They share objects
Two styles to explore
Process, MPI (distributed computing)
Thread, shared objects (parallel computing)
Or combined
Boils down to messages vs. distributed shared
objects

12
Two Projects _at_ HKU

M-JavaMPI M for Migration
Process migration
I/O redirection
Extension to grid
No modification of JVM and MPI
JESSICA Java-Enabled Single System Image
Computing Architecture
By modifying JVM
Thread migration, Amoeba mode
Global object space, I/O redirection
JIT mode (Version 2)

13
Design Choices

Bytecode instrumentation
Insert code into programs, manually or via
pre-processor
JVM extension
Make thread state accessible from Java program
Non-transparent
Modification of JVM is required
Checkpointing the whole JVM process
Powerful but heavy penalty
Modification of JVM
Runtime support
Totally transparent to the applications
Efficient but very difficult to implement

14
M-JavaMPI

Support transparent Java process migration and
provide communication redirection services
Communication using MPI
Implemented as a middleware on top of standard
JVM
No modifications of JVM and MPI
Checkpointing the Java process code insertion
by preprocessor

15
System Architecture
16
Preprocessing

Bytecode is modified before passing to JVM for
execution
Restoration functions are inserted as exception
handlers, in the form of encapsulated try-catch
statements
Re-arrangement of bytecode, and addition of local
variables

17
The Layers

Java-MPI API layer
Restorable MPI layer
Provides restorable MPI communications
No modification of MPI library
Migration Layer
Captures and save the execution state of the
migrating process in the source node, and
restores the execution state of the migrated
process in the destination node
Cooperates with the Restorable MPI layer to
reconstruct the communication channels of the
parallel application

18
State Capturing and Restoring

Program code re-used in the destination node
Data captured and restored by using the object
serialization mechanism
Execution context captured by using JVMDI and
restored by inserted exception handlers
Eager (all) strategy For each frame, local
variables, referenced objects, the name of the
class and class method, and program counter are
saved using object serialization

19
State Capturing using JVMDI
public class A try catch
(RestorationException e) a saved value
of local variable a b saved value of
local variable b pc saved value of
program counter when the program is
suspended jump to the location where the
program is suspended

public class A
int a
char b

20
Message Redirection Model

MPI daemon in each node to support message
passing between distributed java processes
IPC between Java program and MPI daemon in the
same node through shared memory and semaphores

client-server
client-server
21
Process migration steps
Source Node
Destination Node
22
Experiments

PC Cluster
16-node cluster
300 MHz Pentium II with 128MB of memory
Linux 2.2.14 with Sun JDK 1.3.0
100Mb/s fast Ethernet
All Java programs executed in interpreted mode

Bandwidth PingPong Test

Native MPI 10.5 MB/s
Direct Java-MPI binding 9.2 MB/s
Restorable MPI layer 7.6 MB/s

24
Latency PingPong Test

Native MPI 0.2 ms
Direct Java-MPI binding 0.23 ms
Restorable MPI layer 0.26 ms

Migration Cost capturing and restoring objects

26
Migration Cost capturing and restoring frames
27
Application Performance

PI calculation
Recursive ray-tracing
NAS integer sort
Parallel SOR

Time spent in calculating PI and ray-tracing with
and without the migration layer

Execution time of NAS program with different
problem sizes (16 nodes)

No noticeable overhead introduced in the
computation part while in the communication
part, an overhead of about 10-20
30
Time spent in executing SOR using different
numbers of nodes with and without migration layer
31
Cost of Migration

Time spent in executing the SOR program on an
array of size 256x256 without and with one
migration during the execution
32
Cost of Migration

Time spent in migration (in seconds) for
different applications

33
Dynamic Load Balancing

A simple test
SOR program was executed using six nodes in an
unevenly loaded environment with one of the nodes
executing a computationally intensive program
Without migration 319s
With migration 180s

34
In Progress

M-JavaMPI in JIT mode
Develop system modules for automatic dynamic load
balancing
Develop system modules for effective
fault-tolerant supports

35
Java Virtual Machine
Application Class File
Java API Class File

Class Loader
Loads class files
Interpreter
Executes bytecode
Runtime Compiler
Converts bytecode to native code

Class loader
Bytecode
Interpreter
0a0b0c0d0c6262431 c1d688662a0b0c0d0 c133451472652
2723
Runtime compiler
01010101000101110 10101011000111010 1011001101011
1011
Native code
36
Threads in JVM
A Multithreaded Java Program
public class ProducerConsumerTest public
static void main(String args)
CubbyHole c new CubbyHole() Producer
p1 new Producer(c, 1) Consumer c1
new Consumer(c, 1) p1.start()
c1.start()
Thread 3
Thread 2
Java Method Area (Code)
Thread 1
PC
Class loader
Stack Frame
Execution Engine
Stack Frame
Class files
Heap (Data)
object
object
37
JMM
Java Memory Model (How to maintain memory
consistency between threads)
T1
T2
Variable is modified in T1s working memory.
Per-Thread working memory
Main memory
Garbage Bin
Object
master copy
Heap Area
Variable
Threads T1, T2
38
Problems in Existing DJVMs

Mostly based on interpreters
Simple but slow
Layered design using distributed shared memory
system (DSM) ? cannot be tightly coupled with JVM
JVM runtime information cannot be channeled to
DSM
False sharing if page-based DSM is employed
Page faults block the whole JVM
Programmer to specify thread distribution ? lack
of transparency
Need to rewrite multithreaded Java applications
No dynamic thread distribution (preemptive thread
migration) for load balancing

39
Related Work

Method shipping IBM cJVM
Like remote method invocation (RMI) when
accessing object fields, the proxy redirects the
flow of execution to the node where the object's
master copy is located.
Executed in Interpreter mode.
Load balancing problem affected by the object
distribution.
Page shipping Rice U. Java/DSM, HKU JESSICA
Simple. GOS was supported by some page-based
Distributed Shared Memory (e.g., TreadMarks,
JUMP, JiaJia)
JVM runtime information cant be channeled to
DSM.
Executed in Interpreter mode.
Object shipping Hyperion, Jackal
Leverage some object-based DSM
Executed in native mode Hyperion translate Java
bytecode to C. Jackal compile Java source code
directly to native code

40
Distributed Java Virtual Machine (DJVM)
JESSICA2 A distributed Java Virtual Machine
(DJVM) spanning multiple cluster nodes can
provide a true parallel execution environment for
multithreaded Java applications with a Single
System Image illusion to Java threads.
Java Threads created in a program
Global Object Space
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
41
JESSICA2 Main Features
JESSICA2

Transparent Java thread migration
Runtime capturing and restoring of thread
execution context.
No source code modification no bytecode
instrumentation (preprocessing) no new API
introduced
Enables dynamic load balancing on clusters
Operated in Just-In-Time (JIT) compilation Mode
Global Object Space
A shared global heap spanning all cluster nodes
Adaptive object home migration protocol
I/O redirection

Transparent migration
JIT
GOS
42
Transparent Thread Migration in JIT Mode

Simple for interpreters (e.g., JESSICA)
Interpreter sits in the bytecode decoding loop
which can be stopped upon a migration flag
checking
The full state of a thread is available in the
data structure of interpreter
No register allocation
JIT mode execution makes things complex
(JESSICA2)
Native code has no clear bytecode boundary
How to deal with machine registers?
How to organize the stack frames (all are in
native form now)?
How to make extracted thread states portable and
recognizable by the remote JVM?
How to restore the extracted states (rebuild the
stack frames) and restart the execution in native
form?

Need to modify JIT compiler to instrument native
code
43
Approaches

Using JVMDI (e.g., M-JavaMPI)?
Only newer JDKs (Aug., 2002) provide full speed
debugging to support the capturing of thread
status
Portable but too heavy
need large data structures to keep debug
information
Only using JVMDI cannot support full function of
DJVM
How to access remote object?
Put a DSM under it? But you cant control Sun
JVMs memory allocation unless you get the latest
JDK source codes
Our lightweight approach
Provide the minimum functions required to capture
and restore Java threads to support Java thread
migration

44
An overview of JESSICA2 Java thread migration

Frame parsing
Restore execution

Thread
GOS (heap)
(3)
Frames
Frames
Frames
Migration Manager
(4a) Object Access
GOS (heap)
Method Area
Frame
PC

Stack analysis
Stack capturing

(2)
Method Area
Thread Scheduler
JVM
PC
(4b) Load method from NFS
Source node
(1) Alert
Destination node
Load Monitor
45
Essential Functions

Migration points selection
At the start of loop, basic block or method
Register context handler
Spill dirty registers at migration point without
invalidation so that native code can continue the
use of registers
Use register recovering stub at restoring phase
Variable type deduction
Spill type in stacks using compression
Java frames linking
Discover consecutive Java frames

46
Dynamic Thread State Capturing and Restoring in
JESSICA2
migration point
Bytecode verifier
migration point Selection
(Restore)
cmp mflag,0 jz ...
invoke
register allocation
bytecode translation
cmp objoffset,0 jz ...
1. Add migration checking 2. Add object
checking 3. Add type register spilling
Intermediate Code
mov 0x110182, slot ...
Register recovering
code generation
reg
slots
Native Code
Global Object Access
(Capturing)
Linking Constant Resolution
Native stack scanning
Java frame
mov slot1-gtreg1 mov slot2-gtreg2 ...
C frame
Frame
Native thread stack
47
How to Maintain Memory Consistency in a
Distributed Environment?
T2
T4
T6
T8
T1
T3
T5
T7
Heap
Heap
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
48
Embedded Global Object Space (GOS)

Take advantage of JVM runtime information for
optimization (e.g., object types, accessing
threads, etc.)
Use threaded I/O interface inside JVM for
communication to hide the latency ? Non-blocking
GOS access
OO-based to reduce false sharing
Home-based, compliant with JVM Memory Model
(Lazy Release Consistency)
Master heap (home objects) and cache heap (local
and cached objects) reduce object access latency

49
Object Cache
50
Adaptive object home migration

Definition
home of an object the JVM that holds the
master copy of an object
Problems
cache objects need to be flushed and re-fetched
from the home whenever synchronization happens
Adaptive object home migration
if of accesses from a thread dominates the
total of accesses to an object, the object home
will be migrated to the node where the thread is
running

51
I/O redirection

Timer
Use the time in master node as the standard time
Calibrate the time in worker node when they
register to master node
File I/O
Use half word of fd as node number
Open file
For read, check local first, then master node
For write, go to master node
Read/Write
Go to the node specified by the node number in fd
Network I/O
Connectionless send do it locally
Others, go to master

52
Experimental Setting

Modified Kaffe Open JVM version 1.0.6
Linux PC clusters
Pentium II PCs at 540MHz (Linux 2.2.1 kernel)
connected by Fast Ethernet
HKU Gideon 300 Cluster (for the Ray Tracing demo)

53
Parallel Ray Tracing on JESSICA2(Using 64 nodes
of the Gideon 300 cluster)
Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes 108
seconds 1 node 4420 seconds ( 1 hour) Speedup
4402/108 40.75
54
Micro Benchmarks
(PI Calculation)
55
Java Grande Benchmark
56
SPECjvm98 Benchmark
M- disabling migration mechanism M
enabling migration I enabling pseudo-inlining
I- disabling pseudo-inlining
57
JESSICA2 vs JESSICA (CPI)
58
Application Performance
59
Effect of Adaptive Object Home Migration (SOR)
60
Work in Progress

New optimization techniques for GOS
Incremental Distributed GC
Load balancing module
Enhanced single I/O space to benefit more
real-life applications
Parallel I/O support

61
Conclusion

Effective HPC for the mass
They supply the (parallel) program, system does
the rest
Lets hope for parallelizing compilers
Small to medium grain programming
SSI the ideal
Java the choice
Poor man mode too
Thread distribution and migration feasible
Overhead reduction
Advances in low-latency networking
Migration as intrinsic function (JVM, OS,
hardware)
Grid and pervasive computing

62
Some Publications

W.Z. Zhu , C.L. Wang, and F.C.M. Lau, A
Lightweight Solution for Transparent Java Thread
Migration in Just-in-Time Compilers, ICPP 2003,
Taiwan, October 2003.
W.J. Fang, C.L. Wang, and F.C.M. Lau, On the
Design of Global Object Space for Efficient
Multi-threading Java Computing on Clusters,
Parallel Computing, to appear.
W.Z. Zhu , C.L. Wang, and F.C.M. Lau, JESSICA2
A Distributed Java Virtual Machine with
Transparent Thread Migration Support, CLUSTER
2002, Chicago, September 2002, 381-388.
R. Ma, C.L. Wang, and F.C.M. Lau, M-JavaMPI A
Java-MPI Binding with Process Migration
Support,'' CCGrid 2002, Berlin, May 2002.
M.J.M. Ma, C.L. Wang, and F.C.M. Lau, JESSICA
Java-Enabled Single-System-Image Computing
Architecture, Journal of Parallel and
Distributed Computing, Vol. 60, No. 10, October
2000, 1194-1222.