- PowerPoint PPT Presentation

About This Presentation
Title:

Description:

Title: ICPP Author: HKU Last modified by: HKU Created Date: 10/5/2003 10:01:10 AM Document presentation format: Company: HKU Other titles – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 63
Provided by: hku
Category:

less

Transcript and Presenter's Notes

Title:


1
Towards an SSI for HP Java
  • Francis Lau
  • The University of Hong Kong
  • With contributions from C.L. Wang, Ricky Ma, and
    W.Z. Zhu

2
Cluster Coming of Age
  • HPC
  • Cluster the de facto standard equipment
  • Grid?
  • Clusters
  • Fortran or C MPI the norm
  • 99 on top of bare-bone Linux or the like
  • Ok if application is embarrassingly parallel and
    regular

3
Cluster for the Mass
Commercial Data mining, Financial Modeling, Oil
Reservoir Simulation, Seismic Data Processing,
Vehicle and Aircraft Simulation Government
Nuclear Stockpile Stewardship, Climate and
Weather, Satellite Image Processing, Forces
Modeling Academic Fundamental Physics
(particles, relativity, cosmology), Biochemistry,
Environmental Engineering, Earthquake Prediction
  • Two modes
  • For number crunching in Grande type applications
    (superman)
  • As a CPU farm to support high-throughput
    computing (poor man)

4
Cluster Programming
  • Auto-parallelization tools have limited success
  • Parallelization a chore but have to do it (or
    lets hire someone)
  • Optimization for performance not many users cup
    of tea
  • Partitioning and parallelization
  • Mapping
  • Remapping (experts?)

5
Amateur Parallel Programming
  • Common problems
  • Poor parallelization few large chunks or many
    small chunks
  • Load imbalances large and small chunks
  • Meeting the amateurs half-way
  • They do crude parallelization
  • System does the rest mapping/remapping
    (automatic optimization)
  • And I/O?

6
Automatic Optimization
  • Feed the fat boy with two spoons, and a few slim
    ones with one spoon
  • But load information could be elusive
  • Need smart runtime supports
  • Goal is to achieve high performance with good
    resource utilization and load balancing
  • Large chunks that are single-threaded a problem

7
The Good Fat Boys
  • Large chunks that span multiple nodes
  • Must be a program with multiple execution
    threads
  • Threads can be in different nodes program
    expands and shrinks
  • Threads/programs can roam around dynamic
    migration
  • This encourages fine-grain programming

cluster node
amoeba
8
Mechanism and Policy
  • Mechanism for migration
  • Traditional process migration
  • Thread migration
  • Redirection of I/O and messages
  • Objects sharing between nodes for threads
  • Policy for good dynamic load balancing
  • Message traffic a crucial parameter
  • Predictive
  • Towards the single system image ideal

9
Single System Image
  • If user does only crude parallelization and
    system does the rest
  • If processes/threads can roam, and processes
    expand/shrink
  • If I/O (including sockets) can be at any node
    anytime
  • We achieve at least 50 of SSI
  • The rest is difficult

Single Entry Point File System Virtual
Networking I/O and Memory Space Process
Space Management / Programming View
10
Bon Java!
  • Java (for HPC) in good hands
  • JGF Numerics Working Group, IBM Ninja,
  • JGF Concurrency/Applications Working Group
    (benchmarking, MPI, )
  • The workshops
  • Java has many advantages (vs. Fortran and C/C)
  • Performance not an issue any more
  • Threads as first-class citizens!
  • JVM can be modified

Java has the greatest potential to deliver an
attractive productive programming environment
spanning the very broad range of tasks needed by
the Grande programmer The Java Grande Forum
Charter
11
Process vs. Thread Migration
  • Process migration easier than thread migration
  • Threads are tightly coupled
  • They share objects
  • Two styles to explore
  • Process, MPI (distributed computing)
  • Thread, shared objects (parallel computing)
  • Or combined
  • Boils down to messages vs. distributed shared
    objects

12
Two Projects _at_ HKU
  • M-JavaMPI M for Migration
  • Process migration
  • I/O redirection
  • Extension to grid
  • No modification of JVM and MPI
  • JESSICA Java-Enabled Single System Image
    Computing Architecture
  • By modifying JVM
  • Thread migration, Amoeba mode
  • Global object space, I/O redirection
  • JIT mode (Version 2)

13
Design Choices
  • Bytecode instrumentation
  • Insert code into programs, manually or via
    pre-processor
  • JVM extension
  • Make thread state accessible from Java program
  • Non-transparent
  • Modification of JVM is required
  • Checkpointing the whole JVM process
  • Powerful but heavy penalty
  • Modification of JVM
  • Runtime support
  • Totally transparent to the applications
  • Efficient but very difficult to implement

14
M-JavaMPI
  • Support transparent Java process migration and
    provide communication redirection services
  • Communication using MPI
  • Implemented as a middleware on top of standard
    JVM
  • No modifications of JVM and MPI
  • Checkpointing the Java process code insertion
    by preprocessor

15
System Architecture
16
Preprocessing
  • Bytecode is modified before passing to JVM for
    execution
  • Restoration functions are inserted as exception
    handlers, in the form of encapsulated try-catch
    statements
  • Re-arrangement of bytecode, and addition of local
    variables

17
The Layers
  • Java-MPI API layer
  • Restorable MPI layer
  • Provides restorable MPI communications
  • No modification of MPI library
  • Migration Layer
  • Captures and save the execution state of the
    migrating process in the source node, and
    restores the execution state of the migrated
    process in the destination node
  • Cooperates with the Restorable MPI layer to
    reconstruct the communication channels of the
    parallel application

18
State Capturing and Restoring
  • Program code re-used in the destination node
  • Data captured and restored by using the object
    serialization mechanism
  • Execution context captured by using JVMDI and
    restored by inserted exception handlers
  • Eager (all) strategy For each frame, local
    variables, referenced objects, the name of the
    class and class method, and program counter are
    saved using object serialization

19
State Capturing using JVMDI
public class A try catch
(RestorationException e) a saved value
of local variable a b saved value of
local variable b pc saved value of
program counter when the program is
suspended jump to the location where the
program is suspended
  • public class A
  • int a
  • char b

20
Message Redirection Model
  • MPI daemon in each node to support message
    passing between distributed java processes
  • IPC between Java program and MPI daemon in the
    same node through shared memory and semaphores

client-server
client-server
21
Process migration steps
Source Node
Destination Node
22
Experiments
  • PC Cluster
  • 16-node cluster
  • 300 MHz Pentium II with 128MB of memory
  • Linux 2.2.14 with Sun JDK 1.3.0
  • 100Mb/s fast Ethernet
  • All Java programs executed in interpreted mode

23
  • Bandwidth PingPong Test
  • Native MPI 10.5 MB/s
  • Direct Java-MPI binding 9.2 MB/s
  • Restorable MPI layer 7.6 MB/s

24
Latency PingPong Test
  • Native MPI 0.2 ms
  • Direct Java-MPI binding 0.23 ms
  • Restorable MPI layer 0.26 ms

25
  • Migration Cost capturing and restoring objects

26
Migration Cost capturing and restoring frames
27
Application Performance
  • PI calculation
  • Recursive ray-tracing
  • NAS integer sort
  • Parallel SOR

28
  • Time spent in calculating PI and ray-tracing with
    and without the migration layer

29
  • Execution time of NAS program with different
    problem sizes (16 nodes)

No noticeable overhead introduced in the
computation part while in the communication
part, an overhead of about 10-20
30
Time spent in executing SOR using different
numbers of nodes with and without migration layer
31
Cost of Migration
 
Time spent in executing the SOR program on an
array of size 256x256 without and with one
migration during the execution
32
Cost of Migration
  • Time spent in migration (in seconds) for
    different applications

33
Dynamic Load Balancing
  • A simple test
  • SOR program was executed using six nodes in an
    unevenly loaded environment with one of the nodes
    executing a computationally intensive program
  • Without migration 319s
  • With migration 180s

34
In Progress
  • M-JavaMPI in JIT mode
  • Develop system modules for automatic dynamic load
    balancing
  • Develop system modules for effective
    fault-tolerant supports

35
Java Virtual Machine
Application Class File
Java API Class File
  • Class Loader
  • Loads class files
  • Interpreter
  • Executes bytecode
  • Runtime Compiler
  • Converts bytecode to native code

Class loader
Bytecode
Interpreter
0a0b0c0d0c6262431 c1d688662a0b0c0d0 c133451472652
2723
Runtime compiler
01010101000101110 10101011000111010 1011001101011
1011
Native code
36
Threads in JVM
A Multithreaded Java Program
public class ProducerConsumerTest public
static void main(String args)
CubbyHole c new CubbyHole() Producer
p1 new Producer(c, 1) Consumer c1
new Consumer(c, 1) p1.start()
c1.start()
Thread 3
Thread 2
Java Method Area (Code)
Thread 1
PC
Class loader
Stack Frame
Execution Engine
Stack Frame
Class files
Heap (Data)
object
object
37
JMM
Java Memory Model (How to maintain memory
consistency between threads)
T1
T2
Variable is modified in T1s working memory.
Per-Thread working memory
Main memory
Garbage Bin
Object
master copy
Heap Area
Variable
Threads T1, T2
38
Problems in Existing DJVMs
  • Mostly based on interpreters
  • Simple but slow
  • Layered design using distributed shared memory
    system (DSM) ? cannot be tightly coupled with JVM
  • JVM runtime information cannot be channeled to
    DSM
  • False sharing if page-based DSM is employed
  • Page faults block the whole JVM
  • Programmer to specify thread distribution ? lack
    of transparency
  • Need to rewrite multithreaded Java applications
  • No dynamic thread distribution (preemptive thread
    migration) for load balancing

39
Related Work
  • Method shipping IBM cJVM
  • Like remote method invocation (RMI) when
    accessing object fields, the proxy redirects the
    flow of execution to the node where the object's
    master copy is located.
  • Executed in Interpreter mode.
  • Load balancing problem affected by the object
    distribution.
  • Page shipping Rice U. Java/DSM, HKU JESSICA
  • Simple. GOS was supported by some page-based
    Distributed Shared Memory (e.g., TreadMarks,
    JUMP, JiaJia)
  • JVM runtime information cant be channeled to
    DSM.
  • Executed in Interpreter mode.
  • Object shipping Hyperion, Jackal
  • Leverage some object-based DSM
  • Executed in native mode Hyperion translate Java
    bytecode to C. Jackal compile Java source code
    directly to native code

40
Distributed Java Virtual Machine (DJVM)
JESSICA2 A distributed Java Virtual Machine
(DJVM) spanning multiple cluster nodes can
provide a true parallel execution environment for
multithreaded Java applications with a Single
System Image illusion to Java threads.
Java Threads created in a program
Global Object Space
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
41
JESSICA2 Main Features
JESSICA2
  • Transparent Java thread migration
  • Runtime capturing and restoring of thread
    execution context.
  • No source code modification no bytecode
    instrumentation (preprocessing) no new API
    introduced
  • Enables dynamic load balancing on clusters
  • Operated in Just-In-Time (JIT) compilation Mode
  • Global Object Space
  • A shared global heap spanning all cluster nodes
  • Adaptive object home migration protocol
  • I/O redirection

Transparent migration
JIT
GOS
42
Transparent Thread Migration in JIT Mode
  • Simple for interpreters (e.g., JESSICA)
  • Interpreter sits in the bytecode decoding loop
    which can be stopped upon a migration flag
    checking
  • The full state of a thread is available in the
    data structure of interpreter
  • No register allocation
  • JIT mode execution makes things complex
    (JESSICA2)
  • Native code has no clear bytecode boundary
  • How to deal with machine registers?
  • How to organize the stack frames (all are in
    native form now)?
  • How to make extracted thread states portable and
    recognizable by the remote JVM?
  • How to restore the extracted states (rebuild the
    stack frames) and restart the execution in native
    form?

Need to modify JIT compiler to instrument native
code
43
Approaches
  • Using JVMDI (e.g., M-JavaMPI)?
  • Only newer JDKs (Aug., 2002) provide full speed
    debugging to support the capturing of thread
    status
  • Portable but too heavy
  • need large data structures to keep debug
    information
  • Only using JVMDI cannot support full function of
    DJVM
  • How to access remote object?
  • Put a DSM under it? But you cant control Sun
    JVMs memory allocation unless you get the latest
    JDK source codes
  • Our lightweight approach
  • Provide the minimum functions required to capture
    and restore Java threads to support Java thread
    migration

44
An overview of JESSICA2 Java thread migration
  • Frame parsing
  • Restore execution

Thread
GOS (heap)
(3)
Frames
Frames
Frames
Migration Manager
(4a) Object Access
GOS (heap)
Method Area
Frame
PC
  • Stack analysis
  • Stack capturing

(2)
Method Area
Thread Scheduler
JVM
PC
(4b) Load method from NFS
Source node
(1) Alert
Destination node
Load Monitor
45
Essential Functions
  • Migration points selection
  • At the start of loop, basic block or method
  • Register context handler
  • Spill dirty registers at migration point without
    invalidation so that native code can continue the
    use of registers
  • Use register recovering stub at restoring phase
  • Variable type deduction
  • Spill type in stacks using compression
  • Java frames linking
  • Discover consecutive Java frames

46
Dynamic Thread State Capturing and Restoring in
JESSICA2
migration point
Bytecode verifier
migration point Selection
(Restore)
cmp mflag,0 jz ...
invoke
register allocation
bytecode translation
cmp objoffset,0 jz ...
1. Add migration checking 2. Add object
checking 3. Add type register spilling
Intermediate Code
mov 0x110182, slot ...
Register recovering
code generation
reg
slots
Native Code
Global Object Access
(Capturing)
Linking Constant Resolution
Native stack scanning
Java frame
mov slot1-gtreg1 mov slot2-gtreg2 ...
C frame
Frame
Native thread stack
47
How to Maintain Memory Consistency in a
Distributed Environment?
T2
T4
T6
T8
T1
T3
T5
T7
Heap
Heap
OS
OS
OS
OS
PC
PC
PC
PC
High Speed Network
48
Embedded Global Object Space (GOS)
  • Take advantage of JVM runtime information for
    optimization (e.g., object types, accessing
    threads, etc.)
  • Use threaded I/O interface inside JVM for
    communication to hide the latency ? Non-blocking
    GOS access
  • OO-based to reduce false sharing
  • Home-based, compliant with JVM Memory Model
    (Lazy Release Consistency)
  • Master heap (home objects) and cache heap (local
    and cached objects) reduce object access latency

49
Object Cache
50
Adaptive object home migration
  • Definition
  • home of an object the JVM that holds the
    master copy of an object
  • Problems
  • cache objects need to be flushed and re-fetched
    from the home whenever synchronization happens
  • Adaptive object home migration
  • if of accesses from a thread dominates the
    total of accesses to an object, the object home
    will be migrated to the node where the thread is
    running

51
I/O redirection
  • Timer
  • Use the time in master node as the standard time
  • Calibrate the time in worker node when they
    register to master node
  • File I/O
  • Use half word of fd as node number
  • Open file
  • For read, check local first, then master node
  • For write, go to master node
  • Read/Write
  • Go to the node specified by the node number in fd
  • Network I/O
  • Connectionless send do it locally
  • Others, go to master

52
Experimental Setting
  • Modified Kaffe Open JVM version 1.0.6
  • Linux PC clusters
  • Pentium II PCs at 540MHz (Linux 2.2.1 kernel)
    connected by Fast Ethernet
  • HKU Gideon 300 Cluster (for the Ray Tracing demo)

53
Parallel Ray Tracing on JESSICA2(Using 64 nodes
of the Gideon 300 cluster)
Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes 108
seconds 1 node 4420 seconds ( 1 hour) Speedup
4402/108 40.75
54
Micro Benchmarks
(PI Calculation)
55
Java Grande Benchmark
56
SPECjvm98 Benchmark
M- disabling migration mechanism M
enabling migration I enabling pseudo-inlining
I- disabling pseudo-inlining
57
JESSICA2 vs JESSICA (CPI)
58
Application Performance
59
Effect of Adaptive Object Home Migration (SOR)
60
Work in Progress
  • New optimization techniques for GOS
  • Incremental Distributed GC
  • Load balancing module
  • Enhanced single I/O space to benefit more
    real-life applications
  • Parallel I/O support

61
Conclusion
  • Effective HPC for the mass
  • They supply the (parallel) program, system does
    the rest
  • Lets hope for parallelizing compilers
  • Small to medium grain programming
  • SSI the ideal
  • Java the choice
  • Poor man mode too
  • Thread distribution and migration feasible
  • Overhead reduction
  • Advances in low-latency networking
  • Migration as intrinsic function (JVM, OS,
    hardware)
  • Grid and pervasive computing

62
Some Publications
  • W.Z. Zhu , C.L. Wang, and F.C.M. Lau, A
    Lightweight Solution for Transparent Java Thread
    Migration in Just-in-Time Compilers, ICPP 2003,
    Taiwan, October 2003.
  • W.J. Fang, C.L. Wang, and F.C.M. Lau, On the
    Design of Global Object Space for Efficient
    Multi-threading Java Computing on Clusters,
    Parallel Computing, to appear.
  • W.Z. Zhu , C.L. Wang, and F.C.M. Lau, JESSICA2
    A Distributed Java Virtual Machine with
    Transparent Thread Migration Support, CLUSTER
    2002, Chicago, September 2002, 381-388.
  • R. Ma, C.L. Wang, and F.C.M. Lau, M-JavaMPI A
    Java-MPI Binding with Process Migration
    Support,'' CCGrid 2002, Berlin, May 2002.
  • M.J.M. Ma, C.L. Wang, and F.C.M. Lau, JESSICA
    Java-Enabled Single-System-Image Computing
    Architecture, Journal of Parallel and
    Distributed Computing, Vol. 60, No. 10, October
    2000, 1194-1222.

63
THE END And Thanks!
Write a Comment
User Comments (0)
About PowerShow.com