Title: SSI Team Progress
 1SSI Team Progress  Status
- 2002. 7. 9 
- NRL SSI TEAM 
- ???, ???, ??
2Contents
- Introduction 
- Load Balancing Design 
- Opensource Functionality and Status 
- Placement System on Opensource SSI Cluster 
- Socket Migration Integration within Opensource
3Introduction
- Load balancing 
- even distribution of workloads among nodes 
- improve system performance and throughput through 
 full utilization of diverse resources
- minimize applications avg. completion time 
- Classification of dynamic load balancing 
- remote execution (non-preemptive migration) 
- some new processes are (possibly automatically) 
 invoked on remote nodes
- only new-born processes are migrated 
- process migration (preemptive migration) 
- running processes may be suspended, moved to a 
 remote node and restarted
4Load Balancing Design (1)
- 0. Load balancing system over single system image 
- multi-user environment 
- interactive (or sequential) and parallel jobs 
 coexist
- Dynamic load balancing system must support both 
 placement and process migration mechanism
- preemptive process migration only is not 
 efficient
- specially for short-lived jobs
placement layer 
 5Load Balancing Design (2)
- Dynamic load balancing system must consider the 
 characteristics of parallel workloads
- minimize communication cost 
- process selection 
- comm. to comp. ratio 
- Dynamic load balancing system must consider the 
 communication pattern of parallel workloads
- avoid communication delay 
host A
host B
wating 
 6Load Balancing Design (3) 
 7Load Balancing Design (4)
- CCR (Communication-to-Computation Ratio) Recorder 
- assist to make decision for process selection in 
 MOSIX LL
- heuristic process selection  lowest 
 communication overhead
- modification of kernel process descriptor 
 structure
- accumulated read/write bytes during a time 
 quantum
- as long as a lot of communication, more closer to 
 peer
- for realization, real socket migration 
process migration with shipback mechanism 
 8Opensource Functionality and Status
- SSI patch V0.5.2 
- GFS as root filesystem (using GNBD) 
- clusterwide device space (/devfs/) 
- clusterwide process management 
- clusterwide PID 
- clusterwide IPC (signal, fifo, pipe) 
- process migration with shipbacked socket 
- SSI patch V0.6.0 
- patch V0.5.2  MOSIX LL integration 
9Placement System on Opensource SSI Cluster 
 10Overview
- Combine with kernel level migration system 
- Component 
- Interpreter / List manager / Load manager 
placement system
MOSIX Load Leveler
MOSIX Load Leveler 
 11Component - Interpreter
- Interpreter 
- read and parse user commands 
- modify bash-2.05 
- request eligibility check to list manager 
- execute task locally or remotely 
- measure execution time of first executed job 
12Component  List Manager
- List manager 
- determine eligibility of remote execution 
- maintain local and remote list per user 
- add long job to remote list 
- receive eligibility check request from 
 interpreter
- check that user command is eligible to be run 
 remotely
- respond local or remote execution to interpreter 
13Component  Load Manager
- Load Manager 
- run as daemon process 
- maintain load info. of MOSIX Load Leveler (LL) in 
 kernel level
- invoke system call to get load info. from MOSIX 
 LL when MOSIX update load info.
14Overall Operations
Interpreter
lightLodedNode()
localOrRemote (task)
Lightest loaded node 
SHM
local remote new
Light node
Local exe
Remote exe
List Manager
Daemon
Local exe Time check
Load Manager
New syscall
Load info.
User level
Kernel level
MOSIX LL 
 15Performance Evaluation - Overhead
- Program (pi value solver) 
- exe-time 4.1612 sec 
- Exe-time comparison 
- Eligibility check time increases as length of 
 list grows
16Performance Evaluation - Speedup
- Environment 
- P-III 700MHz, 256MB, 5 nodes 
- 100Mbps Ethernet 
- Linux 2.4.16 with patch Opensource SSI V0.6 
- Program  pi value solver 
- exe-time 28.1 sec 
- Test 
- Invoke jobs on one node randomly 
- Measure speedup as nodes added 
17Performance evaluation - Speedup (contd)
- 28.1 sec / 50 jobs / random arrival (210 sec) 
(sec) 
 18Limitation  Future Work
- Limitation 
- jobs that are invoked at the same time may go to 
 same light node at that time
- coarse-grain placement 
- unit of placement is a job 
- no consideration to placement of processes that 
 are made by one job
- Future work 
- more detailed evaluation 
- job characteristic / different job 
- fine-grained placement 
- job generating several processes 
- adaptive load index depend on application 
 characteristic
19Socket Migration Integration within Opensource 
 20Opensource Shipback Socket
- Socket migration by file op. function shipping 
- migrated process ships its file op. functions to 
 the original node.
- Real socket migration 
migration
VPROC
VPROC
VPROC
Result
PPROC
PPROC
Shipping op. functions back
Process B
Process B
Process A
Node 1
Node 2
Node 3
migration
VPROC
VPROC
VPROC
connection closed
connection reopened
PPROC
PPROC
Process A
Process B
Process B
Node 1
Node 2
Node 3 
 21CRAK 2001
checkpoint
restart
4. Recover socket and file descriptor 5. Try to 
bind to the same port 6. Use rsh to change socket 
 info of remote process
1. Stop the peer process to be checkpointed 
using rsh 
2. ioctl
9. Set information of socket and file structures 
constructed above
7. ioctl
10. let the stopped process continue to run 
User level
Kernel level
8. Load the checkpointed file and copy it into 
mem.
3. Save address space, register set, 
open files/pipes/socket,  
 22Our Socket Migration Flow (1)
Node A Node B 
 Node C 
(task kernel) (task 
 kernel) (task kernel)
SIGMIGRATE
ICS Communication
SIGSTOP
1.Migration ??? ?????? ??
 Checkpoint ?? 2.Process descriptor 3.Exporting 
Processs root, current working directory
RPC
TIME
4. ?? ????? Virtual Process? ??, ?? ??? ??
5. Process Context ??
RPC Response
6.file reopen? ??? ? descriptor ? ??? 
export file descriptor? socket 
descriptor?? Socket Migration ?? 8.?? open?? ?? 
socket descriptor ? ??? file? ?? checkpoint 
RPC
7. ? descriptor ?? export? path ? ???? file? 
reopen
RPC Response
RPC 
 23Our Socket Migration Flow (2)
Node A Node B 
 Node C 
(task kernel) (task 
 kernel) (task kernel)
10.Socket Structure ?? 11.??? dest. port? ???? 
bind
9. TCP_TIME_WAIT
RPC Response
12. (saddr, sport, daddr, dport)? ?? Socket? ??? 
process? ?? 13. saddr, sport? NodeC? ?? 
14.Destination socket information ??? ??
ICS Communication
TIME
16.SS_CONNECTING 17.daddr, dport? Node C? 
?? 18.Hash Table Update 19.SS_CONNECTED 20. 
TCP_ESTABLISHED
SIGCONT
SIGCONT 
 24Measurement (1)
- Environments 
- Pentium III 850Mhz, 512MB, 3 nodes 
- 100Mbps Ethernet 
- Linux 2.4.10-ac4 with patch Opensource SSI V0.5.2 
- Test Process 
- consists of a sender and a receiver 
- single socket connection, 64K buffer 
- a small message is sent to receiver for 10000 
 times.
25Measurement (2)
- Total migration cost 
- Opensource SSI vs. Opensource SSI  real socket 
 migration
- per one socket file descriptor, ? 8ms overhead 
26Measurement (3)
ICS_CHANNEL
- Communication cost 
- one way message transfer, 10000 loop 
- data copy cost in original node 
27Limitation  Future Work
- Limitation 
- not yet support socket migration within 2 nodes 
- ICS channel vs. IPC 
- some bugs 
- pts or tty devices is not quickly activated on 
 certain node (receiver)
- Future work 
- debugging 
- more test and analysis with various communication 
 conditions
- 2 node socket migration support