Title: LOTS: A Software DSM Supporting Large Object Space
1LOTS A Software DSM Supporting Large Object Space
- Benny Wang-Leung Cheung, Cho-Li Wang,
and Francis Chi-Moon Lau
Department of Computer Science The University of
Hong Kong
September, 2004
2Presentation Outline
- Why LOTS? (Objectives)
- DSM Background and Related Work
- Design of LOTS
- Performance Testing and Results
- Conclusion and Future Work
3The Problem in Current DSM
- Lack of shared object (memory) space
- Another major problem apart from performance
- Fixed address mapping in virtual memory
- Shared object space size lt process space
- TreadMarks min RAM size among all machines
- JIAJIA V1.0 128 MB
- 32-bit machines ? max 4 GB shared space
- Unscalable Fixed regardless of machines
- Large problems (with gt 4GB shared memory need)
cant be run directly ? The programmer needs to
change the application code to reduce the memory
utilization.
4Objectives of LOTS
- Using 64-bit machines is not a total solution!
- 32-bit machines are dominating the market (poor
mans clusters lt) - Problems keep increasing memory consumption
(Rich mans cluster)
(Poor mans cluster)
5Objectives of LOTS
- Hence we introduce LOTS
- Large Shared Object Space gt 4GB
- Dynamic run-time memory mapping technique
- Local disk as the backing store for temporarily
unused objects - Shared space size now limited by disk space
- Lazy disk read/write ? reasonable performance
6Some DSM Background
- Memory Consistency Issues
- Memory Consistency Models
- Sequential Consistency (IVY) performs poorly
- Relaxed models reduce redundant data traffic
- Lazy Release Consistency (TreadMarks)
- Scope Consistency (JIAJIA)
P
Q
Y5
Acq(L)
X3
Rel(L)
Acq(L)
X?
In Scope, Q sees X to be 3, but Y may not be 5
Y?
Rel(L)
7Some DSM Background
- More Memory Consistency Issues
- Coherence Protocols
- Home-based (JIAJIA) vs Homeless (TreadMarks) vs
Migrating-Home (JUMP) - Write-update vs Write-invalidate
- Adaptive Protocol (DOSA, ADSM)
- Coherence Protocol has to match with memory model
for higher efficiency - No DSM deals with Large Object Space!
8Related Work
- Large object space support
- Pointer swizzling
- Artificial, invalid addresses are translated to
machine-addressable form during access - Used in persistent store (QuickStore, Thor-1)
Unused objects free their virtual addresses and
are swapped out (i.e., swizzled out) to hard disk
Process Space
Compiler-generated addresses cause page fault at
runtime and are translated to valid ones
9Design of LOTS
- Dynamic Memory Mapping (DMM)
- Uses C Operator Overloading as the interface
- Overloads , , -, , /, , --, gt, lt, !,
etc. - Purely runtime
Network
Data (DMM) Area
Array A
Remote Memory
A-gtctrl
A57
Heap Area
Program
Virtual Memory Area
Local Hard Disk
10LOTS Shared Objects Creation
Process Space
- Through the LOTS memory allocator
- Exists as a C class
- Memory allocation through alloc() function
- Put data into specific part of process space
- Object control info in heap area
Array A
A-gtCtrl
11LOTS Memory Allocator
- Bypass Doug Leas Memory Allocator used in
original C/C - Uses mmap() to get physical memory, and map the
shared object data to the process space. - Free queues and used queues
- Small large objects allocated separately
Free queue
Used block Free block
Twin and Control Area
Heap Area
0x50000000 DMM Area
0x70000000
Used queue
12Shared Memory Behavior
- Goal Reduce redundant data traffic
- Memory Consistency Model Scope
- Memory Coherence Protocol Mixed
- Lock-synchronized objects Homeless
write-update - Barrier-synchronized objects Migrating-Home
write-invalidate - Principle To eliminate as much all-to-all data
communication as possible
13Mixed Coherence Protocol
Updates Movement Home Token Movement
P0
P1
P2
P3
Home of X and Y
x2 3
x11 y15
x2 4
Barrier
New Home
Inv X, Y
Inv X, Y
Inv X, Y
x1 2, x2 4
When the processes arrive at the barrier, the
process that holds the token of the object will
become the new home of that object, and other
processes will send the updates to the home.
14Making LOTS More Efficient
- Eliminating Diff Accumulation Problem
- Lock and timestamp info in DSM control area
- Calculate diff on request, no redundancy
T1 (len6) T2 (len4) T3 (len4) T4 (len3)
X1
X2
X6
X7
X3
X8
Value Last Updated Time
X1
X3
X5
X8
Traditional Method
X2
X8
X5
X7
LOTS Method
X7
X3
X5
Length
Only send 7 units data 8 units of control data
All updates above need to be sent (17 units data
8 units of control)
15Other Components in LOTS
- C runtime library in Linux
- Minimal set of functions as interface
- Retains as much C syntax as possible to improve
programmability - Synchronization Locks and Barriers
- Barriers With/Without memory effect
- Communication Sockets with UDP/IP
- SIGIO handler for incoming messages
16Performance Testing
- Two Kinds of Testing
- Without invoking large object space support
- Compare performance with other DSM (JIAJIA V1.0,
as both have similar communication protocol) - Report no. of messages and bytes sent
- Calculate large object space support overhead
- 16 Pentium IV 2GHz machines with 100Mbps Fast
Ethernet connection, 128MB mem, Linux Fedora - With large object space support
- Use an application with large memory demand
- Run on different platforms for analysis
- Expect disk read/write overhead dominates
17Test 1 Timing Performance
LOTSltJiaJia
LOTSltJiaJia
LOTSgtJiaJia
LOTSltJiaJia
LOTS LOTS enabled LOTS-x LOTS
disabled x-axis problem size, y-axis
execution time in seconds
18Performance Results Summary
- LOTS beat JIAJIA V1.0 in most applications
- Mixed protocol Diff accumulation elimination
reduce data traffic - Large object space support and access checking
incur a considerable overhead - about 5-15 of total execution time (application
dependent)
19Test 2 Large Object Space
- Using 4-node PC and server clusters
- Test program simple matrix operations
- With 120GB (SCSI) hard disk in each machine, able
to claim 117.77GB Shared Object Space - Disk read and write time is closely related to
the OS version.
20Conclusions
- LOTS succeed in
- Providing a large shared object space larger than
the local process space during runtime - Performing reasonably well by reducing data
traffic through Scope Consistency, mixed
coherence protocol and diff accumulation
elimination technique - Similar programming interface with C
21Future Work
- A Number of Optimizations
- Further increase shared object space
- ? the minimum hard disk space x number of
processes / 2. - Recent progress 64GB (4GB x 16) of shared
objects can be allocated in 16 machines, each
having a 9GB hard disk. - Reduce disk overhead
- Reduce over-loading overhead (access check)
- Load-aware migrating-home protocol coherence
protocol adapting to network traffic and
processor loading (e.g., avoid too many homes
in a single machine)
22Questions ?
23Test 1 No. of Messages Sent
The percentage is obtained by dividing the number
of messages sent in LOTS over that in JIAJIA for
the same application.
No. of procs (p)
24Test 1 No. of Bytes Sent
The percentage is obtained by dividing the number
of bytes sent in LOTS over that in JIAJIA for the
same application.
No. of procs (p)
25Test 2 Large Object Space
- Allocate shared objects with total size gt 4GB,
and another process accesses each of them once
(array addition with p4)