Title: Memory Coherence in Shared Virtual Memory Systems
1Memory Coherence in Shared Virtual Memory Systems
- by
- Kai Li, Princeton University
- Paul Hudak, Yale University
- Presented by Shu Du
- Assisted by Charles Reis
2Motivation
- Parallel computing platform
- Supercomputer
- Powerful, but expensive
- Cluster of workstations,PCs
- Cheap and scalable
- Parallel programming model
- Message passing
- Shared memory
3Virtual Memory
Main Memory
g1
g2
g3
g4
g5
g6
Secondary Storage
g4
g5
g6
g7
g8
g9
Mapping Manager
View of the Applications
g1
g2
g3
g4
g5
g6
g7
Virtual Memory Space
4Shared Virtual Memorys Architecture
Node N
CPU N
g1
g7
g5
Mapping Manager
View of the Applications
g1
g2
g3
g4
g5
g6
g7
Shared Virtual Memory
5The Main Problem Memory Coherence
Node1
Node2
A memory is coherent if Read most recent
Write
CPU1
CPU2
g1
g2
g3
g1
g5
g6
1.Write g1
2.Read g1
g1
g2
g3
g4
g5
g6
g7
6Overview
- Introduction to the solutions
- Centralized Manager Algorithm
- Dynamic Manager Algorithm
- Experiment and results
- Conclusion
7Introduction to the possible solutions
8Page Synchronization Solutions(1)
- Page Invalidation
- When a processor Q has a write fault to page P,
- Q will Invalidates all copies of p
Node1
Node2
CPU1
CPU2
1.Write fault to g6
g1
g2
g3
g1
g5
g6
g1
g2
g3
g4
g5
g6
g7
9Page Synchronization Solutions(2)
- Write broadcast
- When a processor Q has a write fault to page P,
- Q will writes to all copies of p
Node1
Node2
CPU1
CPU2
1.Write fault to g6
g1
g2
g6
g1
g5
g6
g1
g2
g3
g4
g5
g6
g7
10Page Ownership Solutions
- Owner Write access to the page
- Fixed Ownership
- Dynamic Ownership
- Centralized manager
- Distributed manager
- Fixed managing server
- Dynamic managing server
11Solutions to the Memory Coherence
12Centralized Manager Algorithm(CMA)
13CMAs Data Structure
- PTable
- Kept by each processor
- Access , Lock
- Info table
- Kept by centralized manager
- Owner, Copy set, Lock
14CMAs Read Fault Handler
Manager
Node2
Node1
CPU2
CPU1
Owner
1.ask Manager for p6
p4
p5
p6
p1
p2
3
p3
P6-gtN2
Copy Set
P6-gt
15CMAs Write Fault Handler
Manager
Node2
Node1
CPU2
CPU1
Owner
1.ask Manager for p6
p4
p5
p1
p2
3
p3
p6
P6-gtN3
Copy Set
P6-gtN2
Node3
CPU2
p4
p5
p6
16Summary of the CMA
- Straightforward and easy to implement
- But, have a traffic bottleneck
17Distributed Manager Algorithm(DMA)
18Fixed Manager
- Predetermined subset of the pages to manage
- Difficulty
- To find the appropriate mapping from pages to
the processors, since different applications may
have different page access tendencies
19Dynamic Manager
- A simple way is to broadcast request to contact
the manager, but that will bring a lot of
overheads. - Use probOwner chain
- Kept in each processors local Ptable
- Initially, all probOwner set to one processor
- Changes on write-page fault as well as a
read-page fault - Point to the true owner or the probable one
20Dynamic DMAsRead fault handling
Node 2
Node 3
Node 1
1.ask for p6
ProbOwner
ProbOwner
ProbOwner
P6-gtN2
P6-gtN3
P6-gtN3
Copy Set
Copy Set
Copy Set
P6-gtN2
P6-gtN2
P6-gt
Access
Access
Access
P6-gtN/A
P6-gtREAD
P6-gtREAD
p1
p2
p3
p4
p5
p6
p7
p8
p6
21Dynamic DMAsWrite Fault Handling
Node 2
Node 3
Node 1
1.ask for p6
ProbOwner
ProbOwner
ProbOwner
P6-gtN2
P6-gtN3
P6-gtN3
Copy Set
Copy Set
Copy Set
P6-gtN2
P6-gtN2
P6-gt
Access
Access
Access
P6-gtN/A
P6-gtREAD
P6-gtREAD
p1
p2
p3
p4
p5
p6
p7
p8
p6
22Summary of the Distributed Manager Algorithm
- Fixed DMA alleviates the former bottleneck, but
maybe not easy to find a good allocation scheme. - Dynamic DMA has good flexibility, and it is maybe
easier to adapt the locality of the memory
accesses.
23Experiment and results
24Experimental System-IVY
- Integrated Shared Virtual Memory at Yale
- Distributed Environment
- Apollo DOMAIN computers
- Modified Aegis operating system
- Apollo ring network
- Memory Coherent Algorithm
- Centralized manager
- Fixed distributed manager
- Dynamic distributed manager
25Benchmarks and Metric
- Practical parallel Programs
- Parallel Jacobi program for 3D partial
differential equations(PDEs) - Parallel matrix multiply CAB
- Parallel dot-product
- SPEEDUP Tsingle-processor / Tmulti-processor
26Speedup of 3D PDE
27Speedup of dot-product
28Speedup of Matrix multiplication
29Overhead of the algorithms
30Conclusion
- Shared virtual memory implementation in the
cluster system is indeed practical - Dynamic distributed manager algorithm has the
most desirable overall features