Single System Abstractions for Clusters of Workstations - PowerPoint PPT Presentation

About This Presentation
Title:

Single System Abstractions for Clusters of Workstations

Description:

Talk focuses on Coarse Grain Layer. GLUnix. Characteristics ... Reliability: Relies of accumulating dirty blocks to generate large sequential writes ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 46
Provided by: programmin4
Learn more at: http://www.ece.uprm.edu
Category:

less

Transcript and Presenter's Notes

Title: Single System Abstractions for Clusters of Workstations


1
Single System Abstractionsfor Clusters of
Workstations
  • Bienvenido Vélez

2
What is a cluster?
A collection of loosely connected self-contained
computers cooperating to provide the abstraction
of a single one
Possible System Abstractions
Characterized by
System Abstraction
Fine grain parallelism
Massively parallel processor
Coarse grain concurrency
Multi-programmed system
Fast interconnects
Independent Nodes
Transparency is a goal
3
Question
Compare three approaches to provide abstraction
of a single system for clusters of workstations
using the following criteria
  • Transparency
  • Availability
  • Scalability

4
Contributions
  • Improvements to the Microsoft Cluster Service
  • better availability and scalability
  • Adaptive Replication
  • Automatically adapting replication levels to
    maintain availability as cluster grows

5
Outline
  • Comparison of approaches
  • Transparent remote execution (GLUnix)
  • Preemptive load balancing (MOSIX)
  • Highly available servers (Microsoft Cluster
    Service)
  • Contributions
  • Improvements to the MS Cluster Service
  • Adaptive Replication
  • Conclusions

6
GLUnix
7
GLUnix Transparent Remote Execution
master node
master daemon
Execute (make, env)
signals
fork
node daemon
stdin
glurun make
exec make
user
stdout, stderr
remote node (selected by master)
home node
  • Dynamic load balancing

8
GLUnixVirtues and Limitations
  • Transparency
  • home node transparency limited by user-level
    implementation
  • interactive jobs supported
  • special commands for running cluster jobs
  • Availability
  • detects and masks node failures
  • master process is single point of failure
  • Scalability
  • master process performance bottleneck

9
MOSIX
10
MOSIXPreemptive Load Balancing
1
3
4
2
5
node
process
  • probabilistic diffusion of load information
  • redirects system calls to home node

11
MOSIXPreemptive Load Balancing
Exchange local load with random node
delay
Consider migrating a process to a node with
minimal cost
  • Keeps load information from fixed number nodes
  • load average size of ready queue
  • cost f(cpu time) f(communication)
    f(migration time)

12
MOSIXVirtues and Limitations
  • Transparency
  • limited home node transparency
  • Availability
  • masks node failures
  • no process restart
  • preemptive load balancing limits portability and
    performance
  • Scalability
  • flooding and swinging possible
  • low communication overhead

13
MicrosoftClusterService
14
Microsoft Cluster Service (MSCS)Highly available
server processes
SQL
Web
MSCS
MSCS
status
status
  • replicated consistent node/server status database
  • migrates servers from failed nodes

15
Microsoft Cluster Service Hardware Configuration
ethernet
Web
SQL
bottleneck
status
status
SCSI
status
single points of failure
16
MSCSVirtues and Limitations
  • Transparency
  • server migration transparent to clients
  • Availability
  • servers migrated from failed nodes
  • shared disk are single points of failure
  • Scalability
  • manual static configuration
  • manual static load balancing
  • shared disk bus is performance bottleneck

17
Summary of Approaches
System
Transparency
Availability
Scalability
GLUnix
home node limited
single point of failure masks failures no
fail-over
load balancing bottleneck
MOSIX
home node transparent
masks failures no fail-over
load balancing
MSCS
clients
server fail-over single point of failure
bottleneck
18
Re-designing MSCS
19
Transaction-basedReplication
operates on object
writex
replication
operates on copies
writex1, , writexn
transactions
20
Re-designing MSCS
  • Idea New core resource group fixed on every node
  • special disk resource
  • distributed transaction processing resource
  • transactional replicated file storage resource
  • Implement consensus with transactions
    (El-Abbadi-Toueg algorithm)
  • changes to configuration DB
  • cluster membership service
  • Improvements
  • eliminates complex global update and regroup
    protocols
  • switchover not required for application data
  • provides new generally useful service
  • Transactional replicated object storage

21
Re-designed MSCSwith transactional replicated
object storage
Node
Cluster Service
node manager
resource manager
RPC
Resource Monitor
resource DLL
resource DLL
RPC
Replicated Storage Svc
Transaction Service
network
22
ADAPTIVEREPLICATION
23
Adaptive ReplicationProblem
What should a replication service do when nodes
are added to the cluster?
replication vs. migration
Goal Maintain availability
Hypothesis
  • Must alternate migration with replication
  • Replication (R) should happen significantly less
    often that migration (M)

24
Replication increases number of copies of objects
2 nodes
X y
X y
2 nodes added
X y
X y
X y
X y
4 nodes
25
Migration re-distributes objects across all nodes
2 nodes
X y
X y
2 nodes added
X
y
x
y
4 nodes
26
Simplifying Assumptions
  • System keeps same number of copies k of each
    object
  • System has n nodes
  • Initially n k
  • n increases k nodes at a time
  • ignore partitions in computing availability

27
ConjectureHighest availability can be obtained
if objects partitioned in q n / k groups living
disjoint sets of nodes.
Example k 3, n 6, q 2
X
X
X
q
X
X
X
k
Lets call this optimal migration
28
Adaptive Replication Necessary
Let each node have availability p The
availability of the system is A(k,n) 1 - q
pk Since optimal migration always increases q,
migration decreases availability (albeit slowly)
Adaptive replication may be necessary to
maintain availability
29
Adaptive ReplicationFurther Work
  • determine when it matters in real situations
  • relax assumptions
  • formalize arguments

30
SupportSlides
31
Home Node Single System Image
32
Talk focuses on Coarse Grain Layer
LCM layers supported
Mechanisms Used
System
NET, CGP, FGP
active Messages, trasparent remote
execution, message passing API
Berkeley NOW
NET, CGP
preemptive load balancing kernel-to-kernel RPC
MOSIX
CGP
node regroup, resource failover switchover
MSCS
NET, FGP
user level protocol stack with semaphores
ParaStation
33
GLUnixCharacteristics
  • Provides special user commands for managing
    cluster jobs
  • Both batch and interactive jobs can be executed
    remotely
  • Supports dynamic load balancing

34
MOSIX preemptive load balancing
load balance
less loaded node exists?
no
yes
Select candidate process p with maximal impact
on local load
no
yes
p can migrate?
no
yes
signal p to consider migration
return
35
xFS distribued log-based file system
data block
1
data stripes
log segment (dirty data blocks)
2
3
parity stripe
client
writes are always sequential
1
2
3
stripe group
36
xFSVirtues and Limitations
  • Exploits aggregate bandwidth of all disks
  • No need to buy expensive RAIDs
  • No single point of failure
  • Reliability Relies of accumulating dirty blocks
    to generate large sequential writes
  • Adaptive replication potentially more difficult

37
Microsoft Cluster Service (MSCS) GOAL
Off-the-shelf Server Application
Cluster-aware Server Application
Wrapper
Highly Available
38
MSCSAbstractions
  • Node
  • Resource
  • e.g. disks, IP addresses, server
  • Resource dependency
  • e.g. DBMS depends on disk holding its data
  • Resource group
  • e.g. server and its IP number
  • Quorum resource
  • logs configuration data
  • breaks ties during membership changes

39
MSCSGeneral Characteristics
  • Global state of all nodes and resources
    consistently replicated across all nodes (write
    all using atomic multicast protocol)
  • Node and resource failures detected
  • Resources of failed nodes migrated to surviving
    nodes
  • Failed resources restarted

40
MSCS System Architecture
network
41
MSCS virtually synchronous regroup operation
regroup
Activate
  • determine nodes in its connected component
  • determine if its component is the primary
  • elect new tie-breaker
  • if node new tie breaker then broadcast
  • component as new membership

Closing
Pruning
  • if not in the new membership halt

Cleanup 1
  • install new membership from new tie breaker
  • acknowledge ready to commit

Cleanup 2
  • if own quorum disk, log membership change

end
42
MSCSPrimary Component Determination Rule
A node is in the primary component if one of the
following holds
  • node connected to a majority of previous
    membership
  • node connected to half (gt2) of the previous
    members and one of those is a tie-breaker
  • node isolated and previous membership had two
    nodes and node owned quorum resource during
    previous membership

43
MSCS switchover
Every disk a single point of failure!
node failure
Alternative Replication
44
Summary of Approaches
System
Transparency
Availability
Performance
Berkeley NOW
home node limited
single point of failure no fail-over
load balancing bottleneck
MOSIX
home node transparent
masks failures no fail-over tolerates partitions
load balancing low msg overhead
MSCS
server
single point of failure low MTTR tolerates
partitions
bottleneck
45
Comparing Approaches Design Goals
LCM layers supported
Mechanisms Used
System
NET, CGP, FGP
active Messages transparent remote
execution Message passing API
Berkeley NOW
NET, CGP
preemptive load balancing kernel-to-kernel RPC
MOSIX
CGP
cluster membership services resource fail-over
MSCS
NET, FGP
user level protocol stack network interface
hardware
ParaStation
46
Comparing Approaches Global Information Management
Approach
Description
System
centralized
processes run to completion once assigned to
processor
Berkeley NOW
distributed probabilistic
processes brought offline at source and online
at destination
MOSIX
replicated consistent
process migrated at any point during execution
MSCS
47
Comparing Approaches Fault-tolerance
Single Points of Failure
Possible solution
System
master process
process pairs
Berkeley NOW
none
N.A.
MOSIX
quorum resource shared disks
virtual partitions replication algorithm
MSCS
48
Comparing Approaches Load Balancing
Approach
Description
System
manual
sys admin manually assigns processes to nodes
MSCS
static
processes statically assigned to processors
dynamic
uses dynamic load information to
assign processes to processors
Berkeley NOW
preemptive
migrates processes in the middle of their
execution
MOSIX
49
Comparing Approaches Process Migration
Process Migration Approach
Description
System
none
processes run to completion once assigned to
processor
Berkeley NOW
cooperative shutdown/restart
processes brought offline at source and online
at destination
MSCS
transparent
process migrated at any point during execution
MOSIX
50
Example k 3, n 3
X
x
x
Each letter (e.g. x above) represents a group of
objects with copies in the same subset of nodes
51
fail-over/ failback
redundancy
switch-over
error-correcting codes
replication
MSCS
RAID
xFS
primary copy
voting (quorum consensus)
HARP
voting w/ views (virtual partitions)
Write a Comment
User Comments (0)
About PowerShow.com