Title: Distributed File Systems
1Distributed File Systems
2Distributed Systems
- Introduction advantages of distributed systems
- 15. Structures network types, design
- 16. File Systems naming, cache updating
- 17. Coordination event ordering, mutual
exclusion
3Multi-CPU Systems
M
M
M
CM
CM
C
C
C
C
C
C
C
C
C
Shared memory
C
M
Inter- connect
C
C
C
M
C
C
C
C
C
C
C
CM
CM
CM
M
M
M
Shared-memory multiprocessor
Wide-area distributed system
Message-passing multicomputer
Tanenbaum, Modern Operating Systems, 2nd Ed., p.
505
4Examples of Multi-CPU Systems
- Multiprocessors quad CPU PC
- Multicomputer 512 nodes in a room working on
pharmaceutical modelling - Distributed System Thousands of machines
loosely cooperating over the Internet
Tanenbaum, p. 549
5Types of Multi-CPU Systems
6Interconnect Topologies
Grid
Single switch
Ring
Hypercube
Cube
Smallest diameter Most links
Double Torus
Tanenbaum, Modern Operating Systems, 2nd Ed., p.
528
7Chapter 16 Distributed-File Systems
- Background
- Naming and Transparency
- Remote File Access
- Stateful versus Stateless Service
- File Replication
- Example Systems
8Background
- Distributed file system (DFS) a distributed
implementation of the classical time-sharing
model of a file system, where multiple users
share files and storage resources. - A DFS manages set of dispersed storage devices
- Overall storage space is composed of different,
remotely located, smaller storage spaces. - A component unit is the smallest set of files
that can be stored on a single machine,
independently of other units. - There is usually a correspondence between
constituent storage spaces and sets of files.
9DFS Parts
- Service software entity running on one or more
machines and providing a particular type of
function to a priori unknown clients. - Server service software running on a single
machine. - Client process that can invoke a service using
a set of operations that forms its client
interface. - So a file system provides file services to
clients.
10DFS Features
- Client Interface
- A set of primitive file operations (create,
delete, read, write). - Transparency
- Local and remote files are indistinguishable
- The multiplicity of its servers and storage
devices should appear invisible. - Response time is ideally comparable to that of a
local file system
11DFS Implementation
- Various Implementations
- Part of a distributed operating system, or
- A software layer managing communication between
conventional Operating Systems
Tanenbaum, p. 551
12Naming and Transparency
- Naming mapping between logical and physical
objects. - Multilevel mapping abstraction of a file that
hides the details of how and where on the disk
the file is actually stored. - A transparent DFS hides the location in the
network where the file is stored. - A file can be replicated in several sites,
- the mapping returns a set of the locations of
this files replicas - both the existence of multiple copies and their
location are hidden.
13Naming Structures
- Location transparency file name does not
reveal the files physical storage location. - e.g. /server1/dir/dir2/x says that file is
located on server1, but it does not tell where
that server is located - File name still denotes a specific, although
hidden, set of physical disk blocks. - Convenient way to share data.
- Can expose correspondence between component units
and machines. - However, if file x is large, the system might
like to move x from server1 to server2, but the
path name would change from /server1/dir/dir2/x
to /server2/dir/dir2/x
14Naming Structures
- Location independence file name does not need
to be changed when the files physical storage
location changes. - Better file abstraction.
- Promotes sharing the storage space itself.
- Separates the naming hierarchy from the
storage-devices hierarchy, allowing file
migration - Difficult to achieve, few experimental examples
only - (e.g. Andrews File System)
- Even remote mounting will not achieve location
independence, since it is not normally possible
to move a file from one file group (the unit of
mounting) to another, and still be able to use
the old path name.
15Naming Schemes Three Main Approaches
- Combination names
- Files named by combination of their host name and
local name - Guarantees a unique systemwide name.
- e.g. hostlocal-name
- Mounting file systems
- Attach remote directories to local directories,
giving the appearance of a coherent directory
tree - Automount allows mounts to be done on demand
- Global name structure
- Total integration of the component file systems.
- Spans all the files in the system.
- Location-independent file identifiers link files
to component units
16Types of Middleware
- Document-based
- Each page has a unique address
- Hyperlinks within each page point to other pages
- File System based
- Distributed system looks like a local file
system - Shared Object based
- All items are objects, bundled with access
procedures called methods - Coordination-based
- The network appears as a large, shared memory
17Document-based Middleware
- Make a distributed system look like a giant
collection of hyperlinked documents - E.g. hyperlinks on web pages.
- Steps in accessing web page
- http//www.acm.org/dl/faq.html
- Browser asks DNS for IP address of www.acm.org
- DNS replies with 199.222.69.151
- Browser connects by TCP to Port 80 of
199.222.69.151 - Browser requests file dl/faq.html
- TCP connection is released
- Browser displays all text in dl/faq.html
- Browser fetches and displays all images in
dl/faq.html
18File-system based Middleware
- Make a distributed system look like a great big
file system - Single global file system, with users all over
the world able to read and write files for which
they have authorization
Client
Server
Client
Server
Old file
New file
Upload/download model e.g. AFS
Remote access model e.g. NFS
19Remote File Access
- Reduce network traffic by retaining recently
accessed disk blocks in a cache, so that repeated
accesses to the same information can be handled
locally. - If needed data not already cached, a copy of data
is brought from the server to the user. - Accesses are performed on the cached copy.
- Files identified with one master copy residing at
the server machine, but copies of (parts of) the
file are scattered in different caches. - Cache-consistency problem keeping the cached
copies consistent with the master file. - Network virtual memory, with backing store at a
remote server
20Network Cache Location
- Disk Cache
- More reliable, survive crashes.
- Main-Memory Cache
- Permit workstations to be diskless.
- Data can be accessed more quickly.
- Technology trend to bigger, less expensive
memories. - Server caches (used to speed up disk I/O) are in
main memory regardless of where user caches are
located using main-memory caches on the user
machine permits a single caching mechanism for
servers and users. - e.g. NFS has memory caching, optional disk cache
21Cache Update Policy
- Write-through policy write data through to disk
as soon as they are placed on any cache.
Reliable, but poor performance. - Delayed-write policy modifications written to
the cache and then written through to the server
later. - Fast write accesses complete quickly.
- Less reliable unwritten data lost whenever a
user machine crashes. - Update on flush from cache
- But flushes happen at irregular intervals
- Update on regular scan
- Scan cache, flush blocks that have been modified
since the last scan (NFS). - Write-on-close write data back to the server
when the file is closed (AFS). - Best for files that are open for long periods
and frequently modified.
22Consistency
- Is locally cached copy of the data consistent
with the master copy? How to verify validity of
cached data? - Client-initiated approach
- Client initiates a validity check.
- Server checks whether the local data are
consistent with the master copy. - Check before every access, or timed checks
- Server-initiated approach
- Server records, for each client, the (parts of)
files it caches. - When server detects a potential inconsistency, it
reacts - e.g. When same file is open for read and write
on different clients
23Caching and Remote Service
- Caching
- Faster, especially with locality in file
accessing - Servers contacted only occasionally (rather than
for each access). - Reduced server load and network traffic
- Enhanced potential for scalability.
- Lower network overhead, as data is transmitted in
bigger chunks - Remote server method
- Useful for diskless machines
- Avoids cache-consistency problem
- Inter-machine interface mirrors the local
user-file-system interface
24Stateful File Service
- Mechanism.
- Client opens a file.
- Server fetches information about the file from
its disk, stores it in its memory, and gives the
client a connection identifier unique to the
client and the open file. - Identifier is used for subsequent accesses until
the session ends. - Server must reclaim the main-memory space used by
clients who are no longer active. - Increased performance.
- Fewer disk accesses.
- Stateful server knows if a file was opened for
sequential access and can thus read ahead the
next blocks.
25Stateless File Server
- Mechanism
- Each request self-contained.
- No state information retained between requests.
- Each request identifies the file and position in
the file. - File open and close are local to the client
- Design implications
- Reliable, survives server crashes
- Slower, with longer request messages
- System-wide file names needed, to avoid name
translation - Idempotent File requests should leave server
unchanged
26Recovery from Failures
- Stateful server
- Server failure loses its volatile state
- Restore state by recovery protocol in dialog with
clients, or - Abort operations that were underway when the
crash occurred. - Client failure
- Server needs to be aware of client failures in
order to reclaim space allocated to record the
state of crashed client processes (orphan
detection and elimination). - Stateless server
- Server failure and recovery almost unnoticeable.
- Newly refreshed server can respond to a
self-contained request without any difficulty.
27File Replication
- Replicas of the same file on failure-independent
machines. - Improves availability, shortens service time.
- Replicated file name mapped to a particular
replica. - Existence of replicas should be invisible to
higher levels. - Replicas distinguished from one another by
different lower-level names. - Updates replicas of a file denote the same
logical entity - thus an update to any replica must be reflected
in all other replicas e.g. Locus OS. - Demand replication reading a nonlocal replica
causes it to be cached locally, thereby
generating a new nonprimary replica. - Updates are made to the primary copy, others are
invalid (e.g. Ibis)
28Andrew Distributed Computing Environment
- History
- under development since 1983 at Carnegie-Mellon
University. - Name honours Andrew Carnegie and Andrew Mellon
- Highly scalable
- the system is targeted to span over 5000
workstations. - Distinguishes between client machines
(workstations) and dedicated server machines. - Servers and clients run slightly modified UNIX
- Workstation LAN clusters interconnected by a WAN.
29Andrew File System (AFS)
- Clients are presented with a partitioned space of
file names a local name space and a shared name
space. - Dedicated servers, called Vice, present the
shared name space to the clients as a
homogeneous, identical, and location transparent
file hierarchy. - The local name space is the root file system of a
workstation, from which the shared name space
descends. - Workstations run the Virtue protocol to
communicate with Vice, and are required to have
local disks where they store their local name
space. - Servers collectively are responsible for the
storage and management of the shared name space.
30AFS File Operations
- Andrew caches entire files from servers.
- A workstation interacts with Vice servers only
during opening and closing of files. - Venus runs locally in the kernel on each
workstation - Caches files from Vice when they are opened,
- Stores modified copies of files back when they
are closed. - Caches contents of directories and symbolic
links, for path-name translation - Reading and writing bytes of a file
- Done by the kernel without Venus intervention on
the cached copy.
31Types of Middleware
- Document-based (e.g. Web)
- Each page has a unique address
- Hyperlinks within each page point to other pages
- File System based (e.g NFS, AFS)
- Distributed system looks like a local file
system - Shared Object based (e.g. CORBA, Globe)
- All items are objects, bundled with access
procedures called methods - Coordination-based (e.g. Linda, Jini)
- The network appears as a large, shared memory
32Shared Object based Middleware
- Objects
- Everything is an object, a collection of
variables bundled with access procedures called
methods - Processes invoke methods to access the variables
- Common Object Request Broker Architecture (CORBA)
- Client processes on client machines can invoke
operations on objects on (possibly) remote server
machines - To match objects from different machines, Object
Request Brokers (ORBs) are interposed between
client and server to allow them to match up - Interface Definition Language (IDL)
- Tells what methods the object exports,
- Tells what parameter types each object expects
33CORBA Model
Server code
Client stub
Skeleton
Client code
server
client
Object adapter
IIOP
Tanenbaum, p. 567
34CORBA
- Allows different client and server applications
to communicate - e.g. a C program can use CORBA to access a
COBOL database - ORB (Object Request Broker)
- implements the interface specified by the IDL
- ORB is on both client and server side
- IIOP (Internet InterORB Protocol)
- specifies how ORBs can communicate
- Stub Client-side library of IDL object specs
- Skeleton Server-side procedure for IDL-specd
object - Object adapter
- wrapper that registers object,
- generates object references,
- activates the object
35Remote Method Invocation
- Procedure
- Process creates CORBA object, receives its
reference - Reference is available to be passed to other
processes, or stored in an object database for
lookup - Client process acquires a reference to the object
- Client process marshals required parameters into
a parcel - Client process contacts client ORB
- Client ORB sends the parcel to the server ORB
- Server ORB arranges for invocation of method on
the object
36Globe System
- Scope
- Scales to 1 billion users and 1 trillion objects
- e.g. stock prices, sports scores
- Method
- Replicate object, spread load over replicas
- Every globe object has a class object with its
methods - The object interface is a table of pointers, each
a ltmethod pointer, state pointergt pair - State pointers can point to interfaces such as
mailboxes, each with its own language or function - e.g. business mail, personal mail
- e.g. languages such as C, C, Java, assembly
37Globe Object
Class object contains the method
State of Mailbox 2
State of Mailbox 1
Interface used to access Mailbox 2
Interface used to access Mailbox 1
38Accessing a Globe Object
- Reading
- Process looks it up, finds a contact address (e.g
IP, port) - Security check, then object binding
- Class object (code) loaded into callers address
space - Instantiate a copy of its state
- Process receives a pointer to its standard
interface - Process invokes methods using the interface
pointer - Writing
- According to object replication policy
- Obtain a sequence number from the sequencer
- Multicast a message containing the sequence
number, operation name and parameters to all
other processes bound to the object - Apply writes in order of sequence, to master, and
update replicas
39Globe Object
Interface
Control subobject
Semantic subobject
Replication subobject
Communication subobject
Security subobject
Operating System
Messages
40Subobjects in a Globe Object
- Control subobject
- Accepts incoming invocations, distributes tasks
- Semantics subobject
- Actually does the work required by object
interface only part actually programmed by coder - Replication subobject
- Manages object replication
- (e.g. all active, or master-slave)
- Security suboject implements security policy
- Communication subobject network protocols (e.g.
IP v4)
41Coordination-based Middleware
- Linda
- Developed at Yale, 1986
- Users appear to share a big memory, known as
tuple space - Processes on any machine can insert tuples into
tuple space or remove tuples from tuple space - Publish/Subscribe, 1993
- Processes connected by a broadcast network
- Each process can be a producer of information, a
consumer, or both - Jini
- From Sun Microsystems, 1999
- Self-contained Jini devices are plugged into a
network, not a computer - Each device offers or uses services
42Linda
- tuples
- Like a structure in C, pure data, with no
associated methods - e.g. (abc, 2, 5)
- (matrix-1, 1, 6, 3.14)
- (family, is-sister, Stephany, Roberta)
- Operations
- Out put a tuple into tuple space e.g.
out(abc, 2, 5) - In retrieve a tuple from tuple space e.g.
in((abc, 2, ?i) - addressed by content rather than ltname, addressgt
- tuple space is searched for a match to the
specified contents - Process is blocked until a match is found
- Read a tuple, but leave it in tuple space
- Eval to evaluate tuple parameters and the
resulting tuple put out
43Publish/subscribe
- Publishing
- New information broadcast as a tuple on the
network - Tuple has a subject line with multiple fields
separated by periods - Processes can subscribe to certain subjects
- Subscribing
- The tuple daemon on each machine copies all
broadcasted tuples into its RAM - It inspects each subject line, forwards a copy to
each interested process.
Producer
WAN
LAN
Consumer
Information router
Daemon
44Jini
- Network-centric computing
- An attempt to change from CPU-centric computing
- Many self-contained Jini devices offer services
to the others - e.g. Computer, cell phone, printer, palmtop, TV
set, stereo - A loose confederation of devices, with no central
administration - Coded in JVM (Java Virtual Machine language)
-
- Joining a Jini federation
- Broadcasts a message asking for a lookup service
- Uses the discovery protocol to find the service
- Lookup service sends code to register the new
device - Device acquires a lease to register for a fixed
time - The registration proxy can be sent to other
devices looking for service
45Jini
- JavaSpaces
- Entries like Linda tuples, but strongly typed
- e.g. Employee entry could have ltstring, integer,
integer, booleangt to accommodate ltname,
department, telephone, works fulltimegt -
- Operations
- Write put an entry into JavaSpace, specifying
the lease time - Read copy an entry that matches a template out
of JavaSpace - Take copy and remove an entry that matches a
template - Notify notify the caller when a matching entry
is written - Transactions can be atomic, so multiple methods
can be safely grouped all or none will execute