Title: Distributed Systems
1Distributed Systems
Lecture 11 Distributed File Systems 2. July, 2002
2Schedule of Today
Distributed File Systems
- Distributed File Systems
- Implementation of Distributed File Systems
- File Usage and FS Structure
- Caching and Replication
- Web Test Case for Replication
- Example DFS
- Suns Network File System (NFS)
- Sprite File System
- Andrew File System (AFS)
- Coda
- Leases
- Log Structured File System
3File Systems
Introduction
- Goal
- Provide a set of primitives that support users to
- keep information on persistent media
- e.g. disk, tapes, etc.
- manage accesses to files and directories
- name files or directories
- offer abstractions for users from details of
storage access and management
4Distributed File Systems
Introduction
- Promote accessing and sharing of files across
machine boundaries - Offer transparency to users
- Make diskless machines viable
- Increase disk space availability by avoiding
duplication - Balance load among multiple file servers
- Offer mobility
5Transparency
Introduction
- Access transparency
- Location transparency
- Concurrency transparency
- Failure transparency
- Performance transparency
6Access Transparency
Introduction
- Users, i.e. application programmers do not notice
whether file system is local or distributed - Accesses to local or remote files are the same,
i.e. applications running on a local file systems
still run under a distributed file system
7Location Transparency
Introduction
- User does not have to know the exact location of
a file - Files can be migrated without affecting users
8Concurrency Transparency
Introduction
- Concurrent accesses to a file from different
users, i.e. applications should not lead to
inconsistencies of that file - To achieve this goal you have to use the concept
of transactions
9Failure Transparency
Introduction
- After a client or after a server has crashed the
file system should work as before
10Performance Transparency
Introduction
- Delays due to remote access should be as minimal
as possible - Delays due to remote access should not depend on
the current load
11Distributed File Systems
Distributed File Systems
- File and directory naming
- Semantics of file sharing
- Implementation considerations
- Caching
- Update protocols
- Replication
12Naming
Distributed File Systems
- File names for file retrieval
- Name service states alphabet and syntax of valid
file names - Some file systems may offer names consisting of
- ltfile namegt.extension to differ file types
- (Others FS deal file type as a file attribute)
13Directory
Distributed File Systems
- Flat directories
- Hierarchical directories
- Directories may contain files and other sub
directories - Directory tree
- Internal node may never be a file
- Complete file name is a path name
- Relative path name
- Absolute path name
14Distributed File Systems
Distributed File Systems
Assumption Hierarchical directory tree,
consisting of local and remote directories.
Whats the view of a user if he/she wants to
access a non local file?
- 3 Possibilities
- additional node name, e.g. hostxyzdir1/dir2//fi
le - (easy to implement but no transparency at
all) - mounting of remote directory-sub trees, there is
transparency, - (however each node may have a different view)
- single global name space, that looks the same on
all nodes, - (full naming transparency)
15Additional File Descriptors
Distributed File Systems
- Instead of symbolic file names the system uses
internal file descriptors (unique file identifier
(UFID), i-nodes, etc.) - UFID are short, constant in length ? easing the
use for system programs - Directory has to map symbolic names to UFIDs
- A UFID may consist of
File number (32 bit)
Random number (32 bit)
16Possible File Sharing Semantics
Distributed File Systems
- One-copy semantics (à la Unix)
- Updates are written to the single copy and are
available immediately (without regarding delays
due to file caching) - Session Semantics
- Copy the file on open, work on your local copy,
and copy back on close - No Updates of Files
- Any update of a file causes the creation of a
new file - Using simple locking
- System offers read- and write-locks, users have
to deal with - Serializability
- Transaction semantics (locking files share for
reads and exclusive access for writes, but
without interfering a user)
17File System Implementation
Implementation
- File Usage
- System Structure
- Caching
- Replication
18File Usage
Implementation
- Result of Satayaranans analysis at CMU (1990)
- File size lt 10 k
- ? Feasible, to transfer entire files instead
of block transfers - gt 80 read operation (most read writes are
sequential) - Life time 1 usage
- ? Create file on client side and wait if it
really will survive, - reducing network traffic a lot
- Few files are shared
- ? Client caching is favorable
- Average application uses only a few files
- Several file classes with different behavior
- ? Provide an adequate solution for each of them
Similar resulty by Mullender Tanenbaum and the
Unix-Team
19File System Structure
Implementation
- Clients and Servers on different machines?
- Distinct directory server and file server or not?
- Discuss pros and cons of both designs
- 1 directors server or several directory servers?
- Iterative lookup on several directory servers
- Automatic lookup on several directory servers
20Iterative Lookup
Implementation
Server 1
a
Server 2
Client
b
Server 3
c
file
Analysis 1. Bad performance Needs a couple of
messages. 2. Forces clients to be aware of which
server holds which file or directory
21Automatic Lookup
Implementation
Server 1
a
Lookup a/b/c
Server 2
Client
b
Server 3
c
file
Analysis Fewer messages, thus more efficient,
but cannot use RPCs, since a different server
replies to the initial call.
22State of the DFS
Implementation
In local systems FCB is created whenever a user
opens a file. FCB contains all relevant state
information (i.e. file pointer etc.).
- In a DFS we can distinguish three session like
phases - creation of a usage-relation
- the using of the file itself and
- deletion of the usage-relation
- In a DFS file server may provide the above
state-information or not. - Thus we can distinguish between
- stateless server and
- stateful server
23Stateless versus Stateful Server
Implementation
- Advantages of stateless servers
- If server crashes it can restart immediately
cause no state information was lost. - Also if client crashes no additional overhead on
server side, - the server only knows a client during the
clients file request - No need for open and close and their related
messages across the net. - No additional space for this status information
on the server, this may pay off - if server has to deal with many concurrent
clients - No limit on the number of concurrently opened
files. - Advantages of stateful servers
- Shorter request messages, no need for symbolic
file name in each request - Better performance, easier to implement a read
ahead - Easier to establish idempotency (controllable
via sequence numbers) - Possibility, to set file locks in order to
establish a certain consistency model
24File System Caches in a DFS
Implementation
- No caches at all, all files only on the servers
disk - Servers disk should be large enough
- Files are always accessible to all clients
- No additional memory overhead, no consistency
problem
server
client
Performance Problem !!!
file
Server disk(s)
Local disk
25File System Caches in a DFS
Implementation
- Server uses parts of its main memory as a file
system cache for all most recently used clients
files - Still the data transfer via the network, but
hopefully most accesses to the servers disk is
avoided
server
client
Cache 1
Conceptual Problems?
file
Server disk(s)
Local disk
26Servers Main Memory as Cache
Implementation
- Cacheable units
- Complete files
- File portions, e.g. blocks or chunks
- Replacement algorithm
- What to do if cache fills up
- LRU ?
- FIFO ?
- No additional consistency problems from the users
point of view
27File System Caches in a DFS
Implementation
- To minimize time consuming data tranfers via the
net you can use caching on the client side (in
main memory or in local disk, a matter of
performance) - Consistency problems due to cache implementation
server
client
Cache 1
Cache 2
Cache 3
file
Server disk(s)
Local disk
28Client Cache Implementation
Implementation
- Placement of Caches
- Clients main memory
- Cache within the user address space
- Cache within the kernel
- Separate user level cache manager
- Clients disk
- Improves temporally the availability of the file
copy
29Client Cache in Main Memory (1)
Implementation
Cache inside the UAS managed by a library holds
the most recently used files per UAS. Only, if
file is reused by the same UAS, it may still be
in the cache. If task exits, contents of
modified files are written back to the server,
cache will be freed.
cache hit
Analysis Few overhead, but only valuable, if
file is reused by the same task, (e.g.. data
base manager process), in most other tasks a file
is opened once and closed once, so caching
within the library wins nothing..
30Client Cache in Main Memory (2)
Implementation
Cache inside the kernel is used by all
applications. For each file access a kernel call
is necessary, but cache may survive a task.
UAS
client
cache hit
Analysis Unix pipelining of user task is
supported very efficiently, e.g. ls count, or
a 2-phase compiler
31Client Cache in Main Memory (3)
Implementation
UAS
User level cache manager free kernel From keeping
caches for various clients. Cache manager is
isolated and easier to test. Kernel could
decide to page out some of the pages of the
cache manager ? Cache hit results in 1 or more
page faults
cache manager
client
cache hit
- Compare the three methods concerning
- the number of RPCs involved in cache hits and
cache misses and - applicability to µ-kernel systems and
32Cache Consistency
Implementation
Using caches in a DFS ? consistency problem. To
reduce the network traffic the following policy
have been proposed
- Write-Through
- The cache is used for reading, only.
- Writing is immediately done to the server, i.e.
into the original file. - Delayed Write
- Several write operations are collected and
passed to the server in a burst. - Write-on-Close
- File updates are delayed until the file is to be
closed.
33Replication
Replication
- Objective
- Store information at multiple sites in a DS
Why? To increase
- Availability (its there when you want it)
- Reliability (it doesnt get destroyed)
- Performance (its always nearby)
- How? Replication is
- User initiated (aware of replication process)
- Automatic (transparent replication)
34Replication
From a Centralized Server ...
server
Network
Remark As long as only clients or their subnets
fail the rest still can use the centralized
server.
35Replication
From a Centralized Server ...
server
Network
Remark But what to do when the server or its
subnet crashes? Rien ne va plus!
36Replication
to Decentralized Servers
Remark As long as only one replicated server or
its entire subnet fails, all clients
from other subnets still may get their
services via their nearby servers.
37Explicit File Replication
Replication
servers
Is this a good idea?
s2
s1
- Not at all
- neither user friendly
- nor efficient
s3
38Analysis of Explicit File Replication
Replication
- Not very user friendly
- you have to find all servers (may vary in place
and amount) - if at copy time one server is down, user may
forget to copy - file ltprog.cgt when the failed server will be
available again - Not very efficient
- to be sure that all replicas have the same state
we have to use - a transaction mechanism (two or three phase
commit protocol) - when the last copy has been done successfully
(has committed) - the information is available again
- what to do when a site fails for a longer period
of time? -
39Lazy Replication
Replication
s2
s1
c1
At some time t0 n Dt, n gt 1, replication
manager on the server s1 issues the other lazy
replicas gt for a while the servers may have
different states!!!
s3
40Replication using a Group
Replication
group of servers
s2
s1
c1
s3
41Replication and Update-Protocols
Replication
- Approaches
- write to all-available replicas
- primary/backup
- quorum consensus
42Write to all-available Update-Protocol
Replication
s2
prog.c
s1
c1
prog.c
s3
prog.c
If update process fails for s3 the replicated
servers s1, s2 respectively s3 have different
states gt Inconsistency! Characteristic of this
protocol Cheap reads, but expensive writes!
43Primary/Backup
Replication
- Possible options
- Backups are maintained for availability only
- Backups can improve performance for reads,
- What is the query semantic?
- How can we achieve one copy serializability?
- Client interacts with one copy, and if it is a
backup, - these updates are propagated to the primary
- What is the query semantics with regard to our
own updates? - Clients who dont need actual data can read from
any site
44Primary/Backup
Replication
- Any client has one primary server (hopefully a
nearby one and a powerful one) within the total
system. The other servers only provide as
backups. - Any request of the client goes to his primary
server. - If the primary fails, i.e. a failover occurs,
then one of the backup servers will become the
new primary - Consequences
- There is at most one primary at any time
- Every client ci has a single site si to which it
send requests - Any client message arriving at a backup server
is ignored
45Alsberg and Day Protocol
Replication
s2
prog.c
primary s1
c1
prog.c
46Alsberg and Day Protocol
Replication
s2
prog.c
primary s1
c1
prog.c
How to meet and detect possible failures? Server
send periodically I am alive messages, use
timeout to detect possible crashes Backup takes
over control and recruits the new backup
47Tandem Non-Stop Protocol
Replication
- Symmetric Pair Policy
- One primary process and one backup process
- joined by redundant links
- Client sends request to the primary
- Primary forwards updates to the backup
- Backup acknowledges to primary, only
- Primary acknowledges to client
- Failures detected by timeout.
- Tolerates
- node crashes
- one link failure
48Anti-Entropy Method (Golding 1992)
Replication
- State kept by replicated servers can be weakly
consistent, - i.e. replicas are allowed to diverge
temporarily. - They will eventually come to agreement.
- From time to time, a server picks another server
and these 2 servers exchange updates and converge
to same state - Total ordering is obtained after getting one
message from every server (directly) - Lamport timestamps are used to order messages
49Anti-Entropy Method
Replication
knowledge at s1
knowledge at s2
A B C
A B C
1
3
5
12
1
3
2
5
6
2
9
11
2
3
4
2
12
3
summary s1
summary s2
2
11
4
2
Remark Numbers in the objects refer to Lamport
time stamps.
50Anti-Entropy Method
Replication
knowledge at s2
knowledge at s1
A B C
A B C
1
3
5
12
1
3
2
5
6
2
9
11
2
3
4
2
12
3
summary s2
summary s1
2
11
4
2
12
11
4
summary after merge
51Eventual Path Propagation
Replication
Phase 1 Partitoning
mx
my
mx
my
mx
my
52Eventual Path Propagation
Replication
Phase 2 Partitoning
mx
my
mx
my
mx
my
53Eventual Path Propagation
Replication
Phase 3 Merging
mx my
mx my
mx
mx my
mx
my
54Eventual Path Propagation
Replication
Further merging
mx my
mx my
mx
mx my
mx
mx my
55Analysis
Replication
- All primary/backup protocols have some
- disadvantages
- if primary fails, no updates any valid longer
- doesnt tolerate network partitions
- Gifford published 1979 another protocol based on
majority voting, named quorum algorithm.
56Quorum Algorithm
Replication
- Any client has to acquire permission of some
subset of the replicated servers before reading
from or writing to a replicated file. - Readers need a read-quorum (i.e. at least Nr
servers must accept) - Writes need a write-quorum (i.e. at least Nw
servers must accept). - Simplification There are N server, then
- Nr Nw gt N.
- Any write is coupled with an update of the files
version number!!!!
57Example
Replication
Nr 3 and Nw 10
A B C D E F G H I J K L
Suppose at time t0, the red servers C, D, , and
L have been updated. If a client wants to read at
t0 ?t, he needs a least 3 sites, e.g. site A,
B, and C. Although, sites A and B have majority
with their old version, client can detect,
that sites C version number is newer, thus
client reads from C.
58Web Test Case for Replication
Web Replication
- Observation
- Explosion of the web has led to a situation where
- majority of the traffic on the Internet is web
related. - Goal
- Offer a subset of web servers spread all over the
world - lowering the long distance traffic
- The providers viewpoint
- As few Web Servers as possible without bothering
clients - The clients viewpoint
- As many nearby Web Servers to get quick answers
59Web Test Case for Replication
Web Replication
- Objective
- Try to establish web replication
- where each of the replicas resides in a different
part of the network - Problem
- How may the clients web browser automatically
and transparently - contact the bets replica server, taking into
account - Network topology which replica is closest to the
client - Server availability which web servers are
currently active - Server load which one is able to run the most
rapid response
60Provider 1 Single Web Server
Web Replication
Analysis A popular web-site being served only
from one location gt frequent, long-distance
network transfer gt high-response-times for user
requests and wasting available network
bandwidth Moreover danger of single point of
failure
Web server Clients Web queries and responses
61Solution Caching and Replication
Web Replication
- Caching
- Server side caching (Squid, Harvest, Apache)
- Client side caching (proxy, browser)
- Replication
- Cluster replication
- Wide area replication
- Wide area, cluster replication
- Combination of Caching and Replication
62Caching
Web Replication
- Analysis
- lower latency
- better network untilization
- freshness
- some things cannot be cached (server side
programming, CGI scripts etc.) - some things are not meant to be cached
(advertising)
63Cluster Replication
Web Replication
- Analysis
- improves performance - load is shared by several
servers - improves availability of the web server as a
whole - moderate effort is required to set up and
maintain - still a single point of failure in the network
- still high latency for clients that are distant
(network-wise)
64Wide-Area (Cluster) Replication
Web Replication
- Analysis
- improves performance
- load is shared by several servers
- clients access the best server
- improves availability of the web service
- network availability
- server availability
- complex to implement, deploy and maintain
65The Technical Challenge
Web Replication
- Making wide-area cluster replication work in a
Web environment - get the nearest server
- based on network topology
- from the nearest server to the best server
- server availability
- server load
- do it automatically and seamlessly
- HTTP redirect method - application layer
- DNS round trip method - session layer
- shared IP address method - network layer
Simple and limited method works without knowledge
of the network topology and the location of the
client within that network, i.e. you might get
the overall best server for all potential
clients, but not the best server for a specific
client, I.e. an overloaded server on the same LAN
may be better than a very fast and unloaded
server in New South Wales.
For more details on this method see
http//www.cnds.jhu.edu
66DNS Round Trip Times Method
Web Replication
ns.bar.edu
foo.bar.edu
Do I know www.cnds.jhu.edu? NO Do I know DNS for
cnds.jhu.edu? NO Di I know DNS for jhu.edu? Yes
128.220.1.5
time
67DNS Round Trip Times Method
Web Replication
- DNSnearby Web server Local DNS serving
all local clients potential replica
selected replica - No special requirements
- Convergence time is linear with the number of
replica
68Practical Implementation Walrus
Web Replication
- A Wide Area Load Re-balancing User-transparent
System - No change to the Web server
- No change to the Web client
- No change to the infrastructure (ISP, DNS, OS)
- Implemented in a Unix environment,
- but can be ported to other environments
- see http//www.cnds.jhu.edu/walrus
69Some Distributed File Systems
Example DFS
- Problem to solve
- Find out the main characteristics of at least 3
major DFS. - Discuss the pros and cons of each DFS.
-
- Explain the typical application of each DFS.
70Network File System (NFS)
Network File Systems
- De facto standard, Sun published its protocol
specification - to establish a platform independent DFS
- Mid 80s
- Widely adopted in academia and industry
- In NFS each node may act as client or/and as
server - Each server holds a file /etc/exports containing
a list of directories - the server wants to export to other nodes
- NFS supports heterogeneous systems (DOS, MacOS,
VMS) - mostly in LANs, but also applicable in WANs
- Uses Suns synchronous RPC and XDR
- Client blocks until it gets result from file
server
71Characteristics of NFS
Network File System
- Access transparency is reached only within the
Unix area, i.e. Unix-applic. - can access local or remote files with the
common Unix-file-operations. - Location transparency is implemented via the
import mechanism. - The client specifies the mount point within his
local file system - where he wants to import a sub-file-system of
NFS. - Concurrency transparency is not supported.
- There are some rudimentary locking mechanisms,
only. - Fault transparency is supported, because a
NFS-server is stateless. - Performance transparency ? With only a slight
load in a LAN - remote accesses are hardly slower than accesses
onto a local file.
72Sun NFS (1)
Network File System
- Architecture
- Server exports n 1 directory trees for access
by remote clients - Clients may access exported directory trees by
mounting them to the clients local tree - Diskless clients can mount exported directory
to their root directory - Auto-mount (on the first access)
- Remote access is done via Suns RPC
73SUN NFS (2)
Network File System
- Stateless server
- RPCs are self-contained
- Servers dont need to keep state about previous
requests, i.e. flush all modified data to disk
before returning from RPC call - Robustness
- No state to recover
- Clients initiate a retry
74NFS Protocols
Network File System
- Mount Protocol
- Hand mounting
- Boot mounting
- Auto mounting
- Directory and File Access Protocol
75 Sun NFS Protocols
Network File System
- Mounting protocol
- Client sends pathname of the exportable directory
to the server (not including the mount place) - If that pathname is legal and the directory is
exportable, then server returns a file handle to
client - File handle contains
- Uniquely identifying file system type
- Disk
- i-node number of directory
- Security information
76 Sun NFS Protocols
Network File System
- Bootmounting
- Scriptfile /etc/rc containing all mounting
commands is executed - Automounting
- Set of remote exportable directories are
associated with the client - If client opens the first time one of these
remote files, OS sends a mount message to each
file server, the first replying wins - If 1 server is down during boot mounting client
hangs
Mostly used for read-only files
77Mount a Remote File System in NFS
Network File System
Result Clients file name space includes remote
files
78Achieving NFS Transparancy
Network File System
- Mount service
- Mount remote file systems in the clients file
name space - Mount service process runs on each node to
provide RPC interface for mounting and unmounting
file systems at client - Runs at system boot time or user login time
79Achieving NFS Transparancy 2
Network File System
- Auto mounter
- Dynamically mounts file systems
- Runs as user-level process on client (demon)
- Resolves references to unmounted pathnames by
mounting them on demand - Maintains a table of mount points and the
corresponding server(s) sends probes to
server(s) - Primitive form of replication
80NFS Transparency ?
Network File System
- Early binding
- Mount system call attaches remote file systems to
local mount point - Client has to deal with the host only once
- But, mount needs to happen before remote files
become accessible
81NFS Directory and File Access Protocol
Network File System
- Directory and file accessing protocol
- RPC for read write to files and directories
- No open/close, since NFS server is stateless
- Each read/write message contains the full path
and file description position - NFS protocol differs from Suns Remote File
System (RFS), where you have to open and close
files explicitly
82Other NFS Functions
Network File System
- NFS file and directory operations
- Read, write, create, delete, getattr, etc.
- Access control
- File and directory access permission (UNIX)
- Path name translation
- Lookup for each path component
- Caching
83NFS Semantics
Network File System
- Unix
- You cannot open a file and lock it, so that no
other user can use that file anymore - In a stateless server locks cannot be associated
with opened files, server does not know about - Additionally Network information System (NIS) is
established controlling whether client and server
are really those they are claiming for, however
data are still transferred without encryption
84NFS Implementation
Network File System
Client
Server
Virtual File System
Local OS
NFS Server
Local Disk
Message from client
85Virtual File System
Network File System
- VFS added to Unix kernel
- Location transparent file access
- Distinguished between local and remote access
- Client
- Executing a file system call to determine
- whether access is local or remote
- Server
- NFS server receives request and passes it to
local FS via VFS
86VFS 2
Network File System
- If local, translates file handle to internal file
ids (in Unix i-nodes) - V-node
- If file local, reference to files i-node
- Of file remote, reference to file handle
- File handle uniquely distinguishes a file
File system id
I-node
I-node generation
87NFS Caching
Network File System
- File contents and attributes
- Client versus server caching
server
client
88NFS Server Caching
Network File System
- Read
- Same as in Unix FS
- Caching of file blocks and attributes
- Cache replacement using LRU
- Write
- Write through (as opposed to delayed writes in
conventional Unix FS) - Delayed writes modified blocks written to disk
when buffer space is needed, or by an explicit or
periodical synch operation, and on every close)
89NFS Client Caching 1
Network File System
- Time stamped-based cache invalidation
- Read
- Cached entries have timestamps with last-modified
time - Blocks assumed to be valid for TTL
- TTL specified at mount time
- Typically 3 sec for files
90NFS Client Caching 2
Network File System
- Write
- Modified pages are marked and flushed to
- server at file close or at sync
- Consistency
- Not always guaranteed
- E.g. client modifies file delay for modification
to reach the server 3 sec, window for cache
validation from clients sharing file
91NFS Cache Validation
Network File System
- Validation check performed when
- First reference to file after TTL expires
- File open or new block fetched from server
- Done for all files (even those not being shared)
- Expensive
- Potentially, every 3 sec get file attributes
- If needed invalidate all blocks
- Fetch fresh copy when file is accessed again
92Satayas Design Principles (1990) Lessons
learned from NFS
Lessons Learned from NFS
- WS have enough processing power ? its wise to
use it, - instead of the servers processor, whenever
possible - Caching files can save network bandwidth
- since they are likely to be used again.
- Exploit usage properties
- Minimize the dependency on as much of the system
as possible as part of the requirement for
changes - Trust the fewest possible entries
- Perform work in batch whenever possible
93Sprite File System
Sprite FS
- Main memory caching on client and server side
- Write-sharing consistency guarantees.
- Variable sized caches
- VM and FS negotiate amount of memory needed
- According to caching needs, cache size may adapt
Sprite at Berkeley by John Osterhout, started
1984, finished 91, a test bed for research in
log-structured file systems, striped file
systems, crash recovery, RAID file systems
94Sprite File System
Sprite FS
- Sprite supports concurrent writes by disabling
caching of write-shared files. - If a file is shared, server notifies client that
has opened a file for writing to write modified
blocks back to the server - Server notifies all clients that have opened this
file for reading that this file is no longer
cacheable - Clients then discard all cached blocks, so that
next accesses go though the server
95Sprite File System
Sprite FS
- Sprite servers are stateful
- Need to keep state about current accesses
- Centralized points for cache consistency
- Bottleneck?
- Single point of failure?
- Tradeoff
- consistency versus performance/robustness
96Andrew File System
Andrew File System
- Distributed Computing environment
- developed at Carnegie Mellon University (CMU)
(again by Satya) - Campus wide computing system
- Between 5 K and 10 k workstations (WSs)
- 1991 already 800 WSs, 40 servers
97Design Goals
Andrew File System
- Information sharing
- Scalability
- Key policy caching of whole files at client
- Whole file serving
- Entire file to client
- Whole file caching
- Local copy of file cached on clients local disk
- Survive clients reboots and server unavailability
98Andrew File System
Andrew File System
- Supports information sharing on a large scale (
1000 WSs) - Uses a session semantics
- Provides location transparency and location
independence - First entire file is copied to the local machine
(Venus) - from the server (Vice) when it is opened. If
file will be changed, - it will be copied back to the server when it is
closed again. - The method works because in practice most files
- are changed by only one person
- Measurements show that only 0.4 of all
changed files - have been updated by more than one user during
one week.
Remark AFS works only on BSD 4.3 Unix platforms
with TCP/IP. Each node in the entire system
needs a local hard disk.
99File Cache Consistency
Andrew File System
- File caches hold recently accessed file records
- Caches are consistent when they contain
- exact copies for remote data
- File-locking prevents simultaneous access to a
file - writing causes the server cached to be updated
100Whole File Caching
Andrew File System
- Local cache contains some most recently used files
Client
Server
(5) file
Subsequent operations on file apply to local
copy On close ltfilegt, if file modified, sent back
to server
101 AFS Structure
Andrew File System
Venus works as a file cache manager
Vice is a multi-threaded server providing shared
file services
N e t w o r k
Clients
102Implementation 1
Andrew File System
- Network of WSs running BSD 4.3 and Mach
- Implemented as 2 user-level processes
- Vice runs at each Andrew server
- Venus runs at each Andrew client
103Implementation 2
Andrew File System
- Modified BSD 4.3 Unix kernel
- At client, intercept file system calls (e.g.
open, close, etc.) and pass them to Venus when
referring to shared, non cached files - Venus manages the client cache partition on
local disk - LRU replacement policy
- Cache large enough for 100s of average sized
files
104File Sharing
Andrew File System
- Files are shared or local
- Shared files
- Utilities (/bin, /lib) infrequently updated
files or files accessed by a single user (e.g.
users home directory) - Stored on servers and cached on clients
- Local copies remain valid for long time
- Local files
- Temporary files (/tmp) and files used for
start-up - Stored on local machines disk
105AFS Components
Andrew File System
Namespace Each local file system can be set up
differently However, the shared file system has a
universal look
Shared files use symbolic links
106AFS Caching 1
Andrew File System
- AFS-1 timestamp-based cache invalidation
- ASF 2 dito use of callbacks
- When serving file, Vice server promises to notify
Venus client whenever a file will be modified - Still a stateless server?
- Callback is stored with cached file
- Valid
- Canceled when client is notified by server thet
file has been modified
107ASF Caching 2
Andrew File System
- Callbacks implemented using RPC
- When accessing a file, Venus checks if file
exists and if callback is valid if canceled,
fetches copy from server - Failure recovery
- When restarting after failure, Venus checks each
cached file by sending a validation request to
server - Also periodic checks in case of communication
failures
108AFS Caching 3
Andrew File System
- After file close time, Venus on client modifying
file sends update to Vice server - Server updates its own copy and sends callback
cancellation to all clients caching file - Consistency?
- Concurrent updates?
109Andrew File Validation
Andrew File System
- Older AFS versions
- On open Venus accesses Vice to see if its copy
of the file is still valid. - This causes a substantial delay even if the
copy is valid. - Vice is stateless
- Newer AFS versions
- Vice maintains lists of valid copies.
- If a file is modified Vice invalidates other
copies. - On open if Venus has a valid copy it can open
it immediately. - If Venus crashes it has to invalidate its
version or check their validity.
110AFS Replication
Andrew File System
- Read-only replication
- Only read-only files allowed to be replicated at
several servers
111File Identifiers
Andrew File System
A volume is a collection of files being managed
together to allow ease of movement A partition
may consist of n 1 volumes
Volume number
Vnode number
Unique number
- Volume number
- to uniquely identify a single volume in the
system
- Vnode number
- to identify a file within a volume ( ? Unix
inodes) - can be reused if old file is deleted
- Unique number
- to cater reused Vnode numbers if there is still
an old Vnode number
112Example of a System Call (fopen)
Andrew File System
Application requests fopen(filename, ...)?
Venus parses filename. If its a local file,
fopen() is treated in a similar way as in
Unix. However, if it starts with /afs Venus has
to check several things Is the requested file
already in the local file cache (see /cache).
If so, it checks whether this file is
still valid or currently invalid If still valid,
Venus returns the file descriptor to the
application If already invalid Venus compares
the timestamps of the local copy
with the server file and if the local copy is
outdated, Venus sends a request to Vice to
download this file to the local cache If the
file is not the local cache, Venus dens a request
to Vice to download this file to the local
cache.
113Security within AFS
Andrew File System
In AFS all traffic between clients and servers is
encrypted. Access to directories is controlled
via ACLs. File access is controlled as in Unix
(9 rwx bits for owner, group and others) (for
compatibility reasons to Unix). Newer versions
of AFS use Kerberos authentication systems and
also offer ACLs for file accesses
Remark More on AFS see www.homepages.uel.ac.uk/5
291n/afs.doc.html
114Coda
Coda File System
- Evolved from AFS
- Goal constant data availability
- Improved replication
- Replication of read-write volumes
- Disconnected operation mobility
- Extension of AFSs whole file caching mechanism
- Access to shared file repository (servers) versus
relying on local resources when server not
available
115Replication in Coda
Coda File System
- Replication unit file volume (set of files)
- Set of replicas of file volume volume storage
group (VSG) - Subset of replicas available to client AVSG
- Different clients, different AVSGs
- AVSG membership changes as server availability
changes - On write when file is closed, copies of modified
file broadcast to AVSG
116Optimistic Replication
Coda File System
- Primary goal Availability
- Replicated files are allowed to be modified even
in the presence of partitions or during
disconnected operation
117Disconnected Operation
Coda File System
- AVSG
- Network/server failures or host on the move
- Rely on local cache to serve all needed files.
- Loading the cache
- User intervention list of files to be cached
- Learning usage patterns over time
- Upon reconnection, cached copies validated
against servers files
118Normal and Disconnected Operation
Coda File System
- During normal operation
- Coda behaves like AFS
- Cache miss transparent to user only performance
penalty - Load balancing across replicas
- Cost replica consistency cache consistency
- Disconnected operation
- No replicas are accessible
- Cache miss prevents further progress
- Need to load cache before disconnection
119Replication and Caching
Coda File System
- Coda integrates server replication and client
caching - On cache hit and valid data Venus does not need
to contact the server - On cache miss Venus gets data from AVSG server,
i.e. the preferred server (PS) - PS is chosen at random or based upon proximity,
load - Venus also contacts other AVSG servers and
collects their versions if conflict, abort
operation if replicas stale, update them off-line
120Summary Caching
Coda File System
- Improves performance in terms of
- response time,
- availability (disconnected operations), and
- fault tolerance
- Price consistency
- Consistency mechanisms
- Time stamp- based invalidation
- Callbacks
121Leases
Leases
- Time-based cache consistency protocol
- Contract between client and server
- Lease grants holder control over writes to
corresponding data item during lease term - Server must obtain approval from holder of lease
before modifying data - When holder grants approval for write, it
invalidates its local copy
122Other Distributed File Systems
Example DSF
- Plan 9 (Pike et al)
- xFS (based on Berkeleys LFS)
- Secure file System (SFS) (Maziere et al)
123Log-Structured File System
Log-Structured File System
- Built as extension to Sprite FS (Sprite LFS)
- New disk storage technique that tries to use
disks more efficiently - Assumes main memory cache for files
- Larger memory makes cache more efficient in
satisfying reads - Most of the working set is cached
- Thus, most disk access costs due to writes
124Main Idea
Log-Structured File System
- Batch multiple writes in file cache
- Transform may small writes into 1 large one
- Close to disks full bandwidth utilization
- Write to disk in one write in a contiguous region
of the disk called log - Eliminates seeks (i.e. reduces access time)
- Improves crash recovery
- Sequential structure of log
- Only most recent portion of log needs to be
examined
125LSFS Structure
Log-Structured File System
- 2 key functions
- How to retrieve information from log
- How to manage free disk space
126File Location and Retrieval 1
Log-Structured File System
- Allows random access to information in the log
- Goal is to match or increase read performance
- Keeps indexing structures with log
- Each file has i-node containing
- File attributes (type, owner, permissions9
- Disk address of first 10 blocks
- Files gt 10 block, i-node contains pointer to more
data
127File Location and Retrieval 2
Log-Structured File System
- In Unix FS
- Fixed mapping between disk address and file
i-node disk address as function of the file id - In LFS
- I-nodes written to log
- I-node map keeps current location of each i-node
- I-node maps usually fit main memory cache
i-nodes disk address