Title: Distributed File Systems
1Distributed File Systems
- CS 519 Operating System Theory
- Computer Science, Rutgers University
- Instructor Thu D. Nguyen
- TA Xiaoyan Li
- Spring 2002
2File Service
- Implemented by a user/kernel process called file
server - A system may have one or several file servers
running at the same time - Two models for file services
- upload/download files move between server and
clients, few operations (read file write file),
simple, requires storage at client, good if whole
file is accessed - remote memory access files stay at server, reach
interface for many operations, less space at
client, efficient for small accesses
3Directory Service
- Provides naming usually within a hierarchical
file system - Clients can have the same view (global root
directory) or different views of the file system
(remote mounting) - Location transparent location of the file
doesnt appear in the name of the file - ex /server1/dir1/file specifies the server but
not where the server is located -gt server can
move the file in the network without changing the
path - Location independence a single name space that
looks the same on all machines, files can be
moved between servers without changing their
names -gt difficult
4Two-Level Naming
- Symbolic name (external), e.g. prog.c binary
name (internal), e.g. local i-node number as in
Unix - Directories provide the translation from symbolic
to binary names - Binary name format
- i-node no cross references among servers
- (server, i-node) a directory in one server can
refer to a file on a different server - Capability specifying address of server, number
of file, access permissions, etc - binary_name binary names refer to the
original file and all of its backups
5File Sharing Semantics
- UNIX semantics total ordering of R/W events
- easy to achieve in a non-distributed system
- in a distributed system with one server and
multiple clients with no caching at client, total
ordering is also easily achieved since R and W
are immediately performed at server - Session semantics writes are guaranteed to
become visible only when the file is closed - allow caching at client with lazy updating -gt
better performance - if two or more clients simultaneously write one
file (last one or non-deterministically) replaces
the other
6File Sharing Semantics (contd)
- Immutable files create and read file operations
(no write) - writing a file means to create a new one and
enter it into the directory replacing the
previous one with the same name atomic
operations - collision in writing last copy or
nondeterministically - what happens if the old copy is being read
- Transaction semantics mutual exclusion on file
accesses either all file operations are
completed or none is. Good for banking systems
7File System Properties
- Observed in a study by Satyanarayanan (1981)
- most files are small (lt 10K)
- reading is much more frequent than writing
- most RW accesses are sequential (random access
is rare) - most files have a short lifetime -gt create the
file on the client - file sharing is unusual -gt caching at client
- the average process uses only a few files
8Server System Structure
- File directory service combined or not
- Cache directory hints at client to accelerate the
path name look up directory and hints must be
kept coherent - State information about clients at the server
- stateless server no client information is kept
between requests - stateful server servers maintain state
information about clients between requests
9Stateless vs. Stateful
10Caching
- Three possible places servers memory, clients
disk, clients memory - Caching in servers memory avoids disk access
but still network access - Caching at clients disk (if available) tradeoff
between disk access and remote memory access - Caching at client usually in main memory
- inside each process address space no sharing at
client - in the kernel kernel involvement on hits
- in a separate user-level cache manager flexible
and efficient if paging can be controlled from
user-level - Server-side caching eliminates coherence problem.
Client-side cache coherence? Next
11Client Cache Coherence in DFS
- How to maintain coherence (according to a model,
e.g. UNIX semantics or session semantics) of
copies of the same file at various clients - Write-through writes sent to the server as soon
as they are performed at the client -gt high
traffic, requires cache managers to check
(modification time) with server before can
provide cached content to any client - Delayed write coalesces multiple writes better
performance but ambiguous semantics - Write-on-close implements session semantics
- Central control file server keeps a directory of
open/cached files at clients -gt Unix semantics,
but problems with robustness and scalability
problem also with invalidation messages because
clients did not solicit them
12File Replication
- Multiple copies are maintained, each copy on a
separate file server - multiple reasons - Increase reliability file accessible even if a
server is down - Improve scalability reduce the contention by
splitting the workload over multiple servers - Replication transparency
- explicit file replication programmer controls
replication - lazy file replication copies made by the server
in background - use group communication all copies made at the
same time in the foreground - How replicas should be modified? Next
13Modifying Replicas Voting Protocol
- Updating all replicas using a coordinator works
but is not robust (if coordinator is down, no
updates can be performed) gt Voting updates (and
reads) can be performed if some specified of
servers agree. - Voting Protocol
- A version (incremented at write) is associated
with each file - To perform a read, a client has to assemble a
read quorum of Nr servers similarly, a write
quorum of Nw servers for a write - If Nr Nw gt N, then any read quorum will contain
at least one most recently updated file version - For reading, client contacts Nr active servers
and chooses the file with largest version - For writing, client contacts Nw active servers
asking them to write. Succeeds if they all say
yes.
14Modifying Replicas Voting Protocol
- Nr is usually small (reads are frequent), but Nw
is usually close to N (want to make sure all
replicas are updated). Problem with achieving a
write quorum in the presence of server failures - Voting with ghosts allows to establish a write
quorum when several servers are down by
temporarily creating dummy (ghost) servers (at
least one must be real) - Ghost servers are not permitted in a read quorum
(they dont have any files) - When server comes back it must restore its copy
first by obtaining a read quorum
15Network File System (NFS)
- A stateless DFS implemented at Sun
- An NFS server exports directories
- Clients access exported directories by mounting
them - Because NFS is stateless, OPEN and CLOSE
operations are not needed in the server
(implemented at the client) - NFS provides file locking but UNIX file semantics
is not achieved because of client caching - Write through protocol, but delay is possible
dirty cache blocks are sent back by clients in
chunks, every 30 sec or at close - a timer is associated with each cache block at
the client (3 sec for data blocks, 30 sec for
directory blocks). When the timer expires, the
entry is discarded (if clean, of course) - when a file is opened, the last modification time
at the server is checked
16Recent Research in DFS
- Petal Frangipani (DEC SRC) 2-layer DFS system
17Petal Distributed Virtual Disks
- A distributed storage system that provides a
virtual disk abstraction separate from the
physical resource - The virtual disk is globally accessible to all
Petal clients on the network - Virtual disks are implemented on a cluster of
servers that cooperate to manage a pool of
physical disks - Advantages
- recover from any single failure
- transparent reconfiguration and expandability
- load and capacity balancing
- low-level service (lower than a DFS) that handles
distribution problems
18Petal
19Virtual to Physical Translation
- ltvirtual disk, virtual offsetgt -gt ltserver,
physical disk, physical offsetgt - Three data structures virtual disk directory,
global map, and physical map - The virtual disk directory and global map are
globally replicated and kept consistent - Physical map is local to each server
- One level of indirection (virtual disk to global
map) is necessary to allow transparent
reconfiguration. Well discuss reconfiguration
soon
20Virtual to Physical Translation (contd)
- The virtual disk directory translates the virtual
disk identifier (like volume id) into a global
map identifier - The global map determines the server responsible
for translating the given offset (a virtual disk
may be spread over multiple physical disks). The
global map also specifies the redundancy scheme
for the virtual disk - The physical map at specific server translates
global map identifier and the offset to a
physical disk and an offset within that disk.
Physical map is similar to a page table
21Support for Backup
- Petal simplifies a clients backup procedure by
providing a snapshot mechanism - Petal generates snapshots of virtual disks using
copy-on-write (backup files are pointing to old
blocks with write protection). Creating a
snapshot requires pausing the clients
application to guarantee consistency - A snapshot is a virtual disk that cannot be
modified - Snapshots require a modification to the
translation scheme. The virtual disk directory
translates a virtual disk id into a pair ltglobal
map id, epoch gt where epoch is incremented at
each snapshot - At each snapshot a new tuple with a new epoch is
created in the virtual disk directory. The
snapshot takes the old epoch - All accesses to the virtual disk are made using
the new epoch , so that any write to the
original disk create new entries in the new epoch
rather than overwrite the blocks in the snapshot
22Virtual Disk Reconfiguration
- Needed when a new server is added or the
redundancy scheme is changed - Steps to perform it at once (not incrementally)
and in the absence of any other activity - create a new global map with desired redundancy
scheme and server mapping - change all virtual disk directories to point to
the new global map - redistribute data to the severs according to the
translation specified in the new global map - The challenge is to perform it incrementally and
concurrently with normal client requests
23Incremental Reconfiguration
- First two steps as before step 3 done in
background starting with the translations in the
most recent epoch that have not yet been moved - Old global map is used to perform read
translations which are not found in the new
global map - A write request only accesses the new global map
to avoid consistency problems - Limitation the mapping of the entire virtual
disk must be changed before any data is moved -gt
lots of new global map misses on reads -gt high
traffic. Solution relocate only a portion of
the virtual disk at a time. Read requests for
portion of virtual disk being relocated cause
misses, but not requests to other areas
24Redundancy with Chained Data Placement
- Petal uses chained-declustering data placement
- two copies of each data block are stored on
neighboring servers - every pair of neighboring servers has data blocks
in common - if server 1 fails, servers 0 and 2 will share
servers read load (not server 3)
server 0 server 1 server 2 server 3 d0 d1
d2 d3 d3 d0 d1
d2 d4 d5 d6 d7 d7 d4
d5 d6
25Chained Data Placement (contd)
- In case of failure, each server can offload some
of its original read load to the next/previous
server. Offloading can be cascaded across
servers to uniformly balance load - Advantage with a simple mirrored redundancy, the
failure of a server would result in a 100 load
increase to another server - Disadvantage less reliable than simple mirroring
- if a server fails, the failure of either one of
its two neighbor servers will result in data
becoming unavailable - In Petal, one copy is called primary, the other
secondary - Read requests can be serviced by any of the two
servers, while write requests must always try the
primary first to prevent deadlock (blocks are
locked before reading or writing, but writes
require access to both servers)
26Read Request
- The Petal client tries primary or secondary
server depending on which one has the shorter
queue length. (Each client maintains a small
amount of high-level mapping information that is
used to route requests to the most appropriate
servers. If a request is sent to an
inappropriate server, the server returns an error
code, causing the client to update its hints and
retry the request) - The server that receives the request attempts to
read the requested data - If not successful, the client tries the other
server
27Write Request
- The Petal client tries the primary server first
- The primary server marks data busy and sends the
request to its local copy and the secondary copy - When both complete, the busy bit is cleared and
the operation is acknowledged to the client - If not successful, the client tries the secondary
server - If the secondary server detects that the primary
server is down, it marks the data element as
stale on stable storage before writing to its
local disk - When the primary server comes up, the primary
server has to bring all data marked stale
up-to-date during recovery - Similar if secondary server is down
28Petal Prototype
29Petal Performance - Latency
Single client generates requests to random disk
offsets
30Petal Performance - Throughput
Each of 4 clients making random requests to
single VD. Failed configuration one of 4
servers has crashed
31Petal Performance - Scalability
32Frangipani
- Petal provides disk interface -gt need a file
system - Frangipani is a file system designed to take full
advantage of Petal - Frangipanis main characteristics
- All users are given a consistent view of the same
set of files - Servers can be added without changing
configuration of existing servers or interrupting
their operation - Tolerates and recovers from machine, network, and
disk failures - Very simple internally a set of cooperating
machines that use a common store and synchronize
access to that store with locks
33Frangipani
- Petal takes much of the complexity out of
Frangipani - Petal provides highly available storage that can
scale in throughput and capacity - However, Frangipani improves on Petal, since
- Petal has no provision for sharing the storage
among multiple clients - Applications use a file-based interface rather
than the disk-like interface provided by Petal - Problems with Frangipani on top of Petal
- Some logging occurs twice (once in Frangipani and
once in Petal) - Cannot use disk location in placing data, because
Petal virtualizes disks - Frangipani locks entire files and directories as
opposed to individual blocks
34Frangipani Structure
35Frangipani Disk Layout
- A Frangipani file system uses only 1 Petal
virtual disk - Petal provides a 264 bytes of virtual disk
space - Commits real disk space when actually used
(written) - Frangipani breaks disk into regions
- 1st region stores configuration parameters and
housekeeping info - 2nd region stores logs each Frangipani server
uses a portion of this region for its log. Can
have up to 256 logs. - 3rd region holds allocation bitmaps, describing
which blocks in the remaining regions are free.
Each server locks a different portion. - 4th region holds inodes
- 5th region holds small data blocks (4 Kbytes
each) - Remainder of Petal disk holds large data blocks
(1 Tbyte each)
36Frangipani File Structure
- First 16 blocks (64 KB) of a file are stored in
small blocks - If file becomes larger, store the rest in a 1 TB
large block
37Frangipani Dealing with Failures
- Write-ahead redo logging of metadata user data
is not logged (but Petal takes care of that). - Each Frangipani server has its own private log
- Only after a log record is written to Petal does
the server modify the actual metadata in its
permanent locations - If a server crashes, the system detects the
failure and another server uses the log to
recover - Because the log is on Petal, any server can get
to it.
38Frangipani Synchronization Coherence
- Frangipani has a lock for each log segment,
allocation bitmap segment, and each file - Multiple-reader/single-writer locks. In case of
conflicting requests, the owner of the lock is
asked to release or downgrade it to remove the
conflict - A read lock allows a server to read data from
disk and cache it. If server is asked to release
its read lock, it must invalidate the cache entry
before complying - A write lock allows a server to read or write
data and cache it. If a server is asked to
release its write lock, it must write dirty data
to disk and invalidate the cache entry before
complying. If a server is asked to downgrade the
lock, it must write dirty data to disk before
complying
39Frangipani Lock Service
- Fully distributed lock service for fault
tolerance and scalability - How to release locks owned by a failed Frangipani
server? - The failure of a server is discovered when its
lease expires. A lease is obtained by the
server when it first contacts the lock service.
All locks acquired are associated with the lease.
Each lease has an expiration time (30 seconds)
after its creation or last renewal. A server
must renew its lease before it expires - When a server fails, the locks that it owns
cannot be released until its log is processed and
any pending updates are written to Petal
40Frangipani Performance
41Frangipani Performance
42Frangipani Scalability
43Frangipani Scalability
44Frangipani Scalability