CS519 Fall 2003

About This Presentation

Title:

CS519 Fall 2003

Description:

... is asked to release or downgrade it to remove the conflict ... If a server is asked to downgrade the lock, it must write dirty data to disk before complying ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 60

Provided by: csRut

Category:

more less

Transcript and Presenter's Notes

Title: CS519 Fall 2003

1
CS519 Fall 2003

Distributed File Systems
Lecturer Ricardo Bianchini

2
File Service

Implemented by a user/kernel process called file
server
A system may have one or several file servers
running at the same time
Two models for file services
upload/download files move between server and
clients, few operations (read file write file),
simple, requires storage at client, good if whole
file is accessed
remote memory access files stay at server, rich
interface with many operations, less space at
client, efficient for small accesses

3
Directory Service

Provides naming usually within a hierarchical
file system
Clients can have the same view (global root
directory) or different views of the file system
(remote mounting)
Location transparent location of the file
doesnt appear in the name of the file
ex /server1/dir1/file specifies the server but
not where the server is located -gt server can
move the file in the network without changing the
path
Location independence a single name space that
looks the same on all machines, files can be
moved between servers without changing their
names -gt difficult

4
Two-Level Naming

Symbolic name (external), e.g. prog.c binary
name (internal), e.g. local i-node number as in
Unix
Directories provide the translation from symbolic
to binary names
Binary name format
i-node no cross references among servers
(server, i-node) a directory in one server can
refer to a file on a different server
Capability specifying address of server, number
of file, access permissions, etc
binary_name binary names refer to the
original file and all of its backups

5
File Sharing Semantics

UNIX semantics total ordering of R/W events
easy to achieve in a non-distributed system
in a distributed system with one server and
multiple clients with no caching at client, total
ordering is also easily achieved since R and W
are immediately performed at server
Session semantics writes are guaranteed to
become visible only when the file is closed
allow caching at client with lazy updating -gt
better performance
if two or more clients simultaneously write one
file (last one or non-deterministically) replaces
the other

6
File Sharing Semantics (contd)

Immutable files create and read file operations
(no write)
writing a file means to create a new one and
enter it into the directory replacing the
previous one with the same name atomic
operations
collision in writing last copy or
non-deterministically
what happens if the old copy is being read?
Transaction semantics mutual exclusion on file
accesses either all file operations are
completed or none is. Good for banking systems

7
File System Properties

Observed in a study by Satyanarayanan (1981)
most files are small (lt 10K)
reading is much more frequent than writing
most RW accesses are sequential (random access
is rare)
most files have a short lifetime -gt create the
file on the client
file sharing is unusual -gt caching at client
the average process uses only a few files

8
Server System Structure

File directory service combined or not
Cache directory hints at client to accelerate the
path name look up directory and hints must be
kept coherent
State information about clients at the server
stateless server no client information is kept
between requests
stateful server servers maintain state
information about clients between requests

9
Stateless vs. Stateful
10
Caching

Three possible places servers memory, clients
disk, clients memory
Caching in servers memory avoids disk access
but still network access
Caching at clients disk (if available) tradeoff
between disk access and remote memory access
Caching at client in main memory
inside each process address space no sharing at
client
in the kernel kernel involvement on hits
in a separate user-level cache manager flexible
and efficient if paging can be controlled from
user-level
Server-side caching eliminates coherence problem.
Client-side cache coherence? Next

11
Client Cache Coherence in DFS

How to maintain coherence (according to a model,
e.g. UNIX semantics or session semantics) of
copies of the same file at various clients
Write-through writes sent to the server as soon
as they are performed at the client -gt high
traffic, requires cache managers to check
(modification time) with server before can
provide cached content to any client
Delayed write coalesces multiple writes better
performance but ambiguous semantics
Write-on-close implements session semantics
Central control file server keeps a directory of
open/cached files at clients and sends
invalidations -gt Unix semantics, but problems
with robustness and scalability problem also
with invalidation messages because clients did
not solicit them

12
File Replication

Multiple copies are maintained, each copy on a
separate file server - multiple reasons
Increase reliability file accessible even if a
server is down
Improve scalability reduce the contention by
splitting the workload over multiple servers
Replication transparency
explicit file replication programmer controls
replication
lazy file replication copies made by the server
in background
use group communication all copies made at the
same time in the foreground
How replicas should be modified? Next

13
Modifying Replicas Voting Protocol

Updating all replicas using a coordinator works
but is not robust (if coordinator is down, no
updates can be performed) gt Voting updates (and
reads) can be performed if some specified of
servers agree.
Voting Protocol
A version (incremented at write) is associated
with each file
To perform a read, a client has to assemble a
read quorum of Nr servers similarly, a write
quorum of Nw servers for a write
If Nr Nw gt N, then any read quorum will contain
at least one most recently updated file version
For reading, client contacts Nr active servers
and chooses the file with largest version
For writing, client contacts Nw active servers
asking them to write. Succeeds if they all say
yes.

14
Modifying Replicas Voting Protocol

Nr is usually small (reads are frequent), but Nw
is usually close to N (want to make sure all
replicas are updated). Problem with achieving a
write quorum in the presence of server failures
Voting with ghosts allows to establish a write
quorum when several servers are down by
temporarily creating dummy (ghost) servers (at
least one must be real)
Ghost servers are not permitted in a read quorum
(they dont have any files)
When server comes back it must restore its copy
first by obtaining a read quorum

15
Network File System (NFSv3)

A stateless DFS from Sun only state is map of
handles to files
An NFS server exports directories
Clients access exported directories by mounting
them
Because NFS is stateless, OPEN and CLOSE RPCs are
not provided by the server (implemented at the
client) clients need to block on close until all
dirty data are stored on disk at the server
NFS provides file locking (through separate
network lock manager protocol) but UNIX semantics
is not achieved due to client caching
dirty cache blocks are sent to server in chunks,
every 30 sec or at close
a timer is associated with each cache block at
the client (3 sec for data blocks, 30 sec for
directory blocks). When the timer expires, the
entry is discarded (if clean, of course)
when a file is opened, last modification time at
the server is checked

16
Recent Research in DFS

Petal Frangipani (DEC SRC) 2-layer DFS system
xFS (Berkeley) a serverless network file system

17
Petal Distributed Virtual Disks

A distributed storage system that provides a
virtual disk abstraction separate from the
physical resource
The virtual disk is globally accessible to all
Petal clients on the network
Virtual disks are implemented on a cluster of
servers that cooperate to manage a pool of
physical disks
Advantages
recover from any single failure
transparent reconfiguration and expandability
load and capacity balancing
low-level service (lower than a DFS) that handles
distribution problems

18
Petal
19
Virtual to Physical Translation

ltvirtual disk, virtual offsetgt -gt ltserver,
physical disk, physical offsetgt
Three data structures virtual disk directory,
global map, and physical map
The virtual disk directory and global map are
globally replicated and kept consistent
Physical map is local to each server
One level of indirection (virtual disk to global
map) is necessary to allow transparent
reconfiguration. Well discuss reconfiguration
soon

20
Virtual to Physical Translation (contd)

The virtual disk directory translates the virtual
disk identifier into a global map identifier
The global map determines the server responsible
for translating the given offset (a virtual disk
may be spread over multiple physical disks). The
global map also specifies the redundancy scheme
for the virtual disk
The physical map at a specific server translates
the global map identifier and the offset to a
physical disk and an offset within that disk.
The physical map is similar to a page table

21
Support for Backup

Petal simplifies a clients backup procedure by
providing a snapshot mechanism
Petal generates snapshots of virtual disks using
copy-on-write. Creating a snapshot requires
pausing the clients application to guarantee
consistency
A snapshot is a virtual disk that cannot be
modified
Snapshots require a modification to the
translation scheme. The virtual disk directory
translates a virtual disk id into a pair ltglobal
map id, epoch gt where epoch is incremented at
each snapshot
At each snapshot a new tuple with a new epoch is
created in the virtual disk directory. The
snapshot takes the old epoch
All accesses to the virtual disk are made using
the new epoch , so that any write to the
original disk creates new entries in the new
epoch rather than overwrites the blocks in the
snapshot

22
Virtual Disk Reconfiguration

Needed when a new server is added or the
redundancy scheme is changed
Steps to perform it at once (not incrementally)
and in the absence of any other activity
create a new global map with desired redundancy
scheme and server mapping
change all virtual disk directories to point to
the new global map
redistribute data to the severs according to the
translation specified in the new global map
The challenge is to perform it incrementally and
concurrently with normal client requests

23
Incremental Reconfiguration

First two steps as before step 3 done in
background starting with the translations in the
most recent epoch that have not yet been moved
Old global map is used to perform read
translations which are not found in the new
global map
A write request only accesses the new global map
to avoid consistency problems
Limitation the mapping of the entire virtual
disk must be changed before any data is moved -gt
lots of new global map misses on reads -gt high
traffic. Solution relocate only a portion of
the virtual disk at a time. Read requests for
portion of virtual disk being relocated cause
misses, but not requests to other areas

24
Redundancy with Chained Data Placement

Petal uses chained-declustering data placement
two copies of each data block are stored on
neighboring servers
every pair of neighboring servers has data blocks
in common
if server 1 fails, servers 0 and 2 will share
servers read load (not server 3)

server 0 server 1 server 2 server 3 d0 d1
d2 d3 d3 d0 d1
d2 d4 d5 d6 d7 d7 d4
d5 d6
25
Chained Data Placement (contd)

In case of failure, each server can offload some
of its original read load to the next/previous
server. Offloading can be cascaded across
servers to uniformly balance load
Advantage with simple mirrored redundancy, the
failure of a server would result in a 100 load
increase to another server
Disadvantage less reliable than simple mirroring
- if a server fails, the failure of either one of
its two neighbor servers will result in data
becoming unavailable
In Petal, one copy is called primary, the other
secondary
Read requests can be serviced by any of the two
servers, while write requests must always try the
primary first to prevent deadlock (blocks are
locked before reading or writing, but writes
require access to both servers)

26
Read Request

The Petal client tries primary or secondary
server depending on which one has the shorter
queue length. (Each client maintains a small
amount of high-level mapping information that is
used to route requests to the most appropriate
servers. If a request is sent to an
inappropriate server, the server returns an error
code, causing the client to update its hints and
retry the request)
The server that receives the request attempts to
read the requested data
If not successful, the client tries the other
server

27
Write Request

The Petal client tries the primary server first
The primary server marks data busy and sends the
request to its local copy and the secondary copy
When both complete, the busy bit is cleared and
the operation is acknowledged to the client
If not successful, the client tries the secondary
server
If the secondary server detects that the primary
server is down, it marks the data element as
stale on stable storage before writing to its
local disk
When the primary server comes up, the primary
server has to bring all data marked stale
up-to-date during recovery
Similar if secondary server is down

28
Petal Prototype
29
Petal Performance - Latency
Single client generates requests to random disk
offsets
30
Petal Performance - Throughput
Each of 4 clients making random requests to
single VD. Failed configuration one of 4
servers has crashed
31
Petal Performance - Scalability
32
Frangipani

Petal provides disk interface -gt need a file
system
Frangipani is a file system designed to take full
advantage of Petal
Frangipanis main characteristics
All users are given a consistent view of the same
set of files
Servers can be added without changing
configuration of existing servers or interrupting
their operation
Tolerates and recovers from machine, network, and
disk failures
Very simple internally a set of cooperating
machines that use a common store and synchronize
access to that store with locks

33
Frangipani

Petal takes much of the complexity out of
Frangipani
Petal provides highly available storage that can
scale in throughput and capacity
However, Frangipani improves on Petal, since
Petal has no provision for sharing the storage
among multiple clients
Applications use a file-based interface rather
than the disk-like interface provided by Petal
Problems with Frangipani on top of Petal
Some logging occurs twice (once in Frangipani and
once in Petal)
Cannot use disk location in placing data, cause
Petal virtualizes disks
Frangipani locks entire files and directories as
opposed to individual blocks

34
Frangipani Structure
35
Frangipani Disk Layout

A Frangipani file system uses only 1 Petal
virtual disk
Petal provides 264 bytes of virtual disk space
Commits real disk space when actually used
(written)
Frangipani breaks disk into regions
1st region (1T) stores config parameters and
housekeeping info
2nd region (1T) stores logs each Frangipani
server uses a portion of this region for its log.
Can have up to 256 logs.
3rd region (3T) holds allocation bitmaps,
describing which blocks in remaining regions are
free. Each server locks a different portion.
4th region (1T) holds inodes
5th region (128T) holds small data blocks (4
Kbytes each)
Remainder of Petal disk holds large data blocks
(1 Tbyte each)

36
Frangipani File Structure

First 16 blocks (64 KB) of a file are stored in
small blocks
If file becomes larger, store the rest in a 1 TB
large block

37
Frangipani Dealing with Failures

Write-ahead redo logging of metadata user data
is not logged
Each Frangipani server has its own private log
Only after a log record is written to Petal does
the server modify the actual metadata in its
permanent locations
If a server crashes, the system detects the
failure and another server uses the log to
recover
Because the log is on Petal, any server can get
to it.

38
Frangipani Synchronization Coherence

Frangipani has a lock for each log segment,
allocation bitmap segment, and each file
Multiple-reader/single-writer locks. In case of
conflicting requests, the owner of the lock is
asked to release or downgrade it to remove the
conflict
A read lock allows a server to read data from
disk and cache it. If server is asked to release
its read lock, it must invalidate the cache entry
before complying
A write lock allows a server to read or write
data and cache it. If a server is asked to
release its write lock, it must write dirty data
to disk and invalidate the cache entry before
complying. If a server is asked to downgrade the
lock, it must write dirty data to disk before
complying

39
Frangipani Lock Service

Fully distributed lock service for fault
tolerance and scalability
How to release locks owned by a failed Frangipani
server?
The failure of a server is discovered when its
lease expires. A lease is obtained by the
server when it first contacts the lock service.
All locks acquired are associated with the lease.
Each lease has an expiration time (30 seconds)
after its creation or last renewal. A server
must renew its lease before it expires
When a server fails, the locks that it owns
cannot be released until its log is processed and
any pending updates are written to Petal

40
Frangipani Performance
41
Frangipani Performance
42
Frangipani Scalability
43
Frangipani Scalability
44
Frangipani Scalability
45
xFS (Context Motivation)

A server-less network file system that works over
a cluster of cooperative workstations
Moving away from central FS is motivated by three
factors
hardware opportunity (fast switched LANs) provide
aggregate bandwidth that scales with the number
of machines in the network
user demand is increasing e.g., multimedia
limitations of central FS approach
limited scalability
Expensive
replication for availability increase complexity
and operation latency

46
xFS (Contribution Limitations)

A well-engineered approach which takes advantage
of several research ideas RAID, LFS, cooperative
caching
A truly distributed network file system (no
central bottleneck)
control processing distributed across the system
on per-file granularity
storage distributed using a software RAID and a
log-based network striping (Zebra)
use cooperative caching to use portions of client
memory as a large, global file cache
Limitation requires machines to trust each other

47
RAID in xFS

RAID partitions a stripe of data into N-1 data
blocks and a parity block (the exclusive-OR of
the bits of data blocks)
Data and parity blocks are stored on different
storage servers
Provides both high bandwidth and fault tolerance
Traditional RAID drawbacks
multiple accesses for small writes
hardware RAID expensive (special hardware to
compute parity)

48
LFS in xFS

High-performance writes buffer writes in memory
to write them to disk in large, contiguous,
fixed-size groups called log segments
Writes are always appended as logs
imap to locate i-nodes stored in memory and
periodically checkpointed to disk
Simple recovery procedure get the last
checkpoint and then rolls forward reading the
later segments and in the log and update imap and
i-nodes
Free disk management through log cleaner
coalesces old, partially empty segments into a
smaller number of full segments -gt cleaning
overhead can be large sometime

49
Zebra

Combines LFS and RAID LFSs large writes make
writes to the network RAID efficient
Implements RAID in software
Writes coalesced into a private per-client log
Log-base striping
log segment split into log fragments which are
striped over the storage servers
parity fragment computation is local (no network
access)
Deltas stored in the log encapsulate
modifications to file system states that must be
performed atomically - used for recovery

50
Metadata and Data Distribution

A centralized FS stores all data blocks on its
local disks
manages location of metadata
maintains a central cache of data blocks in its
memory
manages cache consistency metadata that lists
which clients in the system are caching each
block (not NFS)

51
xFS Metadata and Data Distribution

Stores data on storage servers
Splits metadata management among multiple
managers that can dynamically alter the mapping
from a file to its manager
Uses cooperative caching that forwards data among
client caches under the control of the managers
The key design challenge how to locate data and
metadata in such a completely distributed system

52
xFS Data Structures
53
Manager Map

Allows clients to determine which manager to
contact for a file
Manager map is globally replicated (it is small)
Two translations are necessary to allow manager
remapping
external file name - gt file index number
(directory)
index number -gt manager (manager map)
Manager map can also be used for a coarse-grained
workload balancing among managers
File manager controls disk location metadata
(Imap I-node) and cache consistency state (list
of clients caching the block or who has the
ownership for write)

54
Read Operation
55
Write Operation

Clients buffer writes in their local memory until
committed to a stripe group of storage servers
Since xFS uses LFS a write changes the disk
address of the modified block
After a client commits a segment to a storage
server it notifies the modified blocks managers
to modify their index nodes and imaps
Index nodes and data blocks do not have to be
simultaneously committed because in Zebra the
clients log includes a delta that allows
reconstruction of the managers data structure
in the event of a crash

56
Cache Consistency

Per-block rather than per-file
Ownership-based similar to a DSM scheme
To modify a block a client must get the ownership
from the manager
The manager invalidates any other cached copies
of the block, then gives write permission
(ownership) to the client
Ownership can be revoked by the manager
Manager keeps the list of clients caching each
block

57
Log cleaner in xFS

Distributed
Relies on utilization status which is also
distributed maintained by the client who wrote
that segment
A leader in each group initiates cleaning and
decides which cleaners should clean the stripe
groups segments
Each cleaner receives a subset of segments to
clean
Cleaners assume optimistic concurrrency to
resolve conflicts between cleaner updates and
normal writes
In case of a conflict (because a client is
writing a block as it is cleaned) the manager
ensures that client update takes precedence over
the cleaners update