Title: Chapter 11: File System Implementation
1Chapter 11 File System Implementation
2 Chapter 11 File System Implementation
- File-System Structure
- File-System Implementation
- Directory Implementation
- Allocation Methods
- Free-Space Management
- Efficiency and Performance
- Recovery
- Log-Structured File Systems
- NFS
- Example WAFL File System
3Objectives
- To describe the details of implementing local
file systems and directory structures - To describe the implementation of remote file
systems - To discuss block allocation and free-block
algorithms and trade-offs
4File-System Structure
- File structure
- Logical storage unit
- Collection of related information
- File system resides on secondary storage (disks)
- File system organized into layers
- File control block storage structure consisting
of information about a file - File control blocks reside on disk and are copied
into memory
5Layered File System
6A Typical File Control Block
- Used by logical file system includes metadata
about the file - Sometimes called an inode (or a vnode in VFS)
7Other File System Structures
- In memory partition table
- Contains information about each mounted partition
- In memory directory structure
- Contains copies of recently accessed directories
- System-wide open file table
- Contains copy of FCB for each open file, plus
other info. - Per-process open file table
- Contains pointers to appropriate entries in
system-wide open file table
8Initial File Operations
- To create a file
- LFS allocates a new FCB, reads appropriate
directory into memory, updates it, and forces it
back out to disk - Some file systems (e.g., Unix) treat a directory
just like a file - i.e., when create a directory, allocate a FCB /
inode / vnode for it, and set a bit to indicate
it is a directory - Other file systems (e.g., Windows) use different
kind of structure - To open a file (so it can be used)
- Pass the OS the filename lookup in directory
structure - Copy files FCB from disk into memory, into
system-wide open file table (or increment number
of processes that have that file open) - Update per-process file table to point to entry
in SWOFT, etc. - Return to caller a pointer (file descriptor /
file handle) to appropriate entry in per-process
file table - Caller uses this handle for all I/O to the file
9In-Memory File System Structures
Opening a file
Reading a file
10Virtual File Systems
- Virtual File Systems (VFS) common standard that
provides an object-oriented way of implementing
file systems (common abstraction layer). - Separates generic file system operations from
their implementations by defining a VFS interface
(set of APIs) - VFS allows the same system call interface (the
API) to be used for different types of file
systems. - Invoke the generic API on the VFS interface,
rather than a special API for each specific type
of file system. - The VFS is based on a file representation
structure called a vnode - Contains numerical designator for a network-wide
unique file (or directory) - Kernel maintains one vnode structure for each
active node (file or directory)
11Schematic View of Virtual File System
The VFS hides file system type and whether file
systems are local or remote
12Secondary Storage Where Files Reside
- Secondary storage extension of system storage,
which provides large, non-volatile area of
storage - Today magnetic disks formerly magnetic tape (or
cards) - Fixed head / movable head
- Fixed / removable / RAM
- Platter / cylinder / track / sector
- Drive / controller / subsystem
- Floppies / drums, even CDs / DVDs / thumb
drives / USB drives - Characteristics
- Permanent
- Random-access
- Reusable
- Data stored in files look like large contiguous
address space
13Disk Structure
14Magnetic Disk Structure
15Logical Disk Structure / Addressing
- Can be viewed as an array of blocks (sectors)
like tape - A mapping scheme exists, to map from logical
block to physical address (track and sector)
IMPORTANT POINT - Sometimes block size sector size page size
(512 bytes) - Smallest storage allocation area is a block
- Storage allocated by block
- Internal fragmentation within a block
- Often each disk has a directory VTOC
- Exists on disk itself
- Contains information about files on disk
- Filename / date(s)
- Address / length
- Owner / security
- Some systems use other techniques (single level
store)
16Device Directory Implementation
- Linear list of file names with pointer to the
data blocks. - Simple to program
- Time-consuming to execute
- Can try to improve access by caching, sorting,
other techniques - The difficulty is, this directory is on disk, and
is not easy to expand, contract, etc. - Possible option -- Hash Table linear list with
hash data structure. - Decreases directory search time
- BUT -- must handle collisions situations where
two file names hash to the same location - Hash table usually fixed size
- Limited size and collisions can impact performance
17File Allocation Methods
- File is a logical unit of storage
- Collection of related information
- Exists at some stage in main memory, but stored
permanently on mass storage (disk / tape) - May be stored on disk in a variety of ways
- An allocation method refers to how disk blocks
are allocated for files on disk - Contiguous allocation
- Linked allocation
- Indexed allocation
- Look at implementation and advantages/disadvantage
s of each
18Contiguous Allocation
- Each file occupies a set of contiguous blocks on
the disk. - Simple only need starting location (block )
and length (number of blocks) - Random access is quick and easy
- Files cannot grow so must create extra large
(results in internal fragmentation) - Wasteful of space external fragmentation
- Allocate by first fit / best fit / worst fit /
etc. - Compaction sometimes necessary
- Like an array data structure
19Contiguous Allocation of Disk Space
20Extent-Based Systems
- Some newer file systems (i.e. Veritas File
System) use a modified contiguous allocation
scheme. - Extent-based file systems allocate disk blocks in
extents. - An extent is a contiguous block of disks. Extents
are allocated for file allocation. A file
consists of one or more extents. - Extents are also handy in saving directory space
- OS/400 / i5/OS has used variable-sized extents
for 30 years, along with other innovative
techniques - Will discuss later
21Linked Allocation
- Each file is a linked list of disk blocks
- Blocks in a file may be scattered anywhere on the
disk - Each block contains a pointer (address) to next
block in file - Like linked-list data structure
22Linked Allocation (Cont.)
- Files created easily, can grow easily
- Simple need only starting address
- No external fragmentation, no compaction
- Potential difficulties
- Random access
- Reliability
- Pointer space required in each block (no longer
matches page size) - Addressed somewhat by not chaining individual
blocks together, but rather chaining clusters of
blocks together - Cluster size important in internal fragmentation
- Variant FAT used by DOS, early Windows,
OS/2 (like LL) - File Allocation Table entry contains both ptr to
data block on disk, and ptr to next FAT entry for
file - Also improves random access performance, if FAT
cached
23Linked Allocation
24File-Allocation Table
Directory entry points to FAT entry FAT has
chain of addresses of disk blocks in
file Usually clusters not blocks. If FAT is
damaged, can lose pieces of file
25Indexed Allocation
- Each file has an index block, contains pointers
to all blocks in file - Random access quick, easy
- Easy to expand file
- No external fragmentation
- BUT need extra space for index tables one
table per file internal fragmentation in index
block - Like table of pointers/references to data
structures/objects - Logical view
index table
26Example of Indexed Allocation
27What if Index Table Gets Full ?
- Several possibilities
- Link to another index block
- No theoretical limit may be physical limit
- Multi-level directories
- Multiple levels of index blocks
- Combinations
- Especially to address performance
- e.g., inode structure
28Indexed Allocation Mapping (Cont.)
?
outer-index
Multi-level directories
file
index table
29Combined Scheme UNIX (4K bytes per block)
This structure is an inode vnode
30Free Space Management
- Another question related to disk space allocation
- HOW DO WE KEEP TRACK OF AND MANAGE
- THE FREE SPACE ON DISK ? ? ?
- Need free space list / free space directory
- Then how do we implement it ? ? ?
31Free-Space Management Bit Map
- Simplest FS directory is bit map one bit for
each block - Simple, fast, easy to find contiguous space
- BUT can take up much space, especially since
must be kept in mainstore to be very efficient - Block size 29 512 bytes
- Disk size 230 1GB
- Bitmap size blocks / bits per
block - (230/29) / (2923) (221/212)
29 blocks ¼ MB / GB just for bitmap - Problem similar (but worse) for
larger disks
32Other Problems with Bit Maps
- Have multiple copies
- One copy in memory, for quick access
- One copy on disk, to maintain permanent state
- Must keep copies in sync
- Difficult to make updating both copies atomic
- So, practically, the copies in memory and disk
may differ. - BUT cannot have a situation where the in-memory
copy says a block is allocated but the on-disk
copy says it is not - Runtime solution
- Set biti 1 in disk.
- Allocate blocki
- Set biti 1 in memory
- But takes time
- AND what happens if crash
33Free-Space Management Linked List
- Maintain a pointer in each free block to point to
the next free block - Must also maintain a pointer to head of free
space list - This must be kept on disk, so permanent
- Little wasted space
- Cannot get contiguous space easily and
sometimes this is required - SLOW requires substantial I/O time to traverse
list - Cannot take advantage of fact that most systems
can read / write multiple contiguous blocks at
once - Unless free space links are to sets of blocks,
etc., and then can have fragmentation
34Linked Free Space List on Disk
35Free-Space Management Other Techniques
- Grouping FS directory in a series of sectors /
blocks - Store blocks of addresses of FS blocks all in
same sector - Like an index block
- Last address points to another block of addresses
- Can find large number of blocks of FS quickly
with little I/O - Counting to improve on grouping technique
above - Since FS blocks tend to occur in groups
- Keep address of first free block in group, plus
number of contiguous following free blocks - Requires larger FS list entries, but list will be
shorter - Contiguous space will be easier to find
- /////////////////////////
36Efficiency and Performance
- Efficiency dependent on
- Disk allocation and directory algorithms
- Types of data kept in files directory entry
- Location of directories on disk
- Performance
- Disk cache separate section of main memory for
frequently used blocks - Also, caches in many controllers today
- Free-behind and read-ahead techniques to
optimize sequential access - Remove page from buffer when next page requested
- Read in several pages past page requested
- Improve performance by dedicating section of
memory as virtual disk, or RAM disk.
37Various Disk-Caching Locations
38Page Cache
- Have been discussing caching blocks from disk
- Now discuss what happens after the blocks become
pages in memory - A page cache caches pages rather than disk blocks
using virtual memory techniques - Note this is a cache of pages the actual
pages that will be used are elsewhere in memory,
referenced by the page table, etc. - Memory-mapped I/O uses a page cache
- Routine I/O through the file system uses the
buffer (disk) cache
39I/O Without a Unified Buffer Cache
Memory-mapped I/O goes first to page cache and
then to common buffer cache Regular I/O goes
directly to buffer cache Buffer cache then goes
to file system
40Unified Buffer Cache
- Note multiple caches result in cache coherency
concerns, if same data is in both caches - Also, there is problem of having to cache data
for memory mapped I/O twice - Have to move much data
- Have to allocate twice as much space in main
memory - A unified buffer cache uses the same page cache
to cache both memory-mapped pages and ordinary
file system I/O.
41I/O Using a Unified Buffer Cache
Here, both memory-mapped I/O and regular I/O go
directly to the same buffer cache. Buffer cache
then goes to file system
42File and Data Recovery
- Consistency checking compares data in directory
structure with data blocks on disk, and tries to
fix inconsistencies. - Use system programs to back up data from disk to
another storage device (tape, CD, DVD, SAN,
library, etc.). - Need organized backup plan
- Recover lost file or disk by restoring data from
backup. - Often tradeoffs minimizing backup or recovery
time - Backups more frequent than recoveries
- So optimize performance for most frequent
(backups) - Most recoveries not disaster recoveries rather,
single files - However, must also be able to handle disaster
recoveries - Performance critical in disaster recoveries
- Need to test backups, replace media, disaster
recovery processes/plans, etc.
43Log Structured File Systems
- Log structured (or journaling) file systems
record each update to the file system as a
transaction. - All transactions are written to a log.
- A transaction is considered committed once it is
written to the log. - However, the file system may not yet be updated.
- The transactions in the log are written to the
file system. - May be asynchronous or may be forced to be
synchronous (and appear atomic) - When the file system is modified again, some
systems remove the previous transaction from the
log, other systems leave the log as a
record/backup. - If the file system crashes, all remaining
transactions in the log must still be performed. - So, keep log on separate disk / separate file
system - However, if system crashes, may leave filesystem
in unknown state, unless logs have been forced
out - Most journaled file systems only journal changes
to metadata NOT changes to data in files - NTFS, JFS, ext3, ReiserFS, UFS
- iSeries also journals changes to data ! ! !
44Can Also Use Journals for Backup / HA
- Possible if also journaling data in a filesystem
- Have two physical computer systems, a local
(primary) and remote (backup) - Remote system also has copy of critical data (or
database) - When change data on local system, in addition to
generating local journal of changes, also send
copy of all journal entries to remote system - Can be done either by system or application
software - Remote system then applied changes to its copy of
data - If primary system fails, can switch to backup
system and backup copy of data will be up to date
45Network File System (NFS)
- Network file systems are common
- Use NFS as an example
- An implementation and a specification of a
software system for accessing remote files across
LANs (or WANs). - Available in implementations for many operating
systems and architectures - Ubiquitous expect to be available everywhere
- Both clients and servers (at least expect NFS
clients) - Details of implementations may vary
- Text looks at Sun's implementation
- Applicable to more general implementations as well
46NFS (Cont.)
- NFS views interconnected workstations as a set of
independent machines with independent file
systems - Goal is to allow sharing among these file systems
in a transparent manner - A remote directory is mounted over a local file
system directory - To the user, the mounted directory looks like a
local mount like an integral subtree of the
local file system, replacing the subtree
descending from the local directory - Specification of the remote directory for the
mount operation is nontransparent - This means, the host name of the remote
directory has to be provided - After the remote directory is mounted, however,
files in it can then be accessed in a transparent
manner using VFS interface - Subject to access-rights accreditation,
potentially any file system (or directory within
a file system), can be mounted remotely on top of
any local directory (local mount point)
47NFS (Cont.)
- NFS is designed to operate in a heterogeneous
environment of different machines, operating
systems, and network architectures - The NFS specifications independent of these media
- This independence is achieved through the use of
RPC primitives built on top of an External Data
Representation (XDR) protocol used between two
implementation-independent interfaces - Standard interfaces, standard data
representations, standard RPC primitives - The NFS specification distinguishes between the
services provided by a mount mechanism and the
actual remote-file-access services - i.e., the user must know about the remote system
to mount the remote file system - But once file system is mounted, the user does
not really know or care
48Three Independent File Systems
Three independent file systems must be mounted
in order to use
49Mounting in NFS
The new /dir1 is mounted over the old /dir1
covering it up and making it inaccessible until
the new /dir1 is unmounted
Mounts
Cascading mounts
50NFS Mount Protocol
- Establishes initial logical connection between
server and client - Mount operation includes name of remote directory
to be mounted and name of server machine storing
it - Mount request is mapped to corresponding RPC and
forwarded to mount server running on server
machine - Export list specifies local file systems that
server exports for mounting, along with names of
machines that are permitted to mount them - Following a mount request that conforms to its
export list, the server returns a file handlea
key for further accesses - File handle a file-system identifier, and an
inode number to identify the mounted directory
within the exported file system - The mount operation changes only the users view
and does not affect the server side - In users view, under VFS, looks just like a
locally mounted file system
51NFS Protocol
- Provides a set of remote procedure calls for
remote file operations. The procedures support
the following operations - Searching for a file within a directory
- Reading a set of directory entries
- Manipulating links and directories
- Accessing file attributes
- Reading and writing files
- NFS servers are stateless each request has to
provide a full set of arguments (NFS V4 is just
becoming available quite different, stateful) - Modified data must be committed to the servers
disk before results are returned to the client
(lose advantages of caching) - The NFS protocol does not provide
concurrency-control mechanisms - Program must handle (e.g., by byte-range locking)
52Three Major Layers of NFS Architecture
- UNIX / Linux / Posix file-system interface (based
on the open, read, write, and close calls, and
file descriptors) - Virtual File System (VFS) layer distinguishes
local files from remote ones, and local files are
further distinguished according to their
file-system types - The VFS activates file-system-specific operations
to handle local requests according to their
file-system types - Calls the NFS protocol procedures for remote
requests - NFS service layer bottom layer of the
architecture - Implements the NFS protocol
53Schematic View of NFS Architecture
54NFS Path-Name Translation
- Performed by breaking the path into component
names and performing a separate NFS lookup call
for every pair of component name and directory
vnode - To make lookup faster, a directory name lookup
cache on the clients side holds the vnodes for
remote directory names
55NFS Remote Operations
- Nearly one-to-one correspondence between regular
UNIX / Linux system calls and the NFS protocol
RPCs (except opening and closing files) - NFS adheres to the remote-service paradigm, but
employs buffering and caching techniques for the
sake of performance - File-blocks cache when a file is opened, the
kernel checks with the remote server whether to
fetch or revalidate the cached attributes - Cached file blocks are used only if the
corresponding cached attributes are up to date - File-attribute cache the attribute cache is
updated whenever new attributes arrive from the
server - Clients do not free delayed-write blocks until
the server confirms that the data have been
written to disk
56Example WAFL File System
- Used on Network Appliance Filers distributed
file system appliances - Write-anywhere file layout
- Serves up NFS, CIFS, http, ftp
- Random I/O optimized, write optimized
- NVRAM for write caching
- Similar to Berkeley Fast File System, with
extensive modifications
57The WAFL File Layout
58Snapshots in WAFL
59End of Chapter 11