Title: Disks and Files
1Disks and Files
- Vivek Pai
- Princeton University
2Gedankyou
- Imagine the following
- A disk scheduling policy says handle the request
that is closest to where the disk head currently
is - On a system with lots of disk-intensive jobs,
what problem can arise? - What tweaks can avoid this problem?
3Why Files
- Physical reality
- Block oriented
- Physical sector s
- No protection among users of the system
- Data might be corrupted if machine crashes
- Filesystem model
- Byte oriented
- Named files
- Users protected from each other
- Robust to machine failures
4File Structures
- Byte sequence
- Read or write a number of bytes
- Unstructured or linear
- Record sequence
- Fixed or variable length
- Read or write a number of records
- Tree
- Records with keys
- Read, insert, delete a record (typically using
B-tree)
5File Structures Today
- Stream of bytes
- Simplest to implement in kernel
- Easy to manipulate in other forms
- Little performance loss
- More complicated structures
- Hardware assist fell out of favor
- Special-purpose hardware slower, costly
6File Types
- ASCII plain text
- A Unix executable file
- header magic number, sizes, entry point, flags
- Text (code)
- Data
- relocation bits
- symbol table
- Devices
- Everything else in the system
7So What Makes Filesystems Hard?
- Files grow and shrink in pieces
- Little a priori knowledge
- 6 orders of magnitude in file sizes
- Overcoming disk performance behavior
- Desire for efficiency
- Coping with failure
8File System Components
User
- Disk management
- Arrange collection of disk blocks into files
- Naming
- User gives file name, not track or sector number,
to locate data - Security
- Keep information secure
- Reliability/durability
- When system crashes, lose stuff in memory, but
want files to be durable
File Naming
File access
Disk management
Disk drivers
9Some Definitions
- File descriptor (fd) an integer used to
represent a file easier than using names - Metadata data about data - bookkeeping data
used to eventually access the real data - Open file table system-wide list of descriptors
in use
10Kinds of Metadata
- inode index node, or a specific set of
information kept about each file - Two forms on disk and in memory
- Directory names and location information for
files and subdirectories - Note stored in files in Unix
- Superblock contains information to describe the
file system, disk layout - Information about free blocks/inodes on disk
11Contents of an Inode
- Disk inode
- File type, size, blocks on disk
- Owner, group, permissions (r/w/x)
- Reference count
- Times creation, last access, last mod
- Inode generation number
- Padding other stuff
- 128 bytes on classic Unix
12Directories in Unix
- Stored like regular files
- Contents are file names and inode s
- Names are nul-terminated strings
- Logic
- Separates file from location in tree
- File can appear in multiple places
- What are the drawbacks?
13Effects of Corruption
- inode file gets damaged
- Maybe some free block gets viewed
- Directory lose files/directories
- Might get to read deleted files
- Superblock cant figure out anything
- This is why we replicate the superblock
14Data Structures for A Typical File System
Process control block
Open file table (systemwide)
Memory Inode
Disk inode
Open file pointer array
. . .
15Opening A File
fd open( FileName, access)
- File name lookup and authentication
- Copy the file metadata into the in-memory data
structure, if it is not in yet - Create an entry in the open file table (system
wide) if there isnt one - Create an entry in PCB
- Link up the data structures
- Return a pointer to user
PCB
Allocate link up data structures
Open file table
File name lookup authenticate
Metadata
File system on disk
16Reading And Writing
- What happens when you
- read 10 bytes from a file?
- write 10 bytes into an existing file?
- write 4096 bytes into a file?
- Disk works on blocks (sectors)
- Can have temporary (ephemeral) buffers
- Longer lasting buffers disk cache
17Reading A Block
read( fd, userBuf, size )
PCB
Open file table
Get physical block to sysBuf copy to userBuf
Metadata
read( device, phyBlock, size )
Buffer cache
Logical ? phyiscal
Disk device driver
18A Disk Layout for A File System
Super block
File metadata (i-node in Unix)
File data blocks
Boot block
- Superblock defines a file system
- size of the file system
- size of the file descriptor area
- free list pointer, or pointer to bitmap
- location of the file descriptor of the root
directory - other meta-data such as permission and various
times - For reliability, replicate the superblock
19File Usage Patterns
- How do users access files?
- Sequential bytes read in order
- Random read/write element out of middle of
arrays - Whole file or partial file
- How are files used?
- Most files are small
- Large files use up most of the disk space
- Large files account for most of the bytes
transferred - Bad news
- Need everything to be efficient
20Data Structures for Disk Management
- A header for each file (part of the file
meta-data) - Disk sectors associated with each file
- A data structure to represent free space on disk
- Bit map
- 1 bit per block (sector)
- blocks numbered in cylinder-major order, why?
- Linked list
- Others?
- How much space does a bit map need for a 4G disk?
21Linked Files (Alto)
- File header points to 1st block on disk
- Each block points to next
- Pros
- Can grow files dynamically
- Free list is similar to a file
- Cons
- random access horrible
- unreliable losing a block means losing the rest
File header
. . .
null
22Contiguous Allocation
- Request in advance for the size of the file
- Search bit map or linked list to locate a space
- File header
- first sector in file
- number of sectors
- Pros
- Fast sequential access
- Easy random access
- Cons
- External fragmentation
- Hard to grow files
23Single-Level Indexed Files orExtent-based
Filesystems
- A user declares max size
- A file header holds an array of pointers to point
to disk blocks - Pros
- Can grow up to a limit
- Random access is fast
- Cons
- Clumsy to grow beyond limit
- Periodic cleanup of new files
- Up-front declaration a real pain
Disk blocks
File header
24File Allocation Table (FAT)
- Approach
- A section of disk for each partition is reserved
- One entry for each block
- A file is a linked list of blocks
- A directory entry points to the 1st block of the
file - Pros
- Simple
- Cons
- Always go to FAT
- Wasting space
0
foo
217
217
619
399
EOF
619
399
FAT
25Multi-Level Indexed Files (Unix)
data
- 13 Pointers in a header
- 10 direct pointers
- 11 1-level indirect
- 12 2-level indirect
- 13 3-level indirect
- Pros Cons
- In favor of small files
- Can grow
- Limit is 16G and lots of seek
- What happens to reach block 23, 5, 340?
data
1
2
. . .
data
. . .
11
12
13
. . .
data
. . .
. . .
data
. . .
. . .
26Reliability In Disk Systems
- Make sure certain actions have occurred before
function completes - Known as synchronous operation
- Ex make sure new inode is on disk that the
directory has been modified before declaring a
file creation is complete - Drawback speed
- Some ops easily asynchronous access time
- Some filesystems dont care Linux ext2fs
27Recovery After Failure
- Need to ensure consistency
- Does free bitmap match tree walk?
- Do reference counts in inodes match directory
entries? - Do blocks appear in multiple inodes?
- This kind of recovery grows with disk size
- Clean shutdown mark as such, no recovery
28Reducing Synchronous Times
- Write to a faster storage
- Nonvolatile memory expensive, requires some
additional OS/firmware support - Write to a special disk or section logging
- Only have to examine log when recovering
- Eventually have to put information in place
- Some information dies in the log itself
- Write in a special order
- Write metadata in a way that is consistent but
possibly recovers less
29Challenges
- Unix filesystem has great flexibility
- Extent-based filesystems have speed
- Seeks kill performance locality
- Bitmaps show contiguous free space
- Linked lists easy to search
- How do you perform backup/restore?
30A Quick XOR Overview
- XOR eXclusive OR
- a XOR a 0
- a XOR 0 a
- a XOR b b XOR a
- (a XOR b) XOR c a XOR (b XOR c)
- In other words, count the bits,
- even 0, odd 1
31More Fun With XOR
- Result XOR (a1, a2, a3, a4,)
- a2 goes bad
- Can we reconstruct a2?
- a2 XOR (a1, result, a3, a4,)
- What does this imply for disks?
- What kinds of failures does it handle?
32Bigger, Faster, Stronger
- Making individual disks larger is hard
- Throw more disks at the problem
- Capacity increases
- Effective access speed may increase
- Probability of failure also increases
- Use some disks to provide redundancy
- Generally assume a fail-stop model
- Fail-stop versus Byzantine failures
33RAID (Redundant Array of Inexpensive Disks)
- Main idea
- Store the error correcting codes on other disks
- General error correcting codes are too powerful
- Use XORs or single parity
- Upon any failure, one can recover the entire
block from the spare disk (or any disk) using
XORs - Pros
- Reliability
- High bandwidth
- Cons
- The controller is complex
RAID controller
XOR
34Synopsis of RAID Levels
RAID Level 0 Non redundant (JBOD)
RAID Level 1Mirroring
RAID Level 2Byte-interleaved, ECC
RAID Level 3Byte-interleaved, parity
RAID Level 4Block-interleaved, parity
RAID Level 5Block-interleaved, distributed
parity
35Did RAID Work?
- Performance yes
- Reliability yes
- Cost no
- Controller design complicated
- Fewer economies of scale
- High-reliability environments dont care
- Now also software implementations
36RAIDs Real Benefit
- Partly addresses the failure problem
- Backup/restore less of an issue
- Failed disk rebuilt at sector level
- Lower performance during rebuild, but system
still on-line - Still not perfect
- Geographic problems
- Failure during rebuild
37(No Transcript)