Title: Filesystems
1Filesystems Metadata, Paths, Caching
- Vivek Pai
- Princeton University
2Diskgedanken
- Assuming you back-up and restore files, what
factors affect the time involved? - How are these factors changing?
- What issues affect the rates of change?
- How is total backup time changing over the years?
- What is Occams razor?
3Todays Overview
- Quiz recap
- Finish up metadata, reliability
- A little discussion of mounting, etc
- Move on to performance
4Quiz 1 Observations
- Im disappointed
- Quizzes not yet graded, but
- Most people did poorly on question 1
- Lots of dimensional analysis
- Lots of sleepers, chatting, weird faces
- Very little (too little) feedback in general
- Open question looking for a methodical approach
5Occams Razor
- From William of Occam (philosopher)
- entities should not be multiplied unnecessarily
- Often reduced to other statements
- one should not increase, beyond what is
necessary, the number of entities required to
explain anything - Make as few assumptions as possible
- once you have eliminated all other possible
explanations, what remains must be the answer
6A Reasonable Approach
- Disk size 40GB (20-80GB common)
- File size 10KB (5-20KB common)
- Access time 10ms (5-20ms common)
- Assume 1 seek per file (reasonable)
- 100 files 1MB, each access .01 sec
- So, 40GB at 1MB/s 40K sec 11 hours
7Changes Over Time
- Disk density doubling each year
- Seek time dropping lt 10
- File size growing slowly
- Results
- of files grows faster than access time
reduction - Backup time increases
8Most Common Answer
- Disk size / maximum transfer rate
- In other words, read sectors, not files
- Can this be done?
- Yes, if you have access to raw disk
- Which means that you have root permission
- And that the system has raw disk support
- Faster than file-based dump/restore
- No concept of files, however
- What happens if you restore to a disk with a
different geometry?
9Linked Files (Alto)
- File header points to 1st block on disk
- Each block points to next
- Pros
- Can grow files dynamically
- Free list is similar to a file
- Cons
- random access horrible
- unreliable losing a block means losing the rest
File header
. . .
null
10Contiguous Allocation
- Request in advance for the size of the file
- Search bit map or linked list to locate a space
- File header
- first sector in file
- number of sectors
- Pros
- Fast sequential access
- Easy random access
- Cons
- External fragmentation
- Hard to grow files
11Single-Level Indexed Files orExtent-based
Filesystems
- A user declares max size
- A file header holds an array of pointers to point
to disk blocks - Pros
- Can grow up to a limit
- Random access is fast
- Cons
- Clumsy to grow beyond limit
- Periodic cleanup of new files
- Up-front declaration a real pain
Disk blocks
File header
12File Allocation Table (FAT)
- Approach
- A section of disk for each partition is reserved
- One entry for each block
- A file is a linked list of blocks
- A directory entry points to the 1st block of the
file - Pros
- Simple
- Cons
- Always go to FAT
- Wasting space
0
foo
217
217
619
399
EOF
619
399
FAT
13Multi-Level Indexed Files (Unix)
data
- 13 Pointers in a header
- 10 direct pointers
- 11 1-level indirect
- 12 2-level indirect
- 13 3-level indirect
- Pros Cons
- In favor of small files
- Can grow
- Limit is 16G and lots of seek
- What happens to reach block 23, 5, 340?
data
1
2
. . .
data
. . .
11
12
13
. . .
data
. . .
. . .
data
. . .
. . .
14Reliability In Disk Systems
- Make sure certain actions have occurred before
function completes - Known as synchronous operation
- Ex make sure new inode is on disk that the
directory has been modified before declaring a
file creation is complete - Drawback speed
- Some ops easily asynchronous access time
- Some filesystems dont care Linux ext2fs
15Recovery After Failure
- Need to ensure consistency
- Does free bitmap match tree walk?
- Do reference counts in inodes match directory
entries? - Do blocks appear in multiple inodes?
- This kind of recovery grows with disk size
- Clean shutdown mark as such, no recovery
16Reducing Synchronous Times
- Write to a faster storage
- Nonvolatile memory expensive, requires some
additional OS/firmware support - Write to a special disk or section logging
- Only have to examine log when recovering
- Eventually have to put information in place
- Some information dies in the log itself
- Write in a special order
- Write metadata in a way that is consistent but
possibly recovers less
17Challenges
- Unix filesystem has great flexibility
- Extent-based filesystems have speed
- Seeks kill performance locality
- Bitmaps show contiguous free space
- Linked lists easy to search
- How do you perform backup/restore?
18Bigger, Faster, Stronger
- Making individual disks larger is hard
- Throw more disks at the problem
- Capacity increases
- Effective access speed may increase
- Probability of failure also increases
- Use some disks to provide redundancy
- Generally assume a fail-stop model
- Fail-stop versus Byzantine failures
19RAID (Redundant Array of Inexpensive Disks)
- Main idea
- Store the error correcting codes on other disks
- General error correcting codes are too powerful
- Use XORs or single parity
- Upon any failure, one can recover the entire
block from the spare disk (or any disk) using
XORs - Pros
- Reliability
- High bandwidth
- Cons
- The controller is complex
RAID controller
XOR
20Synopsis of RAID Levels
RAID Level 0 Non redundant (JBOD)
RAID Level 1Mirroring
RAID Level 2Byte-interleaved, ECC
RAID Level 3Byte-interleaved, parity
RAID Level 4Block-interleaved, parity
RAID Level 5Block-interleaved, distributed
parity
21Did RAID Work?
- Performance yes
- Reliability yes
- Cost no
- Controller design complicated
- Fewer economies of scale
- High-reliability environments dont care
- Now also software implementations
22RAIDs Real Benefit
- Partly addresses the failure problem
- Backup/restore less of an issue
- Failed disk rebuilt at sector level
- Lower performance during rebuild, but system
still on-line - Still not perfect
- Geographic problems
- Failure during rebuild
23Namespace
- Basically, the filesystem hierarchy
- Provides a convenient way of accessing things
- Files
- Devices
- Pseudo-filesystems
- In Unix, a nice, consistent namespace
- No drive names
24A Sample File Tree
- /
- bin/ boot/ proc/ usr/
- home/ local/
- mariah/ vivek/
25What If You Have Two Disks?
- /
- bin/ boot/ proc/ usr/
- home/ local/
- mariah/ vivek/
26As Mariahs Files Grow?
- /
- bin/ boot/ proc/ usr/
- home/ local/
- mariah/ vivek/
27Mount Points
- /
- bin/ boot/ proc/ usr/
- home/ local/
- mariah/ vivek/
28Mount Points
- Original directories get hidden
- Traversal is transparent to user
- OS keeps track of various disks (devices)
- But what happens with big disks?
- Partition (split) them into several logical
devices easier to manage, safer, etc - Home directories in one partition,
startup-related files/programs in another, etc
29Paths
- Each process has current directory
- Convenient shorthand
- Paths that start with / are absolute
- Paths without / are relative to current
directory - Path lookup is potentially expensive
- Its also repetitive
- Amenable to caching
- Metadata cache from assigned reading
30Finding Paths
- In Unix, directory contains inode
- If two directories contain same , file is
accessible via different paths (and names) - Adding another name into the filespace is called
linking (via ln command) - But the directory is a file
- What happens if a directory gets linked?
31Consider The Following
- /
- bin/ boot/ proc/ usr/
- home/ local/
- mariah/ vivek/
32Various Solutions
- Only allow root to link to directory
- Can still be useful
- Hopefully root knows when to do it
- Limit the number of iterations
- Pick some large maximum
- Terminate traversal after that
- Detect loops
- Cost? Utility?
33Does It Do What You Want
- I create vivek/work/cal/now/mtgs
- I create a link to it via vivek/mtgs
- The month advances, and vivek/work/cal/now/mtgs
becomes vivek/cal/Sep01/mtgs - Create new vivek/work/cal/now/mtgs
- To what does vivek/mtgs point?
34Symbolic Link
- Created via ln s command
- Dynamically interpreted each use
- Does not cause a standard directory entry to
target. Instead - Link is a file containing the file/path
- May be stored in inode if link is short
- Standard looping rules apply