Title: Log Structured File System
1Log Structured File System
2Transactions in File System
- Main Points
- reliability from unreliable components
- concept
- atomicity all or nothing
- durability once it happens, it is there
- serializability transactions appear to happen
one by one - Motivation
- File Systems have lots of data structures
- bitmap for free blocks
- directory
- file header
- indirect blocks
- data blocks
- for performance reason, all must be cached
- read requests are easy
- what about writes?
3Transactions in File System
- Write to cache
- write through cache is not of any help
- write back data can be lost on a crash
- Multiple updates that belong to one operation
- what happen if a crash occurs between updates
- e.g. 1 move a file between directories
- delete file from old directory
- add file to new directory
- e.g. 2 create a new file
- allocate space on disk for header, data
- write new header to disk
- add the new file to the directory
4Transactions in File System
- Unix Approach (ad hoc)
- meta-data consistency
- synchronous write-through
- multiple updates are done in specific order
- after crash, fsck program fixes up anything in
progress - e.g.
- file created, but not yet in a directory gt
delete file - blocks allocated, but not in bitmap gt update
bitmap - user data consistency
- write back to disk every 30 seconds or by user
request - no guarantee blocks are written to disk in any
order - no support for transaction
- user may want multiple file operation done as a
unit
5Transactions in File System
- Write-ahead logging
- Almost all the file systems since 1985 use
write-ahead logging - Windows/NT, Solaris, OSF, etc.
- mechanism
- operation
- write all changes in a transaction to log
- send file changes to disk
- reclaim log space
- if crash, read log
- if log isn't complete, no change!
- if log is completely written, apply all changes
to disk - if log is zero, then don't worry. All updates
have gotten to disk - pros cons
- reliability
- asynchronous write-behind
- - all data written twice
6Log-Structured File Systems
- Idea
- write data once
- log is the only copy of the data
- as you modify disk blocks, store them in log
- put everything data blocks, file header, etc, on
log - Data fetch
- if need to get data from disk, get it from the
log - keep map in memory
- tells you where everything is
- map should be in the log for crash recovery
7Log-Structured File Systems
- Advantage
- all writes are sequential!!
- no seeks, except for reads
- reads can be handled by cache
- cache is getting bigger
- in extreme case, disk IO only for writes which
are sequential - same problems of contiguous allocation
- many files are deleted in the first 5 minutes
- need garbage collection
- if disk fills up, problem!!
- keep disk under-utilized
8Log-Structured File Systems
- Mechanism
- Issues for implementing the log
- how to retrieve information from the log
- enough free space for the log
- Cache file changes, and writes sequentially on
the disk in a single operation - fast writes
- Information retrieval
- inode map at a fixed checkpoint region
- indices to inodes contained in the write
- most of them are cached in memory
- fast reads
9Log Examples
LFS
data
i-node
dir
i-node
data
i-node
dir
i-node
map
log
FFS
i-node
i-node
data
dir
i-node
i-node
dir
data
- In FFS, each inode is at a fixed location on disk
- an index into the inode set is sufficient to find
it - in LFS, a map is needed to locate inode since it
is mixed with data on the log
10Log-Structured File Systems
- Space management
- holes left by deleting files
- threading
- use the dispersed holes like a linked list
- fragmentation will get worse
- copying
- copy a file out of the log to a leave large hole
- expensive especially for long-lived files
- Segment
- Concept
- clean segments are linked (threading)
- segments with holes may be copied into a clean
segment - collect long-lived files into the same segment
- segment cleaning policy
- when? low watermark for clean segments
- how many segments? high watermark
- which segments? - most fragmented
- how to group files?
- files in the same directory
11Log-Structured File Systems
- Recovery
- checkpoints and roll-forward (NOT a roll-back!!)
- possible since all the file operations are in the
log - checkpoint
- a point in the log at which file system is
completed - contains
- address of inode maps
- segment usage table
- current time
- checkpoint region
- contains checkpoint
- placed at a specific location on disk
- operation
- 1. write out all modified information to disk
- 2. write out checkpoint region
- on a crash,
- roll-forward operations logged after the last
checkpoint - if the crash occurs while writing a checkpoint,
- keep old checkpoint
12Roll-Forward
- Recover as much information as possible
- in segment summary block, there exist
- a new inode then, there must be data blocks
before it. Just update inode map - data blocks without inode ignores them since we
dont know if the data blocks are complete - Each inode has counter to indicate how many
directories refer it - reference counter updated, but directory is not
written - directory is written, but the reference counter
is not updated - employs special write ahead log for directory
changes
13Informed Prefetching and Caching
14Introduction
- Prefetching
- memory prefetching (to cache memory)
- more about the issues in architecture
- too fast to be controlled by some intelligence
- disk prefetching (to memory buffer)
- disk latency is larger in different order of
magnitude - Pros Cons of prefetching
- reduce latency when the prefetched data is
accessed - file cache may be wasted if the prefetched data
is unused - difficult to know when the prefetched data will
be used - interference with other cached data and virtual
memory is difficult to understand - Assumptions
- disk parallelism is underutilized
- file performance is getting more important with
faster CPU - applications provide hints
15Limits of RAID
- RAID increases disk throughput when the workload
can be processed in parallel - very large accesses
- multiple concurrent accesses
- Many real I/O workload is not parallel
- get a byte from a file
- think
- get another byte from (the same or another) file
- access only a single disk at a time
16Real I/O Workload
- Recent trends
- faster CPU generated I/O requests more often
- programs favor larger data objects
- file cache hit ratio is more important than
before - Most workload is read
- writes can be done behind in parallel
- read blocks the applications
- most access patterns are predictable.
- Lets use the predictability as hints
17Overview of Informed Prefetching
- Application discloses its future resource
requirements - the system makes the final decisions, not
applications - Disclosing hints are issued through ioctl
- file specifier
- file name or file descriptor
- pattern specifier
- sequential
- list of ltoffset, lengthgt
- What to do with the disclosing hints
- parallelize the I/O request for RAID
- keep the data in the cache
- schedule disk to reduce seek time
18Informed Cache Manager Schematic
19A System Model
- Tdisk latency of the disk fetch
- Tdriver buffer allocation, queueing at the
driver, and interrupt service
20Benefit of a buffer
- Tstall (x) read stall time when there are x
buffers for x prefetches - Tpf (x) service time for a hinted read when
there are x buffers
- benefit of using one more buffer
issue(a)
use(a)
use(a)
x buffers
gt x(TcpuThitTdriver)
Tstall
21Stall time for a disk access
- At worst, it takes Tdisk
- before x-th request generates, it takes x(Tcpu
Thit Tdriver) for the CPU at best (all cache
hits, no stall)
- prefetch horizon P(TCPU) distance at which
Tstall becomes zero, i.e., there is no need to
prefetch beyond this point
22What really happens
- 3 buffers are assumed
- so,
23Benefit of a single buffer
- When used for prefetching
- When used for demand miss
- ?
24Model Verification
- The model underestimates the stall time due to
- neglecting disk contention
- variation in disk service time (queueing effect)
- overall, it is a good estimator
25Cost of Shrinking LRU buffer cache
- hit ratio H(n) for file cache with n buffers
- service time
- cost for taking a buffer from the file cache
- H(n) varies with workload
- need dynamic run time monitoring
26Cost for Ejecting a Prefetched Block
- cost is paid when the ejected block is accessed
again later - cost when that block is prefetched in x accesses
in advance - Tstall can be zero beyond the prefetch horizon
- ejection frees one block for y-x accesses
- increase in service time per access is
y
x
prefetch
eject
reaccess
region affected by eviction
27(No Transcript)
28Seeking Global Optimum
- Normalization of each estimate LRU, hinted
prefetch - multiply each with usage rate
- unhinted demand access rate LRU cache
estimate(TLRU) - access rate to the hinted sequence (TPF)
- When a manager needs a new block
- each estimator selects the least valuable block
- hint the block that is accessed in the furthest
future - LRU the block at the bottom of the LRU stack
- the manager selects the least valuable block
- compare the benefit with the cost of least
valuable block
29Real World Estimator
- LRU Cache hit ratio
- can be measured in the cache but the cache size
varies as time goes on use ghost buffer to
measure to the maximum size of the cache - keeping history of each cache block is too much
work - use segment
- Use system wide prefetch horizon
- upper bound
- For Teject, assume prefetch happens at the
prefetch horizon - assuming
30After 4 Years,
- Providing hints is too much burden to programmers
- Automatic hints generation is desired
- there are idle CPU times when program blocks for
I/O - speculative execution can provide hints for
future I/O accesses - Approaches made
- a kernel threads performs the speculative
execution - this speculating thread shares the address space
- Issues
- run time overhead
- incorrectness
- may affect the correctness of the results
- incorrect hints may waste I/O bandwidth
31Ensuring Program Correctness
- Software copy-on-write
- prevents code/data distortion
- for each new write to a memory region, make a
copy - insert code to every load/store to check if it is
to a copied region - software fault isolation
- code is inserted to a copy of code (shadow code)
- original code is not changed, so no overhead for
normal execution - Generates no system call
- system state is not changed by the speculative
execution - Signal handler
- catches all exceptions that may disturb normal
execution
32Generating Correct and Timely Hints
- Problems
- the speculating thread may lack behind generating
stale hints - the speculating thread may stray from the
execution path - How to detect the problems
- the original thread checks the hint log prepared
by the speculating thread - if it is wrong, the original thread prepares a
copy of register set and sets the flag - when the speculating thread is invoked, checks
the flag - if set, restart using the register set