Title: Journaling File Systems
1Journaling File Systems
UNIVERSITY of WISCONSIN-MADISONComputer Sciences
Department
CS 537Introduction to Operating Systems
Andrea C. Arpaci-DusseauRemzi H.
Arpaci-Dusseau Haryadi S. Gunawi
- Questions answered in this lecture
- VFS and FS operations
- Why is it hard to maintain on-disk consistency?
- How does the FSCK tool help with consistency?
- What information is written to a journal?
- What 3 journaling modes does Linux ext3 support?
2Virtual File System (VFS)
- Operations
- File/Dir open, close, chdir, link, unlink
(delete), truncate, rename - Data read, write, lseek,
- Access and info stat, chmod, chown
- Ext2/3 (or any other file system)
- Knows its on-disk format
- Has its own block allocation policies
- VFS layer
- Structure-independent code
- Manage buffer cache, directory cache, generic
inode descriptor, file descriptor - Defines a set of functions every file system has
to implement
Application
VFS
Linux Ext2/3
SGI XFS
ReiserFS
Sun ZFS
IBM JFS
3Multiple updates / ops
- Write
- Write to the next byte (to a data block)
- Update block bitmap
- Update meta-data
- Delete (e.g. rm /dir/file)
- Release data blocks of file ? update block bitmap
(to free space) - Update the inode for file
- Update inode bitmap
- Update dir data block (remove directory entry)
- And many more
- What happens if a crash happens in the middle
4Review The I/O Path (Reads)
1
- Read() from file
- Check if block is in cache
- (file cache sometimes is called buffer cache)
- If so, return block to user1 in figure
- If not, read from disk, insert into cache, return
to user 2
Blockin cache
Main Memory (Cache)
Leave copy in cache
Block Not in cache
2
Disk
5Review The I/O Path (Writes)
1
- Write() to file
- Write is buffered in memory (write behind) 1
- Sometime later, OS decides to write to disk 2
- Why delay writes?
- Implications for performance
- Implications for reliability
Buffer in memory
Main Memory (Cache)
Later Write to disk
2
Disk
6Many dirty blocks in memoryWhat order to
write to disk?
- Example Appending a new block to existing file
- Initially I have a data bitmap B, inode file I,
and an unused data block D - After append
- Write data bitmap B (for new data block),write
inode I of file (to add new pointer, update
time),write new data block D
B
I
D
Memory
?
?
?
Disk
B
I
D
7The Problems
- Writes Have to update disk with N writes
- Writes are buffered on the first place, and then
are performed at the same time later - Disk scheduler
- But, disk does only a single write atomically
- Disk scheduler (e.g. C-LOOK) reorders write
sequence - Crashes System may crash at arbitrary point
- Bad case In the middle of an update sequence
- Desire To update on-disk structures atomically
- Either all should happen or none
8Example Bitmap first
- Write Ordering Bitmap (B), Inode (I), Data
(D) - But CRASH after B has reached disk, before I or
D - Result?
- Inode is still the old inode (I), it doesnt
point to D - Data is still the old data (D)
- D can never be used (bitmap says D is used but
actually its not because no inode is pointing to
D)
B
I
D
Memory
Disk
B
I
D
9Example Inode first
- Write Ordering Inode (I), Bitmap (B), Data
(D) - But CRASH after I has reached disk, before B or
D - Result?
- I points to D which contains garbage (not D)
- B is the old bitmap which says block D is unused
(although there is an inode that already points
to the data block) - Another user (I2) requests a block, the FS gives
D to I2 - I2 gets D, D is pointed by I and I2 (security
leak!)
B
I
D
Memory
Disk
B
I
D
10Example Inode first
- Write Ordering Inode (I), Bitmap (B), Data
(D) - CRASH after I AND B have reached disk, before
D - Result?
- Better than previous example (no security leak)
- But D still contains garbage, so I is pointing to
garbage data
B
I
D
Memory
Disk
B
I
D
11Example Data first
- Write Ordering Data (D) , Bitmap (B), Inode
(I) - CRASH after D has reached disk, before I or B
- Result?
- No bad thing happens, everything is consistent
- Bitmap says the block that holds D is free
- No inode points to that block D
- Inode is still the old inode (I) which does not
point to any data block
B
I
D
Memory
Disk
B
I
D
12Traditional Solution FSCK
- FSCK file system checker
- When system boots
- Make multiple passes over file system, looking
for inconsistencies - e.g., inode pointers and bitmaps, directory
entries, inode reference counts - Ex1 bitmap says D is used, but no inode is
pointing to D, then bitmap is modified (D is not
used) - Ex2 Two inodes pointing to the same data block,
a clone of the data block will be created, and
one of the inodes will point to the new clone
(hence, no sharing anymore) - Either fix automatically or punt to admin
- Does fsck have to run upon every reboot?
- Yes, if FS does not know if there is a crash in
the middle of ops - No, if FS knows that an operation has not
finished yet - E.g. put in superblock a dirty bit, set dirty bit
to 1 before starting the operations. Clean dirty
bit if operations have finished - If upon reboot, dirty bit in superblock is 1,
must run fsck - Problem add runtime overhead (must write to
superblock for each update sequence) - Main problem with fsck Performance
- Sometimes takes hours to run on large disk
volumes - Inconsistency can only be detected if the whole
content of the file system is checked ? must scan
the whole file system (more precisely must scan
all metadata in the file system)
13How To Avoid The Long Scan?
- Idea
- Do not perform in-place update
- Write something to another area on the disk
before updating its data structures - Called the write ahead log or journal
- If all updates have been successfully reflected
to the journal, then all the updates can be
reflected to the final place (this is called the
checkpointing process) - When crash occurs, look through log and seewhat
was going on - Use contents of log to fix file system structures
- The process is called recovery
14Case Study Linux ext3
- Journal location
- EITHER on a separate device partition
- OR just a special file within ext2
- Three separate modes of operation
- Data All data and metadata is journaled
- Ordered, Writeback Just metadata is journaled
- First focus Data journaling mode
15Transactions in ext3 Data Journaling Mode
- Same example Update Inode (I), Bitmap (B), Data
(D) - First, write to journal
- Each write is formed into a transaction
- A transaction comprises of
- Journal descriptor block (Dr)
- Implies the beginning of a transaction (Tx begin)
- Contains the actual locations of the blocks saved
in the journal data blocks - Contains the transaction number
- Journal data blocks
- All blocks that must be updated atomically to the
disk, e.g. in this example I, B, and D - Journal commit block (C)
- Implies the end of transaction (Tx end)
- Also contains the transaction number
Dr Tx 2 3 blocks 1 1000 2 2000 3
3000
Dr
B
I
D
C
I
B
D
blk 1000
blk 2000
blk 3000
16Write to the journal (sequence)
- I want to write B, I, and D
- Please give me a transaction (e.g. got tx 2)
- Write tx2 to the journal superblock (so that we
know later tx2 is pending) - Prepare journal descriptor block
- Set tx 2, set the final locations of the
journal data blocks, set the blks - Write journal descriptor block and journal data
blocks - Write the journal commit block
17Transactions in ext3 Data Journaling Mode
- Second, checkpoint data to fixed ext3
structures - Copy B, I, and D to their fixed file system
locations
Dr
B
I
D
C
I
B
D
blk 1000
blk 2000
blk 3000
- Finally, free Tx in journal
- Journal is fixed-sized circular buffer,
entriesmust be periodically freed
I
B
D
Dr
B
I
D
C
blk 1000
blk 2000
blk 3000
18Upon reboot
- Check the journal superblock is there any
pending transaction? - If yes (e.g. tx2), scan the journal area to find
a journal descriptor for tx2 - After finding the journal descriptor block, ask
Is there a commit block? - If not, release the transaction
- If yes, need to checkpoint the transaction
- If checkpoint is successful, clear the
transaction by updating the journal superblock
(so that we can know there is no pending
transaction)
19What if theres a Crash?
- Recovery Go through log and redo
operationsthat have been successfully commited
to log - What if
- Tx begin but not Tx end in log?
- Discard the transaction
- Tx begin through Tx end are in log,but I, B,
and D have not yet been checkpointed? - Keep that transaction on the disk until the
journal data blocks have been checkpointed
successfully - What if Tx is in log, I, B, D have been
checkpointed,but Tx has not been freed from log? - In terms of correctness, there is no problem
- But the journal size is usually fixed (e.g. X
MB), so eventually the log will be full and some
transactions must be freed - Again Tx can only be freed if all its journal
data blocks have been checkpointed successfully!! - Performance? (As compared to fsck?)
- Much faster
- Only read the transactions that havent been
checkpointed - Journal size is fixed, so no need to scan the
entrire file system, simply scan the journal area
only (X MB)
20Complication Disk Scheduling
- Problem
- Low-levels of I/O subsystem in OSand even the
disk/RAID itself may reorder requests - How does this affect Tx management?
- Do we write the journal blocks (e.g. Dr, B, I,
D, C) in parallel? - No! Because of this ordering, when all these
journal blocks are sent to the disk, it could be
the case that Dr and C have been written first
before the journal data blocks! - Where is it OK to issue writes in parallel?
- Tx begin
- I, B, D
- Tx end
- Checkpoint I, B, D copied to final destinations
- Tx freed in journal
- Synchronization points
- Write Tx begin, B, I, D in parallel (then wait
until they finish) - Write Tx end (wait)
- Checkpoint B, I, D in parallel (wait)
- Tx freed in journal
21Problem with Data Journaling
- Data journaling Lots of extra writes
- All data committed to disk twice(once in
journal, once to final location) - Overkill if only goal is to keep metadata
consistent - Instead, use ext3 writeback mode
- Just journals metadata
- Writes data to final location directly, at any
time - Problem B and I are written to the journal,
crash (D has not been written to the disk) ? I
points to a valid block, but the content is
garbage - Solution Ordered mode
- Write all data blocks to their final location
(e.g. write D to its final location), then wait
until finish - Write metadata to the journal
22Conclusions
- Journaling
- All modern file systems use journaling toreduce
recovery time during startup(e.g., Linux ext3,
ReiserFS, SGI XFS, IBM JFS, NTFS) - Simple idea Use write-ahead log to record
someinfo about what you are going to do before
doing it - Turns multi-write update sequence into a
singleatomic update (all or nothing) - Some performance overhead Extra writes to
journal - Worth the cost?