Journaling File Systems - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Journaling File Systems

Description:

Journaling File Systems. Questions answered in this lecture: VFS and FS operations ... What 3 journaling modes does Linux ext3 support? UNIVERSITY of WISCONSIN-MADISON ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 23
Provided by: andreaarpa
Category:

less

Transcript and Presenter's Notes

Title: Journaling File Systems


1
Journaling File Systems
UNIVERSITY of WISCONSIN-MADISONComputer Sciences
Department
CS 537Introduction to Operating Systems
Andrea C. Arpaci-DusseauRemzi H.
Arpaci-Dusseau Haryadi S. Gunawi
  • Questions answered in this lecture
  • VFS and FS operations
  • Why is it hard to maintain on-disk consistency?
  • How does the FSCK tool help with consistency?
  • What information is written to a journal?
  • What 3 journaling modes does Linux ext3 support?

2
Virtual File System (VFS)
  • Operations
  • File/Dir open, close, chdir, link, unlink
    (delete), truncate, rename
  • Data read, write, lseek,
  • Access and info stat, chmod, chown
  • Ext2/3 (or any other file system)
  • Knows its on-disk format
  • Has its own block allocation policies
  • VFS layer
  • Structure-independent code
  • Manage buffer cache, directory cache, generic
    inode descriptor, file descriptor
  • Defines a set of functions every file system has
    to implement

Application
VFS
Linux Ext2/3
SGI XFS
ReiserFS
Sun ZFS
IBM JFS
3
Multiple updates / ops
  • Write
  • Write to the next byte (to a data block)
  • Update block bitmap
  • Update meta-data
  • Delete (e.g. rm /dir/file)
  • Release data blocks of file ? update block bitmap
    (to free space)
  • Update the inode for file
  • Update inode bitmap
  • Update dir data block (remove directory entry)
  • And many more
  • What happens if a crash happens in the middle

4
Review The I/O Path (Reads)
1
  • Read() from file
  • Check if block is in cache
  • (file cache sometimes is called buffer cache)
  • If so, return block to user1 in figure
  • If not, read from disk, insert into cache, return
    to user 2

Blockin cache
Main Memory (Cache)
Leave copy in cache
Block Not in cache
2
Disk
5
Review The I/O Path (Writes)
1
  • Write() to file
  • Write is buffered in memory (write behind) 1
  • Sometime later, OS decides to write to disk 2
  • Why delay writes?
  • Implications for performance
  • Implications for reliability

Buffer in memory
Main Memory (Cache)
Later Write to disk
2
Disk
6
Many dirty blocks in memoryWhat order to
write to disk?
  • Example Appending a new block to existing file
  • Initially I have a data bitmap B, inode file I,
    and an unused data block D
  • After append
  • Write data bitmap B (for new data block),write
    inode I of file (to add new pointer, update
    time),write new data block D

B
I
D
Memory
?
?
?
Disk
B
I
D
7
The Problems
  • Writes Have to update disk with N writes
  • Writes are buffered on the first place, and then
    are performed at the same time later
  • Disk scheduler
  • But, disk does only a single write atomically
  • Disk scheduler (e.g. C-LOOK) reorders write
    sequence
  • Crashes System may crash at arbitrary point
  • Bad case In the middle of an update sequence
  • Desire To update on-disk structures atomically
  • Either all should happen or none

8
Example Bitmap first
  • Write Ordering Bitmap (B), Inode (I), Data
    (D)
  • But CRASH after B has reached disk, before I or
    D
  • Result?
  • Inode is still the old inode (I), it doesnt
    point to D
  • Data is still the old data (D)
  • D can never be used (bitmap says D is used but
    actually its not because no inode is pointing to
    D)

B
I
D
Memory
Disk
B
I
D
9
Example Inode first
  • Write Ordering Inode (I), Bitmap (B), Data
    (D)
  • But CRASH after I has reached disk, before B or
    D
  • Result?
  • I points to D which contains garbage (not D)
  • B is the old bitmap which says block D is unused
    (although there is an inode that already points
    to the data block)
  • Another user (I2) requests a block, the FS gives
    D to I2
  • I2 gets D, D is pointed by I and I2 (security
    leak!)

B
I
D
Memory
Disk
B
I
D
10
Example Inode first
  • Write Ordering Inode (I), Bitmap (B), Data
    (D)
  • CRASH after I AND B have reached disk, before
    D
  • Result?
  • Better than previous example (no security leak)
  • But D still contains garbage, so I is pointing to
    garbage data

B
I
D
Memory
Disk
B
I
D
11
Example Data first
  • Write Ordering Data (D) , Bitmap (B), Inode
    (I)
  • CRASH after D has reached disk, before I or B
  • Result?
  • No bad thing happens, everything is consistent
  • Bitmap says the block that holds D is free
  • No inode points to that block D
  • Inode is still the old inode (I) which does not
    point to any data block

B
I
D
Memory
Disk
B
I
D
12
Traditional Solution FSCK
  • FSCK file system checker
  • When system boots
  • Make multiple passes over file system, looking
    for inconsistencies
  • e.g., inode pointers and bitmaps, directory
    entries, inode reference counts
  • Ex1 bitmap says D is used, but no inode is
    pointing to D, then bitmap is modified (D is not
    used)
  • Ex2 Two inodes pointing to the same data block,
    a clone of the data block will be created, and
    one of the inodes will point to the new clone
    (hence, no sharing anymore)
  • Either fix automatically or punt to admin
  • Does fsck have to run upon every reboot?
  • Yes, if FS does not know if there is a crash in
    the middle of ops
  • No, if FS knows that an operation has not
    finished yet
  • E.g. put in superblock a dirty bit, set dirty bit
    to 1 before starting the operations. Clean dirty
    bit if operations have finished
  • If upon reboot, dirty bit in superblock is 1,
    must run fsck
  • Problem add runtime overhead (must write to
    superblock for each update sequence)
  • Main problem with fsck Performance
  • Sometimes takes hours to run on large disk
    volumes
  • Inconsistency can only be detected if the whole
    content of the file system is checked ? must scan
    the whole file system (more precisely must scan
    all metadata in the file system)

13
How To Avoid The Long Scan?
  • Idea
  • Do not perform in-place update
  • Write something to another area on the disk
    before updating its data structures
  • Called the write ahead log or journal
  • If all updates have been successfully reflected
    to the journal, then all the updates can be
    reflected to the final place (this is called the
    checkpointing process)
  • When crash occurs, look through log and seewhat
    was going on
  • Use contents of log to fix file system structures
  • The process is called recovery

14
Case Study Linux ext3
  • Journal location
  • EITHER on a separate device partition
  • OR just a special file within ext2
  • Three separate modes of operation
  • Data All data and metadata is journaled
  • Ordered, Writeback Just metadata is journaled
  • First focus Data journaling mode

15
Transactions in ext3 Data Journaling Mode
  • Same example Update Inode (I), Bitmap (B), Data
    (D)
  • First, write to journal
  • Each write is formed into a transaction
  • A transaction comprises of
  • Journal descriptor block (Dr)
  • Implies the beginning of a transaction (Tx begin)
  • Contains the actual locations of the blocks saved
    in the journal data blocks
  • Contains the transaction number
  • Journal data blocks
  • All blocks that must be updated atomically to the
    disk, e.g. in this example I, B, and D
  • Journal commit block (C)
  • Implies the end of transaction (Tx end)
  • Also contains the transaction number

Dr Tx 2 3 blocks 1 1000 2 2000 3
3000
Dr
B
I
D
C
I
B
D
blk 1000
blk 2000
blk 3000
16
Write to the journal (sequence)
  • I want to write B, I, and D
  • Please give me a transaction (e.g. got tx 2)
  • Write tx2 to the journal superblock (so that we
    know later tx2 is pending)
  • Prepare journal descriptor block
  • Set tx 2, set the final locations of the
    journal data blocks, set the blks
  • Write journal descriptor block and journal data
    blocks
  • Write the journal commit block

17
Transactions in ext3 Data Journaling Mode
  • Second, checkpoint data to fixed ext3
    structures
  • Copy B, I, and D to their fixed file system
    locations

Dr
B
I
D
C
I
B
D
blk 1000
blk 2000
blk 3000
  • Finally, free Tx in journal
  • Journal is fixed-sized circular buffer,
    entriesmust be periodically freed

I
B
D
Dr
B
I
D
C
blk 1000
blk 2000
blk 3000
18
Upon reboot
  • Check the journal superblock is there any
    pending transaction?
  • If yes (e.g. tx2), scan the journal area to find
    a journal descriptor for tx2
  • After finding the journal descriptor block, ask
    Is there a commit block?
  • If not, release the transaction
  • If yes, need to checkpoint the transaction
  • If checkpoint is successful, clear the
    transaction by updating the journal superblock
    (so that we can know there is no pending
    transaction)

19
What if theres a Crash?
  • Recovery Go through log and redo
    operationsthat have been successfully commited
    to log
  • What if
  • Tx begin but not Tx end in log?
  • Discard the transaction
  • Tx begin through Tx end are in log,but I, B,
    and D have not yet been checkpointed?
  • Keep that transaction on the disk until the
    journal data blocks have been checkpointed
    successfully
  • What if Tx is in log, I, B, D have been
    checkpointed,but Tx has not been freed from log?
  • In terms of correctness, there is no problem
  • But the journal size is usually fixed (e.g. X
    MB), so eventually the log will be full and some
    transactions must be freed
  • Again Tx can only be freed if all its journal
    data blocks have been checkpointed successfully!!
  • Performance? (As compared to fsck?)
  • Much faster
  • Only read the transactions that havent been
    checkpointed
  • Journal size is fixed, so no need to scan the
    entrire file system, simply scan the journal area
    only (X MB)

20
Complication Disk Scheduling
  • Problem
  • Low-levels of I/O subsystem in OSand even the
    disk/RAID itself may reorder requests
  • How does this affect Tx management?
  • Do we write the journal blocks (e.g. Dr, B, I,
    D, C) in parallel?
  • No! Because of this ordering, when all these
    journal blocks are sent to the disk, it could be
    the case that Dr and C have been written first
    before the journal data blocks!
  • Where is it OK to issue writes in parallel?
  • Tx begin
  • I, B, D
  • Tx end
  • Checkpoint I, B, D copied to final destinations
  • Tx freed in journal
  • Synchronization points
  • Write Tx begin, B, I, D in parallel (then wait
    until they finish)
  • Write Tx end (wait)
  • Checkpoint B, I, D in parallel (wait)
  • Tx freed in journal

21
Problem with Data Journaling
  • Data journaling Lots of extra writes
  • All data committed to disk twice(once in
    journal, once to final location)
  • Overkill if only goal is to keep metadata
    consistent
  • Instead, use ext3 writeback mode
  • Just journals metadata
  • Writes data to final location directly, at any
    time
  • Problem B and I are written to the journal,
    crash (D has not been written to the disk) ? I
    points to a valid block, but the content is
    garbage
  • Solution Ordered mode
  • Write all data blocks to their final location
    (e.g. write D to its final location), then wait
    until finish
  • Write metadata to the journal

22
Conclusions
  • Journaling
  • All modern file systems use journaling toreduce
    recovery time during startup(e.g., Linux ext3,
    ReiserFS, SGI XFS, IBM JFS, NTFS)
  • Simple idea Use write-ahead log to record
    someinfo about what you are going to do before
    doing it
  • Turns multi-write update sequence into a
    singleatomic update (all or nothing)
  • Some performance overhead Extra writes to
    journal
  • Worth the cost?
Write a Comment
User Comments (0)
About PowerShow.com