Improving File System Reliability with I/O Shepherding - PowerPoint PPT Presentation

About This Presentation
Title:

Improving File System Reliability with I/O Shepherding

Description:

Managing disk and individual block failures. File System. Electrical. Mechanical. Firmware. Media. Transport. Device Driver. 3. File System Reality. Good news: ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 40
Provided by: hsg3
Category:

less

Transcript and Presenter's Notes

Title: Improving File System Reliability with I/O Shepherding


1
Improving File System Reliability with I/O
Shepherding
  • Haryadi S. Gunawi,
  • Vijayan Prabhakaran, Swetha Krishnan,
  • Andrea C. Arpaci-Dusseau,
  • Remzi H. Arpaci-Dusseau

University of Wisconsin - Madison

2
Storage Reality
  • Complex Storage Subsystem
  • Mechanical/electrical failures, buggy drivers
  • Complex Failures
  • Intermittent faults, latent sector errors,
    corruption, lost writes, misdirected writes, etc.
  • FS Reliability is important
  • Managing disk and individual block failures

File System
3
File System Reality
  • Good news
  • Rich literature
  • Checksum, parity, mirroring
  • Versioning, physical/logical identity
  • Important for single and multiple disks setting
  • Bad news
  • File system reliability is brokenSOSP05
  • Unlike other components (performance,
    consistency)
  • Reliability approaches hard-to understand and
    evolve

4
Broken FS Reliability
  • Lack of good reliability strategy
  • No remapping, checksumming, redundancy
  • Existing strategy is coarse-grained
  • Mount read-only, panic, retry
  • Inconsistent policies
  • Different techniques in similar failure scenarios
  • Bugs
  • Ignored write failures

Lets fix them!
With current Framework? Not so easy
5
No Reliability Framework
Reliability Policy
  • Diffused
  • Handle each fault in each I/O location
  • Different developers might increase diffusion

File System
Disk Subsystem
  • Inflexible
  • Fixed policies, hard to change
  • But, no policy that fits all diverse settings
  • Less reliable vs. more reliable drives
  • Desktop workload vs. web-server apps
  • The need for new framework
  • Reliability is a first-class file system concern

6
Localized
  • I/O Shepherd
  • Localized policies,
  • More correct, less bug, simpler reliability
    management

File System
Shepherd
Disk Subsystem
7
Flexible
  • I/O Shepherd
  • Localized, flexible policies,

File System
Shepherd
Disk Subsystem
8
Powerful
  • I/O Shepherd
  • Localized, flexible, and powerful policies

File System
Shepherd
Add Mirror
Check- sum
More Retry
More Protection
Add Mirror
Check- sum
More Retry
More Protection
Less Protection
Compo- sable Policies
Disk Subsystem
ATA
SCSI
Archival
Scientific Data
Networked Storage
Less Reliable Drive
More Reliable Drive
Custom Drive
9
Outline
  • Introduction
  • I/O Shepherd Architecture
  • Implementation
  • Evaluation
  • Conclusion

10
Architecture
File System
  • Building reliability framework
  • How to specify reliability policies?
  • How to make powerful policies?
  • How to simplify reliability management?
  • I/O Shepherd layer
  • Four important components
  • Policy table
  • Policy code
  • Policy primitives
  • Policy Metadata

I/O Shepherd
Policy Table Policy Table
Data Mirror()
Inode
Super
Policy Code
DynMirrorWrite(DiskAddr D, MemAddr A) DiskAddr
copyAddr IOS_MapLookup(MMap, D,
copyAddr) if (copyAddr NULL)
PickMirrorLoc(MMap, D, copyAddr)
IOS_MapAllocate(MMap, D, copyAddr) return
(IOS_Write(D, A, copyAddr, A))
Disk Subsystem
11
Policy Table
Policy Table Policy Table Policy Table
Block Type Write Policy Read Policy




  • How to specify reliability policies?
  • Different block types, different
    levels of importance
  • Different volumes, different
    reliability levels
  • Need fine-grained policy
  • Policy table
  • Different policies across different
    block types
  • Different policy tables across different volumes

Superblock TrippleMirror()
Inode ChecksumParity()
Inode Bitmap ChecksumParity()
Data WriteRetry1sec()

12
Policy Metadata
  • What support is needed to make powerful policies?
  • Remapping track bad block remapping
  • Mirroring allocate new block
  • Sanity check need on-disk structure
    specification
  • Integration with file system
  • Runtime allocation
  • Detailed knowledge of on-disk structures
  • I/O Shepherd Maps
  • Managed by the shepherd
  • Commonly used maps
  • Mirror-map
  • Checksum-map
  • Remap-map

13
Policy Primitives and Code
Policy Primitives
Maps
Computation
  • How to make reliability management simple?
  • I/O Shepherd Primitives
  • Rich set and reusable
  • Complexities are hidden
  • Policy writer simply composes primitives into
    Policy Code

Checksum
Map Update
Parity
Map Lookup
FS-Level
Layout
Sanity Check
Allocate Near
Stop FS
Allocate Far
Policy Code
MirrorData(Addr D) Addr M MapLookup (MMap,
D, M) if (M NULL) M PickMirrorLoc
(D) MapAllocate (MMap, D, M) Copy (D,
M) Write (D, M)
14
File System
D
D
I/O Shepherd
Policy Table Policy Table
Data MirrorData()
Inode
Super
Policy Code
MirrorData(Addr D) Addr R R
MapLookup(MMap, D) if (R NULL) R
PickMirrorLoc(D) MapAllocate(MMap, D, R)
Copy(D, R) Write(D, R)
Disk Subsystem
R
D
D
15
Summary
  • Interposition simplifies reliability management
  • Localized policies
  • Simple and extensible policies
  • Challenge Keeping new data and metadata
    consistent

16
Outline
  • Introduction
  • I/O Shepherd Architecture
  • Implementation
  • Consistency Management
  • Evaluation
  • Conclusion

17
Implementation
  • CrookFS
  • (named for the hooked staff of a shepherd)
  • An ext3 variant with I/O shepherding capabilities
  • Implementation
  • Changes in Core OS
  • Semantic information, layout and allocation
    interface, allocation during recovery
  • Consistency management (data journaling mode)
  • 900 LOC (non-intrusive)
  • Shepherd Infrastructure
  • Shepherd primitives, thread support, maps
    management, etc.
  • 3500 LOC (reusable for other file systems)
  • Well-integrated with the file system
  • Small overhead

18
Data Journaling Mode
Memory
D
I
Bm
Tx Release
Sync (intent is logged)
Journal
Fixed Location
Checkpoint (intent is realized)
19
Reliability Policy Journaling
  • When to run policies?
  • Policies (e.g. mirroring) are executed during
    checkpoint
  • Is current journaling approach adequate to
    support reliability policy?
  • Could we run remapping/mirroring during
    checkpoint?
  • No Problem of failed intentions
  • Cannot react to checkpoint failures

20
Failed Intentions
Example Policy Remapping
Crash
Memory
D
I
RMD?R
Impossible
R
I
Tx Release
Journal
Inconsistencies 1) Pointer I?D invalid 2) No
reference to R
TB
D
I
TC
Fixed Location
RMD?0
R
D
I
RMD?0
Remap-Map
RMD?R
Checkpoint completes
Checkpoint (failed intent)
21
Journaling Flaw
  • Journal log intent to the journal
  • If journal write failure occurs? Simply abort the
    transaction
  • Checkpoint intent is realized to final location
  • If checkpoint failure occurs? No solution!
  • Ext3, IBM JFS ignore
  • ReiserFS stop the FS (coarse-grained recovery)
  • Flaw in current journaling approach
  • No consistency for any checkpoint recovery that
    changes state
  • Too late, transaction has been committed
  • Crash could occur anytime
  • Hopes checkpoint writes always succeed (wrong!)
  • Consistent reliability current journal
    impossible

22
Chained Transactions
  • Contains all recent changes (e.g. modified
    shepherds metadata)
  • Chained with previous transaction
  • Rule Only after the chained transaction commits,
    can we release the previous transaction

23
Chained Transactions
Example Policy Remapping
Memory
D
I
RMD?R
RMD?R
New Tx Release after CTx commits
Old Tx Release
Journal
TB
D
I
TC
TB
TC
Fixed Location
R
D
I
RMD?0
Checkpoint completes
24
Summary
  • Chained Transactions
  • Handles failed-intentions
  • Works for all policies
  • Minimal changes in the journaling layer
  • Repeatable across crashes
  • Idempotent policy
  • An important property for consistency in multiple
    crashes

25
Outline
  • Introduction
  • I/O Shepherd Architecture
  • Implementation
  • Evaluation
  • Conclusion

26
Evaluation
  • Flexible
  • Change ext3 to all-stop or more-retry policies
  • Fine-Grained
  • Implement gracefully-degrade RAIDTOS05
  • Composable
  • Perform multiple lines of defense
  • Simple
  • Craft 8 policies in a simple manner

27
Flexibility
  • Modify ext3 inconsistent read recovery policies

Workload
Failed Block Type
Failed Block Indirect block Workload Path
traversal cd /mnt/fs2/test/a/b/ Policy
observed Detect failure and propagate
failure to app
Propagate
Retry
Ignore failure
Stop
ext3
28
Flexibility
  • Modify ext3 policies to all-stop policies

ext3
All-Stop
Policy Table Policy Table
Any Block Type AllStopRead()
No Recovery
Retry
Stop
AllStopRead (Block B) if (Read(B) OK)
return OK else
Stop()
Propagate
29
Flexibility
  • Modify ext3 policies to retry-more policies

ext3
Retry-More
Policy Table Policy Table
Any Block Type RetryMoreRead()
No Recovery
Retry
RetryMoreRead (Block B) for (int i 0 i
lt RETRY_MAX i) if (Read(B)
SUCCESS) return SUCCESS
return FAILURE
Stop
Propagate
30
Fine-Granularity
  • RAID problem
  • Extreme unavailability
  • Partially available data
  • Unavailable root directory
  • DGRAIDTOS05
  • Degrade gracefully
  • Fault isolate a file to a disk
  • Highly replicate metadata

f1.pdf
f2.pdf
31
Fine-Granularity
F 1 A 90
F 2 A 80
DGRAID Policy Table DGRAID Policy Table
Superblock MirrorXway()
Group Desc MirrorXway()
Bitmaps MirrorXway()
Directory MirrorXway()
Inode MirrorXway()
Indirect MirrorXway()
Data IsolateAFileToADisk()
10-way Linear
X 1, 5, 10
F 3 A 40
32
Composability
ReadInode(Block B) C Lookup(Ch-Map,
B) Read(B,C) if ( CompareChecksum(B, C)
OK ) return OK M Lookup(M-Map,
B) Read(M) if ( CompareChecksum(M, C)
OK ) B M return OK if (
SanityCheck(B) OK ) return OK if (
SanityCheck(M) OK ) B M
return OK RunOnlineFsck() return
ReadInode(B)
Time (ms)
  • Multiple lines of defense
  • Assemble both low-level and high-level recovery
    mechanism

33
Simplicity
Policy LOC
Propagate 8
Sanity Check 10
Reboot 15
Retry 15
Mirroring 18
Parity 28
Multiple Lines of D 39
D-GRAID 79
  • Writing reliability policy is simple
  • Implement 8 policies
  • Using reusable primitives
  • Complex one lt 80 LOC

34
Conclusion
  • Modern storage failures are complex
  • Not only fail-stop, but also exhibit individual
    block failures
  • FS reliability framework does not exist
  • Scattered policy code cant expect much
    reliability
  • Journaling Block Failures ? Failed intentions
    (Flaw)
  • I/O Shepherding
  • Powerful
  • Deploy disk-level, RAID-level, FS-level policies
  • Flexible
  • Reliability as a function of workload and
    environment
  • Consistent
  • Chained-transactions

35
ADvanced Systems Laboratorywww.cs.wisc.edu/adsl
Thanks to
I/O Shepherds shepherd Frans Kaashoek
ScholarshipSponsor
ResearchSponsor
36
Extra Slides
37
Policy Table Policy Table
Data RemapMirrorData()
..
..
Policy Code
D
RemapMirrorData(Addr D) Addr R, Q
MapLookup(MMap, D, R) if (R NULL) R
PickMirrorLoc(D) MapAllocate(MMap, D,
R) Copy(D, R) Write(D, R) if (Fail(R))
Deallocate(R) Q PickMirrorLoc(D)
MapAllocate(MMap, D, Q) Write(Q)
Disk Subsystem
R
D
Q
38
Chained Transactions (2)
Example Policy RemapMirrorData
Memory
D
I
MD?R1
MD?R2
MD?R2
Journal
TB
TB
TC
TC
D
I
Fixed Location
MD?0
D
I
R1
R2
MD?0
Checkpoint completes
39
Existing Solution Enough?
  • Is machinery in high-end systems enough (e.g.
    disk scrubbing, redundancy, end-to-end
    checksums)?
  • Not pervasive in home environment (store photos,
    tax returns)
  • New trend commodity storage clusters (Google,
    EMC Centera)
  • Is RAID enough?
  • Requires more than one disk
  • Does not protect faults above disk system
  • Focus on whole disk failure
  • Does not enable fine-grained policies
Write a Comment
User Comments (0)
About PowerShow.com