IRON File Systems presentation

About This Presentation

Transcript and Presenter's Notes

Title: IRON File Systems

1
IRON File Systems

Remzi Arpaci-Dusseau
University of Wisconsin, Madison

2
Understanding How ThingsFail Is Important
3
How Disks Fail
4
Classic Failure Model Fail Stop

As defined Schneider 90
Stop Upon failure, halt
Make known But first, switch to state s.t.
other components can detect that you have
failed
Very simple model of disk failure
Used by all early file and storage systems(once
controllers could detect failure)
But is it realistic?

5
AssertionModern Disks Are Not Whole-Disk Fail
Stop
6
Real Failures

Latent sector errors Kari 93, Bairavasundaram
07
Block or blocks becomes inaccessible
Data corruption Weinberg 04, Greene 05,
Bairavasundaram 08
Controller bugs, not bit rot
Transient errors too Talagala 99
Bus stuttering, etc.
Result Partial failures are a reality

7
So What Should We Do?
8
High-end Systems Extra Measures

Disk Scrubbing Kari 93
Proactively scan drives in search of latent
errors
When detected, correct from redundant copyon
another disk
Extra redundancy Corbett 04
RAID system with two parity disks
Checksums Bartlett 04, Weinberg 04
Extra computation over data
Guard against corruption

9
But What About Desktop File Systems?
10
Desktop FSs Lost In The Past?

Desktop file systems are important
Home use Photos, movies, tax returns, ...
Cluster use too GoogleFS built on local FSs
Performance policies are well known
e.g., FFS placement policy
But what is their fault-handling policy?
Do they handle partial disk failures?
How can we tell?

11
Two Questions
12
Questions I Will Answer

Question 1 How do local file systems reactto
the more realistic set of disk failures?
Question 2 How can we change file systemsto
better handle these types of faults?

13
How Disks Fail The Details
14
The Storage Stack
Host

Not just file system on top of the disk
Many layers
Lots of software
Even within disk!
Failures occur at all levels

Disk
15
Latent Sector Errors

Disks experience partial failures
a small portion of data on disk
becomestemporarily or permanently
unavailableCorbett 04
Root causes
Surface is scratched, inaccurate arm movement,
interconnect problems
Bottom line A single read or write can fail

16
Data Corruption

Suns ZFS Weinberg 04
Misdirected writes Right data, wrong location
Phantom/Lost writes Yes I wrote the data!
(but didnt)
EIDE Interface on motherboards Greene 05
Read reported as done when not (race)
Similar problem at Google Ghemewat 03
Network Appliance Lewis 99
Disk occasionally returns byte-shifted data

17
Transient Errors

18-month study of large disk farm Talagala 99
Most machines had SCSI timeout errors(loose
cables, bad cables?)
SCSI parity errors were common too(data
corrupted when moving across the bus)
Failures can be transient too
Might work if just retried

18
Even Worse With ATA (Not SCSI)

ATA drives Less reliable Anderson 03, Hughes
Murray 05
Few are returned for failure analysis
Some are partially flaw marked during testing
Test conditions not as harsh (power, temp.)
High-end reliability features missing(filters
remove particles, chemicals humidity)
Cheap disks -gt less testing -gt less reliability
But cost drives many purchasing decisions

19
Trend More Problems, Not Less

Denser drives Capacity sells drives
More logic -gt more complexity
More complexity -gt more bugs
Cost per byte dominates Pennies matter
Manufacturers will cut corners
Reliability features are the first to go
Increasing amount of software
400K lines of code in modern Seagate drive
Hard to write, hard to debug

20
The Fail-Partial Failure Model
21
The Fail-Partial Failure Model

Disk failure
Entire disk may fail
Block failure
Part of disk may fail
Block corruption
Part of disk may get corrupted
All can be either transient or sticky

22
Important Parameters

Locality
Are partial faults independent of each other?
Frequency
How often do partial faults occur?

23
Frequency of Failures

Study of Latent Sector Errors Bairavasundaram et
al. 07
1.53 millions disks, 3 years of data
ATA 8.5 - SCSI 1.9
Latent sector errors are not independent
Spatial locality exists, disk capacity matters
Study of Block Corruption Bairavasundaram et al.
08
Same data set
ATA 0.6 - SCSI 0.06
Corruptions within disk are not independent
Spatial locality exists
The bad block number problem

24
How Do File Systems ReactTo Partial Failures?
25
How To Detect Handle Failures?

Need Classification of techniques
Detection Discovering a failure took place
Recovery Recovering from the failure
Detection Recovery IRON
File systems with Internal RObustNess
IRON Taxonomy Classify techniques

26
IRON Detection Taxonomy

How to detect block failure or corruption?
Possible strategies
Zero No detection technique used
Error Code Check return codes from disk
Sanity Check data structures for consistency
Redundancy Add checksums or otherforms of
computed replication to detect problems

27
IRON Recovery Taxonomy

How to recover from a detected failure?
Possible strategies
Zero Dont do anything
Propagate Pass error on to higher level
Stop Halt activity (fail stop)
Guess Manufacture data, return to user
Retry Assume failure is transient
Repair If inconsistency is detected
Remap Redirect to another block
Redundancy Use another copy of block

28
What IRON Techniques DoModern File Systems Use?
29
Fault Injection

Typical fault injection
Insert failures at random disk locations/times
Watch system to see what happens
Not good enough
May miss interesting behavior
May find problems, but not explanatory
What we do Space- and Time-aware injection
A gray box approach to testing

30
Space Awareness

File systems comprised of many on-disk structures
e.g., superblocks, inodes, etc.
Idea Make fault injection layer awareof file
system structures
Inject faults across all block types

31
Time Awareness

Time is key to testing as well
e.g., update sequence
Idea Build model of file system I/O activity

Writes
J Journal C Commit K Checkpoint S Superblock
Data Journaling (Simplified)

Use model to induce faults at crucial times
Dont miss interesting behaviors

32
Making It Comprehensive

Workloads
Exercise as much of FS as possible
Two types of workloads
Singlets Stress single system call(open, lstat,
rename, symlink, write, etc.)
Generics Stress common functionality(path
traversal, recovery, log writes, etc.)

33
Injecting Faults

Disk Hard to do -gtits hardware
Software approach
Easy
Desirable
Fail-partial faults
Read, write errors
Read corruption

Host
Disk
34
The File Systems We Tested

Linux ext3
Popular, simple, compatible Linux file system
Linux ReiserFS
Scalable, database-like file system
Linux IBM JFS
Big Blues classic journaling file system
Windows NTFS
Yes, a non-Linux file system

35
Result Matrix
Workloads
Data Structures
36
Read ErrorsRecovery
Ext3

Ext3 Stop and propagate(dont tolerate
transience)
ReiserFS Mostly propagate
JFS Stop, propagate, retry
All Some cases missed

ReiserFS
JFS
37
Write Errors Recovery
Ext3

Ext3/JFS Ignore write faults
No detection -gt no recovery
Can corrupt entire volume
ReiserFS always calls panic
Exception indirect blocks

ReiserFS
JFS
38
Corruption Recovery
Ext3

Ext3/Reiser/JFS
Some sanity checking used
Stop/Propagate common
Sanity checking not enough

ReiserFS
JFS
39
File System Specific Results

Ext3 Overall simplicity
Checks error codes, modest sanity
checking,propagates errors, aborts operation
Overreacts on read errors -gt halt instead of
propagate
But, some write errors are ignored
ReiserFS First, do no harm
At slightest sign of failure, panic() file system
Preserves integrity overreacts to transients
IBM JFS The kitchen sink
Uses broadest range of techniques
Windows NTFS Persistence is a virtue
Liberal retry (understands disks can be flaky)

40
General Results (1 of 3)

Illogical inconsistency is common
Similar faults -gt different reactions(e.g., JFS
failed read of superblock)
Bugs are common
Code not stress-tested enough?(e.g., ReiserFS
indirect block code paths)
Error codes are sometimes ignored
Highly surprising Easiest to detect(but
sometimes hard to act upon)

41
General Results (2 of 3)

Sanity checking is of limited utility
Doesnt help if read right type, wrong block
Hard to do for some structures (e.g., bitmaps)
Stop is useful (if used correctly)
ReiserFS halts on write errors
Ext3 tries to do this (but aborts too late)
Stop should not be overused
Faults can be transient
Faults can be sticky, too!

42
General Results (3 of 3)

Retry is underutilized
JFS does it some, NTFS quite a bit
But transient faults occur
Automatic repair is rare
Almost all stop actions involve administrator
intervention/repair (running fsck, reboot, etc.)
Redundancy is rarely used
Only superblocks are replicated, sometimes

43
Towards an IRON File System
44
IRON ext3 ixt3

Prototype of an IRON file system
First cut Many other possibilities still exist
Start with Linux ext3
Add checksums To detect corruption
Add replication For important structures(e.g.,
meta-data)
Add parity For user data
Result IRON ext3 (ixt3)

45
Ixt3 Implementation

Checksums
Initially write to the ext3 log,then checkpoint
them to their final location
Meta-data replicas
Write to replica log, checkpointlater to their
final on-disk location
Parity protection for data
One block per file, extra pointer in inode
Performance issues
Space overhead Low
Time overhead?

46
Ixt3 Performance Evaluation

For home use or read-mostly No overhead
Has cost for write-intensive workloads

47
Wrapping Up
48
Summary

File systems are important
Used everywhere, in many different ways
Disks fail in interesting ways
New model Fail-partial failure model
Local file systems Not ready for local faults
Illogical inconsistencies, bugs, and little
recovery
Need IRON file systems
Ixt3 Low-cost protection from partial failures

49
Challenges and Directions

Need to rethink how we build file systems
Performance policy isnt the only policy
Fault-handling policy is critical
Testing and beyond testing
Failure handling must be tested (continuously?)
Beyond testing Code analysis too?
Guiding principles
Lessons from networking
Put simply Dont trust the disk

50
ADvanced Systems Lab (ADSL)

www.cs.wisc.edu/adsl

51
ADvanced Systems Lab (ADSL)

Who did the real work
Nitin Agrawal
Lakshmi Bairavasundaram
Haryadi Gunawi
Vijayan Prabhakaran

52
Backup Slides
53
Read Errors Detection Techniques

Across all three file systems
Error codes checked forread errors(rarely
ignored)

54
Write Errors Detection Techniques

Ext3, JFS ignore write errors!
Either ignored altogetheror not used
meaningfully
ReiserFS Much more careful

55
Corruption Detection Techniques

Sanity checking used acrossall three file
systems
Sanity checking not sufficient
e.g., when you read blockof similar type

56
File Systems The Manager of Your Data
57
Why File Systems Are Important

The file system The manager of most data
Consists of named files Linear array of bytes
Organized in directories /this/is/my/file
Access methods open(), read(), write(), close()
Where we use them Everywhere
Home use Photos, tax returns, home movies
Servers Network file servers, Google search
engine
Why we use them
Simple, convenient
Good performance Subject of much research
Reliable? Depends on how disks fail

IRON File Systems PowerPoint PPT Presentation