Beyond the File System - PowerPoint PPT Presentation

1 / 94
About This Presentation
Title:

Beyond the File System

Description:

Beyond the File System – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 95
Provided by: Yah99
Category:
Tags: beyond | file | olm | system

less

Transcript and Presenter's Notes

Title: Beyond the File System


1
Beyond the File System
  • Designing Large Scale File Storage and Serving
  • Cal Henderson

2
Hello!
3
Big file systems?
  • Too vague!
  • What is a file system?
  • What constitutes big?
  • Some requirements would be nice

4
1
  • Scalable
  • Looking at storage and serving infrastructures

5
2
  • Reliable
  • Looking at redundancy, failure rates, on the fly
    changes

6
3
  • Cheap
  • Looking at upfront costs, TCO and lifetimes

7
Four buckets
Storage
Serving
BCP
Cost
8
Storage
9
The storage stack
10
Hardware overview
  • The storage scale

11
Internal storage
  • A disk in a computer
  • SCSI, IDE, SATA
  • 4 disks in 1U is common
  • 8 for half depth boxes

12
DAS
Direct attached storage
Disk shelf, connected by SCSI/SATA
HP MSA30 14 disks in 3U
13
SAN
  • Storage Area Network
  • Dumb disk shelves
  • Clients connect via a fabric
  • Fibre Channel, iSCSI, Infiniband
  • Low level protocols

14
NAS
  • Network Attached Storage
  • Intelligent disk shelf
  • Clients connect via a network
  • NFS, SMB, CIFS
  • High level protocols

15
  • Of course, its more confusing than that

16
Meet the LUN
  • Logical Unit Number
  • A slice of storage space
  • Originally for addressing a single drive
  • c1t2d3
  • Controller, Target, Disk (Slice)
  • Now means a virtual partition/volume
  • LVM, Logical Volume Management

17
NAS vs SAN
  • With SAN, a single host (initiator) owns a single
    LUN/volume
  • With NAS, multiple hosts own a single LUN/volume
  • NAS head NAS access to a SAN

18
SAN Advantages
  • Virtualization within a SAN offers some nice
    features
  • Real-time LUN replication
  • Transparent backup
  • SAN booting for host replacement

19
Some Practical Examples
  • There are a lot of vendors
  • Configurations vary
  • Prices vary wildly
  • Lets look at a couple
  • Ones I happen to have experience with
  • Not an endorsement )

20
NetApp Filers
Heads and shelves, up to 500TB in 260U
FC SAN with 1 or 2 NAS heads
21
Isilon IQ
  • 2U Nodes, 3-96 nodes/cluster, 6-600 TB
  • FC/InfiniBand SAN with NAS head on each node

22
Scaling
  • Vertical vs Horizontal

23
Vertical scaling
  • Get a bigger box
  • Bigger disk(s)
  • More disks
  • Limited by current tech size of each disk and
    total number in appliance

24
Horizontal scaling
  • Buy more boxes
  • Add more servers/appliances
  • Scales forever
  • sort of

25
Storage scaling approaches
  • Four common models
  • Huge FS
  • Physical nodes
  • Virtual nodes
  • Chunked space

26
Huge FS
  • Create one giant volume with growing space
  • Suns ZFS
  • Isilon IQ
  • Expandable on-the-fly?
  • Upper limits
  • Always limited somewhere

27
Huge FS
  • Pluses
  • Simple from the application side
  • Logically simple
  • Low administrative overhead
  • Minuses
  • All your eggs in one basket
  • Hard to expand
  • Has an upper limit

28
Physical nodes
  • Application handles distribution to multiple
    physical nodes
  • Disks, Boxes, Appliances, whatever
  • One volume per node
  • Each node acts by itself
  • Expandable on-the-fly add more nodes
  • Scales forever

29
Physical Nodes
  • Pluses
  • Limitless expansion
  • Easy to expand
  • Unlikely to all fail at once
  • Minuses
  • Many mounts to manage
  • More administration

30
Virtual nodes
  • Application handles distribution to multiple
    virtual volumes, contained on multiple physical
    nodes
  • Multiple volumes per node
  • Flexible
  • Expandable on-the-fly add more nodes
  • Scales forever

31
Virtual Nodes
  • Pluses
  • Limitless expansion
  • Easy to expand
  • Unlikely to all fail at once
  • Addressing is logical, not physical
  • Flexible volume sizing, consolidation
  • Minuses
  • Many mounts to manage
  • More administration

32
Chunked space
  • Storage layer writes parts of files to different
    physical nodes
  • A higher-level RAID striping
  • High performance for large files
  • read multiple parts simultaneously

33
Chunked space
  • Pluses
  • High performance
  • Limitless size
  • Minuses
  • Conceptually complex
  • Can be hard to expand on the fly
  • Cant manually poke it

34
Real Life
Case Studies
35
GFS Google File System
  • Developed by Google
  • Proprietary
  • Everything we know about it is based on talks
    theyve given
  • Designed to store huge files for fast access

36
GFS Google File System
  • Single Master node holds metadata
  • SPF Shadow master allows warm swap
  • Grid of chunkservers
  • 64bit filenames
  • 64 MB file chunks

37
GFS Google File System
Master
1(a)
2(a)
1(b)
38
GFS Google File System
  • Client reads metadata from master then file parts
    from multiple chunkservers
  • Designed for big files (gt100MB)
  • Master server allocates access leases
  • Replication is automatic and self repairing
  • Synchronously for atomicity

39
GFS Google File System
  • Reading is fast (parallelizable)
  • But requires a lease
  • Master server is required for all reads and writes

40
MogileFS OMG Files
  • Developed by Danga / SixApart
  • Open source
  • Designed for scalable web app storage

41
MogileFS OMG Files
  • Single metadata store (MySQL)
  • MySQL Cluster avoids SPF
  • Multiple tracker nodes locate files
  • Multiple storage nodes store files

42
MogileFS OMG Files
Tracker
MySQL
Tracker
43
MogileFS OMG Files
  • Replication of file classes happens
    transparently
  • Storage nodes are not mirrored replication is
    piecemeal
  • Reading and writing go through trackers, but are
    performed directly upon storage nodes

44
Flickr File System
  • Developed by Flickr
  • Proprietary
  • Designed for very large scalable web app storage

45
Flickr File System
  • No metadata store
  • Deal with it yourself
  • Multiple StorageMaster nodes
  • Multiple storage nodes with virtual volumes

46
Flickr File System
SM
SM
SM
47
Flickr File System
  • Metadata stored by app
  • Just a virtual volume number
  • App chooses a path
  • Virtual nodes are mirrored
  • Locally and remotely
  • Reading is done directly from nodes

48
Flickr File System
  • StorageMaster nodes only used for write
    operations
  • Reading and writing can scale separately

49
Serving
50
Serving files
  • Serving files is easy!

Apache
Disk
51
Serving files
  • Scaling is harder

Apache
Disk
Apache
Disk
Apache
Disk
52
Serving files
  • This doesnt scale well
  • Primary storage is expensive
  • And takes a lot of space
  • In many systems, we only access a small number of
    files most of the time

53
Caching
  • Insert caches between the storage and serving
    nodes
  • Cache frequently accessed content to reduce reads
    on the storage nodes
  • Software (Squid, mod_cache)
  • Hardware (Netcache, Cacheflow)

54
Why it works
  • Keep a smaller working set
  • Use faster hardware
  • Lots of RAM
  • SCSI
  • Outer edge of disks (ZCAV)
  • Use more duplicates
  • Cheaper, since theyre smaller

55
Two models
  • Layer 4
  • Simple balanced cache
  • Objects in multiple caches
  • Good for few objects requested many times
  • Layer 7
  • URL balances cache
  • Objects in a single cache
  • Good for many objects requested a few times

56
Replacement policies
  • LRU Least recently used
  • GDSF Greedy dual size frequency
  • LFUDA Least frequently used with dynamic aging
  • All have advantages and disadvantages
  • Performance varies greatly with each

57
Cache Churn
  • How long do objects typically stay in cache?
  • If it gets too short, were doing badly
  • But it depends on your traffic profile
  • Make the cached object store larger

58
Problems
  • Caching has some problems
  • Invalidation is hard
  • Replacement is dumb (even LFUDA)
  • Avoiding caching makes your life (somewhat) easier

59
CDN Content Delivery Network
  • Akamai, Savvis, Mirror Image Internet, etc
  • Caches operated by other people
  • Already in-place
  • In lots of places
  • GSLB/DNS balancing

60
Edge networks
Origin
61
Edge networks
Cache
Cache
Cache
Origin
Cache
Cache
Cache
Cache
Cache
62
CDN Models
  • Simple model
  • You push content to them, they serve it
  • Reverse proxy model
  • You publish content on an origin, they proxy and
    cache it

63
CDN Invalidation
  • You dont control the caches
  • Just like those awful ISP ones
  • Once something is cached by a CDN, assume it can
    never change
  • Nothing can be deleted
  • Nothing can be modified

64
Versioning
  • When you start to cache things, you need to care
    about versioning
  • Invalidation Expiry
  • Naming Sync

65
Cache Invalidation
  • If you control the caches, invalidation is
    possible
  • But remember ISP and client caches
  • Remove deleted content explicitly
  • Avoid users finding old content
  • Save cache space

66
Cache versioning
  • Simple rule of thumb
  • If an item is modified, change its name (URL)
  • This can be independent of the file system!

67
Virtual versioning
  • Database indicates version 3 of file
  • Web app writes version number into URL
  • Request comes through cache and is cached with
    the versioned URL
  • mod_rewrite converts versioned URL to path

Version 3
example.com/foo_3.jpg
Cached foo_3.jpg
foo_3.jpg -gt foo.jpg
68
Authentication
  • Authentication inline layer
  • Apache / perlbal
  • Authentication sideline
  • ICP (CARP/HTCP)
  • Authentication by URL
  • FlickrFS

69
Auth layer
  • Authenticator sits between client and storage
  • Typically built into the cache software

Authenticator
Cache
Origin
70
Auth sideline
Cache
Origin
Authenticator
  • Authenticator sits beside the cache
  • Lightweight protocol used for authenticator

71
Auth by URL
Cache
Origin
Web Server
  • Someone else performs authentication and gives
    URLs to client (typically the web app)
  • URLs hold the keys for accessing files

72
BCP
73
Business Continuity Planning
  • How can I deal with the unexpected?
  • The core of BCP
  • Redundancy
  • Replication

74
Reality
  • On a long enough timescale, anything that can
    fail, will fail
  • Of course, everything can fail
  • True reliability comes only through redundancy

75
Reality
  • Define your own SLAs
  • How long can you afford to be down?
  • How manual is the recovery process?
  • How far can you roll back?
  • How many node x boxes can fail at once?

76
Failure scenarios
  • Disk failure
  • Storage array failure
  • Storage head failure
  • Fabric failure
  • Metadata node failure
  • Power outage
  • Routing outage

77
Reliable by design
  • RAID avoids disk failures, but not head or fabric
    failures
  • Duplicated nodes avoid host and fabric failures,
    but not routing or power failures
  • Dual-colo avoids routing and power failures, but
    my need duplication too

78
Tend to all points in the stack
  • Going dual-colo great
  • Taking a whole colo offline because of a single
    failed disk bad
  • We need a combination of these

79
Recovery times
  • BCP is not just about continuing when things fail
  • How can we restore after they come back?
  • Host and colo level syncing
  • replication queuing
  • Host and colo level rebuilding

80
Reliable Reads Writes
  • Reliable reads are easy
  • 2 or more copies of files
  • Reliable writes are harder
  • Write 2 copies at once
  • But what do we do when we cant write to one?

81
Dual writes
  • Queue up data to be written
  • Where?
  • Needs itself to be reliable
  • Queue up journal of changes
  • And then read data from the disk whose write
    succeeded
  • Duplicate whole volume after failure
  • Slow!

82
Cost
83
Judging cost
  • Per GB?
  • Per GB upfront and per year
  • Not as simple as youd hope
  • How about an example

84
Hardware costs
Single Cost
Cost of hardware
Usable GB
85
Power costs
Recurring Cost
Cost of power per year
Usable GB
86
Power costs
Single Cost
Power installation cost
Usable GB
87
Space costs
Recurring Cost


Cost per U
x
Us needed (inc network)
Usable GB
88
Network costs
Single Cost
Cost of network gear
Usable GB
89
Misc costs
Single Recurring Costs


Support contracts spare disks
bus adaptors cables
Usable GB
90
Human costs
Recurring Cost


Admin cost per node
x
Node count
Usable GB
91
TCO
  • Total cost of ownership in two parts
  • Upfront
  • Ongoing
  • Architecture plays a huge part in costing
  • Dont get tied to hardware
  • Allow heterogeneity
  • Move with the market

92
(fin)
93
Photo credits
  • flickr.com/photos/ebright/260823954/
  • flickr.com/photos/thomashawk/243477905/
  • flickr.com/photos/tom-carden/116315962/
  • flickr.com/photos/sillydog/287354869/
  • flickr.com/photos/foreversouls/131972916/
  • flickr.com/photos/julianb/324897/
  • flickr.com/photos/primejunta/140957047/
  • flickr.com/photos/whatknot/28973703/
  • flickr.com/photos/dcjohn/85504455/

94
  • You can find these slides online
  • iamcal.com/talks/
Write a Comment
User Comments (0)
About PowerShow.com