Beyond the File System - PowerPoint PPT Presentation

1 / 94

About This Presentation

Title:

Beyond the File System

Description:

Beyond the File System – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 95

Provided by: Yah99

Category:

more less

Transcript and Presenter's Notes

Title: Beyond the File System

1
Beyond the File System

Designing Large Scale File Storage and Serving
Cal Henderson

2
Hello!
3
Big file systems?

Too vague!
What is a file system?
What constitutes big?
Some requirements would be nice

4
1

Scalable
Looking at storage and serving infrastructures

5
2

Reliable
Looking at redundancy, failure rates, on the fly
changes

6
3

Cheap
Looking at upfront costs, TCO and lifetimes

7
Four buckets
Storage
Serving
BCP
Cost
8
Storage
9
The storage stack
10
Hardware overview

The storage scale

11
Internal storage

A disk in a computer
SCSI, IDE, SATA
4 disks in 1U is common
8 for half depth boxes

12
DAS
Direct attached storage
Disk shelf, connected by SCSI/SATA
HP MSA30 14 disks in 3U
13
SAN

Storage Area Network
Dumb disk shelves
Clients connect via a fabric
Fibre Channel, iSCSI, Infiniband
Low level protocols

14
NAS

Network Attached Storage
Intelligent disk shelf
Clients connect via a network
NFS, SMB, CIFS
High level protocols

Of course, its more confusing than that

16
Meet the LUN

Logical Unit Number
A slice of storage space
Originally for addressing a single drive
c1t2d3
Controller, Target, Disk (Slice)
Now means a virtual partition/volume
LVM, Logical Volume Management

17
NAS vs SAN

With SAN, a single host (initiator) owns a single
LUN/volume
With NAS, multiple hosts own a single LUN/volume
NAS head NAS access to a SAN

18
SAN Advantages

Virtualization within a SAN offers some nice
features
Real-time LUN replication
Transparent backup
SAN booting for host replacement

19
Some Practical Examples

There are a lot of vendors
Configurations vary
Prices vary wildly
Lets look at a couple
Ones I happen to have experience with
Not an endorsement )

20
NetApp Filers
Heads and shelves, up to 500TB in 260U
FC SAN with 1 or 2 NAS heads
21
Isilon IQ

2U Nodes, 3-96 nodes/cluster, 6-600 TB
FC/InfiniBand SAN with NAS head on each node

22
Scaling

Vertical vs Horizontal

23
Vertical scaling

Get a bigger box
Bigger disk(s)
More disks
Limited by current tech size of each disk and
total number in appliance

24
Horizontal scaling

Buy more boxes
Add more servers/appliances
Scales forever
sort of

25
Storage scaling approaches

Four common models
Huge FS
Physical nodes
Virtual nodes
Chunked space

26
Huge FS

Create one giant volume with growing space
Suns ZFS
Isilon IQ
Expandable on-the-fly?
Upper limits
Always limited somewhere

27
Huge FS

Pluses
Simple from the application side
Logically simple
Low administrative overhead
Minuses
All your eggs in one basket
Hard to expand
Has an upper limit

28
Physical nodes

Application handles distribution to multiple
physical nodes
Disks, Boxes, Appliances, whatever
One volume per node
Each node acts by itself
Expandable on-the-fly add more nodes
Scales forever

29
Physical Nodes

Pluses
Limitless expansion
Easy to expand
Unlikely to all fail at once
Minuses
Many mounts to manage
More administration

30
Virtual nodes

Application handles distribution to multiple
virtual volumes, contained on multiple physical
nodes
Multiple volumes per node
Flexible
Expandable on-the-fly add more nodes
Scales forever

31
Virtual Nodes

Pluses
Limitless expansion
Easy to expand
Unlikely to all fail at once
Addressing is logical, not physical
Flexible volume sizing, consolidation
Minuses
Many mounts to manage
More administration

32
Chunked space

Storage layer writes parts of files to different
physical nodes
A higher-level RAID striping
High performance for large files
read multiple parts simultaneously

33
Chunked space

Pluses
High performance
Limitless size
Minuses
Conceptually complex
Can be hard to expand on the fly
Cant manually poke it

34
Real Life
Case Studies
35
GFS Google File System

Developed by Google
Proprietary
Everything we know about it is based on talks
theyve given
Designed to store huge files for fast access

36
GFS Google File System

Single Master node holds metadata
SPF Shadow master allows warm swap
Grid of chunkservers
64bit filenames
64 MB file chunks

37
GFS Google File System
Master
1(a)
2(a)
1(b)
38
GFS Google File System

Client reads metadata from master then file parts
from multiple chunkservers
Designed for big files (gt100MB)
Master server allocates access leases
Replication is automatic and self repairing
Synchronously for atomicity

39
GFS Google File System

Reading is fast (parallelizable)
But requires a lease
Master server is required for all reads and writes

40
MogileFS OMG Files

Developed by Danga / SixApart
Open source
Designed for scalable web app storage

41
MogileFS OMG Files

Single metadata store (MySQL)
MySQL Cluster avoids SPF
Multiple tracker nodes locate files
Multiple storage nodes store files

42
MogileFS OMG Files
Tracker
MySQL
Tracker
43
MogileFS OMG Files

Replication of file classes happens
transparently
Storage nodes are not mirrored replication is
piecemeal
Reading and writing go through trackers, but are
performed directly upon storage nodes

44
Flickr File System

Developed by Flickr
Proprietary
Designed for very large scalable web app storage

45
Flickr File System

No metadata store
Deal with it yourself
Multiple StorageMaster nodes
Multiple storage nodes with virtual volumes

46
Flickr File System
SM
SM
SM
47
Flickr File System

Metadata stored by app
Just a virtual volume number
App chooses a path
Virtual nodes are mirrored
Locally and remotely
Reading is done directly from nodes

48
Flickr File System

StorageMaster nodes only used for write
operations
Reading and writing can scale separately

49
Serving
50
Serving files

Serving files is easy!

Apache
Disk
51
Serving files

Scaling is harder

Apache
Disk
Apache
Disk
Apache
Disk
52
Serving files

This doesnt scale well
Primary storage is expensive
And takes a lot of space
In many systems, we only access a small number of
files most of the time

53
Caching

Insert caches between the storage and serving
nodes
Cache frequently accessed content to reduce reads
on the storage nodes
Software (Squid, mod_cache)
Hardware (Netcache, Cacheflow)

54
Why it works

Keep a smaller working set
Use faster hardware
Lots of RAM
SCSI
Outer edge of disks (ZCAV)
Use more duplicates
Cheaper, since theyre smaller

55
Two models

Layer 4
Simple balanced cache
Objects in multiple caches
Good for few objects requested many times
Layer 7
URL balances cache
Objects in a single cache
Good for many objects requested a few times

56
Replacement policies

LRU Least recently used
GDSF Greedy dual size frequency
LFUDA Least frequently used with dynamic aging
All have advantages and disadvantages
Performance varies greatly with each

57
Cache Churn

How long do objects typically stay in cache?
If it gets too short, were doing badly
But it depends on your traffic profile
Make the cached object store larger

58
Problems

Caching has some problems
Invalidation is hard
Replacement is dumb (even LFUDA)
Avoiding caching makes your life (somewhat) easier

59
CDN Content Delivery Network

Akamai, Savvis, Mirror Image Internet, etc
Caches operated by other people
Already in-place
In lots of places
GSLB/DNS balancing

60
Edge networks
Origin
61
Edge networks
Cache
Cache
Cache
Origin
Cache
Cache
Cache
Cache
Cache
62
CDN Models

Simple model
You push content to them, they serve it
Reverse proxy model
You publish content on an origin, they proxy and
cache it

63
CDN Invalidation

You dont control the caches
Just like those awful ISP ones
Once something is cached by a CDN, assume it can
never change
Nothing can be deleted
Nothing can be modified

64
Versioning

When you start to cache things, you need to care
about versioning
Invalidation Expiry
Naming Sync

65
Cache Invalidation

If you control the caches, invalidation is
possible
But remember ISP and client caches
Remove deleted content explicitly
Avoid users finding old content
Save cache space

66
Cache versioning

Simple rule of thumb
If an item is modified, change its name (URL)
This can be independent of the file system!

67
Virtual versioning

Database indicates version 3 of file
Web app writes version number into URL
Request comes through cache and is cached with
the versioned URL
mod_rewrite converts versioned URL to path

Version 3
example.com/foo_3.jpg
Cached foo_3.jpg
foo_3.jpg -gt foo.jpg
68
Authentication

Authentication inline layer
Apache / perlbal
Authentication sideline
ICP (CARP/HTCP)
Authentication by URL
FlickrFS

69
Auth layer

Authenticator sits between client and storage
Typically built into the cache software

Authenticator
Cache
Origin
70
Auth sideline
Cache
Origin
Authenticator

Authenticator sits beside the cache
Lightweight protocol used for authenticator

71
Auth by URL
Cache
Origin
Web Server

Someone else performs authentication and gives
URLs to client (typically the web app)
URLs hold the keys for accessing files

72
BCP
73
Business Continuity Planning

How can I deal with the unexpected?
The core of BCP
Redundancy
Replication

74
Reality

On a long enough timescale, anything that can
fail, will fail
Of course, everything can fail
True reliability comes only through redundancy

75
Reality

Define your own SLAs
How long can you afford to be down?
How manual is the recovery process?
How far can you roll back?
How many node x boxes can fail at once?

76
Failure scenarios

Disk failure
Storage array failure
Storage head failure
Fabric failure
Metadata node failure
Power outage
Routing outage

77
Reliable by design

RAID avoids disk failures, but not head or fabric
failures
Duplicated nodes avoid host and fabric failures,
but not routing or power failures
Dual-colo avoids routing and power failures, but
my need duplication too

78
Tend to all points in the stack

Going dual-colo great
Taking a whole colo offline because of a single
failed disk bad
We need a combination of these

79
Recovery times

BCP is not just about continuing when things fail
How can we restore after they come back?
Host and colo level syncing
replication queuing
Host and colo level rebuilding

80
Reliable Reads Writes

Reliable reads are easy
2 or more copies of files
Reliable writes are harder
Write 2 copies at once
But what do we do when we cant write to one?

81
Dual writes

Queue up data to be written
Where?
Needs itself to be reliable
Queue up journal of changes
And then read data from the disk whose write
succeeded
Duplicate whole volume after failure
Slow!

82
Cost
83
Judging cost

Per GB?
Per GB upfront and per year
Not as simple as youd hope
How about an example

84
Hardware costs
Single Cost
Cost of hardware
Usable GB
85
Power costs
Recurring Cost
Cost of power per year
Usable GB
86
Power costs
Single Cost
Power installation cost
Usable GB
87
Space costs
Recurring Cost

Cost per U
x
Us needed (inc network)
Usable GB
88
Network costs
Single Cost
Cost of network gear
Usable GB
89
Misc costs
Single Recurring Costs

Support contracts spare disks
bus adaptors cables
Usable GB
90
Human costs
Recurring Cost

Admin cost per node
x
Node count
Usable GB
91
TCO