Title: Storage: Alternate Futures
1Storage Alternate Futures
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Jim Gray
- Microsoft Research
- http//Research.Microsoft.com/Gray/talks
- IBM Almaden, 1 December 1999
-
2Acknowledgments Thank You!!
- Dave Patterson
- Convinced me that processors are moving to the
devices. - Kim Keeton and Erik Riedell
- Showed that many useful subtasks can be done by
disk-processors, and quantified execution
interval - Remzi Dusseau
- Re-validated Amdahl's laws
3Outline
- The Surprise-Free Future (5 years)
- 500 mips cpus for 10
- 1 Gb RAM chips
- MAD at 50 Gbpsi
- 10 GBps SANs are ubiquitous
- 1 GBps WANs are ubiquitous
- Some consequences
- Absurd (?) consequences.
- Auto-manage storage
- Raid10 replaces Raid5
- Disc-packs
- Disk is the archive media of choice
- A surprising future?
- Disks (and other useful things) become
supercomputers. - Apps run in the disk
4The Surprise-free Storage Future
- 1 Gb RAM chips
- MAD at 50 Gbpsi
- Drives shrink one quantum
- Standard IO
- 10 GBps SANs are ubiquitous
- 1 Gbps WANs are ubiquitous
- 5 bips cpus for 1K and 500 mips cpus for 10
51 Gb RAM Chips
- Moving to 256 Mb chips now
- 1Gb will be standard in 5 years, 4 Gb will
be premium product. - Note
- 256Mb 32MB the smallest memory
- 1 Gb 128 MB the smallest memory
6System On A Chip
- Integrate Processing with memory on one chip
- chip is 75 memory now
- 1MB cache gtgt 1960 supercomputers
- 256 Mb memory chip is 32 MB!
- IRAM, CRAM, PIM, projects abound
- Integrate Networking with processing on one chip
- system bus is a kind of network
- ATM, FiberChannel, Ethernet,.. Logic on chip.
- Direct IO (no intermediate bus)
- Functionally specialized cards shrink to a chip.
7500 mips System On A Chip for 10
- 486 now 7 233 MHz ARM for 10 system on a
chiphttp//www.cirrus.com/news/products99/news-pr
oduct14.html AMD/Celeron 266 30 - In 5 years, todays leading edge will be
- System on chip (cpu, cache, mem ctlr, multiple
IO) - Low cost
- Low-power
- Have integrated IO
- High end is 5 BIPS cpus
8Standard IO in 5 Years
- Probably
- Replace PCI with something better will still
need a mezzanine bus standard - Multiple serial links directly from processor
- Fast (10 GBps/link) for a few meters
- System Area Networks (SANS) ubiquitous (VIA
morphs to SIO?)
9Ubiquitous 10 GBps SANs in 5 years
- 1Gbps Ethernet are reality now.
- Also FiberChannel ,MyriNet, GigaNet, ServerNet,,
ATM, - 10 Gbps x4 WDM deployed now (OC192)
- 3 Tbps WDM working in lab
- In 5 years, expect 10x, progress is astonishing
- Gilders law Bandwidth grows 3x/year
http//www.forbes.com/asap/97/0407/090.htm
1 GBps
120 MBps (1Gbps)
80 MBps
5 MBps
40 MBps
20 Mbsp
10Thin Clients mean HUGE servers
- AOL hosting customer pictures
- Hotmail allows 5 MB/user, 50 M users
- Web sites offer electronic vaulting for SOHO.
- IntelliMirror replicate client state on server
- Terminal server timesharing returns
- . Many more.
11Remember Your Roots?
12MAD at 50 Gbpsi
- MAD Magnetic Aerial Density
- 3-10 Mbpsi in products
- 28 Mbpsi in lab
- 50 Mbpsi paramagnetic limit
but. People have ideas. - Capacity rise 10x in 5 years (conservative)
- Bandwidth rise 4x in 5 years (densityrpm)
- Disk 50GB to 500 GB,
- 60-80MBps
- 1k/TB
- 15 minute to 3 hour scan time.
13The Absurd Disk
- 2.5 hr scan time (poor sequential access)
- 1 aps / 5 GB (VERY cold data)
- Its a tape!
1 TB
100 MB/s
200 Kaps
14Disk vs Tape
- Disk
- 47 GB
- 15 MBps
- 5 ms seek time
- 3 ms rotate latency
- 9/GB for drive 3/GB for ctlrs/cabinet
- 4 TB/rack
- Tape
- 40 GB
- 5 MBps
- 30 sec pick time
- Many minute seek time
- 5/GB for media10/GB for drivelibrary
- 10 TB/rack
Guestimates Cern 200 TB 3480 tapes 2 col
50GB Rack 1 TB 20 drives
The price advantage of tape is narrowing, and
the performance advantage of disk is growing
15Standard Storage Metrics
- Capacity
- RAM MB and /MB today at 512MB and 3/MB
- Disk GB and /GB today at 50GB and 10/GB
- Tape TB and /TB today at 50GB and
12k/TB (nearline) - Access time (latency)
- RAM 100 ns
- Disk 10 ms
- Tape 30 second pick, 30 second position
- Transfer rate
- RAM 1 GB/s
- Disk 15 MB/s - - - Arrays can go to 1GB/s
- Tape 5 MB/s - - - striping is
problematic, but works
16New Storage Metrics Kaps, Maps, SCAN?
- Kaps How many kilobyte objects served per second
- The file server, transaction processing metric
- This is the OLD metric.
- Maps How many megabyte objects served per second
- The Multi-Media metric
- SCAN How long to scan all the data
- the data mining and utility metric
- And
- Kaps/, Maps/, TBscan/
17For the Record (good 1999 devices packaged in
systemhttp//www.tpc.org/results/individual_resul
ts/Compaq/compaq.5500.99050701.es.pdf)
X 100
Tape is 1Tb with 4 DLT readers at 5MBps each.
18For the Record (good 1999 devices packaged in
systemhttp//www.tpc.org/results/individual_resul
ts/Compaq/compaq.5500.99050701.es.pdf)
Tape is 1Tb with 4 DLT readers at 5MBps each.
19The Access Time Myth
- The Myth seek or pick time dominates
- The reality (1) Queuing dominates
- (2) Transfer dominates BLOBs
- (3) Disk seeks often short
- Implication many cheap servers better than
one fast expensive server - shorter queues
- parallel transfer
- lower cost/access and cost/byte
- This is obvious for disk arrays
- This even more obvious for tape arrays
Wait
Transfer
Transfer
Rotate
Rotate
Seek
Seek
20Storage Ratios Changed
- DRAM/disk media price ratio changed
- 1970-1990 1001
- 1990-1995 101
- 1995-1997 501
- today 0.1pMB disk 301
3pMB dram
- 10x better access time
- 10x more bandwidth
- 4,000x lower media price
21Data on Disk Can Move to RAM in 8 years
301
6 years
22Outline
- The Surprise-Free Future (5 years)
- 500 mips cpus for 10
- 1 Gb RAM chips
- MAD at 50 Gbpsi
- 10 GBps SANs are ubiquitous
- 1 GBps WANs are ubiquitous
- Some consequences
- Absurd (?) consequences.
- Auto-manage storage
- Raid10 replaces Raid5
- Disc-packs
- Disk is the archive media of choice
- A surprising future?
- Disks (and other useful things) become
supercomputers. - Apps run in the disk.
23The (absurd?) consequences
- 256 way nUMA?
- Huge main memories now 500MB - 64GB memories
then 10GB - 1TB memories - Huge disksnow 5-50 GB 3.5 disks then 50-500
GB disks - Petabyte storage farms
- (that you cant back up or restore).
- Disks gtgt tapes
- Small disksOne platter one inch 10GB
- SAN convergence 1 GBps point to point is easy
- 1 GB RAM chips
- MAD at 50 Gbpsi
- Drives shrink one quantum
- 10 GBps SANs are ubiquitous
- 500 mips cpus for 10
- 5 bips cpus at high end
24The Absurd? Consequences
- Further segregate processing from storage
- Poor locality
- Much useless data movement
- Amdahls laws bus 10 B/ips io 1 b/ips
Disks
Processors
100 GBps
10 TBps
1 Tips
100TB
25Storage Latency How Far Away is the Data?
Andromeda
9
Tape /Optical
10
2,000 Years
Robot
6
Pluto
Disk
2 Years
10
1.5 hr
Olympia
Memory
100
This Hotel
10
10 min
On Board Cache
On Chip Cache
2
This Room
Registers
1
My Head
1 min
26Consequences
- AutoManage Storage
- Sixpacks (for arm-limited apps)
- Raid5-gt Raid10
- Disk-to-disk backup
- Smart disks
27Auto Manage Storage
- 1980 rule of thumb
- A DataAdmin per 10GB, SysAdmin per mips
- 2000 rule of thumb
- A DataAdmin per 5TB
- SysAdmin per 100 clones (varies with app).
- Problem
- 5TB is 60k today, 10k in a few years.
- Admin cost gtgt storage cost???
- Challenge
- Automate ALL storage admin tasks
28The Absurd Disk
- 2.5 hr scan time (poor sequential access)
- 1 aps / 5 GB (VERY cold data)
- Its a tape!
1 TB
100 MB/s
200 Kaps
29Extreme case 1TB disk Alternatives
- Use all the heads in parallel
- Scan in 30 minutes
- Still one Kaps/5GB
- Use one platter per arm
- Share power/sheetmetal
- Scan in 30 minutes
- One KAPS per GB
500 MB/s
1 TB
200 Kaps
500 MB/s
200GB each
1,000 Kaps
30Drives shrink (1.8, 1)
- 150 kaps for 500 GB is VERY cold data
- 3 GB/platter today, 30 GB/platter in 5years.
- Most disks are ½ full
- TPC benchmarks use 9GB drives (need arms or
bandwidth). - One solution smaller form factor
- More arms per GB
- More arms per rack
- More arms per Watt
31Prediction 6-packs
- One way or another, when disks get huge
- Will be packaged as multiple arms
- Parallel heads gives bandwidth
- Independent arms gives bandwidth aps
- Package shares power, package, interfaces
32Stripes, Mirrors, Parity (RAID 0,1, 5)
- RAID 0 Stripes
- bandwidth
- RAID 1 Mirrors, Shadows,
- Fault tolerance
- Reads faster, writes 2x slower
- RAID 5 Parity
- Fault tolerance
- Reads faster
- Writes 4x or 6x slower.
0,3,6,..
1,4,7,..
2,5,8,..
0,1,2,..
0,1,2,..
0,2,P2,..
1,P1,4,..
P0,3,5,..
33RAID 10 (strips of mirrors) Winswastes space,
saves arms
- RAID 5
- Performance
- 225 reads/sec
- 70 writes/sec
- Write
- 4 logical IO,
- 2 seek 1.7 rotate
- SAVES SPACE
- Performance degrades on failure
- RAID1
- Performance
- 250 reads/sec
- 100 writes/sec
- Write
- 2 logical IO
- 2 seek 0.7 rotate
- SAVES ARMS
- Performance improves on failure
34The Storage RackToday
- 140 arms
- 4TB
- 24 racks24 storage processors61 in rack
- Disks 2.5 GBps IO
- Controllers 1.2 GBps IO
- Ports 500 MBps IO
35Storage Rack in 5 years?
- 140 arms
- 50TB
- 24 racks24 storage processors61 in rack
- Disks 14 GBps IO
- Controllers 5 GBps IO
- Ports 1 GBps IO
- My suggestion move the processors into the
storage racks.
36Its hard to archive a PetaByteIt takes a LONG
time to restore it.
- Store it in two (or more) places online (on
disk?). - Scrub it continuously (look for errors)
- On failure, refresh lost copy from safe copy.
- Can organize the two copies differently
(e.g. one by time, one by space)
37Crazy Disk Ideas
- Disk Farm on a card surface mount disks
- Disk (magnetic store) on a chip (micro machines
in Silicon) - Full Apps (e.g. SAP, Exchange/Notes,..) in the
disk controller (a processor with 128 MB dram)
ASIC
The Innovator's Dilemma When New Technologies
Cause Great Firms to FailClayton M.
Christensen.ISBN 0875845851
38The Disk Farm On a Card
- The 500GB disc card
- An array of discs
- Can be used as
- 100 discs
- 1 striped disc
- 50 Fault Tolerant discs
- ....etc
- LOTS of accesses/second
- bandwidth
14"
39Functionally Specialized Cards
P mips processor
Today P50 mips M 2 MB
M MB DRAM
In a few years P 200 mips M 64 MB
ASIC
ASIC
40Data Gravity Processing Moves to Transducers
- Move Processing to data sources
- Move to where the power (and sheet metal) is
- Processor in
- Modem
- Display
- Microphones (speech recognition) cameras
(vision) - Storage Data storage and analysis
41Its Already True of PrintersPeripheral
CyberBrick
- You buy a printer
- You get a
- several network interfaces
- A Postscript engine
- cpu,
- memory,
- software,
- a spooler (soon)
- and a print engine.
42Disks Become Supercomputers
Kilo Mega Giga Tera Peta Exa Zetta Yotta
- 100x in 10 years 2 TB 3.5 drive
- Shrink to 1 is 200GB
- Disk replaces tape?
- Disk is super computer!
43All Device Controllers will be Cray 1s
- TODAY
- Disk controller is 10 mips risc engine with 2MB
DRAM - NIC is similar power
- SOON
- Will become 100 mips systems with 100 MB DRAM.
- They are nodes in a federation (can run Oracle
on NT in disk controller). - Advantages
- Uniform programming model
- Great tools
- Security
- Economics (cyberbricks)
- Move computation to data (minimize traffic)
Central Processor Memory
Tera Byte Backplane
44With Tera Byte Interconnectand Super Computer
Adapters
- Processing is incidental to
- Networking
- Storage
- UI
- Disk Controller/NIC is
- faster than device
- close to device
- Can borrow device package power
- So use idle capacity for computation.
- Run app in device.
- Both Kim Keeton (UCB) and Erik Riedel (CMU)
thesis investigate thisshow benefits of this
approach.
45Implications
Conventional
Radical
- Move app to NIC/device controller
- higher-higher level protocols CORBA / COM.
- Cluster parallelism is VERY important.
- Offload device handling to NIC/HBA
- higher level protocols I2O, NASD, VIA, IP, TCP
- SMP and Cluster parallelism is important.
46How Do They Talk to Each Other?
- Each node has an OS
- Each node has local resources A federation.
- Each node does not completely trust the others.
- Nodes use RPC to talk to each other
- CORBA? COM? RMI?
- One or all of the above.
- Huge leverage in high-level interfaces.
- Same old distributed system story.
Applications
Applications
datagrams
datagrams
streams
RPC
?
streams
RPC
?
SIO
SIO
SAN
47Basic Argument for x-Disks
- Future disk controller is a super-computer.
- 1 bips processor
- 128 MB dram
- 100 GB disk plus one arm
- Connects to SAN via high-level protocols
- RPC, HTTP, DCOM, Kerberos, Directory
Services,. - Commands are RPCs
- management, security,.
- Services file/web/db/ requests
- Managed by general-purpose OS with good dev
environment - Move apps to disk to save data movement
- need programming environment in controller
48The Slippery Slope
Nothing Sector Server
- If you add function to server
- Then you add more function to server
- Function gravitates to data.
Something Fixed App Server
Everything App Server
49Why Not a Sector Server?(lets get physical!)
- Good idea, thats what we have today.
- But
- cache added for performance
- Sector remap added for fault tolerance
- error reporting and diagnostics added
- SCSI commends (reserve,.. are growing)
- Sharing problematic (space mgmt, security,)
- Slipping down the slope to a 2-D block server
50Why Not a 1-D Block Server?Put A LITTLE on the
Disk Server
- Tried and true design
- HSC - VAX cluster
- EMC
- IBM Sysplex (3980?)
- But look inside
- Has a cache
- Has space management
- Has error reporting management
- Has RAID 0, 1, 2, 3, 4, 5, 10, 50,
- Has locking
- Has remote replication
- Has an OS
- Security is problematic
- Low-level interface moves too many bytes
51Why Not a 2-D Block Server?Put A LITTLE on the
Disk Server
- Tried and true design
- Cedar -gt NFS
- file server, cache, space,..
- Open file is many fewer msgs
- Grows to have
- Directories Naming
- Authentication access control
- RAID 0, 1, 2, 3, 4, 5, 10, 50,
- Locking
- Backup/restore/admin
- Cooperative caching with client
- File Servers are a BIG hit NetWare
- SNAP! is my favorite today
52Why Not a File Server?Put a Little on the Disk
Server
- Tried and true design
- Auspex, NetApp, ...
- Netware
- Yes, but look at NetWare
- File interface gives you app invocation interface
- Became an app server
- Mail, DB, Web,.
- Netware had a primitive OS
- Hard to program, so optimized wrong thing
53Why Not Everything?Allow Everything on Disk
Server(thin clients)
- Tried and true design
- Mainframes, Minis, ...
- Web servers,
- Encapsulates data
- Minimizes data moves
- Scaleable
- It is where everyone ends up.
- All the arguments against are short-term.
54The Slippery Slope
Nothing Sector Server
- If you add function to server
- Then you add more function to server
- Function gravitates to data.
Something Fixed App Server
Everything App Server
55Outline
- The Surprise-Free Future (5 years)
- Astonishing hardware progress.
- Some consequences
- Absurd (?) consequences.
- Auto-manage storage
- Raid10 replaces Raid5
- Disc-packs
- Disk is the archive media of choice
- A surprising future?
- Disks (and other useful things) become
supercomputers. - Apps run in the disk