Title: Designing for 20TB Disk Drives And enterprise storage
1Designing for 20TB Disk DrivesAnd enterprise
storage
- Jim Gray, Microsoft research
2Disk Evolution
Kilo Mega Giga Tera Peta Exa Zetta Yotta
- Capacity100x in 10 years 1 TB 3.5 drive in
2005 20 TB? in 2012?! - System on a chip
- High-speed SAN
- Disk replacing tape
- Disk is super computer!
3Disks are becoming computers
- Smart drives
- Camera with micro-drive
- Replay / Tivo / Ultimate TV
- Phone with micro-drive
- MP3 players
- Tablet
- Xbox
- Many more
ApplicationsWeb, DBMS, Files OS
Disk Ctlr 1Ghz cpu 1GB RAM
Comm Infiniband, Ethernet, radio
4Intermediate Step Shared Logic
Snap 1TB 12x80GB NAS
- Brick with 8-12 disk drives
- 200 mips/arm (or more)
- 2xGbpsEthernet
- General purpose OS
- 10k/TB to 100k/TB
- Shared
- Sheet metal
- Power
- Support/Config
- Security
- Network ports
- These bricks could run applications (e.g. SQL or
Mail or..)
NetApp .5TB 8x70GB NAS
Maxstor 2TB 12x160GB NAS
IBM TotalStorage 360GB 10x36GB NAS
5Hardware
- Homogenous machines leads to quick response
through reallocation - HP desktop machines, 320MB RAM, 3u high, 4 100GB
IDE Drives - 4k/TB (street), 2.5processors/TB, 1GB RAM/TB
- 3 weeks from ordering to operational
Slide courtesy of Brewster Kahle, _at_ Archive.org
6Disk as Tape
- Tape is unreliable, specialized, slow, low
density, not improving fast, and expensive - Using removable hard drives to replace tapes
function has been successful - When a tape is needed, the drive is put in a
machine and it is online. No need to copy from
tape before it is used. - Portable, durable, fast, media cost raw tapes,
dense. Unknown longevity suspected good. -
Slide courtesy of Brewster Kahle, _at_ Archive.org
7Disk As Tape What format?
- Today I send NTFS/SQL disks.
- But that is not a good format for Linux.
- Solution Ship NFS/CIFS/ODBC servers (not disks)
- Plug disk into LAN.
- DHCP then file or DB server via standard
interface. - Web Service in long term
8State is Expensive
- Stateless clones are easy to manage
- App servers are middle tier
- Cost goes to zero with Moores law.
- One admin per 1,000 clones.
- Good story about scaleout.
- Stateful servers are expensive to manage
- 1TB to 100TB per admin
- Storage cost is going to zero(2k to 200k).
- Cost of storage is management cost
9Databases ( SQL)
- VLDB survey (Winter Corp).
- 10 TB to 100TB DBs.
- Size doubling yearly
- Riding disk Moores law
- 10,000 disks at 18GB is 100TB cooked.
- Mostly DSS and data warehouses.
- Some media managers
10Interesting facts
- No DBMSs beyond 100TB.
- Most bytes are in files.
- The web is file centric
- eMail is file centric.
- Science (and batch) is file centric.
- But.
- SQL performance is better than CIFS/NFS..
- CISC vs RISC
11 BarBar the biggest DB
- 500 TB
- Uses Objectivity
- SLAC events
- Linux cluster scans DB looking for patterns
12300 TB (cooked)Hotmail / Yahoo
- Clone front ends 10,000_at_hotmail.
- Application servers
- 100 _at_ hotmail
- Get mail box
- Get/put mail
- Disk bound
- 30,000 disks
- 20 admins
13AOL (msn) (1PB?)
- 10 B transactions per day (10 of that)
- Huge storage
- Huge traffic
- Lots of eye candy
- DB used for security/accounting.
- GUESS AOL is a petabyte
- (40M x 10MB 400 x 1012)
14Google1.5PB as of last spring
- 8,000 no-name PCs
- Each 1/3U, 2 x 80 GB disk, 2 cpu 256MB ram
- 1.4 PB online.
- 2 TB ram online
- 8 TeraOps
- Slice-price is 1K so 8M.
- 15 admins (!) ( 1/100TB).
15Astronomy
- Ive been trying to apply DB to astronomy
- Today they are at 10TB per data set
- Heading for Petabytes
- Using Objectivity
- Trying SQL (talk to me offline)
16Scale Out Buy Computing by the Slice709,202
tpmC! 1 Billion transactions/day
- Slice 8cpu, 8GB, 100 disks (1.8TB) 20ktpmC per
slice, 300k/slice - clients and 4 DTC nodes not shown
17ScaleUp A Very Big System!
- UNISYS Windows 2000 Data Center Limited Edition
- 32 cpus on
- 32 GB of RAM and
- 1,061 disks (15.5 TB)
- Will be helped by 64bit addressing
24 fiber channel
18Hardware
8 Compaq DL360 Photon Web Servers
One SQL database per rack Each rack contains 4.5
tb 261 total drives / 13.7 TB total
Fiber SAN Switches
Meta Data Stored on 101 GB Fast, Small
Disks(18 x 18.2 GB)
SQL\Inst1
Imagery Data Stored on 4 339 GB Slow, Big
Disks (15 x 73.8 GB)
SQL\Inst2
SQL\Inst3
To Add 90 72.8 GB Disks in Feb 2001 to create 18
TB SAN
Spare
4 Compaq ProLiant 8500 Db Servers
19Amdahls Balance Laws
- parallelism law If a computation has a serial
part S and a parallel component P, then the
maximum speedup is (SP)/S. - balanced system law A system needs a bit of IO
per second per instruction per secondabout 8
MIPS per MBps. - memory law ?1 the MB/MIPS ratio (called alpha
(?)), in a balanced system is 1. - IO law Programs do one IO per 50,000
instructions.
20Amdahls Laws Valid 35 Years Later?
- Parallelism law is algebra so SURE!
- Balanced system laws?
- Look at tpc results (tpcC, tpcH) at
http//www.tpc.org/ - Some imagination needed
- Whats an instruction (CPI varies from 1-3)?
- RISC, CISC, VLIW, clocks per instruction,
- Whats an I/O?
21TPC systems
- Normalize for CPI (clocks per instruction)
- TPC-C has about 7 ins/byte of IO
- TPC-H has 3 ins/byte of IO
- TPC-H needs ½ as many disks, sequential vs random
- Both use 9GB 10 krpm disks (need arms, not bytes)
22TPC systems Whats alpha (MB/MIPS)?
- Hard to say
- Intel 32 bit addressing ( 4GB limit). Known CPI.
- IBM, HP, Sun have 64 GB limit. Unknown CPI.
- Look at both, guess CPI for IBM, HP, Sun
- Alpha is between 1 and 6
23Performance (on current SDSS data)
- Run times on 15k COMPAQ Server (2 cpu, 1 GB ,
8 disk) - Some take 10 minutes
- Some take 1 minute
- Median 22 sec.
- Ghz processors are fast!
- (10 mips/IO, 200 ins/byte)
- 2.5 m rec/s/cpu
1,000 IO/cpu sec 64 MB IO/cpu sec
24How much storage do we need?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
Everything! Recorded
- Soon everything can be recorded and indexed
- Most bytes will never be seen by humans.
- Data summarization, trend detection anomaly
detection are key technologies - See Mike Lesk How much information is there
http//www.lesk.com/mlesk/ksg97/ksg.html - See Lyman Varian
- How much information
- http//www.sims.berkeley.edu/research/projects/how
-much-info/
All Books MultiMedia
All LoC books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
25Standard Storage Metrics
- Capacity
- RAM MB and /MB today at 512MB and 200/GB
- Disk GB and /GB today at 80GB and
70k/TB - Tape TB and /TB today at 40GB and
10k/TB (nearline) - Access time (latency)
- RAM 100 ns
- Disk 15 ms
- Tape 30 second pick, 30 second position
- Transfer rate
- RAM 1-10 GB/s
- Disk 10-50 MB/s - - -Arrays can go to
10GB/s - Tape 5-15 MB/s - - - Arrays can go to
1GB/s
26New Storage Metrics Kaps, Maps, SCAN
- Kaps How many kilobyte objects served per second
- The file server, transaction processing metric
- This is the OLD metric.
- Maps How many megabyte objects served per sec
- The Multi-Media metric
- SCAN How long to scan all the data
- the data mining and utility metric
- And
- Kaps/, Maps/, TBscan/
27More Kaps and Kaps/ but.
- Disk accesses got much less expensive Better
disks Cheaper disks! - But disk arms are expensivethe scarce resource
- 1 hour Scanvs 5 minutes in 1990
28Data on Disk Can Move to RAM in 10 years
1001
10 years
29The Absurd 10x (4 year) Disk
- 2.5 hr scan time (poor sequential access)
- 1 aps / 5 GB (VERY cold data)
- Its a tape!
1 TB
100 MB/s
200 Kaps
30Its Hard to Archive a PetabyteIt takes a LONG
time to restore it.
- At 1GBps it takes 12 days!
- Store it in two (or more) places online (on
disk?). A geo-plex - Scrub it continuously (look for errors)
- On failure,
- use other copy until failure repaired,
- refresh lost copy from safe copy.
- Can organize the two copies differently
(e.g. one by time, one by space)
31Auto Manage Storage
- 1980 rule of thumb
- A DataAdmin per 10GB, SysAdmin per mips
- 2000 rule of thumb
- A DataAdmin per 5TB
- SysAdmin per 100 clones (varies with app).
- Problem
- 5TB is 50k today, 5k in a few years.
- Admin cost gtgt storage cost !!!!
- Challenge
- Automate ALL storage admin tasks
32How to cool disk data
- Cache data in main memory
- See 5 minute rule later in presentation
- Fewer-larger transfers
- Larger pages (512-gt 8KB -gt 256KB)
- Sequential rather than random access
- Random 8KB IO is 1.5 MBps
- Sequential IO is 30 MBps (201 ratio is growing)
- Raid1 (mirroring) rather than Raid5 (parity).
33Data delivery costs 1/GB today
- Rent for big customers 300/megabit per
second per month - Improved 3x in last 6 years (!).
- That translates to 1/GB at each end.
- You can mail a 160 GB disk for 20.
- Thats 16x cheaper
- If overnight its 4 MBps.
3x160 GB ½ TB