Title: Chapter 9, Disks and Files
1Chapter 9, Disks and Files
- The Storage Hierarchy
- Disks
- Mechanics
- Performance
- RAID
- Disk Space Management
- Buffer Management
- Files of Records
- Format of a Heap File
- Format of a Data Page
- Format of Records
2Learning objectives
- Given disk parameters, compute storage needs and
read times - Given a reminder about what each level means, be
able to derive any figures on the RAID
performance slide - Describe the pros and cons of alternative
structures for files, pages and records
3A (Very) Simple Hardware Model
CPU chip
register file
ALU
system bus
memory bus
main memory
I/O bridge
bus interface
I/O bus
Expansion slots for other devices such as network
adapters.
USB controller
disk controller
graphics adapter
mouse
keyboard
monitor
disk
4 Storage Options
Capacity Access Time Cost
Registers Caches Main Memory Hard Disk /
Flash Tape
1k-2k bytes 1 Tc Way Expensive
10s -1000s K Bytes 2-20 Tc 10 / MByte
G Bytes 300 1000 Tc 0.03 / MB (eBay)
100s G Bytes 10 ms 30M Tc 0.10/ GB (eBay)
Infinite Forever Way Cheap
5 Memory Hierarchy
Upper Level
Capacity Access Time Cost
Staging Xfer Size
Faster
1k-2k bytes 1 Tc Way Expensive
Registers
prog./compiler 1-8 bytes
Instr. Operands
10s -1000s K Bytes 2-20 Tc 10 / MByte
Cache - SDRAMmay be multiple levels!
cache cntl 8-128 bytes
Blocks
G Bytes 300 1000 Tc 0.03 / MB (eBay)
Memory - DRAM
OS 4K bytes
Pages
100s G Bytes 10 ms 30M Tc 0.10/ GB (eBay)
Disk
user/operator Gbytes
Files
Larger
Infinite Forever Way Cheap
Tape
Lower Level
6Why Does Hierarchy Work?
- Locality
- Program access a relatively small portion of the
address space at any instant of time - Two Different Types
- Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon (e.g., loops, reuse) - Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon (e.g., straightline
code, array access)
79.1 The Memory Hierarchy
- Typical storage hierarchy as used by a RDBMS
- Primary storageMain memory (RAM) for currently
used data - Secondary storageDisk, Flash Memory for the
main database - http//www.cs.cmu.edu/damon2007/pdf/graefe07fivem
inrule.pdf - What are other reasons besides cost to use disk?
- Tertiary storageTapes, DVDs for archiving older
versions of the data - Other factors
- Caches at every level
- Controllers, protocols
- Network connections
8What is FLASH Memory, Anyway?
- Floating gate transitor
- Presence of charge gt 0
- Erase Electrically or UV (EPROM)
- Peformance
- Reads like DRAM (ns)
- Writes like DISK (ms). Write is a complex
operation
9Components of a Disk
Spindle
Disk head
Tracks
- platters are always spinning (say, 120rps).
- one head reads/writes at any one time.
- to read a record
- position arm (seek)
- engage head
- wait for data to spin by
- read (transfer data)
Sector
Platters
Arm movement
Arm assembly
10More terminology
Spindle
Disk head
Tracks
- Each track is made up of fixed size sectors.
- Page size is a multiple of sector size.
- A platter typically has data on
- both surfaces.
- All the tracks that you can reach from one
position of the arm is called a cylinder
(imaginary!).
Sector
Platters
Arm movement
Arm assembly
11Disks Technology Background
- Seagate 373453, 2003
- 15000 RPM (4X)
- 73.4 GBytes (2500X)
- Tracks/Inch 64000 (80X)
- Bits/Inch 533,000 (60X)
- Four 2.5 platters (in 3.5 form factor)
- Bandwidth 86 MBytes/sec (140X)
- Latency 5.7 ms (8X)
- Cache 8 MBytes
- CDC Wren I, 1983
- 3600 RPM
- 0.03 GBytes capacity
- Tracks/Inch 800
- Bits/Inch 9550
- Three 5.25 platters
- Bandwidth 0.6 MBytes/sec
- Latency 48.3 ms
- Cache none
12Typical Disk Drive Statistics (2008)
Sector size 512 bytes Seek time
Average 4-10 ms Track to
track .6-1.0 ms Average Rotational Delay -
3 to 5 ms (rotational speed 10,000 RPM to
5,400RPM) Transfer Time - Sustained data
rate 0.3- 0.1 msec per 8K page, or 25-75
MB/second Density 12-18 GB/in2
13Disk Capacity
- Capacity maximum number of bits that can be
stored. - Expressed in units of gigabytes (GB), where 1 GB
109 bytes - Capacity is determined by
- Recording density (bits/in) number of bits that
can be squeezed into a 1 inch segment of a track. - Track density (tracks/in) number of tracks that
can be squeezed into a 1 inch radial segment. - Areal density (bits/in2) product of recording
and track density. - Modern disks partition tracks into disjoint
subsets called recording zones - Each track in a zone has the same number of
sectors, determined by the circumference of
innermost track. - Each zone has a different number of
sectors/track
14Cost of Accessing Data on Disk
- Time to access (read/write) a disk block
- Taccess Tavg seek Tavg rotation Tavg
transfer - seek time (moving arms to position disk head on
track) - rotational delay (waiting for block to rotate
under head) - Half a rotation, on average
- transfer time (actually moving data to/from disk
surface) - Key to lower I/O cost reduce seek/rotation
delays! - No way to avoid transfer time
- Textbook measures query cost by NUMBER of page
I/Os - Implies all I/Os have the same cost, and that CPU
time is free - This is a common simplification.
- Real DBMSs (in the optimizer) would consider
sequential vs. random disk reads - Because sequential reads are much faster
- and would count CPU time.
15Disk Parameters Practice
- A 2-platter disk rotates at 7,200 rpm. Each
track contains 256KB. - How many cylinders are required to store an 8
Gigabyte file? - What is the average rotational delay, in
milliseconds?
16Disk Access Time Example
- Given
- Rotational rate 7,200 RPM
- Average seek time 9 ms.
- Avg sectors/track 400.
- Derived
- Tavg rotation 1/2 x (60 secs/7200 RPM) x 1000
ms/sec 4 ms. - Tavg transfer 60/7200 RPM x 1/400 secs/track x
1000 ms/sec 0.02 ms - Taccess 9 ms 4 ms 0.02 ms
- Important points
- Access time dominated by seek time and rotational
latency. - First bit in a sector is the most expensive, the
rest are free. - SRAM access time is about 4 ns/doubleword, DRAM
about 60 ns - Disk is about 40,000 times slower than SRAM,
- 2,500 times slower than DRAM.
17So, How far away is the data?
From http//research.microsoft.com/gray/papers/Al
phaSortSigmod.doc
18Block, page and record sizes
- Block According to text, smallest unit of I/O.
- Page often used in place of block.
- typical record size commonly hundreds,
sometimes thousands of bytes - Unlike the toy records in textbooks
- typical page size 4K, 8K
19Effect of page size on read time
- Suppose rotational delay is 4ms, average seek
time 6 ms, transfer speed .5msec/8K. - This graph shows the time required to read 1Gig
of data for different page sizes.
20Why the difference?
- What accounts for the difference, in times to
read one Gigabyte, on the previous graph? - Assume rotational delay 4ms, average seek time 6
ms, transfer speed .5msec/8K - Transfer time
- (230/213 8K blocks) ?(.5msec/8K) 66 secs
one minute - How many reads?
- Page size 8K there are 230/213 217 128K
reads - Page size 64K, there are 1/8th that many reads
16K reads - Time taken by rotational delays and seeks
- Each read requires a rotational delay and a seek,
totalling 10 msec. - 8K (128K reads) ? (10msec/read) 1,311 secs
22 minutes - 64K 1/8 of that, or 164 secs 3 minutes
21Moral of the Story
- As page size increases, read (and write) time
reduces to transfer time, a big savings. - So why not use a huge page size?
- Wastes memory space if you dont need all that is
read - Wastes read time if you dont need all that is
read - What applications could use a large page size?
- Those that sequentially access data
- The problem with a small page size is that pages
get scattered across the disk. Turn the page.
22Faster I/O, even with a small page size
- Even if the page size is small, you can achieve
fast I/O by storing a files data as follows - Consecutive pages on same track, followed by
- Consecutive tracks on same cylinder, followed by
- Consecutive cylinders adjacent to each other
- First two incur no seek time or rotational delay,
seek for third is only one-track. - What is saved with this storage pattern?
- How is this storage pattern obtained?
- Disk defragmenter and its relatives/predecessors
- Also places frequently used files near the
spindle - When data is in this storage pattern, the
application can do sequential I/O - Otherwise it must do random I/O
23More Hardware Issues
9. Disks
- Disk Controllers
- Interface from Disks to bus
- Checksums, remap bad sectors, driver mgt, etc
- Interface Protocols and MB per second xfer rates
- IDE/EIDE/ATA/PATA, SATA -133
- SCSI -640
- BUT for a single device, SCSI is inferior
- Faster network technologies such as Fibre Channel
- Storage Area Networks (SANs)
- Disk farm networked to servers
- Servers can be heterogeneous a primary
advantage - Centralized management
24Dependability
- Module reliability measure of continuous
service accomplishment (or time to failure). 2
metrics - Mean Time To Failure (MTTF) measures Reliability
- Failures In Time (FIT) 1/MTTF, the rate of
failures - Traditionally reported as failures per billion
hours of operation - Mean Time To Repair (MTTR) measures Service
Interruption - Mean Time Between Failures (MTBF) MTTFMTTR
- Module availability measures service as alternate
between the 2 states of accomplishment and
interruption (number between 0 and 1, e.g. 0.9) - Module availability MTTF / ( MTTF MTTR)
25Example calculating reliability
- If modules have exponentially distributed
lifetimes (age of module does not affect
probability of failure), overall failure rate is
the sum of failure rates of the modules - Example Calculate FIT and MTTF for
- 10 disks (1M hour MTTF per disk)
- 1 disk controller (0.5M hour MTTF)
- and 1 power supply (0.2M hour MTTF)
26Example calculating reliability
- Calculate FIT and MTTF for
- 10 disks (1M hour MTTF per disk)
- 1 disk controller (0.5M hour MTTF)
- and 1 power supply (0.2M hour MTTF)
279.2 RAID 587
9.Disks
- Disk Array Arrangement of several disks that
gives abstraction of a single, large disk. - Goals Increase performance and reliability.
- Two main techniques
- Data striping Data is partitioned size of a
partition is called the striping Unit. Partitions
are distributed over several disks. - Redundancy More disks gt more failures.
Redundant information allows reconstruction of
data if a disk fails.
28Data Striping
- CPUs go fast, disks dont. How can disks keep
up? - CPUs do work in parallel. Can disks?
- Answer Partition data across D disks (see next
slide). - If Partition unit is a page
- A single page I/O request is no faster
- Multiple I/O requests can run at aggregated
bandwidth - Number of pages in a partition unit called the
depth of the partition. - Contrary to text, partition units of a bit are
almost never used and partition units of a byte
are rare.
29Data Striping (RAID Level 0)
30Redundancy
- Striping is seductive, but remember reliability!
- MTTF of a disk is about 6 years
- If we stripe over 24 disks, what is MTTF?
- Solution redundancy
- Parity corrects single failures
- Others detect where the failure is, and corrects
multiple failures - But failure location is provided by controller
- Redundancy may require more than one check bit
- Redundancy makes writes slower why?
31RAID Levels
- Standardized by SNIA (www.snia.org )
- Vary in practice
- For each level, decide (assume single user)
- Number of disks required to hold D disks of data.
- Speedup s (compared to 1 disk) for
- S/R (Sequential/Random) R/W (Reads/Writes)
- Random each I/O is one block
- Sequential Each I/O is one stripe
- Number of disks/blocks that can fail w/o data
loss - Level 0 Block Striped, No redundancy
- Picture is 2 slides back
32JBOD, RAID Level 1
- JBOD Just a Bunch of Disks
- Level 1 Mirrored (two identical JBODs no
striping)
33RAID Level 01 Stripe Mirror
1 D1 2D1 1
2 D2 2D2 2
D-1 2D-1 3D-1 D-1
...
Disk D Disk D1 Disk D2
Disk 2D-1
34RAID Level 4
- Block-Interleaved Parity (not common)
- One check disk, uses one bit of parity.
- How to tell if there is a failure, or which disk
failed? - Read-modify-write
- Disk D is a bottleneck
35RAID Level 5
- Level 5 Block-Interleaved Distributed Parity
1 D1 2D1
D-2 2D-2 P
D-1 P 3D-2
P 2D-1 3D-1
...
Disk 0 Disk 1
Disk D-2 Disk D-1 Disk D
- Level 6 Like 5, but 2 parity bits/disks
- Can survive loss of 2 disks/blocks
36Notation on the next slide
- Disks
- Number of disks required to hold D disks worth of
data using this RAID level - Reads/Write speedup of blocks in a single file
- SR Sequential Read
- RR Random read
- SW Sequential write
- RW Random write
- Failure Tolerance
- How many disks can fail without loss of data
- Internal Data
- s Blocks transferred in the time it takes to
transfer one block of data from one disk. - These numbers are theoretical!
- YMMVand vary significantly!
37RAID Performance
If no two are copies of each other note
cant write both mirrors at once why?
38Small Writes on Levels 4 and 5
- Levels 4 and 5 require a read-modify-write cycle
for all writes, since the parity block must be
read and modified. - On small writes this can be very expensive
- This is another justification for Log Based File
Systems (see your OS course)
39Which RAID Level is best?
- If data loss is not a problem
- Level 0
- If storage cost is not a problem
- Level 01
- Else
- Level 5
- Software Support
- Linux 0,1,4,5 (http//www.tldp.org/HOWTO/Softwar
e-RAID-HOWTO.html ) - Windows 0,1,5 (http//www.techimo.com/articles/in
dex.pl?photo149 )
409.3, 9.4.1 Covered earlier
9.Disks
419.4.2 DBMS vs. OS File System
9.Disks
- OS does disk space buffer mgmt why not let OS
manage these tasks? 715 - Differences in OS support portability issues
- Some limitations, e.g., files cant span disks.
- Buffer management in DBMS requires ability to
- pin a page in buffer pool, force a page to disk
(important for implementing CC recovery), - adjust replacement policy, and pre-fetch pages
based on access patterns in typical DB
operations. - Sometimes MRU is the best replacement policy For
example, for a scan or a loop that does not fit.
429.5 Files of Records
9.Disks
- Page or block is OK when doing I/O, but higher
levels of DBMS operate on records, and files of
records. - FILE A collection of pages, each containing a
collection of records. Must support - insert/delete/modify record
- read a particular record (specified using record
id) - scan all records (possibly with some conditions
on the records to be retrieved)
439.5.1 Unordered (Heap) Files
9.Disks
- Simplest file structure contains records in no
particular order. - As file grows and shrinks, disk pages are
allocated and de-allocated. - To support record level operations, we must
- keep track of the pages in a file
- keep track of free space on pages
- keep track of the records on a page
- There are at least two alternatives for keeping
track of heap files.
44Heap File Implemented as a List
9.Disks
Data Page
Data Page
Data Page
Full Pages
Header Page
Data Page
Data Page
Data Page
Pages with Free Space
- The header page id and Heap file name must be
stored someplace. - Each page contains 2 pointers plus data.
45Heap File Using a Page Directory
9.Disks
Data Page 1
Header Page
Data Page 2
Data Page N
DIRECTORY
- The entry for a page can include the number of
free bytes on the page. - The directory is a collection of pages linked
list implementation is just one alternative. - Much smaller than linked list of all HF pages!
46Comparing Heap File Implementations
- Assume
- 100 directory entries per page.
- U full pages, E pages with free space
- D directory pages
- Then D ?(UE) /100?
- Note that D is two orders of magnitude less than
U or E - Cost to find a page with enough free space
- List E/2 Directory (D/2) 1
- Cost to Move a page from Full to Free (e.g.,
when a record is deleted) - List 3, Directory 1
- Can you think of some other operations?
479.6 Page Formats Fixed Length Records
9.Disks
Slot 1
Slot 1
Slot 2
Slot 2
Free Space
. . .
. . .
Slot N
Slot N
Slot M
N
M
1
0
. . .
1
1
M ... 3 2 1
number of records
number of slots
PACKED
UNPACKED, BITMAP
48Packed vs Unpacked Page Formats
- Record ID (RID, TID) (page, slot) , in all
page formats - Note that indexes are filled with RIDs
- Data entries in alternatives 2 and 3 are (key,
RID..) - Packed
- stores more records
- RIDs change when a record is deleted
- This may not be acceptable.
- Unpacked
- RID does not change
- Less data movement when deleting
49Page Formats Variable Length Records
9.Disks
Rid (i,N)
Page i
Rid (i,2)
Rid (i,1)
N
Pointer to start of free space
20
16
24
N . . . 2 1
slots
SLOT DIRECTORY
50Slotted Page Format
- Intergalactic Standard, for fixed length records
also. - How to deal with free space fragmentation?
- Pack records. lazily
- Note that RIDs dont change
- How are updates handled which expand the size of
a record? - Forwarding flag to new location
- http//www.postgresql.org/docs/8.3/interactive/sto
rage-page-layout.html - postgresql-8.3.1\src\include\storage\bufpage.h
519.7 Record Formats Fixed Length
9.Disks
F1
F2
F3
F4
L1
L2
L3
L4
Base address (B)
Address BL1L2
- Information about field types same for all
records in a file stored in system catalogs. - Finding ith field does not require scan of
record.
52Record Formats Variable Length
9.Disks
- Two alternative formats ( fields is fixed)
F1 F2 F3
F4
Fields Delimited by Special Symbols
Field Count
F1 F2 F3 F4
Array of Field Offsets
- Second offers direct access to ith field,
efficient storage - of nulls (special dont know value) small
directory overhead.