Title: Storage Systems Part I
1Storage Systems Part I
INF SERV Media Storage and Distribution Systems
2Overview
- Disks
- mechanics and properties
- Disk scheduling
- traditional
- real-time
- stream oriented
3Storage System
- The VoD storage systems deals with issues like
- data retrieval from storage devices
- data placement and organization
- QoS guarantees like ensured continuous delivery
- must consider the storage sub-system architecture
for optimal performance
4Disks
5Disks I
- Disks are orders of magnitude slower than main
memory, but are cheaper and have more capacity - Disks are used to have a persistent system and
manage huge amounts of information - Because...
- ...there is a large speed mismatch compared to
main memory (this gap will increase according
to Moores law), - ...disk I/O is often the main performance
bottleneck - ...we need to minimize the number of accesses,
- ...
- we must look closer on how to manage disks
6Disks II
- Two resources of importance
- storage space
- disk I/O bandwidth
- Several approaches to manage multimedia data on
disks - specific disk scheduling and large buffers
(traditional file structure) - optimize data placement for contiguous media
(traditional retrieval mechanisms) - combinations of the above
7Mechanics of Disks
Spindleof which the platters rotate around
Tracksconcentric circles on asingle platter
Platterscircular platters covered with magnetic
material to provide nonvolatile storage of bits
Disk headsread or alter the magnetism (bits)
passing under it. The heads are attached to an
arm enabling it to move across the platter surface
Sectorssegments of the track circle separated
by non-magnetic gaps.The gaps are often used to
identifybeginning of a sector
Cylinderscorresponding tracks on the different
platters are said to form a cylinder
8Disk Specifications
Note 1disk manufacturers usually denote GB as
109 whereascomputer quantities often arepowers
of 2, i.e., GB is 230
- Disk technology develops fast
- Some existing (Seagate) disks today
Note 2there is usually a trade off between
speed and capacity
Note 3there is a difference between internal
and formatted transfer rate. Internal is only
between platter. Formatted is after the signals
interfere with the electronics (cabling loss,
interference, retransmissions, checksums, etc.)
9Disk Capacity
- The size of the disk is dependent on
- the number of platters
- whether the platters use one or both sides
- number of tracks per surface
- (average) number of sectors per track
- number of bytes per sector
- Example (Cheetah X15)
- 4 platters using both sides 8 surfaces
- 18497 tracks per surface
- 617 sectors per track (average)
- 512 bytes per sector
- Total capacity 8 x 18497 x 617 x 512 ? 4.6 x
1010 42.8 GB - Formatted capacity 36.7 GB
Note 1the tracks on the edge of the platter is
larger than the tracks close to the spindle.
Today, most disks are zoned, i.e., the outer
tracks have more sectors than the inner tracks
Note 2there is a difference between formatted
and total capacity. Some of the capacity is used
for storing checksums, spare tracks, gaps, etc.
10Disk Access Time I
- How do we retrieve data from disk?
- position head over the cylinder (track) on which
the block (consisting of one or more sectors) are
located - read or write the data block as the sectors move
under the head when the platters rotate - The time between the moment issuing a disk
request and the time the block is resident in
memory is called disk latency or disk access time
11Disk Access Time II
Disk platter
Disk access time
Disk head
Seek time
Rotational delay
Transfer time
Disk arm
Other delays
12Disk Access Time Seek Time
- Seek time is the time to position the head
- the heads require a minimum amount of time to
start and stop moving the head - some time is used for actually moving the head
roughly proportional to the number of cylinders
traveled
Typical average 10 ms ? 40 ms 7.4 ms
(Barracuda 180) 5.7 ms (Cheetah 36) 3.6 ms
(Cheetah X15)
13Disk Access Time Rotational Delay
- Time for the disk platters to rotate so the first
of the required sectors are under the disk head
Average delay is 1/2 revolutionTypical
average 8.33 ms (3.600 RPM) 5.56 ms
(5.400 RPM) 4.17 ms (7.200 RPM) 3.00 ms
(10.000 RPM) 2.00 ms (15.000 RPM)
14Disk Access Time Transfer Time
- Time for data to be read by the disk head, i.e.,
time it takes the sectors of the requested block
to rotate past the head - Transfer time
- Example 1If a disk has 250 KB per track and
operates at 10.000 RPM, we can read from the
disk at 40.69 MB/s - Example 2 Barracuda 180406 KB per track x
7.200 RPM ? 47.58 MB/s - Example 2 Cheetah X15316 KB per track x
15.000 RPM ? 77.15 MB/s - Access time is dependent on data density and
rotation speed - If we has to change track, time must also be
added for moving the head
Noteone might achieve these transfer rates
reading continuously on disk, but time must be
added for seeks, etc.
15Disk Access Time Other Delays
- There are several other factors which might
introduce additional delays - CPU time to issue and process I/O
- contention for controller
- contention for bus
- contention for memory
- verifying block correctness with checksums
(retransmissions) - waiting in scheduling queue
- ...
- Typical values 0 (maybe except from waiting
in the queue)
16Disk Throughput
- How much data can we retrieve per second?
- Throughput
- Examplefor each operation we have- average
seek- average rotational delay- transfer time-
no gaps, etc. - Cheetah X154 KB blocks ? 0.71 MB/s64 KB blocks
? 11.42 MB/s - Barracuda 180 4 KB blocks ? 0.35 MB/s64 KB
blocks ? 5.53 MB/s
Noteto increase overall throughput, one should
read as much as possible contiguously on disk
17Some Complicating Issues
- There are several complicating factors
- The other delays described earlier like
consumed CPU time, resource contention, etc. - zoned disks, i.e., outer tracks are longer and
therefore usually have more sectors than inner - checksums are also stored with each the sectors
inner
outer
Note 1transfer rates are higher on outer tracks
Note 3the checksum is read for each track and
used to validate the track
Note 5for older drives the checksum is 16 bytes
Note 4the checksum is usually calculated using
Reed-Solomon interleaved with CRC
Note 6SCSI disks may be changed by user to have
other sector sizes
Note 2gaps between sectors
18Writing and Modifying Blocks
- A write operation is analogous to read operations
- must add time for block allocation
- a complication occurs if the write operation has
to be verified must wait another rotation and
then read the block to see if it is the block we
wanted to write - Total write time ? read time time for one
rotation - Cannot modify a block directly
- read block into main memory
- modify the block
- write new content back to disk
- (verify the write operation)
- Total modify time ? read time time to modify
write time
19Disk Controllers
- To manage the different parts of the disk, we use
a disk controller, which is a small processor
capable of - controlling the actuator moving the head to the
desired track - selecting which platter and surface to use
- knowing when right sector is under the head
- transferring data between main memory and disk
- New controllers acts like small computers
themselves - both disk and controller now has an own buffer
reducing disk access time - data on damaged disk blocks/sectors are just
moved to spare room at the disk the system
above (OS) does not know this, i.e., a block may
lie elsewhere than the OS thinks
20Efficient Secondary Storage Usage
- Many programs are assumed to fit in main memory,
but one must assume that data is larger than main
memory - Must take into account the use of secondary
storage - there are large access time gaps, i.e., a disk
access will probably dominate the total execution
time - there may be huge performance improvements if we
reduce the number of disk accesses - a slow algorithm with few disk accesses will
probably outperform a fast algorithm with many
disk accesses - Several ways to optimize .....
- disk scheduling
- block size
- multiple disks
- prefetching
- file management / data placement
- memory caching / replacement algorithms
21Disk Scheduling
22Disk Scheduling I
- Seek time is a dominant factor of total disk I/O
time - Let operating system or disk controller choose
which request to serve next depending on current
position on disk and requested blocks position
on disk (disk scheduling) - Note that disk scheduling ? CPU scheduling
- a mechanical device hard to determine
(accurate) access times - disk accesses cannot be preempted runs until it
finishes - disk I/O often the main performance bottleneck
- General goals
- short response time
- high overall throughput
- fairness (equal probability for all blocks to be
accessed in the same time) - Tradeoff seek and rotational delay vs. maximum
response time
23Disk Scheduling II
- Several traditional algorithms
- First-Come-First-Serve (FCFS)
- Shortest Seek Time First (SSTF)
- SCAN (and variations)
- Look (and variations)
24First-Come-First-Serve (FCFS)
- FCFS serves the first arriving request first
- Long seeks
- Short average response time
incoming requests (in order of arrival)
12
14
2
7
21
8
24
12
14
2
7
21
Notethe lines only indicate some time not
exact amount
8
24
25Shortest Seek Time First (SSTF)
- SSTF serves closest request first
- short seek times
- longer maximum seek times may even lead to
starvation
incoming requests (in order of arrival)
12
14
2
7
21
8
24
24
8
21
7
2
14
12
26SCAN
- SCAN (elevator) moves head edge to edge and
serves requests on the way - bi-directional
- compromise between response time and seek time
optimizations
incoming requests (in order of arrival)
12
14
2
7
21
8
24
24
8
21
7
2
14
12
scheduling queue
27C-SCAN
- Circular-SCAN moves head from edge to edge
- serves requests on one way uni-directional
- improves response time (fairness)
incoming requests (in order of arrival)
12
14
2
7
21
8
24
24
8
21
7
2
14
12
scheduling queue
28SCAN vs. C-SCAN
- Why is C-SCAN in average better in reality than
SCAN when both service the same number of
requests in two passes? - modern disks must accelerate (speed up and down)
when seeking - head movement formula
time
number of tracks seek time constant fixed overhead
cylinders traveled
if n is large
29LOOK and C-LOOK
- LOOK (C-LOOK) is a variation of SCAN (C-SCAN)
- same schedule as SCAN
- does not run to the edges
- stops and returns at outer- and innermost request
- increased efficiency
- SCAN vs. LOOK example
incoming requests (in order of arrival)
12
14
2
7
21
8
24
scheduling queue
2
7
8
24
21
14
12
30V-SCAN(R)
- V-SCAN(R) combines SCAN (LOOK) and SSTF
- define a R-sized unidirectional SCAN (LOOK)
window, i.e., C-SCAN (C-LOOK), - V-SCAN(0.6) makes a C-SCAN (C-LOOK) window over
60 of the cylinders - uses SSTF for requests outside the window
- V-SCAN(0.0) equivalent with SSTF
- V-SCAN(1.0) equivalent with SCAN (C-LOOK)
- V-SCAN(0.2) is supposed to be an appropriate
configuration
cylinder number
1
5
10
15
20
25
31Continuous Media Disk Scheduling
- Suitability of classical algorithms
- minimal disk arm movement (short seek times)
- no provision of time or deadlines
- generally not suitable
- Continuous media requirements
- serve both periodic and aperiodic requests
- never miss deadline due to aperiodic requests
- aperiodic requests must not starve
- support multiple streams
- balance buffer space and efficiency tradeoff
32Real-Time Disk Scheduling
- Targeted for real-time applications with
deadlines - Several proposed algorithms
- earliest deadline first (EDF)
- SCAN-EDF
- shortest seek and earliest deadline by
ordering/value (SSEDO / SSEDV) - priority SCAN (PSCAN)
- ...
33Earliest Deadline First (EDF)
- EDF serves the request with nearest deadline
first - non-preemptive (i.e., a request with a shorter
deadline must wait) - excessive seeks
- poor throughput
incoming requests (in order of arrival)
12,5
14,6
2,4
7,7
21,1
8,2
24,3
12,5
14,6
2,4
7,7
21,1
8,2
24,3
scheduling queue
34SCAN-EDF
- SCAN-EDF combines SCAN and EDF
- the real-time aspects of EDF
- seek optimizations of SCAN
- especially useful if the end of the period of a
batch is the deadline
- increase efficiency by modifying the deadlines
- method
- serve requests with earlier deadline first (EDF)
- sort requests with same deadline after track
location (SCAN)
incoming requests (in order of arrival)
2,3
14,1
9,3
7,2
21,1
8,2
24,2
16,1
2,3
14,1
9,3
7,2
21,1
8,2
24,2
16,1
scheduling queue
Notesimilarly, we can combine EDF with C-SCAN,
LOOK or C-LOOK
35Stream Oriented Disk Scheduling
- Targeted for streaming contiguous media data
- Several algorithms proposed
- group sweep scheduling (GSS)
- mixed disk scheduling strategy
- contiguous media file system (CMFS)
- lottery scheduling
- stride scheduling
- batched SCAN (BSCAN)
- greedy-but-safe EDF (GS_EDF)
- bubble up
-
- MARS scheduler
- chello
- adaptive disk scheduler for mixed media workloads
(APEX)
multimedia applications may require both RT and
NRT data desirable to have all on
same disk
36Group Sweep Scheduling (GSS)
- GSS combines Round-Robin (RR) and SCAN
- requests are serviced in rounds (cycles)
- principle
- divide S active streams into G groups
- service the G groups in RR order
- service each stream in a group in C-SCAN order
- playout can start at the end of the group
- special cases
- G S RR scheduling
- G 1 SCAN scheduling
- tradeoff between buffer space and disk arm
movement - try different values for G giving minimum buffer
requirement select minimum - a large G ? smaller groups, more arm movements,
smaller buffers (reuse) - a small G ? larger groups, less arm movements,
larger buffers - with high loads and equal playout rates, GSS and
SCAN often service streams in same order - replacing RR with FIFO and group requests after
deadline gives SCAN-EDF
37Group Sweep Scheduling (GSS)
- GSS example streams A, B, C and D ? g1A,C and
g2B,D - RR group schedule
- C-SCAN block schedule within a group
25
A2
A1
A3
B2
B3
B1
C1
C2
C3
D3
D1
D2
A1
g1
A,C
C1
B1
g2
B,D
D1
C2
g1
C,A
A2
B2
g2
B,D
D2
g1
A3
A,C
C3
B3
g2
B,D
D3
38Mixed Disk Scheduling Strategy (MDSS)
- MDSS combines SSTF with buffer overflow and
underflow prevention - data delivered to several buffers (one per
stream) - disk bandwidth share allocated according to
buffer fill level - SSTF is used to schedule the requests
share allocator
SSTF scheduler
...
...
39Continuous Media File System Disk Scheduling
- CMFS provides (propose) several algorithms
- determines new schedule on completion of each
request - orders request so that no deadline violations
occur delays new streams until it is safe to
proceed (admission control) - all based on slack-time
- amount of time that can be used for non-real-time
requests or - work-ahead for continuous media requests
- based on amount of data in buffers and deadlines
of next requests(how long can I delay the
request before violating the deadline?) - useful algorithms
- greedy serve one stream as long as possible
- cyclic serve always the stream with shortest
slack-time
40MARS Disk Scheduler
- Massively-parallel And Real-time Storage (MARS)
scheduler supports mixed media on a single system - a two-level scheduling
- top-level 1 NRT queue and n (1) RT queue(SCAN,
but future GSS, SCAN-EDF, or) - use deficit RR fair queuing to assign quantums
to each queue per round divides total
bandwidth among queues - bottom-level select requests from queues
according to quantums, use SCAN order - work-conserving(variable round times, new round
starts immediately)
NRT
RT
deficit round robin fair queuingjob selector
41Chello
- Chello is part of the Symphony FS supporting
mixed media - two-level scheduling
- top-level n (3) service classes (queues)
- deadline ( end-of-round) real-time (EDF)
- throughput intensive best effort (FCFS)
- interactive best effort (FCFS)
- divides total bandwidth among queues according
to a static proportional allocation scheme(equal
to MARS job selector) - bottom-level class independent scheduler (FCFS)
- select requests from queues according to quantums
- sort requests from each queue in SCAN order when
transferred - partially work-conserving(extra requests might
be added at the end of the classindependent
scheduler if space, but constant rounds)
deadline RT
throughput intensive best-effort
interactive best-effort
1
7
4
3
2
8
2
1
2
sort each queue in SCAN order when transferred
42Adaptive Disk Scheduler for Mixed Media Workloads
- APEX is another mixed media scheduler designed
for MM DBSs - two-level scheduling similar to Chello and MARS
- uses token bucket for traffic shaping(bandwidth
allocation) - the batch builder select requests inFCFS order
from the queues based on number of tokens each
queue must sort according to deadline (or
another strategy) - work-conserving
- adds extra requests if possible to a batch
- starts extra batch between ordinary batches
Batch Builder
43APEX, Chello and C-LOOK Comparison
- Results from Ketil Lund (2002)
- Configuration
- Atlas Quantum 10K
- Avg. seek 5.0ms
- Avg. latency 3.0ms
- transfer rate 18 26 MB/s
- data placement random, video and audio
multiplexed - round time 1 second
- block size 64KB
- Video playback and user queries
- Six video clients
- Each playing back a random video
- Random start time (after 17 secs, all have
started)
44APEX, Chello and C-LOOK Comparison
- Nine different user-query traces, each with the
following characteristics - Inter-arrival time of queries is exponentially
distributed, with a mean of 10 secs - Each query requests between two and 1011 pages
- Inter-arrival time of disk requests in a query is
exponentially distributed, with a mean of 9.7ms - Start with one trace, and then add traces, in
order to increase workload (? queries may
overlap) - Video data disk requests are assigned to a
real-time queue - User-query disk requests to a best-effort queue
- Bandwidth is shared 50/50 between real-time queue
and best-effort queue - We measure response times (i.e., time from
request arrived at disk scheduler, until data is
placed in the buffer) for user-query disk
requests, and check whether deadline violations
occur for video data disk requests
45APEX, Chello and C-LOOK Comparison
Deadlineviolations(video)
46Disk Scheduling Today
- Most algorithms assume linear head movement
overhead, but this is not the case (acceleration) - Disk buffer caches may use read-ahead prefetching
- The disk parameters exported to the OS may be
completely different from the actual disk
mechanics - Modern disks (often) have a built-in SCAN
scheduler - Actual VoD server implementation (???)
- hierarchical software scheduler
- several top-level queues, at least
- RT (EDF)
- NRT (FCFS)
- process queues in rounds (RR)
- dynamic assignment of quantums
- work-conserving with variable round length(full
disk bandwidth utilization vs. buffer
requirement) - only simple collection of requests according to
quantums in lowest level and forwarding to disk,
because ... - ..fixed SCAN scheduler in hardware (on disk)
EDF / FCFS
SCAN
47The EndSummary
48Summary
- The main bottleneck is disk I/O performance due
to disk mechanics seek time and rotational
delays - Many algorithms trying to minimize seek
overhead(most existing systems uses a SCAN
derivate) - World today more complicated (both different
media and unknown disk characteristics) - Next week, distribution (part II)
- In two weeks, storage systems (part II)
- data placement
- multiple disks
- memory caching
- ...
49Some References
- Anderson, D. P., Osawa, Y., Govindan, R.A File
System for Continuous Media, ACM Transactions on
Computer Systems, Vol. 10, No. 4, Nov. 1992, pp.
311 - 337 - Elmasri, R. A., Navathe, S. Fundamentals of
Database Systems, Addison Wesley, 2000 - Garcia-Molina, H., Ullman, J. D., Widom, J.
Database Systems The Complete Book, Prentice
Hall, 2002 - Lund, K. Adaptive Disk Scheduling for
Multimedia Database Systems, PhD thesis,
IFI/UniK, UiO (to be finished soon) - Plagemann, T., Goebel, V., Halvorsen, P., Anshus,
O. Operating System Support for Multimedia
Systems, Computer Communications, Vol. 23, No.
3, February 2000, pp. 267-289 - Seagate Technology, http//www.seagate.com
- Sitaram, D., Dan, A. Multimedia Servers
Applications, Environments, and Design, Morgan
Kaufmann Publishers, 2000