Title: Storage Systems Part I
1Storage Systems Part I
INF SERV Media Storage and Distribution Systems
2Overview
- Block size
- Data placement
- Multiple disks
- Prefetching
- Managing heterogeneous disks
- Memory caching
3Block Size
4Block Size I
- The block size may have large effects on
performance - Exampleassume random block placement on disk
and sequential file access - doubling block size will halve the number of disk
accesses - each access take some more time to transfer the
data, but the total time is the same (i.e., more
data per request) - halve the seek times
- halve rotational delays are omitted
- e.g., when increasing block size from 2 KB to 4
KB (no gaps,...) for cheetah X15 typically an
average of - 3.6 ms is saved for seek time
- 2 ms is saved in rotational delays
- 0.026 ms is added per transfer time
- e.g., increasing from 2 KB to 64 KB saves 96,4
reading 64 KB
saving a total of 5.6 ms when reading 4 KB (49,8
)
5Block Size II
- Thus, increasing block size can increase
performance by reducing seek times and
rotational delays - However, a large block size is not always best
- blocks spanning several tracks still introduce
latencies - small data elements may occupy only a fraction
of the block - Which block size to use therefore depend on data
size and data reference patterns - The trend, however, is to use large block sizes
as new technology appear with increased
performance at least in high data rate systems
6Data Placement on Disk
7Data Placement on Disk I
- Disk blocks can be assigned to files many ways,
and several schemes is designed for - optimized latency
- increased throughput
- access pattern dependent
- Multimedia server approaches
- interactive applications
- popularity-based placement
- striping and clustering
- streaming applications
- continuous placement
- striping and clustering
- replication
- cross relations between objects
- no (at least only little) research yet
8Data Placement on Disk II
- Constant angular velocity (CAV) disks
- equal amount of data in each track(and thus
constant transfer time) - constant rotation speed
- Zoned CAV disks
- zones are ranges of tracks
- typical few zones
- different amount of data on tracks in different
zones, i.e., more data on outer tracks
- One should always place often used or high rate
data on outermost tracks!?
- NO, arm movement is often more important than
transfer time?
9Data Placement on Disk III
- What is the connection between data popularity
and placement - one could gain from placing popular data at the
right place how? - zones might be important for placement why?
zoned
not zoned
10Data Placement on Disk IV
- Continuous placement stores all disk blocks
continuous on disk - minimal disk arm movement reading the whole file
- possible advantage
- head must not move between read operations(often
WRONG read other files as well) - real advantage
- do not have to pre-determine block size
(whatever amount to read, at most track-to-track
seeks are performed)
file A
file B
file C
11Using Adjacent Sectors, Cylinders and Tracks
- To avoid seek time (and possibly rotational
delay), we can store data likely to be accessed
together on - adjacent sectors (similar to using larger
blocks) - if the track is full, use another track on the
same cylinder (only use another head) - if the cylinder is full, use next cylinder
(track-to-track seek) - Advantage
- can approach theoretical transfer rate (no seeks
or rotational delays) - Disadvantage
- no gain if we have unpredictable disk accesses
12Data Placement on Disk V
- Interleaved placement tries to store blocks from
a file with a fixed number of other blocks
in-between each block - minimal disk arm movement reading the files A, B
and C - fine for predictable workloads reading multiple
files - Non-interleaved (or even random) placement can be
used for highly unpredictable workloads
13Data Placement on Disk V
- Organ-pipe placement consider the usual disk head
position - place most popular data where head is most
often - center of the disk is closest to the head using
CAV disks a bit outward for zoned CAV disks
(modified organ-pipe)
innermost
outermost
disk
Noteskew dependent on tradeoff between
zoned transfer time and seek time
organ-pipe
modified organ-pipe
14Fast File System
- FFS is a general file system
- idea is to keep inode and associated blocks
close(no long seeks when getting the inode and
data) - organizes the disks in partitions cylinder
groups - having several inodes
- free block bitmap
-
- tries to store a file within a cylinder group
- next block on same cylinder
- a block within the cylinder group
- find a block in another group using a hash
function - search all cylinder groups for a free block
15Log-Structured File System
- Log-structured placement is based on assumptions
(facts?) that - RAM memory is getting larger
- writes is most expensive
- reads can often be served from buffer cache
(!!??) - Organize disk blocks as a circular log
- periodically, all pending (so far buffered)
writes are performed as a batch - write on next free block regardless of content
(inode, directory, data, ) - a cleaner reorganizes wholes and deleted blocks
in the background - stores blocks continuously when writing a single
file - efficient for small writes, other operations as
traditional UNIX FS
disk
16Minorca File System I
- Minorca is a multimedia file system (from
IFI/UiO) - enhanced allocation of disk blocks for continuous
storage of media files - supports both continuous and non-continuous files
in the same system using different placement
policies - Multimedia-Oriented Split Allocation (MOSA)
one file system, two sections - cylinder group sections (CGSs) for non-continuous
files - like traditional BSD FFS disk partitions
- small block sizes (like 4 or 8 KB)
- traditional FFS operations
- extent sections for continuous files
- extents contain one or more (adjacent) CGSs
- summary information
- allocation bitmap
- data block area
- expected to store one media file
- large block sizes (e.g., 64 KB)
- new transparent file operations, create file
using O_CREATEXT
cylinder group
cylinder group
extent
extent
extent
extent
extent
17Minorca File System II
- Count-augmented address indexing in the extent
section - observation indirect block reads introduce disk
I/O and break access locality (e.g.,
inode) - introduce a new inode structure
- add counter field to original direct entries
direct points to a disk blockand count indicated
how many other blocks is following the first
block (continuously) - if continuous allocation is assured, each direct
entry is able to access much more blocks without
additional retrieving an indirect block
attributes
direct 0
count 0
direct 1
count 1
direct 2
count 2
direct 10
count 10
direct 11
count 11
single indirect
double indirect
triple indirect
18Other File Systems Examples
- Contiuous Allocation
- Presto
- similar to Minorca extents for continuous files
- doesnt support small, discrete files
- Fellini
- simple flat file system
- maintains free block list with grouping
contiguous blocks - Continuous Media File System
- Several systems use multiple disks and stripe
data - Symphony
- Tiger Shark
- Tiger
19Prefetching
20Prefetching
- If we can predict the access pattern, one might
speed up performance using prefetching - a video playout is often linear ? easy to predict
access pattern - eases disk scheduling
- read larger amounts of data per request
- data in memory when requested reducing page
faults - One way of doing prefetching is read-ahead
- read more than the requested block into memory
- serve next read requests from buffer cache
- Another way of doing prefetching is double
(multiple) buffering - read data into first buffer
- process data in first buffer and at the same
time read data into second buffer - process data in second buffer and at the same
time read data into first buffer - etc.
21Multiple Buffering I
- Examplehave a file with block sequence B1, B2,
...our program processes data sequentially,
i.e., B1, B2, ... - single buffer solution
- read B1 ? buffer
- process data in buffer
- read B2 ? buffer
- process data in Buffer
- ...
- if P time to process/block R time to read in
1 block n blockssingle buffer time n
(PR)
process data
memory
disk
22Multiple Buffering II
- double buffer solution
- read B1 ? buffer1
- process data in buffer1, read B2 ? buffer2
- process data in buffer2, read B3 ? buffer1
- process data in buffer1, read B4 ? buffer2
- ...
- if P time to process/block R time to read in
1 block n blocksif P ? R double buffer
time R nP - if P lt R, we can try to add buffers (n -
buffering)
process data
process data
memory
disk
23Multiple Disks
24Multiple Disks
- Disk controllers and busses manage several
devices - One can improve total system performance by
replacing one large disk with many small accessed
in parallel - Several independent heads can read
simultaneously(if the other parts of the system
can manage the speed)
Single disk
Two disks
Notethe single disk might be faster, but as
seek time and rotational delay are the dominant
factors of total disk access time, the two
smaller disks might operate faster together
performing seeks in parallel...
25Striping
- Another reason to use multiple disks is when one
disk cannot deliver requested data rate - In such a scenario, one might use several disks
for striping - bandwidth disk Bdisk
- required bandwidth Bdisplay
- Bdisplay gt Bdisk
- read from n disks in parallel n Bdisk gt Bdisplay
- clients are serviced in rounds
- Advantages
- high data rates
- faster response time compared to one disk
- Drawbacks
- cant serve multiple clients in parallel
- positioning time increases (i.e., reduced
efficiency)
26Interleaving (Compound Striping)
- Full striping usually not necessary today
- faster disks
- better compression algorithms
- Interleaving lets each client may be serviced by
only a set of the available disks - make groups
- stripe data in a way such thata consecutive
request arrive atnext group (here each disk is a
group)
27Interleaving (Compound Striping)
- Divide traditional striping group into
sub-groups, e.g., staggered striping - Advantages
- multiple clients can still be served in parallel
- more efficient disks
- potentially shorter response time
- Drawbacks
- load balancing (all clients access same group)
28Mirroring
- Multiple disks might come in the situation where
all requests are for one of the disks and the
rest lie idle - In such cases, it might make sense to have
replicas of data on several disks if we have
identical disks, it is called mirroring - Advantages
- faster response time
- survive crashes fault tolerance
- load balancing by dividing the requests for the
data on the same disks equally among the mirrored
disks - Drawbacks
- increases storage requirement and write operations
29Redundant Array of Inexpensive Disks
- The various RAID levels define different disk
organizations to achieve higher performance and
more reliability - RAID 0 - striped disk array without fault
tolerance (non-redundant) - RAID 1 - mirroring
- RAID 2 - memory-style error correcting code
(Hamming Code ECC) - RAID 3 - bit-interleaved parity
- RAID 4 - block-interleaved parity
- RAID 5 - block-interleaved distributed-parity
- RAID 6 - independent data disks with two
independent distributed parity schemes (PQ
redundancy) - RAID 7
- RAID 10
- RAID 53
- RAID 10
30Redundant Array of Inexpensive Disks
- RAID is intended ...
- ... for general systems
- ... to give higher throughput
- ... to be fault tolerant
- For multimedia systems, some requirements are
missing - low latency
- guaranteed response time
- optimizations for linear access to large objects
- optimizations for cyclic operations
-
31Replication
- Replication is in traditional RAID systems often
used for fault tolerance (and higher performance
in the new combined levels) - Replication in multimedia systems is used for
- reducing hot spots
- increase scalability
- higher performance
-
- but, fault tolerance is a side effect
- Replication in multimedia scenarios should
- be based on observed load
- changed dynamically as popularity changes
32Dynamic Segment Replication (DSR)
- DSR tries to balance load by dynamically
replicating hot data - assumes read only, VoD like retrieval
- predefines a load threshold for when to replicate
a segment by examining current and expected load - replicate when threshold is reached, but which
segment?? - not necessarily segment that receives additional
requests(another segment may have more requests) - replicates based on payoff factor p (replicate
segment x with highest p)
33Some Challenges Managing Multiple Disks
- How large should a stripe group and stripe unit
be? - Can one avoid hot sets of disks (load
imbalance)? - Heterogeneous disks?
- What and when to replicate?
34Heterogeneous Disks
35File Placement
- A multimedia file might be stored (striped) on
multiple disks, but how should one choose on
which devices? - storage devices limited by both bandwidth and
space - we have hot (frequently viewed) and cold (rarely
viewed) files - we may have several heterogeneous storage
devices - the objective of a file placement policy is to
achieve maximum utilization of both bandwidth and
space, and hence, efficient usage of all devices
by avoiding load imbalance - must consider expected load and storage
requirement - should a file be replicated
- expected load may change over time
36Bandwidth-to-Space Ratio (BSR) I
- BSR attempts to mix hot and cold as well as large
and small multimedia objects on heterogeneous
devices - dont optimize placement based on throughput or
space only - BSR consider both required storage space and
throughput requirement(which is dependent on
playout rate and popularity) to achieve a best
combined device utilization
disk(no deviation)
disk (large deviation)
disk(large deviation)
media object
wasted space
wasted bandwidth
space
bandwidth
may vary according to popularity
37Bandwidth-to-Space Ratio (BSR) II
- The BSR policy algorithm
- input space and bandwidth requirements
- phase 1
- find a device to place the media object according
to BSR - if no device, or stripe of devices, can give
sufficient space or bandwidth, then add replicas - phase 2
- find devices for the needed replicas
- phase 3
- allocate expected load on replica devices
according to BSR of the devices - phase 4
- if not enough resources are available, see if
other media objects can delete replicas according
to their current workload - all phases may be needed adding a new media
object or increasing the workload for decrease,
only the phase 3 (reallocation) in needed - Popular, high data rate movies should be on high
bandwidth disks
38Disk Grouping
- Disk grouping is a technique to stripe (or
fragment) data over heterogeneous disks - groups heterogeneous physical disks to
homogeneous logical disks - the amount of data on each disk (fragments) is
determined so that the service time (based on
worst-case seeks) is equal for all physical disks
in a logical disk - blocks for an object are placed (and read) on
logical disks in a round-robin manner all disks
in a group is activated simultaneously
logical disk 0
X0,0
X2,0
X0
X2
X0,1
X2,1
logical disk 1
X1,0
X3,0
X1
X3
X1,1
X3,1
39Staggered Disk Grouping
- Staggered disk grouping is a variant of disk
grouping minimizing memory requirement - reading and playing out differently
- not all fragments of a logical block is needed at
the same time - first (and largest) fragment on most powerful
disk, etc. - read sequentially (must not buffer later segments
for a long time) - start display when largest fragment is read
logical disk 0
X0,0
X2,0
X0
X2
X0,0
X2,0
X0,1
X2,1
X0,1
X2,1
logical disk 1
X1,0
X3,0
X1
X3
X1,0
X1,1
X1,1
X3,1
40Disk Merging
- Disk merging forms logical disks form capacity
fragments of a physical disk - all logical disks are homogeneous
- supports an arbitrary mix of heterogeneous disks
(grouping needs equal groups) - starts by choosing how many logical disks the
slowest device shall support (e.g., 1 for disk 1
and 3) and calculates the corresponding number of
more powerful devices (e.g., 1.5 for disk 0 and 2
if these disks are 1.5 times better) - most powerful most flexible (arbitrary mix of
devices) and can be adapted to zoned disks (each
zone considered as a disk)
X0
X0
X2,0
X1
X1
X2
X3
X2,1
X3
X4
X4
41Memory Caching
42Data Path (Intel Hub Architecture)
Pentium 4 Processor
registers
cache(s)
file system
RDRAM
communication system
RDRAM
application
RDRAM
RDRAM
network card
PCI slots
PCI slots
disk
PCI slots
43Memory Caching
- How do we manage a cache?
- how much memory to use?
- how much data to prefetch?
- which data item to replace?
-
application
cache
communication system
file system
expensive
disk
network card
44Is Caching Useful in a Multimedia Scenario?
- High rate data may need lots of memory for
caching - Tradeoff amount of memory, algorithms
complexity, gain, - Cache only frequently used data how?(e.g.,
first (small) parts of a broadcast partitioning
scheme, allow on top-ten only, )
Maximum amount of memory (totally) that a Dell
Server can manage today and all is NOT used
for caching
45Need For Special Multimedia Algorithms ?
In this case, LRU replaces the next needed
frame. So the answer is in many cases YES
- Most existing systems use an LRU-variant, e.g.,
- keep a sorted list
- replace first in list
- insert new data elements at the end
- if a data element is re-accesses, move back to
the end of the list - Example playout of video frames
longest time since access
shortest time since access
LRU buffer
play video (7 frames)
1
2
3
4
5
6
7
7
5
4
3
2
1
rewind and restart playout at 1
6
1
7
6
5
4
3
2
playout 2
2
7
6
5
4
3
playout 3
1
3
2
1
7
6
5
4
playout 4
46Classification of Mechanisms
- Block-level caching consider (possibly unrelated)
set of blocks - each data element is viewed upon as an
independent item - usually used in traditional systems
- e.g., FIFO, LRU, CLOCK,
- multimedia approaches
- L/MRP (Least/Most Relevant for Presentation)
-
- Stream-dependent caching consider a stream object
as a whole - related data elements are treated in the same way
- research prototypes in multimedia systems
- e.g.,
- BASIC
- DISTANCE
- Interval Caching (IC)
- Generalized Interval Caching (GIC)
- Split and Merge (SAM)
- SHR
47Least/Most Relevant for Presentation (L/MRP)
Moser et al. 95
- L/MRP is a buffer management mechanism for a
single interactive, continuous data stream - adaptable to individual multimedia applications
- supports pre-loading, i.e., prefetch data from
disk - replaces least relevant pages regarding current
playout of the multimedia stream
COPUs continuous object presentation units
playback direction
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
15
16
17
18
19
14
20
21
13
22
12
23
11
10
24
25
26
48Least/Most Relevant for Presentation (L/MRP)
- L/MRP
- gives few disk accesses (compared to other
schemes) - supports interactivity
- supports prefetching
- targeted for single streams (users)
- expensive to execute (calculate relevance
values for all COPUs each round) - Variations
- Q-L/MRP extends L/MRP with multiple streams and
changes prefetching mechanism (reduces overhead)
Halvorsen et. al. 98 - MPEG-L/MRP gives different relevance values for
different MPEG frames Boll et. all. 00
49Interval Caching (IC)
- Interval caching (IC) is a caching strategy for
streaming servers - caches data between requests for same video
stream based on playout intervals between
requests - following requests are thus served from the cache
(not disk) filled by the preceding stream - sort intervals on length, buffer requirement is
data size of interval - to maximize cache hit ratio (minimize disk
accesses) the shortest intervals are cached first
S32
S33
S21
S11
S31
S12
50Generalized Interval Caching (GIC)
- Interval caching (IC) does not work for short
clips - a frequently accessed short clip will not be
cached - GIC generalizes the IC strategy
- manages intervals for long video objects as IC
- short intervals extend the interval definition
- keep track of a finished stream for a while after
its termination - define the interval for short stream as the
length between the new stream and the position of
the old stream if it had been a longer video
object - the cache requirement is, however, only the real
requirement - cache the shortest intervals as in IC
S11
Video clip 1
I11
C11
51Generalized Interval Caching (GIC)
- Open function form if possible new interval
with previous stream if (NO) exit / dont
cache / compute interval size and cache
requirement reorder interval list / smallest
first / if (not already in a cached
interval) if (space available) cache
interval else if (larger cached intervals
exist and sufficient memory can be released)
release memory form larger
intervals cache new interval - Close function if (not following another stream)
exit / not served form cache / delete
interval with preceding stream free memory if
(next interval can be cached in released memory)
cache next interval
52The EndSummary
53Summary
- Much work has been performed to optimize disks
performance - For multimedia streams, ...
- time-aware scheduling is important
- use large block sizes or read many continuous
blocks - prefetch data from disk to memory to have a
hiccup free playout - striping might not be necessary on new disks (at
least not on all disks) - replication on multiple disks can offload a hot
spot of disks - memory caching can save disk I/Os, but it might
not be worthwhile - ...
- BUT, new disks are smart, we cannot fully
control the device
54Some References
- Advanced Computer Network Corporation
RAID.edu, http//www.raid.com/04_00.html, 2002 - Boll, S., Heinlein, C., Klas, W., Wandel, J.
MPEG-L/MRP Adaptive Streaming of MPEG Videos
for Interactive Internet Applications,
Proceedings of the 6th International Workshop on
Multimedia Information System (MIS00), Chicago,
USA, October 2000, pp. 104 - 113 - Halvorsen, P., Goebel, V., Plagemann, T.
Q-L/MRP A Buffer Management Mechanism for QoS
Support in a Multimedia DBMS, Proceedings of
1998 IEEE International Workshop on Multimedia
Database Management Systems (IW-MMDBMS'98),
Dayton, Ohio, USA, August 1998, pp. 162 - 171 - Moser, F., Kraiss, A., Klas, W. L/MRP a Buffer
Management Strategy for Interactive Continuous
Data Flows in a Multimedia DBMS, Proceedings of
the 21th VLDB Conference, Zurich, Switzerland,
1995 - Plagemann, T., Goebel, V., Halvorsen, P., Anshus,
O. Operating System Support for Multimedia
Systems, Computer Communications, Vol. 23, No.
3, February 2000, pp. 267-289 - Sitaram, D., Dan, A. Multimedia Servers
Applications, Environments, and Design, Morgan
Kaufmann Publishers, 2000 - Zimmermann, R., Ghandeharizadeh, S. Continuous
Display using Heterogeneous Disk-Subsystems,
Proceedings of the 5th ACM International
Multimedia Conference, Seattle, WA, November 1997