Storage Systems Part I

About This Presentation

Title:

Storage Systems Part I

Description:

Storage Systems. Part I. 31/10 - 2002. INF SERV Media Storage and Distribution Systems: ... Tiger Shark. Tiger. Prefetching. 2002 Carsten Griwodz & P l Halvorsen ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 55

Provided by: paa5138

Category:

more less

Transcript and Presenter's Notes

Title: Storage Systems Part I

1
Storage Systems Part I
INF SERV Media Storage and Distribution Systems

31/10 - 2002

2
Overview

Block size
Data placement
Multiple disks
Prefetching
Managing heterogeneous disks
Memory caching

3
Block Size
4
Block Size I

The block size may have large effects on
performance
Exampleassume random block placement on disk
and sequential file access
doubling block size will halve the number of disk
accesses
each access take some more time to transfer the
data, but the total time is the same (i.e., more
data per request)
halve the seek times
halve rotational delays are omitted
e.g., when increasing block size from 2 KB to 4
KB (no gaps,...) for cheetah X15 typically an
average of
3.6 ms is saved for seek time
2 ms is saved in rotational delays
0.026 ms is added per transfer time
e.g., increasing from 2 KB to 64 KB saves 96,4
reading 64 KB

saving a total of 5.6 ms when reading 4 KB (49,8
)
5
Block Size II

Thus, increasing block size can increase
performance by reducing seek times and
rotational delays
However, a large block size is not always best
blocks spanning several tracks still introduce
latencies
small data elements may occupy only a fraction
of the block
Which block size to use therefore depend on data
size and data reference patterns
The trend, however, is to use large block sizes
as new technology appear with increased
performance at least in high data rate systems

6
Data Placement on Disk
7
Data Placement on Disk I

Disk blocks can be assigned to files many ways,
and several schemes is designed for
optimized latency
increased throughput
access pattern dependent
Multimedia server approaches
interactive applications
popularity-based placement
striping and clustering
streaming applications
continuous placement
striping and clustering
replication
cross relations between objects
no (at least only little) research yet

8
Data Placement on Disk II

Constant angular velocity (CAV) disks
equal amount of data in each track(and thus
constant transfer time)
constant rotation speed

Zoned CAV disks
zones are ranges of tracks
typical few zones
different amount of data on tracks in different
zones, i.e., more data on outer tracks

One should always place often used or high rate
data on outermost tracks!?

NO, arm movement is often more important than
transfer time?

9
Data Placement on Disk III

What is the connection between data popularity
and placement
one could gain from placing popular data at the
right place how?
zones might be important for placement why?

zoned
not zoned
10
Data Placement on Disk IV

Continuous placement stores all disk blocks
continuous on disk
minimal disk arm movement reading the whole file
possible advantage
head must not move between read operations(often
WRONG read other files as well)
real advantage
do not have to pre-determine block size
(whatever amount to read, at most track-to-track
seeks are performed)

file A
file B
file C
11
Using Adjacent Sectors, Cylinders and Tracks

To avoid seek time (and possibly rotational
delay), we can store data likely to be accessed
together on
adjacent sectors (similar to using larger
blocks)
if the track is full, use another track on the
same cylinder (only use another head)
if the cylinder is full, use next cylinder
(track-to-track seek)
Advantage
can approach theoretical transfer rate (no seeks
or rotational delays)
Disadvantage
no gain if we have unpredictable disk accesses

12
Data Placement on Disk V

Interleaved placement tries to store blocks from
a file with a fixed number of other blocks
in-between each block
minimal disk arm movement reading the files A, B
and C
fine for predictable workloads reading multiple
files
Non-interleaved (or even random) placement can be
used for highly unpredictable workloads

13
Data Placement on Disk V

Organ-pipe placement consider the usual disk head
position
place most popular data where head is most
often
center of the disk is closest to the head using
CAV disks a bit outward for zoned CAV disks
(modified organ-pipe)

innermost
outermost
disk
Noteskew dependent on tradeoff between
zoned transfer time and seek time
organ-pipe
modified organ-pipe
14
Fast File System

FFS is a general file system
idea is to keep inode and associated blocks
close(no long seeks when getting the inode and
data)
organizes the disks in partitions cylinder
groups
having several inodes
free block bitmap
tries to store a file within a cylinder group
next block on same cylinder
a block within the cylinder group
find a block in another group using a hash
function
search all cylinder groups for a free block

15
Log-Structured File System

Log-structured placement is based on assumptions
(facts?) that
RAM memory is getting larger
writes is most expensive
reads can often be served from buffer cache
(!!??)
Organize disk blocks as a circular log
periodically, all pending (so far buffered)
writes are performed as a batch
write on next free block regardless of content
(inode, directory, data, )
a cleaner reorganizes wholes and deleted blocks
in the background
stores blocks continuously when writing a single
file
efficient for small writes, other operations as
traditional UNIX FS

disk
16
Minorca File System I

Minorca is a multimedia file system (from
IFI/UiO)
enhanced allocation of disk blocks for continuous
storage of media files
supports both continuous and non-continuous files
in the same system using different placement
policies
Multimedia-Oriented Split Allocation (MOSA)
one file system, two sections
cylinder group sections (CGSs) for non-continuous
files
like traditional BSD FFS disk partitions
small block sizes (like 4 or 8 KB)
traditional FFS operations
extent sections for continuous files
extents contain one or more (adjacent) CGSs
summary information
allocation bitmap
data block area
expected to store one media file
large block sizes (e.g., 64 KB)
new transparent file operations, create file
using O_CREATEXT

cylinder group
cylinder group

extent
extent
extent
extent
extent

17
Minorca File System II

Count-augmented address indexing in the extent
section
observation indirect block reads introduce disk
I/O and break access locality (e.g.,
inode)
introduce a new inode structure
add counter field to original direct entries
direct points to a disk blockand count indicated
how many other blocks is following the first
block (continuously)
if continuous allocation is assured, each direct
entry is able to access much more blocks without
additional retrieving an indirect block

attributes
direct 0
count 0
direct 1
count 1
direct 2
count 2

direct 10
count 10
direct 11
count 11
single indirect
double indirect
triple indirect
18
Other File Systems Examples

Contiuous Allocation
Presto
similar to Minorca extents for continuous files
doesnt support small, discrete files
Fellini
simple flat file system
maintains free block list with grouping
contiguous blocks
Continuous Media File System
Several systems use multiple disks and stripe
data
Symphony
Tiger Shark
Tiger

19
Prefetching
20
Prefetching

If we can predict the access pattern, one might
speed up performance using prefetching
a video playout is often linear ? easy to predict
access pattern
eases disk scheduling
read larger amounts of data per request
data in memory when requested reducing page
faults
One way of doing prefetching is read-ahead
read more than the requested block into memory
serve next read requests from buffer cache
Another way of doing prefetching is double
(multiple) buffering
read data into first buffer
process data in first buffer and at the same
time read data into second buffer
process data in second buffer and at the same
time read data into first buffer
etc.

21
Multiple Buffering I

Examplehave a file with block sequence B1, B2,
...our program processes data sequentially,
i.e., B1, B2, ...
single buffer solution
read B1 ? buffer
process data in buffer
read B2 ? buffer
process data in Buffer
...
if P time to process/block R time to read in
1 block n blockssingle buffer time n
(PR)

process data
memory
disk
22
Multiple Buffering II

double buffer solution
read B1 ? buffer1
process data in buffer1, read B2 ? buffer2
process data in buffer2, read B3 ? buffer1
process data in buffer1, read B4 ? buffer2
...
if P time to process/block R time to read in
1 block n blocksif P ? R double buffer
time R nP
if P lt R, we can try to add buffers (n -
buffering)

process data
process data
memory
disk
23
Multiple Disks
24
Multiple Disks

Disk controllers and busses manage several
devices
One can improve total system performance by
replacing one large disk with many small accessed
in parallel
Several independent heads can read
simultaneously(if the other parts of the system
can manage the speed)

Single disk
Two disks
Notethe single disk might be faster, but as
seek time and rotational delay are the dominant
factors of total disk access time, the two
smaller disks might operate faster together
performing seeks in parallel...
25
Striping

Another reason to use multiple disks is when one
disk cannot deliver requested data rate
In such a scenario, one might use several disks
for striping
bandwidth disk Bdisk
required bandwidth Bdisplay
Bdisplay gt Bdisk
read from n disks in parallel n Bdisk gt Bdisplay
clients are serviced in rounds
Advantages
high data rates
faster response time compared to one disk
Drawbacks
cant serve multiple clients in parallel
positioning time increases (i.e., reduced
efficiency)

26
Interleaving (Compound Striping)

Full striping usually not necessary today
faster disks
better compression algorithms
Interleaving lets each client may be serviced by
only a set of the available disks
make groups
stripe data in a way such thata consecutive
request arrive atnext group (here each disk is a
group)

27
Interleaving (Compound Striping)

Divide traditional striping group into
sub-groups, e.g., staggered striping
Advantages
multiple clients can still be served in parallel
more efficient disks
potentially shorter response time
Drawbacks
load balancing (all clients access same group)

28
Mirroring

Multiple disks might come in the situation where
all requests are for one of the disks and the
rest lie idle
In such cases, it might make sense to have
replicas of data on several disks if we have
identical disks, it is called mirroring
Advantages
faster response time
survive crashes fault tolerance
load balancing by dividing the requests for the
data on the same disks equally among the mirrored
disks
Drawbacks
increases storage requirement and write operations

29
Redundant Array of Inexpensive Disks

The various RAID levels define different disk
organizations to achieve higher performance and
more reliability
RAID 0 - striped disk array without fault
tolerance (non-redundant)
RAID 1 - mirroring
RAID 2 - memory-style error correcting code
(Hamming Code ECC)
RAID 3 - bit-interleaved parity
RAID 4 - block-interleaved parity
RAID 5 - block-interleaved distributed-parity
RAID 6 - independent data disks with two
independent distributed parity schemes (PQ
redundancy)
RAID 7
RAID 10
RAID 53
RAID 10

30
Redundant Array of Inexpensive Disks

RAID is intended ...
... for general systems
... to give higher throughput
... to be fault tolerant
For multimedia systems, some requirements are
missing
low latency
guaranteed response time
optimizations for linear access to large objects
optimizations for cyclic operations

31
Replication

Replication is in traditional RAID systems often
used for fault tolerance (and higher performance
in the new combined levels)
Replication in multimedia systems is used for
reducing hot spots
increase scalability
higher performance
but, fault tolerance is a side effect
Replication in multimedia scenarios should
be based on observed load
changed dynamically as popularity changes

32
Dynamic Segment Replication (DSR)

DSR tries to balance load by dynamically
replicating hot data
assumes read only, VoD like retrieval
predefines a load threshold for when to replicate
a segment by examining current and expected load
replicate when threshold is reached, but which
segment??
not necessarily segment that receives additional
requests(another segment may have more requests)
replicates based on payoff factor p (replicate
segment x with highest p)

33
Some Challenges Managing Multiple Disks

How large should a stripe group and stripe unit
be?
Can one avoid hot sets of disks (load
imbalance)?
Heterogeneous disks?
What and when to replicate?

34
Heterogeneous Disks
35
File Placement

A multimedia file might be stored (striped) on
multiple disks, but how should one choose on
which devices?
storage devices limited by both bandwidth and
space
we have hot (frequently viewed) and cold (rarely
viewed) files
we may have several heterogeneous storage
devices
the objective of a file placement policy is to
achieve maximum utilization of both bandwidth and
space, and hence, efficient usage of all devices
by avoiding load imbalance
must consider expected load and storage
requirement
should a file be replicated
expected load may change over time

36
Bandwidth-to-Space Ratio (BSR) I

BSR attempts to mix hot and cold as well as large
and small multimedia objects on heterogeneous
devices
dont optimize placement based on throughput or
space only
BSR consider both required storage space and
throughput requirement(which is dependent on
playout rate and popularity) to achieve a best
combined device utilization

disk(no deviation)
disk (large deviation)
disk(large deviation)
media object
wasted space
wasted bandwidth
space
bandwidth
may vary according to popularity
37
Bandwidth-to-Space Ratio (BSR) II

The BSR policy algorithm
input space and bandwidth requirements
phase 1
find a device to place the media object according
to BSR
if no device, or stripe of devices, can give
sufficient space or bandwidth, then add replicas
phase 2
find devices for the needed replicas
phase 3
allocate expected load on replica devices
according to BSR of the devices
phase 4
if not enough resources are available, see if
other media objects can delete replicas according
to their current workload
all phases may be needed adding a new media
object or increasing the workload for decrease,
only the phase 3 (reallocation) in needed
Popular, high data rate movies should be on high
bandwidth disks

38
Disk Grouping

Disk grouping is a technique to stripe (or
fragment) data over heterogeneous disks
groups heterogeneous physical disks to
homogeneous logical disks
the amount of data on each disk (fragments) is
determined so that the service time (based on
worst-case seeks) is equal for all physical disks
in a logical disk
blocks for an object are placed (and read) on
logical disks in a round-robin manner all disks
in a group is activated simultaneously

logical disk 0
X0,0
X2,0
X0
X2
X0,1
X2,1
logical disk 1
X1,0
X3,0
X1
X3
X1,1
X3,1
39
Staggered Disk Grouping

Staggered disk grouping is a variant of disk
grouping minimizing memory requirement
reading and playing out differently
not all fragments of a logical block is needed at
the same time
first (and largest) fragment on most powerful
disk, etc.
read sequentially (must not buffer later segments
for a long time)
start display when largest fragment is read

logical disk 0
X0,0
X2,0
X0
X2
X0,0
X2,0
X0,1
X2,1
X0,1
X2,1
logical disk 1
X1,0
X3,0
X1
X3
X1,0
X1,1
X1,1
X3,1
40
Disk Merging

Disk merging forms logical disks form capacity
fragments of a physical disk
all logical disks are homogeneous
supports an arbitrary mix of heterogeneous disks
(grouping needs equal groups)
starts by choosing how many logical disks the
slowest device shall support (e.g., 1 for disk 1
and 3) and calculates the corresponding number of
more powerful devices (e.g., 1.5 for disk 0 and 2
if these disks are 1.5 times better)
most powerful most flexible (arbitrary mix of
devices) and can be adapted to zoned disks (each
zone considered as a disk)

X0
X0
X2,0
X1
X1
X2
X3
X2,1
X3
X4
X4
41
Memory Caching
42
Data Path (Intel Hub Architecture)
Pentium 4 Processor
registers
cache(s)
file system
RDRAM
communication system
RDRAM
application
RDRAM
RDRAM
network card
PCI slots
PCI slots
disk
PCI slots
43
Memory Caching

How do we manage a cache?
how much memory to use?
how much data to prefetch?
which data item to replace?

application
cache
communication system
file system
expensive
disk
network card
44
Is Caching Useful in a Multimedia Scenario?

High rate data may need lots of memory for
caching
Tradeoff amount of memory, algorithms
complexity, gain,
Cache only frequently used data how?(e.g.,
first (small) parts of a broadcast partitioning
scheme, allow on top-ten only, )

Maximum amount of memory (totally) that a Dell
Server can manage today and all is NOT used
for caching
45
Need For Special Multimedia Algorithms ?
In this case, LRU replaces the next needed
frame. So the answer is in many cases YES

Most existing systems use an LRU-variant, e.g.,
keep a sorted list
replace first in list
insert new data elements at the end
if a data element is re-accesses, move back to
the end of the list
Example playout of video frames

longest time since access
shortest time since access
LRU buffer
play video (7 frames)
1
2
3
4
5
6
7
7
5
4
3
2
1
rewind and restart playout at 1
6
1
7
6
5
4
3
2
playout 2
2
7
6
5
4
3
playout 3
1
3
2
1
7
6
5
4
playout 4
46
Classification of Mechanisms

Block-level caching consider (possibly unrelated)
set of blocks
each data element is viewed upon as an
independent item
usually used in traditional systems
e.g., FIFO, LRU, CLOCK,
multimedia approaches
L/MRP (Least/Most Relevant for Presentation)
Stream-dependent caching consider a stream object
as a whole
related data elements are treated in the same way
research prototypes in multimedia systems
e.g.,
BASIC
DISTANCE
Interval Caching (IC)
Generalized Interval Caching (GIC)
Split and Merge (SAM)
SHR

47
Least/Most Relevant for Presentation (L/MRP)
Moser et al. 95

L/MRP is a buffer management mechanism for a
single interactive, continuous data stream
adaptable to individual multimedia applications
supports pre-loading, i.e., prefetch data from
disk
replaces least relevant pages regarding current
playout of the multimedia stream

COPUs continuous object presentation units
playback direction
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
15
16
17
18
19
14
20
21
13
22
12
23
11
10
24
25
26
48
Least/Most Relevant for Presentation (L/MRP)

L/MRP
gives few disk accesses (compared to other
schemes)
supports interactivity
supports prefetching
targeted for single streams (users)
expensive to execute (calculate relevance
values for all COPUs each round)
Variations
Q-L/MRP extends L/MRP with multiple streams and
changes prefetching mechanism (reduces overhead)
Halvorsen et. al. 98
MPEG-L/MRP gives different relevance values for
different MPEG frames Boll et. all. 00

49
Interval Caching (IC)

Interval caching (IC) is a caching strategy for
streaming servers
caches data between requests for same video
stream based on playout intervals between
requests
following requests are thus served from the cache
(not disk) filled by the preceding stream
sort intervals on length, buffer requirement is
data size of interval
to maximize cache hit ratio (minimize disk
accesses) the shortest intervals are cached first

S32
S33
S21
S11
S31
S12
50
Generalized Interval Caching (GIC)

Interval caching (IC) does not work for short
clips
a frequently accessed short clip will not be
cached
GIC generalizes the IC strategy
manages intervals for long video objects as IC
short intervals extend the interval definition
keep track of a finished stream for a while after
its termination
define the interval for short stream as the
length between the new stream and the position of
the old stream if it had been a longer video
object
the cache requirement is, however, only the real
requirement
cache the shortest intervals as in IC

S11
Video clip 1
I11
C11
51
Generalized Interval Caching (GIC)

Open function form if possible new interval
with previous stream if (NO) exit / dont
cache / compute interval size and cache
requirement reorder interval list / smallest
first / if (not already in a cached
interval) if (space available) cache
interval else if (larger cached intervals
exist and sufficient memory can be released)
release memory form larger
intervals cache new interval
Close function if (not following another stream)
exit / not served form cache / delete
interval with preceding stream free memory if
(next interval can be cached in released memory)
cache next interval

52
The EndSummary
53
Summary

Much work has been performed to optimize disks
performance
For multimedia streams, ...
time-aware scheduling is important
use large block sizes or read many continuous
blocks
prefetch data from disk to memory to have a
hiccup free playout
striping might not be necessary on new disks (at
least not on all disks)
replication on multiple disks can offload a hot
spot of disks
memory caching can save disk I/Os, but it might
not be worthwhile
...
BUT, new disks are smart, we cannot fully
control the device

54
Some References

Advanced Computer Network Corporation
RAID.edu, http//www.raid.com/04_00.html, 2002
Boll, S., Heinlein, C., Klas, W., Wandel, J.
MPEG-L/MRP Adaptive Streaming of MPEG Videos
for Interactive Internet Applications,
Proceedings of the 6th International Workshop on
Multimedia Information System (MIS00), Chicago,
USA, October 2000, pp. 104 - 113
Halvorsen, P., Goebel, V., Plagemann, T.
Q-L/MRP A Buffer Management Mechanism for QoS
Support in a Multimedia DBMS, Proceedings of
1998 IEEE International Workshop on Multimedia
Database Management Systems (IW-MMDBMS'98),
Dayton, Ohio, USA, August 1998, pp. 162 - 171
Moser, F., Kraiss, A., Klas, W. L/MRP a Buffer
Management Strategy for Interactive Continuous
Data Flows in a Multimedia DBMS, Proceedings of
the 21th VLDB Conference, Zurich, Switzerland,
1995
Plagemann, T., Goebel, V., Halvorsen, P., Anshus,
O. Operating System Support for Multimedia
Systems, Computer Communications, Vol. 23, No.
3, February 2000, pp. 267-289
Sitaram, D., Dan, A. Multimedia Servers
Applications, Environments, and Design, Morgan
Kaufmann Publishers, 2000
Zimmermann, R., Ghandeharizadeh, S. Continuous
Display using Heterogeneous Disk-Subsystems,
Proceedings of the 5th ACM International
Multimedia Conference, Seattle, WA, November 1997