Storage Systems Part I - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Storage Systems Part I

Description:

Storage Systems. Part I. 31/10 - 2002. INF SERV Media Storage and Distribution Systems: ... Tiger Shark. Tiger. Prefetching. 2002 Carsten Griwodz & P l Halvorsen ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 55
Provided by: paa5138
Category:
Tags: part | storage | systems

less

Transcript and Presenter's Notes

Title: Storage Systems Part I


1
Storage Systems Part I
INF SERV Media Storage and Distribution Systems
  • 31/10 - 2002

2
Overview
  • Block size
  • Data placement
  • Multiple disks
  • Prefetching
  • Managing heterogeneous disks
  • Memory caching

3
Block Size
4
Block Size I
  • The block size may have large effects on
    performance
  • Exampleassume random block placement on disk
    and sequential file access
  • doubling block size will halve the number of disk
    accesses
  • each access take some more time to transfer the
    data, but the total time is the same (i.e., more
    data per request)
  • halve the seek times
  • halve rotational delays are omitted
  • e.g., when increasing block size from 2 KB to 4
    KB (no gaps,...) for cheetah X15 typically an
    average of
  • 3.6 ms is saved for seek time
  • 2 ms is saved in rotational delays
  • 0.026 ms is added per transfer time
  • e.g., increasing from 2 KB to 64 KB saves 96,4
    reading 64 KB


saving a total of 5.6 ms when reading 4 KB (49,8
)
5
Block Size II
  • Thus, increasing block size can increase
    performance by reducing seek times and
    rotational delays
  • However, a large block size is not always best
  • blocks spanning several tracks still introduce
    latencies
  • small data elements may occupy only a fraction
    of the block
  • Which block size to use therefore depend on data
    size and data reference patterns
  • The trend, however, is to use large block sizes
    as new technology appear with increased
    performance at least in high data rate systems

6
Data Placement on Disk
7
Data Placement on Disk I
  • Disk blocks can be assigned to files many ways,
    and several schemes is designed for
  • optimized latency
  • increased throughput
  • access pattern dependent
  • Multimedia server approaches
  • interactive applications
  • popularity-based placement
  • striping and clustering
  • streaming applications
  • continuous placement
  • striping and clustering
  • replication
  • cross relations between objects
  • no (at least only little) research yet

8
Data Placement on Disk II
  • Constant angular velocity (CAV) disks
  • equal amount of data in each track(and thus
    constant transfer time)
  • constant rotation speed
  • Zoned CAV disks
  • zones are ranges of tracks
  • typical few zones
  • different amount of data on tracks in different
    zones, i.e., more data on outer tracks
  • One should always place often used or high rate
    data on outermost tracks!?
  • NO, arm movement is often more important than
    transfer time?

9
Data Placement on Disk III
  • What is the connection between data popularity
    and placement
  • one could gain from placing popular data at the
    right place how?
  • zones might be important for placement why?

zoned
not zoned
10
Data Placement on Disk IV
  • Continuous placement stores all disk blocks
    continuous on disk
  • minimal disk arm movement reading the whole file
  • possible advantage
  • head must not move between read operations(often
    WRONG read other files as well)
  • real advantage
  • do not have to pre-determine block size
    (whatever amount to read, at most track-to-track
    seeks are performed)

file A
file B
file C
11
Using Adjacent Sectors, Cylinders and Tracks
  • To avoid seek time (and possibly rotational
    delay), we can store data likely to be accessed
    together on
  • adjacent sectors (similar to using larger
    blocks)
  • if the track is full, use another track on the
    same cylinder (only use another head)
  • if the cylinder is full, use next cylinder
    (track-to-track seek)
  • Advantage
  • can approach theoretical transfer rate (no seeks
    or rotational delays)
  • Disadvantage
  • no gain if we have unpredictable disk accesses

12
Data Placement on Disk V
  • Interleaved placement tries to store blocks from
    a file with a fixed number of other blocks
    in-between each block
  • minimal disk arm movement reading the files A, B
    and C
  • fine for predictable workloads reading multiple
    files
  • Non-interleaved (or even random) placement can be
    used for highly unpredictable workloads

13
Data Placement on Disk V
  • Organ-pipe placement consider the usual disk head
    position
  • place most popular data where head is most
    often
  • center of the disk is closest to the head using
    CAV disks a bit outward for zoned CAV disks
    (modified organ-pipe)

innermost
outermost
disk
Noteskew dependent on tradeoff between
zoned transfer time and seek time
organ-pipe
modified organ-pipe
14
Fast File System
  • FFS is a general file system
  • idea is to keep inode and associated blocks
    close(no long seeks when getting the inode and
    data)
  • organizes the disks in partitions cylinder
    groups
  • having several inodes
  • free block bitmap
  • tries to store a file within a cylinder group
  • next block on same cylinder
  • a block within the cylinder group
  • find a block in another group using a hash
    function
  • search all cylinder groups for a free block

15
Log-Structured File System
  • Log-structured placement is based on assumptions
    (facts?) that
  • RAM memory is getting larger
  • writes is most expensive
  • reads can often be served from buffer cache
    (!!??)
  • Organize disk blocks as a circular log
  • periodically, all pending (so far buffered)
    writes are performed as a batch
  • write on next free block regardless of content
    (inode, directory, data, )
  • a cleaner reorganizes wholes and deleted blocks
    in the background
  • stores blocks continuously when writing a single
    file
  • efficient for small writes, other operations as
    traditional UNIX FS

disk
16
Minorca File System I
  • Minorca is a multimedia file system (from
    IFI/UiO)
  • enhanced allocation of disk blocks for continuous
    storage of media files
  • supports both continuous and non-continuous files
    in the same system using different placement
    policies
  • Multimedia-Oriented Split Allocation (MOSA)
    one file system, two sections
  • cylinder group sections (CGSs) for non-continuous
    files
  • like traditional BSD FFS disk partitions
  • small block sizes (like 4 or 8 KB)
  • traditional FFS operations
  • extent sections for continuous files
  • extents contain one or more (adjacent) CGSs
  • summary information
  • allocation bitmap
  • data block area
  • expected to store one media file
  • large block sizes (e.g., 64 KB)
  • new transparent file operations, create file
    using O_CREATEXT

cylinder group
cylinder group

extent
extent
extent
extent
extent

17
Minorca File System II
  • Count-augmented address indexing in the extent
    section
  • observation indirect block reads introduce disk
    I/O and break access locality (e.g.,
    inode)
  • introduce a new inode structure
  • add counter field to original direct entries
    direct points to a disk blockand count indicated
    how many other blocks is following the first
    block (continuously)
  • if continuous allocation is assured, each direct
    entry is able to access much more blocks without
    additional retrieving an indirect block

attributes
direct 0
count 0
direct 1
count 1
direct 2
count 2


direct 10
count 10
direct 11
count 11
single indirect
double indirect
triple indirect
18
Other File Systems Examples
  • Contiuous Allocation
  • Presto
  • similar to Minorca extents for continuous files
  • doesnt support small, discrete files
  • Fellini
  • simple flat file system
  • maintains free block list with grouping
    contiguous blocks
  • Continuous Media File System
  • Several systems use multiple disks and stripe
    data
  • Symphony
  • Tiger Shark
  • Tiger

19
Prefetching
20
Prefetching
  • If we can predict the access pattern, one might
    speed up performance using prefetching
  • a video playout is often linear ? easy to predict
    access pattern
  • eases disk scheduling
  • read larger amounts of data per request
  • data in memory when requested reducing page
    faults
  • One way of doing prefetching is read-ahead
  • read more than the requested block into memory
  • serve next read requests from buffer cache
  • Another way of doing prefetching is double
    (multiple) buffering
  • read data into first buffer
  • process data in first buffer and at the same
    time read data into second buffer
  • process data in second buffer and at the same
    time read data into first buffer
  • etc.

21
Multiple Buffering I
  • Examplehave a file with block sequence B1, B2,
    ...our program processes data sequentially,
    i.e., B1, B2, ...
  • single buffer solution
  • read B1 ? buffer
  • process data in buffer
  • read B2 ? buffer
  • process data in Buffer
  • ...
  • if P time to process/block R time to read in
    1 block n blockssingle buffer time n
    (PR)

process data
memory
disk
22
Multiple Buffering II
  • double buffer solution
  • read B1 ? buffer1
  • process data in buffer1, read B2 ? buffer2
  • process data in buffer2, read B3 ? buffer1
  • process data in buffer1, read B4 ? buffer2
  • ...
  • if P time to process/block R time to read in
    1 block n blocksif P ? R double buffer
    time R nP
  • if P lt R, we can try to add buffers (n -
    buffering)

process data
process data
memory
disk
23
Multiple Disks
24
Multiple Disks
  • Disk controllers and busses manage several
    devices
  • One can improve total system performance by
    replacing one large disk with many small accessed
    in parallel
  • Several independent heads can read
    simultaneously(if the other parts of the system
    can manage the speed)

Single disk
Two disks
Notethe single disk might be faster, but as
seek time and rotational delay are the dominant
factors of total disk access time, the two
smaller disks might operate faster together
performing seeks in parallel...
25
Striping
  • Another reason to use multiple disks is when one
    disk cannot deliver requested data rate
  • In such a scenario, one might use several disks
    for striping
  • bandwidth disk Bdisk
  • required bandwidth Bdisplay
  • Bdisplay gt Bdisk
  • read from n disks in parallel n Bdisk gt Bdisplay
  • clients are serviced in rounds
  • Advantages
  • high data rates
  • faster response time compared to one disk
  • Drawbacks
  • cant serve multiple clients in parallel
  • positioning time increases (i.e., reduced
    efficiency)

26
Interleaving (Compound Striping)
  • Full striping usually not necessary today
  • faster disks
  • better compression algorithms
  • Interleaving lets each client may be serviced by
    only a set of the available disks
  • make groups
  • stripe data in a way such thata consecutive
    request arrive atnext group (here each disk is a
    group)

27
Interleaving (Compound Striping)
  • Divide traditional striping group into
    sub-groups, e.g., staggered striping
  • Advantages
  • multiple clients can still be served in parallel
  • more efficient disks
  • potentially shorter response time
  • Drawbacks
  • load balancing (all clients access same group)

28
Mirroring
  • Multiple disks might come in the situation where
    all requests are for one of the disks and the
    rest lie idle
  • In such cases, it might make sense to have
    replicas of data on several disks if we have
    identical disks, it is called mirroring
  • Advantages
  • faster response time
  • survive crashes fault tolerance
  • load balancing by dividing the requests for the
    data on the same disks equally among the mirrored
    disks
  • Drawbacks
  • increases storage requirement and write operations

29
Redundant Array of Inexpensive Disks
  • The various RAID levels define different disk
    organizations to achieve higher performance and
    more reliability
  • RAID 0 - striped disk array without fault
    tolerance (non-redundant)
  • RAID 1 - mirroring
  • RAID 2 - memory-style error correcting code
    (Hamming Code ECC)
  • RAID 3 - bit-interleaved parity
  • RAID 4 - block-interleaved parity
  • RAID 5 - block-interleaved distributed-parity
  • RAID 6 - independent data disks with two
    independent distributed parity schemes (PQ
    redundancy)
  • RAID 7
  • RAID 10
  • RAID 53
  • RAID 10

30
Redundant Array of Inexpensive Disks
  • RAID is intended ...
  • ... for general systems
  • ... to give higher throughput
  • ... to be fault tolerant
  • For multimedia systems, some requirements are
    missing
  • low latency
  • guaranteed response time
  • optimizations for linear access to large objects
  • optimizations for cyclic operations

31
Replication
  • Replication is in traditional RAID systems often
    used for fault tolerance (and higher performance
    in the new combined levels)
  • Replication in multimedia systems is used for
  • reducing hot spots
  • increase scalability
  • higher performance
  • but, fault tolerance is a side effect
  • Replication in multimedia scenarios should
  • be based on observed load
  • changed dynamically as popularity changes

32
Dynamic Segment Replication (DSR)
  • DSR tries to balance load by dynamically
    replicating hot data
  • assumes read only, VoD like retrieval
  • predefines a load threshold for when to replicate
    a segment by examining current and expected load
  • replicate when threshold is reached, but which
    segment??
  • not necessarily segment that receives additional
    requests(another segment may have more requests)
  • replicates based on payoff factor p (replicate
    segment x with highest p)

33
Some Challenges Managing Multiple Disks
  • How large should a stripe group and stripe unit
    be?
  • Can one avoid hot sets of disks (load
    imbalance)?
  • Heterogeneous disks?
  • What and when to replicate?

34
Heterogeneous Disks
35
File Placement
  • A multimedia file might be stored (striped) on
    multiple disks, but how should one choose on
    which devices?
  • storage devices limited by both bandwidth and
    space
  • we have hot (frequently viewed) and cold (rarely
    viewed) files
  • we may have several heterogeneous storage
    devices
  • the objective of a file placement policy is to
    achieve maximum utilization of both bandwidth and
    space, and hence, efficient usage of all devices
    by avoiding load imbalance
  • must consider expected load and storage
    requirement
  • should a file be replicated
  • expected load may change over time

36
Bandwidth-to-Space Ratio (BSR) I
  • BSR attempts to mix hot and cold as well as large
    and small multimedia objects on heterogeneous
    devices
  • dont optimize placement based on throughput or
    space only
  • BSR consider both required storage space and
    throughput requirement(which is dependent on
    playout rate and popularity) to achieve a best
    combined device utilization

disk(no deviation)
disk (large deviation)
disk(large deviation)
media object
wasted space
wasted bandwidth
space
bandwidth
may vary according to popularity
37
Bandwidth-to-Space Ratio (BSR) II
  • The BSR policy algorithm
  • input space and bandwidth requirements
  • phase 1
  • find a device to place the media object according
    to BSR
  • if no device, or stripe of devices, can give
    sufficient space or bandwidth, then add replicas
  • phase 2
  • find devices for the needed replicas
  • phase 3
  • allocate expected load on replica devices
    according to BSR of the devices
  • phase 4
  • if not enough resources are available, see if
    other media objects can delete replicas according
    to their current workload
  • all phases may be needed adding a new media
    object or increasing the workload for decrease,
    only the phase 3 (reallocation) in needed
  • Popular, high data rate movies should be on high
    bandwidth disks

38
Disk Grouping
  • Disk grouping is a technique to stripe (or
    fragment) data over heterogeneous disks
  • groups heterogeneous physical disks to
    homogeneous logical disks
  • the amount of data on each disk (fragments) is
    determined so that the service time (based on
    worst-case seeks) is equal for all physical disks
    in a logical disk
  • blocks for an object are placed (and read) on
    logical disks in a round-robin manner all disks
    in a group is activated simultaneously

logical disk 0
X0,0
X2,0
X0
X2
X0,1
X2,1
logical disk 1
X1,0
X3,0
X1
X3
X1,1
X3,1
39
Staggered Disk Grouping
  • Staggered disk grouping is a variant of disk
    grouping minimizing memory requirement
  • reading and playing out differently
  • not all fragments of a logical block is needed at
    the same time
  • first (and largest) fragment on most powerful
    disk, etc.
  • read sequentially (must not buffer later segments
    for a long time)
  • start display when largest fragment is read

logical disk 0
X0,0
X2,0
X0
X2
X0,0
X2,0
X0,1
X2,1
X0,1
X2,1
logical disk 1
X1,0
X3,0
X1
X3
X1,0
X1,1
X1,1
X3,1
40
Disk Merging
  • Disk merging forms logical disks form capacity
    fragments of a physical disk
  • all logical disks are homogeneous
  • supports an arbitrary mix of heterogeneous disks
    (grouping needs equal groups)
  • starts by choosing how many logical disks the
    slowest device shall support (e.g., 1 for disk 1
    and 3) and calculates the corresponding number of
    more powerful devices (e.g., 1.5 for disk 0 and 2
    if these disks are 1.5 times better)
  • most powerful most flexible (arbitrary mix of
    devices) and can be adapted to zoned disks (each
    zone considered as a disk)

X0
X0
X2,0
X1
X1
X2
X3
X2,1
X3
X4
X4
41
Memory Caching
42
Data Path (Intel Hub Architecture)
Pentium 4 Processor
registers
cache(s)
file system
RDRAM
communication system
RDRAM
application
RDRAM
RDRAM
network card
PCI slots
PCI slots
disk
PCI slots
43
Memory Caching
  • How do we manage a cache?
  • how much memory to use?
  • how much data to prefetch?
  • which data item to replace?

application
cache
communication system
file system
expensive
disk
network card
44
Is Caching Useful in a Multimedia Scenario?
  • High rate data may need lots of memory for
    caching
  • Tradeoff amount of memory, algorithms
    complexity, gain,
  • Cache only frequently used data how?(e.g.,
    first (small) parts of a broadcast partitioning
    scheme, allow on top-ten only, )

Maximum amount of memory (totally) that a Dell
Server can manage today and all is NOT used
for caching
45
Need For Special Multimedia Algorithms ?
In this case, LRU replaces the next needed
frame. So the answer is in many cases YES
  • Most existing systems use an LRU-variant, e.g.,
  • keep a sorted list
  • replace first in list
  • insert new data elements at the end
  • if a data element is re-accesses, move back to
    the end of the list
  • Example playout of video frames

longest time since access
shortest time since access
LRU buffer
play video (7 frames)
1
2
3
4
5
6
7
7
5
4
3
2
1
rewind and restart playout at 1
6
1
7
6
5
4
3
2
playout 2
2
7
6
5
4
3
playout 3
1
3
2
1
7
6
5
4
playout 4
46
Classification of Mechanisms
  • Block-level caching consider (possibly unrelated)
    set of blocks
  • each data element is viewed upon as an
    independent item
  • usually used in traditional systems
  • e.g., FIFO, LRU, CLOCK,
  • multimedia approaches
  • L/MRP (Least/Most Relevant for Presentation)
  • Stream-dependent caching consider a stream object
    as a whole
  • related data elements are treated in the same way
  • research prototypes in multimedia systems
  • e.g.,
  • BASIC
  • DISTANCE
  • Interval Caching (IC)
  • Generalized Interval Caching (GIC)
  • Split and Merge (SAM)
  • SHR

47
Least/Most Relevant for Presentation (L/MRP)
Moser et al. 95
  • L/MRP is a buffer management mechanism for a
    single interactive, continuous data stream
  • adaptable to individual multimedia applications
  • supports pre-loading, i.e., prefetch data from
    disk
  • replaces least relevant pages regarding current
    playout of the multimedia stream

COPUs continuous object presentation units
playback direction
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
15
16
17
18
19
14
20
21
13
22
12
23
11
10
24
25
26
48
Least/Most Relevant for Presentation (L/MRP)
  • L/MRP
  • gives few disk accesses (compared to other
    schemes)
  • supports interactivity
  • supports prefetching
  • targeted for single streams (users)
  • expensive to execute (calculate relevance
    values for all COPUs each round)
  • Variations
  • Q-L/MRP extends L/MRP with multiple streams and
    changes prefetching mechanism (reduces overhead)
    Halvorsen et. al. 98
  • MPEG-L/MRP gives different relevance values for
    different MPEG frames Boll et. all. 00

49
Interval Caching (IC)
  • Interval caching (IC) is a caching strategy for
    streaming servers
  • caches data between requests for same video
    stream based on playout intervals between
    requests
  • following requests are thus served from the cache
    (not disk) filled by the preceding stream
  • sort intervals on length, buffer requirement is
    data size of interval
  • to maximize cache hit ratio (minimize disk
    accesses) the shortest intervals are cached first

S32
S33
S21
S11
S31
S12
50
Generalized Interval Caching (GIC)
  • Interval caching (IC) does not work for short
    clips
  • a frequently accessed short clip will not be
    cached
  • GIC generalizes the IC strategy
  • manages intervals for long video objects as IC
  • short intervals extend the interval definition
  • keep track of a finished stream for a while after
    its termination
  • define the interval for short stream as the
    length between the new stream and the position of
    the old stream if it had been a longer video
    object
  • the cache requirement is, however, only the real
    requirement
  • cache the shortest intervals as in IC

S11
Video clip 1
I11
C11
51
Generalized Interval Caching (GIC)
  • Open function form if possible new interval
    with previous stream if (NO) exit / dont
    cache / compute interval size and cache
    requirement reorder interval list / smallest
    first / if (not already in a cached
    interval) if (space available) cache
    interval else if (larger cached intervals
    exist and sufficient memory can be released)
    release memory form larger
    intervals cache new interval
  • Close function if (not following another stream)
    exit / not served form cache / delete
    interval with preceding stream free memory if
    (next interval can be cached in released memory)
    cache next interval

52
The EndSummary
53
Summary
  • Much work has been performed to optimize disks
    performance
  • For multimedia streams, ...
  • time-aware scheduling is important
  • use large block sizes or read many continuous
    blocks
  • prefetch data from disk to memory to have a
    hiccup free playout
  • striping might not be necessary on new disks (at
    least not on all disks)
  • replication on multiple disks can offload a hot
    spot of disks
  • memory caching can save disk I/Os, but it might
    not be worthwhile
  • ...
  • BUT, new disks are smart, we cannot fully
    control the device

54
Some References
  • Advanced Computer Network Corporation
    RAID.edu, http//www.raid.com/04_00.html, 2002
  • Boll, S., Heinlein, C., Klas, W., Wandel, J.
    MPEG-L/MRP Adaptive Streaming of MPEG Videos
    for Interactive Internet Applications,
    Proceedings of the 6th International Workshop on
    Multimedia Information System (MIS00), Chicago,
    USA, October 2000, pp. 104 - 113
  • Halvorsen, P., Goebel, V., Plagemann, T.
    Q-L/MRP A Buffer Management Mechanism for QoS
    Support in a Multimedia DBMS, Proceedings of
    1998 IEEE International Workshop on Multimedia
    Database Management Systems (IW-MMDBMS'98),
    Dayton, Ohio, USA, August 1998, pp. 162 - 171
  • Moser, F., Kraiss, A., Klas, W. L/MRP a Buffer
    Management Strategy for Interactive Continuous
    Data Flows in a Multimedia DBMS, Proceedings of
    the 21th VLDB Conference, Zurich, Switzerland,
    1995
  • Plagemann, T., Goebel, V., Halvorsen, P., Anshus,
    O. Operating System Support for Multimedia
    Systems, Computer Communications, Vol. 23, No.
    3, February 2000, pp. 267-289
  • Sitaram, D., Dan, A. Multimedia Servers
    Applications, Environments, and Design, Morgan
    Kaufmann Publishers, 2000
  • Zimmermann, R., Ghandeharizadeh, S. Continuous
    Display using Heterogeneous Disk-Subsystems,
    Proceedings of the 5th ACM International
    Multimedia Conference, Seattle, WA, November 1997
Write a Comment
User Comments (0)
About PowerShow.com