Mirrored Storage - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Mirrored Storage

Description:

Ten binary arrays named bitmaps. Buddy System Algorithm ... The map field points to a bitmap. Each bit of the bitmap of the kth entry of the free_area array ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 48

Provided by: dasanSe

Category:

more less

Transcript and Presenter's Notes

Title: Mirrored Storage

1
Mirrored Storage

Mirrored storage replicates data over two or more
plexes of the same size
Mirrored storage with two mirrors corresponds to
RAID 1
Bandwidth and I/O rate of mirrored storage
depends on the direction of data flow
Performance for read operations is
additivemirrored storage that uses n plexes will
give n times the bandwidth and I/O rate of a
single plex for read requests
Write bandwidth and I/O rate is a bit less than
that is a single plex
If write requests cannot be issued in parallel,
but happen one after the other, write performance
will be n times worse than that of a single mirror

2
Mirrored Storage

Algorithms when servicing a read request
Round robin
Preferred Mirror
Least Busy
The forte of the mirrored storage is increased
reliability, whereas striped or concatenated
storage gives decreased reliability
In case a disk fails, it can be hot-swapped
(manually replaced on-line with a new working
disk). Alternatively, a hot standby disk can be
deployed

3
RAID Storage

In RAID 3, a stripe spans n subdisks each stripe
stores data on n-1 subdisks and parity on the
last
RAID 5 differs from RAID 3 in that the parity is
distributed over different subdisks for different
stripes, and a stripe can be read or written
partially

4
(No Transcript)
5
RAID Storage

RAID 3 storage capacity equals n-1 subdisks,
since one subdisk capacity is used up for storing
parity data
Bandwidth and I/O rate of an n-way RAID 3 storage
is equivalent to (n-1)-way striped storage
The minimum unit of I/O for RAID 3 is equal to
one stripe.
If a write request spans one stripe exactly,
performance is least impacted. The only overhead
is computing contents of one parity block and
writing it, thus n I/Os are required instead of
n-1 I/Os for an equivalent (n-1)-way striped
storage
A small write request must be handled as a
read-modify-write sequence for the whole
stripethat makes it 2n I/O
RAID 3 storage provides protection against one
disk failure

6
Chained Declustering
Server0
Server1
Server2
Server3
D0
D1
D2
D3
D3
D0
D1
D2
D4
D5
D6
D7
D7
D4
D5
D6
7
Chained Declustering Server Failure
Server0
Server1
Server2
Server3
D0
D2
D3
D1
D3
D1
D2
D0
D4
D6
D7
D5
D7
D5
D6
D4
Server failed, but all data is still available
8
RAID Storage

Equals n-1 subdisks, since one subdisk capacity
is used up for storing parity data
Bandwidth and I/O rate of an n-way RAID 5 storage
is equivalent to n-way striped storage, because
the parity blocks are distributed over all disks
RAID 5 works the same as RAID 3 when write
requests span one or more full stripes. However,
RAID 5 only requires four disk I/Os
Read1 old data
Read2 parity
compute new parity XOR sum of old data, old
parity, and
new data
Write3 new data
Write4 new parity

9
RAID Storage

As is the case with mirrored storage, RAID
storage is also vulnerable with respect to host
computer crashes while write requests are in
flight to disks
A single logical request can result in two to n
physical write requests parity is always updated
If some writes succeed and some do not, the
stripe becomes inconsistent

10
Compound Storage

Mirrored Stripes (RAID 10)
Two striped storage plexes of equal capacity can
be mirrored to form a single volumn. Each plex
would be resident on a separate disk array
Striped Mirrors (RAID 01)
Multiple plexes, each containing a pair of
mirrored subdisks, can be aggregated using
striping to form a single volumn
Each plex provides reliability.

11
Compound Storage

In both cases, storage cost is doubled due to
two-way mirroring
Mirrored-striped storage
If a disk fails in mirrored stripe storage, one
whole plex is declared failed.
After the failure is repaired, the whole plex
must be rebuilt by copying from the good plex
Storage is vulnerable to a second disk failure in
the good plex until the mirror is rebuilt
Striped-mirror storage
If a disk fails in striped-mirror storage, no
plex is failed.
After the disk is repaired, only data of that one
disk needs to be rebuilt from the other disk
Storage is vulnerable to a second disk failure
Thus, striped mirrors are preferable over
mirrored stripes

12
(No Transcript)
13
(No Transcript)
14
Dynamic multipathing

If the I/O path from host computer to disk
storage fails due to host bus adapter card
failure, storage availability is completely lost
Redundant I/O channels are added to the hardware
configuration by putting in extra HBAs that
connect to independent I/O cables
Disk arrays must also support multiple I/O ports
to plug in multiple cables
Once redundant I/O paths are available in the
hardware, a volumn manager can utilize these
paths to provide protection against I/O channel
failure

15
(No Transcript)
16
Dynamic Multipathing

Active/passive
Only one port to be active for I/O at a time
while the other port is passive and will not
provide I/O
In case the I/O path to the active port fails,
the passive port is activated by a special
command issued on the I/O path to the passive
port
Active/active
Allow I/O requests to be sent to its disks down
both I/O paths concurrently
The volume manager sends I/O requests through one
active I/O channel, or it may balance I/O traffic
over multiple active channels

17
(No Transcript)
18
Issues with server failure

Out of sync
To disbelieve all except one mirror and just copy
all data from this mirror to the remaining ones
Can take hours to rebuild out-of-sync mirrored
storage. Such storage should not be accessed
until mirror rebuild completes
Dirty Region Logging (DRL)
Logs the addresses that undergo writes
It divides the whole volume into a number of
regions. If an I/O request falls within a region,
that region is marked dirty, and its identity is
logged
Since at the most a few hundred I/O requests
would be in flight when the server crashed, the
number of blocks that are truly inconsistent is
much smaller than the total number of blocks on
the mirrored storage
An alternative to DRL is to use a full-fledged
transaction mechanism to log all intended writes
to separate stable storage before initiating any
physical write

19
Page descriptors

The kernel must keep track of the current status
of each page frame
State information of a page frame is kept in a
page descriptor of type struct page

20
Page Descriptors

struct list_head list
struct address_space mapping
Used when the page inserted into the page cache
unsigned long index
The position of the data stored in the page
struct page next_hash
atomic_t count
unsigned long flags
PG_locked, PG_referenced, PG_uptodate,
struct list_head lru
wait_queue_head_t wait
struct page pprev_hash
struct buffer_head buffers
void virtual
struct zone_struct zone

21
Memory zones

ZONE_DMA
Contains pages of memory below 16MB
Used by the DMA processors for ISA buses
ZONE_NORMAL
Contains pages of memory at and above 16MB and
below 896MB
ZONE_HIGHMEM
Contains pages of memory at and above 896MB

22
Memory zones

The ZONE_DMA and ZONE_NORMAL zones include the
normal pages of memory that can be directly
accessed by the kernel through the linear mapping
in the fourth gigabyte of the linear address
space
The ZONE_HIGHMEM includes pages of memory that
cannot be directly accessed by the kernel through
the linear mapping
Each memory zone has its own descriptor of type
struct zone_struct
P220-221

23
Non-Uniform Memory Access (NUMA)

Linux 2.4 supports the Non-Uniform Memory Access
(NUMA) model, in which the access time for
different memory locations from a given CPU may
vary
The physical memory of the system is partitioned
in several nodes.
The time needed by any given CPU to access pages
within a single node is the same

24
Non-Uniform Memory Access(NUMA)

The physical memory inside each node can be split
in several zones
If NUMA support is not compiled in the kernel,
Linux makes use of a single node that includes
all system physical memory

25
Memory initialization

paging_init() invokes the free_area_init()
Computes the total number of page frames in RAM,
and stores the result in the totalpages
Initializes the active_list and inactive_list
lists of page descriptors
Allocates space for the mem_map array of page
descriptors
Initializes some fields of the node descriptor
contig_page_data
contig_page_data.node_size totalpages
contig_page_data.node_start_paddr 0x00000000
contig_page_data.node_start_mapnr 0

26
Memory initialization

Initializes some fields of all page descriptors
for (pmem_map pltmem_maptotalpages p)
p-gtcount 0
SetPageReserved(p)
init_waitqueue_head(p-gtwait)
p-gtlist.next p-gtlist.prev p
Initializes some fields of the memory zone
descriptor in the zone local variable
zone-gtname zone_namej
zone-gtsize zone_sizej
zone-gtlock SPIN_LOCK_UNLOCKED
zone-gtzone_pgdat contig_page_data
zone-gtfree_pages 0
zone-gtneed_balance 0
Initializes the node_zonelists array of the
contig_page_data node descriptors

27
Memory initialization

mem_init()
Initializes the value of num_physpages, the total
number of page frames present in the system
For each page descriptor, sets the count field to
1
Resets the PG_reserved flag
Sets the PG_highmem flag if the page belongs to
ZONE_HIGHMEM
Call the free_page() to release the page frame

28
Requesting and releasing

alloc_pages(gfp_mask, order)
Used to request 2order contiguous page frames. It
returns the address of the descriptor of the
first allocated page frame
alloc_page(gfp_mask)
Used to get a single page frame. It returns the
address of the descriptor of the allocated page
frame
__get_free_pages(gfp_mask, order)
Similar to alloc_pages(), but it returns the
linear address of the first allocated page

29
Requesting and releasing

__GFP_WAIT
The kernel is allowed to block the current
process waiting for free page frames
__GFP_HIGH
The kernel is allowed to access the pool of free
page frames left for recovering from very low
memory conditions
__GFP_IO
The kernel is allowed to perform I/O transfers on
low memory pages in order to free page frames

30
Requesting and releasing

__GFP_HIGHIO
The kernel is allowed to perform I/O transfers on
high memory pages in order to free page frames
__GFP_DMA
The requested page frames must be included in the
ZONE_DMA zone
__GFP_HIGHMEM
The requested page frames can be included in the
ZONE_HIGHMEM zone

31
Kernel mappings of High-Memory Page Frames

Allocations of high-memory page frames must be
done only through the alloc_pages function and
its alloc_page
Once allocated, a high-memory page frame has to
be mapped into the fourth gigabyte of the linear
address space
Permanent kernel mappings
Temporary kernel mappings
Noncontiguous memory allocation

32
Buddy system algorithm

A suitable technique to solve external
fragmentation is to avoid as much as possible the
need to split up a large free block
DMA ignores the paging circuitry and accesses the
address bus directly
Reduction of translation lookaside buffers misses

33
Buddy System Algorithm

All free page frames are groups into 10 lists of
blocks that contain groups of 1,2,4,8,16,32,64,128
,256,512
The physical address of the first page frame of a
block is a multiple of the group size

34
Buddy System Algorithm

Assume there is a request for a group of 128
contiguous page frames
Checks first to see whether a free block in the
128-page-frame list exists
If there is no such block, the algorithm looks
for the next larger blocka free block in the
256-page-frame list
If such a block exists, the kernel allocates 128
of the 256 page frames and inserts the remaining
128 page frames into the list of free
128-page-frame blocks

35
Buddy System Algorithm

If there is no free 256-page block, the kernel
looks for the next larger block
If such a block exists, it allocates 128 of the
512 page frames, inserts the first 256 of the
remaining 384 page frames into the list of free
256-page-frame blocks, and inserts the last 128
of the remaining 384 page frames into the list of
free 128-page-frame blocks

36
Buddy System Algorithm

Two blocks of size b are considered buddies if
Both blocks have the same size, say b
They are allocated in contiguous physical
addresses
The physical address of the first page frame of
the first block is a multiple of 2xbx212

37
Buddy System Algorithm

Main data structure
mem_map array
An array having 10 elements of type free_area_t,
one element for each group size
typedef struct free_area_struct
struct list_head free_list
unsigned long map
Ten binary arrays named bitmaps

38
Buddy System Algorithm

The kth element of the free_area array in the
zone descriptor is associated with a doubly
linked circular list of blocks of size 2k each
member of such a list is the descriptor of the
first page frame of a block
The map field points to a bitmap. Each bit of the
bitmap of the kth entry of the free_area array
describes the status of two buddy blocks of size
2k page frames.

39
Buddy System Algorithm

A zone including 128MB of RAM
32728 single pages, 16384 groups of 2pages each,
and so on up to 64 groups of 512 pages each
Bitmap of free_area0 consists of 16384 bits,
one for each pair of the 32768 existing page
frames the bitmap of free_area1 consists of
8192 bits, one for each pair of blocks of two
consecutive page frames

40
Memory Area Management

Need a scheme to satisfy the requests for small
memory areas, a few tens or hundreds of bytes
Need a scheme to avoid internal fragmentation
Mismatch between the size of the memory request
and the size of the memory area allocated to
satisfy the request

41
Slab Allocator

View the memory areas as objects consisting of
both a set of data structures and a couple of
functions called the constructor and destructor
Linux uses the slab allocator to reduce the
number of calls to the buddy system allocator
The slab allocator does not discard the objects
but instead saves them in memory.
When a new object is requested, it can be taken
from memory without having to be reinitialized

42
Slab allocator

The slab allocator groups objects into caches
Each cache is a store of objects of the same type
The area of main memory that contains a cache is
divided into slabs each slab consists of one or
more contiguous page frames that contain both
allocated and free objects
The slab allocator never releases the page frames
of an empty slab on its own

43
Slab Allocator
object
slab
cache
object
object
slab
44
Slab Allocator

Each cache is described by a table of type
struct kmem_cache_s
Each slab of a cache has its own descriptor of
type struct slab_s
Caches are divided into two types general and
specific

45
Slab Allocator

The general caches are
The first cache contains the cache descriptors of
the remaining caches used by the kernel.
Twenty-six additional caches contain
geometrically distributed memory areas. The
table, called cache_sizes, points to the 26 cache
descriptors associated with memory areas of size
32, 64, 128, 256, 512, 1024, 2048, 4096, 8192,
16384, 32768, 65536, 131072 bytes, respectively

46
Slab Allocator

The kmem_cache_init() and kmem_cache_sizes_init()
functions are invoked during system
initialization to set up the general caches
The kmem_cache_destroy() function destroys a
cache
The kmem_cache_shrink() function destroys all
slabs in a cache by invoking kmem_slab_destroy()
iteratively.

47
Slab Allocator
Cache Descriptor
Cache Descriptor
Cache Descriptor
Slab Descriptor
Slab Descriptor
Slab Descriptor
Slab Descriptor

Slab Descriptor
Slab Descriptor
Slab Descriptor
Slab Descriptor

Write a Comment

User Comments (0)