Title: Multiple Device Driver Linux Software RAID
1Multiple Device Driver (Linux Software RAID)
- Ted Baker ? Andy Wang
- CIS 4930 / COP 5641
2The md driver
- Provides virtual devices
- Created from one or more independent underlying
devices - The basic mechanism to support RAIDs
- Redundant arrays of inexpensive disks
3Common RAID levels
- RAID0
- Striping
- RAID1
- Mirroring
- RAID4 (gt 3 disks)
- Striped array with a parity device
- RAID5 (gt 3 disks)
- Striped array with distributed parity
- RAID6 (gt 4 disks)
- Striped array with dual redundancy information
4Common RAID levels
- RAID10
- Striped array of mirrored disks
- RAID01
- Mirroring two RAID0s
- RAID50
- Striped array of RAID5s
- RAID51
- Mirroring two RAID5s
5md pseudo RAID configurations
- Linear (catenates multiple disks into a single
one) - Multipath
- A set of different interfaces to the same device
(e.g., multiple disk controllers) - Faulty
- A layer over a single device into which errors
can be injected
6RAID Creation
- gt mdadm --create /dev/md0 --level1
--raid-devices2 /dev/hdac1 - Create /dev/md0 as RAID1
- Consisting of /dev/hda1 and /dev/hdc1
7RAID Status
- To check the status for RAIDs
- See /proc/mdstat
- Personalities raid1
- md0 active raid1 sda50 sdb51
- 979840 blocks 2/2 UU
- md1 active raid1 sda62 sdb61
- 159661888 blocks 2/1 _U
- gt................. recovery 17.9
- (28697920/159661888) finish56.4min
speed38656K/sec - unused devices ltnonegt
8md Super Block
- Each device in a RAID may have a superblock with
various information - Level
- UUID
- 128 bit identifier that identifies an array
9Some RAID Concepts
- Personality
- RAID level
- Chunk size
- Power of two gt 4KB
- A RAID assigns chunks to disks in a round robin
fashion
- Stripe
- A collection of ith chunk at each disk form a
stripe - Parity
- A chunk constructed via XORing other chunks
10Synchrony
- An update may involve both the data block and the
parity block - Implications
- A RAID may be shut down in an inconsistency state
- Resynchronization may be required at startup, in
the background - Reduced performance
11Recovery
- If the md driver detects a write error, it
immediately disables that device - Continues operation on the remaining devices
- Starts recreating the content if there is a spare
drive
12Recovery
- If the md driver detects a read error
- Overwrites the bad block
- Read the block again
- If fails, treat it as a write error
- Recovery is a background process
- Can be configured via
- /proc/sys/dev/raid/speed_limit_min
- /proc/sys/dev/raid/speed_limit_max
13Bitmap Write-Intent Logging
- Records which blocks of the array may be out of
sync - Speeds up resynchronization
- Allows a disk to be temporarily removed and
reinserted without causing an enormous recovery
cost - Can spin down disks for power savings
14Bitmap Write-Intent Logging
- Can be stored on a separate device
15Write-Behind
- Certain devices in the array can be flagged as
write-mostly - md will not wait for writes to write-behind
devices to complete before returning to the file
system
16Restriping (Reshaping)
- Change the number of disks
- Change the RAID levels
- Not robust against failures
17Faulty.c
- static int __init raid_init(void)
- return register_md_personality(faulty_personali
ty) -
- static void raid_exit(void)
- unregister_md_personality(faulty_personality)
-
- module_init(raid_init)
- module_exit(raid_exit)
18Faulty.c
- static struct mdk_personality faulty_personality
- .name "faulty",
- .level LEVEL_FAULTY,
- .owner THIS_MODULE,
- .make_request make_request,
- .run run,
- .stop stop,
- .status status,
- .reconfig reconfig,
-
19Faulty.c
typedef struct faulty_conf int
periodModes atomic_t countersModes
sector_t faultsMaxFault int
modesMaxFault int nfaults mdk_rdev_t
rdev conf_t
- static int run(mddev_t mddev)
- mdk_rdev_t rdev
- struct list_head tmp
- int i
- conf_t conf kmalloc(sizeof(conf),
GFP_KERNEL) - .../ zero out conf /
- ITERATE_RDEV(mddev, rdev, tmp) conf-gtrdev
rdev - mddev-gtarray_size mddev-gtsize
- mddev-gtprivate conf
- reconfig(mddev, mddev-gtlayout, -1)
- return 0
20Faulty.c
- static int reconfig(mddev_t mddev, int layout,
- int chunk_size)
- int mode layout ModeMask
- int count layout gtgt ModeShift
- conf_t conf mddev-gtprivate
- .../ error checks /
- if (mode / clear something /)
- / clear various counters /
- else if (mode lt Modes)
- conf-gtperiodmode count
- if (!count) count
- atomic_set(conf-gtcountersmode, count)
- else ...
- return 0
21Faulty.c
- static int stop(mddev_t mddev)
- conf_t conf (conf_t )mddev-gtprivate
- kfree(conf)
- mddev-gtprivate NULL
- return 0
22Faulty.c
- static int make_request(request_queue_t q,
struct bio bio) - mddev_t mddev q-gtqueuedata
- conf_t conf (conf_t)mddev-gtprivate
- int failit 0
-
- if (bio_data_dir(bio) WRITE) / data
direction / - .../ misc cases /
- if (check_sector(conf, bio-gtbi_sector,
bio-gtbi_sector - (bio-gtbi_size gtgt 9), WRITE))
- failit 1 / if a sector failed before,
fail again / - if (check_mode(conf, WritePersistent))
- / if the period is reached for a sector,
record the - sector and fail it /
- add_sector(conf, bio-gtbi_sector,
WritePersistent) - failit 1
- ...
23Faulty.c
- else / failure cases for reads /
- ...
-
- if (failit)
- struct bio b bio_clone(bio, GFP_NOIO)
- b-gtbi_bdev conf-gtrdev-gtbdev
- b-gtbi_private bio
- b-gtbi_end_io faulty_fail
- generic_make_request(b)
- return 0
- else
- bio-gtbi_bdev conf-gtrdev-gtbdev
- return 1
-
To the queue of this device, initialized in md.c
from the disk device inode
Let the main block layer submit the IO and
resolve the recursion
24ll_rw_blk.c
- A file system eventually calls generic_make_reques
t() - void generic_make_request(struct bio bio)
- ...
- do
- q bdev_get_queue(bio-gtbi_bdev)
- .../ check errors /
- ret q-gtmake_request_fn(q, bio)
- while (ret)
25Faulty.c
- static int faulty_fail(struct bio bio,
- unsigned int bytes_done,
int error) - struct bio b bio-gtbi_private
- b-gtbi_size bio-gtbi_size
- b-gtbi_sector bio-gtbi_sector
- if (bio-gtbi_size 0)
- bio_put(bio)
- clear_bit(BIO_UPTODATE, b-gtbi_flags)
- return (b-gtbi_end_io)(b, bytes_done, -EIO)
26Linear.c
- static int __init linear_init(void)
- return register_md_personality(linear_personali
ty) -
- static void linear_exit (void)
- unregister_md_personality(linear_personality)
-
- module_init(linear_init)
- module_exit(linear_exit)
27Linear.c
- static struct mdk_personality linear_personality
- .name "linear",
- .level LEVEL_LINEAR,
- .owner THIS_MODULE,
- .make_request linear_make_request,
- .run linear_run,
- .stop linear_stop,
- .status linear_status, / for proc /
- .hot_add_disk linear_add,
-
28Linear.c
- static int linear_run(mddev_t mddev)
- linear_conf_t conf
-
- / initialize
- conf linear_conf(mddev, mddev-gtraid_disks)
- if (!conf) return 1
- mddev-gtprivate conf
- mddev-gtarray_size conf-gtarray_size / in
bytes / - ...
-
typedef struct linear_private_data struct
linear_private_data prev dev_info_t
hash_table / to track disk boundaries /
sector_t hash_spacing sector_t array_size
int preshift dev_info_t disks0
linear_conf_t
29Linear.c
- ...
- / determines whether two bio can be merged /
- / overrides the default merge_bvec function /
- blk_queue_merge_bvec(mddev-gtqueue,
linear_mergeable_bvec) - / queues are first plugged to build up the
queue length, then unplugged to release requests
to devices / - mddev-gtqueue-gtunplug_fn linear_unplug
- mddev-gtqueue-gtissue_flush_fn
linear_issue_flush - / disable prefetching when the device is
congested / - mddev-gtqueue-gtbacking_dev_info.congested_fn
- linear_congested
- mddev-gtqueue-gtbacking_dev_info.congested_data
mddev - return 0
30Linear.c
- static int linear_stop(mddev_t mddev)
- linear_conf_t conf mddev_to_conf(mddev)
- / the unplug fn references 'conf /
- blk_sync_queue(mddev-gtqueue)
- do
- linear_conf_t t conf-gtprev
- kfree(conf-gthash_table)
- kfree(conf)
- conf t
- while (conf)
- return 0
31Linear.c
- static int linear_make_request(request_queue_t
q, - struct bio bio)
- const int rw bio_data_dir(bio)
- mddev_t mddev q-gtqueuedata
- dev_info_t tmp_dev
- sector_t block
-
- .../ check for errors and update statistis /
- tmp_dev which_dev(mddev, bio-gtbi_sector)
- block bio-gtbi_sector gtgt 1
-
- .../ more error checks /
-
-
32Linear.c
- if (unlikely(bio-gtbi_sector (bio-gtbi_size gtgt
9) gt - (tmp_dev-gtoffset tmp_dev-gtsize)
ltlt 1)) - / This bio crosses a device boundary, so we
have to - split it. /
- struct bio_pair bp
- bp bio_split(bio, bio_split_pool,
- ((tmp_dev-gtoffset
tmp_dev-gtsize) ltlt 1) - - bio-gtbi_sector)
- if (linear_make_request(q, bp-gtbio1)) /
recursion!? / - generic_make_request(bp-gtbio1)
- if (linear_make_request(q, bp-gtbio2)) /
recursion! / - generic_make_request(bp-gtbio2)
- bio_pair_release(bp) / remove bio hazard /
- return 0
-
-
33Linear.c
Points to the specific device
- bio-gtbi_bdev tmp_dev-gtrdev-gtbdev
- bio-gtbi_sector bio-gtbi_sector -
(tmp_dev-gtoffset ltlt 1) - tmp_dev-gtrdev-gtdata_offset
- return 1
Translates the virtual sector number to the
physical sector number for the specific device
Again, let the main block layer submit the IO and
resolve the recursion