Title: ASM without HW RAID
1Implementing ASM Without HW RAID,A Users
Experience
Luca Canali, CERN Dawid Wojcik, CERN UKOUG,
Birmingham, December 2008
2Outlook
- Introduction to ASM
- Disk groups, fail groups, normal redundancy
- Scalability and Performance of the solution
- Possible pitfalls, sharing experiences
- Implementation details, monitoring, and tools to
ease ASM deployment
3Architecture and main concepts
- Why ASM ?
- Provides functionality of volume manager and a
cluster file system - Raw access to storage for performance
- Why ASM-provided mirroring?
- Allows to use lower-cost storage arrays
- Allows to mirror across storage arrays
- arrays are not single points of failure
- Array (HW) maintenances can be done in a rolling
way - Stretch clusters
-
4ASM and cluster DB architecture
- Oracle architecture of redundant low-cost
components
5Files, extents, and failure groups
- Files and
- extent
- pointers
- Failgroups
- and ASM
- mirroring
6ASM disk groups
- Example HW 4 disk arrays with 8 disks each
- An ASM diskgroup is created using all available
disks - The end result is similar to a file system on
RAID 10 - ASM allows to mirror across storage arrays
- Oracle RDBMS processes directly access the
storage - RAW disk access
-
ASM Diskgroup
Mirroring
Striping
Striping
Failgroup1
Failgroup2
7Performance and scalability
- ASM with normal redundancy
- Stress tested for CERNs use
- Scales and performs
-
8Case Study the largest cluster I have ever
installed, RAC5
9Multipathed fiber channel
- 8 FC switches 4Gbps (10Gbps uplink)
10Many spindles
- 26 storage arrays (16 SATA disks each)
11Case Study I/O metrics for the RAC5 cluster
- Measured, sequential I/O
- Read 6 GB/sec
- Read-Write 33 GB/sec
- Measured, small random IO
- Read 40K IOPS (8 KB read ops)
- Note
- 410 SATA disks, 26 HBAS on the storage arrays
- Servers 14 x 44Gbps HBAs, 112 cores, 224 GB of
RAM
12How the test was run
- A custom SQL-based DB workload
- IOPS Probe randomly a large table (several TBs)
via several parallel queries slaves (each reads a
single block at a time) - MBPS Read a large (several TBs) table with
parallel query - The test table used for the RAC5 cluster was 5 TB
in size - created inside a disk group of 70TB
13Possible pitfalls
- Production Stories
- Sharing experiences
- 3 years in production, 550 TB of raw capacity
-
14Rebalancing speed
- Rebalancing is performed (and mandatory) after
space management operations - Typically after HW failures (restore mirror)
- Goal balanced space allocation across disks
- Not based on performance or utilization
- ASM instances are in charge of rebalancing
- Scalability of rebalancing operations?
- In 10g serialization wait events can limit
scalability - Even at maximum speed rebalancing is not always
I/O bound
15Rebalancing, an example
16VLDB and rebalancing
- Rebalancing operations can move more data than
expected - Example
- 5 TB (allocated) 100 disks, 200 GB each
- A disk is replaced (diskgroup rebalance)
- The total IO workload is 1.6 TB (8x the disk
size!) - How to see this query vasm_operation, the
column EST_WORK keeps growing during rebalance - The issue excessive repartnering
17Rebalancing issues wrap-up
- Rebalancing can be slow
- Many hours for very large disk groups
- Risk associated
- 2nd disk failure while rebalancing
- Worst case - loss of the diskgroup because
partner disks fail
18Fast Mirror Resync
- ASM 10g with normal redundancy does not allow to
offline part of the storage - A transient error in a storage array can cause
several hours of rebalancing to drop and add
disks - It is a limiting factor for scheduled
maintenances - 11g has new feature fast mirror resync
- Great feature for rolling intervention on HW
19ASM and filesystem utilities
- Only a few tools can access ASM
- Asmcmd, dbms_file_transfer, xdb, ftp
- Limited operations (no copy, rename, etc)
- Require open DB instances
- file operations difficult in 10g
- 11g asmcmd has the copy command
-
20ASM and corruption
- ASM metadata corruption
- Can be caused by bugs
- One case in prod after disk eviction
- Physical data corruption
- ASM will fix automatically most corruption on
primary extent - Typically when doing a full backup
- Secondary extent corruption goes undetected
untill disk failure/rebalance can expose it -
21Disaster recovery
- Corruption issues were fixed using physical
standby to move to fresh storage - For HA our experience is that disaster recovery
is needed - Standby DB
- On-disk (flash) copy of DB
-
22Implementation details
23Storage deployment
- Current storage deployment for Physics Databases
at CERN - SAN, FC (4Gb/s) storage enclosures with SATA
disks (8 or 16) - Linux x86_64, no ASM lib, device mapper instead
(naming persistence HA) - Over 150 FC storage arrays (production,
integration and test) and 2000 LUNs exposed - Biggest DB over 7TB (more to come when LHC starts
estimated growth up to 11TB/year)
24Storage deployment
- ASM implementation details
- Storage in JBOD configuration (1 disk -gt 1 LUN)
- Each disk partitioned on OS level
- 1st partition 45 of disk size faster part of
disk short stroke - 2nd partition rest slower part full stroke
inner sectors full stroke
outer sectors short stroke
25Storage deployment
- Two diskgroups created for each cluster
- DATA data files and online redo logs outer
part of the disks - RECO flash recovery area destination archived
redo logs and on disk backups inner part of the
disks - One failgroup per storage array
Failgroup4
Failgroup2
Failgroup3
Failgroup1
DATA_DG1
RECO_DG1
26Storage management
- SAN configuration in JBOD configuration many
steps, can be time consuming - Storage level
- logical disks
- LUNs
- mappings
- FC infrastructure zoning
- OS creating device mapper configuration
- multipath.conf name persistency HA
27Storage management
- Storage manageability
- DBAs set-up initial configuration
- ASM extra maintenance in case of storage
maintenance (disk failure) - Problems
- How to quickly set-up SAN configuration
- How to manage disks and keep track of the
mappingsphysical disk -gt LUN -gt Linux disk -gt
ASM Disk
SCSI 1013 2013 -gt/dev/sdn
/dev/sdax -gt/dev/mpath/rstor901_3 -gtASM
TEST1_DATADG1_0016
28Storage management
- Solution
- Configuration DB - repository of FC switches,
port allocations and of all SCSI identifiers for
all nodes and storages - Big initial effort
- Easy to maintain
- High ROI
- Custom tools
- Tools to identify
- SCSI (block) devices lt-gt device mapper device lt-gt
physical storage and FC port - Device mapper mapper device lt-gt ASM disk
- Automatic generation of device mapper
configuration
29Storage management
- lssdisks.py
- The following storages are connected
- Host interface 1
- Target ID 100 - WWPN 210000D0230BE0B5 -
Storage rstor316, Port 0 - Target ID 101 - WWPN 210000D0231C3F8D -
Storage rstor317, Port 0 - Target ID 102 - WWPN 210000D0232BE081 -
Storage rstor318, Port 0 - Target ID 103 - WWPN 210000D0233C4000 -
Storage rstor319, Port 0 - Target ID 104 - WWPN 210000D0234C3F68 -
Storage rstor320, Port 0 - Host interface 2
- Target ID 200 - WWPN 220000D0230BE0B5 -
Storage rstor316, Port 1 - Target ID 201 - WWPN 220000D0231C3F8D -
Storage rstor317, Port 1 - Target ID 202 - WWPN 220000D0232BE081 -
Storage rstor318, Port 1 - Target ID 203 - WWPN 220000D0233C4000 -
Storage rstor319, Port 1 - Target ID 204 - WWPN 220000D0234C3F68 -
Storage rstor320, Port 1 - SCSI Id Block DEV MPath name
MP status Storage Port - ------------- ---------------- -------------------
- ---------- ------------------ ----- - 0000 /dev/sda -
- - -
Custom made script
30Storage management
- listdisks.py
- DISK NAME GROUP_NAME
FG H_STATUS MODE MOUNT_S STATE
TOTAL_GB USED_GB - ---------------- ------------------ -------------
---------- ---------- ------- -------- -------
------ ----- - rstor401_1p1 RAC9_DATADG1_0006 RAC9_DATADG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
111.8 68.5 - rstor401_1p2 RAC9_RECODG1_0000 RAC9_RECODG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
119.9 1.7 - rstor401_2p1 -- --
-- UNKNOWN ONLINE CLOSED NORMAL
111.8 111.8 - rstor401_2p2 -- --
-- UNKNOWN ONLINE CLOSED NORMAL
120.9 120.9 - rstor401_3p1 RAC9_DATADG1_0007 RAC9_DATADG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
111.8 68.6 - rstor401_3p2 RAC9_RECODG1_0005 RAC9_RECODG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
120.9 1.8 - rstor401_4p1 RAC9_DATADG1_0002 RAC9_DATADG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
111.8 68.5 - rstor401_4p2 RAC9_RECODG1_0002 RAC9_RECODG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
120.9 1.8 - rstor401_5p1 RAC9_DATADG1_0001 RAC9_DATADG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
111.8 68.5 - rstor401_5p2 RAC9_RECODG1_0006 RAC9_RECODG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
120.9 1.8 - rstor401_6p1 RAC9_DATADG1_0005 RAC9_DATADG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
111.8 68.5 - rstor401_6p2 RAC9_RECODG1_0007 RAC9_RECODG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
120.9 1.8 - rstor401_7p1 RAC9_DATADG1_0000 RAC9_DATADG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
111.8 68.6 - rstor401_7p2 RAC9_RECODG1_0001 RAC9_RECODG1
RSTOR401 MEMBER ONLINE CACHED NORMAL
120.9 1.8
Custom made script
31Storage management
- gen_multipath.py
- multipath default configuration for PDB
- defaults
- udev_dir /dev
- polling_interval 10
- selector "round-robin 0"
- . . .
-
- . . .
- multipaths
- multipath
- wwid
3600d0230006c26660be0b5080a407e00 - alias
rstor916_CRS -
- multipath
- wwid
3600d0230006c26660be0b5080a407e01 - alias rstor916_1
-
- . . .
Custom made script
device mapper alias naming persistency and
multipathing (HA)
SCSI 1013 2013 -gt/dev/sdn
/dev/sdax -gt/dev/mpath/rstor916_1
32Storage monitoring
- ASM-based mirroring means
- Oracle DBAs need to be alerted of disk failures
and evictions - Dashboard global overview custom solution
RACMon - ASM level monitoring
- Oracle Enterprise Manager Grid Control
- RACMon alerts on missing disks and failgroups
plus dashboard - Storage level monitoring
- RACMon LUNs health and storage configuration
details dashboard
33Storage monitoring
- ASM instance level monitoring
- Storage level monitoring
new failing disk onRSTOR614
new disk installed onRSTOR903 slot 2
34Conclusions
- Oracle ASM diskgroups with normal redundancy
- Used at CERN instead of HW RAID
- Performance and scalability are very good
- Allows to use low-cost HW
- Requires more admin effort from the DBAs than
high end storage - 11g has important improvements
- Custom tools to ease administration
-
35QA
- Thank you
- Links
- http//cern.ch/phydb
- http//www.cern.ch/canali