Title: A Closer Look inside Oracle ASM
1A Closer Look inside Oracle ASM
- UKOUG Conference 2007
- Luca Canali, CERN IT
2Outline
- Oracle ASM for DBAs
- Introduction and motivations
- ASM is not a black box
- Investigation of ASM internals
- Focus on practical methods and troubleshooting
- ASM and VLDB
- Metadata, rebalancing and performance
- Lessons learned from CERNs production DB
services
3ASM
- Oracle Automatic Storage Management
- Provides the functionality of a volume manager
and filesystem for Oracle (DB) files - Works with RAC
- Oracle 10g feature aimed at simplifying storage
management - Together with Oracle Managed Files and the Flash
Recovery Area - An implementation of S.A.M.E. methodology
- Goal of increasing performance and reducing cost
4ASM for a Clustered Architecture
- Oracle architecture of redundant low-cost
components
5ASM Disk Groups
- Example HW 4 disk arrays with 8 disks each
- An ASM diskgroup is created using all available
disks - The end result is similar to a file system on
RAID 10 - ASM allows to mirror across storage arrays
- Oracle RDBMS processes directly access the
storage - RAW disk access
-
ASM Diskgroup
Mirroring
Striping
Striping
Failgroup1
Failgroup2
6Files, Extents, and Failure Groups
- Files and
- extent
- pointers
- Failgroups
- and ASM
- mirroring
7ASM Is not a Black Box
- ASM is implemented as an Oracle instance
- Familiar operations for the DBA
- Configured with SQL commands
- Info in V views
- Logs in udump and bdump
- Some secret details hidden in XTABLES and
underscore parameters
8Selected V Views and X Tables
View Name X Table Description
VASM_DISKGROUP XKFGRP performs disk discovery and lists diskgroups
VASM_DISK XKFDSK, XKFKID performs disk discovery, lists disks and their usage metrics
VASM_FILE XKFFIL lists ASM files, including metadata
VASM_ALIAS XKFALS lists ASM aliases, files and directories
VASM_TEMPLATE XKFTMTA ASM templates and their properties
VASM_CLIENT XKFNCL lists DB instances connected to ASM
VASM_OPERATION XKFGMG lists current rebalancing operations
N.A. XKFKLIB available libraries, includes asmlib
N.A. XKFDPARTNER lists disk-to-partner relationships
N.A. XKFFXP extent map table for all ASM files
N.A. XKFDAT allocation table for all ASM disks
9ASM Parameters
- Notable ASM instance parameters
- .asm_diskgroups'TEST1_DATADG1','TEST1_RECODG1'
- .asm_diskstring'/dev/mpath/itstorp'
- .asm_power_limit5
- .shared_pool_size70M
- .db_cache_size50M
- .large_pool_size50M
- .processes100
10More ASM Parameters
- Underscore parameters
- Several undocumented parameters
- Typically dont need tuning
- Exception _asm_ausize and _asm_stripesize
- May need tuning for VLDB in 10g
- New in 11g, diskgroup attributes
- VASM_ATTRIBUTE, most notable
- disk_repair_time
- au_size
- XKFENV shows underscore attributes
11ASM Storage Internals
- ASM Disks are divided in Allocation Units (AU)
- Default size 1 MB (_asm_ausize)
- Tunable diskgroup attribute in 11g
- ASM files are built as a series of extents
- Extents are mapped to AUs using a file extent map
- When using normal redundancy, 2 mirrored
extents are allocated, each on a different
failgroup - RDBMS read operations access only the primary
extent of a mirrored couple (unless there is an
IO error) - In 10g the ASM extent size AU size
12ASM Metadata Walkthrough
- Three examples follow of how to read data
directly from ASM. - Motivations
- Build confidence in the technology, i.e. get a
feeling of how ASM works - It may turn out useful one day to troubleshoot a
production issue.
13Example 1 Direct File Access 1/2
- Goal Reading ASM files with OS tools, using
metadata information from X tables - Example find the 2 mirrored extents of the RDBMS
spfile - sys_at_ASM1gt select GROUP_KFFXP Group, DISK_KFFXP
Disk, AU_KFFXP AU, XNUM_KFFXP Extent from
XKFFXP where number_kffxp(select file_number
from vasm_alias where name'spfiletest1.ora') - GROUP DISK AU EXTENT
- ---------- ---------- ---------- ----------
- 1 16 17528 0
- 1 4 14838 0
14Example 1 Direct File Access 2/2
- Find the disk path
- sys_at_ASM1gt select disk_number,path from
- vasm_disk where GROUP_NUMBER1 and disk_number
- in (16,4)
- DISK_NUMBER PATH
- ----------- ------------------------------------
- 4 /dev/mpath/itstor417_1p1
- 16 /dev/mpath/itstor419_6p1
- Read data from disk using dd
- dd if/dev/mpath/itstor419_6p1 bs1024k
- count1 skip17528 strings
15XKFFXP
Column Name Description
NUMBER_KFFXP ASM file number. Join with vasm_file and vasm_alias
COMPOUND_KFFXP File identifier. Join with compound_index in vasm_file
INCARN_KFFXP File incarnation id. Join with incarnation in vasm_file
XNUM_KFFXP ASM file extent number (mirrored extent pairs have the same extent value)
PXN_KFFXP Progressive file extent number
GROUP_KFFXP ASM disk group number. Join with vasm_disk and vasm_diskgroup
DISK_KFFXP ASM disk number. Join with vasm_disk
AU_KFFXP Relative position of the allocation unit from the beginning of the disk.
LXN_KFFXP 0-gtprimary extent,1-gtmirror extent, 2-gt2nd mirror copy (high redundancy and metadata)
16Example 2 A Different Way
- A different metadata table to reach the same goal
of reading ASM files directly from OS - sys_at_ASM1gt select GROUP_KFDAT Group
,NUMBER_KFDAT Disk, AUNUM_KFDAT AU from XKFDAT
where fnum_kfdat(select file_number from
vasm_alias where name'spfiletest1.ora') - GROUP DISK AU
- ---------- ---------- ----------
- 1 4 14838
- 1 16 17528
17XKFDAT
Column Name (subset) Description
GROUP_KFDAT Diskgroup number, join with vasm_diskgroup
NUMBER_KFDAT Disk number, join with vasm_disk
COMPOUND_KFDAT Disk compund_index, join with vasm_disk
AUNUM_KFDAT Disk allocation unit (relative position from the beginning of the disk), join with xkffxp.au_kffxp
V_KFDAT Flag Vthis Allocation Unit is used FAU is free
FNUM_KFDAT File number, join with vasm_file
XNUM_KFDAT Progressive file extent number join with xkffxp.pxn_kffxp
18Example 3 Yet Another Way
- Using the internal package dbms_diskgroup
- declare
- fileType varchar2(50) fileName varchar2(50)
- fileSz number blkSz number hdl number plkSz
number - data_buf raw(4096)
- begin
- fileName 'TEST1_DATADG1/TEST1/spfiletest1.ora
' - dbms_diskgroup.getfileattr(fileName,fileType,file
Sz, blkSz) - dbms_diskgroup.open(fileName,'r',fileType,blkSz,
hdl,plkSz, fileSz) - dbms_diskgroup.read(hdl,1,blkSz,data_buf)
- dbms_output.put_line(data_buf)
- end
- /
19DBMS_DISKGROUP
- Can be used to read/write ASM files directly
- Its an Oracle internal package
- Does not require a RDBMS instance
- 11gs asmcmd cp command uses dbms_diskgroup
-
Procedure Name Parameters
dbms_diskgroup.open (fileName, openMode, fileType, blkSz, hdl,plkSz, fileSz)
dbms_diskgroup.read (hdl, offset, blkSz, data_buf)
dbms_diskgroup.createfile (fileName, fileType, blkSz, fileSz, hdl, plkSz, fileGenName)
dbms_diskgroup.close (hdl)
dbms_diskgroup.commitfile (handle)
dbms_diskgroup.resizefile (handle,fsz)
dbms_diskgroup.remap (gnum, fnum, virt_extent_num)
dbms_diskgroup.getfileattr (fileName, fileType, fileSz, blkSz)
20File Transfer Between OS and ASM
- The supported tools (10g)
- RMAN
- DBMS_FILE_TRANSFER
- FTP (XDB)
- WebDAV (XDB)
- They all require a RDBMS instance
- In 11g, all the above plus asmcmd
- cp command
- Works directly with the ASM instance
21Strace and ASM 1/3
- Goal understand strace output when using ASM
storage - Example
- read64(15,"33\0_at_\"..., 8192, 473128960)8192
- This is a read operation of 8KB from FD 15 at
offset 473128960 - What is the segment name, type, file and block ?
22Strace and ASM 2/3
- From /proc/ltpidgt/fd I find that FD15 is
- /dev/mpath/itstor420_1p1
- This is disk 20 of D.G.1 (from vasm_disk)
- From xkffxp I find the ASM file and extent
- Note offset 473128960 451 MB 27 8KB
- sys_at_ASM1gtselect number_kffxp, xnum_kffxp from
xkffxp where group_kffxp1 and disk_kffxp20 and
au_kffxp451 - NUMBER_KFFXP XNUM_KFFXP
- ------------ ----------
- 268 17
23Strace and ASM 3/3
- From vasm_alias I find the file alias for file
268 USERS.268.612033477 - From vdatafile view I find the RDBMS file 9
- From dba extents finally find the owner and
segment name relative to the original IO
operation - sys_at_TEST1gtselect owner,segment_name,segment_type
from dba_extents where FILE_ID9 and
271710241024 between block_id and
block_idblocks - OWNER SEGMENT_NAME SEGMENT_TYPE
- ----- ------------ ------------
- SCOTT EMP TABLE
24Investigation of Fine Striping
- An application finding the layout of
fine-striped files - Explored using strace of an oracle session
executing alter system dump logfile .. - Result round robin distribution over 8 x 1MB
extents
25Metadata Files
- ASM diskgroups contain hidden files
- Not listed in VASM_FILE (file lt256)
- Details are available in XKFFIL
- In addition the first 2 AUs of each disk are
marked as file0 in XKFDAT - Example (10g)
- GROUP FILE FILESIZE_AFTER_MIRR
RAW_FILE_SIZE - ---------- ---------- -------------------
------------- - 1 1 2097152
6291456 - 1 2 1048576
3145728 - 1 3 264241152
795869184 - 1 4 1392640
6291456 - 1 5 1048576
3145728 - 1 6 1048576
3145728
26ASM Metadata 1/2
- File0, AU0 disk header (disk name, etc),
Allocation Table (AT) and Free Space Table (FST) - File0, AU1 Partner Status Table (PST)
- File1 File Directory (files and their extent
pointers) - File2 Disk Directory
- File3 Active Change Directory (ACD)
- The ACD is analogous to a redo log, where changes
to the metadata are logged. - Size42MB number of instances
- Source Oracle Automatic Storage Management,
Oracle Press Nov 2007, N. Vengurlekar, M.
Vallath, R.Long
27ASM Metadata 2/2
- File4 Continuing Operation Directory (COD).
- The COD is analogous to an undo tablespace. It
maintains the state of active ASM operations such
as disk or datafile drop/add. The COD log record
is either committed or rolled back based on the
success of the operation. - File5 Template directory
- File6 Alias directory
- 11g, File9 Attribute Directory
- 11g, File12 Staleness registry, created when
needed to track offline disks
28ASM Rebalancing
- Rebalancing is performed (and mandatory) after
space management operations - Goal balanced space allocation across disks
- Not based on performance or utilization
- ASM spreads every file across all disks in a
diskgroup - ASM instances are in charge of rebalancing
- Extent pointers changes are communicated to the
RDBMS - RDBMS ASMB process keeps an open connection to
ASM - This can be observed by running strace against
ASMB - In RAC, extra messages are passed between the
cluster ASM instances - LMD0 of the ASM instances are very active during
rebalance
29ASM Rebalancing and VLDB
- Performance of Rebalancing is important for VLDB
- An ASM instance can use parallel slaves
- RBAL coordinates the rebalancing operations
- ARBx processes pick up chunks of work. By
default they log their activity in udump - Does it scale?
- In 10g serialization wait events can limit
scalability - Even at maximum speed rebalancing is not always
I/O bound
30ASM Rebalancing Performance
- Tracing ASM rebalancing operations
- 10046 trace of the arbx processes
- Oradebug setospid
- oradebug event 10046 trace name context forever,
level 12 - Process log files (in bdump) with orasrp (tkprof
will not work) - Main wait events from my tests with RAC (6 nodes)
- DFS lock handle
- Waiting for CI level 5 (cross instance lock)
- Buffer busy wait
- unaccounted for
- enq AD - allocate AU
- enq AD - deallocate AU
- log write(even)
- log write(odd)
31ASM Single Instance Rebalancing
- Single instance rebalance
- Faster in RAC if you can rebalance with only 1
node up (I have observed 20 to 100 speed
improvement) - Buffer busy wait can be the main event
- It seems to depend on the number of files in the
diskgroup. - Diskgroups with a small number of (large) files
have more contention (arbx processes operate
concurrently on the same file) - Only seen in tests with 10g
- 11g has improvements regarding rebalancing
contention
32Rebalancing, an Example
Data D.Wojcik, CERN IT
33Rebalancing Workload
- When ASM mirroring is used (e.g. with normal
redundancy) - Rebalancing operations can move more data than
expected - Example
- 5 TB (allocated) 100 disks, 200 GB each
- A disk is replaced (diskgroup rebalance)
- The total IO workload is 1.6 TB (8x the disk
size!) - How to see this query vasm_operation, the
column EST_WORK keeps growing during rebalance - The issue excessive repartnering
34ASM Disk Partners
- ASM diskgroup with normal redundancy
- Two copies of each extents are written to
different failgroups - Two ASM disks are partners
- When they have at least one extent set in common
(they are the 2 sides of a mirror for some data) - Each ASM disk has a limited number of partners
- Typically 10 disk partners XKFDPARTNER
- Helps to reduce the risk associated with 2
simultaneous disk failures
35Free and Usable Space
- When ASM mirroring is used not all the free
space should be occupied - VASM_DISKGROUP.USABLE_FILE_MB
- Amount of free space that can be safely utilized
taking mirroring into account, and yet be able to
restore redundancy after a disk failure - its calculated for the case of the worst
scenario, anyway it is a best practice not to
have it go negative (it can) - This can be a problem when deploying a small
number of large LUNs and/or failgroups
36Fast Mirror Resync
- ASM 10g with normal redundancy does not allow to
offline part of the storage - A transient error in a storage array can cause
several hours of rebalancing to drop and add
disks - It is a limiting factor for scheduled
maintenances - 11g has new feature fast mirror resync
- Redundant storage can be put offline for
maintenance - Changes are accumulated in the staleness registry
(file12) - Changes are applied when the storage is back
online
37Read Performance, Random I/O
- IOPS measured with SQL (synthetic test)
130 IOPS per disk Destroking, only the external
part of the disks is used
38Read Performance, Sequential I/O
Limited by HBAs -gt 4 x 2 Gb (measured with
parallel query)
39Implementation Details
- Multipathing
- Linux Device Mapper (2.6 kernel)
- Block devices
- RHEL4 and 10gR2 allow to skip raw devices mapping
- External half of the disk for data disk groups
- JBOD config
- No HW RAID
- ASM used to mirror across disk arrays
- HW
- Storage arrays (Infortrend) FC controller, SATA
disks - FC (Qlogic) 4Gb switch and HBAs (2Gb in older
HW) - Servers are 2x CPUs, 4GB RAM, 10.2.0.3 on RHEL4,
RAC of 4 to 8 nodes
40Conclusions
- CERN deploys RAC and ASM on Linux on commodity HW
- 2.5 years of production, 110 Oracle 10g RAC nodes
and 300TB of raw disk space (Dec 2007) - ASM metadata
- Most critical part, especially rebalancing
- Knowledge of some ASM internals helps
troubleshooting - ASM on VLDB
- Know and work around pitfalls in 10g
- 11g has important manageability and performance
improvements
41QA
- QA
- Links
- http//cern.ch/phydb
- http//twiki.cern.ch/twiki/bin/view/PSSGroup/ASM_I
nternals - http//www.cern.ch/canali