Title: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation
1Scalable, Fault-Tolerant NAS for Oracle - The
Next Generation
- Kevin Closson
- Chief Software Architect
- Oracle Platform Solutions, Polyserve Inc
2The Un-Show Stopper
- NAS for Oracle is not file serving, let me
explain - Think of GbE NFS I/O paths from Oracle Servers to
the NAS device that are totally direct. No
VLANing sort of indirection. - In these terms, NFS over GbE is just a protocol
as is FCPover FiberChannel - The proof is in the numbers.
- A single dual-socket/dual-core ADM server running
Oracle10gR2 can push through 273MB/s of large
I/Os (scattered reads, direct path read/write,
etc) of triple-bonded GbE NICs! - Compare that to infrastructure and HW costs of
4GbE FCP (450MB/s, but you need 2 cards for
redundancy) - OLTP over modern NFS with GbE is not a
challenging I/O profile. - However, not all NAS devices are created equal by
any means
3Agenda
- Oracle on NAS
- NAS Architecture
- Proof of Concept Testing
- Special Characteristics
4Oracle on NAS
5Oracle on NAS
- Connectivity
- Fantasyland Dream Grid would be nearly
impossible with FibreChannel switched fabric, for
instance - 128 nodes 256 HBAs, 2 switches each with 256
ports just for the servers then you have to work
out storage paths - Simplicity
- NFS is simple. Anyone with a pulse can plug in
cat-5 and mount filesystems. - MUCH MUCH MUCH MUCH MUCH simpler than
- Raw partitions for ASM
- Raw, OCFS2 for CRS
- Oracle Home? Local Ext3 or UFS?
- What a mess
- Supports shared Oracle Home, shared APPL_TOP too
- But not simpler than a Certified Third Party
Cluster Filesystem , but that is a different
presentation - Cost
- FC HBAs are always going to be more expensive
than NICs - Ports on enterprise-level FC switches are very
expensive
6Oracle on NAS
- NFS Client Improvements
- Direct IO
- open(,O_DIRECT,) works with Linux NFS clients,
Solaris NFS client, likely others - Oracle Improvements
- init.ora filesystemio_optionsdirectIO
- No async I/O on NFS, but look at the numbers
- Oracle runtime checks mount options
- Caveat It doesnt always get it right, but at
least it tries (OSDS) - Dont be surprised to see Oracle offer a
platform-independent NFS client - NFS V4 will have more improvements
7NAS Architecture
8NAS Architecture
- Single-headed Filers
- Clustered Single-headed Filers
- Asymmetrical Multi-headed NAS
- Symmetrical Multi-headed NAS
9Single Headed Filer Architecture
10NAS Architecture Single-headed Filer
GigE Network
Filesystems /u01 /u02 /u03
11Oracle Servers Accessing a Single-headed Filer
I/O Bottleneck
A single one of these
Has the same (or more) bus bandwidth as this!
I/O Bottleneck
Filesystems /u01 /u02 /u03
12Oracle Servers Accessing a Single-headed Filer
Single Point of Failure
Highly Available through failover-HA, DataGuard,
RAC, etc
Single Point of Failure
Filesystems /u01 /u02 /u03
13Clustered Single-headed Filers
14Architecture Cluster of Single-headed Filers
Paths Active After Failover
Filesystems /u01 /u02
Filesystems /u03
15Oracle Servers Accessing a Cluster of
Single-headed Filers
16Architecture Cluster of Single-headed Filers
What if /u03 I/O saturates this Filer?
17Filer I/O Bottleneck. Resolution Data Migration
Paths Active After Failover
Filesystems /u01 /u02
Filesystems /u03
Filesystems /u04
Migrate some of the hot data to /u04
18Data Migration Remedies I/O Bottleneck
NEW Single Point of Failure
Paths Active After Failover
Filesystems /u01 /u02
Filesystems /u03
Filesystems /u04
Migrate some of the hot data to /u04
19Summary Single-headed Filers
- Cluster to mitigate S.P.O.F
- Clustering is a pure afterthought with filers
- Failover Times?
- Long, really really long.
- Transparent?
- Not in many cases.
- Migrate data to mitigate I/O bottlenecks
- What if the data hot spot moves with time? The
Dog Chasing His Tail Syndrome - Poor Modularity
- Expanded by pairs for data availability
- Whats all this talk about CNS?
20Asymmetrical Multi-headed NAS Architecture
21Asymmetrical Multi-headed NAS Architecture
Three Active NAS Heads / Three For Failover
and Pools of Data
FibreChannel SAN
Note Some variants of this architecture support
M1 ActiveStandby but that doesnt really change
much.
22Asymmetrical NAS Gateway Architecture
- Really not much different than clusters of
single-headed filers - 1 NAS head to 1 filesystem relationship
- Migrate data to mitigate I/O contention
- Failover not transparent
- But
- More Modular
- Not necessary to scale up by pairs
23Symmetric Multi-headed NAS
24HP Enterprise File Services Clustered Gateway
25Symmetric vs Asymmetric
EFS-CG
26Enterprise File Services Clustered Gateway
Component Overview
- Cluster Volume Manager
- RAID 0
- Expand Online
- Fully Distributed, Symmetric Cluster Filesystem
- The embedded filesystem is a fully distributed,
symmetric cluster filesystem - Virtual NFS Services
- Filesystems are presented through Virtual NFS
Services - Modular and Scalable
- Add NAS heads without interruption
- All filesystems can be presented for read/write
through any/all NAS heads
27EFS-CG Clustered Volume Manager
- RAID 0
- LUNS are RAID 1, so this implements S.A.M.E.
- Expand online
- Add LUNS, grow volume
- Up to 16TB
- Single Volume
28The EFS-CG Filesystem
- All NAS devices have embedded operating systems
and file systems, but the EFS-CG is - Fully Symmetric
- Distributed Lock Manager
- No Metadata Server or Lock Server
- General Purpose clustered file system
- Standard C Library and POSIX support
- Journaled with Online recovery
- Proprietary format but uses standard Linux file
system semantics and system calls including
flock() and fcntl() clusterwide - Expand a single filesystem online up to 16TB, up
to 254 filesystems in current release.
29EFS-CG Filesystem Scalability
30Scalability. Single Filesystem Export Using x86
Xeon-based NAS Heads (Old Numbers)
1,196
1,084
986
1,200
1,000
739
800
MegaBytes per
Second (MB/s)
493
600
400
246
ApproximateSingle-headed Filer limit
123
200
0
1
2
4
6
8
9
10
Cluster Size (Nodes)
NAS Heads
HP StorageWorks Clustered File System is
optimized for both READ and WRITE performance.
31Virtual NFS Services
- Specialized Virtual Host IP
- Filesystem groups are exported through VNFS
- VNFS failover and rehosting are 100 transparent
to NFS client - Including active file descriptors, file locks
(e.g. fctnl/flock), etc
32EFS-CG Filesystems and VNFS
33Enterprise File Services Clustered Gateway
Enterprise File Services Clustered Gateway
vnfs2b
vnfs1
vnfs1b
vnfs3b
NAS Head
NAS Head
NAS Head
NAS Head
/u03
/u02
/u03
/u01
/u04
/u04
/u01
/u02
/u03
/u04
34EFS-CG Management Console
35EFS-CG Proof of Concept
36EFS-CG Proof of Concept
- Goals
- Use Oracle10g (10.2.0.1) with a single high
performance filesystem for the RAC database and
measure - Durability
- Scalability
- Virtual NFS functionality
37EFS-CG Proof of Concept
- The 4 filesystems presented by the EFS-CG were
- /u01. This filesystems contained all Oracle
executables (e.g., ORACLE_HOME) - /u02. This filesystem contained the Oracle10gR2
clusterware files (e.g., OCR, CSS) and some
datafiles and External Tables for ETL testing - /u03. This filesystem was lower-performance space
used for miscellaneous tests such as backup
disk-to-disk - /u04. This filesystem resided on a
high-performance volume that spanned two storage
arrays. It contained the main benchmark database
38EFS-CG P.O.C. Parallel Tablespace Creation
- All datafiles created in a single exported
filesystem - Proof of multi-headed, single filesystem write
scalability
39EFS-CG P.O.C. Parallel Tablespace Creation
40EFS-CG P.O.C. Full Table Scan Performance
- All datafiles located in a single exported
filesystem - Proof of multi-headed, single filesystem
sequential I/O scalability
41EFS-CG P.O.C.Parallel Query Scan Throughput
42EFS-CG P.O.C.OLTP Testing
- OLTP Database based on an Order Entry Schema and
workload - Test areas
- Physical I/O Scalability under Oracle OLTP
- Long Duration Testing
43EFS-CG P.O.C.OLTP Workload Transaction Avg Cost
Oracle Statistics Average Per Transaction
SGA Logical Reads 33
SQL Executions 5
Physical I/O 6.9
Block Changes 8.5
User Calls 6
GCS/GES Messages Sent 12
Averages with RAC can be deceiving, be aware of
CR sends
44EFS-CG P.O.C.OLTP Testing
45EFS-CG P.O.C.OLTP Testing. Physical I/O
Operations
46EFS-CG Handles all OLTP I/O Types Sufficientlyno
Logging Bottleneck
47Long Duration Stress Test
- Benchmarks do not prove durability
- Benchmarks are sprints
- Typically 30-60 minute measured runs (e.g.,
TPC-C) - This long duration stress test was no benchmark
by any means ? - Ramp OLTP I/O up to roughly 10,000/sec
- Run non-stop until the aggregate I/O breaks
through 10 Billion physical transfers - 10,000 physical I/O transfers per second for
every second of nearly 12 days
48Long Duration Stress Test
49Long Duration Stress Test
50Long Duration Stress Test
51(No Transcript)
52Special Characteristics
53Special Characteristics
- The EFS-CG NAS Heads are Linux Servers
- Tasks can be executed directly within the EFS-CG
NAS Heads at FCP speed - Compression
- ETL, data importing
- Backup
- etc..
54Example of EFS-CG Special Functionality
- A table is exported on one of the RAC nodes
- The export file is then compressed on the EFS-CG
NAS head - CPU from NAS Head, instead of database servers
- The NAS heads are really just protocol engines.
I/O DMAs are offloaded to the I/O subsysystems.
There are plenty of spare cycles. - Data movement at FCP rate instead of GigE
- Offload the I/O fabric (NFS paths from servers to
the EFS-CG)
55Export a Table to NFS Mount
56Compress it on the NAS Head
57Questions and Answers
58Backup Slide
59EFS-CG Scales Up and Out
EFS-CG NAS Head
EFS-CG NAS Head
SAN