Alex Goral, LightSand - PowerPoint PPT Presentation

1 / 78

About This Presentation

Title:

Alex Goral, LightSand

Description:

2005 Hewlett-Packard Development Company, L.P. ... OpenVMS does move mastership of a resource tree to the node with the most activity ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 79

Provided by: keithp7

Category:

more less

Transcript and Presenter's Notes

Title: Alex Goral, LightSand

1
Long-Distance HP OpenVMS Clusters

Alex Goral, LightSand
Dennis Majikas, Digital Networks
Keith Parris, HP
Session 1530

2
Trends and Driving Forces

BC, DR and DT in a post-9/11 world
Recognition of greater risk to datacenters
Particularly in major metropolitan areas
Push toward greater distances between redundant
datacenters
It is no longer inconceivable that, for example,
terrorists might obtain a nuclear device and
destroy the entire NYC metropolitan area

3
Trends and Driving Forces

"Draft Interagency White Paper on Sound Practices
to Strengthen the Resilience of the U.S.
Financial System
http//www.sec.gov/news/studies/34-47638.htm
Agencies involved
Federal Reserve System
Department of the Treasury
Securities Exchange Commission (SEC)
Applies to
Financial institutions critical to the US economy

4
Draft Interagency White Paper

Maintain sufficient geographically dispersed
resources to meet recovery and resumption
objectives.
Long-standing principles of business continuity
planning suggest that back-up arrangements should
be as far away from the primary site as necessary
to avoid being subject to the same set of risks
as the primary location.

5
Draft Interagency White Paper

Organizations should establish back-up
facilities a significant distance away from their
primary sites.
The agencies expect that, as technology and
business processes continue to improve and
become increasingly cost effective, firms will
take advantage of these developments to increase
the geographic diversification of their back-up
sites.

6
Basic underlying challenges, and technologies to
address them

Data protection through data replication
Geographic separation for the sake of relative
safety
Careful site selection
Application coordination
Long-distance multi-site clustering
Inter-site link technologies choices
Inter-site link bandwidth
Inter-site latency due to the speed of light

7
Dennis Majikas

Site Selection
Inter-Site Links

8
Multi-Site Clusters

Consist of multiple sites in different locations,
with one or more OpenVMS systems at each site
Systems at each site are all part of the same
OpenVMS cluster, and share resources
Sites generally need to be connected by bridges
(or bridge-routers) pure IP routers dont pass
the SCS protocol used within OpenVMS Clusters
If only IP is available, L2TP Tunnel or LightSand
boxes might be used

9
Inter-site Link Options

Sites linked by
DS-3/T3 (E3 in Europe) or ATM circuits from a
telecommunications vendor
Microwave link DS-3/T3 or Ethernet
Free-Space Optics link (short distance, low cost)
Dark fiber where available. ATM over SONET,
or
Ethernet over fiber (10 mb, Fast, Gigabit)
FDDI
Fibre Channel
Fiber links between Memory Channel switches (up
to 3 km)

10
Inter-site Link Options

Sites linked by
Wave Division Multiplexing (WDM), in either
Coarse (CWDM) or Dense (DWDM) Wave Division
Multiplexing flavors
Can carry any of the types of traffic that can
run over a single fiber
Individual WDM channel(s) from a vendor, rather
than entire dark fibers

11
Bandwidth of Inter-Site Link(s)
12
Inter-Site Link Requirements

Inter-site SCS link minimum standards are in
OpenVMS Cluster Software SPD
10 megabits minimum data rate
Minimize packet latency
Low SCS packet retransmit rate
Less than 0.1 retransmitted. Implies
Low packet-loss rate for bridges
Low bit-error rate for links

13
Important Inter-Site Link Decisions

Service type choices
Telco-provided circuit service, own link (e.g.
microwave or FSO), or dark fiber?
Dedicated bandwidth, or shared pipe?
Single or multiple (redundant) links? If
multiple links, then
Multiple vendors?
Diverse paths?

14
Long-Distance Clusters

OpenVMS officially supports distance of up to 500
miles (805 km) between nodes
Why the limit?
Inter-site latency

15
Long-distance Cluster Issues

Latency due to speed of light becomes significant
at higher distances. Rules of thumb
About 1 ms per 125 miles, one-way or
About 1 ms per 62 miles, round-trip latency
Actual circuit path length can be longer than
highway mileage between sites
Latency primarily affects performance of
Remote lock operations
Remote I/Os

16
Lock Request Latencies
17
Inter-site LatencyActual Customer Measurements
18
Differentiate between latency and bandwidth

Cant get around the speed of light and its
latency effects over long distances
Higher-bandwidth link doesnt mean lower latency

19
Latency of Inter-Site Link

Latency affects performance of
Lock operations that cross the inter-site link
Lock requests
Directory lookups, deadlock searches
Write I/Os to remote shadowset members, either
Over SCS link through the OpenVMS MSCP Server on
a node at the opposite site, or
Direct via Fibre Channel (with an inter-site FC
link)
Both MSCP and the SCSI-3 protocol used over FC
take a minimum of two round trips for writes

20
SAN Extension

Fibre Channel distance over fiber is limited to
about 100 kilometers
Shortage of buffer-to-buffer credits adversely
affects Fibre Channel performance above about 50
kilometers
Various vendors provide SAN Extension boxes to
connect Fibre Channel SANs over an inter-site
link
See SAN Design Reference Guide Vol. 4 SAN
extension and bridging
http//h20000.www2.hp.com/bc/docs/support/SupportM
anual/c00310437/c00310437.pdf

21
Alex Goral

SAN Extension vs. Application Extension

22
Long-distance OpenVMS Cluster Testing within HP

Host-based Volume Shadowing over SAN Extension in
Colorado Springs
Craig Showers lab and Melanie Hubbards test
work at Nashua

23
Long-distance HBVS Testing

SAN Extension used to extend SAN using FCIP
No OpenVMS Cluster involved across the distance
(no OpenVMS node at the remote end just data
vaulting to a distant disk controller)
Distance simulated via introduced packet latency

24
Long-distance HBVS Testing
25
Long-distance OpenVMS Cluster Testing within HP

Craig Showers lab and Melanie Hubbards test
work at Nashua

26
Solutions Engineering OEM Lab Project Oracle 9i
RAC DT/HA in a distributed OpenVMS
EnvironmentPhase II-Shadow Sets, HBMM and
Oracle RAC Proof of Concept

Craig Showers Carlton Davis
August 22, 2005

27
Background

OpenVMS Ambassadors pushed for Proof Of Concept
(POC) combining Oracle RAC with OpenVMS
long-distance cluster capabilities
Phase I POC LAN Failover failSafeIP with
Oracle RAC over local and 100km (2004)
Ambassadors and Bootcamp attendees provided Phase
II requirements
Separate VMS nodes, RAC instances, clients, and
disks in a truly distributed 2 node cluster
DT environment includes on disk copies use
Volume Shadowing
Consider cost issues for the networking
infrastructure

28
Project Business Goals

Provide proof of concept around RAC over
stretched VMS cluster
Provide data on VMS and Oracle behavior in
stretched configurations
Raise visibility of BCS platform HA and DT
differentiators working in conjunction with
Oracle
Provide re-usable environment for customers or
partners to test own application, distance
requirements
Model POC for other operating systems and db
products
Prepare for data security requirements
Deliverable technical sales collateral

29
Project Technical Objective

Observe and record the behavior of Oracle 9iRAC
Server used with OpenVMS Shadow Sets on a 2 node
clustered OpenVMS system with various
combinations of FULL, COPY and MERGE Shadow Set
status

30
Partner Involvement

LightSand
Engineer, Network Switch Alex Goral
Digital Networks
Networking Dennis Majikas

31
Configuration

2 node GS1280 cluster, 8-cpu, 8GB memory
EVA3000, and EVA8000 - RAID1 volumes
OpenVMS 7.3-2, latest patches esp HBMM required
Oracle 9.2.0.5
Swingbench load generator
TCPIP 5.4, ECO5
LightSand switches allows SCS traffic over IP

32
(No Transcript)
33
Testing

Introduce delays in data transmission to emulate
long distances between nodes campus/metro
50-100km, regional 500km, and extreme 1000km
Load Generation 300 remote clients, 600k
transactions of typical database functions
Observe behavior of RAC and Shadow Set operations
in combinations of local remotely served
volumes
Data Collection
T4 (plus OTLT and VEVAMON collectors)
disk related DCL commands
transaction output from Load Generation

34
Testing (cont.)

Test variations
RAC
Active-Active
Active-Passive
Distances between nodes, incl. 0 (data center)
shadow sets
Network transfer compressed/non-compressed
Network bandwidth OC3, gigabit, OC12?

35
Results to Date

Datapoints have been collected for Act-Act RAC
configurations run with the following delays
0ms Local Cluster
1ms 200km (124mi)
3ms 600km (372mi)
5ms 1000km (621mi)
10ms 2000km (1242mi)

36
Mitigating the Impact of Distance

Do local lock requests rather than remote
Avoid lock directory lookups between sites
Avoid SCS and MSCP credit waits
Avoid remote shadowset reads (/SITE and
/READ_COST)
Minimize round trips between sites

37
Mitigating Impact of Inter-Site Latency

Locking
Try to avoid lock requests to master node at
remote site
OpenVMS does move mastership of a resource tree
to the node with the most activity
How applications are distributed across the
cluster can affect local vs. remote locking
But this represents a trade-off among
performance, availability, and resource
utilization

38
Application Scheme 1Hot Primary/Cold Standby

All applications normally run at the primary site
Second site is idle, except for volume shadowing,
until primary site fails, then it takes over
processing
Performance will be good (all-local locking)
Fail-over time will be poor, and risk high
(standby systems not active and thus not being
tested)
Wastes computing capacity at the remote site

39
Application Scheme 2Hot/Hot but Alternate
Workloads

All applications normally run at one site or the
other, but not both data is shadowed between
sites, and the opposite site takes over upon a
failure
Performance will be good (all-local locking)
Fail-over time will be poor, and risk moderate
(standby systems in use, but specific
applications not active and thus not being tested
from that site)
Second sites computing capacity is actively used

40
Application Scheme 3Uniform Workload Across
Sites

All applications normally run at both sites
simultaneously surviving site takes all load
upon failure
Performance may be impacted (some remote locking)
if inter-site distance is large
Fail-over time will be excellent, and risk low
(standby systems are already in use running the
same applications, thus constantly being tested)
Both sites computing capacity is actively used

41
Mitigating Impact of Inter-Site Latency

Lock directory lookups
Lock directory lookups with directory node at
remote site can only be avoided by setting
LOCKDIRWT to zero on all nodes at the remote site
This is typically only satisfactory for
Primary/Backup or remote-shadowing-only clusters
For cases where applications create new locks and
free them instead of converting to/from Null
mode
Create a program to take out a Null lock on the
root resources and simply hold those locks
forever

42
Mitigating Impact of Inter-Site Latency

SCS Credit Waits
Check SHOW CLUSTER/CONTINUOUS with ADD
CONNECTIONS, ADD REM_PROC_NAME and ADD CR_WAITS
to check for SCS credit waits. If counts are
present and increasing over time, increase the
SCS credits at the remote end as follows

43
SCS Flow Control

How to alleviate SCS credit waits
For wait on VMSVAXcluster SYSAP on another
OpenVMS node
Raise SYSGEN parameter CLUSTER_CREDITS
Default is 10 maximum is 128
For wait on VMSDISK_CL_DRVR SYSAP to MSCPDISK
SYSAP on another OpenVMS node
Raise SYSGEN parameter MSCP_CREDITS on serving
node
Default is 8 maximum is 128

44
Minimizing Round Trips Between Sites

MSCP-served reads take 1 round trip writes take
two
Fibre Channel SCSI-3 Protocol tricks to do writes
in 1 round trip
e.g. Ciscos Write Acceleration

45
Lab Cluster Configuration
46
LAN Switch
LAN Switch
GS1280
SCS
SCS
XP1000
GS1280
Cisco
Cisco
XP1000
IP
IP
GS1280
XP1000
GS1280
LAN Switch
LAN Switch
Shunra
SCS
SCS
FC
FC
Shunra
FC Switch
FC Switch
LightSand
LightSand
Delay Box
MSA1000 Storage
HSG80 Storage
47
LAN Switch
LAN Switch
GS1280
SCS
SCS
XP1000
GS1280
Cisco
Cisco
XP1000
IP
IP
GS1280
XP1000
GS1280
LAN Switch
LAN Switch
Shunra
SCS
SCS
FC
FC
Shunra
FC Switch
FC Switch
LightSand
LightSand
Delay Box
MSA1000 Storage
HSG80 Storage
One SCS Path
48
LAN Switch
LAN Switch
GS1280
SCS
SCS
XP1000
GS1280
Cisco
Cisco
XP1000
IP
IP
GS1280
XP1000
GS1280
LAN Switch
LAN Switch
Shunra
SCS
SCS
FC
FC
Shunra
FC Switch
FC Switch
LightSand
LightSand
Delay Box
MSA1000 Storage
HSG80 Storage
2nd SCS Path
49
LAN Switch
LAN Switch
GS1280
SCS
SCS
XP1000
GS1280
Cisco
Cisco
XP1000
IP
IP
GS1280
XP1000
GS1280
LAN Switch
LAN Switch
Shunra
SCS
SCS
FC
FC
FC
Shunra
FC Switch
FC Switch
LightSand
LightSand
FC Switch
Delay Box
MSA1000 Storage
HSG80 Storage
Fibre Channel Path
50
Interactive DemoLong-Distance Cluster
Considerations

Demonstration of how to measure lock-request
latency
LOCKTIME.COM tool
Demonstration of measuring local and remote disk
I/O latency
DISKBLOCK tool

51
Data Replication

Providing and maintaining redundant copies of
data is obviously extremely important in a
disaster-tolerant cluster
Options for data replication between sites
Host-Based Volume Shadowing software
Continuous Access
Database replication
Middleware (i.e. Reliable Transaction Router
software)

52
Data Replication

Synchronizing data can consume significant
inter-site bandwidth and time

53
Host-Based Volume Shadowing

Host software keeps multiple disks identical
All writes go to all shadowset members
Reads can be directed to any one member
Different read operations can go to different
members at once, helping throughput
Synchronization (or Re-synchronization after a
failure) is done with a Copy operation
Re-synchronization after a node failure is done
with a Merge operation

54
Fibre Channel and SCSI in Clusters

Fibre Channel and SCSI are Storage-Only
Interconnects
Provide access to storage devices and controllers
Cannot carry SCS protocol (e.g. Connection
Manager and Lock Manager traffic)
Need SCS-capable Cluster Interconnect also
Memory Channel, Computer Interconnect (CI), DSSI,
FDDI, Ethernet, or Galaxy Shared Memory Cluster
Interconnect (SMCI)
Fail-over between a direct path and an
MSCP-served path is first supported in OpenVMS
version 7.3-1

55
Host-Based Volume Shadowing and StorageWorks
Continuous Access

Fibre Channel introduces new capabilities into
OpenVMS disaster-tolerant clusters

56
Continuous Access
Inter-site SCS Link
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Controller-Based Mirrorset
57
Continuous Access
Node
Node
Write
FC Switch
FC Switch
Controller in charge of mirrorset
Write
EVA
EVA
Mirrorset
58
Continuous Access
Node
Node
I/O
FC Switch
FC Switch
Controller in charge of mirrorset
I/O
EVA
EVA
Mirrorset
59
Continuous Access
Node
Node
Nodes must now switch to access data through this
controller
FC Switch
FC Switch
EVA
EVA
Mirrorset
60
Host-Based Volume Shadowing
SCS--capable interconnect
Node
Node
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
61
Host-Based Volume ShadowingReads from Local
Member
Node
Node
Read
Read
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
62
Interactive DemoLong-Distance Cluster
Considerations

Demonstration of the effect of Volume Shadowing
SITE and READ_COST settings on member selection
for read operations

63
Host-Based Volume ShadowingWrites to All Members
Node
Node
Write
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
64
Host-Based Volume Shadowingwith Inter-Site Fibre
Channel Link
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
65
Direct vs. MSCP-Served Paths
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
66
Direct vs. MSCP-Served Paths
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
67
Cross-site Shadowed System Disk

With only an SCS link between sites, it was
impractical to have a shadowed system disk and
boot nodes from it at multiple sites
With a Fibre Channel inter-site link, it becomes
possible to do this
but it is probably still not a good idea (single
point of failure for the cluster)

68
Host-Based Volume Shadowingwith Inter-Site Fibre
Channel Link
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
69
New Failure ScenariosSCS link OK but FC link
broken
(Direct-to-MSCP-served path failover provides
protection)
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
70
New Failure ScenariosSCS link broken but FC
link OK
(Quorum scheme provides protection)
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
71
Interactive DemoMSCP-Served vs. Direct Fibre
Channel

Introduction of inter-site SAN link is
demonstrated disk access fails over from
MSCP-served path over LAN link to direct Fibre
Channel path (over LightSand boxes)

72
Dennis Majikas