Title: Alex Goral, LightSand
1Long-Distance HP OpenVMS Clusters
- Alex Goral, LightSand
- Dennis Majikas, Digital Networks
- Keith Parris, HP
- Session 1530
2Trends and Driving Forces
- BC, DR and DT in a post-9/11 world
- Recognition of greater risk to datacenters
- Particularly in major metropolitan areas
- Push toward greater distances between redundant
datacenters - It is no longer inconceivable that, for example,
terrorists might obtain a nuclear device and
destroy the entire NYC metropolitan area
3Trends and Driving Forces
- "Draft Interagency White Paper on Sound Practices
to Strengthen the Resilience of the U.S.
Financial System - http//www.sec.gov/news/studies/34-47638.htm
- Agencies involved
- Federal Reserve System
- Department of the Treasury
- Securities Exchange Commission (SEC)
- Applies to
- Financial institutions critical to the US economy
4Draft Interagency White Paper
- Maintain sufficient geographically dispersed
resources to meet recovery and resumption
objectives. - Long-standing principles of business continuity
planning suggest that back-up arrangements should
be as far away from the primary site as necessary
to avoid being subject to the same set of risks
as the primary location.
5Draft Interagency White Paper
- Organizations should establish back-up
facilities a significant distance away from their
primary sites. - The agencies expect that, as technology and
business processes continue to improve and
become increasingly cost effective, firms will
take advantage of these developments to increase
the geographic diversification of their back-up
sites.
6Basic underlying challenges, and technologies to
address them
- Data protection through data replication
- Geographic separation for the sake of relative
safety - Careful site selection
- Application coordination
- Long-distance multi-site clustering
- Inter-site link technologies choices
- Inter-site link bandwidth
- Inter-site latency due to the speed of light
7Dennis Majikas
- Site Selection
- Inter-Site Links
8Multi-Site Clusters
- Consist of multiple sites in different locations,
with one or more OpenVMS systems at each site - Systems at each site are all part of the same
OpenVMS cluster, and share resources - Sites generally need to be connected by bridges
(or bridge-routers) pure IP routers dont pass
the SCS protocol used within OpenVMS Clusters - If only IP is available, L2TP Tunnel or LightSand
boxes might be used
9Inter-site Link Options
- Sites linked by
- DS-3/T3 (E3 in Europe) or ATM circuits from a
telecommunications vendor - Microwave link DS-3/T3 or Ethernet
- Free-Space Optics link (short distance, low cost)
- Dark fiber where available. ATM over SONET,
or - Ethernet over fiber (10 mb, Fast, Gigabit)
- FDDI
- Fibre Channel
- Fiber links between Memory Channel switches (up
to 3 km)
10Inter-site Link Options
- Sites linked by
- Wave Division Multiplexing (WDM), in either
Coarse (CWDM) or Dense (DWDM) Wave Division
Multiplexing flavors - Can carry any of the types of traffic that can
run over a single fiber - Individual WDM channel(s) from a vendor, rather
than entire dark fibers
11Bandwidth of Inter-Site Link(s)
12Inter-Site Link Requirements
- Inter-site SCS link minimum standards are in
OpenVMS Cluster Software SPD - 10 megabits minimum data rate
- Minimize packet latency
- Low SCS packet retransmit rate
- Less than 0.1 retransmitted. Implies
- Low packet-loss rate for bridges
- Low bit-error rate for links
13Important Inter-Site Link Decisions
- Service type choices
- Telco-provided circuit service, own link (e.g.
microwave or FSO), or dark fiber? - Dedicated bandwidth, or shared pipe?
- Single or multiple (redundant) links? If
multiple links, then - Multiple vendors?
- Diverse paths?
14Long-Distance Clusters
- OpenVMS officially supports distance of up to 500
miles (805 km) between nodes - Why the limit?
- Inter-site latency
15Long-distance Cluster Issues
- Latency due to speed of light becomes significant
at higher distances. Rules of thumb - About 1 ms per 125 miles, one-way or
- About 1 ms per 62 miles, round-trip latency
- Actual circuit path length can be longer than
highway mileage between sites - Latency primarily affects performance of
- Remote lock operations
- Remote I/Os
16Lock Request Latencies
17Inter-site LatencyActual Customer Measurements
18Differentiate between latency and bandwidth
- Cant get around the speed of light and its
latency effects over long distances - Higher-bandwidth link doesnt mean lower latency
19Latency of Inter-Site Link
- Latency affects performance of
- Lock operations that cross the inter-site link
- Lock requests
- Directory lookups, deadlock searches
- Write I/Os to remote shadowset members, either
- Over SCS link through the OpenVMS MSCP Server on
a node at the opposite site, or - Direct via Fibre Channel (with an inter-site FC
link) - Both MSCP and the SCSI-3 protocol used over FC
take a minimum of two round trips for writes
20SAN Extension
- Fibre Channel distance over fiber is limited to
about 100 kilometers - Shortage of buffer-to-buffer credits adversely
affects Fibre Channel performance above about 50
kilometers - Various vendors provide SAN Extension boxes to
connect Fibre Channel SANs over an inter-site
link - See SAN Design Reference Guide Vol. 4 SAN
extension and bridging - http//h20000.www2.hp.com/bc/docs/support/SupportM
anual/c00310437/c00310437.pdf
21Alex Goral
- SAN Extension vs. Application Extension
22Long-distance OpenVMS Cluster Testing within HP
- Host-based Volume Shadowing over SAN Extension in
Colorado Springs - Craig Showers lab and Melanie Hubbards test
work at Nashua
23Long-distance HBVS Testing
- SAN Extension used to extend SAN using FCIP
- No OpenVMS Cluster involved across the distance
(no OpenVMS node at the remote end just data
vaulting to a distant disk controller) - Distance simulated via introduced packet latency
24Long-distance HBVS Testing
25Long-distance OpenVMS Cluster Testing within HP
- Craig Showers lab and Melanie Hubbards test
work at Nashua
26Solutions Engineering OEM Lab Project Oracle 9i
RAC DT/HA in a distributed OpenVMS
EnvironmentPhase II-Shadow Sets, HBMM and
Oracle RAC Proof of Concept
- Craig Showers Carlton Davis
- August 22, 2005
27Background
- OpenVMS Ambassadors pushed for Proof Of Concept
(POC) combining Oracle RAC with OpenVMS
long-distance cluster capabilities - Phase I POC LAN Failover failSafeIP with
Oracle RAC over local and 100km (2004) - Ambassadors and Bootcamp attendees provided Phase
II requirements - Separate VMS nodes, RAC instances, clients, and
disks in a truly distributed 2 node cluster - DT environment includes on disk copies use
Volume Shadowing - Consider cost issues for the networking
infrastructure
28Project Business Goals
- Provide proof of concept around RAC over
stretched VMS cluster - Provide data on VMS and Oracle behavior in
stretched configurations - Raise visibility of BCS platform HA and DT
differentiators working in conjunction with
Oracle - Provide re-usable environment for customers or
partners to test own application, distance
requirements - Model POC for other operating systems and db
products - Prepare for data security requirements
- Deliverable technical sales collateral
29Project Technical Objective
- Observe and record the behavior of Oracle 9iRAC
Server used with OpenVMS Shadow Sets on a 2 node
clustered OpenVMS system with various
combinations of FULL, COPY and MERGE Shadow Set
status
30Partner Involvement
- LightSand
- Engineer, Network Switch Alex Goral
- Digital Networks
- Networking Dennis Majikas
31Configuration
- 2 node GS1280 cluster, 8-cpu, 8GB memory
- EVA3000, and EVA8000 - RAID1 volumes
- OpenVMS 7.3-2, latest patches esp HBMM required
- Oracle 9.2.0.5
- Swingbench load generator
- TCPIP 5.4, ECO5
- LightSand switches allows SCS traffic over IP
32(No Transcript)
33Testing
- Introduce delays in data transmission to emulate
long distances between nodes campus/metro
50-100km, regional 500km, and extreme 1000km - Load Generation 300 remote clients, 600k
transactions of typical database functions - Observe behavior of RAC and Shadow Set operations
in combinations of local remotely served
volumes - Data Collection
- T4 (plus OTLT and VEVAMON collectors)
- disk related DCL commands
- transaction output from Load Generation
34Testing (cont.)
- Test variations
- RAC
- Active-Active
- Active-Passive
- Distances between nodes, incl. 0 (data center)
- shadow sets
- Network transfer compressed/non-compressed
- Network bandwidth OC3, gigabit, OC12?
35Results to Date
- Datapoints have been collected for Act-Act RAC
configurations run with the following delays - 0ms Local Cluster
- 1ms 200km (124mi)
- 3ms 600km (372mi)
- 5ms 1000km (621mi)
- 10ms 2000km (1242mi)
36Mitigating the Impact of Distance
- Do local lock requests rather than remote
- Avoid lock directory lookups between sites
- Avoid SCS and MSCP credit waits
- Avoid remote shadowset reads (/SITE and
/READ_COST) - Minimize round trips between sites
37Mitigating Impact of Inter-Site Latency
- Locking
- Try to avoid lock requests to master node at
remote site - OpenVMS does move mastership of a resource tree
to the node with the most activity - How applications are distributed across the
cluster can affect local vs. remote locking - But this represents a trade-off among
performance, availability, and resource
utilization
38Application Scheme 1Hot Primary/Cold Standby
- All applications normally run at the primary site
- Second site is idle, except for volume shadowing,
until primary site fails, then it takes over
processing - Performance will be good (all-local locking)
- Fail-over time will be poor, and risk high
(standby systems not active and thus not being
tested) - Wastes computing capacity at the remote site
39Application Scheme 2Hot/Hot but Alternate
Workloads
- All applications normally run at one site or the
other, but not both data is shadowed between
sites, and the opposite site takes over upon a
failure - Performance will be good (all-local locking)
- Fail-over time will be poor, and risk moderate
(standby systems in use, but specific
applications not active and thus not being tested
from that site) - Second sites computing capacity is actively used
40Application Scheme 3Uniform Workload Across
Sites
- All applications normally run at both sites
simultaneously surviving site takes all load
upon failure - Performance may be impacted (some remote locking)
if inter-site distance is large - Fail-over time will be excellent, and risk low
(standby systems are already in use running the
same applications, thus constantly being tested) - Both sites computing capacity is actively used
41Mitigating Impact of Inter-Site Latency
- Lock directory lookups
- Lock directory lookups with directory node at
remote site can only be avoided by setting
LOCKDIRWT to zero on all nodes at the remote site - This is typically only satisfactory for
Primary/Backup or remote-shadowing-only clusters - For cases where applications create new locks and
free them instead of converting to/from Null
mode - Create a program to take out a Null lock on the
root resources and simply hold those locks
forever
42Mitigating Impact of Inter-Site Latency
- SCS Credit Waits
- Check SHOW CLUSTER/CONTINUOUS with ADD
CONNECTIONS, ADD REM_PROC_NAME and ADD CR_WAITS
to check for SCS credit waits. If counts are
present and increasing over time, increase the
SCS credits at the remote end as follows
43SCS Flow Control
- How to alleviate SCS credit waits
- For wait on VMSVAXcluster SYSAP on another
OpenVMS node - Raise SYSGEN parameter CLUSTER_CREDITS
- Default is 10 maximum is 128
- For wait on VMSDISK_CL_DRVR SYSAP to MSCPDISK
SYSAP on another OpenVMS node - Raise SYSGEN parameter MSCP_CREDITS on serving
node - Default is 8 maximum is 128
44Minimizing Round Trips Between Sites
- MSCP-served reads take 1 round trip writes take
two - Fibre Channel SCSI-3 Protocol tricks to do writes
in 1 round trip - e.g. Ciscos Write Acceleration
45Lab Cluster Configuration
46LAN Switch
LAN Switch
GS1280
SCS
SCS
XP1000
GS1280
Cisco
Cisco
XP1000
IP
IP
GS1280
XP1000
GS1280
LAN Switch
LAN Switch
Shunra
SCS
SCS
FC
FC
Shunra
FC Switch
FC Switch
LightSand
LightSand
Delay Box
MSA1000 Storage
HSG80 Storage
47LAN Switch
LAN Switch
GS1280
SCS
SCS
XP1000
GS1280
Cisco
Cisco
XP1000
IP
IP
GS1280
XP1000
GS1280
LAN Switch
LAN Switch
Shunra
SCS
SCS
FC
FC
Shunra
FC Switch
FC Switch
LightSand
LightSand
Delay Box
MSA1000 Storage
HSG80 Storage
One SCS Path
48LAN Switch
LAN Switch
GS1280
SCS
SCS
XP1000
GS1280
Cisco
Cisco
XP1000
IP
IP
GS1280
XP1000
GS1280
LAN Switch
LAN Switch
Shunra
SCS
SCS
FC
FC
Shunra
FC Switch
FC Switch
LightSand
LightSand
Delay Box
MSA1000 Storage
HSG80 Storage
2nd SCS Path
49LAN Switch
LAN Switch
GS1280
SCS
SCS
XP1000
GS1280
Cisco
Cisco
XP1000
IP
IP
GS1280
XP1000
GS1280
LAN Switch
LAN Switch
Shunra
SCS
SCS
FC
FC
FC
Shunra
FC Switch
FC Switch
LightSand
LightSand
FC Switch
Delay Box
MSA1000 Storage
HSG80 Storage
Fibre Channel Path
50Interactive DemoLong-Distance Cluster
Considerations
- Demonstration of how to measure lock-request
latency - LOCKTIME.COM tool
- Demonstration of measuring local and remote disk
I/O latency - DISKBLOCK tool
51Data Replication
- Providing and maintaining redundant copies of
data is obviously extremely important in a
disaster-tolerant cluster - Options for data replication between sites
- Host-Based Volume Shadowing software
- Continuous Access
- Database replication
- Middleware (i.e. Reliable Transaction Router
software)
52Data Replication
- Synchronizing data can consume significant
inter-site bandwidth and time
53Host-Based Volume Shadowing
- Host software keeps multiple disks identical
- All writes go to all shadowset members
- Reads can be directed to any one member
- Different read operations can go to different
members at once, helping throughput - Synchronization (or Re-synchronization after a
failure) is done with a Copy operation - Re-synchronization after a node failure is done
with a Merge operation
54Fibre Channel and SCSI in Clusters
- Fibre Channel and SCSI are Storage-Only
Interconnects - Provide access to storage devices and controllers
- Cannot carry SCS protocol (e.g. Connection
Manager and Lock Manager traffic) - Need SCS-capable Cluster Interconnect also
- Memory Channel, Computer Interconnect (CI), DSSI,
FDDI, Ethernet, or Galaxy Shared Memory Cluster
Interconnect (SMCI) - Fail-over between a direct path and an
MSCP-served path is first supported in OpenVMS
version 7.3-1
55Host-Based Volume Shadowing and StorageWorks
Continuous Access
- Fibre Channel introduces new capabilities into
OpenVMS disaster-tolerant clusters
56Continuous Access
Inter-site SCS Link
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Controller-Based Mirrorset
57Continuous Access
Node
Node
Write
FC Switch
FC Switch
Controller in charge of mirrorset
Write
EVA
EVA
Mirrorset
58Continuous Access
Node
Node
I/O
FC Switch
FC Switch
Controller in charge of mirrorset
I/O
EVA
EVA
Mirrorset
59Continuous Access
Node
Node
Nodes must now switch to access data through this
controller
FC Switch
FC Switch
EVA
EVA
Mirrorset
60Host-Based Volume Shadowing
SCS--capable interconnect
Node
Node
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
61Host-Based Volume ShadowingReads from Local
Member
Node
Node
Read
Read
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
62Interactive DemoLong-Distance Cluster
Considerations
- Demonstration of the effect of Volume Shadowing
SITE and READ_COST settings on member selection
for read operations
63Host-Based Volume ShadowingWrites to All Members
Node
Node
Write
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
64Host-Based Volume Shadowingwith Inter-Site Fibre
Channel Link
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
65Direct vs. MSCP-Served Paths
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
66Direct vs. MSCP-Served Paths
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
67Cross-site Shadowed System Disk
- With only an SCS link between sites, it was
impractical to have a shadowed system disk and
boot nodes from it at multiple sites - With a Fibre Channel inter-site link, it becomes
possible to do this - but it is probably still not a good idea (single
point of failure for the cluster)
68Host-Based Volume Shadowingwith Inter-Site Fibre
Channel Link
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
69New Failure ScenariosSCS link OK but FC link
broken
(Direct-to-MSCP-served path failover provides
protection)
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
70New Failure ScenariosSCS link broken but FC
link OK
(Quorum scheme provides protection)
SCS--capable interconnect
Node
Node
Inter-site FC Link
FC Switch
FC Switch
EVA
EVA
Host-Based Shadowset
71Interactive DemoMSCP-Served vs. Direct Fibre
Channel
- Introduction of inter-site SAN link is
demonstrated disk access fails over from
MSCP-served path over LAN link to direct Fibre
Channel path (over LightSand boxes)
72Dennis Majikas
73Keith Parris
- Case studies
- Manhattan Municipal Credit Union
- Another Credit Union
- New York Clearing Houses
74Questions?
75Speaker Contact Info
- Keith Parris
- E-mail Keith.Parris_at_hp.com or keithparris_at_yahoo.c
om - Web http//www2.openvms.org/kparris/
76(No Transcript)
77get connected
People. Training. Technology.
78(No Transcript)
79(No Transcript)