Title: Keith Parris SystemsSoftware Engineer, HP
1Long-Distance Disaster Tolerance Technology,
Challenges, State of the Art, and Directions
- Keith Parris Systems/Software Engineer, HP
- Session 1520
2Topics
- Terminology
- Disaster Recovery vs. Disaster Tolerance
- Metrics
- Basic technologies and state of the art
- Historical Context
- Trends
- Challenges
- Promising areas for future directions
3High Availability (HA)
- Ability for application processing to continue
with high probability in the face of common
(mostly hardware) failures - Typical technique redundancy
4Fault Tolerance (FT)
- Ability for a computer system to continue
operating despite hardware and/or software
failures - Typically requires
- Special hardware with full redundancy,
error-checking, and hot-swap support - Special software
- Provides the highest availability possible within
a single datacenter
5Disaster Recovery (DR)
- Disaster Recovery is the ability to resume
operations after a disaster - Foundation Off-site data storage of some sort
- Typically,
- There is some delay before operations can
continue (many hours, possibly days), and - Some transaction data may have been lost from IT
systems and must be re-entered
6DR Methods
- Tape Backup
- Expedited hardware replacement
- Vendor Recovery Site
- Data Vaulting
- Hot Site
7Disaster Tolerance vs.Disaster Recovery
- Disaster Recovery is the ability to resume
operations after a disaster. - Disaster Tolerance is the ability to continue
operations uninterrupted despite a disaster.
8Disaster Tolerance Ideals
- Ideally, Disaster Tolerance allows one to
continue operations uninterrupted despite a
disaster - Without any appreciable delays
- Without any lost transaction data
9Quantifying Disaster Tolerance and Disaster
Recovery Requirements
- Commonly-used metrics
- Recovery Point Objective (RPO)
- Amount of data loss that is acceptable, if any
- Recovery Time Objective (RTO)
- Amount of downtime that is acceptable, if any
10Recovery Point Objective (RPO)
- Recovery Point Objective is measured in terms of
time - RPO indicates the point in time to which one is
able to recover the data after a failure,
relative to the time of the failure itself - RPO effectively quantifies the amount of data
loss permissible before the business is adversely
affected
Recovery Point Objective
Time
Disaster
Backup
11Recovery Time Objective (RTO)
- Recovery Time Objective is also measured in terms
of time - Measures downtime
- from time of disaster until business can continue
- Downtime costs vary with the nature of the
business, and with outage length
Recovery Time Objective
Time
Business Resumes
Disaster
12Disaster Tolerance vs. Disaster Recoverybased on
RPO and RTO Metrics
Increasing Data Loss
Recovery Point Objective
Disaster Recovery
Disaster Tolerance
Zero
Zero
Recovery Time Objective
Increasing Downtime
13Historical Context
- 1993 World Trade Center bombing raised awareness
of DR and prompted some improvements - Sept. 11, 2001 has had dramatic and far-reaching
effects - Scramble to find replacement office space
- Many datacenters moved off Manhattan Island, some
out of NYC entirely - Increased distances to DR sites
- Induced regulatory responses (in USA and abroad)
14Trends and Driving Forces
- BC, DR and DT in a post-9/11 world
- Recognition of greater risk to datacenters
- Particularly in major metropolitan areas
- Push toward greater distances between redundant
datacenters - It is no longer inconceivable that, for example,
terrorists might obtain a nuclear device and
destroy the entire NYC metropolitan area
15Trends and Driving Forces
- "Draft Interagency White Paper on Sound Practices
to Strengthen the Resilience of the U.S.
Financial System - http//www.sec.gov/news/studies/34-47638.htm
- Agencies involved
- Federal Reserve System
- Department of the Treasury
- Securities Exchange Commission (SEC)
- Applies to
- Financial institutions critical to the US economy
16Draft Interagency White Paper
- The early concept release inviting input made
mention of a 200-300 mile limit (only as part of
an example when asking for feedback as to whether
any minimum distance value should be specified or
not) - Sound practices. Have the agencies sufficiently
described expectations regarding out-of-region
back-up resources? Should some minimum distance
from primary sites be specified for back-up
facilities for core clearing and settlement
organizations and firms that play significant
roles in critical markets (e.g., 200 - 300 miles
between primary and back-up sites)? What factors
should be used to identify such a minimum
distance?
17Draft Interagency White Paper
- This induced panic in several quarters
- NYC feared additional economic damage of
companies moving out - Some pointed out the technology limitations of
some synchronous mirroring products and of Fibre
Channel at the time which typically limited them
to a distance of 100 miles or 100 km - Revised draft contained no specific distance
numbers just cautionary wording - Ironically, that same non-specific wording now
often results in DR datacenters 1,000 to 1,500
miles away
18Draft Interagency White Paper
- Maintain sufficient geographically dispersed
resources to meet recovery and resumption
objectives. - Long-standing principles of business continuity
planning suggest that back-up arrangements should
be as far away from the primary site as necessary
to avoid being subject to the same set of risks
as the primary location.
19Draft Interagency White Paper
- Organizations should establish back-up
facilities a significant distance away from their
primary sites. - The agencies expect that, as technology and
business processes continue to improve and
become increasingly cost effective, firms will
take advantage of these developments to increase
the geographic diversification of their back-up
sites.
20Ripple effect of Regulatory Activity within the
USA
- National Association of Securities Dealers
(NASD) - Rule 3510 3520
- New York Stock Exchange (NYSE)
- Rule 446
21Regulatory Activity Outside the USA
- United Kingdon Financial Services Authority
- Consultation Paper 142 Operational Risk and
Systems Control - Europe
- Basel II Accord
- Australian Prudential Regulation Authority
- Prudential Standard for business continuity
management APS 232 and guidance note AGN 232.1 - Monetary Authority of Singapore (MAS)
- Guidelines on Risk Management Practices
Business Continuity Management affecting
Significantly Important Institutions (SIIs)
22Resiliency Maturity Model project
- The Financial Services Technology Consortium
(FTSC) has begun work on a Resiliency Maturity
Model - Taking inspiration from the Carnegie Mellon
Software Engineering Institutes Capability
Maturity Model (CMM) and Networked Systems
Survivability Program - Intent is to develop industry standard metrics to
evaluate an institutions business continuity,
disaster recovery, and crisis management
capabilities
23Technologies
- Inter-site data replication
- Clustering for availability
24Data Replication Technologies
- Hardware
- Mirroring between disk subsystems
- Software
- Host-based mirroring software
- Database replication or log-shipping
- Middleware or transaction processing monitor with
replication functionality (e.g. HP Reliable
Transaction Router)
25Data Replication in Hardware
- HP StorageWorks Continuous Access (CA)
- EMC Symmetrix Remote Data Facility (SRDF)
26Data Replication in Software
- Host software disk mirroring or shadowing
- Volume Shadowing Software for OpenVMS
- MirrorDisk/UX for HP-UX
- Veritas VxVM with Volume Replicator extensions
for UNIX and Windows
27Data Replication in Software
- Database replication or log-shipping
- Replication within the database software
- Remote Database Facility (RDF) on NonStop
- Oracle DataGuard (Oracle Standby Database)
- Database backups plus Log Shipping
28Data Replication in Software
- TP Monitor/Transaction Router
- e.g. HP Reliable Transaction Router (RTR)
Software on OpenVMS, UNIX, Linux, and Windows
29Data Replication in Hardware
- Data mirroring schemes
- Synchronous
- Slower, but no chance of data loss in conjunction
with a site loss - Asynchronous
- Faster, and works for longer distances
- but can lose seconds or minutes worth of data
(more under high loads) in a site disaster
30Basic underlying challenges, and technologies to
address them
- Data protection through data replication
- Geographic separation for the sake of relative
safety - Careful site selection
- Application coordination
- Long-distance multi-site clustering
- Inter-site link technologies choices
- Inter-site link bandwidth and media types
available vary widely with location - Cost can be very high in some cases
- Inter-site latency due to the speed of light
- And its adverse impact on performance
31Inter-site Link Options
- Sites linked by
- DS-3/T3 (E3 in Europe) or ATM circuits from a
telecommunications vendor - Microwave link DS-3/T3 or Ethernet
- Free-Space Optics link (short distance, low cost)
- Dark fiber where available. ATM over SONET,
or - Ethernet over fiber (10 mb, Fast, Gigabit)
- FDDI
- Fibre Channel
- Fiber links between Memory Channel switches (up
to 3 km)
32Inter-site Link Options
- Sites linked by
- Wave Division Multiplexing (WDM), in either
Coarse (CWDM) or Dense (DWDM) Wave Division
Multiplexing flavors - Can carry any of the types of traffic that can
run over a single fiber - Individual WDM channel(s) from a vendor, rather
than entire dark fibers
33Bandwidth of Inter-Site Link(s)
34Long-distance Cluster Issues
- Latency due to speed of light becomes significant
at higher distances. Rules of thumb - About 1 ms per 125 miles, one-way or
- About 1 ms per 62 miles, round-trip latency
- Actual circuit path length can be longer than
highway mileage between sites - Latency adversely affects performance
35Round-trip Packet Latencies
36Inter-site LatencyActual Customer Measurements
37Differentiate between latency and bandwidth
- Cant get around the speed of light and its
latency effects over long distances - Higher-bandwidth link doesnt mean lower latency
38SAN Extension
- Fibre Channel distance over fiber is limited to
about 100-200 kilometers - Shortage of buffer-to-buffer credits adversely
affects Fibre Channel performance above about 50
kilometers some vendors provide more for a price - Various vendors provide SAN Extension boxes to
connect Fibre Channel SANs over an inter-site
link like SONET, DS-3, ATM, Gigabit Ethernet, IP
network, etc. - See SAN Design Reference Guide Vol. 4 SAN
extension and bridging - http//h20000.www2.hp.com/bc/docs/support/SupportM
anual/c00310437/c00310437.pdf
39Long-distance SynchronousHost-based Mirroring
Software Tests
- OpenVMS Host-Based Volume Shadowing (HBVS)
software (host-based mirroring software) - Synchronous mirroring product
- SAN Extension used to extend SAN using FCIP boxes
- AdTech box used to simulate distance via
introduced packet latency - No OpenVMS Cluster involved across this distance
(no OpenVMS node at the remote end just data
vaulting to a distant disk controller)
40Long-distance HBVS Test Results
41Mitigating the Impact of Distance
- Do transactions as much as possible in parallel
rather than serially - May have to find and eliminate hidden
serialization points in applications and system
software - Minimize number of round trips between sites to
complete a transaction
42Minimizing Round Trips Between Sites
- Some vendors have Fibre Channel SCSI-3 protocol
tricks to do writes in 1 round trip vs. 2 - e.g. Ciscos Write Acceleration
43Mitigating Impact of Inter-Site Latency
- How applications are distributed across a
multi-site cluster can affect performance when a
distributed lock manager is involved (e.g. Oracle
RAC, OpenVMS Cluster or TruCluster) - But this represents a trade-off among
performance, availability, and resource
utilization
44Application Scheme 1Hot Primary/Cold Standby
- All applications normally run at the primary site
- Second site is idle, except for volume shadowing,
until primary site fails, then it takes over
processing - Performance will be good (all-local locking)
- Fail-over time will be poor, and risk high
(standby systems not active and thus not being
tested) - Wastes computing capacity at the remote site
45Application Scheme 2Hot/Hot but Alternate
Workloads
- All applications normally run at one site or the
other, but not both data is shadowed between
sites, and the opposite site takes over upon a
failure - Performance will be good (all-local locking)
- Fail-over time will be poor, and risk moderate
(standby systems in use, but specific
applications not active and thus not being tested
from that site) - Second sites computing capacity is actively used
46Application Scheme 3Uniform Workload Across
Sites
- All applications normally run at both sites
simultaneously surviving site takes all load
upon failure - Performance may be impacted (some remote locking)
if inter-site distance is large - Fail-over time will be excellent, and risk low
(standby systems are already in use running the
same applications, thus constantly being tested) - Both sites computing capacity is actively used
47Work-arounds being used today
- Multi-hop replication
- Synchronous to nearby site
- Asynchronous to far-away site
48Promising areas for investigation
- Replicate at higher levels to reduce round-trips
and replication volumes - e.g. replicate transaction (a few hundred bytes)
with Reliable Transaction Router instead of
having to replicate all the database page updates
(often 8 kilobytes or 64 kilobytes per page) and
journal log file writes behind a database - or replicate database transactions instead of
database disk writes
49Promising areas for investigation
- Parallelism as a potential solution
- Rationale
- Adding 30 milliseconds to a typical transaction
for a human may not be noticeable - Having to wait for many 30-millisecond
transactions in front of yours slows things down - Applications in the future may have to be built
with greater awareness of inter-site latency - Promising direction
- Allow many more transactions to be in flight in
parallel - Each will take longer, but overall throughput
(transaction rate) might be the same or even
higher than now
50Useful Resources
- Tabb Research report
- "Crisis in Continuity Financial Markets Firms
Tackle the 100 km Question" - available from https//h30046.www3.hp.com/campaign
s/2005/promo/wwfsi/index.php?mcclanding_pagejump
idex_R2548_promo/fsipaper_mcc7Clanding_page
51Useful Resources
- Disaster Recovery Journal
- http//www.drj.com/
- Continuity Insights Magazine
- http//www.continuityinsights.com//
- Contingency Planning Management Magazine
- http//www.contingencyplanning.com/
- All are high-quality journals. The first two are
available free to qualified subscribers - All hold conferences as well.
52Keith Parris
- Case studies
- a large Credit Union
- New York clearing firms
53Questions?
54Speaker Contact Info
- Keith Parris
- E-mail Keith.Parris_at_hp.com or keithparris_at_yahoo.c
om - Web http//www2.openvms.org/kparris/
55(No Transcript)
56get connected
People. Training. Technology.
57(No Transcript)
58(No Transcript)