Keith Parris SystemsSoftware Engineer, HP - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

Keith Parris SystemsSoftware Engineer, HP

Description:

The information contained herein is subject to change without ... operations can continue (many hours, possibly days), and ... in several quarters: ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 58

Provided by: keithp7

Category:

more less

Transcript and Presenter's Notes

Title: Keith Parris SystemsSoftware Engineer, HP

1
Long-Distance Disaster Tolerance Technology,
Challenges, State of the Art, and Directions

Keith Parris Systems/Software Engineer, HP
Session 1520

2
Topics

Terminology
Disaster Recovery vs. Disaster Tolerance
Metrics
Basic technologies and state of the art
Historical Context
Trends
Challenges
Promising areas for future directions

3
High Availability (HA)

Ability for application processing to continue
with high probability in the face of common
(mostly hardware) failures
Typical technique redundancy

4
Fault Tolerance (FT)

Ability for a computer system to continue
operating despite hardware and/or software
failures
Typically requires
Special hardware with full redundancy,
error-checking, and hot-swap support
Special software
Provides the highest availability possible within
a single datacenter

5
Disaster Recovery (DR)

Disaster Recovery is the ability to resume
operations after a disaster
Foundation Off-site data storage of some sort
Typically,
There is some delay before operations can
continue (many hours, possibly days), and
Some transaction data may have been lost from IT
systems and must be re-entered

6
DR Methods

Tape Backup
Expedited hardware replacement
Vendor Recovery Site
Data Vaulting
Hot Site

7
Disaster Tolerance vs.Disaster Recovery

Disaster Recovery is the ability to resume
operations after a disaster.
Disaster Tolerance is the ability to continue
operations uninterrupted despite a disaster.

8
Disaster Tolerance Ideals

Ideally, Disaster Tolerance allows one to
continue operations uninterrupted despite a
disaster
Without any appreciable delays
Without any lost transaction data

9
Quantifying Disaster Tolerance and Disaster
Recovery Requirements

Commonly-used metrics
Recovery Point Objective (RPO)
Amount of data loss that is acceptable, if any
Recovery Time Objective (RTO)
Amount of downtime that is acceptable, if any

10
Recovery Point Objective (RPO)

Recovery Point Objective is measured in terms of
time
RPO indicates the point in time to which one is
able to recover the data after a failure,
relative to the time of the failure itself
RPO effectively quantifies the amount of data
loss permissible before the business is adversely
affected

Recovery Point Objective
Time
Disaster
Backup
11
Recovery Time Objective (RTO)

Recovery Time Objective is also measured in terms
of time
Measures downtime
from time of disaster until business can continue
Downtime costs vary with the nature of the
business, and with outage length

Recovery Time Objective
Time
Business Resumes
Disaster
12
Disaster Tolerance vs. Disaster Recoverybased on
RPO and RTO Metrics
Increasing Data Loss
Recovery Point Objective
Disaster Recovery
Disaster Tolerance
Zero
Zero
Recovery Time Objective
Increasing Downtime
13
Historical Context

1993 World Trade Center bombing raised awareness
of DR and prompted some improvements
Sept. 11, 2001 has had dramatic and far-reaching
effects
Scramble to find replacement office space
Many datacenters moved off Manhattan Island, some
out of NYC entirely
Increased distances to DR sites
Induced regulatory responses (in USA and abroad)

14
Trends and Driving Forces

BC, DR and DT in a post-9/11 world
Recognition of greater risk to datacenters
Particularly in major metropolitan areas
Push toward greater distances between redundant
datacenters
It is no longer inconceivable that, for example,
terrorists might obtain a nuclear device and
destroy the entire NYC metropolitan area

15
Trends and Driving Forces

"Draft Interagency White Paper on Sound Practices
to Strengthen the Resilience of the U.S.
Financial System
http//www.sec.gov/news/studies/34-47638.htm
Agencies involved
Federal Reserve System
Department of the Treasury
Securities Exchange Commission (SEC)
Applies to
Financial institutions critical to the US economy

16
Draft Interagency White Paper

The early concept release inviting input made
mention of a 200-300 mile limit (only as part of
an example when asking for feedback as to whether
any minimum distance value should be specified or
not)
Sound practices. Have the agencies sufficiently
described expectations regarding out-of-region
back-up resources? Should some minimum distance
from primary sites be specified for back-up
facilities for core clearing and settlement
organizations and firms that play significant
roles in critical markets (e.g., 200 - 300 miles
between primary and back-up sites)? What factors
should be used to identify such a minimum
distance?

17
Draft Interagency White Paper

This induced panic in several quarters
NYC feared additional economic damage of
companies moving out
Some pointed out the technology limitations of
some synchronous mirroring products and of Fibre
Channel at the time which typically limited them
to a distance of 100 miles or 100 km
Revised draft contained no specific distance
numbers just cautionary wording
Ironically, that same non-specific wording now
often results in DR datacenters 1,000 to 1,500
miles away

18
Draft Interagency White Paper

Maintain sufficient geographically dispersed
resources to meet recovery and resumption
objectives.
Long-standing principles of business continuity
planning suggest that back-up arrangements should
be as far away from the primary site as necessary
to avoid being subject to the same set of risks
as the primary location.

19
Draft Interagency White Paper

Organizations should establish back-up
facilities a significant distance away from their
primary sites.
The agencies expect that, as technology and
business processes continue to improve and
become increasingly cost effective, firms will
take advantage of these developments to increase
the geographic diversification of their back-up
sites.

20
Ripple effect of Regulatory Activity within the
USA

National Association of Securities Dealers
(NASD)
Rule 3510 3520
New York Stock Exchange (NYSE)
Rule 446

21
Regulatory Activity Outside the USA

United Kingdon Financial Services Authority
Consultation Paper 142 Operational Risk and
Systems Control
Europe
Basel II Accord
Australian Prudential Regulation Authority
Prudential Standard for business continuity
management APS 232 and guidance note AGN 232.1
Monetary Authority of Singapore (MAS)
Guidelines on Risk Management Practices
Business Continuity Management affecting
Significantly Important Institutions (SIIs)

22
Resiliency Maturity Model project

The Financial Services Technology Consortium
(FTSC) has begun work on a Resiliency Maturity
Model
Taking inspiration from the Carnegie Mellon
Software Engineering Institutes Capability
Maturity Model (CMM) and Networked Systems
Survivability Program
Intent is to develop industry standard metrics to
evaluate an institutions business continuity,
disaster recovery, and crisis management
capabilities

23
Technologies

Inter-site data replication
Clustering for availability

24
Data Replication Technologies

Hardware
Mirroring between disk subsystems
Software
Host-based mirroring software
Database replication or log-shipping
Middleware or transaction processing monitor with
replication functionality (e.g. HP Reliable
Transaction Router)

25
Data Replication in Hardware

HP StorageWorks Continuous Access (CA)
EMC Symmetrix Remote Data Facility (SRDF)

26
Data Replication in Software

Host software disk mirroring or shadowing
Volume Shadowing Software for OpenVMS
MirrorDisk/UX for HP-UX
Veritas VxVM with Volume Replicator extensions
for UNIX and Windows

27
Data Replication in Software

Database replication or log-shipping
Replication within the database software
Remote Database Facility (RDF) on NonStop
Oracle DataGuard (Oracle Standby Database)
Database backups plus Log Shipping

28
Data Replication in Software

TP Monitor/Transaction Router
e.g. HP Reliable Transaction Router (RTR)
Software on OpenVMS, UNIX, Linux, and Windows

29
Data Replication in Hardware

Data mirroring schemes
Synchronous
Slower, but no chance of data loss in conjunction
with a site loss
Asynchronous
Faster, and works for longer distances
but can lose seconds or minutes worth of data
(more under high loads) in a site disaster

30
Basic underlying challenges, and technologies to
address them

Data protection through data replication
Geographic separation for the sake of relative
safety
Careful site selection
Application coordination
Long-distance multi-site clustering
Inter-site link technologies choices
Inter-site link bandwidth and media types
available vary widely with location
Cost can be very high in some cases
Inter-site latency due to the speed of light
And its adverse impact on performance

31
Inter-site Link Options

Sites linked by
DS-3/T3 (E3 in Europe) or ATM circuits from a
telecommunications vendor
Microwave link DS-3/T3 or Ethernet
Free-Space Optics link (short distance, low cost)
Dark fiber where available. ATM over SONET,
or
Ethernet over fiber (10 mb, Fast, Gigabit)
FDDI
Fibre Channel
Fiber links between Memory Channel switches (up
to 3 km)

32
Inter-site Link Options

Sites linked by
Wave Division Multiplexing (WDM), in either
Coarse (CWDM) or Dense (DWDM) Wave Division
Multiplexing flavors
Can carry any of the types of traffic that can
run over a single fiber
Individual WDM channel(s) from a vendor, rather
than entire dark fibers

33
Bandwidth of Inter-Site Link(s)
34
Long-distance Cluster Issues

Latency due to speed of light becomes significant
at higher distances. Rules of thumb
About 1 ms per 125 miles, one-way or
About 1 ms per 62 miles, round-trip latency
Actual circuit path length can be longer than
highway mileage between sites
Latency adversely affects performance

35
Round-trip Packet Latencies
36
Inter-site LatencyActual Customer Measurements
37
Differentiate between latency and bandwidth

Cant get around the speed of light and its
latency effects over long distances
Higher-bandwidth link doesnt mean lower latency

38
SAN Extension

Fibre Channel distance over fiber is limited to
about 100-200 kilometers
Shortage of buffer-to-buffer credits adversely
affects Fibre Channel performance above about 50
kilometers some vendors provide more for a price
Various vendors provide SAN Extension boxes to
connect Fibre Channel SANs over an inter-site
link like SONET, DS-3, ATM, Gigabit Ethernet, IP
network, etc.
See SAN Design Reference Guide Vol. 4 SAN
extension and bridging
http//h20000.www2.hp.com/bc/docs/support/SupportM
anual/c00310437/c00310437.pdf

39
Long-distance SynchronousHost-based Mirroring
Software Tests

OpenVMS Host-Based Volume Shadowing (HBVS)
software (host-based mirroring software)
Synchronous mirroring product
SAN Extension used to extend SAN using FCIP boxes
AdTech box used to simulate distance via
introduced packet latency
No OpenVMS Cluster involved across this distance
(no OpenVMS node at the remote end just data
vaulting to a distant disk controller)

40
Long-distance HBVS Test Results
41
Mitigating the Impact of Distance

Do transactions as much as possible in parallel
rather than serially
May have to find and eliminate hidden
serialization points in applications and system
software
Minimize number of round trips between sites to
complete a transaction

42
Minimizing Round Trips Between Sites

Some vendors have Fibre Channel SCSI-3 protocol
tricks to do writes in 1 round trip vs. 2
e.g. Ciscos Write Acceleration

43
Mitigating Impact of Inter-Site Latency

How applications are distributed across a
multi-site cluster can affect performance when a
distributed lock manager is involved (e.g. Oracle
RAC, OpenVMS Cluster or TruCluster)
But this represents a trade-off among
performance, availability, and resource
utilization

44
Application Scheme 1Hot Primary/Cold Standby

All applications normally run at the primary site
Second site is idle, except for volume shadowing,
until primary site fails, then it takes over
processing
Performance will be good (all-local locking)
Fail-over time will be poor, and risk high
(standby systems not active and thus not being
tested)
Wastes computing capacity at the remote site

45
Application Scheme 2Hot/Hot but Alternate
Workloads

All applications normally run at one site or the
other, but not both data is shadowed between
sites, and the opposite site takes over upon a
failure
Performance will be good (all-local locking)
Fail-over time will be poor, and risk moderate
(standby systems in use, but specific
applications not active and thus not being tested
from that site)
Second sites computing capacity is actively used

46
Application Scheme 3Uniform Workload Across
Sites

All applications normally run at both sites
simultaneously surviving site takes all load
upon failure
Performance may be impacted (some remote locking)
if inter-site distance is large
Fail-over time will be excellent, and risk low
(standby systems are already in use running the
same applications, thus constantly being tested)
Both sites computing capacity is actively used

47
Work-arounds being used today

Multi-hop replication
Synchronous to nearby site
Asynchronous to far-away site

48
Promising areas for investigation

Replicate at higher levels to reduce round-trips
and replication volumes
e.g. replicate transaction (a few hundred bytes)
with Reliable Transaction Router instead of
having to replicate all the database page updates
(often 8 kilobytes or 64 kilobytes per page) and
journal log file writes behind a database
or replicate database transactions instead of
database disk writes

49
Promising areas for investigation

Parallelism as a potential solution
Rationale
Adding 30 milliseconds to a typical transaction
for a human may not be noticeable
Having to wait for many 30-millisecond
transactions in front of yours slows things down
Applications in the future may have to be built
with greater awareness of inter-site latency
Promising direction
Allow many more transactions to be in flight in
parallel
Each will take longer, but overall throughput
(transaction rate) might be the same or even
higher than now

50
Useful Resources

Tabb Research report
"Crisis in Continuity Financial Markets Firms
Tackle the 100 km Question"
available from https//h30046.www3.hp.com/campaign
s/2005/promo/wwfsi/index.php?mcclanding_pagejump
idex_R2548_promo/fsipaper_mcc7Clanding_page

51
Useful Resources

Disaster Recovery Journal
http//www.drj.com/
Continuity Insights Magazine
http//www.continuityinsights.com//
Contingency Planning Management Magazine
http//www.contingencyplanning.com/
All are high-quality journals. The first two are
available free to qualified subscribers
All hold conferences as well.

52
Keith Parris