Designing High Availability Networks, Systems, and Software for the University Environment - PowerPoint PPT Presentation

About This Presentation

Title:

Designing High Availability Networks, Systems, and Software for the University Environment

Description:

... Rolling disaster protection Others: IP Multipathing Trunking links to servers 802.3ad, SMLT, DMLT or similar Rapid Spanning Tree (IEEE 802.1w) ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 33

Provided by: DekeKassa

Category:

more less

Transcript and Presenter's Notes

Title: Designing High Availability Networks, Systems, and Software for the University Environment

1
Designing High Availability Networks, Systems,
and Softwarefor the University Environment

Deke Kassabian and Shumon Huque
The University of Pennsylvania
January 14, 2004

2
About Penn

The University of Pennsylvania was founded by Ben
Franklin in 1751
Penn is part of the Ivy League
Located in western Philadelphia
Community of more than 30,000 people

3
General Goals

Networked services available as expected by our
users
Minimized time to repair (TTR) for when outages
do occur
Ability to perform maintenance and upgrades
(planned downtime) non-disruptively
Cost effectiveness in meeting these goals

4
Definitions

Availability
High Availability (HA)
Rapid Recovery (RR)
Disaster Recovery (DR)
Basic Systems

5
Definitions

Disaster Recovery (DR) -The process of restoring
a service to full operation after an interruption
in service

6
Definitions

Basic System - a Basic System is a Network,
System, Service with only the most basic of
protections against outages
Examples
A network recoverable using spare parts
A single computer system with RAID disk
A service recoverable from tape backups

7
Definitions

Availability - the percentage of total time that
a Network, System, Service is available for use
Related points
Advertised periods of availability
Availability as advertised
Absolute availability

8
Definitions

High Availability (HA) - a Network, System,
Service with specific design elements intended
to keep availability above a high threshold (eg,
99.99)

9
Definitions

Rapid Recovery (RR) - a Network, System,
Service with specific design elements intended
to recover from downtime very quickly (eg, 15
minutes)

10
Metrics

Economics of high availability (the costs of
non-available)
Calculating availability
How availability measurements are performed

11
Economics of high availability

What is the cost of an outage in your
Student Courseware systems and student record
systems
Financial systems
Primary campus web site and Email servers
DNS, DHCP and AuthN systems
Internet connection(s)
Development / Gifts systems
How much should you be willing to spend to
minimize downtime of any or all of these?

12
Calculating availability

Availability can be measured directly through
periodic polling (eg, SNMP, Mon, Nagios)
A formula for predicting availability of a single
component

MTBF
TTR
1
or
(MTBFTTR)
(MTBFTTR)
13
Design Principals

Towards HA
Minimize points of catastrophic failure
Maximize redundancy
Minimize fault zones
Minimize complexity and cost
Applying the above principles to
Networks
Systems
Services

14
Specific examples at Penn

High Availability Services
Rapid Recovery Services

15
High Availability Design

Strategies employed to achieve HA
Server redundancy
Hardware component redundancy
Storage redundancy (RAID)
Network redundancy
Redundant power, A/C, cooling etc
Application protocols that can transparently
failover to alternate servers
Secondary offsite hosting (of some services like
DNS)

16
Rapid Recovery Design

Strategies employed to achieve RR
Standby servers and storage
Some HA design elements
Hardware redundancy, storage redundancy, network
redundancy, power, A/C redundancy etc
Note services deployed in the RR model typically
dont have an easy way to transparently failover
to alternate servers (eg. E-mail, Web etc)

17
Network Aggregation Point

Abbreviation NAP
Machine rooms in separate campus locations that
house critical network electronics and servers.
Good environmentals and extensive connectivity to
campus fiber-optic cable plant
Both HA and RR services utilize multiple NAPs

18
Central Infra. Networks

AKA NOC Networks (historical name)
3 highly redundant IP networks that house systems
providing critical infrastructure services
Each network is triply connected to campus
routing core via distinct NAP locations
Network wiring traverses physically diverse fiber
conduit pathways
Use of router redundancy protocols (VRRP)
Layer-2 path redundancy (802.1D) for high
availability

19
HA Server Platforms

Two sets of three replicated servers
3 KDC servers central authentication
3 NOC servers everything else
Kerberos runs on separate systems mainly for
security reasons.

20
High Availability KDCs

KDCs (3)
3 distinct machines (kdc1, kdc2, kdc3)
Run only Kerberos AS and TGS
Each located in a different campus machine room
Each connected to a distinct IP network
Via a distinct IP core router
Additionally each network is triply connected to
the campus routing core via 3 NAPs

21
High Availability NOCs

3 NOC systems (a historical name)
Provide DNS, DHCP, NTP, RADIUS plus a few
homegrown services
Same physical and network connectivity as the
KDCs
In addition some servers have a secondary
interface on a different NOC network (for reasons
to be explained later)

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
HA Application Failover

Kerberos
DNS
RADIUS
NTP
DHCP
Current spec supports only 2 failover systems
Non-HA homegrown services PennNames

28
Rapid Recovery service

Example E-mail and Web service
A set of servers and storage is replicated at two
sites primary and standby
Primary site active servers and storage
Secondary site standby servers and replicated
storage
Data from 1st site is synchronously replicated to
2nd
Two separate fibrechannel networks interconnect
systems and storage at both sites
Catastrophic failure event system can be
manually reconfigured to use the standby servers
and/or secondary storage ( 30 minutes)
Servers are located on the HA primary
infrastructure network