Title: Going Beyond Recovery to Continuity: Lessons Learned
1Going Beyond Recovery to Continuity Lessons
Learned
- Dave Swartz
- Vice President CIO
- The George Washington University
2Brief Background on GW
- Main campus
- Washington, DC
- 100 buildings
- Blocks from the White House, IMF/World Bank,
State Dept. - 27,000 people
- 20K students (50 UG and 50 graduate and
professional students) - 7K faculty and staff
- Of the 20K there are 8K resident students
- Major medical center the ER for the leadership
of our government - Two other smaller campuses in region
- 2.5 Gb into Internet and Internet-2
- 15K voice connections and 17K data connections
- Two major data centers 34 miles apart
GW
White House
IMF/WB, State Dept.
Pentagon
3Some Drivers for Business Continuity at GW
- Explosions in Man Holes in Street
- Recurring unexplained accumulations of flammable
liquids in the storm drains explodes shutting
power off a few buildings for days. - Flood hits Academic Center with Data Center
- A backed up city sewer system causes a flood in a
building not designed for a data center. - Change Management Issues
- Our Facilities group is prone to taking
significant actions without much notice,
including cutting off power or cooling to a
building. - Email Systems Failure
- Lost the SAN and was down for 24 hours for basic
email and it was 3 days until the archive could
be restored. - Cybersecurity Incidents
- After a major worm infestation and also a hack on
a trusted host in 2000, GW creates its
Information Security Program. - 9/11
- The tragic events of Sept. 11 and their
aftermath have resulted in changes in the way all
of us conduct our lives, said President Stephen
Joel Trachtenberg. Just as GW strives for
academic excellence, we also want to take all
appropriate steps to ensure the safety and well
being of our community and the continued
operation of the university. - GW was close to ground zero that day and all
land-based phones and cell phones were congested
for much of the day. - Sarbanes-Oxley
- A risk conscious Board of Trustees has lead to a
number of initiatives to address BC at GW.
4Who Owns BC at GW?
- John Petrie, AVP for Public Safety Emergency
Mgmt., holds the AB degree from Villanova
University and a masters and doctorate from The
Fletcher School of Law and Diplomacy. - A career Naval officer, he was the head of the
Naval Station at Norfolk, the worlds largest
Naval complex, and also professor and head of
research at the War College. - The AVP position was created after 9/11 and was
designed to broaden, coordinate, and execute the
Universitys crisis management, business
continuity, emergency preparedness and public
safety plans and activities. - We need to have people at the local level
comfortable with whats expected of them and what
they have the authority to do, Petrie says. If
they are confident and comfortable, then the
chances of their being able to prepare, respond,
or recover are easier. - Johns number one priority is the safety and
welfare of people. - He sits on regional and national emergency
management response groups and represents the
regional universities in exercises. - References
- BC Plan - http//www.gwu.edu/response/contents.cf
m - Advisories and Alerts - http//www.gwu.edu/gwaler
t/
John Petrie, AVP for Public Safety Emergency Mgt
John has help to lead the development and
administration of BC plans and testing, and an
integrated system of advisories, alerts
and real-time communications.
5Role of IT in Campus BC
- Address the risks of IT failures
- IT has helped to coordinate and fund the
development of the main 19 core office
departmental plans - Many core departments had to be assisted to get
their BC plans done since they felt IT had things
under control, so why do they have to plan? - They also had difficulty freeing themselves from
other priorities needed their VP to make BC a
priority! - IT has also helped to deliver
- Campus Alerts (web page, portal, email, 3rd party
call service) - Back up web site
- Redundant email system and broadcast server
(reflector and Listserv) - Alternate routing to different area code for our
main incoming and outgoing phone lines - Emergency intercom broadcasts over speaker phones
- A network of Blackberries and support for
management - Online directories and BC response plans
- A fully configured and supported command center.
6The Planning Process
- Identify sources of risks and plan accordingly
- Provide assistance
- Standard templates and questions to facilitate
preparation of plans (available on request) - Expert assistance to develop plan
- Review of plans
- Enlist support
- Of senior management, the Board and all core
offices - Prioritize efforts
- Not every department needs a comprehensive plan.
At GW we identified 19 core offices that needed
detailed plans. - Make the plan easily available
- Test the plan and the ability to think on your
feet regularly - Keep plans current
- All plans require periodic review, validation and
update.
The online plan for GW is called the Incident
Planning, Response, and Recovery Manual,
included are individual BC Plans.
7The GW IT Recovery Profile
- Rebuild Replace Disaster Recovery
- Tape backup and priority shipment of equipment
- Weeks to recovery
- Hot-Site Disaster Recovery
- Off site arrangements with a hot-site provider
- Several days to recovery
- High Availability Operations
- Redundant data centers, networks and telecom
- Less than one day and ideally less than a couple
of hours to recovery.
Hours to Recovery
420 (projected)
Rebuild Replace
Hot-Site
84
High-Availability
12
lt 2
8Dealing with Risk Continuity rather than
Recovery
- Common areas of IT risk were addressed with a
focus on major risks and points of failure - Data Center
- Telecommunications
- Network and ISP
- Data
- Security
- Power and Cooling
- Change and Service Management
- Classrooms
- Continuity of operations needs to be built into
the architecture and culture from the bottom up. - If you live and use it day to day then it is less
of a big deal when a disaster hits. - BC at a comprehensive local level is essential to
enable IT to deliver the sustainability of data
and information services.
9Data Center Redundancy
- We have created dual data centers
- separated by 34 miles
- connected by a DWDM link over a redundant dark
fiber ring - We split Test/Dev from the Prod instances.
- We also deploy VMware and virtualize servers
across centers. - Not all of production is at one site, but split
on a 35-65 basis. - We mirror data between data centers.
- We have staff split between centers.
- We routinely test failover during maintenance and
upgrades. - This design enables continuity of operations
without the need to recover from most disasters.
10Telecommunications Redundancy
- We have several PBX switches (Avaya S8700s)
interconnected, load balanced, and spatially
distributed. - Two are on the main campus and separated. The
third is on a remote campus 34 miles away in a
different area code. - We have the ability to re-route incoming and
outgoing calls through different campuses and
area codes. - There are redundant emergency 911 and analog
lines as a back up to our main trunks. - Some specific phone numbers are protected and
given regional priority for accessibility and
sustainability during a major incident. - We maintain copper connections for voice to
permit inline power off of diesel generators to
15,000 phones.
11Data Redundancy
- All enterprise data is mirrored between data
centers, including ERP, data marts, email,
one-card, portal, and web systems. - The main campus file servers are automatically
backed up. Legacy departmental systems are slowly
transitioning to central support and
sustainability a difficult political process. - Desktops in many core offices have a standard
image and automatically store to a central suite
of file servers. - Critical documents are being stored online in an
enterprise document management system and
archived to tape. - We regularly test data backups to make sure we
can restore from them. - One of the most critical aspects of continuity is
rapid access to the data!
On-site fire rated vault in addition to off-site
storage
12Information Security
- Protecting the university from security risks
that can interrupt operations and cost millions
of dollars in lost productivity and liability is
an important priority in BC. - Like an onion, the best approach is defense in
depth. - One of our newest efforts after securing campus
file servers is our desktop initiatitive. - We now use Novell Patchlinks, Cisco Clean Access
and IPS to automate updates, verify conformance
to standards and non-infection. - As a result, desktop infection problems have
declined to a trickle. - Creating a focused Information Security program,
setting standards, and centralizing services, are
critical to success.
Rounding Up Rogue Servers, article in July 2005
Chronicle.
13Power and Cooling
- Power Redundancy
- Conditioned Commercial Power
- 450KW Diesel Generator w/Maintenance Tap
- Automatic Transfer Switch
- Uninterruptible Power Supplies (UPS)
- Multiple Power supplies in each computer system
- 48 hours supply diesel (going to 96 hrs) with
priority shipments from three regional vendors
possible - Redundant Air Conditioning Systems
- Chilled Water Plant Two 60 Ton Dry Coolers
- Glycol Chilled Water Air Handlers
14Change Service Management
App. Change Control
Prob Tickets Service Orders
Remedy
Kintana
Work Requests C3
Asset Management
S/W License Mgmt
Remedy
TBD
Upside
Aperture
Change Control via Integration
Adoption of integrated change control is one of
the major factors to improvement and reliability
of operations.
15Classrooms
- What happens if we lose some classroom space?
How could we continue to conduct classes? - Using R25i (Resource25 3.3) to complement
Schedule25 we can identify and reallocate any
available university space to classrooms - Using Bb and Elluminate we can conduct classes
virtually from home. - We are piloting this approach now for snow days
and other unscheduled ad hoc gatherings such as
study sessions. - We are also suggesting that faculty teach one
virtual class every month so they have practice. - Podcasting Apreso iPods
- GW is supporting Podcasting of its non-credit
lecture series to provide access to recorded
presentations. - Could this be expanded for credit classes?
Depends on support from faculty.
16Selling BCnot the WHAT, but the HOW
- Rational Approach
- The risk or probability of the event multiplied
by the potential loss provides a suggested
magnitude to the investment for protecting a
university from disaster. Not many use this
approach. - Peer Group Benchmarks
- A very common and accepted approach is to
compare the university against the market basket
of peer institutions to see what they are doing. - Leverage the Crisis
- The emotional side of living through a crisis
tends to ease the flow of funds, so capture the
opportunity when it arises. - Partnering with the Board and Audit Team
- The Board has the ability to drive
improvements. The External and Internal Audit
Teams are agents of the Board and should be
viewed as a partner, not a threat, as they are
often viewed.
17Risks of Complexity
Virtualization, distant centers, and split
operations add complexity, which has its own
attendant risks.
Standardization, documentation, and tight change
control help to reduce risks from complexity.
18Factors Related to Distance
- How far away is far enough for a second center?
- GW has selected 34 miles
- USC has designated a bunker just a few miles
away - Others are saying 70 miles.
- It really depends
- You need to consider the types of risks in your
region. - The greater the distance
- The greater the cost or lesser the functionality
and immediacy of response. - You may want to
- Have a secondary high-availability or hot-site
nearby and a tertiary cold-site much farther
away. - You need to consider
- The impacts on your staff and their ability to
make it to the different sites both for routine
maintenance as well as during a disaster - Some types of clustering do not work at a
distance - Real-time mirroring is also adversely affected by
distance.
19Support those Blackberries
- A critical element of the GW BC program is a
network of Blackberries. All senior management
at GW have them and use them everyday. - Blackberries are more like a laptop than a phone
and require expert assistance - They have cell phone and radio capability
- They can send and receive email and instant text
messages - They have the ability to surf the web and access
calendars, directories and online documents that
can be used to support BC - We have a dedicated expert with backup to provide
support to the Blackberries and the command
centers.
20Doesnt it cost a great deal?
Cost
- GW had a hot-site,
- Costing several hundred thousand dollars per
year. - Went to a high-availability 2nd site.
- One-time cost about 1 million
- The ongoing costs were not more than the previous
base budget due to the reallocation of the funds
from the hot-site contract. - Increase in base needed was
- 136K/yr 1 million loaned at 6 over 10 years
- To offset costs we are leasing excess space
- We are recovering the incremental operating costs
of the 2nd site. - More reliable service without large additional
costs - A NO-BRAINER!
Expected Cost Curve
GW Cost Curve
Time to Restoration of Operations
A myth propagated by hot-site vendors is that the
cost of customer owned high-availability is
prohibitive
21Partnerships
- National Capital Regional Emergency Response
Partnership - Emergency Response groups across the region
coordinate efforts and share experiences - First Responder Access Card (FRAC)
- Regional exercises
- Information sharing with key groups
- University Partnerships
- Cost and resource sharing or exchange programs
- Georgetown University GW back one another up
- MAX (Mid-Atlantic Crossroads gigapop)
- Vendor Partnerships
- Have helped GW identify best practices and
utilize new technology useful to BC. - Their support in a disaster can be critical
The FRAC helps to get approved personnel across
road-blocks and barriers.
22Questions?
Dave Swartz