Title: Microsoft SQL Server High Availability and Disaster Recovery
1Microsoft SQL Server High Availabilityand
Disaster Recovery
- Michael Poremba // October 2008
2Database HA DR Experience
- Work with business to determine HA or DR
requirements for applications and data? - Design HA or DR solutions?
- Administer HA or DR process?
- Still learning MS SQL Server HA DR capabilities?
3Scope of this Presentation
- Beyond Scope of Presentation
- Data Availability
- Data recovery
- High availability
- Disaster recovery
- Technology Focus
- MS SQL Server
- Physical servers
- SANs
- In-depth how-to(available elsewhere)
- Partitioned views (federated)
- Advanced DBA techniques
- Custom application logic
- 3rd-party software solutions
- Alternate DBMS engines(e.g. Oracle DB2)
- HA on virtual machines
- Complex scenarios solutions
- Load balancing
4Introduction to Data Availability
- So, you need to make yourproduction database
bulletproof
5Data Availability Continuum
- Degrees of protection for information systems
6Business Case for Availability
- Keep business-critical applications available
- Secondary
- Server maintenance
- Protect against loss of data center
- Secondary
- Application upgrades
- Infrastructure upgrades
7Service Level Agreement (SLA)
- Permitted downtime (planned vs. unplanned?)
- Acceptable data/transaction loss
- Application response times
- Mean time to recovery
- Note Database uptime is not equivalent to
application availability - Failures of other application services
- Network outages
8Protect What?
- Application data stores
- Databases
- Files
- Other data repositories
- Database services
- DBMS availability for applications
- Application services
- Application availability for users and external
systems - Databases are the heart of most information
systemsthey deserve the highest affordable
protection.
9Database Failure Scenarios
- Physical Infrastructure Failures
- Storage subsystem
- Disk
- Controller
- Network
- Server
- Power
- Operator errors
- DBMS interruption
- Drops / deletes
- Application defects
- DBMS defects
- Data corruption
10Service Recovery Strategies
11Data RecoveryTerminology
- Terminology varies for source vs. copy
12Data Recovery
13Database Backups
- Traditional backup types
- Full backup
- Differential backup
- Transaction log backup
- Disk is better than tape
- First backup to disk (separate physical disk
volume) - Detect exceptions encountered during backup
- Verify backup files
- Copy backup files to tape or remote disk
- Data retention policy for backup files
14Database Backup Strategy
- Backup of user databases not sufficient for
recovery - System database
- Master database
- MSDB database
- Model database
- External data stores
15Synch with External Data Stores
- Synchronize recovered database with external data
stores - Identity column seeds
- Full-text indexes(SQL Server 2000)
- LDAP entries
- File system objects
- Other databases
16Backup Retention Policy
- Location of backup files
- Duration of retention
- Protection of sensitive data
- Sarbanes/Oxley (SOX)
- HIPAA
- Internal policies for data management and
protection - Access to backups from offsite data storage
17Data Recovery Process
- Backup file sets
- Full baseline, differential, and transaction logs
- Retrieving backup files
- Offsite storage
- Tape
- Network copy
- Dependency on multiple people to get access to
backup files
- Recovery strategy depends on failure scenario
- Create comprehensive failure matrix
- Devise recovery strategy for each scenario
- Does worst-case recovery scenario fit within SLA
parameters? - Recovery time SLA
- Include future data growth in recovery plan
- Fully test recovery strategiespractice is
essential
18High Availability
19High Availability
- Minimize or avoid service downtime
- Whether planned or unplanned
- When components fail,service interruption is
brief or non-existent - Automatic failover
- Eliminate single points of failure (as
affordable) - Redundant components
- Fault-tolerant servers
20Redundant Components
- Objective Avoid single points of failure (where
affordable) - Approach Use redundant components for database
service - Database server nodes
- Server components
- ECC RAM failure-tolerant HW OS
- DBMS instance
- User databases
- Storage devices
- Storage unit components
- MPIO Interfaces paths switches controllers
- RAID Disks
- Networking
- MPIO Interfaces paths switches
- Data copies
- E.g. Recovering torn page from mirror in SQL
Server 2008
21Transaction Log Shipping
- Warm standby solution
- Duplicate user database
- Copy transaction logs to standby server restore
- Database available for read-only access
- Users must disconnect for logs to be applied
- Two database licenses required if querying
standby - Manual application failover
- Supported on standard hardware
- Possible data loss (unapplied transactions)
22Database Mirroring
- Redundancy at user database level
- Duplicate copy of user database
- Independent storage devices
- Multiple copies of instance databases
- Mirrored over private network channel
- Mirror always redoing transactions from principal
- Negligible impact on transaction throughput
- Multiple mirroring modes
- High-availability commit _at_ log on mirror
automatic failover - High-protection commit _at_ log on mirror manual
failover - High-performance commit when logged on principal
- Very fast automatic failoverseconds
- Requires witness server
- Mirror-aware application client connection
- Provided by client library
- Database connection string must specify both
servers - Mirror may be available for read-only access
(snapshots) - Works with standard hardware
23Mirror Witness
- With mirroring, more than one server is required
to decide on failover - Witness automates failover from primary to mirror
- Watches database availability
- Reports observations back to principal and mirror
- Runs in separate SQL Server instance (Express is
OK) - Prevents split brain scenario
- Very low resource consumption
- Can be witness for multiple databases
- Not a single point of failure
24SQL Server Failover Clustering
- Two clustered nodes
- Active/Passive config
- MS SQL services
- Running on virtual server
- Shared storage device
- User databases
- System databases
- Quorum drive
- Redundant internal components
25Active/Passive Failover Clustering
- Redundancy at database instance level
- All databases fail over together
- Shared copy of system databases
- Single data copy on shared storage device
- No I/O overhead reducing throughput
- Storage unit is single point of failure for
cluster - All database services are clustered
- SQL Agent Analysis Services Full-Text engine,
MS DTC - Automatic failover (up to minutes)
- DBMS accessed over virtual IP
- Database not available from inactive node for DB
client connections - Storage is controlled by one cluster node at a
time - Requires hardware certified by Microsoft for
Microsoft Cluster Service
26HA Comparison
- Scope user DB
- Standard hardware
- One SQL license(unless querying snapshots on
mirror) - Very fast failover (seconds)
- OS flexible (e.g. 32/64)
- Independent storage
- Independent services
- Reporting on mirror
- Geographic separation OK
- Scope DBMS instance
- Certified hardware
- One SQL license(only one node can access
database) - Automatic failover (up to minutes)
- Enterprise OS
- Shared storage
- Clustered services
- Standby not available
- Servers are usually co-located
27Considerations for HA
- HA complements backup and recovery strategy
- Does not replace data recovery plan
- Application service availability is often
determined by a network of interdependent
services - Availability can be difficult to define (e.g.
partial failures) - Failure probability difficult to measure or
compute - Increased system complexity could lead to lower
service availability! - Operator error a leading cause of availability
issues - Increased number/types of system components
- More complex to configure and administer
28Data Recovery Requirements
29Disaster Recovery
30Disaster Recovery
- Minimize downtime of business operations
- Redundant systems and facilities
- SQL Server features
- Transaction log shipping
- Database mirroring
- Failover clustering
- Other technologies
- Storage-based mirroring
31Disaster Recovery Planning
- Data security requirements
- Clarify SLA, data loss allowance
- Evaluate system cost vs. data protection
- Failure analysis
- System redundancy
- Process validation
- Training for personnel
- Prevention practices
- Executing disaster recovery and business
continuity - Practice, practice, practice
32Business Continuity Facility
- System redundancy
- Systems Web servers app servers database, etc.
- Data Databases data files on OS security info,
etc. - Networking Domain, routing, subnet, VIPs, etc.
- Alternate facilities
- Network bandwidth
- Physical or network access by operations staff
- Failover
- Often a deliberate decision, using manual failover
33Data Redundancy
- Synchronous redundancy
- Network bandwidth cost
- Network latency and application performance
- Network reliability
- Asynchronous redundancy
- Risk of data loss
- More cost-effective
- Resilient to network latency issues
- Candidate Technologies
- SQL Server database mirroring
- Failover clustering with SAN-based mirroring
34DR Using Database Mirroring
- Two sites Primary and DR location
- Separate failover clusters at each site
- SQL Server database mirroring between sites
35DR Using SAN-Based Mirroring
- Two sites Primary and DR location
- Four-node failover cluster one virtual IP
address - SAN-based mirroring between sites
- Manual cluster failover
36Complimentary Technologies
- Skip if time is running short.
37SAN-Based Data Mirroring
- Data blocks duplicated at storage level
- Similar to transaction log shipping
- Copy performed in sequence and coordinated with
database checkpoint - Ensures consistency of mirrored data files
- Synchronous or asynchronous mirroring
- Co-located or geographically dispersedboth are
OK - SAN link bandwidth must support database I/O rate
- May require extra feature support from SAN vendor
- Could rely on Failover Clustering for HA
38SQL Server Database Snapshots
- Read-only point-in-time database snapshot
- No data is copiedinstantaneous
- Historical snapshot pages tracked separately from
changing pages - Snapshots can be maintained indefinitely
- Limited only by available storage
- Snapshot copy can be used for reporting
- Read-only, so no locking issues
39SQL Server Replication
- Transactional replication
- High transaction volume
- Low data latency required
- Mixed technologies Integrates with other DBMS
- Merge replication
- Bi-directional data changes
- Typically server-to-client
- Snapshot replication
- Large, infrequent data changes
- Data change latency OK
- Best for smaller data sets
- Subscriber databases available for reporting
- Replicate data subsets
- Some data loss is possible
- Periodically validate replicated data
40App Development and Admin
41Considerations for App Developers
- App services tolerant to database service
interruptions - Application transactions must be handled in
codedata consistency - Exception handling for transaction retry,
connection recovery - Requires coding standards, code reviews, and
testing - Bulk data operations
- Transaction volume impacts rollback time during
failover - Batch jobs must be run on alternate nodes
- Dont bypass transaction logging
- Synchronization with external data sources?
- Be aware of database recovery model
- Mirroring uses FailoverPartner in connection
string - Use TCP/IP as client protocol
42Considerations for Admins
- Use identical server hardware, when possible
- Design network redundancies, when feasible
- Consider network latency for geographic
separation - Always manage through virtual cluster, not
individual cluster nodes - Retest failover/failback after HA maintenance
- Diagnose after failover
- Repair alternate node
- Resynchronize data, as necessary
- Be aware of primary/secondary locations
- Ensure application services are connected and
functioning properly - Keep server node configurations synchronized
- Service pack and patch levels
- Duplicate non-redundant resources
- Jobs logins and permissions OS sys objects
43HA Risks
- System performance degradation
- HA system complexity leads to availability issues
- Some system failures not planned for
- Backup and recovery planning incomplete
- Administrators not fully trained or informed
- User databases not synchronized with other data
sources
44Common Admin Use Cases
- Maintain HA nodes
- Hardware maintenance
- Rolling upgrades and software patches
- Resynchronize the redundant copy
- Re-synch mirror
- Restart log shipping
- Diagnose and repair
- Diagnose cause of failover
- Repair failed node and restore failover
capabilities - Test failover and failback
45Common Admin Actions
- Train and practice administrators to
- Initiate a database mirror
- Manually failover mirror database or cluster node
- Add/remove passive node from mirror or cluster
- Upgrade/patch servers nodes
- Restart or redirect application services
46More Information
47ReferencesBooks
- Microsoft SQL Server 2008 High Availability with
Clustering Database Mirroringby Michael Otey,
2009. - Microsoft SQL Server High Availabilityby Paul
Bertucci, 2004. - Pro SQL Server 2005 High Availabilityby Allan
Hirt, 2007.
- Pro SQL Server 2005 Replication by Sujoy Paul,
2006. - Pro SQL Server 2005 Service Brokerby Klaus
Aschenbrenner, 2007. - The Rational Guide to SQL Server 2005 Service
Brokerby Roger Wolter, 2006.
48ReferencesPresentations
- Microsoft Load Balancing and Clusteringhttp//ce.
sharif.edu/courses/84-85/2/ce317/resources/root/le
cture20slides/14.20Microsoft20Load20Balancing
20and20Clustering.ppt - SQL Server 2005 High Availabilityhttp//www.atlan
tamdf.com/Presentations/AtlantaMDF_111207HA.ppt - High Availability Technologies In SQL Server 2000
And SQL Server 2005 http//202.181.238.2/hk/teche
d2004/ppt/Day_2_Rm407/DAT431(1330-1445).ppt - Meeting the Availability Challengehttp//download
.microsoft.com/download/E/D/C/EDCF54DB-19CD-4882-9
FC4-4F7D46FCEAA6/HighAvailability.ppt - Disaster Recovery Mistakeshttp//www.sqlsig.org/O
ct201120DASSUG20-20Jason20Hall2010-11-0720M
M.ppt - SQL Server 2005 High Availabilityhttp//blogs.msd
n.com/sql2005event/attachment/564303.ashx - Effective Usage of SQL Server 2005 Database
Mirroring http//www.sqlserver-qa.net/SSQA-Effect
ive20Usage20of20SQL20Server20200520Database
20Mirroring_show.ppt
49ReferencesArticles
- Achieve High Availability for SQL
Serverhttp//technet.microsoft.com/en-us/magazine
/cc162477.aspx - Geographically Dispersed Clusters in Windows
Server 2003http//www.microsoft.com/windowsserver
2003/techinfo/overview/clustergeo.mspx - Restoring file and filegroup backups
http//support.microsoft.com/kb/281122/en-us - Restoring specific tables or rows from
backupshttp//support.microsoft.com/kb/321836/en-
us - Maintaining Availability During
Upgradeshttp//msdn.microsoft.com/en-us/library/m
s191449.aspx