Title: Initial Availability Benchmarking of a Database System
 1Initial Availability Benchmarking of a Database 
System
- Aaron Brown 
 - abrown_at_cs.berkeley.edu 
 - DBLunch Seminar, 1/23/01
 
  2Motivation
- Availability is a key metric for modern apps. 
 - e-commerce, enterprise apps, online services, 
ISPs  - Database availability is particularly important 
 - databases hold the critical hard state for most 
enterprise and e-business applications  - the most important system component to keep 
available  - we trust databases to be highly dependable. 
Should we?  - how do DBMSs react to hardware faults/failures? 
 - what is the user-visible impact of such failures?
 
  3Overview of approach
- Use availability benchmarking to evaluate 
database dependability  - an empirical technique based on simulated faults 
 - study 3-tier OLTP workload 
 - back-end commercial database 
 - middleware transaction monitor  business logic 
 - front-end web-based form interface 
 - focus on storage system faults/failures 
 - measure availability in terms of performance 
 - also possible to look at consistency of data
 
  4Outline
- Availability benchmarking methodology 
 - Adapting methodology for OLTP databases 
 - Case study of Microsoft SQL Server 2000 
 - Discussion and future directions
 
  5Availability benchmarking
- A general methodology for defining and measuring 
availability  - focused toward research, not marketing 
 - empirically demonstrated with software RAID 
systems Usenix00  - 3 components 
 - 1) metrics 
 - 2) benchmarking techniques 
 - 3) representation of results
 
  6Part 1 Availability metrics
- Traditionally, percentage of time system is up 
 - time-averaged, binary view of system state 
(up/down)  - This metric is inflexible 
 - doesnt capture degraded states 
 - a non-binary spectrum between up and down 
 - time-averaging discards important temporal 
behavior  - compare 2 systems with 96.7 traditional 
availability  - system A is down for 2 seconds per minute 
 - system B is down for 1 day per month
 
- Our solution measure variation in system quality 
of service metrics over time  - performance, fault-tolerance, completeness, 
accuracy 
  7Part 2 Measurement techniques
- Goal quantify variation in QoS metrics as system 
availability is compromised  - Leverage existing performance benchmarks 
 - to measure  trace quality of service metrics 
 - to generate fair workloads 
 - Use fault injection to compromise system 
 - hardware and software faults 
 - maintenance events (repairs, SW/HW upgrades) 
 - Examine single-fault and multi-fault workloads 
 - the availability analogues of performance micro- 
and macro-benchmarks 
  8Part 3 Representing results
- Results are most accessible graphically 
 - plot change in QoS metrics over time 
 - compare to normal behavior 
 - 99 confidence intervals calculated from no-fault 
runs 
- Graphs can be distilled into numbers
 
  9Outline
- Availability benchmarking methodology 
 - Adapting methodology for OLTP databases 
 - metrics 
 - workload and fault injection 
 - Case study of Microsoft SQL Server 2000 
 - Discussion and future directions
 
  10Availability metrics for databases
- Possible OLTP quality of service metrics 
 - transaction throughput 
 - transaction response time 
 - better  of transactions longer than a fixed 
cutoff  - rate of transactions aborted due to errors 
 - consistency of database 
 - fraction of database content available 
 - Our experiments focused on throughput 
 - rates of normal and failed transactions
 
  11Workload  fault injection
- Performance workload 
 - easy TPC-C 
 - Fault workload disk subsystem 
 - realistic fault set based on Tertiary Disk study 
 - correctable  uncorrectable media errors, 
hardware errors, power failures, disk 
hangs/timeouts  - both transient and sticky faults 
 - injected via an emulated SCSI disk (0.5ms 
overhead)  - faults injected in one of two partitions 
 - database data partition 
 - databases write-ahead log partition
 
  12Outline
- Availability benchmarking methodology 
 - Adapting methodology for OLTP databases 
 - Case study of Microsoft SQL Server 2000 
 - Discussion and future directions
 
  13Experimental setup
- Database 
 - Microsoft SQL Server 2000, default configuration 
 - Middleware/front-end software 
 - Microsoft COM transaction monitor/coordinator 
 - IIS 5.0 web server with Microsofts tpcc.dll HTML 
terminal interface and business logic  - Microsoft BenchCraft remote terminal emulator 
 - TPC-C-like OLTP order-entry workload 
 - 10 warehouses, 100 active users, 860 MB database 
 - Measured metrics 
 - throughput of correct NewOrder transactions/min 
 - rate of aborted NewOrder transactions (txn/min)
 
  14Experimental setup (2)
DB Server
Front End
Adaptec3940
100mbEthernet
MS BenchCraft RTEIIS  MS tpcc.dllMS COM
IBM18 GB10k RPM
SQL Server 2000
Intel P-III/450256 MB DRAMWindows 2000 AS
AMD K6-2/333128 MB DRAMWindows 2000 AS
DB data/log disks
- Database installed in one of two configurations 
 - data on emulated disk, log on real (IBM) disk 
 - data on real (IBM) disk, log on emulated disk
 
  15Results
- All results are from single-fault 
micro-benchmarks  - 14 different fault types 
 - injected once for each of data and log partitions 
 - 4 categories of behavior detected 
 - 1) normal 
 - 2) transient glitch 
 - 3) degraded 
 - 4) failed
 
  16Type 1 normal behavior
- System tolerates fault 
 - Demonstrated for all sector-level faults except 
 - sticky uncorrectable read, data partition 
 - sticky uncorrectable write, log partition
 
  17Type 2 transient glitch
- One transaction is affected, aborts with error 
 - Subsequent transactions using same data would 
fail  - Demonstrated for one fault only 
 - sticky uncorrectable read, data partition
 
  18Type 3 degraded behavior
- DBMS survives error after running log recovery 
 - Middleware partially fails, results in degraded 
perf.  - Demonstrated for one fault only 
 - sticky uncorrectable write, log partition
 
  19Type 4 failure
- Example behaviors (10 distinct variants observed)
 
Disk hang during write to data disk
Simulated log disk power failure
- DBMS hangs or aborts all transactions 
 - Middleware behaves erratically, sometimes 
crashing  - Demonstrated for all fatal disk-level faults 
 - SCSI hangs, disk power failures
 
  20Results summary
- DBMS was robust to a wide range of faults 
 - tolerated all transient and recoverable errors 
 - tolerated some unrecoverable faults 
 - transparently (e.g., uncorrectable data writes) 
 - or by reflecting fault back via transaction abort 
 - these were not tolerated by the SW RAID systems 
 - Overall, DBMS is significantly more robust to 
disk faults than software RAID on same OS! 
  21Outline
- Availability benchmarking methodology 
 - Adapting methodology for OLTP databases 
 - Case study of Microsoft SQL Server 2000 
 - Discussion and future directions
 
  22Results discussion
- DBMSs extra robustness comes from 
 - redundant data representation in form of log 
 - transactions 
 - standard mechanism for reporting errors (txn 
abort)  - encapsulate meaningful unit of work, providing 
consistent rollback upon failure  - But, middleware was not robust, compromising 
overall system availability  - crashed or behaved erratically when DBMS 
recovered or returned errors  - user cannot distinguish DBMS and middleware 
failure  - system is only as robust as its weakest 
component!  
compare RAID blocks dont let you do this 
 23Discussion of methodology
- General availability benchmarking methodology 
does work on more than just RAID systems  - Issues in adapting the methodology 
 - defining appropriate metrics 
 - measuring non-performance availability metrics 
 - understanding layered (multi-tier) systems with 
only end-to-end instrumentation 
  24Discussion of methodology
DO NOT PROJECT THIS SLIDE!
- General availability benchmarking methodology 
does work on more than just RAID systems  - Issues in adapting the methodology 
 - defining appropriate metrics 
 - metrics to capture database ACID properties 
 - adapting binary metrics such as data 
consistency  - measuring non-performance availability metrics 
 - existing benchmarks (like TPC-C) may not do this 
 - understanding layered (multi-tier) systems with 
only end-to-end instrumentation  - teasing apart availability impact of different 
layers 
  25Future directions
- Direct extensions of this work 
 - expand metrics, including tests of ACID 
properties  - consider other fault injection points besides 
disks  - investigate clustered database designs 
 - study issues in benchmarking layered systems
 
  26Future directions (2)
- Availability/maintainability extensions to TPC 
 - proposed by James Hamilton at ISTORE retreat 
 - an optional maintainability test after regular 
run  - sponsor supplies N best administrators 
 - TPC benchmark run repeated with realistic fault 
injection and a set of maintenance tasks to 
perform  - measure availability, performance, admin. time, . 
. .  - requires 
 - characterization of typical failure modes, admin. 
tasks  - scalable, easy-to-deploy fault-injection harness 
 - This work is a (small) step toward that goal 
 - and hints at poor state-of-the-art in TPC-C 
benchmark middleware fault handling 
  27Thanks!
- Microsoft SQL Server group 
 - for generously providing access to SQL Server 
2000 and the Microsoft TPC-C Benchmark Kit  - James Hamilton 
 - Jamie Redding and Charles Levine
 
  28Backup slides 
 29Example results failing data disk
- Transient, correctable read fault 
 - (system tolerates fault)
 
Sticky, uncorrectable read fault (transaction is 
aborted with error)
Disk hang between SCSI commands (DBMS hangs, 
middleware returns errors)
Disk hang during a data write (DBMS hangs, 
middleware crashes) 
 30Example results failing log disk
Transient, correctable write fault (system 
tolerates fault)
- Sticky, uncorrectable write fault 
 - (DBMS recovers, middleware degrades)
 
Simulated disk power failure (DBMS aborts all 
txns with errors)
Disk hang between SCSI commands (DBMS hangs, 
middleware hangs)