Initial Availability Benchmarking of a Database System - PowerPoint PPT Presentation

About This Presentation

Title:

Initial Availability Benchmarking of a Database System

Description:

front-end: web-based form interface. measure availability in terms of performance ... Middleware/front-end software. Microsoft COM transaction monitor/coordinator ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 22

Provided by: iramCsB

Learn more at: http://iram.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Initial Availability Benchmarking of a Database System

1
Initial Availability Benchmarking of a Database
System

Aaron Brown
abrown_at_cs.berkeley.edu
2001 Winter ISTORE Retreat

2
Motivation

Extend availability benchmarks to new areas
explore generality and limitations of approach
gain more understanding of system failure modes
Why look at database availability?
databases hold the critical hard state for most
enterprise and e-business applications
the most important system component to keep
available
we trust databases to be highly reliable. Should
we?
how do DBMSs react to hardware faults/failures?
what is the user-visible impact of such failures?

3
Approach

Use our availability benchmarking methodology to
evaluate database robustness
focus on storage system failures
study 3-tier OLTP workload
back-end commercial database
middleware transaction monitor business logic
front-end web-based form interface
measure availability in terms of performance
also possible to look at consistency of data

4
Refresher availability benchmarks

Goal quantify variation in quality of service as
system availability is compromised
Leverage existing performance benchmark
to measure trace quality of service metrics
to generate fair workloads
Use fault injection to compromise system
Observe results graphically

5
Availability metrics for databases

Possible OLTP quality of service metrics
transaction throughput
transaction response time
better of transactions longer than a fixed
cutoff
rate of transactions aborted due to errors
consistency of database
fraction of database content available
Our experiments focused on throughput
rates of normal and failed transactions

6
Fault injection

Disk subsystem faults only
realistic fault set based on Tertiary Disk study
correctable uncorrectable media errors,
hardware errors, power failures, disk
hangs/timeouts
both transient and sticky faults
note similar fault set to RAID benchmarks
injected via an emulated SCSI disk (0.5ms
overhead)
faults injected in one of two partitions
database data partition
databases write-ahead log partition

7
Experimental setup

Database
Microsoft SQL Server 2000, default configuration
Middleware/front-end software
Microsoft COM transaction monitor/coordinator
IIS 5.0 web server with Microsofts tpcc.dll HTML
terminal interface and business logic
Microsoft BenchCraft remote terminal emulator
TPC-C-like OLTP order-entry workload
10 warehouses, 100 active users, 860 MB database
Measured metrics
throughput of correct NewOrder transactions/min
rate of aborted NewOrder transactions (txn/min)

8
Experimental setup (2)
DB Server
Front End
Adaptec3940
100mbEthernet
MS BenchCraft RTEIIS MS tpcc.dllMS COM
IBM18 GB10k RPM
SQL Server 2000
Intel P-III/450256 MB DRAMWindows 2000 AS
AMD K6-2/333128 MB DRAMWindows 2000 AS
DB data/log disks

Database installed in one of two configurations
data on emulated disk, log on real (IBM) disk
data on real (IBM) disk, log on emulated disk

9
Results

All results are from single-fault
micro-benchmarks
14 different fault types
injected once for each of data and log partitions
4 categories of behavior detected
1) normal
2) transient glitch
3) degraded
4) failed

10
Type 1 normal behavior

System tolerates fault
Demonstrated for all sector-level faults except
sticky uncorrectable read, data partition
sticky uncorrectable write, log partition

11
Type 2 transient glitch

One transaction is affected, aborts with error
Subsequent transactions using same data would
fail
Demonstrated for one fault only
sticky uncorrectable read, data partition

12
Type 3 degraded behavior

DBMS survives error after running log recovery
Middleware partially fails, results in degraded
perf.
Demonstrated for one fault only
sticky uncorrectable write, log partition

13
Type 4 failure

Example behaviors (10 distinct variants observed)

Disk hang during write to data disk
Simulated log disk power failure

DBMS hangs or aborts all transactions
Middleware behaves erratically, sometimes
crashing
Demonstrated for all fatal disk-level faults
SCSI hangs, disk power failures

14
Results summary

DBMS was robust to a wide range of faults
tolerated all transient and recoverable errors
tolerated some unrecoverable faults
transparently (e.g., uncorrectable data writes)
or by reflecting fault back via transaction abort
these were not tolerated by the SW RAID systems
Overall, DBMS is significantly more robust to
disk faults than software RAID systems!

15
Results discussion

DBMSs extra robustness comes from
redundant data representation in form of log
transactions
standard mechanism for reporting errors (txn
abort)
encapsulate meaningful unit of work, providing
consistent rollback upon failure
But, middleware was not robust, compromising
overall system availability
crashed or behaved erratically when DBMS
recovered or returned errors
user cannot distinguish DBMS and middleware
failure
system is only as robust as its weakest
component!

compare RAID blocks dont let you do this
16
Discussion of methodology

General availability benchmarking methodology
does work on more than just RAID systems
Issues in adapting the methodology
defining appropriate metrics
measuring non-performance availability metrics
understanding layered (multi-tier) systems with
only end-to-end instrumentation

17
Discussion of methodology
DO NOT PROJECT THIS SLIDE!

General availability benchmarking methodology
does work on more than just RAID systems
Issues in adapting the methodology
defining appropriate metrics
metrics to capture database ACID properties
adapting binary metrics such as data
consistency
measuring non-performance availability metrics
existing benchmarks (like TPC-C) may not do this
understanding layered (multi-tier) systems with
only end-to-end instrumentation
teasing apart availability impact of different
layers

18
Future directions

Last retreat James Hamilton proposed
availability/maintainability extensions to TPC
This work is a (small) step toward that goal
exposed limitations, capabilities of disk fault
injection
revealed importance of middleware, which clearly
must be considered as part of the benchmark
hints at poor state-of-the-art in TPC-C benchmark
middleware fault handling
Next
expand metrics, including tests of ACID
properties
consider other fault injection points besides
disks
investigate clustered database designs
study issues in benchmarking layered systems

19
Thanks!

Microsoft SQL Server group
for generously providing access to SQL Server
2000 and the Microsoft TPC-C Benchmark Kit
James Hamilton
Jamie Redding and Charles Levine

20
Backup slides
21
Example results failing data disk

Transient, correctable read fault
(system tolerates fault)

Sticky, uncorrectable read fault (transaction is
aborted with error)
Disk hang between SCSI commands (DBMS hangs,
middleware returns errors)
Disk hang during a data write (DBMS hangs,
middleware crashes)
22
Example results failing log disk
Transient, correctable write fault (system
tolerates fault)