Initial Availability Benchmarking of a Database System - PowerPoint PPT Presentation

About This Presentation
Title:

Initial Availability Benchmarking of a Database System

Description:

front-end: web-based form interface. measure availability in terms of performance ... Middleware/front-end software. Microsoft COM transaction monitor/coordinator ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 22
Provided by: iramCsB
Category:

less

Transcript and Presenter's Notes

Title: Initial Availability Benchmarking of a Database System


1
Initial Availability Benchmarking of a Database
System
  • Aaron Brown
  • abrown_at_cs.berkeley.edu
  • 2001 Winter ISTORE Retreat

2
Motivation
  • Extend availability benchmarks to new areas
  • explore generality and limitations of approach
  • gain more understanding of system failure modes
  • Why look at database availability?
  • databases hold the critical hard state for most
    enterprise and e-business applications
  • the most important system component to keep
    available
  • we trust databases to be highly reliable. Should
    we?
  • how do DBMSs react to hardware faults/failures?
  • what is the user-visible impact of such failures?

3
Approach
  • Use our availability benchmarking methodology to
    evaluate database robustness
  • focus on storage system failures
  • study 3-tier OLTP workload
  • back-end commercial database
  • middleware transaction monitor business logic
  • front-end web-based form interface
  • measure availability in terms of performance
  • also possible to look at consistency of data

4
Refresher availability benchmarks
  • Goal quantify variation in quality of service as
    system availability is compromised
  • Leverage existing performance benchmark
  • to measure trace quality of service metrics
  • to generate fair workloads
  • Use fault injection to compromise system
  • Observe results graphically

5
Availability metrics for databases
  • Possible OLTP quality of service metrics
  • transaction throughput
  • transaction response time
  • better of transactions longer than a fixed
    cutoff
  • rate of transactions aborted due to errors
  • consistency of database
  • fraction of database content available
  • Our experiments focused on throughput
  • rates of normal and failed transactions

6
Fault injection
  • Disk subsystem faults only
  • realistic fault set based on Tertiary Disk study
  • correctable uncorrectable media errors,
    hardware errors, power failures, disk
    hangs/timeouts
  • both transient and sticky faults
  • note similar fault set to RAID benchmarks
  • injected via an emulated SCSI disk (0.5ms
    overhead)
  • faults injected in one of two partitions
  • database data partition
  • databases write-ahead log partition

7
Experimental setup
  • Database
  • Microsoft SQL Server 2000, default configuration
  • Middleware/front-end software
  • Microsoft COM transaction monitor/coordinator
  • IIS 5.0 web server with Microsofts tpcc.dll HTML
    terminal interface and business logic
  • Microsoft BenchCraft remote terminal emulator
  • TPC-C-like OLTP order-entry workload
  • 10 warehouses, 100 active users, 860 MB database
  • Measured metrics
  • throughput of correct NewOrder transactions/min
  • rate of aborted NewOrder transactions (txn/min)

8
Experimental setup (2)
DB Server
Front End
Adaptec3940
100mbEthernet
MS BenchCraft RTEIIS MS tpcc.dllMS COM
IBM18 GB10k RPM
SQL Server 2000
Intel P-III/450256 MB DRAMWindows 2000 AS
AMD K6-2/333128 MB DRAMWindows 2000 AS
DB data/log disks
  • Database installed in one of two configurations
  • data on emulated disk, log on real (IBM) disk
  • data on real (IBM) disk, log on emulated disk

9
Results
  • All results are from single-fault
    micro-benchmarks
  • 14 different fault types
  • injected once for each of data and log partitions
  • 4 categories of behavior detected
  • 1) normal
  • 2) transient glitch
  • 3) degraded
  • 4) failed

10
Type 1 normal behavior
  • System tolerates fault
  • Demonstrated for all sector-level faults except
  • sticky uncorrectable read, data partition
  • sticky uncorrectable write, log partition

11
Type 2 transient glitch
  • One transaction is affected, aborts with error
  • Subsequent transactions using same data would
    fail
  • Demonstrated for one fault only
  • sticky uncorrectable read, data partition

12
Type 3 degraded behavior
  • DBMS survives error after running log recovery
  • Middleware partially fails, results in degraded
    perf.
  • Demonstrated for one fault only
  • sticky uncorrectable write, log partition

13
Type 4 failure
  • Example behaviors (10 distinct variants observed)

Disk hang during write to data disk
Simulated log disk power failure
  • DBMS hangs or aborts all transactions
  • Middleware behaves erratically, sometimes
    crashing
  • Demonstrated for all fatal disk-level faults
  • SCSI hangs, disk power failures

14
Results summary
  • DBMS was robust to a wide range of faults
  • tolerated all transient and recoverable errors
  • tolerated some unrecoverable faults
  • transparently (e.g., uncorrectable data writes)
  • or by reflecting fault back via transaction abort
  • these were not tolerated by the SW RAID systems
  • Overall, DBMS is significantly more robust to
    disk faults than software RAID systems!

15
Results discussion
  • DBMSs extra robustness comes from
  • redundant data representation in form of log
  • transactions
  • standard mechanism for reporting errors (txn
    abort)
  • encapsulate meaningful unit of work, providing
    consistent rollback upon failure
  • But, middleware was not robust, compromising
    overall system availability
  • crashed or behaved erratically when DBMS
    recovered or returned errors
  • user cannot distinguish DBMS and middleware
    failure
  • system is only as robust as its weakest
    component!

compare RAID blocks dont let you do this
16
Discussion of methodology
  • General availability benchmarking methodology
    does work on more than just RAID systems
  • Issues in adapting the methodology
  • defining appropriate metrics
  • measuring non-performance availability metrics
  • understanding layered (multi-tier) systems with
    only end-to-end instrumentation

17
Discussion of methodology
DO NOT PROJECT THIS SLIDE!
  • General availability benchmarking methodology
    does work on more than just RAID systems
  • Issues in adapting the methodology
  • defining appropriate metrics
  • metrics to capture database ACID properties
  • adapting binary metrics such as data
    consistency
  • measuring non-performance availability metrics
  • existing benchmarks (like TPC-C) may not do this
  • understanding layered (multi-tier) systems with
    only end-to-end instrumentation
  • teasing apart availability impact of different
    layers

18
Future directions
  • Last retreat James Hamilton proposed
    availability/maintainability extensions to TPC
  • This work is a (small) step toward that goal
  • exposed limitations, capabilities of disk fault
    injection
  • revealed importance of middleware, which clearly
    must be considered as part of the benchmark
  • hints at poor state-of-the-art in TPC-C benchmark
    middleware fault handling
  • Next
  • expand metrics, including tests of ACID
    properties
  • consider other fault injection points besides
    disks
  • investigate clustered database designs
  • study issues in benchmarking layered systems

19
Thanks!
  • Microsoft SQL Server group
  • for generously providing access to SQL Server
    2000 and the Microsoft TPC-C Benchmark Kit
  • James Hamilton
  • Jamie Redding and Charles Levine

20
Backup slides
21
Example results failing data disk
  • Transient, correctable read fault
  • (system tolerates fault)

Sticky, uncorrectable read fault (transaction is
aborted with error)
Disk hang between SCSI commands (DBMS hangs,
middleware returns errors)
Disk hang during a data write (DBMS hangs,
middleware crashes)
22
Example results failing log disk
Transient, correctable write fault (system
tolerates fault)
  • Sticky, uncorrectable write fault
  • (DBMS recovers, middleware degrades)

Simulated disk power failure (DBMS aborts all
txns with errors)
Disk hang between SCSI commands (DBMS hangs,
middleware hangs)
Write a Comment
User Comments (0)
About PowerShow.com