Latency as a Performability Metric: Experimental Results - PowerPoint PPT Presentation

About This Presentation
Title:

Latency as a Performability Metric: Experimental Results

Description:

Performability class of metrics to describe how a ... emu. config. file. Apps. config file. Emulated LAN. Mendosus design. Experimental setup. Fault types ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 26
Provided by: petebro
Category:

less

Transcript and Presenter's Notes

Title: Latency as a Performability Metric: Experimental Results


1
Latency as a Performability Metric Experimental
Results
  • Pete Broadwell
  • pbwell_at_cs.berkeley.edu

2
Outline
  • Motivation and background
  • Performability overview
  • Project summary
  • Test setup
  • PRESS web server
  • Mendosus fault injection system
  • Experimental results analysis
  • How to represent latency
  • Questions for future research

3
Performability overview
  • Goal of ROC project develop metrics to evaluate
    new recovery techniques
  • Performability class of metrics to describe how
    a system performs in the presence of faults
  • First used in fault-tolerant computing field1
  • Now being applied to online services

1 J. F. Meyer, Performability Evaluation Where
It Is and What Lies Ahead, 1994
4
Example microbenchmark
RAID disk failure
5
Project motivation
  • Rutgers study performability analysis of a web
    server, using throughput
  • Other studies (esp. from HP Labs Storage group)
    also use response time as a metric
  • Assertion latency and data quality are better
    than throughput for describing user experience
  • How best to represent latency in performability
    reports?

6
Project overview
  • Goals
  • Replicate PRESS/Mendosus study with response time
    measurements
  • Discuss how to incorporate latency into
    performability statistics
  • Contributions
  • Provide a latency-based analysis of a web
    servers performability (currently rare)
  • Further the development of more comprehensive
    dependability benchmarks

7
Experiment components
  • The Mendosus fault injection system
  • From Rutgers (Rich Martin)
  • Goal low-overhead emulation of a cluster of
    workstations, injection of likely faults
  • The PRESS web server
  • Cluster-based, uses cooperative caching. Designed
    by Carreira et al. (Rutgers)
  • Perf-PRESS basic version
  • HA-PRESS incorporates hearbeats, master node for
    automated cluster management
  • Client simulators
  • Submit set of requests/sec, based on real traces

8
Mendosus design
Workstations (real or VMs)
Global Controller (Java)
ModifiedNICdriver
SCSImodule
procmodule
Fault config file
LAN emu config file
Apps config file
User-leveldaemon (Java)
apps
Emulated LAN
9
Experimental setup
10
Fault types
Category Fault Possible Root Cause
Node Node crash Operator error, OS bug, hardware component failure, power outage
Node Node freeze OS or kernel module bug
Application App crash Application bug or resource unavailability
Application App hang Application bug or resource contention with other processes
Network Link down or flaky Broken, damaged or misattached cable
Network Switch down or flaky Damaged or misconfigured switch, power outage
11
Test case timeline
- Warm-up time 30-60 seconds - Time to repair
up to 90 seconds
12
Simplifying assumptions
  • Operator repairs any non-transient failure after
    90 seconds
  • Web page size is constant
  • Faults are independent
  • Each client request is independent of all others
    (no sessions!)
  • Request arrival times are determined by a Poisson
    process (not self-similar)
  • Simulated clients abandon connection attempt
    after 2 secs, give up on page load after 8 secs

13
Sample result app crash
Perf-PRESS
HA-PRESS
Throughput
Latency
14
Sample result node hang
Perf-PRESS
HA-PRESS
Throughput
Latency
15
Representing latency
  • Total seconds of wait time
  • Not good for comparing cases with different
    workloads
  • Average (mean) wait time per request
  • OK, but requires that expected (normal) response
    time be given separately
  • Variance of wait time
  • Not very intuitive to describe. Also, read-only
    workload means that all variance is toward longer
    wait times anyway

16
Representing latency (2)
  • Consider goodput-based availability total
    responses served total requests
  • Idea Latency-based punctuality ideal total
    latency actual total latency
  • Like goodput, maximum value is 1
  • Ideal total latencyaverage latency for
    non-fault cases x total requests (shouldnt be 0)

17
Representing latency (3)
  • Aggregate punctuality ignores brief, severe
    spikes in wait time (bad for user experience)
  • Can capture these in a separate statistic (EX 1
    of 100k responses took gt8 sec)

18
Availability and punctuality
19
Other metrics
  • Data quality, latency and throughput are
    interrelated
  • Is a 5-second wait for a response worse than
    waiting 1 second to get a try back later?
  • To combine DQ, latency and throughput, can use a
    demerit system (proposed by Keynote)1
  • These can be very arbitrary, so its important
    that the demerit formula be straightforward and
    publicly available

1 Zona Research and Keynote Systems, The Need for
Speed II, 2001
20
Sample demerit system
  • Rules
  • Each aborted (2s) conn 2 demerits
  • Each conn error 1 demerit
  • Each user timeout (8s) 8 demerits
  • Each sec of total latency above ideal level(1
    demerit/total requests) x scaling factor

21
Online service optimization
Performance metrics throughput, latency data
quality
Environment workload faults
22
Conclusions
  • Latency-based punctuality and throughput-based
    availability give similar results for a read-only
    web workload
  • Applied workload is very important
  • Reliability metrics do not (and should not)
    reflect maximum performance/workload!
  • Latency did not degrade gracefully in proportion
    to workload
  • At high loads, PRESS oscillates between full
    service, 100 load shedding

23
Further Work
  • Combine test results predicted component
    failure rates to get long-term performability
    estimates (are these useful?)
  • Further study will benefit from more
    sophisticated client workload simulators
  • Services that generate dynamic content should
    lead to more interesting data (ex RUBiS)

24
Latency as a Performability Metric Experimental
Results
  • Pete Broadwell
  • pbwell_at_cs.berkeley.edu

25
Example long-term model
Discrete-time Markov chain (DTMC) model of a
RAID-5 disk array1
pi(t) probability that system is in state i at
time t
D number of data disks
wi(t) reward (disk I/O operations/sec)
m disk repair rate
l failure rate of a single disk drive
1 Hannu H. Kari, Ph.D. Thesis, Helsinki
University of Technology, 1997
Write a Comment
User Comments (0)
About PowerShow.com