Title: TEQUILA: T____ E____ QUeuing I___ L____ A_____
1TEQUILA T____ E____ QUeuing I___ L____ A_____
TEQUILA Toward Evidence-based QUeueing Inference
for Latency Analysis
Charles Sutton, RAD Lab Jan 7, 2009
1
- Charles Sutton
- RAD Lab
- Jan 7, 2009
2Transient Diagnostic ?s
- Performance models represent What if?
- This talk They can also answer
- What happened?
- Ex What fraction of response time caused by load?
Ex What caused temporary performance degradation
1 hr ago? Ex What are bottlenecks of the 1
slowest requests?
3Measurement
- Many questions can be answered by measurement.
- Measurements include
- Response time
- Workload
4Types of Measurement
5Talk Road Map
- Interpret queueing network as a probabilistic
model. - Observe subset of arrivals, departures from
running system. - Reconstruct unobserved measurements
- Compute posterior distribution
- Answer diagnostic questions from reconstructed
data
6Generative Probabilistic Models
HIDDEN
OBSERVATIONS
- Define probability distribution
- p(Burglary, Earthquake, Alarm, JohnCalls,
MaryCalls) - Compute posterior distribution
- p(Plague Buboes, No Runny Nose)
7Graphical Models
- Idea Represent how hidden variables generate
the input
p(Burglary) p(EQ)
p(Alarm Burglary, EQ)
p(JohnCalls Alarm) p(MaryCalls Alarm)
8Inference
- Problem Compute marginal probabilities given
evidence
p(Burglary John, not Mary) S p(EQ, Alarm,
John, not Mary)
This is a posterior distribution.
9Queueing Networks
- Model each component as a queue
Processor
M/M/1 queue
- For each task k
- Arrival time
- Departure time
- Service time
- Waiting time
10Queueing Networks
- Model distributed system as network of queues
- Example two-tier web application
Web Server
DB
Web Server
Network
DB
Web Server
11Queueing Networks
Finite state machine describes a tasks path
through the system.
Web Server
DB
Web Server
Network
DB
Web Server
12Probabilistic Modeling
- Now we have a probability distribution over
- arrivals, departures, service times, and waiting
times.
¹
Web Server
DB
Web Server
Network
DB
Web Server
13Probabilistic Modeling
- Arrival and departure times can be instrumented.
- Question Can we reconstruct missing arrivals?
- Answer Use posterior distribution
- p( hidden arrivals arrivals I observed)
¹
Web Server
DB
Web Server
Network
DB
Web Server
14Progress from Last Retreat
- Last retreat
- All queues single processor
- FIFO
- Service times exponential
- This retreat
- Arbitrarily many processors
- FIFO and random
- Arbitrary service distributions
15Programmers Perspective
- Programmer supplies
- 1. Structure of queueing network
Rails 1
DB
Rails 2
Rails 3
DB 5 processor, FIFO, log normal service
distribution
16Programmers Perspective
- Programmer supplies
- 2. Arrivals and departures measured in production
17Reconstruction Accuracy
Service
Waiting
Exponential
Log Normal
18Example Cloudstone
- Cloudstone running on EC2
- 5 VMs, Load up to 20 req/s
- Model Each thread (thin) a single-processor
queue - Data For each request, log
- Time spent in Rails
- Time spent in database
-
19Example Cloudstone
- Cloudstone running on EC2
- 5 VMs, Load up to 20 req/s
- Workload
201. Visualization
- Performance bottlenecks over time
212. Hidden Resources
- Common performance bug
- Blocking on a resource you shouldnt
- Approach Model selection
MODEL 2 (PERFORMANCE BUG)
Rails
Rails
VM1
DB
Rails
Rails
VM2
222. Hidden Resources
MODEL 2 (PERFORMANCE BUG)
Rails
DB
Rails
VM
232. Hidden Resources
MODEL 1 (NORMAL)
Rails
DB
Rails
MODEL 2 (PERFORMANCE BUG)
Rails
DB
Rails
VM1
24Summary
- Model-based diagnosis
- WHY A model lets you reason about aspects of the
system state that you cant measure. - HOW Do that reasoning using algorithms from
machine learning - Accurate reconstruction from 10 of possible log
data
25What Next?
- Modeling different applications
- Distributed file systems (e.g., Hadoop DFS)
- Network traffic
- SCADS?
- Feedback between queues
- Online, distributed inference
- Converting code ? performance model