Title: Testing the Limits of a Transactional Networked Service
1TESTING THE LIMITS OF A TRANSACTIONAL NETWORKED
SERVICE
BY BOWEI DU
2(No Transcript)
3INTRODUCTION
- One of the defining characteristics of a cloud
service is scale, and with scale comes the
question of performance and cost. How efficient
are the software systems that we run? How many
computing resources are required to meet our
current demands, and how much more will be
required in the future? - At Instart Logic, we have created a system
called Lava that enables us to measure and test
the scalability limits of our systems. Lava is
focused on transactional networked services and
systems that serve independent requests sent over
a network from a large number of clients.
Examples include HTTP frontends, data caches and
API endpoints. - Performance measurement is a deep topic with many
facets. Lava seeks to solve a specific slice of
the performance measurement problem how can we
quickly find the maximum load a service can
handle? While today there are many open source
tools for stress testing, we found most of them
to be too inflexible and slow to use for this
purpose. This poses a problem as we have a large
space of experimental parameters to explore
during system stress tests. - Lava decomposes this problem into two pieces
- a set of extensible protocol-specific agents that
generate a controllable amount of load on the
system in test - a control function that uses feedback from
metrics generated by the stress test to find
system limits. - While the ideas used in the Lava system are not
novel, we feel that the particular combination of
features used will be interesting to a broader
audience.
4BACKGROUND
The most important metrics for our Lava use cases
are throughput and latency. Throughput is the
rate of request that can be processed and latency
is the time from the start of a request to
reception of the response.
5- Figure 1 is a graph of the typical response time
behavior with respect to increasing request
volume. Service response time is stable under
increasing request rate until we reach a
saturation point at which the service cannot
keep up with the request ingress rate. Beyond the
saturation point, internal queues overflow and
service response times degrade past acceptable
thresholds. - It is important to know what the saturation point
is for each of our services. In development, we
use the results obtained from Lava to find
performance regressions and guide our performance
improvement efforts. In production, we use these
results for capacity planning, as services need
to be provisioned with enough headroom to absorb
service failures and request spikes. - There are many existing performance frameworks
for network protocols such as HTTP, the main ones
being Tsung, Apache Bench, Siege and JMeter. We
encountered the following issues with these
frameworks - First, many of the frameworks available run a set
workload without any feedback mechanisms for load
control. Our stress runs can be sensitive around
the saturation point and slightly too much load
can cause high variance in the output, leading to
unstable results. - Lack of feedback also meant that finding the
saturation point required many runs of the stress
tools probing at different load levels. Even with
a guided binary search, this proved to be too
slow to be viable for exploring large sets of
experimental parameters. - Finally, while this is not fundamental, we found
that the Lava system was simple enough that
implementation of the mechanisms within our own
framework did not incur undue engineering cost.
6DESIGN
7- The Lava system (Figure 2) consists of two main
components - A set of agents running on worker threads that
generate application-specific loads. For example,
in a stress test of an HTTP frontend, each state
machine executes a sequence of HTTP
request/response interactions. For saturation
point measurement, each agent generates a
constant number of requests per second for easy
load control. - A control function component that receives
real-time metrics aggregated from the state
machines and adjusts the parameters of the stress
run. The control function manages the number of
state machines that are active and the state of
the Lava system overall. - Each Lava run consists of three phases ramp-up,
search and measurement. During the ramp-up phase,
the Lava control function steadily increases the
number of active agents until a metric threshold
has been exceeded. The ramp-up phase is not
strictly necessary however, we have found it is
useful to distinguish for debugging purposes.
Lava then transitions to the search phase, in
which the number of agents is varied up and down
around the saturation point, to find the maximum
load possible that still meets the threshold.
When the search phase has stabilized, Lava
transitions to the measurement phase, in which
the number of agents is held constant for a
configurable time period. During the measurement
phase, all metrics should be stable. If high
variance occurs, it is an indication that either
something is wrong with the system under test or
with the test setup itself. Figure 3 shows the
agent count and metric graphs for each of the
phases.
8Each agent in Lava simulates a constant rate
workload from a client. By increasing or
decreasing the number of active agents, Lava can
adjust the amount of load placed on the system in
test. Each agent has (modulo code transformations
to facilitate non-blocking I/O) the following
inner loop
void Agentrun() while (true)
Operation op create_next_operation()
op-gtrun() sleep(1/rate)
Agents can be implemented as extensions in C or
via the Lua scripting language. In addition to
the system limit exploration, we have also
implemented agents that replay request traces
taken from production.
METRICS AND CONTROL FUNCTIONS
We track an extensible set of metrics from the
active agents and feed them to a control function
that determines how to adjust the load. Metrics
are tracked by each agent and aggregated by the
central control function component.
9class Controller public enum Signal
STABLE, DECREASE, INCREASE virtual Signal
update( const Metrics metrics) 0
For most applications, we have found that a
simple linear controller tracking a moving window
of 95th/99th percentile operation latency
suffices
ControllerSignal LinearControllerupdate(
const Metrics metrics) double delta
metrics-gtp95_latency() - limit if (delta gt
epsilon) return DECREASE if (delta lt
-epsilon) return INCREASE return STABLE
More sophisticated control functions with faster
convergence are possible but currently not
explored.
10EXAMPLE
11Figure 4 shows a sample result from a Lava run
testing an HTTP-based service. In this graph, we
set a threshold for the 95th percentile latency
of 2 milliseconds with the linear control
function. The top graph shows the throughput we
are getting from the system. The middle graph
shows the sliding window metrics we are
measuring. Note that the metrics can vary due to
inherent system variabilities and randomness. The
bottom graph shows the number of agents that are
active through the run. We can see Lava
transition through the ramp-up, search and
measurement states from the agent graph.
CONCLUSION
Lava is currently being used to stress test all
major systems at Instart Logic, replacing all 3rd
party stress test frameworks. Adoption of Lava
has reduced the length of time taken for a single
stress test experiment by an order of magnitude.
For example, our HTTP-based stress tests using
Tsung and binary search took around twenty
minutes to converge. A similar run using Lava can
converge in under five minutes. We are in the
process of open-sourcing our Lava software as we
feel the feedback-control-based stress test
framework is widely applicable and useful. To
read additional technical content from the
Instart Logic engineering team, visit
our technology blog.