Performance Debugging for Distributed Systems of Black Boxes PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Performance Debugging for Distributed Systems of Black Boxes


1
Performance Debugging for Distributed Systems of
Black Boxes
  • Marcos K. Aguilera
  • Jeffrey C. Mogul
  • Janet L. Wiener
  • HP Labs
  • Patrick Reynolds, Duke
  • Athicha Muthitacharoen, MIT
  • WISP 2004
  • 11 November 2004

2
Example multi-tier system
3
Motivation
  • Complex distributed systems are built from black
    box components
  • These systems may have performance problems
  • High or erratic latency
  • Locating the causes of these problems is hard
  • We cant always examine or modify system
    components
  • We need tools to infer where bottlenecks are
  • Choose which black boxes to open

4
Contributions of our work
  • Tools to highlight which black boxes have
    problems
  • Require only passive information, such as packet
    traces
  • Infer where most of time is spent from traces
  • Person can then use more invasive tools to
    examine those boxes
  • Reduce time and cost to debug complex systems
  • Improve quality of delivered systems

5
Example causal path
6
Goals of our tools
  • Find high-impact causal paths through a
    distributed system
  • Causal path series of nodes that sent/received
    messages
  • Each message is caused by receipt of previous
    message
  • Some causal paths occur many times
  • High impact
  • Occurs frequently
  • Contributes significantly to overall latency
  • Without modifications or semantic knowledge
  • Report per-node latencies on causal paths

7
Overview of our approach
  • Obtain traces of messages between components
  • Ethernet packets, middleware messages, etc.
  • Collect traces as non-invasively as possible
  • Analyze traces using algorithms
  • Visualize results and highlight high-impact paths
  • Requires very little information
  • timestamp, source, destination

8
Outline
  • Problem statement goals
  • Overview of our approach
  • Algorithm
  • Experimental results
  • Related work
  • Conclusions

9
The convolution algorithm input
Time From To 0.01 A B
0.02 A B 0.04 B D
0.05 C F ...
10
The convolution algorithm output
.15
.10
0
0
.10
.10
0
0
0
0
0
0
11
Basic idea
  • Creates a time signal for messages from each
    node
  • Given time signals S1(t)(A?B) and
    S2(t)(B?X)Computes convolution of S2(t) and
    S1(t) S1 S2
  • (can be computed quickly using fast fourier
    transforms)

S1(t)(A?B msgs)
time
1 2 3 4 5 6 7
12
S1(t)(A?B msgs)
S2(t)(B?X msgs)
S1 S2conv(S2(t), S1(-t))
  • Spikes suggest causality between A?B and B?X
    msgs
  • Time shift of a spike indicates its
    characteristic delay

13
Details first step
  • Choose starting node A
  • Use trace to add edges from it
  • Time From To
  • 0.01 A B
  • 0.02 A B
  • 0.04 A C
  • 0.05 A B

14
Continuing
  • Time From To
  • B D
  • B E
  • B F
  • B G

A
B
C
??
15
How
  • Time From To
  • t1 A B
  • t2 A B
  • t3 A B
  • t4 A B
  • Time From To
  • t1d B D
  • t2d B D
  • t3d B D
  • t3d B E
  • t4d B D

16
Heuristic to find spikes
threshold 1 n1 stddev over mean
threshold 2 n2 stddev over mean
n1 2n2 1.5
17
Recursing to continue
  • Observations
  • 1. (B?D) are not all msgs from B to D
  • (only those caused by A)
  • 2. Stop recursion when too few messages left
  • or no more spikes found

A
B
d
D
??
18
Outline
  • Problem statement goals
  • Overview of our approach
  • Algorithm
  • Experimental results
  • Conclusions

19
Results email service delays
  • Jeff logged all email headers for two months
  • Parsed 80K Received headers in 12K messages
  • Received from cceexg11.americas.cpqcorp.net ...
  • by wera.hpl.hp.com ... Fri, 4 Apr 2003 153554
    -0800
  • Yields (timestamp, sender, receiver) trace
    records
  • Used Convolution Algorithm to
  • Reconstruct message paths
  • Find typical delays
  • Note this is NOT the most direct way to use
    email headers
  • We made the problem harder so as to test our
    algorithm

20
Email trace output
21
Results Petstore
  • Suns demo application for J2EE
  • Stanfords PinPoint project provided us with
    traces
  • One trace has a node that is artificially slowed
    down

22
Future work
  • Automate trace gathering and conversion
  • Sliding-window versions of algorithms
  • Find phased behavior
  • Reduce memory usage of nesting algorithm
  • Improve speed of convolution algorithm
  • Validate usefulness on more complicated systems
  • What are limits of our approach?

23
Conclusion
  • Looking for bottlenecks in black box systems
  • Use signal processing techniques to find causal
    pathsin the network and its delays
  • For more information
  • http//www.hpl.hp.com/research/project5/
  • Contact us if you have multi-hop message traces!
Write a Comment
User Comments (0)
About PowerShow.com