Title: Connecting the Dots: Using Runtime Paths for Macro Analysis
1Connecting the DotsUsing Runtime Paths for
Macro Analysis
- Mike Chen
- mikechen_at_cs.berkeley.edu
- http//pinpoint.stanford.edu
2Motivation
- Divide and conquer, layering, and replication are
fundamental design principles - e.g. Internet systems, P2P systems, and sensor
networks - Execution context is dispersed throughout the
system - gt difficult to monitor and debug
- Lots of existing low-level tools that help with
debugging individual components, but not a
collection of them - Much of the system is in how the components are
put together - Observation a widening gap between the systems
we are building and the tools we have
3Current Approach
Apache
Apache
Java Bean
Java Bean
Java Bean
Database
Database
4Current Approach
A1
A2
- Micro analysis tools, like code-level debuggers
(e.g. gdb) and application logs, offers details
of each individual component - Scenario
- A user reports request A1 failed
- You try the same request, A2, but it works fine
- What to do next?
Apache
Apache
Java Bean
Java Bean
Java Bean
Database
Database
5Macro Analysis
- Macro analysis exploits non-local context to
improve reliability and performance - Performance examples Scout, ILP, Magpie
- Statistical view is essential for large, complex
systems - Analogy micro analysis allows you to understand
the details of individual honeybee macro
analysis is needed to understand how the bees
interact to keep the beehive functioning
6Observation
- Systems have a single system-wide execution paths
associated with each request - E.g. request/response, one-way messages
- Scout, SEDA, Ninja use paths to specify how to
service requests - Our philosophy
- Use only dynamic, observed behavior
- Application-independent techniques
7Our Approach
- Use runtime paths to connect the dots!
- dynamically captures the interactions and
dependency between components - look across many requests to get the overall
system behavior - more robust to noise
- Components are only partially known (gray
boxes)
Apache
Apache
Java Bean
Java Bean
Java Bean
Database
Database
8Our Approach
- Applicable to a wide range of systems.
9Open Challenges in Systems Today
- Deducing system structure
- manual approach is error-prone
- static analysis doesnt consider resources
- Detecting application-level failures
- often dont exhibit lower-level symptoms
- Diagnosing failures
- failures may manifest far from the actual faults
- multi-component faults
- Goal reduce time to detection, recovery,
diagnosis, and repair
10Talk Outline
- Motivation
- Model and architecture
- Applying macro analysis
- Future directions
11Runtime Paths
- Instrument code to dynamically trace requests
through a system at the component level - record call path the runtime properties
- e.g. components, latency, success/failure, and
resources used to service each request - Use statistical analysis detect and diagnose
problems - e.g. data mining, machine learning, etc.
- Runtime analysis tells you how the system is
actually being used, not how it may be used - Complements existing micro analysis tools
12Architecture
request
- Tracer
- Tags each request with a unique ID, and carries
it with the request throughout the system - Report observations (component name resource
performance properties) for each component - Aggregator Repository
- Reconstructs paths and stores them
- Declarative Query Engine
- Supports statistical queries on paths
- Data mining and machine learning routines
- Visualization
Aggregator
Developers/ Operators
Query Engine
Visualization
Path Repository
13Request Tracing
- Challenge maintaining an ID with each request
throughout the system - Tracing is platform-specific but can be
application-generic and reusable across
applications - 2 classes of techniques
- Intra-thread tracing
- Use per-thread context to store request ID (e.g.
ThreadLocal in Java) - ID is preserved if the same thread is used to
service the request - Inter-thread tracing
- For extensible protocols like HTTP, inject new
headers that will be preserved (e.g. REQ_ID xx) - Modify RPC to pass request ID under the cover
- Piggyback onto messages
14Talk Outline
- Motivation
- Model and architecture
- Applying macro analysis
- Inferring system structure
- Detection application-level failures
- Diagnosing failures
- Future directions
15Inferring System Structure
- Key idea paths directly capture application
structure
2 requests
16Indirect Coupling of Requests
- Key idea paths associate requests with internal
state - Trace requests from web server to database
- Parse client-side SQL queries to get sharing of
db tables - Straightforward to extend to more fine-grained
state (e.g. rows)
Database tables
Request types
17Failure Detection and Diagnose
- Detecting application-level failures
- Key idea paths change under failures gt detect
failures via path changes. - Diagnosing failures
- Key idea bad paths touch root cause(s). Find
common features.
18Future Directions
- Key idea violation of macro invariants are signs
of buggy implementation or intrusion - Message paths in P2P and sensor networks
- a general mechanism to provide visibility into
the collective behavior of multiple nodes - micro or static approaches by themselves dont
work well in dynamic, distributed settings - e.g. algorithms have upper bounds on the of
hops - Although hop count violation can be detected
locally, paths help identify nodes that route
messages incorrectly - e.g. detecting nodes that are slow or corrupt msgs
19Conclusion
- Macro analysis fills the need when monitoring and
debugging systems where local context is of
insufficient use - Runtime path-based approach dynamically traces
request paths and statistically infer macro
properties - A shared analysis framework that is reusable
across many systems - Simplifies the construction of effective tools
for other systems and the integration with
recovery techniques like RR - http//pinpoint.stanford.edu
- Paper includes a commercial example from Tellme!
(thanks to Anthony Accardi and Mark Verber)
20Backup Slides
21Backup Slides
22Current Approach
- Micro analysis tools, like code-level debuggers
(e.g. gdb) and application logs, offers details
of each individual component
Apache
Apache
Java Bean
Java Bean
X 1 Y 2
X 2 Y 4
gdb
Java Bean
Java Bean
Java Bean
Java Bean
Java Bean
X 1 Y 2
X 5 Y 2
X 3 Y 2
Database
Database
Java Bean
Java Bean
X 2 Y 3
X 7 Y 1
23Related Work
- Commercial request tracing systems
- Announced in 2002, a few months after Pinpoint
was developed - PerformaSure and AppAssure focus on performance
problems. - IntegriTea captures and playback failure
conditions. - Focus on individual requests rather than overall
behavior, and on recreating the failure
condition. - Extensive work in event/alarm correlation, mostly
in the context of network management (i.e. IP) - Dont directly capture relationship between
events - Rely on human knowledge or use machine learning
to suppress alarms. - Distributed debuggers
- PDT, P2D2, TotalView, PRISM, pdbx
- Aggregates views from multiple components, but do
not capture relationship and interaction between
components - Comparative debuggers Wizard, GUARD
- Dependency models
- Most are statically generated and are likely to
be inconsistent. - Brown et al. takes an active, black box approach
but is invasive. Candea et al. dynamically trace
failures propagation.
241. Detecting Failures using Anomaly Detection
- Key idea paths change under failures gt detect
failures via path changes - Anomalies
- Unusual paths
- Changes in distribution
- Changes in latency/response time
- Examples
- Error paths are shorter.
- User behavior changes under failures
- Retries a few times then give up
- Implement as long running queries (i.e. diff)
- Challenges
- detecting application-level failures
- comparing sets of paths
252. Root-cause Analysis
- Key idea all bad paths touch root cause, find
common features - Challenge a small set of known bad paths and a
large set of maybes - Ideally want to correlate and rank all
combinations of feature sets - E.g. association rules mining
- May get false alarms because the root cause may
not be one of the features - Automatic generation of dynamic functional and
state dependency graphs - Helps developers and operators understand
inter-component dependency and inter-request
dependency - Input to recovery algorithms that use dependency
graphs
263. Verifying Macro Invariants
- Key idea violations of high-level invariants are
signs of intrusion or bugs - Example Peer auditing
- Problem A small number of faulty or malicious
nodes can bring down the system - Corruption should be statistically visible in
your behavior - look for nodes that delay or corrupt messages or
route messages incorrectly - Apply root-cause analysis to locate the
misbehaving peers - Some distributed auditing is necessary
- Example P2P implementation verification
- Problem are messages delivered as specified by
the algorithms? - Detect extra hops, loops, and verify that the
paths are correct - Can implement as a query
- select length from paths where (length gt log2(N))
274. Detecting Single Point of Failure
- Key idea paths converge on a single-point of
failure - Useful for finding out what to replicate to
improve availability - P2P example
- Many P2P systems rely on overlay networks, which
typically are networks built on top of the IP
infrastructure. - Its common for several overlay links to fail
together if they depend on a shared physical IP
link that failed - Implement as a query
- intersect edge.IP_links from paths
A
B
D
E
C
D
F
G
285. Monitoring of Sensor Networks
- An emerging area with primitive tools
- Key idea use paths to reconstruct topology and
membership - Example
- Membership
- select unique node from paths
- Network topology
- for directed information dissemination
- Challenge limited bandwidth
- Can record a (random) subset of the nodes for
each path, then statistically reconstruct the
paths
29Macro Analysis
- Look across many requests to get the overall
system behavior - more robust to noise
Macro Analysis
Request 1 Request 2 Request 3 Request 4
Component A X X X
Component B X
Component C X X
30Properties of Network Systems
- Web services, P2P systems, and sensor networks
can have tens of thousands of nodes each running
many application components - Continuous adaptation provides high availability,
but also makes it difficult to reproduce and
debug errors - Constant evolution of software and hardware
31Motivation
- Difficult to understand and debug network systems
- e.g. Clustered Internet systems, P2P systems and
sensor networks - Composed of many components
- Systems are becoming larger, more dynamic, and
more distributed - Workload is unpredictable and impractical to
simulate - Unit testing is necessary but insufficient.
Components break when used together under real
workload - Dont have tools that capture the interactions
between components and the overall behavior - Existing debugging tools and application-level
logs only do micro analysis
32Macro vs Micro Analysis
Macro Analysis Micro Analysis
Resolution Component. Complements micro analysis tools. Line or variable
Overhead Low. Can use it in actual deployment. High. Typically not used in deployment other than application logs.
33Whats a dynamic path?
- A dynamic path is the (control flow runtime
properties) of a request - Think of it as a stack trace across
process/machine boundaries with runtime
properties - Dynamically constructed by tracing requests
through a system - Runtime properties
- Resources (e.g. host, version)
- Performance properties (e.g. latency)
- Arguments (e.g. URL, args, SQL statement)
- Success/failure
request
Path
A
RequestID 1 Seq Num 1 Name A Host xx Latency
10ms Success true ..
A
B
D
C
D
E
E
F
34Related Work
- Micro debugging tools
- RootCause provides extensible logging of method
calls and arguments. - Diduce look for inconsistencies in variable
usage. - Complements macro analysis tools.
- Languages for monitoring
- InfoSpect looks for inconsistencies in system
state using a logic language - Network flow-based monitoring
- RTFM and Cisco NetFlow classify and record
network flows - Statistical and data mining languages
- S, DMQL, WebML
35Visualization Techniques
- Tainted paths mark all flows that have a certain
property (e.g. failed or slow) with a distinct
color and overlay it on the graph - Detecting performance bottlenecks look for
replicated nodes that have different colors - Detecting anomaly look for missing edges and
unknown paths
36Pinpoint Framework
Components
Requests
1
2
Communications Layer (Tracing Internal F/D)
3
Detected Faults
Logs
37Experimental Setup
- Demo app J2EE Pet Store
- e-commerce site w/30 components
- Load generator
- replay trace of browsing
- Approx. TPCW WIPSo load (50 ordering)
- Fault injection parameters
- Trigger faults based on combinations of used
components - Inject exceptions, infinite loops, null calls
- 55 tests with single-components faults and
interaction faults - 5-min runs of a single client (J2EE server
limitation)
38Application Observations
- large number of tightly coupled components that
are always used together
- of components used in a dynamic web page
request - median 14, min 6, max 23
39Metrics
- Precision C/P
- Recall C/A
- Accuracy whether all actual faults are correctly
identified (recall 100) - boolean measure
404 Analysis Techniques
- Pinpoint clusters of components that
statistically correlate with failures - Detection components where Java exceptions were
detected - union across all failed requests
- similar to what an event monitoring system
outputs - Intersection intersection of components used in
failed requests - Union union of all components used in failed
requests
41Results
- Pinpoint has high accuracy with relatively high
precision
42Pinpoint Prototype Limitations
- Assumptions
- client requests provide good coverage over
components and combinations - requests are autonomous (dont corrupt state and
cause later requests to fail) - Currently cant detect the following
- faults that only degrade performance
- faults due to pathological inputs
- Single-node only
43Current Status
- Simple graph visualization
44Proposed Research
- 3 classes of large network systems
- Clustered Internet systems
- Tiered architecture, high bandwidth, many
replicas - Peer-to-peer (P2P) systems, including sensor
networks - Widely distributed nodes, dynamic membership
- Sensor networks
- Limited storage, processing, and bandwidth.
45P2P Systems Tracing
- Trace messages by piggybacking the current node
name to the messages - Tracing overhead
- Assume 32-bit per node name and a very
conservative log2(N) hops for each msg and - Data overhead is 40 for a 1500-byte message in a
106-node system
46P2P Systems Implementation Verification
- Current debugging techniques lots of printf()s
on each node and manually correlate the paths
taken by messages - How do you know the messages are delivered as
specified by the algorithms? - Use message paths to check for routing invariants
- detect extra hops, loops, and verify that the
paths are correct - Can implement as a query
- select length from paths where (length gt log2(N))