Title: Compression Techniques to Simplify the Analysis of Large Execution Traces
1Compression Techniques to Simplify the Analysis
of Large Execution Traces
- Abdelwahab Hamou-Lhadj and Dr. Timothy C.
Lethbridge - ahamou, tcl_at_site.uottawa.ca
- University of Ottawa - Canada
- IWPC 2002 - Paris
2Introduction
- Execution traces are important to understand the
behavior and sometimes the structure of a
software system - Execution traces tend to be very large and need
to be compressed - In this presentation, we present techniques for
compressing traces of procedure calls - We also show the results of our techniques when
applied to two different software systems
3Why Traces of Procedure Calls?
- Many of todays legacy systems were developed
using the procedural paradigm - The flow of procedure calls can be useful to
comprehend the execution of a particular software
feature - The level of abstraction of traces of procedure
calls tend to be not too low and not too high - Traces of method invocation become crucial when
it comes to understand the behavior of
object-oriented systems
4Traditional Compression Techniques
- They are two types of compression techniques
lossy and lossless compression - In Information theory, most of the compression
algorithms are based on the same principle (David
Salomon, 2000) - Compressing data by removing redundancy
- These techniques produce good results, however
- The information, once compressed, is no longer
readable by humans. - Such algorithms certainly will not help in
program comprehension
5Trace Compression Steps
- Preprocess the trace by removing the contiguous
redundancies due to loops and recursion - Represent the trace as a rooted ordered labeled
tree - Detect the non-contiguous redundancies and
represent them only once - this problem is also known as the common
subexpression problem and can be solved in linear
time - Analyze the compressed version and estimate the
gain
6Preprocessing Stage
- Redundant calls caused by loops and recursion
tend to encumber the trace and should be removed - the number of occurrences is stored to
reconstruct the original trace - Removing the redundant calls is one form of
compression that could make the trace more
readable - If the trace is perceived as a tree, removing
contiguous redundancies reduce the depth of the
tree and the degree of its nodes
7The Common Subexpression ProblemIntroduced by
J.P. Downey, R. Sethi and R.E. Tarjan
- Any tree can be represented in a maximally
compact form as a directed acyclic graph where
common subtrees are factored and shared, being
represented only once - Flajolet, Sipala and
Steyaert - The process of compacting the tree is known as
the common subexpression problem also called
subtree factoring - If we consider trees with a finite number of
nodes so that the degrees are bounded by some
constant ... The compacted form of a tree can be
computed in expected time O(n) using a top-down
recursive procedure in conjecture with
hashing... - Flajolet, Sipala and Steyaert
8Example
Input tree 9 nodes and 8 links
The Compressed form 5 nodes and 6 links
9The Algorithm Introduced by P. Flajolet, P.
Sipala, J.M. Steyaert and improved by G. Valiente
- The algorithm assigns a positive number called
certificate to each node - Two nodes have the same certificate if, and only
if the trees rooted at them are isomorphic. - The certificate of a node n is obtained by
- building a sequence L(n), a1, .... , am called
the signature of the node, where L(n) is the
label of the node, a1,..., am are the
certificates of the children of the node. - The certificates and signatures are stored in a
global table
10Example
11The Algorithm Steps (iterative version)
- The algorithm performs a bottom-up traversal of
the tree using a queue - 1. For each node n
- 2. Build a signature for n
- 3. If the signature already exists in the global
table then - 4. Return the corresponding certificate
- Else
- 5. Create a new certificate
- 6. Update the table
- 7. Assign the certificate to the node
- If the degree of the tree is bounded by a
constant and a hash table is used to store the
certificates then this algorithm performs in
linear time
12Experiment
- We experimented with traces of the following
systems - XFIG (a drawing system under UNIX)
- A real world telecommunication system
- We are interested in the following results
- The initial size of the trace n
- The size of the trace after preprocessing it n1
- The compression ratio r1 such that r1 n1 / n
- The size of the trace after using the common
subexpression algorithm n2. - The compression ratio r2 such that r2 n2 / n
13Results of the Experiment (XFIG System)
14Some Considerations Regarding the
Telecommunication System
- It is a large legacy system
- The traces are generated using an internal
mechanism - The traces tend to be incomplete. This is
reflected as an inconsistency in the trace with
respect to the nesting levels. - Our solution to this problem is to complete the
trace - by filling up the gaps with virtual procedure
calls - estimate the error ratio, which is the number of
missing calls to the size of the original trace.
- e g / (gn)
15Results of the Experiment (Telecom. System)
16Variation of the degrees of the tree according to
depth (3 traces of XFIG)
Before the preprocessing step
After the preprocessing step
17Variation of the degrees of the tree according to
depth (3 traces of the telecom. system)
Before the preprocessing step
After the preprocessing step
18Discussion
- Procedure-call traces could be considerably
compressed in a way that preserves the ability
for humans to understand them - Possible improvement
- look for procedures that are not of a great
interest to software engineers - remove them before the compression process
- The preprocessing stage could be very useful to
- reduce the trace size
- increase of the performance of the common
subexpression algorithm
19Conclusions and future directions
- The results shown in this presentation can help
build better tools based on execution traces - We intend to conduct more experiments with this
framework to see how helpful it is to software
engineers - Future directions should focus on lossy
compression.Types of information eliminated can
include - the number of repetitions, the order of calls,
and some lower-level utility procedures - The non-contiguous redundancies can be used to
determine other features of the system
20(No Transcript)
21Results of the Experiment (XFIG System)With
procedures and files
22Results of the Experiment (Telecom. System) with
procedures and files