Title: Program Analysis and Transformation
1Program Analysis and Transformation
2Program Analysis
- Extracting information, in order to present
abstractions of, or answer questions about, a
software system - Static Analysis Examines the source code
- Dynamic Analysis Examines the system as it is
executing
3What are we looking for?
- Depends on our goals and the system
- In almost any language, we can find out
information about variable usage - In an OO environment, we can find out which
classes use other classes, which are a base of an
inheritance structure, etc. - We can also find potential blocks of code that
can never be executed in running the program
(dead code) - Typically, the information extracted is in terms
of entities and relationships
4Entities
- Entities are individuals that live in the system,
and attributes associated with them. - Some examples
- Classes, along with information about their
superclass, their scope, and where in the code
they exist. - Methods/functions and what their return type or
parameter list is, etc. - Variables and what their types are, and whether
or not they are static, etc.
5Relationships
- Relationships are interactions between the
entities in the system. - Relationships include
- Classes inheriting from one another.
- Methods in one class calling the methods of
another class, and methods within the same class
calling one another. - A method referencing an attribute.
6Information format
- Many different formats in use
- Simple but effective RSF inherit TRIANGLE SHAPE
- TA is an extension of RSF that includes a
schema INSTANCE SHAPE Class - GXL is a XML-like extension of TABlow-up factor
of 10 or more makes it rather cumbersome
7Static Analysis
- Involves parsing the source code
- Usually creates an Abstract Syntax Tree
- Borrows heavily from compiler technology but
stops before code generation - Requires a grammar for the programming language
- Can be very difficult to get right
8CppETS
- CppETS is a benchmark for C extractors
- It consists of a collection of C programs that
pose various problems commonly found in parsing
and reverse engineering - Static analysis research tools typically get
about 60 of the problems right
9Example program
- include ltiostream.hgt
- class Hello
- public Hello() Hello()
-
- HelloHello()
- cout ltlt "Hello, world.\n"
- HelloHello()
- cout ltlt "Goodbye, cruel world.\n"
- main()
- Hello h
- return 0
-
10Example QA
- How many member methods are in the Hello class?
- Where are these member methods used?
Answer Two, the constructor (HelloHello()) and
destructor (HelloHello()).
Answer The constructor is called implicitly when
an instance of the class is created. The
destructor is called implicitly when the
execution leaves the scope of the instance.
11Static analysis in IDEs
- High-level languages lend themselves better to
static analysis needs - EiffelStudio automatically creates BON diagrams
of the static structure of Eiffel systems - Rational Rose does the same with UML and Java
- Unfortunately, most legacy systems are not
written in either of these languages
12Static analysis pipeline
Source code
Parser
Abstract Syntax Tree
Fact extractor
Clustering algorithm
Fact base
Visualizer
Metrics tool
13Dynamic Analysis
- Provides information about the run-time behaviour
of software systems, e.g. - Component interactions
- Event traces
- Concurrent behaviour
- Code coverage
- Memory management
- Can be done with a profiler or a debugger
14Instrumentation
- Augments the subject program with code that
transmits events to a monitoring application, or
writes relevant information to an output file - A profiler can be used to examine the output file
and extract relevant facts from it - Instrumentation affects the execution speed and
storage space requirements of the system
15Instrumentation process
Source code
Annotator
Annotated program
Annotation script
Compiler
Instrumented executable
16Dynamic analysis pipeline
Instrumented executable
CPU
Dynamic analysis data
Profiler
Clustering algorithm
Fact base
Visualizer
Metrics tool
17Non-instrumented approach
- One can also use debugger log files to obtain
dynamic information - Disadvantage Limited amount of information
provided - Advantage Less intrusive approach, more accurate
performance measurements
18Dynamic analysis issues
- Ensuring good code coverage is a key concern
- A comprehensive test suite is required to ensure
that all paths in the code will be exercised - Results may not generalize to future executions
19Static vs. Dynamic
- Reasons over all possible behaviours (general
results) - Conservative and sound
- Challenge Choose good abstractions
- Observes a small number of behaviours (specific
results) - Precise and fast
- Challenge Select representative test cases
20SWAGKit
- SWAGKit is used to generate software landscapes
from source code - Based on a pipeline architecture with three
phases - Extract (cppx)
- Manipulate (prep, linkplus, layoutplus)
- Present (lsedit)
- Currently usable for programs written in C/C
21The SWAGKit Pipeline
Source Code
layoutplus
linkplus
cppx
prep
lsedit
Landscape
22The SWAGKit Pipeline
Function Filter Input Output
Extract cppx source .ta
Manipulate prep .ta .o.ta
Linkplus .o.ta out.ln.ta
Layoutplus out.ln.ta out.ls.ta
Present lsedit out.ls.ta picture
23cppx prep
- C/C Fact extractor based on gcc
(http//swag.uwaterloo.ca/cppx) - Extracts facts from one source file at a time
- Facts represent program information as a series
of triples - INSTANCE x integer x is an integer
- inherit Student Person Student inherits from
Person - call foo bar foo calls bar
- Produces .c.ta files, one per source file
- Use g option for gcc parameters
24cppx prep
- Prep is a series of scripts written in Grok
- Function is to clean up facts from cppx so they
are in a form which can be usable by the rest of
the pipeline. - Produces one .o.ta for each .ta
- Can replace manual use of cppx prep with gce
- Edit makefile, replace gcc with gce
- Type make
25Grok
- A simple scripting language
- A relational algebraic calculator
- Powerful in manipulating binary relations
- Widely used in architecture transformation
- Online documentation
http//swag.uwaterloo.ca/j25wu/projects/grokdoc/i
ndex.html
http//swag.uwaterloo.ca/nsynytskyy/grokdoc/index
.html
26Grok Features
- Set operations
- Union (), intersection (), subtraction (-),
cross-product (X) - Binary relation operations
- Union (), intersection (), subtraction (-),
composition (o, ), projection (.), domain (dom),
range (rng), identity (id), inverse (inv), entity
(ent), transitive closure (), and reflective
transitive closure ()
27Grok Features Cont.
- Programming constructs
- if else
- for, while
- Arithmetic, comparison, logical operators
- , -, , /,
- lt, lt, , gt, gt, !
- !, ,
28Grok Scripts (1)
- Grok
- gtgt cat Garfield, Fluffy
- gtgt mouse Mickey, Nancy
- gtgt cheese Roquefort, Swiss
- gtgt animals cat mouse
- gtgt food mouse cheese
- gtgt animalsWhichAreFood animals food
- gtgt animalsWhichAreNotFood animals food
- gtgt animalsWhichAreFood
- Mickey
- Nancy
- gtgt animals food
- Garfield
- Fluffy
- gtgt food
- 4
- gtgt mouse lt food
- True
- gtgt
gtgt chase cat X mouse gtgt chase Garfield
Mickey Garfield Nancy Fluffy Mickey Fluffy
Nancy gtgt gtgt eat chase mouse X cheese gtgt
eat Garfield Mickey Garfield Nancy Fluffy
Mickey Fluffy Nancy Mickey Roquefort Mickey
Swiss Nancy Roquefort Nancy Swiss
29Grok Scripts (2)
- gtgt Mickey . eat
- Roquefort
- Swiss
- gtgt eat . Mickey
- Garfield
- Fluffy
- gtgt
- gtgt eater dom eat
- gtgt food rng eat
- gtgt chasedBy inv chase
- gtgt topOfFoodChain dom eat rng eat
- gtgt bottomOfFoodChain rng eat dom eat
- gtgt bothEatAndChase  eat chase
- gtgt eatButNotChase eat chase
- gtgt chaseButNotEat chase eat
- gtgt secondOrderEat  eat o eat
- gtgt anyOrderEat eat
Programming constructs if expression
statements else statements while
expression statements for variable in
expression statements
30A real example
- containFacts 1
- getdb containFacts
- d dom contain
- r rng contain
- e ent contain
- root d r
- leaves r d
- rootChildren root . contain
- toKeep leaves rootChildren
- toDelete e toKeep
- cc contain
- delset toDelete
- delrel contain
- contain cc
- relToFile contain 2
Input A containment tree Output A flattened
version of the containment tree
31linkplus
- Function is to link all facts into one large
graph - Combine graphs from .o.ta files
- Resolve inter-compilation unit relationships
- Merge header files together
- Do some cleanup to shrink final graph
- Usage
- linkplus list_of_files_to_link
- Produces out.ln.ta
32layoutplus
- Adds
- Clustering of facts based on contain.rsf (created
manually or from a clustering algorithm - Layout information so that graph can be displayed
- Schema information
- Usage
- layoutplus contain_file out.ln.ta
- Produces out.ls.ta
33lsedit
- View software landscape produced by previous
parts of the pipeline - Can make changes to landscape and save them
- Usage
- lsedit out.ls.ta
34Program Representation
- Fundamental issue in re-engineering
- Provides means to generate abstractions
- Provides input to a computational model for
analyzing and reasoning about programs - Provides means for translation and normalization
of programs
35Key questions
- What are the strengths and weaknesses of various
representations of programs? - What levels of abstraction are useful?
36Abstract Syntax Trees
- A translation of the source text in terms of
operands and operators - Omits superficial details, such as comments,
whitespace - All necessary information to generate further
abstractions is maintained
37AST production
- Four necessary elements to produce an AST
- Lexical analyzer (turn input strings into tokens)
- Grammar (turn tokens into a parse tree)
- Domain Model (defines the nodes and arcs
allowable in the AST) - Linker (annotates the AST with global
information, e.g. data types, scoping etc.)
38AST example
- Input string 1 / two / 2
- Parse Tree
- AST (withoutglobal info)
2
1
Add
arg1
arg2
int
int
1
2
39Program Transformation
- A program is a structured object with semantics
- Structure allows us to transform a program
- Semantics allow us to compare programs and decide
on the validity of transformations
40Program Transformation
- The act of changing one program into another
(from a source language to a target language) - Used in many areas of software engineering
- Compiler construction
- Software visualization
- Documentation generation
- Automatic software renovation
41Application examples
- Converting to a new language dialect
- Migrating from a procedural language to an
object-oriented one, e.g. C to C - Adding code comments
- Requirement upgrading, e.g. using 4 digits for
years instead of 2 (Y2K) - Structural improvements, e.g. changing GOTOs to
control structures - Pretty printing
42Simple program transformation
- Modify all arithmetic expressions to reduce the
number of parentheses using the formula (ab)c
ac bcx (25)3becomesx 23 53
43Two types of transformations
- Translation
- Source and target language are different
- Semantics remain the same
- Rephrasing
- Source and target language are the same
- Goal is to improve some aspect of the program
such as its understandability or performance - Semantics might change
44Translation
- Program synthesis
- Lowers the level of abstraction, e.g. compilation
- Program migration
- Transform to a different language
- Reverse Engineering
- Raises the level of abstraction, e.g. create
architectural descriptions from the source code - Program Analysis
- Reduces the program to one aspect, e.g. control
flow
45Translation taxonomy
46Rephrasing
- Program normalization
- Decreases syntactic complexity (desugaring), e.g.
algebraic simplification of expressions - Program optimization
- Improves performance, e.g. inlining,
common-subexpression and dead code elimination
47Rephrasing
- Program refactoring
- Improves the design by restructuring while
preserving the functionality - Program obfuscation
- Deliberately makes the program harder to
understand - Software renovation
- Fixes bugs such as Y2K
48Transformation tools
- There are many transformation tools
- Program-Transformation.org lists 90 of them
- Most are based on term rewriting
- Other solutions use functional programming,
lambda calculus, etc.
49Term rewriting
- The process of simplifying symbolic expressions
(terms) by means of a Rewrite System, i.e. a set
of Rewrite Rules. - A Rewrite Rule is of the formlhs rhswhere lhs
and rhs are term patterns
50Example Rewrite System
- 0 x x
- s(x) y s(x y)
- (x y) z x (y z)
- Under these rewrite rules, the term
- ((s(s(a)) s(b)) c)
- will be rewritten as
- s(s(s(a (b c))))
51TXL
- A generalized source-to-source translation system
- Uses a context-free grammar to describe the
structures to be transformed - Rule specification uses a by-example style
- Has been used to process billions of lines of
code for Y2K purposes
52TXL programs
- TXL programs consist of two parts
- Grammar for the input language
- Transformation Rules
- Lets look at some examples
53Calculator.Txl - Grammar
- Part I. Syntax specification
- define program
- expression
- end define
-
- define expression
- term
- expression addop term
- end define
- define term
- primary
- term mulop primary
- end define
- define primary
- number
- ( expression )
- end define
-
- define addop
- '
- '-
- end define
-
- define mulop
- '
- '/
- end define
54Calculator.Txl - Rules
- Part 2. Transformation rules
- rule main
- replace expression
- E expression
- construct NewE expression
- E resolveAddition
- resolveSubtraction
- resolveMultiplication
- resolveDivision
- resolveParentheses
- where not
- NewE E
- by NewE
- end rule
- rule resolveAddition
- replace expression
- N1 number N2 number
- by
- N1 N2
- end rule
- rule resolveSubtraction
- rule resolveMultiplication
- rule resolveDivision
- rule resolveParentheses
- replace primary
- ( N number )
- by N
- end rule
55DotProduct.Txl
- Form the dot product of two vectors,
- e.g., (1 2 3).(3 2 1) gt 10
- define program
- ( repeat number ) . ( repeat number )
- number
- end define
- rule main
- replace program
- ( V1 repeat number ) .
- ( V2 repeat number )
- construct Zero number
- 0
- by
- Zero addDotProduct V1 V2
- end rule
- rule addDotProduct V1 repeat number
- V2 repeat number
- deconstruct V1
- First1 number
- Rest1 repeat number
- deconstruct V2
- First2 number
- Rest2 repeat number
- construct ProductOfFirsts number
- First1 First2
- replace number
- N number
- by
- N ProductOfFirsts
- addDotProduct Rest1 Rest2
- end rule
56Sort.Txl
- Sort.Txl - simple numeric bubble sort
- define program
- repeat number
- end define
- rule main
- replace repeat number
- N1 number N2 number Rest repeat
number - where
- N1 gt N2
- by
- N2 N1 Rest
- end rule
57Other TXL constructs
- compounds
- -gt
- end compounds
- keys
- var procedure exists inout out
- end keys
- function isAnAssignmentTo X id
- match statement
- X Y expression
- end function
58www.txl.ca
- Guided Tour
- Many examples
- Reference manual
- Download TXL for many platforms
59Example uses
- HTML Pretty Printing of Source Code
- Language to Language Translation
- Design Recovery from Source
- Improvement of security problems
- Program instrumentation and measurement
- Logical formula simplification and
interpretation.