Title: Getting Started in Program Analysis Research: Outline
1Getting Started in Program Analysis Research
Outline
- Background and useful skills
- Ana
- Using and developing analysis
- Mary Lou
- Identifying and building infrastructure
- Lori
- Evaluating your analysis
- Ana
2Ana Milanova
- I am from Bulgaria
- National High School for Math and Science
- American University in Bulgaria, 1997
- I have a degree in Business Administration
- Rutgers University, PhD in CS, 2003
- Now Assistant Professor at RPI
- Research program analysis for software tools
- Family
- Husband Tony
- Katarina, 5 and Petar, 2
3Program Analysis Useful Background and Skills
4Program Analysis
- Static program analysis
- Analyzes the source code of the program
- Run-time behavior properties without running the
program - E.g., The object values that flow to reference
variable x are only of classes A and B, but not
C. - Static analyses are conservative consider all
possible run-time behaviors of the program
5Program Analysis
- Dynamic program analysis
- Analyzes a set of program executions
- Reasons about run-time behavior properties over
observed executions - E.g., The object values that flowed to reference
variable x during observed executions were only
of classes A and B, but not C. - Dynamic analyses are incomplete consider only
behaviors over particular executions - Goal combine with static analysis
6Uses of Static Program Analysis
- Compilers traditional application domain
- Enables optimizing transformation
- Software engineering tools
- Static debugging, verification, security
- Uncover difficult errors and security flaws
- Testing
- Evaluate and improve test suites
- Software understanding
- Calling structure
- Complex dependences
- Change impacts
7Uses of Program Analysis
- Analysis for compiler optimization
- is different from
- Analysis for software tools
- Different requirements, different success
criteria (more later)
8Static Analysis Methodologies
- Data-flow analysis
- Constraint-based program analysis
- Abstract interpretation
- Type and effect systems
- Model checking
9Example Data-flow Analysis
1. i11 read x,y
- Flow facts
- Information that we are propagating
- E.g., set of definitions (i,1), (i,4),(i,6)
- Transfer functions
- The effect of a statement on the incoming flow
facts - E.g., statement i11 at 6 kills the incoming
definition (i,4), and generates definition (i,6)
(i,1)
2. if xlty
(i,1)
(i,1)
3. p(i)
4. ij5
(i,4)
5. p(i)
(i,4)
6. i11
(i,1)
(i,6)
7. iii
10Theory
- Data-flow frameworks
- Control-flow graph CFG
- Space of flow facts L
- Space of transfer functions F
- Certain properties of L and F allow a general
solution procedure - Fixed-point iteration
- Termination the iterative computation terminates
- Safety (correctness, soundness) the solution is
conservative - For most problems the analysis produces noise
11Theory and Practice
- Analysis cost how much time, memory
- Analysis precision how much noise
- a.m() A more precise analysis a B, and a less
precise analysis a A,B,C - Typically, there is a tradeoff between cost and
precision! - In practice, we need to analyze very large
programs, 100K LOC, even 1M LOC
12Theory and Practice
- Approximations - introduce noise
- make the CFG smaller
- make the set of flow facts smaller
- make the transfer functions converge faster
- Approximations are necessary
- But be careful different approximations for
different analyses
13Standard Approximations
- Flow-sensitive vs. flow-insensitive
x true x true x
true, y false y false x false,
y false x y
x true,false, y false
14Standard Approximations
- Context-sensitive vs. context-insensitive
Merged flow
A(bool X) this.f X a new A(true) b new
A(false)
a.f true/false
a.f true
b.f true/false
b.f false
a.f true,false, b.f true,false
a.f true, b.f false
15Useful Background and Skills
- Higher-level undergraduate or graduate courses
on - Programming Languages, Compilers, Algorithms,
Logic, Software Engineering, Architecture - Analytical and programming skills
- Step1 Design a program analysis algorithm
- Understand your target language (e.g., Java and
C, C) - Step2 Implement the analysis algorithm
- Understand the language(s) of the infrastructure
- Step3 Evaluate analysis algorithm
16Useful Resources
- Books (my personal list)
- Compilers Principles, Techniques and Tools by
Aho, Sethi, Ullman, Ch. 10 - An introduction to data-flow analysis
- Program Analysis by Nielsen, Nielsen, Hankin
- An excellent reference for advanced students
- Model Checking by Clarke, Grumberg, Peled
- Course material on the web
- Classes taught by professors
- My class (there are better ones, of course)
www.cs.rpi.edu/milanova/csci6961/lectures/
17Using and Developing Program Analysis
- Mary Lou Soffa
- University of Virginia
18About Mary Lou Soffa
- Confused about what I wanted to be
- Ph.D. programs
- Mathematics, Sociology Philosophy Environmental
Acoustics disenchanted - Found what I really loved computer science
- After 25 years at Pitt, moved to UVA
- Small farm grow crops love my tractor
- Passion increasing the participation of women
and minorities in computer science - Professional achievement 24 Ph.D. students ½
are women.
19Program analysis
- How to apply program analysis in your research
- What are questions and what do you have to do
20Solve a problem Program behavior static or
dynamic
Determine information needed
What parts of program are involved
Develop appropriate representation
Develop analysis
Develop algorithm
21Have a goal program code
- Problem
- Improve performance
- Understand program
- Find errors
- Locate cause of errors
- Need to collect information about the program
that helps you infer properties of program - Static or dynamic code
22Determine information needed
- What questions are you asking
- What do you need to gather to answer questions
- Examples
- Statements needed to compute an expression
- Values are always constant at a particular
program point - Locations of dead statement
- Branches that are correlated
23Example redundancy
- Remove redundancies with goal of improving
performance - Redundant redundant expressions
- Redundant loads
- Redundant stores
- Dead code
- Static
- Remove redundant expressions from program
representation
24Redundant expressions
- Does the value need to be computed for correct
semantics? - X A B
- F C E
- C C 1
- If (cond) then R A B S C E
- Else X A B A 6
- End if
- G AB
25What parts of program involved
- Given information you need, what parts of
program are involved - Examples
- branches and statements that change values in
conditional - all possible execution paths
- Array definitions and uses
- Types
- Loops
26Example Redundant expressions
- Expressions
- Definitions
- Control flow among definitions and expressions
- Program paths
27Program representation
- Program representation that enables collection of
information - Granularity
- Source, intermediate, binary
- Issues how to get representation from another
representation
28Example redundant expressions
- Want to know how expressions flow
- Is the value of an expression same as when
expression used again - Need control flow graph with statements in nodes
intermediate level - X A B
29Available Expressions
X A B F C E
C C 1
R A B S C E
X A B A 6
G AB
30Formulate analysis over representation
- How to gather information from representation
- How many analyses
- Direction of flow of analysis
- Along all paths or any path
- Local solution
- Global solution
31Example Redundant expressions
- Local - basic block single entry/exit
- What expressions are generated
- What expressions are killed by a definition
- Global Flow over flow graph
- Forward flow
- Must be true on all paths
32Redundant Expressions
X A B F C E
C C 1
A B
A B
A B
R A B S C E
X A B A 6
A B, CE
G A B
33Develop analyses
- Data flow equations use data flow framework
- Algorithm
- Preciseness
- Expense
34Data flow equations
- Gen (B) all expressions
- Kill (B) all definitions kill all incoming
available expression - Out(B) Gen(B) ? (IN(B) Kill(B))
- In(B) ? Out(j)
35Dynamic Optimization
- Static optimizations
- Apply before execution
- Dynamic Optimizations
- Apply during execution redundancy expressions
- Binary code
- Program traces
36B1
1. A 4 2. T1 AB 3. L1 T2 T1/C 4. If
T2 lt W go to L2 5. M T1 K 6. T3 M
1 7. L2 H I 8. M T3 - H 9. If T3 gt 0 go
to L3 10. Go to L1 11. L3 halt
B2
B3
B4
B5
B6
37Program Trace
Binary code
- A 4
- T1 AB
- T2 T1/C
- If T2 !lt W jump out
- H I
- M T3 - H
- If T3 gt 0 go to L3
- T2 T1/C
- If T2 !lt W jump out
- M T1 K
- T3 M 1
- H I
- M T3 - H
- halt
38Dynamic optimization
- Note
- Single entry multiple exits
- No Loops
- Need to Representation bring up a level from
binary code
39Applying optimizations
- Not as complicated
- But, cannot tolerate much overhead
- Phases in static
- Developed algorithm that can apply multiple
optimizations - Demand driven
- Limit study of dynamic optimizations
40Conclusion
- Need analysis in many different applications
- Virtual execution enviroments
- Multicore
- Wireless sensor networks
- Testing
- Testing for wireless sensor networks
- Testing for security
41Identifying and Building Infrastructure
42Loris Journey
- Science/Math love Started in chemistry at
liberal arts college. - Field Trip and first cs course -gt CS major.
- Advisors strong push for grad school -gt U Pitt.
- Took compilers course from Mary Lou -gt PhD in
compiler optimization. - Big year 10/85-married Mark. 1/86-started at
Rice. 4/86-PhD - Family The yankees returned north 3 years later!
- University of Delaware 15 yrs. Visiting,
Assistant, Associate, Full - Family Lauren (HS senior), Lindsay (16 and
driving), Matt (11) - Support Mark, Mark, Mark, Mary Lou, Errol,
Sandee, CRA-W - Currently software tools, testing, compiler
optimization
43Identifying and Building Infrastructure for
Analysis Research
- What kinds of infrastructure do you need?
- How to identify and build infrastructure
- Examples
44What kinds of infrastructure do you need?
Analysis Research and Evaluation
People
Analysis Framework Software
Labspace
Hardware
Workloads
45Identifying Analysis Framework Software
- Short term - Long term
Determine Goals
- Needed - Desired (Prioritized)
Specify Requirements
- Peers/Experts - Technical papers - Internet
search
Search for Possibilities
Try Them Out
- Install Run Tests - Read docs - Examine
code - Try small task
Weigh Choices
- Meet Requirements? - Ease of Use/Change?...
46Example Identifying Analysis Framework Software
Evaluate new analysis on Java On its own and in
client tool
Determine Goals
- Needed call graph, cfg, chg Realistic
environment/apps Easy to extend/build client tools
Specify Requirements
- Common environment is IDE, Java. ? Eclipse
platform
Search for Possibilities
Try Them Out
- Install explore - Write a small plugin - Use
call graph, chg, cfg for small task
Weigh Choices
- Learning curve vs Available analyses, realism
47Implementing Your Analysis
- Once you have decided on an infrastructure
- Think Reuse!! Think modularity!!
- Think prototype, but extensible and scalable
- Test, test, test - try to be systematic
- Debug not easy
48Example Implementing My NL Analysis
- Build small modular components -gt reuse
- Analyzing method signatures to extract NL
- Building program representation for NL
- Traversing program rep
- Building program rep for IR
- Design reps to avoid loss of info -gt reuse
- Ids and their roles and locations in code
- Verb, Direct object rep -gt extensible
49Managing the Evolving Software Infrastructure
- Managing change over time and people
- CVS, subversion
- Tracking tasks, bugs, deadlines/goals
- TRAC, bugzilla, gforge
- Maintaining documentation
- JavaDocs, Doxygen
- Testing, testing, testing
- Unit, system, regression -- test suites
- Sounds like software engineering
50Selecting Appropriate Hardware
- Short term - Long term
Determine Goals
- Needed - Desired (Prioritized)
Specify Requirements
Search for Possibilities
- Peers/Experts - System Staff
Weigh Choices
- Meet Requirements? - Costs within budget? -
Need to ask for money?
51Gathering Good Workloads
Kind of Evaluation Desired
Case Studies
Controlled Experiment
Representative
Try to reduce threats to validity of
experiments - varied/similar - domain -
size - complexity/form - known and available to
others
Synthesized Benchmarks
52Example Gathering Good Workloads
Kind of Evaluation Desired
Research Questions - How effective is our
FindConcept Tool versus other code search
tools? (versus lexical search and IR) (precision
and recall) - How does the human effort compare?
Case Studies
Representative
Try to reduce threats to validity of
experiments - varied/similar - domain -
size - complexity/form - known/available to
others
Sourceforge - very large - many cvs updates
(active) - varied in domain
53Identifying Strong Students
- Teach a compiler or program analysis course
regularly - Identify students from the course
- Ideal
- Creative quick to understand analysis
- good problem solver
- hard working
- good coder
- good communicator good writer
- show initiative and interest in analysis
- Some training will be required.
- Start Small. Create a Pipeline.
54Building a Working Lab Space
- Needs
- - one workspace/computer/storage per grad student
- - room for growth and undergrad researchers
- - current technology minimize old machines
maintenance? - - lab printer
- - lab library of research-oriented background
books - Make it somewhere students want to work
- - posters/pictures/plants
- - open and pleasant microwave, frig,
coffeepot? - - all needed resources/supplies easily available
- - conference room for larger research meetings
55Static Program Analysis Evaluating Your Analysis
56A Typical Program Analysis Research Project
- Step 1 Design your analysis
- Reason about safety
- Reason about complexity in terms of program size
- Step 2 Implement your analysis
- Hard!
- Complex and difficult to test, debug and verify
a real problem - Step 3 EVALUATE!
57Evaluation of a Compiler Analysis
- Strict requirements for the analysis
- Safety is crucial!
- An unsafe analysis may miss an execution path,
and result in a change of the original program - Analysis time (and space)
- Constraint by normal compilation time
- Objective success criteria
- Show improvement in execution time
- Show reduction in memory footprint
58Evaluation of a Compiler Analysis
- Established benchmarks
- E.g., the SPEC JVM98
- General evaluation of Java compilers
- E.g., the DaCapo benchmark suite
- Memory intensive Java applications
- Ideally you would say something like this
- our analysis increases compilation time by at
most 10, and results in speed-up of 10-16 on
the SPEC JVM98 benchmarks.
59Evaluation of an Analysis for a Software Tool
- Requirements for the analysis - not so strict
- Relaxing safety is OK!
- Analysis time (space) is not so crucial
- Developers would definitely wait if the analysis
finds difficult bugs such as data races and
memory leaks - Success criteria - not so objective
- Precision low noise
- Practicality practical time/space requirements,
works on 100K LOC - Usability of tool
- Bugs found absolutely sure
60Evaluation of an Analysis for a Software Tool
- Precision is CRUCIAL noise is really bad!
- E.g., there are 10 buffer overflow bugs in
program P - Safe analysis A issues 1000 warnings, 10 are real
and 990 are false positives - Unsafe analysis B issues 13 warnings, 8 are real
and 5 are false positives - Analysis B is much more useful than analysis A!
- Absolute precision done more and more often
- Choose a subset of analyzed programs
- Manually find the real solution
- Compare with analysis solution
- Precision how much noise is there?
- Recall (if the analysis is unsafe) how much did
it miss? - E.g., a.m() The real solution a B, a safe
analysis solution a A,B,C. Precision - 67
noise!
61Evaluation of an Analysis for a Software Tool
- Finding a benchmark set
- Depends on analysis application
- Large programs
- Diverse programs, as many as it is feasible
- Publicly available sourceforge.org
- Look at benchmark suites in published work!
- Ideally, you will have a large set of diverse
programs, will show acceptable absolute precision
(low false positive rate) and practical cost
62Comparison with Existing Analysis
- Well-known program analysis problems
- Havent we solved that problem yet?
- E.g., Points-to analysis
- Design a new analysis A
- Compare with best known analysis B
- Show improvement in one or more of analysis
cost, analysis precision
63What Not to Do
- Propose a new analysis without any evaluation
- E.g., We describe this new great points-to
analysis. - Design your own metric, different from
established metrics - E.g., We propose a novel points-to analysis A
and points-to analysis A which improves on A.
Therefore, both A and A are great. - Use non-standard benchmark
- Report on a subset the ones for which the
analysis works
64Questions
65(No Transcript)
66An Example Devirtualization in Object-oriented
Programs
- Polymorphism and dynamic dispatch
- class A void m()
- class B extends A void m()
- class C extends A void m()
- Virtual call a.m() is dispatched at run-time,
based on the class of the receiver, A, B or C - Powerful enables modern software engineering
- But costly 13 of time spent in virtual dispatch
- Analysis only B objects ever flow to a
- Optimization virtual call a.m() gt direct call
to B.m()
67Uses of Static Program Analysis
- Software engineering tools
- Static debugging, verification, security
- Uncover difficult errors and security flaws
- Testing
- Evaluate and improve test suites
- Software understanding
- Calling structure
- Complex dependences
- Change impacts
- Many (unexplored) areas of application
68Static Debugging
- Analyze the program and look for bugs
- Memory and pointer bugs memory leaks, null
pointer dereferences, double frees, buffer
overflows, etc. - Concurrency bugs races, deadlocks
- Issue warnings
- Microsoft
- PREFix and PREfast tools in use since 2000
- Many new tools developed
- IBM
- Tools for static debugging of production J2EE
- Tools for security auditing of J2EE
69Software Testing
- Coverage-based testing
- Improve test quality with good coverage
- E.g., cover all possible receiver classes at
virtual calls - Step 1 analyze the tested code
- What are all possible receiver classes at virtual
calls? - a.m() Analysis only B objects ever flow to a
- Step 2 insert instrumentation
- Step 3 run tests and report coverage
- What were the receiver classes actually observed
while running the tests?
compare
70Software Understanding
X.n()
- Navigate through calling structure
- Reason about (im)mutability
- Powerful, central to imperative programming
- Many real bugs are due to unintended mutability
- Q1 is a method A.m() side-effect free?
- Q2 can a private field in a class A be mutated
by untrusted clients of A (i.e., classes that use
A)? - Reason about other quality attributes
- Find code related to a change, etc.
- Reverse engineering
B.m()
71Program Representations
- if (xlty) then z1 else z2
- Control Flow Graph
- Linear
- 3-address statements
- Flow of control
- Syntax Tree
- Tree
- Parse tree of the program