Title: Inferring Developer Activities by Analyzing Successive Versions of Source Code
1Inferring Developer Activitiesby Analyzing
Successive Versionsof Source Code
- Work done for HPCS and CMSC631
- Jaymie Strecker
- January, 2005
2Goal
- Source code from
- version control system (e.g. CVS)
- instrumented compiler
3Motivation
- Applications (to study development process)
- Analyze source code collected by instrumented
compiler compare results of analysis to
self-reported activities - Analyze source code collected by version control
system (e.g. if no instrumentation available)
4Source Code Analysis vs. Instrumented Compilers
- Benefits of source code analysis
- Finer granularity (look at individual changes,
not whole versions) - Guarantees consistency across subjects and
experiments - Transparent to subjects
- Can apply to data that has already been collected
But source code analysis should supplement
instrumented compilers, not replace them.
5- Source Code Changes
- Developer Activities
- An Inference Algorithm
- Algorithm Evaluation
- Conclusions
6Source Code Changes
One program change should be concerned with the
contiguous set of concrete statements that
represent a single abstract instruction.
Dunsmore and Gannon, 1978
Approximation contiguous set of modified lines
main() int a, b a 0
main() int a, b, c a 0
printf(a d\n, a)
7Change Model
8- Source Code Changes
- Developer Activities
- An Inference Algorithm
- Algorithm Evaluation
- Conclusions
9Developer Activities
- Developer activities
- Formulate Formulate an algorithmic approach
- Program Create or incrementally augment the
program and its testing infrastructure - Compile Compile and link the program
developed so far - Test Test the program, observing its
behavior - Debug Diagnose and fix erroneous behavior
- Run Run the program on real input data
- Optimize Improve program performance
-- Smith, 2004
10Low-Level Developer Activities
11- Source Code Changes
- Developer Activities
- An Inference Algorithm
- Algorithm Evaluation
- Conclusions
12Inferring Developer Activities
13Identifying the Low-Level Activity
- Heuristics to guess the activity for a change
- Add Functionality I (Program) First version.
- Correct Compile-Time Errors (Program) Version
A does not compile. - Comment/Uncomment Executable Statements (Debug)
A statement appears in both versions, but in
just one version the statement is inside a
comment.
14Identifying the Low-Level Activity
- Modify Comments (Program) More than half of the
changed lines involve text within comments. - Modify Debugging Code (Debug) The change
involves a print statement which does not appear
(uncommented) in the final version. - Add Functionality II (Program) The change is an
addition.
15change
yes
AddFunc1
no
yes
CorrectCompile
no
yes
CommentStmts
Program or Debug
no
yes
ModifyDoc
no
yes
ModifyDebug
no
yes
AddFunc2
Unclassified
no
16Automatic Inference
- Implementation of an inference tool is
straightforward - Change definition and model
- Heuristics for low-level activities
- Tool performs simple static analysis
- Pattern matching on source code text
- Typically takes a few seconds to analyze one
subjects programming assignment
17- Source Code Changes
- Developer Activities
- An Inference Algorithm
- Algorithm Evaluation
- Conclusions
18Evaluation of Inference Tool
Source code data from experiments
Subjects self-reported activities
Inferred developer activities
False positives
False negatives
Unclassified changes
19Data Analyzed
Experiment Allan Snavelys class at UCSD (Fall
2004) Source code used Serial C/C
implementations Assignments used 3 Subjects
used 11 Source code collection method
Instrumented compiler
At the beginning of the study, subjects were
shown definitions of the developer activity
options used by the instrumented compiler.
20Metrics
- False positives (heuristic) Number of changes
the heuristic recognizes incorrectly - False negatives (self-reported activity) Number
of changes with the self-reported activity that
are classified incorrectly - Unclassified changes Number of changes that no
heuristic recognizes
21Distribution varies widely across subjects (as
does distribution of self-reported activities).
22High false positive rate for many heuristics.
Small sample size for heuristics other than
CorrectCompile.
23Almost all changes are classified.
24No heuristics recognize Experimenting, Testing,
or Tuning.
25- Source Code Changes
- Developer Activities
- An Inference Algorithm
- Algorithm Evaluation
- Conclusions
26Conclusions
- Source code changes only sometimes reflect
self-reported activities. - Serial Coding and Parallelizing were usually
recognized correctly. - Source code alone doesnt reveal the developers
intentions (e.g. Experimenting with Environment). - Some activities may not affect the source code
(e.g. Testing).
27Conclusions
- Source code analysis and instrumented compilers
give different types of information. - SCA shows that subjects do multiple activities
between compiles IC reports just one activity
per compile. - IC reports the amount and type of effort spent
SCA shows what was accomplished by this effort. - SCA is consistent across subjects IC may not be
(e.g. Debugging vs. Testing).
Good news Revised SCA could someday supplement
IC. Bad news Difficult to evaluate SCA using
IC.
28Possible Future Work
- Narrow the scope of changes analyzed
- Defect-related changes
- Language- or API-specific change patterns
- Build upon data from instrumented compilers and
other sources - Correlations between change characteristics or
patterns and self-reported activity - Compare each version to a known solution or use
test case output to understand defects
29For More Information
- Paper
- http//www.cs.umd.edu/strecker/infer_act.pdf
- Slides
- http//www.cs.umd.edu/strecker/infer_act.ppt
30Abstract
In HPCS experiments, instrumented compilers
regularly log the state of the source code being
developed outside of experiments, a software
projects sequence of source code versions often
resides in a CVS repository. Such source code
data abounds. Since the changes made from version
to version in the source code are the end product
of the developers effort, information about the
development process is encoded in those changes.
In this study, we attempt to extract one piece of
that information why the developer made each
change. Currently, instrumented compilers collect
data from developers about the types of
activities they perform in the future, source
code analysis may be a viable supplement to
instrumented compilers. Unlike activity data
collected by instrumented compilers, analysis of
source code changes produces fine-grained,
repeatable results. In a first attempt at source
code change analysis, we present a technique that
uses heuristics to recognize certain patterns of
source code changes that hint at the developers
intentions. We compare the results of this
technique to data collected by an instrumented
compiler, and we suggest refinements to make the
technique a useful tool for analysis.