Title: Introducing Natural Language Program Analysis
1Introducing Natural Language Program Analysis
- Lori Pollock, K. Vijay-Shanker, David Shepherd,
- Emily Hill, Zachary P. Fry, Kishen Maloor
2NLPA Research Team Leaders
K. Vijay-Shanker The Umpire
Lori Pollock Team Captain
University of Delaware
3Problem
- Modern software is large and complex
Software development tools are needed
object oriented class hierarchy
4Successes in Software Development Tools
- Good with local tasks
- Good with traditional structure
object oriented class hierarchy
5Issues in Software Development Tools
- Scattered tasks are difficult
- Programmers use more than traditional program
structure
object oriented class hierarchy
6Observations in Software Development Tools
public interface Storable...
//Store the fields in a file....
Key Insight Programmers leave natural language
clues that can benefit software development tools
undo action
public void Circle.save()
update drawing
activate tool
save drawing
object oriented system
7Studies on choosing identifiers
So, I could use x, y, z. But, no one will
understand my code.
I dont care about names.
Carla, the compiler writer
Pete, the programmer
- Impact of human cognition on names Liblit et al.
PPIG 06 - Metaphors, morphology, scope, part of speech
hints - Hints for understanding code
- Analysis of Function identifiers Caprile and
Tonella WCRE 99 - Lexical, syntactic, semantic
- Use for software tools metrics, traceability,
program understanding
8Our Research Path
- Motivated usefulness of exploiting natural
language (NL) clues in tools - Developed extraction process and an NL-based
program representation - Created and evaluated a concern location tool and
an aspect miner with NL-based analysis
MACS 05, LATE 05
AOSD 06
ASE 05, AOSD 07, PASTE 07
9pic
Name David C Shepherd Nickname
Leadoff Hitter Current Position PhD May 30,
2007 Future Position Postdoc, Gail Murphy
Stats Year coffees/day redmarks/paper
draft 2002 0.1 500 2007 2.2 100
10Aspect Mining
How can I fix Pauls atrocious code?
Applying NL Clues for
- Aspect-Oriented Programming
Molly, the Maintainer
Aspect Mining Task
Locate refactoring candidates
11Timna An Aspect Mining Framework ASE 05
- Uses program analysis clues for mining
- Combines clues using machine learning
- Evaluated vs. Fan-in
- Precision (quality) and Recall (completeness)
P R 37 2 62 60
Fan-In Timna
12Integrating NL Clues into Timna
- iTimna (Timna with NL)
- Integrates natural language clues
- Example Opposite verbs (open and close)
P R 37 2 62 60 81 73
Fan-In Timna iTimna
Natural language information increases the
effectiveness of Timna
Come back Thurs 1005am
13Concern Location
Applying NL Clues for
Motivation
- 60-90 software costs spent on reading and
navigating code for maintenance - (fixing bugs, adding features, etc.)
Erlikh Leveraging Legacy System Dollars for
E-Business
14Key Challenge Concern Location
- Find, collect, and understand all source code
related to a particular concept
Concerns are often crosscutting
15State of the Art for Concern Location
- Mining Dynamic Information
Wilde ICSM 00 - Program Structure Navigation
Robillard FSE 05, FEAT, Schaefer ICSM 05 - Search-Based Approaches
- RegEx grep, Aspect Mining Tool 00
- LSA-Based Marcus 04
- Word-Frequency Based GES 06
Reduced to similar problem
Slow
Fast
Fragile
Sensitive
No Semantics
16Limitations of Search Techniques
- Return large result sets
- Return irrelevant results
- Return hard-to-interpret result sets
17The Find-Concept Approach
1. More effective search
2. Improved search terms
Source Code
3. Understandable results
concept
Method a
Concrete query
Find-Concept
Method b
Method c
NL-based Code Rep
Recommendations
Method d
Method e
Natural Language Information
Result Graph
18Underlying Program Analysis
- Action-Oriented Identifier Graph (AOIG) AOSD
06 - Provides access to NL information
- Provides interface between NL and traditional
- Word Recommendation Algorithm
- NL-based
- Stemmed/Rooted complete, completing
- Synonym finish, complete
- Combining NL and Traditional
- Co-location completeWord()
19Experimental Evaluation
Find Concept, GES, ELex
- Research Questions
- Which search tool is most effective at forming
and executing a query for concern location? - Which search tool requires the least human effort
to form an effective query? - Methodology
- 18 developers complete nine concern location
tasks on medium-sized (gt20KLOC) programs - Measures
- Precision (quality), Recall (completeness),
F-Measure (combination of both P
R)
20Overall Results
Across all tasks
- Effectiveness
- FC gt Elex with statistical significance
- FC gt GES on 7/9 tasks
- FC is more consistent than GES
- Effort
- FC Elex GES
FC is more consistent and more effective in
experimental study without requiring more effort
21Natural Language Extraction from Source Code
What was Pete thinking when he wrote this code?
- Key Challenges
- Decode name usage
- Develop automatic extraction process
- Create NL-based program representation
Molly, the Maintainer
22Natural Language Which Clues to Use?
- Software Maintenance
- Typically focused on actions
- Objects are well-modularized
Maintenance Requests
23Natural Language Which Clues to Use?
- Software Maintenance
- Typically focused on actions
- Objects are well-modularized
- Focus on actions
- Correspond to verbs
- Verbs need Direct Object
- (DO)
24Extracting Verb-DO Pairs
Extraction from comments
class Player / Play a specified
file with specified time interval /
public static boolean play(final File file,final
float fPosition,final long length)
fCurrent file try
playerImpl null //make sure to
stop non-fading players stop(false)
//Choose the player Class
cPlayer file.getTrack().getType().getPlayerImpl(
)
Extraction from method signatures
25Extracting Clues from Signatures
- POS Tag Method Name
- Chunk Method Name
- Identify Verb and Direct-Object (DO)
public UserList getUserListFromFile( String path
) throws IOException try
File tmpFile new File( path )
return parseFile(tmpFile) catch(
java.io.IOException e ) throw new
IOrException( UserList format issue" path "
file " e )
POS Tag
getltverbgt Userltadjgt Listltnoungt From ltprepgt File
ltnoungt
Chunk
getltverb phrasegt User Listltnoun phrasegt From File
ltprep phrasegt
26pic
Name Zak Fry Nickname The
Rookie Current Position Upcoming
senior Future Position Graduate School
Stats Year diet cokes/day lab days/week 2006 1
2 2007 6 8
27Developing rules for extraction
verb
DO
- For many methods
- Identify relevant verb (V) and direct object (DO)
in method signature - Classify pattern of V and DO locations
- If new pattern, create new extraction rule
verb
DO
verb
DO
28Our Current Extraction Rules
- 4 general rules with subcategories
URL parseURL()
void mouseDragged()
void Host.onSaved()
void message()
29Example Sub-Categories for Left-Verb General
Rule
- Look beyond the method name
- Parameters, Return type, Declaring class name,
Type hierarchy
30Representing Verb-DO Pairs
- Action-Oriented Identifier Graph (AOIG)
verb1
verb2
verb3
DO1
DO2
DO3
verb1, DO1
verb1, DO2
verb3, DO2
verb2, DO3
use
use
use
use
use
use
use
use
source code files
31Representing Verb-DO Pairs
- Action-Oriented Identifier Graph (AOIG)
play
add
remove
file
playlist
listener
play, file
play, playlist
remove, playlist
add, listener
use
use
use
use
use
use
use
use
source code files
32Evaluation of Extraction Process
- Compare automatic vs ideal (human) extraction
- 300 methods from 6 medium open source programs
- Annotated by 3 Java developers
- Promising Results
- Precision 57
- Recall 64
- Context of Results
- Did not analyze trivial methods
- On average, at least verb OR direct object
obtained
33pic
Name Emily Gibson Hill Nickname
Batter on Deck Current Position 2nd year PhD
Student Future Position PhD Candidate
Stats Year cokes/day meetings/week 2003 0.2 1
2007 2 5
34Program Exploration
Ongoing work
- Purpose Expedite software maintenance and
program comprehension - Key Insight Automated tools can use program
structure and identifier names to save the
developer time and effort
35Dora the Program Explorer
Query
Dora
Relevant Neighborhood
Dora comes from exploradora, the Spanish word
for a female explorer.
36State of the Art in Exploration
- Structural (dependence, inheritance)
- Slicing
- Suade Robillard 2005
- Lexical (identifier names, comments)
- Regular expressions grep, Eclipse search
- Information Retrieval FindConcept, Google
Eclipse Search Poshyvanyk 2006
37Motivating need for structural and lexical
information
ExampleScenario
- Program JBidWatcher, an eBay auction sniping
program - Bug User-triggered add auction event has no
effect - Task Locate code related to add auction
trigger - Seed DoAction() method, from prior knowledge
38Using only structural information
Looking for add auction trigger
DoAction()
- DoAction() has 38 callees, only 2/38 are relevant
RelevantMethods
- Locates locally relevant items, but many
irrelevant
And what if you wanted to explore more than one
edge away?
Irrelevant Methods
39Using only lexical information
Looking for add auction trigger
- 50/1812 methods contain matches to addauction
regular expression query - Only 2/50 are relevant
- Locates globally relevant items, but many
irrelevant
40Combining Structural Lexical Information
Looking for add auction trigger
- Structural guides exploration from seed
RelevantNeighborhood
- Lexical prunes irrelevant edges
41The Dora Approach
Prune irrelevant structural edges from seed
- Determine method relevance to query
- Calculate lexical-based relevance score
- Low-scored methods pruned from neighborhood
- Recursively explore
42Calculating Relevance ScoreTerm Frequency
Query add auction
- Score based on query term frequency of the method
6 query term occurrences
Only 2 occurrences
43Calculating Relevance ScoreLocation Weights
Query add auction
- Weigh term frequency based on location
- Method name more important than body
- Method body statements normalized by length
?
44Dora explores add auction trigger
- From DoAction() seed
- Correctly identified at 0.5 threshold
- DoAdd() (0.93)
- DoPasteFromClipboard() (0.60)
- With only one false positive
- DoSave() (0.52)
45Summary
- NL technology used
- Synonyms, collocations, morphology, word
frequencies, part-of-speech tagging, AOIG - Evaluation indicates
- Natural language information shows promise for
improving software development tools - Key to success
- Accurate extraction of NL clues
46Our Current and Future Work
- Basic NL-based tools for software
- Abbreviation expander
- Program synonyms
- Determining relative importance of words
- Integrating information retrieval techniques
47Posed Questions for Discussion
- What open problems faced by software tool
developers can be mitigated by NLPA? - Under what circumstances is NLPA not useful?