Title: Refactoring C to Safer C
1Refactoring C to Safer C
One (1) ACME Refactoring Tool
C (OBSCURITAS TERRIBILIS)
SAFER C (LINGUA SALVA)
- Bill McCloskey
- Eric Brewer
2Background
- C is a powerful language with many users and a
huge base of legacy software - but C is error-prone
- Buffer overflows
- Memory/pointer errors
- Concurrency errors/race conditions
- Lack of proper error-handling/cleanup
- Several ways to fix the problem for old code
3Solution 1
- Program analysis/transformation (Lint, Metal,
Prefix, CQual, CCured) - Continue to maintain old programs in C
- Occasionally use the tool to find bugs or
generate safety checks - Downsides
- Programs still written in C, which is error-prone
- Often have many false-positives, and the user
must sort through them
4Solution 2
- Another possibility (Cyclone)
- Rewrite old programs in a safer language
(Cyclone) that still resembles C - All future changes are made to the new (Cyclone)
code, so they are guaranteed safe - Downsides
- Its difficult to rewrite an entire program, even
if new language is close to C - As new checks are added to Cyclone (e.g., for
data races), old code must be revised again
5Proposal
- The Cyclone approach has many benefits
- All new code is guaranteed safe
- Cyclone can check for several classes of bugs
- Why not use a tool to transform old code into a
safer language automatically? - This kind of transformation is called a
refactoring
6Refactoring
- Refactorings are code improvement transformations
- The output of a refactoring is readable code
- Idea gradually transform C code into safe,
readable code---with programmer intervention
allowed at each step - Use different refactorings to solve different
problems memory safety, concurrency, etc.
7Refactoring Example
existing C program P0
refactored program P1
refactored program P2
final program P3
Stages
eliminate buffer overflows
guard against race conditions
ensure clean-up after exceptions
this code readable by programmer
Refactorings
Resusults
8Difficulties
- A refactoring must output readable code
- but existing tools cant do that
- They start by running the C preprocessor, which
destroys readability - Macros are expanded
- Include files are merged
- Conditional code is inlined or eliminated
- For refactoring to work, the preprocessor problem
must be solved
9Outline
- Replacing the C preprocessor
- ASTEC an improved macro language
- Macroscope a cpp to ASTEC translator
- Refactoring C code
- Asfact a prototype refactoring tool
- Future work
10Cpp Lost in Translation
- Cpp operates at the token level
- Analysis tools have difficulty parsing such
macros directly, so they expand them - but expanding them destroys readability
- Cpp must be replaced with something that operates
on entire syntax trees ASTEC
define ADD(x) x ADD(3) 4
3 4
11ASTEC Examples
- Constants and expressions
- Inline functions
- Types (possibly parameterized)
- Also modules, decorators, conditional
compilation, include files
_at_macro CACHE_INDEX(int n) n2 1
_at_macro WORK(int when) begin() do_work(when)
end()
_at_type LIST_VALUE() int
12ASTEC
- Main goal of ASTEC enable analysis of macros in
isolation, without expanding them - ASTEC macros are complete units, so they can be
parsed without expansion - Also include type information, so they can be
typechecked without expansion - ASTEC supports the most common kinds of macros,
but more may be added as necessary
13Outline
- Replacing the C preprocessor
- ASTEC an improved macro language
- Macroscope a cpp to ASTEC translator
- Refactoring C code
- Asfact a prototype refactoring tool
- Future work
14Macroscope
- For ASTEC to be useful, we must be able to
translate cpp constructs into ASTEC - Macroscope is an automatic translation tool
- Example
define ADD(x,y) xy ADD(3, 4)
_at_macro ADD(x, y) xy ADD(3, 4)
15Macroscope Algorithm
- Expand all macros in the program
- Keep a record of the tokens involved
- Parse the expanded code
- Find the post tokens in the syntax tree and try
to synthesize a macro from them - extract cpp arguments as ASTEC arguments
ADD(3, 4)
34
pre
post
16Macroscope Example
ADD(3, 4)
34
pre
post
- Steps
- Expand macros
- Parse expanded code
- Identify arguments and do reverse substitution
- Identify macro body (least common ancestor of
post tokens) - Emit macro definition
bin op
x
y
_at_macro ADD(x, y) xy
17Macroscope Examples
- Using a slightly more advanced algorithm
-
-
define ADD(x) x ADD(3) 4
_at_macro ADD(x, y) xy ADD(3, 4)
define FIELDS f.g.h data.FIELDS
_at_macro FIELDS(a) a.f.g.h FIELDS(data)
18Macroscope Results
- Some preliminary results
- All translated programs are semantically
equivalent to their cpp counterparts - Imperfect translations occur when Macroscope
synthesizes a macro that is less general than the
original
19Outline
- Replacing the C preprocessor
- ASTEC an improved macro language
- Macroscope a cpp to ASTEC translator
- Refactoring C code
- Asfact a prototype refactoring tool
- Future work
20Asfact
- Refactors ASTEC code
- Generates readable output
- Understands ASTEC constructs
- Supports standard refactorings/analyses
- Search/rename for variables/functions/fields
- Add arguments to a function
- Also supports programmable refactorings
- Buffer overflows
21Asfact Buffer Overflows
- As a simple test case, refactored gzip to
eliminate a well-known strcpy overflow - Example
- Although Asfact (currently) can only recognize
fixed-size buffers, it still succeeded in most
cases for gzip
void main() char buffer80
strcpy(buffer, some_data)
void main() char buffer80
strcpy_safe(buffer, 80, some_data)
22Incremental Refactoring
- A more complex example requiring user guidance
struct buffer char data unsigned int
len void foo(const char text) struct
buffer buf strcat(buf-gtdata, text)
A refactoring tool can accept information from
the user that len is the size of data. CCured
would fatten data to include an extra,
unnecessary length field.
23Incremental Refactoring
- Giving the user the ability to control the
refactoring can increase its power
struct buffer char data unsigned int
len void foo(const char text) struct
buffer buf strcat_safe(buf-gtdata,
buf-gtlen, text)
The resulting code is more efficient and
cleaner. Code is also more likely to be
compatible with old libraries, since fewer data
structure changes are necessary.
24Future Work
- Right now, Asfact converts ASTEC code into ASTEC
code - Better idea increase the power of the language,
via extensions, and refactor into this new
language - Example Add bounds-checked arrays to C refactor
old-style arrays to the new form to increase
safety
25Conclusion
- Goal to make existing C code safer via
incremental refactoring - We are now one step closer to the goal
- For later steps, many refactorings can be
synthesized from existing analyses (e.g., CCured)
Macroscope
C
ASTEC
Safer C
Refactorings
26Extra