Title: K. Gondow (Titech, Japan)
1Binary-Level Lightweight Data Integration to
Develop Program Understanding Tools for Embedded
Software in C
- K. Gondow (Titech, Japan)
- T. Suzuki (Elmic System Inc, Japan)
- H. Kawashima (JAIST, Japan)
2Overview
- Problems
- Imprecision in C tools.
- High development cost of C tools.
- Our solution
- Binary-level lightweight data integration.
- As a testbed, DWARF2 used for developing
- dxref, rxref cross-referencers
- bscg a call-graph extractor
3Imprecision in C tools (1/3)
- e.g., GNU GLOBAL cannot identify a variable 'foo'
and a label 'foo'. - Users must select some one from the list.
- Because GNU GLOBAL partially analyzes source code
to run very fast.
int main (void) int foo foo goto
foo
candidate list
click
foo 3 test.c int foo.c foo 4 test.c foo
goto foo
4Imprecision in C tools (2/3)
- e.g., Murphy's study
- "An Empirical Study of Static Call Graph
Extractors", by Murphy, et al., ICSE, 1996. - Tells "call graphs extracted by several broadly
distributed tools vary significantly enough to
surprise many experienced software engineers."
5Imprecision in C tools (3/3 )
- Quantitative results from mosaic, quoted from
Murphy's paper.
cflownField
cflow-Field
Field-cflow
6Why imprecision? (1/2)
- Reason 1 many tools partially parse source
code, resulting in incomplete analysis. - e.g, GNU GLOBAL, cxref, LXR, cscope, cflow...
- At a glance, full-parsing seems to solve this
problem, but...
7Why imprecision? (2/2)
- Reason 2 C source code is difficult to fully
analyze because of - Compiler-specific extensions.
- e.g., asm for inline assembly code
- Ambiguous behaviors in the C standards.
- undefined, unspecified, implementation-defined.
- e.g., padding in a structure.
8Compiler-specific extensions
- Essential in C and embedded software.
- e.g., asm is used to obtain H/W error code.
- e.g., long long is used in C89's ltstdio.hgt
- Make it hard to analyze source code.
- Different compiler has different semantics.
void page_fault_handler (uint32_t error)
uint32_t cr2 asm volatile ("movl
cr2,0""r"(cr2)) ... / IA-32
control register 2 /
9Ambiguous behaviors in C (1/2)
- Intentional and essential to keep C compilers
fast and simple. - e.g., padding in a structure is an
implementation-defined behavior. - This makes pointer-analysis hard.
- "Pointer analysis for programs with structures
and casts", by Suan Hsi Yong, et al, PLDI'99.
10Ambiguous behaviors in C (2/2)
struct S char c int ip p struct T char
c int i t t.i 0x1234 p (struct S
)t printf ("p\n", p-gtip)
- Different padding on different platforms.
- To obtain precise dataflow, tools need to know
the padding values of the compiler. - But it is hard...
struct S
struct S
struct T
c
c
c
padding
ip
i
not
depends on
ip
Solaris8 (32bit)
Solaris8 (64bit)
11Possible solutions
- To modify compilers (e.g. GCC) to emit their
analyzed internal data. - Seemingly high development cost.
- Many compilers to be modified.
- To use binary information in executables emitted
by compilers. - Relatively easy, although it lacks some
information, e.g., statements.
12Our solution and result
- Our solution
- Uses DWARF2 debugging information as binary
information. - Preliminary experiment
- Good result for our cross-referencers and
call-graph extractor. - Better precision, although
- some false negatives increased.
- quantitative results are not yet obtained.
13Demonstration
- Using DWARF2, we implemented
- two cross-referencers
- dxref only uses DWARF2
- Sample output dxref
- rxref hybrid of dxref and GNU GLOBAL
- Sample output dxref
- a static call-graph extractor
- bscg uses DWARF2 and disassembler.
- Sample outputs fact, dxref, bash, bash
14DWARF2-XML
C code
compile
text data symbol info. relocation info. debug
info.
dxref, rxref cross-referencers
binary ELF/ DWARF2
bscg call graph extractor
extract
data inte- gration
use
common format DWARF2-XML
15How bscg works
- extract call instructionsby disassembling text.
(2) convert addresses to symbols using
DWARF2
1234 call 5678
main call fact
(3) trim call graphs according to options
(4) output graph topologyin DOT of Graphviz
digraph G main -gt fact fact -gt fact
main
fact
usage
16Advantages of bscg
- Advantages of binary-level DI (explained later).
- eg., high applicability and few false positives.
- Can identify inlined functions.
- Can extract a call from asm ("call fact")
- Can exclude
- library functions e.g., printf
- system calls e.g., open, fork
- functions in runtime systems _start, _fini
17Disadvantages of bscg
- No support for macro calls, signals, function
pointers, optimization. - gprof-callgraph.pl can handle function pointers,
since it uses dynamic information. - source-level ones (e.g., cflow) don't suffer from
optimization problem.
18So, is bscg good?
- Yes! (not the best, of course)
- Not easy to compare.
19What is binary-level DI?
- Provides common formats by extracting information
from binary code.
source code
binary code
compile
.c
a.out
binary DI
analyze
analyze
common formats
source DI
DWARF2- XML
Tools
20Why binary-level DI?
- Many advantages
- High applicability
- Few false-positives.
- More true-positives for low-level info.
- Low development cost
- Can improve C tool's precision.
21What is lightweight DI?
- Allows several common formats.
- To be practical! Hard to perfectly integrate.
heavy- weight DI
light- weight DI
DWARF2- XML
22Summary
- Imprecision in C tools.
- Our solution
- Binary-level lightweight data integration.
- As a testbed, DWARF2 used for developing
- dxref, rxref cross-referencers
- bscg call-graph extractor
23Future works
- Apply our technique to other tools
- e.g., memory profilers, slicers, test coverage
tools, ... - Develop new binary formats suitable for lower
CASE tools. - tool-information carrying code.
- cf. proof-carrying code, model-carrying code,
schedule-carrying code.
24(No Transcript)
25Taxonomy of cross referencers.
- Source-level
- Partial-parsing GNU GLOBAL, LXR, ...
- Full-parsing Sapid, ACML
- Binary-level
- Symbol tables Visual Studio .NET(?)
- Debug info. dxref
- Hybrid rxref
26What is DWARF2?
- A binary format for debugging information.
- Primary target languages
- C, C, Fortran, Modula2, Pascal.
- Includes
- types, nested blocks, line numbers,
function/object names, addresses, stack frame
information, ...
27DWARF2-XML
- Our common format in XML for DWARF2.
- A testbed of binary-level lightweight DI.
- Makes it easier to process DWARF2.
- cf. libdwarf
- About 15 times larger than DWARF2.
28DWARF2-XML example
int i ...
address range
- ltsection name".debug_info"gt
- lttag name"DW_TAG_lexical_block"
offset"id27"gt - ltattribute name"DW_AT_low_pc"
value"67328"/gt - ltattribute name"DW_AT_high_pc"
value"67356"/gt - ...
- lttag name"DW_TAG_variable" offset"id27"gt
- ltattribute name"DW_AT_name" value"i"/gt
- ltattribute name"DW_AT_type"
value_ref"id161"gt - ltattribute name"DW_AT_location"gt
- ltdescriptiongtDW_OP_fbreg
-24lt/descriptiongtlt/gtlt/gtlt/gtlt/gt - ...
- lttag name"DW_TAG_base_type" offset"id161"gt
- ltattribute name"DW_AT_name" value"int"/gt
- ltattribute name"DW_AT_byte_size" value"4"/gt
- ltattribute name"DW_AT_encoding" value"5"gt
- ltdescriptiongtsignedlt/descriptiongtlt/gtlt/gtlt/gt
variable name
ID/IDREF link
offset to base ptr.
29DWARF2-XML file sizes
- About 15 times larger than DWARF2.
- Size increase is almost cancelled by gzip.
- Consumes much memory when using DOM.
- e.g., we cannot build DOM tree for gdb in our
environment. - Tradeoff between memory consumption and low
development cost.
source a.out .debug_ DWARF2-XML compressed by gzip
x_debug.c 27KB 77KB 50KB 1.1MB 58KB
readelf.c 315KB 575KB 137KB 2.1MB 128KB
bash 1.2MB 2.9MB 705KB 16.3MB 815KB
gdb 12MB 21.5MB 14.4MB 276MB 14MB
gdb's LOC is about 400,000.
30Execution speed
- bscg is slower than the other, but acceptable for
practical use. - 12000 lines in 8.8 sec.
- but too bad in the case of bash-2.03.
- bscg has a problem in scalability due to heavy
overhead of DOM library.
31Why XML?
- Highly readable, portable, interoperable.
- plain-text and self-descriptiveness.
- Powerful enough to describe complex structures
and relations in programs. - Nested tags and ID/IDREF links.
- DTD for checking XML documents.
- Flexibility to process semi-structured documents.
- Easy to query/display/modify.
- XML parsers, DOM/SAX, XPath.
- XPath's description is much smaller than boring
tree traversal code.
32Drawbacks in API integration
e.g., libdwarf
- Insufficient abstraction.
- Many and various data structures/access make it
hard to well encapsulate them into a fixed API. - e.g., poor API in libdwarf to traverse a wide
range of data tree. (only dwarf_siblingof and
dwarf_child are provided.) - High cost to implement API in many languages.
- High cost to learn how to use API.
33false/true positive/negative
- false positives
- tool's incorrect output.
- true positives
- tool's correct output.
- false negatives
- tool's incorrect silence.
- tool should have produced output, but not.
- true negatives
- tool's correct silence
- tool should not have produced output, and not.
34bscg's graph trimming options
35Why lightweight DI?
- To be practical! Hard to perfectly integrate.
- Supported by the fact that most technologies gave
up the perfect integration/definition. - e.g., undefined behaviors in C.
- e.g., GNU BFD gives API integrating different
binary formats. - useful, but not perfect.
- cannot convert ELF/DWARF2 into Windows PE.
36Why function pointer analysis is difficult in C?
- Pointer arithmetic and casting.
- e.g., (int ()())(base offset)
- Dynamic library
- e.g., handle dlopen (libname, RTLD_LAZY)
func dlsym (handle, funcname) f
() - Inline assembly code
- e.g., asm ("call foo")
37CASE tools development cost
- Generally very high.
- individual parsers analyzers.
- internal data is less interoperable and portable
- IBM Eclipse
- 40,000,000 (?)
38E.g., function pointer
- Cflow
- apply calls f (false positive)
- gprof-callgraph.pl
- apply calls add5 (true positive)
- Other tools (bscg)
- apply calls ? (false negative)
int add5 (int x) return x 5 int apply
(int (f)(int), int x) return f (x) int
main (void) return apply (add5, 10)
39Our homepage
- http//www.sde.cs.titech.ac.jp/gondow/dwarf2-xml/
- DTD for DWARF2-XML
- Source code of readelf, dxref, rxref, bscg
- Some sample outputs