MachineCode Analysis and Rewriting for Security Applications - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

MachineCode Analysis and Rewriting for Security Applications

Description:

Machine-Code Analysis. and Rewriting for Security Applications ... IS1. IS1 Analysis System. IS2 Analysis System. IS2. ISM Analysis System. ISM. 25 ... – PowerPoint PPT presentation

Number of Views:291

Avg rating:3.0/5.0

Slides: 61

Provided by: thoma55

Category:

more less

Transcript and Presenter's Notes

Title: MachineCode Analysis and Rewriting for Security Applications

1
Machine-Code Analysisand Rewriting for Security
Applications

Thomas Reps1,2, Tim Teitelbaum2, and David
Melski2
1University of Wisconsin
2GrammaTech, Inc.

2
Objectives

Review of GT/UW machine-code projects
pre-IARPA support (2001-06)
years 1 2 (10/06-9/08)
year 3 option
years 4, 5, 6, and 7
Background for T.P. reminder for C.L.
A plan for the future
In the past, we have been both ambitious and
creative, but
hindered by
the overhead of having had 15 contracts, 6
grants, and 2 fellowships
the lack of flexibility that comes from dealing
with 23 different schedules, sets of milestones,
deliverables, etc.
Discussion

Heilmeier questions (rephrased in the past tense)
Heilmeier questions (future tense)
3
What have we been trying to do?

Code-inspection tools for security analysts
Analysis tools for identifying bugs and security
vulnerabilities
Platform for rewriting executables
All analyses/operations performed on machine code
stripped executables
source-code assist
when source code is available
to validate the techniques that are to be used
when source code is not available

4
Why Machine Code?

Windows
Login process keeps a users password in the heap
after a successful login
To minimize data lifetime
clear buffer
call free()
But . . .
the compiler might optimize away the
buffer-clearing code (useless-code elimination)

memset(buffer, \0, len) free(buffer)
free(buffer)
5
WYSINWYXWhat You See Is Not What You eXecute

Computers do not execute source-code programs
They execute machine-code programs that are
generated from source code

An issue for any verification/analysis method
theorem proving
model checking
dataflow analysis

6
What was the state of the artwhen the project
began?

Bug and vulnerability detection
Performed with a variety of reverse-engineering
tools
disassemblers
debuggers
fuzzers
manual inspection
Who does it?
hackers, criminals, nation-states, and security
companies
What are the limitations of those approaches?
labor intensive
miss many vulnerabilities (i.e., a high
false-negative rate)
Rewriting of executables
Heuristic approaches
Unable to guarantee that correctness is
preserved, even when original source is available

7
How did you advance the state of the art? What
milestones were achieved?

New techniques to analyze machine code
overcome obstacles such as the lack of
symbol-table information
interpret memory accesses, resolve indirect
calls, etc.
property checking
TSL a platform for creating
multiple analysis components
for multiple tools
analyzing multiple languages
with
reduced programming effort
greater confidence in their correctness
Ground-truth intermediate representation (IR)
Accurate rewriting of executables

8
CodeSurfer/x86 Value Sets
9
CodeSurfer/x86 Inferred Types
10
CodeSurfer/x86Targets of Indirect Function Calls
11
CodeSurfer/x86 Data Dependences
Forward Slice
12
Analysis of Indirect Calls

CodeSurfer/x86 analysis framework
Detailed modeling of Dynamic Linked Libraries
(DLLs)
runtime linking
aliasing forwarding
Dataflow analysis
Case Study Nimda virus
Use of telltale system routines are obfuscated
indirect use of LoadLibrary() and
GetProcAddress()
indirection through memory
Ability to resolve indirect calls

13
Device-Driver Analysis

Programming conventions are complicated
85 of crashes in Windows due to driver bugs
DDA/x86 extension to CodeSurfer/x86
Balakrishnan Reps, Analyzing stripped
device-driver executables, TACAS 2008

14
PendedCompletedRequested Rule
A drivers dispatch routine should not return
STATUS_PENDING on an I/O Request Packet (IRP) if
it has called IoCompleteRequest on the IRP,
unless it has also called IoMarkIrpPending.
15
SLAM Error Trace
DDA/x86 Error Trace
16
Results For PendedCompletedRequested Rule
? A-locs from semi-naïve algorithm ? With
GMOD-based merge function ? With cross-product
automaton
17
DVT

Disassembly Validation Tool
Correlates compilation diagnostics with generated
executables to construct ground-truth IR
obtains the real layout of instructions from
the compiler/assembler/linker
Basis for sound rewriting of executables when
source code is available
eliminates reliance on heuristics for code/data
separation
Validation of correctness for disassemblers
provides ground truth for measuring disassembler
accuracy

18
Accurate Rewriting of Executables

Melt, stir, refreeze
DVT-assisted construction of IR for an executable
Modification of the IR
Re-assembly and linking into a new executable
Case study GCC compiler suite
Core executable cc1.exe
5 Mb
After melt, stir, refreeze, the new executable
behaves identically to original on GCC torture
tests

19
A Question that Heilmeier Should Have Asked,
orHow does this address all of next years
problems? Hamming 1986

. . . in the early days, I was solving one
problem after another after another . . . I was
depressed. I could see life being a long sequence
of one problem after another after another. After
quite a while of thinking I decided, No, I
should be in the business of mass production of
a variable product. I should be concerned with
all of next year's problems, not just the one in
front of my face.
You should do your job in such a fashion that
others can build on top of it . . .
Instead of attacking isolated problems, I made
the resolution that I would never again solve an
isolated problem except as characteristic of a
class.

20
A Question that Heilmeier Should Have Asked,
orHow does this address all of next years
problems? Hamming 1986

. . . the ability to generalize often means
that the solution is simple. Often by stopping
and observing, This is the problem he wants
but this is characteristic of so and so . . . I
can attack the whole class with a far superior
method than the particular one because I was
earlier embedded in needless detail.
The business of abstraction frequently makes
things simple. . . . Altering the problem . . .
can make a great deal of difference . . . because
you can either do it in such a fashion that
people can indeed build on what you've done, or
you can do it in such a fashion that the next
person has to essentially duplicate again what
you've done. . . . It's just as easy to do a
broad, general job as one very special case.

21
A Question that Heilmeier Should Have Asked,
orHow does this address all of next years
problems? Hamming 1986

Questions
What long-term leverage will be obtained via your
approach?
How is your approach structured so that it could
provide leverage for solving the next problem in
this area?
Our answers
Our applications are instances of generic,
language-independent technology
We have built an analyzer generator (TSL) that
creates analysis components from a specification
of an instruction sets operational semantics.
Analyses created for, e.g., x86 are immediately
obtained for all other languages for which one
writes a TSL specification (e.g., PPC, ARM, MIPS,
P-code, . . .)

22
Transformer Specification Language (TSL)

A platform for creating
multiple analysis components
for multiple tools
analyzing multiple languages
For instance,
CodeSurfer/x86, CodeSurfer/PPC, CodeSurfer/ARM,
CodeSurfer/MIPS, ...
CodeSonar/x86, CodeSonar/PPC, CodeSonar/ARM,
CodeSonar/MIPS, ...
Dash/x86, Dash/PPC, Dash/ARM, Dash/MIPS, ...

23
TSL Design Principles
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN

TSL Compiler

M Instruction-Set Specifications
24
TSL Design Principles
Stays the same
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN

TSL Compiler

M Instruction-Set Specifications
25
TSL Design Principles
Easily add an additional analysis in a
language-independent way
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN
AnalysisN1

TSL Compiler

M Instruction-Set Specifications
26
TSL Leverage
Redefine 40 TSL operations for each analysis
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN
AnalysisN1

Conventional approach Redefine gt600 x86
instructions for each analysis
TSL Compiler

M Instruction-Set Specifications
27
TSL Leverage
Client Analyzer
N Analysis Components

M Instruction-Set Specifications
28
TSL Leverage
Client Analyzer
N Analysis Components

TSL Compiler

Greatly reduced effort ? enables
building more ambitious tools
Separation of concerns provides
greater confidence in correctness

M Instruction-Set Specifications
29
TSL Leverage vs. P-code Leverage
Each TSL-based analysis specification
involves 166 basetype-operators - Most have 4
variants 8-, 16-, 32-, 64-bit 166/4 40
operations 2 map-operators (access/update) for
each map-type Total cost for N analyzers O(M)
O(40N)
Client Analyzer
N Analysis Components

TSL Compiler

Each P-code-based analyzer might involve 1
abstract transformer per P-code instruction Total
cost for N analyzers O(M) O(PN)
M Instruction-Set Specifications
30
Program-Analysis Components

Analysis of memory contents (numeric values
addresses)
Variable-identification analysis
Affine-relation analysis (ARA)
Global-modification analysis (GMOD)
Live-flag analysis
Reaching-flag analysis
Available-register analysis

Symbolic-Execution Components
Formula generation (quantifier-free bit-vector
arithmetic)
Execution Components
Emulate the specified processor by calling a
version of interpInstr() that has been complied
into C
31
Case Study
Instruction-set specifiers work x86 (2700
lines of TSL) 10-20 man-days to write the
TSL specification TSL generates about
40,000 lines of C PowerPC32 (1200 lines of
TSL) 4 man-days to write the TSL
specification
Analysis developers work 166
basetype-operators - This number covers four
kinds of operand sizes for each basic
operation Add8, Add16, Add32, Add64 166/4
40 operations 2 map-operators (access/update)
for each map-type E.g. Def-Use Analysis
About 1,000 lines of C (DUA abstract semantics)
Each basic operation just performs a Union of its
operands
32
Leverage Provided by TSL
Hand-written
TSL-based
CodeSurfer/SWYXx86
10 days to write the x86 spec
1 man month to implement all analyses
33
Leverage Provided by TSL
Hand-written
TSL-based
Affine Relation Analysis
542 instruction instances
34
Leverage Provided by TSL
Hand-written
TSL-based
Affine Relation Analysis
Equivalent in 324 cases out of 542
TSL generated transformers were more precise in
218 cases
Better!
35
Vulnerability Detection in Executables

CodeSonar/C
Successful commercial tool
Finds flaws in C code
Highly scalable (10s of MLOC)
CodeSonar/SWYXx86
Uses TSL (retargetable)
Scalable techniques of CodeSonar
Current prototype
Mixed source/executable analysis
More than 20 checks
Toy examples

35
36
Instruction-Set Architecture Language (ISAL)

Language to specify syntactic details of
low-level languages
Connects syntax (machine-code/assembly) to
semantics (TSL)
Inspired by SLED Ramsey 1995 and TSL
ISAL infrastructure automatically generates
Instruction decoder, with translation to TSL
abstract syntax tree
Pretty-printer
Parser for assembly code (TBD)
together with
A comprehensive test suite (for syntactic issues)
that covers the language specified

37
Academic MilestonesRepss publications on
static analysis, 2001-08

47 conference papers
9 journal papers
7 invited papers
5 book chapters
1 edited book
1 magazine article
4 Ph.D. students graduated
3 now GrammaTech employees
2 best-paper awards
2 conf. papers invited for special journal
submission
Google Scholar 1541 citations of the 2001-08
papers

38
What has been the transition strategy?

Worked with relevant parties in the intelligence
community to inform them of, and let them try
out, our new capabilities
Obtaining clearances for classified work (PISA)
Seeking classified contracts partnered with
primes
attending classified bidders meetings
pending proposal as sub to Lockheed-Martin
(1.7M)
pending whitepaper with AIS
other opportunities under discussion with BBN,
LMCO, and Telcordia

39
What has been the transition strategy?

Several tantalizing opportunities would involve
only a small amount of effort
Provide DVT information to Ghidra
Specify P-code semantics in TSL ? all TSL-based
analyses would be available for all P-code
applications

40
Cost

How long has it taken?
mid-2001 to present
How much did it cost?
GrammaTech 5.2M
15 different contracts
University of Wisconsin 1.45M
6 different grants
2 graduate fellowships
Combined spending rate 1M/year

41
FY01 ? FY06 FY07 FY08 FY09
FY10 FY11 FY12
FY13
Automatic discovery of API-level exploits
Output-file format discovery
Library summarization
Device-driver analysis
IR recovery
TSL
DASH
CodeSurfer/SWYXx86,PPC32
CodeSurfer/x86
SWYX
CodeSonar/x86
CodeSonar/SWYX
ISAL
DVT
Accurate rewriting of executables
42
Objectives

Review of GT/UW machine-code projects
pre-IARPA support (2001-06)
years 1 2 (10/06-9/08)
year 3 option
years 4, 5, 6, and 7
Background for T.P. reminder for C.L.
A plan for the future
In the past, we have been both ambitious and
creative, but
hindered by
the overhead of having had 15 contracts, 6
grants, and 2 fellowships
the lack of flexibility that comes from dealing
with 23 different schedules, sets of milestones,
deliverables, etc.
Discussion

Heilmeier questions (rephrased in the past tense)
Heilmeier questions (future tense)
43
What are we trying to do?

Integrate static, dynamic, and symbolic
program-analysis methods, to
Automate detection of vulnerabilities in software
executables
generate a list of likely vulnerabilities
provide help understanding/remedying reported
vulnerabilities
Automate discovery of exploits in software
executables
provide inputs to exploit the vulnerabilities
reported
Automate generation of vulnerability-based
signatures
multiple interposition points e.g., network
communication vs. library calls
Provide a fisheye view on the code
concentrate analysis effort (magnification) on
specific regions of interest
Develop similar analysis methods for concurrent
programs
Develop semantics-based techniques to transform
machine code
Guarantee that correctness is preserved
Apply to signature suppression
Mature the technology
Finish what we started
Connect what is easy to connect
Mine what we have in creative ways

44
What is the current state of the art?

Vulnerabilities, exploits, and signatures
How does this get done at present?
disassemblers
debuggers
fuzzers
manual inspection
Who does it?
hackers, criminals, nation-states, and security
companies
What are the limitations of the present
approaches?
labor intensive
miss many vulnerabilities (i.e., a high
false-negative rate)

45
What is the current state of the art?

Fisheye view on the code
(concentrate analysis effort magnification
on regions of interest)
How does this get done at present?
Not done (although demand algorithms have the
right flavor)
Holy grail
What are the limitations of the present
approaches?
A demand alg. can generate as many demands as an
exhaustive alg.
Reps, T., Demand interprocedural program analysis
using logic databases. Applications of Logic
Databases, R. Ramakrishnan (ed.), Kluwer, 1994.
Horwitz, S., Reps, T., and Sagiv, M., Demand
interprocedural dataflow analysis. Foundations of
Softw. Eng., 1995.
Loss of precision (inherent to static analysis)
can cause unnecessary demands to be generated

46
What is the current state of the art?

Develop analysis methods for concurrent programs
How does this get done at present?
model checking
take the product of the individual transition
systems
symmetry reduction, partial-order reduction, etc.
to simplify
check for reachability (safety) or Büchi
acceptance (liveness)
testing
perturb the scheduler
Who does it?
verification researchers
What are the limitations of the present
approaches?
scalability due to state-space explosion

47
What is the current state of the art?

Transformation of machine code
How does this get done at present?
infrastructures for modifying executables
(e.g., ATOM, EEL, DynInst, Etch, Vulcan)
Who does it?
hackers, criminals, nation-states, and security
researchers
What are the limitations of the present
approaches?
labor intensive
based on syntax, not semantics ? incorrect
transformations may perturb behavior

48
Novelty/prospects for success

What is new about your approach?
Our methods work on machine code
addresses the WYSINWYX problem
We use (reinterpretations of) a formal semantics
of an instruction set
Why do you think it can be successful at this
time?
CodeSurfer/x86 IR-recovery work shows that one
can make machine-code problems look very similar
to source-code problems
allows good ideas from research on source code to
be brought to bear on machine-code -analysis
problems
Leverage obtained from TSL
operational semantics
reinterpretations of the operational semantics

49
Novelty/prospects for success

Vulnerabilities, exploits, signature generation
New approaches to detecting flaws in source code
have been developed that are much more scalable
and effective than previous approaches
Fisheye view
New goal-directed algorithms have been developed
for source code that combine dynamic analysis
with symbolic techniques
Analysis methods for concurrent programs
Context-bounded analysis
only check up to k context switches
allow arbitrary amount of work between context
switches
algorithms from Repss group during the past two
years

50
Novelty/prospects for success

Semantics-based techniques to transform machine
code
GrammaTechs key ingredients
DVT provides ground-truth IR (for instruction
layout)
IR TSL specification captures the programs
semantics
makes semantics-preserving rewriting possible
use the semantics to ensure that the behavior is
not changed
A more precise IR allows stronger transforms
E.g., code/data mixing
None w/ Coarse IR
w/ Precise IR

51
Novelty/prospects for success

Mature the technology
Finish what we started
technology readiness levels 3,4,5 ? 5,6,7
(for different technologies)
Connect what is easy to connect
provide DVT information to Ghidra
specify P-code semantics in TSL
Mine what we have in creative ways
scalable, heuristics-based, variable and type
recovery

52
If you succeed what difference will it make?

It will provide the intelligence community with a
decisive advantage in information warfare by
allowing them to analyze and transform machine
code at much lower cost and shorter time than
opponents
The capabilities have both defensive and
offensive applications

53
Initial Milestones

Vulnerabilities
Determine whether using static analysis
(CodeSonar/SWYXx86) to detect vulnerabilities
in executables is as scalable as approaches that
work on source code
Compare false-positive and false-negative rates
with CodeSonar/C on an appropriate corpus of
examples
Fisheye view on the code
Demonstrate a version of the DASH algorithm
working on machine code

54
Initial Milestones

Mature the technology
Provide DVT information to Ghidra
Demonstrate capability to provide information
obtained from TSL-based analyses in the Ghidra
environment
stack-height analysis (using ARA)
high-level expressions for conditions
heuristic-based variable recovery (VSA and ASI)

55
Cost

How long will it take?
5 years
How much will it cost?
University of Wisconsin 2.8M
Reps 5 RAs
GrammaTech 8.4M
In the past, we have been both ambitious and
creative,
but handicapped by
the overhead of having been funded by 15
contracts, 6 grants, and 2 fellowships
the lack of flexibility that comes from dealing
with 23 different schedules, sets of milestones,
deliverables, etc.

56
What is your transition strategy?

Classified discussions with relevant parties in
the intelligence community to understand problems
and opportunities to contribute
GrammaTech recently granted TS facility clearance
3 employees with SCI clearances
4th employee undergoing background check for SCI
Repss SCI clearance delayed until his return to
US
5-year ID/IQ classified contract immanent
Work closely with relevant parties in the
intelligence community to
demonstrate the capabilities of our innovations
integrate with existing tools and approaches