MachineCode Analysis and Rewriting for Security Applications - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

MachineCode Analysis and Rewriting for Security Applications

Description:

Machine-Code Analysis. and Rewriting for Security Applications ... IS1. IS1 Analysis System. IS2 Analysis System. IS2. ISM Analysis System. ISM. 25 ... – PowerPoint PPT presentation

Number of Views:291
Avg rating:3.0/5.0
Slides: 61
Provided by: thoma55
Category:

less

Transcript and Presenter's Notes

Title: MachineCode Analysis and Rewriting for Security Applications


1
Machine-Code Analysisand Rewriting for Security
Applications
  • Thomas Reps1,2, Tim Teitelbaum2, and David
    Melski2
  • 1University of Wisconsin
  • 2GrammaTech, Inc.

2
Objectives
  • Review of GT/UW machine-code projects
  • pre-IARPA support (2001-06)
  • years 1 2 (10/06-9/08)
  • year 3 option
  • years 4, 5, 6, and 7
  • Background for T.P. reminder for C.L.
  • A plan for the future
  • In the past, we have been both ambitious and
    creative, but
  • hindered by
  • the overhead of having had 15 contracts, 6
    grants, and 2 fellowships
  • the lack of flexibility that comes from dealing
    with 23 different schedules, sets of milestones,
    deliverables, etc.
  • Discussion

Heilmeier questions (rephrased in the past tense)
Heilmeier questions (future tense)
3
What have we been trying to do?
  • Code-inspection tools for security analysts
  • Analysis tools for identifying bugs and security
    vulnerabilities
  • Platform for rewriting executables
  • All analyses/operations performed on machine code
  • stripped executables
  • source-code assist
  • when source code is available
  • to validate the techniques that are to be used
    when source code is not available

4
Why Machine Code?
  • Windows
  • Login process keeps a users password in the heap
    after a successful login
  • To minimize data lifetime
  • clear buffer
  • call free()
  • But . . .
  • the compiler might optimize away the
    buffer-clearing code (useless-code elimination)

memset(buffer, \0, len) free(buffer)
free(buffer)
5
WYSINWYXWhat You See Is Not What You eXecute
  • Computers do not execute source-code programs
  • They execute machine-code programs that are
    generated from source code
  • An issue for any verification/analysis method
  • theorem proving
  • model checking
  • dataflow analysis

6
What was the state of the artwhen the project
began?
  • Bug and vulnerability detection
  • Performed with a variety of reverse-engineering
    tools
  • disassemblers
  • debuggers
  • fuzzers
  • manual inspection
  • Who does it?
  • hackers, criminals, nation-states, and security
    companies
  • What are the limitations of those approaches?
  • labor intensive
  • miss many vulnerabilities (i.e., a high
    false-negative rate)
  • Rewriting of executables
  • Heuristic approaches
  • Unable to guarantee that correctness is
    preserved, even when original source is available

7
How did you advance the state of the art? What
milestones were achieved?
  • New techniques to analyze machine code
  • overcome obstacles such as the lack of
    symbol-table information
  • interpret memory accesses, resolve indirect
    calls, etc.
  • property checking
  • TSL a platform for creating
  • multiple analysis components
  • for multiple tools
  • analyzing multiple languages
  • with
  • reduced programming effort
  • greater confidence in their correctness
  • Ground-truth intermediate representation (IR)
  • Accurate rewriting of executables

8
CodeSurfer/x86 Value Sets
9
CodeSurfer/x86 Inferred Types
10
CodeSurfer/x86Targets of Indirect Function Calls
11
CodeSurfer/x86 Data Dependences
Forward Slice
12
Analysis of Indirect Calls
  • CodeSurfer/x86 analysis framework
  • Detailed modeling of Dynamic Linked Libraries
    (DLLs)
  • runtime linking
  • aliasing forwarding
  • Dataflow analysis
  • Case Study Nimda virus
  • Use of telltale system routines are obfuscated
  • indirect use of LoadLibrary() and
    GetProcAddress()
  • indirection through memory
  • Ability to resolve indirect calls

13
Device-Driver Analysis
  • Programming conventions are complicated
  • 85 of crashes in Windows due to driver bugs
  • DDA/x86 extension to CodeSurfer/x86
  • Balakrishnan Reps, Analyzing stripped
    device-driver executables, TACAS 2008

14
PendedCompletedRequested Rule
A drivers dispatch routine should not return
STATUS_PENDING on an I/O Request Packet (IRP) if
it has called IoCompleteRequest on the IRP,
unless it has also called IoMarkIrpPending.
15
SLAM Error Trace
DDA/x86 Error Trace
16
Results For PendedCompletedRequested Rule
? A-locs from semi-naïve algorithm ? With
GMOD-based merge function ? With cross-product
automaton
17
DVT
  • Disassembly Validation Tool
  • Correlates compilation diagnostics with generated
    executables to construct ground-truth IR
  • obtains the real layout of instructions from
    the compiler/assembler/linker
  • Basis for sound rewriting of executables when
    source code is available
  • eliminates reliance on heuristics for code/data
    separation
  • Validation of correctness for disassemblers
  • provides ground truth for measuring disassembler
    accuracy

18
Accurate Rewriting of Executables
  • Melt, stir, refreeze
  • DVT-assisted construction of IR for an executable
  • Modification of the IR
  • Re-assembly and linking into a new executable
  • Case study GCC compiler suite
  • Core executable cc1.exe
  • 5 Mb
  • After melt, stir, refreeze, the new executable
    behaves identically to original on GCC torture
    tests

19
A Question that Heilmeier Should Have Asked,
orHow does this address all of next years
problems? Hamming 1986
  • . . . in the early days, I was solving one
    problem after another after another . . . I was
    depressed. I could see life being a long sequence
    of one problem after another after another. After
    quite a while of thinking I decided, No, I
    should be in the business of mass production of
    a variable product. I should be concerned with
    all of next year's problems, not just the one in
    front of my face.
  • You should do your job in such a fashion that
    others can build on top of it . . .
  • Instead of attacking isolated problems, I made
    the resolution that I would never again solve an
    isolated problem except as characteristic of a
    class.

20
A Question that Heilmeier Should Have Asked,
orHow does this address all of next years
problems? Hamming 1986
  • . . . the ability to generalize often means
    that the solution is simple. Often by stopping
    and observing, This is the problem he wants
    but this is characteristic of so and so . . . I
    can attack the whole class with a far superior
    method than the particular one because I was
    earlier embedded in needless detail.
  • The business of abstraction frequently makes
    things simple. . . . Altering the problem . . .
    can make a great deal of difference . . . because
    you can either do it in such a fashion that
    people can indeed build on what you've done, or
    you can do it in such a fashion that the next
    person has to essentially duplicate again what
    you've done. . . . It's just as easy to do a
    broad, general job as one very special case.

21
A Question that Heilmeier Should Have Asked,
orHow does this address all of next years
problems? Hamming 1986
  • Questions
  • What long-term leverage will be obtained via your
    approach?
  • How is your approach structured so that it could
    provide leverage for solving the next problem in
    this area?
  • Our answers
  • Our applications are instances of generic,
    language-independent technology
  • We have built an analyzer generator (TSL) that
    creates analysis components from a specification
    of an instruction sets operational semantics.
    Analyses created for, e.g., x86 are immediately
    obtained for all other languages for which one
    writes a TSL specification (e.g., PPC, ARM, MIPS,
    P-code, . . .)

22
Transformer Specification Language (TSL)
  • A platform for creating
  • multiple analysis components
  • for multiple tools
  • analyzing multiple languages
  • For instance,
  • CodeSurfer/x86, CodeSurfer/PPC, CodeSurfer/ARM,
    CodeSurfer/MIPS, ...
  • CodeSonar/x86, CodeSonar/PPC, CodeSonar/ARM,
    CodeSonar/MIPS, ...
  • Dash/x86, Dash/PPC, Dash/ARM, Dash/MIPS, ...

23
TSL Design Principles
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN

TSL Compiler

M Instruction-Set Specifications
24
TSL Design Principles
Stays the same
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN

TSL Compiler

M Instruction-Set Specifications
25
TSL Design Principles
Easily add an additional analysis in a
language-independent way
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN
AnalysisN1

TSL Compiler

M Instruction-Set Specifications
26
TSL Leverage
Redefine 40 TSL operations for each analysis
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN
AnalysisN1

Conventional approach Redefine gt600 x86
instructions for each analysis
TSL Compiler

M Instruction-Set Specifications
27
TSL Leverage
Client Analyzer
N Analysis Components

M Instruction-Set Specifications
28
TSL Leverage
Client Analyzer
N Analysis Components

TSL Compiler
  • Greatly reduced effort ? enables
  • building more ambitious tools
  • Separation of concerns provides
  • greater confidence in correctness

M Instruction-Set Specifications
29
TSL Leverage vs. P-code Leverage
Each TSL-based analysis specification
involves 166 basetype-operators - Most have 4
variants 8-, 16-, 32-, 64-bit 166/4 40
operations 2 map-operators (access/update) for
each map-type Total cost for N analyzers O(M)
O(40N)
Client Analyzer
N Analysis Components

TSL Compiler

Each P-code-based analyzer might involve 1
abstract transformer per P-code instruction Total
cost for N analyzers O(M) O(PN)
M Instruction-Set Specifications
30
Program-Analysis Components
  • Analysis of memory contents (numeric values
    addresses)
  • Variable-identification analysis
  • Affine-relation analysis (ARA)
  • Global-modification analysis (GMOD)
  • Live-flag analysis
  • Reaching-flag analysis
  • Available-register analysis

Symbolic-Execution Components
Formula generation (quantifier-free bit-vector
arithmetic)
Execution Components
Emulate the specified processor by calling a
version of interpInstr() that has been complied
into C
31
Case Study
Instruction-set specifiers work x86 (2700
lines of TSL) 10-20 man-days to write the
TSL specification TSL generates about
40,000 lines of C PowerPC32 (1200 lines of
TSL) 4 man-days to write the TSL
specification
Analysis developers work 166
basetype-operators - This number covers four
kinds of operand sizes for each basic
operation Add8, Add16, Add32, Add64 166/4
40 operations 2 map-operators (access/update)
for each map-type E.g. Def-Use Analysis
About 1,000 lines of C (DUA abstract semantics)
Each basic operation just performs a Union of its
operands
32
Leverage Provided by TSL
Hand-written
TSL-based
CodeSurfer/SWYXx86
10 days to write the x86 spec
1 man month to implement all analyses
33
Leverage Provided by TSL
Hand-written
TSL-based
Affine Relation Analysis
542 instruction instances
34
Leverage Provided by TSL
Hand-written
TSL-based
Affine Relation Analysis
Equivalent in 324 cases out of 542
TSL generated transformers were more precise in
218 cases
Better!
35
Vulnerability Detection in Executables
  • CodeSonar/C
  • Successful commercial tool
  • Finds flaws in C code
  • Highly scalable (10s of MLOC)
  • CodeSonar/SWYXx86
  • Uses TSL (retargetable)
  • Scalable techniques of CodeSonar
  • Current prototype
  • Mixed source/executable analysis
  • More than 20 checks
  • Toy examples

35
36
Instruction-Set Architecture Language (ISAL)
  • Language to specify syntactic details of
    low-level languages
  • Connects syntax (machine-code/assembly) to
    semantics (TSL)
  • Inspired by SLED Ramsey 1995 and TSL
  • ISAL infrastructure automatically generates
  • Instruction decoder, with translation to TSL
    abstract syntax tree
  • Pretty-printer
  • Parser for assembly code (TBD)
  • together with
  • A comprehensive test suite (for syntactic issues)
    that covers the language specified

37
Academic MilestonesRepss publications on
static analysis, 2001-08
  • 47 conference papers
  • 9 journal papers
  • 7 invited papers
  • 5 book chapters
  • 1 edited book
  • 1 magazine article
  • 4 Ph.D. students graduated
  • 3 now GrammaTech employees
  • 2 best-paper awards
  • 2 conf. papers invited for special journal
    submission
  • Google Scholar 1541 citations of the 2001-08
    papers

38
What has been the transition strategy?
  • Worked with relevant parties in the intelligence
    community to inform them of, and let them try
    out, our new capabilities
  • Obtaining clearances for classified work (PISA)
  • Seeking classified contracts partnered with
    primes
  • attending classified bidders meetings
  • pending proposal as sub to Lockheed-Martin
    (1.7M)
  • pending whitepaper with AIS
  • other opportunities under discussion with BBN,
    LMCO, and Telcordia

39
What has been the transition strategy?
  • Several tantalizing opportunities would involve
    only a small amount of effort
  • Provide DVT information to Ghidra
  • Specify P-code semantics in TSL ? all TSL-based
    analyses would be available for all P-code
    applications

40
Cost
  • How long has it taken?
  • mid-2001 to present
  • How much did it cost?
  • GrammaTech 5.2M
  • 15 different contracts
  • University of Wisconsin 1.45M
  • 6 different grants
  • 2 graduate fellowships
  • Combined spending rate 1M/year

41
FY01 ? FY06 FY07 FY08 FY09
FY10 FY11 FY12
FY13
Automatic discovery of API-level exploits
Output-file format discovery
Library summarization
Device-driver analysis
IR recovery
TSL
DASH
CodeSurfer/SWYXx86,PPC32
CodeSurfer/x86
SWYX
CodeSonar/x86
CodeSonar/SWYX
ISAL
DVT
Accurate rewriting of executables
42
Objectives
  • Review of GT/UW machine-code projects
  • pre-IARPA support (2001-06)
  • years 1 2 (10/06-9/08)
  • year 3 option
  • years 4, 5, 6, and 7
  • Background for T.P. reminder for C.L.
  • A plan for the future
  • In the past, we have been both ambitious and
    creative, but
  • hindered by
  • the overhead of having had 15 contracts, 6
    grants, and 2 fellowships
  • the lack of flexibility that comes from dealing
    with 23 different schedules, sets of milestones,
    deliverables, etc.
  • Discussion

Heilmeier questions (rephrased in the past tense)
Heilmeier questions (future tense)
43
What are we trying to do?
  • Integrate static, dynamic, and symbolic
    program-analysis methods, to
  • Automate detection of vulnerabilities in software
    executables
  • generate a list of likely vulnerabilities
  • provide help understanding/remedying reported
    vulnerabilities
  • Automate discovery of exploits in software
    executables
  • provide inputs to exploit the vulnerabilities
    reported
  • Automate generation of vulnerability-based
    signatures
  • multiple interposition points e.g., network
    communication vs. library calls
  • Provide a fisheye view on the code
  • concentrate analysis effort (magnification) on
    specific regions of interest
  • Develop similar analysis methods for concurrent
    programs
  • Develop semantics-based techniques to transform
    machine code
  • Guarantee that correctness is preserved
  • Apply to signature suppression
  • Mature the technology
  • Finish what we started
  • Connect what is easy to connect
  • Mine what we have in creative ways

44
What is the current state of the art?
  • Vulnerabilities, exploits, and signatures
  • How does this get done at present?
  • disassemblers
  • debuggers
  • fuzzers
  • manual inspection
  • Who does it?
  • hackers, criminals, nation-states, and security
    companies
  • What are the limitations of the present
    approaches?
  • labor intensive
  • miss many vulnerabilities (i.e., a high
    false-negative rate)

45
What is the current state of the art?
  • Fisheye view on the code
  • (concentrate analysis effort magnification
    on regions of interest)
  • How does this get done at present?
  • Not done (although demand algorithms have the
    right flavor)
  • Holy grail
  • What are the limitations of the present
    approaches?
  • A demand alg. can generate as many demands as an
    exhaustive alg.
  • Reps, T., Demand interprocedural program analysis
    using logic databases. Applications of Logic
    Databases, R. Ramakrishnan (ed.), Kluwer, 1994.
  • Horwitz, S., Reps, T., and Sagiv, M., Demand
    interprocedural dataflow analysis. Foundations of
    Softw. Eng., 1995.
  • Loss of precision (inherent to static analysis)
    can cause unnecessary demands to be generated

46
What is the current state of the art?
  • Develop analysis methods for concurrent programs
  • How does this get done at present?
  • model checking
  • take the product of the individual transition
    systems
  • symmetry reduction, partial-order reduction, etc.
    to simplify
  • check for reachability (safety) or Büchi
    acceptance (liveness)
  • testing
  • perturb the scheduler
  • Who does it?
  • verification researchers
  • What are the limitations of the present
    approaches?
  • scalability due to state-space explosion

47
What is the current state of the art?
  • Transformation of machine code
  • How does this get done at present?
  • infrastructures for modifying executables
  • (e.g., ATOM, EEL, DynInst, Etch, Vulcan)
  • Who does it?
  • hackers, criminals, nation-states, and security
    researchers
  • What are the limitations of the present
    approaches?
  • labor intensive
  • based on syntax, not semantics ? incorrect
    transformations may perturb behavior

48
Novelty/prospects for success
  • What is new about your approach?
  • Our methods work on machine code
  • addresses the WYSINWYX problem
  • We use (reinterpretations of) a formal semantics
    of an instruction set
  • Why do you think it can be successful at this
    time?
  • CodeSurfer/x86 IR-recovery work shows that one
    can make machine-code problems look very similar
    to source-code problems
  • allows good ideas from research on source code to
    be brought to bear on machine-code -analysis
    problems
  • Leverage obtained from TSL
  • operational semantics
  • reinterpretations of the operational semantics

49
Novelty/prospects for success
  • Vulnerabilities, exploits, signature generation
  • New approaches to detecting flaws in source code
    have been developed that are much more scalable
    and effective than previous approaches
  • Fisheye view
  • New goal-directed algorithms have been developed
    for source code that combine dynamic analysis
    with symbolic techniques
  • Analysis methods for concurrent programs
  • Context-bounded analysis
  • only check up to k context switches
  • allow arbitrary amount of work between context
    switches
  • algorithms from Repss group during the past two
    years

50
Novelty/prospects for success
  • Semantics-based techniques to transform machine
    code
  • GrammaTechs key ingredients
  • DVT provides ground-truth IR (for instruction
    layout)
  • IR TSL specification captures the programs
    semantics
  • makes semantics-preserving rewriting possible
  • use the semantics to ensure that the behavior is
    not changed
  • A more precise IR allows stronger transforms
  • E.g., code/data mixing
  • None w/ Coarse IR
    w/ Precise IR

51
Novelty/prospects for success
  • Mature the technology
  • Finish what we started
  • technology readiness levels 3,4,5 ? 5,6,7
    (for different technologies)
  • Connect what is easy to connect
  • provide DVT information to Ghidra
  • specify P-code semantics in TSL
  • Mine what we have in creative ways
  • scalable, heuristics-based, variable and type
    recovery

52
If you succeed what difference will it make?
  • It will provide the intelligence community with a
    decisive advantage in information warfare by
    allowing them to analyze and transform machine
    code at much lower cost and shorter time than
    opponents
  • The capabilities have both defensive and
    offensive applications

53
Initial Milestones
  • Vulnerabilities
  • Determine whether using static analysis
    (CodeSonar/SWYXx86) to detect vulnerabilities
    in executables is as scalable as approaches that
    work on source code
  • Compare false-positive and false-negative rates
    with CodeSonar/C on an appropriate corpus of
    examples
  • Fisheye view on the code
  • Demonstrate a version of the DASH algorithm
    working on machine code

54
Initial Milestones
  • Mature the technology
  • Provide DVT information to Ghidra
  • Demonstrate capability to provide information
    obtained from TSL-based analyses in the Ghidra
    environment
  • stack-height analysis (using ARA)
  • high-level expressions for conditions
  • heuristic-based variable recovery (VSA and ASI)

55
Cost
  • How long will it take?
  • 5 years
  • How much will it cost?
  • University of Wisconsin 2.8M
  • Reps 5 RAs
  • GrammaTech 8.4M
  • In the past, we have been both ambitious and
    creative,
  • but handicapped by
  • the overhead of having been funded by 15
    contracts, 6 grants, and 2 fellowships
  • the lack of flexibility that comes from dealing
    with 23 different schedules, sets of milestones,
    deliverables, etc.

56
What is your transition strategy?
  • Classified discussions with relevant parties in
    the intelligence community to understand problems
    and opportunities to contribute
  • GrammaTech recently granted TS facility clearance
  • 3 employees with SCI clearances
  • 4th employee undergoing background check for SCI
  • Repss SCI clearance delayed until his return to
    US
  • 5-year ID/IQ classified contract immanent
  • Work closely with relevant parties in the
    intelligence community to
  • demonstrate the capabilities of our innovations
  • integrate with existing tools and approaches

57
Questions/Discussion
58
(No Transcript)
59
What difference did it make?
  • Malicious-code analysis
  • AWE project at MIT Lincoln Labs
  • Given a worm . . .
  • What are its target-discovery, propagation, and
    activation mechanisms?
  • What is its payload?
  • Uses CodeSurfer/x86 to recover the IR
  • Resolve indirect jumps indirect calls
  • Find system calls
  • Find their arguments
  • Follow dependences backwards to find where their
    values come from
  • . . .

60
What difference did it make?
  • Vulnerability Reviews
  • Hidden Code
  • Easter eggs, backdoors, time bombs, . . .
  • Information leakage
  • Password leakage, hi/low interference, . . .
  • Bad Paths
  • Authentication circumvention, race conditions, .
    . .
  • Bug finding
  • TOCTOU race conditions
  • null-pointer deference
  • buffer overrun
Write a Comment
User Comments (0)
About PowerShow.com