Title: SRE Basics
1SRE Basics
2In this Section
- We briefly cover following topics
- Assembly code
- Virtual machine/Java bytecode
- Windows PE file format
3Assembly Code
4High Level Languages
- First, high level languages
- Ancient high level languages
- Basic --- little structure
- FORTRAN --- limited structure
- C --- structured language
- C was designed to deal with complexity
- OO languages take this one step further
- Above languages considered primitive today
5High Level Languages
- Object oriented (OO) languages
- Object groups code and data together
- Consider best way to handle complexity (at least
for now) - Important OO ideas include
- Encapsulation, inheritance, polymorphism
6High Level Languages
- Program must deal with code and data
- Data
- Variables, data structures, files, etc.
- Code
- Reverser must study control flow
- Conditionals, switches, loops, etc.
7High Level Languages
- High level languages --- different users want
different things - Goes back (at least) to C vs FORTRAN
- Today, major tradeoff is between simplicity and
flexibility - Simplicity --- easy to write short program to do
exactly what you want (e.g., C) - Flexibility --- language has it all (e.g., Java)
8High Level Languages
- Some languages compiled into native code
- exe is specific to the hardware
- C, C, FORTRAN, etc.
- Other languages compiled into code, which is
interpreted by a virtual machine - Java, C
- Often possible to make compiled version
- For reverser, this distinction is far more
important than OO or not
9Intro to Assembly
- At the lowest level, machine binary
- Assembly code lives between binary and high level
languages - When reversing native code, we must deal with
assembly code - Why assembly code?
- Why not reverse binary to, say, C?
10Intro to Assembly
- Reverser would like to deal with high level, but
is stuck with low level - Ideally, want to create mental link from low
level to high level - Easier for code written in C
- Harder for OO code, such as C
- Why?
11Intro to Assembly
- Perhaps biggest difference at assembly level is
dealing with data - High level languages hide lots and lots of
details on data manipulations - For example, loading and storing
- Also, low level instructions are primitive
- Each instruction does not do very much
12Intro to Assembly
- Consider following simple C program
- Simple, but far higher level than assembly code
int multiply(int x, int y) int z z x
y return z
13Intro to Assembly
int multiply(int x, int y) int z z x
y return z
- In assembly code
- Store state before entering function
- Allocate memory for z
- Load x and y into registers
- Multiply x by y and store result in register
- Copy result back to memory for z (optional)
- Restore state that was stored in 1.
- Return z
14Intro to Assembly
- Why are things so complicated at low level?
- Its all about efficiency!
- Reading memory and storing are slow
- No single asm instruction to read memory, operate
on it, and store result - But this is common in high level languages
15Intro to Assembly
- Registers --- local processor memory
- So dont have to read and write RAM
- Stack --- scratch paper (in RAM)
- Holds register values, local variables, function
parameters and return values - E.g., storage for z in multiply example
- Heap --- dynamic, variable-sized data
- Data section --- e.g., string constants
- Control flow --- high level if or while are
much more complex at low level
16Registers
- Registers used in most instructions
- Specifics here deal with IA-32
- Intel Architecture, 32-bit
- Used in Wintel machines
- We use IA-32 notation
- ATT notation also exists
- Eight 32-bit registers (next slide)
- All 8 start with E
- Also several system registers
17Registers
- EAX, EBX, EDX --- generic, used for int, Boolean,
, memory operations - ECX --- generic, used as counter
- ESI/EDI --- generic, source/destination pointers
when copying memory - SI source index, DI destination index
- EBP --- generic, stack base pointer
- Usually, stack position after return address
- ESP --- stack pointer
- Curretn stack frame is between ESP to EBP
18Flags
- EFLAGS --- special registers
- Status flags updated by various operations to
record outcomes - System flags too, but we dont care about them
- Flags are basic tool for conditionals
- For example, a TEST followed by a jump
instruction - TEST sets various flags, jump determines action
to take, based on those flags
19Instruction Format
- Most instructions consist of
- Opcode --- the instruction
- One or two operands --- parameter(s)
- Operand (parameters) are data
- Operands come in 3 flavors
- Register name --- for example, EAX
- Immediate --- e.g., hard-coded constant
- Memory address --- enclosed in brackets
20Operand Examples
- EAX
- Read from (or write to) EAX register, depending
on opcode - 0x30004040
- Immediate --- number is embedded in code
- Usually a constant in high-level code
- 0x4000349e
- This os a memory address
- Could be a global variable in high level code
21Basic Instructions
- We cover a few common instructions
- First we give general format
- Later, we give a few simple examples
- There are lots of assembly instructions
- But, most assembly code uses only a few
- About 14 assembly instructions account for more
than 90 of all code
22Opcode Counts
- Typical opcode counts, normal code
23Opcode Counts
- Opcode counts, typical virus code
24Instructions
- We consider following operations
- Moving data
- Arithmetic
- Comparisons
- Conditional branches
- Function calls
25Moving Data
- MOV is the most popular opcode
- 2 operands, destination and source
- MOV DestOperand, SourceOperand
- Note the order
- Destination first, source second
26Arithmetic
- Six integer arithmetic operations
- ADD, SUB, MUL, DIV, IMUL, IDIV
- Many variations based on operands
- ADD Op1, Op2 add, store result in Op1
- SUB Op1, Op2 sub Op2 from Op1 --gt Op1
- MUL Op mul Op by EAX ---gt EDXEAX
- DIV Op div EDXEAX by Op
- quotient ---gt EAX, remainder ---gt EDX
- IMUL, IDIV --- like MUL and DIV, but signed
27Comparisons
- CMP opcode has 2 operands
- CMP Operand1, Operand2
- Subtracts Operand2 from Operand1
- Result stored in flag bits
- If 0 then ZF flag is set
- Other flags can be used to tell which is greater,
depending on signed or unsigned
28Conditional Branches
- Conditional branches use Jcc family of
instructions (je, jne, jz, jnz, etc.) - Format is
- Jcc TargetAddress
- If Jcc true, goto TargetAddress
- Otherwise, what happens?
29Function Calls
- Use CALL and RET
- CALL FunctionAddress
-
- RET pops return address
- RET can be told to increment ESP
- Need to reset stack pointer
- Why?
30Examples
cmp ebx,0xf020 jnz 10026509
- What does this do?
- Compares value in EBX with constant
- Jumps to specified address if operands are not
same - Note JNE and JNZ are same instruction
31Examples
mov edi,ecx0x5b0 mov ebx,ecx0x5b4 imul
edi,ebx
- What does this do?
- First, add 0x5b0 to ECX register, get value at
that memory and put in EDI - Next, add 0x5b4 to ECX, get value at that memory
and put in EBX - Note that ECX points to some data structure
- Finally, EDI EDI EBX
- Note there are different forms of IMUL
32Examples
push eax push edi push ebx push esi push dwor
d ptr esp0x24 call 0x10026eeb
- What does this do?
- PUSH four register values
- PUSH something related to stack ptr
- Probably, parameter or local variable
- Would need to look at more code to decide
- Note dword ptr is effectively a cast
- CALL a function
33Examples
mov eax, dword ptr ebp - 0x20 shl eax,
4 mov ecx, dword ptr ebp - 0x24 cmp dword
ptr eaxecx4, 0 call 0x10026eeb
- What does this do?
- Maybe data structure in an array
- Last line
- ECX --- gets base pointer
- EAX --- current offset into the array
- Add 4 to get specific member of structure
34Examples
pushl 14 pushl helloWorld pushl 1 movl 4,
eax pushl eax int 0x80 addl 16,
esp pushl 0 movl 1, eax pushl eax int
0x80
35Compilation
- Converts high level representation of code to
binary - Front end --- lexical analysis
- Verify syntax, etc.
- Intermediate representation
- Optimization
- Improve structure, eliminate redundancy,
36Compilation
- Back end --- generates the actual code
- Instruction selection
- Register allocation
- Instruction scheduling --- pipelining,
parallelism - Back end process might make disassembly hard to
read - Optimization too
- Each compiler has its own quirks
- Can you automatically determine compiler?
37Virtual Machines Bytecode
38Virtual Machines
- Some languages instead generate intermediate
bytecode - Bytecode runs in a virtual machine
- Virtual machine is a program that (historically)
interprets bytecode - Translates bytecode for the hardware
- Bytecode analogous to assembly code
39Virtual Machines
- Advantages?
- Hardware independent
- Disadvantages?
- Slow
- Today, usually just-in-time compilers instead of
interpreters - Compile snippets of bytecode into native code as
needed
40Reversing Bytecode
- Reversing bytecode is easy
- Unless special precautions are taken
- Even then, easier than native code
- Bytecode usually contains lots of metadata
- Possible to reconstruct highly accurate high
level language - Bytecode can be obfuscated
- In worst case, reverser must learn bytecode
- But bytecode is easier than native code
41Windows PE Files
42Windows PE File Format
- Designed to be standard executable file format
for all versions of OS - on all supported processors
- Only small changes since PE format was introduced
- E.g., support for 64-bit Windows
43Windows PE Files
- Trivia
- Q Whats the difference between exe and dll?
- A Not much --- one bit differs in PE files
- Q What is size of smallest possible PE file?
- A 133 bytes
- PE file on disk is a file
- Once loaded into memory, its a module
- File is mapped to module
- Address where module begins is HMODULE
- PE file may not all be mapped to module
44Windows PE Files
- WINNT.H is final word on what PE file looks like
- Tools to examine PE files
- Dumpbin (Visual Studio)
- Depends
- PE Browse Professional
- In spite of its name, its free
- PEDUMP (by author of article)
45PE File Sections
- Each section is chunk of code or data that
logically belongs together - For example, all import tables in one section
- Code is in .text section
- Code is code, but many types of data
- Data examples
- Program data (e.g., .rdata for read-only)
- API import/export tables
- Resources, relocation info, etc.
- Can specify section names in C source
46PE File Sections
- When mapped, module starts on a page boundary
- Linker can be told to merge sections
- E.g., to merge .text and .rdata
- /MERGE.rdata.text
- Some sections commonly merged
- Some sections cannot be merged
47Relative Virtual Addresses
- Exe file specifies in-memory addresses
- PE file specifies preferred load location
- But DLL can actually load just about anywhere
- So, PE specifies addresses in a way that is
independent of where it loads - No hardcoded addresses in PE
- Instead, Relative Virtual Addresses (RVAs)
- RVA is an offset relative to where PE is loaded
48Relative Virtual Addresses
- To find actual memory location, add RVA to the
actual load address - For example, suppose
- Exe file is loaded at 0x400000
- And RVA is 0x1000
- Then code (.text) starts at 0x401000
- In Windows terminology, actual address is known
as Virtual Address (VA)
49Data Directory
- There are many data structures within exe
- For efficiency, must be loaded quickly
- E.g., imports, exports, resources, base
relocations, etc. - DataDirectory
- Array of 16 data structures
- define IMAGE_DIRECTORY_ENTRY_xxx defines array
indexes (0 to 15)
50Importing Functions
- To use code or data from another DLL, must import
it - When PE file loads, Windows loader locates
imported functions/data - Usually automatic, when program first starts
- Imported DLLs may import others
- For example, any program created with Visual C
imports KERNEL32.DLL - and KERNEL32.DLL imports from NTDLL.DLL
51Importing Functions
- Each PE has Import Address Table (IAT)
- IAT contains arrays of function pointers
- One array per imported DLL
- Each imported API has spot in IAT
- The only place where API address stored
- So, all calls to API go thru one function ptr
- E.g., CALL DWORD PTR 0x00405030
- But, by default its a little more complex
52PE File Structure
- Next slides describe PE file structure
- Note that all of these data structures defined in
WINNT.H - Usually, 32-bit and 64-bit versions
- For example,
- IMAGE_NT_HEADERS32
- IMAGE_NT_HEADERS64
- Identical except for widened fields for 64-bit
53MS-DOS Header
- Every PE begins with small MS-DOS exe
- Prints message saying Windows required
- MS-DOS Header
- IMAGE_DOS_HEADER
- 2 important values
- e_lfanew --- file offset of PE header
- e_magic --- 0x5A4D, MZ in ASCII Why MZ?
54IMAGE_NT_HEADERS Header
- Primary location for PE specifics
- Location in file given by e_lfanew
- One version for 32-bit exes and another for
64-bit exes - Only minor differences between them
- Single bit specifies 32-bit or 64-bit
55IMAGE_NT_HEADERS Header
- Has 3 fields
- typedef struct _IMAGE_NT_HEADERS
- DWORD Signature
- IMAGE_FILE_HEADER FileHeader
- IMAGE_OPTIONAL_HEADER32 OptionalHeader
- IMAGE_NT_HEADERS32, PIMAGE_NT_HEADERS32
- In valid PE, Signature is 0x00004550
- In ASCII, this is PE00
56IMAGE_NT_HEADERS Header
- typedef struct _IMAGE_NT_HEADERS
- DWORD Signature
- IMAGE_FILE_HEADER FileHeader
- IMAGE_OPTIONAL_HEADER32 OptionalHeader
- IMAGE_NT_HEADERS32, PIMAGE_NT_HEADERS32
- IMAGE_FILE_HEADER predates PE
- Struct containing basic info about file
- Most important info is size of optional data
that follows (not really optional)
57IMAGE_NT_HEADERS Header
- typedef struct _IMAGE_NT_HEADERS
- DWORD Signature
- IMAGE_FILE_HEADER FileHeader
- IMAGE_OPTIONAL_HEADER32 OptionalHeader
- IMAGE_NT_HEADERS32, PIMAGE_NT_HEADERS32
- IMAGE_OPTIONAL_HEADER
- DataDirectory array (at end) is address book of
important locations in exe - Each entry contains RVA and size of data
58PE Sections
- Recall, section is chunk of code or data that
logically belongs together - For example
- All data for exes import tables are in one
section
59Section Table
- Section table contains array of
IMAGE_SECTION_HEADER structs - An IMAGE_SECTION_HEADER has info about associated
section - Location, length, and characteristics
- Number of such headers given by field
IMAGE_NT_HEADERS.FileHeader.NumberOfSections
60Alignment of Sections
- Visual Studio 6.0
- 4KB sections by default
- Visual Studio .NET
- 4KB by default, except for small files uses
0x200-byte alignment - Also, .NET spec requires 8KB in-memory alignment
(for IA-64 compatibility)
61PE Sections
- So far, overview of PE file format
- Now, look inside important sections
- and some data structures within sections
- Then we finish with look at PEDUMP
- Recall there are other similar utilities
62Section Names
- .text ---The default code section.
- .data --- The default read/write data section.
Global variables typically go here. - .rdata --- The default read-only data section.
String literals and C/COM vtables are examples
of items put into .rdata.
63Section Names
- .idata --- The imports table. It has become
common practice (explicitly, or via linker
default behavior) to merge .idata into another
section, typically .rdata. By default, the linker
only merges the .idata section into another
section when creating a release mode exe. - .edata --- The exports table. When creating an
executable that exports APIs or data, the linker
creates an .EXP file which contains an .edata
section that's added into the final executable.
Like the .idata section, the .edata section is
often found merged into the .text or .rdata
sections.
64Section Names
- .rsrc --- The resources. This section is
read-only. However, it should not be renamed and
should not be merged into other sections. - .bss --- Uninitialized data. Rarely found in exes
created with recent linkers. Instead, the
VirtualSize of the exe's .data section is
expanded to make room for uninitialized data. - .crt --- Data added for supporting the C
runtime (CRT). A good example is the function
pointers that are used to call the constructors
and destructors of static C objects.
65Section Names
- .tls --- Data for supporting thread local storage
variables declared with __declspec(thread). This
includes the initial value of the data, as well
as additional variables needed by the runtime. - .reloc --- Base relocations in an exe. Base
relocations are generally only needed for DLLs
and not EXEs. In release mode, the linker doesn't
emit base relocations for EXE files. Relocations
can be removed when linking with the /FIXED
switch. - .sdata --- "Short" read/write data that can be
addressed relative to the global pointer. Used
for IA-64 and other architectures that use a
global pointer register. Regular-sized global
variables on the IA-64 will go in this section.
66Section Names
- .srdata --- "Short" read-only data that can be
addressed relative to the global pointer. Used on
the IA-64 and other architectures that use a
global pointer register. - .pdata --- The exception table. Contains an array
of IMAGE_RUNTIME_FUNCTION_ENTRY structs,
CPU-specific. Pointed to by IMAGE_DIRECTORY_ENTRY_
EXCEPTION slot in the DataDirectory. Used for
architectures with table-based exception
handling, such as the IA-64. The only
architecture that doesn't use table-based
exception handling is the x86. - .didat --- Delayload import data. Found in exes
built in nonrelease mode. In release mode, the
delayload data is merged into another section.
67Exports Section
- Exe may export code or data
- Makes it available to other exes
- Refer to an exported thing as a symbol
- At minimum, to export symbol, must specify its
address in defined way - Keyword ORDINAL tells linker to use numbers, not
names, for symbols - After all, names just a convenience for coders
68IMAGE_EXPORT_DIRECTORY
- Points to 3 arrays
- And a table of ASCII strings containing symbol
names - Only required array is Export Address Table (EAT)
- Array of function pointers
- Addresses of exported functions
- Export ordinal is an index into this array
69IMAGE_EXPORT_DIRECTORY
70Example
- exports table
- Name KERNEL32.dll
- Characteristics 00000000
- TimeDateStamp 3B7DDFD8 -gt Fri Aug 17 232408
2001 - Version 0.00
- Ordinal base 00000001
- of functions 000003A0
- of Names 000003A0
- Entry Pt Ordn Name
- 00012ADA 1 ActivateActCtx
- 000082C2 2 AddAtomA
- remainder of exports omitted
71Example
- Spse, call GetProcAddress on AddAtomA API
- System locates KERNEL32s IMAGE_EXPORT_DIRECTORY
- Gets start address of Export Names Table (ENT)
- It finds there are 0x3A0 entries in ENT
- Does binary search for AddAtomA
- Suppose AddAtomA is 2nd entry
- loader reads 2nd value from export ordinal table
72Example (Continued)
- Call GetProcAddress on AddAtomA API
- AddAtomA has export ordinal 2
- Use this as index into EAT (taking into account
base field value) - Finds AddAtomA has RVA of 0x82C2
- Add 0x82C2 to load address of KERNEL32 to get
actual address of AddAtomA
73Export Forwarding
- Can forward export to another DLL
- That is, must find it at forward address
- Example
- KERNEL32 HeapAlloc function forwarded to
RtlAllocHeap function exported by NTDLL - In EXPORTS section of KERNEL32, find
- EXPORTS
-
- HeapAlloc NTDLL.RtlAllocHeap
74Imports Section
- Importing is opposite of exporting
- IMAGE_IMPORTS_DESCRIPTOR
- Points to 2 essentially identical arrays
- Import Address Table Import Name Table
- IAT and INT
- Contain ordinal, address, forwarding info
- After binding, IAT rewritten, INT retains
original (pre-binding) info - Binding discussed next
75Imports Section
- Example
- Importing APIs from USER32.DLL
76Binding
- Binding means IAT overwritten with actual
addresses - VAs overwrite RVAs
- Why do this?
- Increased efficiency
- Loader checks whether binding valid
77Delayload Data
- Hybrid between implicit explicit importing
- Not an OS issue
- A linker issue, at runtime
- There is IAT and INT for the DLL
- Identical to regular IAT and INT
- But read by runtime library code instead of OS
- Benefit? Calls then go directly to API
78Resources Section
- For resources such as
- icons, bitmaps, dialogs, etc.
- Most complicated section to navigate
- Organized like a file system
79Base Relocations
- Executable has many memory addresses
- As mentioned, PE file specifies preferred memory
address to load the module - ImageBase field in IMAGE_FILE_HEADER
- If DLL loaded elsewhere, all addresses will be
incorrect - Base relocations tell loader all locations that
need to be modified - Note that this is extra work for the loader
- What about EXE, which is not a DLL?
80Base Relocation Example
- Consider the following line of code
00401020 8B 0D 34 D4 40 00 mov ecx,dword ptr
0x0040D434
- Note that 8B 0D specifies opcode
- Also note the address 0x0040D434
- Suppose preferred load is at 0x00400000
- If it loads at that address, it runs as-is
- Suppose instead it loads at 0x00500000
- Then code above needs to change to
8B 0D 34 D4 50 00 mov ecx,dword ptr
0x0050D434
81Base Relocation Example
- If not loaded at preferred address, then loader
computes delta - For example on previous slide
- delta 0x00500000 - 0x0040000
- So, delta is 0x00100000
- Also, there would be base relocation specifying
location 0x00401020 - Loader modifies address located here by delta
82Debug Directory
- Contains debug info
- Not required to run the program
- But useful for development
- Can be multiple forms of debug info
- Most common is PDB file
83.NET Header
- .NET executables are PE files
- However, code/data is minimal
- Purpose of PE is simply to get .NET-specific info
into memory - Metadata, intermediate language (IL)
- MSCOREE.DLL at start of a .NET process
- This dll takes charge and uses metadata and IL
from executable - So PE has stub to get MSCOREE.DLL going
84TLS Initialization
- Thread Local Storage (TLS)
- .tls section for thread local variables
- New threads initialized using .tls data
- Presence of TLS data indicated by nonzero
IMAGE_DIRECTORY_ENTRY_TLS in DataDirectory - Points to IMAGE_TLS_DIRECTORY struct
- Contains virtual addresses, VAs (not RVAs)
- The actual struct is in .rdata, not in .tls
85Program Exception Data
- x86 architecture uses frame-based exception
handling - A fairly complex way to handle exceptions
- IA-64 and others use table-based approach
- Table containing info about every function that
might be affected by exception unwinding - Table entry includes start and end addresses, how
and where exception to be handled - When exception occurs, search thru table
86PEDUMP
- Tools for analyzing PE files
- Dumpbin (Visual Studio)
- Depends
- PE Browse Professional
- In spite of its name, its free
- PEDUMP (by author of article)