Title: HUNTING FOR METAMORPHIC ENGINES
1HUNTING FOR METAMORPHIC ENGINES
- Mark Stamp
-
- Wing Wong
- September 13, 2006
2Outline
- Metamorphic software
- Both good and evil uses
- Metamorphic virus construction kits
- How effective are metamorphic engines?
- How to compare two pieces of code?
- Similarity within and between virus families
- Similarity to non-viral code
- Can we detect metamorphic viruses?
- Commercial virus scanners
- Hidden Markov models (HMMs)
- Similarity index
- Conclusion
3PART I
4What is Metamorphic Software?
- Software is metamorphic provided
- All copies do the same thing
- Internal structure of copies differs
- Today almost all software is cloned
- Good metamorphic software
- Mitigate buffer overflow attacks
- Bad metamorphic software
- Avoid virus/worm signature detection
5Metamorphic Software for Good?
- Suppose program has a buffer overflow
- If we clone the program
- One attack breaks every copy
- Break once, break everywhere (BOBE)
- If instead, we have metamorphic copies
- Each copy still has a buffer overflow
- One attack does not work against every copy
- BOBE-resistant
- Analogous to genetic diversity in biology
- A little metamorphism does a lot of good!
6Metamorphic Software for Evil?
- Cloned virus/worm can be detected
- Common signature on every copy
- Detect once, detect everywhere (DODE?)
- If instead virus/worm is metamorphic
- Each copy has different signature
- Same detection does not work against every copy
- Provides DODE-resistance
- Analogous to genetic diversity in biology
- But, effective use of metamorphism here is tricky!
7Crypto Analogy
- In information security, almost everything that
consistently works is either - Crypto, or
- Has a crypto analogy
- Consider WWII ciphers
- German Enigma
- Broken by Polish and British cryptanalysts
- Design was (mostly) known to cryptanalysts
- Japanese Purple
- Broken by American cryptanalysts
- Design was (mostly) unknown to cryptanalysts
8Crypto Analogy
- Cryptanalysis ? break a (known) cipher
- Diagnosis ? determine how an unknown cipher works
(from ciphertext) - Which was the greater achievement, breaking
Enigma or Purple? - Cryptanalysis of Enigma was harder
- Diagnosis of Purple was harder
- Can make a reasonable case for either
9Crypto Analogy
- What does this have to do with metamorphic
software? - Suppose we (the good guys) generate metamorphic
copies of our software - Bad guys can attack individual copies
- Can bad guys attack all copies?
- Bad guys can try to diagnose our metamorphic
generator
10Crypto Analogy
- How to diagnose metamorphic generator (from
exes)? - Reverse engineer many copies, look at
differences, etc., etc. - Lots of work
- Diagnosis problem is hard
- If good guys can force bad guys to solve a
diagnosis problem, the good guys win - Security by obscurity? Violates (spirit of)
Kerckhoffs Principle? - Yes, but still may be valuable in the real world
11Crypto Analogy
- What about case where bad guys write metamorphic
code? - Metamorphic viruses, for example
- Do good guys need to solve diagnosis problem?
- If so, good guys are in trouble
- Not if good guys only need to detect the
metamorphic code (not diagnose) - Not claiming the good guys job is easy
- Just claiming that there is hope
12Virus Evolution
- Viruses first appeared in the 1980s
- Fred Cohen
- Viruses must avoid signature detection
- Virus can alter its appearance
- Techniques employed
- encryption
- polymorphic
- metamorphic
13Virus Evolution - Encryption
- Virus consists of
- decrypting module (decryptor)
- encrypted virus body
- Different encryption key
- different virus body signature
- Weakness
- decryptor can be detected
14Virus Evolution Polymorphism
- Try to hide signature of decryptor
- Can use code emulator to decrypt putative virus
dynamically - Decrypted virus body is constant
- Signature detection is possible
15Virus Evolution Metamorphism
- Change virus body
- Mutation techniques
- permutation of subroutines
- insertion of garbage/jump instructions
- substitution of instructions
16PART II
17Virus Construction Kits PS-MPC
- According to Peter Szor
- PS-MPC Phalcon/Skism Mass-Produced Code
generator uses a generator that effectively
works as a code-morphing engine the viruses
that PS-MPC generates are not only polymorphic,
but their decryption routines and structures
change in variants
18Virus Construction Kits G2
- From the documentation of G2 (Second Generation
virus generator) - different viruses may be generated from
identical configuration files
19Virus Construction Kits - NGVCK
- From the documentation for NGVCK (Next Generation
Virus Creation Kit) - all created viruses are completely different
in structure and opcode impossible to catch
all variants with one or more scanstrings.
nearly 100 variability of the entire code - Oh, really?
20PART III
- How Effective Are Metamorphic Engines?
21How We Compare Two Pieces of Code
22Virus Families Test Data
- Four generators, 45 viruses
- 20 viruses by NGVCK
- 10 viruses by G2
- 10 viruses by VCL32
- 5 viruses by MPCGEN
- 20 normal utility programs from the Cygwin bin
directory
23Similarity within Virus Families Results
24Similarity within Virus Families Results
25Similarity within Virus Families Results
26Similarity within Virus Families Results
27Similarity within Virus Families Results
28NGVCK Similarity to Virus Families
- NGVCK versus other viruses
- 0 similar to G2 and MPCGEN viruses
- 0 5.5 similar to VCL32 viruses (43 out of 100
comparisons have score gt 0) - 0 1.2 similar to normal files (only 8 out of
400 comparisons have score gt 0)
29NGVCK Metamorphism/Similarity
- NGVCK
- By far the highest degree of metamorphism of any
kit tested - Virtually no similarity to other viruses or
normal programs - Undetectable???
30PART IV
- Can Metamorphic Viruses Be Detected?
31Commercial Virus Scanners
- Tested three virus scanners
- eTrust version 7.0.405
- avast! antivirus version 4.7
- AVG Anti-Virus version 7.1
- Each scanned 37 files
- 10 NGVCK viruses
- 10 G2 viruses
- 10 VCL32 viruses
- 7 MPCGEN viruses
32Commercial Virus Scanners
- Results
- eTrust and avast! detected 17 (G2 and MPCGEN)
- AVG detected 27 viruses (G2, MPCGEN and VCL32)
- none of NGVCK viruses detected by the scanners
tested
33Hidden Markov Models (HMMs)
- state machines
- transitions between states have fixed
probabilities - each state has a probability distribution for
observing a set of observation symbols - can train an HMM to represent a set of data (in
the form of observation sequences) - states features of the input data
- transition and the observation probabilities
statistical properties of features
34HMM Example the Occasionally Dishonest Casino
35HMM Example the Occasionally Dishonest Casino
- 2 states fair/loaded
- The switch between dice is a Markov process
- Outcomes of a roll have different probabilities
in each state - If we can only see a sequence of rolls, the state
sequence is hidden - want to understand the underlying Markov process
from the observations
36HMMs the Three Problems
- Find the likelihood of seeing an observation
sequence O given a model ?, i.e. P(O ?) - Find an optimal state sequence that could have
generated a sequence O - Find the model parameters given a sequence O,
i.e. find transition and observation
probabilities that maximize the probability of
observing O - There exist efficient algorithms to solve the
three problems
37HMM Application Determining the Properties of
English Text
- Given a large quantity of written English text
- Input a long sequence of observations consisting
of 27 symbols (the 26 lower-case letters and the
word space) - Train a model to find the most probable
parameters (i.e., solve Problem 3) - Use trained model to score any unknown sequence
of letters (and spaces) to determine whether it
corresponds to English text. (i.e., solve Problem
1)
38HMM Application Initial and Final Observation
Probability Distributions
39HMM Application - Results
- Observation probabilities converged, each letter
belongs to one of the two hidden states - The two states correspond to consonants and
vowels - Note
- no a priori assumption was made
- HMM effectively recovered the statistically
significant feature inherent in English
40HMM Application - Results
- Probabilities can be sensibly interpreted for up
to n 12 hidden states - Trained model could be used to detect English
text, even if the text is disguised by, say, a
simple substitution cipher or similar
transformation
41Virus Detection with HMMs
- Use hidden Markov models (HMMs) to represent
statistical properties of a set of metamorphic
virus variants - Train the model on family of metamorphic viruses
- Use trained model to determine whether a given
program is similar to the viruses the HMM
represents
42Virus Detection with HMMs
- A trained HMM
- maximizes the probabilities of observing the
training sequence - assigns high probabilities to sequences similar
to the training sequence - represents the average behavior if trained on
multiple sequences - represents an entire virus family, as opposed to
individual viruses
43Virus Detection with HMMs Data
- Data set
- 200 NGVCK viruses (160 for training, 40 for
testing) - Comparison set
- 40 normal exes from Cygwin
- 25 other non-family viruses (G2, MPCGEN and
VCL32) - 25 HMM models generated and tested
44Virus Detection with HMMs Methodology
45Virus Detection with HMMs Results
46Virus Detection with HMMs Results
- Detect some other viruses for free
47Virus Detection with HMMs
- Summary of experimental results
- All normal programs distinguished
- VCL32 viruses had scores close to NGVCK family
viruses - With proper threshold, 17 HMM models had 100
detection rate and 10 models had 0 false
positive rate - No significant difference in performance between
HMMs with 3 or more hidden states
48Virus Detection with HMMs Trained Models
- Converged probabilities in HMM matrices may give
insight into the features of the represented
viruses - We observe
- opcodes grouped into hidden states
- most opcodes in one state only
- What does this mean?
- We are not sure
49HMMs The Trained Models
50Detection via Similarity Index
- Straightforward similarity index can be used as
detector - To determine whether a program belongs to the
NGVCK virus family, compare it to any randomly
chosen NGVCK virus - NGVCK similarity to non-NGVCK code is small
- Can use this fact to detect metamorphic NGVCK
variants
51Detection via Similarity Index
52Detection via Similarity Index
- Experiment
- compare 105 programs to one selected NGVCK virus
- Results
- 100 detection, 0 false positive
- Does not depend on specific NGVCK virus selected
53PART V
54Conclusion
- Metamorphic generators vary a lot
- NGVCK has highest metamorphism (10 similarity on
average) - Other generators far less effective (60
similarity on average) - Normal files 35 similar, on average
- But, NGVCK viruses can be detected!
- NGVCK viruses too different from other viruses
and normal programs
55Conclusion
- NGVCK viruses not detected by commercial scanners
we tested - Hidden Markov model (HMM) detects NGVCK (and
other) viruses with high accuracy - NGVCK viruses also detectable by similarity index
56Conclusion
- All metamorphic viruses tested were detectable
because - High similarity within family and/or
- Too different from normal programs
- Effective use of metamorphism by virus/worm
requires - A high degree of metamorphism and similarity to
other programs - This is not trivial!
57Conclusion
- How practical is our detection method?
- We cheat in several ways
- Use IDA to disassemble
- Viruses not embedded in other code
- Limited testing, small no. of files, etc.
- But results appear to be robust
- If so, we can be sloppy (i.e., more efficient)
and still get good results
58The Bottom Line
- Metamorphism for good
- For example, buffer overflow mitigation
- A little metamorphism does a lot of good
- Metamorphism for evil
- For example, try to evade virus/worm signature
detection - Requires high degree of metamorphism and
similarity to normal programs - Not impossible, but not easy
59The Bottom Bottom Line
- For metamorphic software, perhaps the inherent
advantage lies with the good guys rather than the
bad guys - All-too-often in information security, the
advantage lies with the bad guys
60References
- X. Gao, Metamorphic software for buffer overflow
mitigation, MS thesis, Dept. of CS, SJSU, 2005 - P. Szor, The Art of Computer Virus Research and
Defense, Addison-Wesley, 2005 - M. Stamp, Information Security Principles and
Practice, Wiley InterScience, 2005 - W. Wong, Analysis and detection of metamorphic
computer viruses, MS thesis, Dept. of CS, SJSU,
2006 - W. Wong and M. Stamp, Hunting for metamorphic
engines, to appear in Journal in Computer Virology
61Appendix
62HMMs Run Time of Training Process
- 5 to 38 minutes, depending on number of states n.
63HMMs Run Time of Classifying Process
- 0.008 to 0.4 milliseconds, depending on N and
number of opcodes T .
64AVG Anti-Virus Scanning Result