Title: Eliminating Web Software Vulnerabilities with Automated Verification
1Eliminating Web Software Vulnerabilities with
Automated Verification
- Tevfik Bultan
- Verification Lab
- Department of Computer Science
- University of California, Santa Barbara
- bultan_at_cs.ucsb.edu http//www.cs.ucsb.edu/vl
ab
2University of California at Santa Barbara
3Verification Lab (VLab) at UCSB
- Application of automated verification techniques
to software - Automated verification of web applications
- Checking input validation, sanitization, string
analysis (PHP) - Checking navigation correctness (MVC farmeworks)
- Checking data models (Ruby on Rails)
- Automated verification of web services
- Modular testing and verification of web services
(WSDL) - Formal modeling and analysis of choreography and
orchestration (WS-CDL, WS-BPEL) - Automated verification of access control policies
- Policy composition (XACML)
- Automated verification of concurrency
- Analyzing concurrency, deadlock detection (Java)
4VLab Publications on String Analysis
- Relational String Verification Using Multi-Track
Automata Yu, Bultan, Ibarra CIAA10 - Stranger An Automata-based String Analysis Tool
for PHP Yu, Alkhalaf, Bultan TACAS10 - Generating Vulnerability Signatures for String
Manipulating Programs Using Automata-based
Forward and Backward Symbolic Analyses Yu,
Alkhalaf, Bultan ASE09 - Symbolic String Verification Combining String
Analysis and Size Analysis Yu, Bultan, Ibarra
TACAS09 - Symbolic String Verification An Automata-based
Approach Yu, Bultan, Cova, Ibarra SPIN08
5Web software
- Web software is becoming increasingly dominant
- Web applications are used extensively in many
areas - Commerce online banking, online shopping,
- Entertainment online music videos,
- Interaction social networks
- We will rely on web applications more in the
future - Health records
- Google Health, Microsoft HealthVault
- Controlling and monitoring of national
infrastructures - Google Powermeter
- Web software is also rapidly replacing desktop
applications - Could computing software-as-service
- Google Docs, Google
6One Major Road Block
- Web applications are not trustworthy!
- Web applications are notorious for security
vulnerabilities - Their global accessibility makes them a target
for many malicious users - As web applications are becoming increasingly
dominant - and as their use in safety critical areas is
increasing - their trustworthiness is becoming a critical
issue -
7Web applications are not secure
- There are many well-known security
vulnerabilities that exist in many web
applications. Here are some examples - Malicious file execution where a malicious user
causes the server to execute malicious code - SQL injection where a malicious user executes
SQL commands on the back-end database by
providing specially formatted input - Cross site scripting (XSS) causes the attacker
to execute a malicious script at a users browser - These vulnerabilities are typically due to
- errors in user input validation or
- lack of user input validation
8Web Application Vulnerabilities
9Web Application Vulnerabilities
- The top two vulnerabilities of the Open Web
Application Security Project (OWASP)s top ten
list in 2007 - Cross Site Scripting (XSS)
- Injection Flaws (such as SQL Injection)
- The top two vulnerabilities of the OWASPs top ten
list in 2010 - Injection Flaws (such as SQL Injection)
- Cross Site Scripting (XSS)
10Why are web applications error prone?
- Extensive string manipulation
- Web applications use extensive string
manipulation - To construct html pages, to construct database
queries in SQL, etc. - The user input comes in string form and must be
validated and sanitized before it can be used - This requires the use of complex string
manipulation functions such as string-replace - String manipulation is error prone
11String Related Vulnerabilities
- String related web application vulnerabilities
occur when - a sensitive function is passed a malicious string
input from the user - This input contains an attack
- It is not properly sanitized before it reaches
the sensitive function - String analysis Discover these vulnerabilities
automatically
12XSS Vulnerability
- A PHP Example
- 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- 4 echo lttdgt . l_otherinfo . . www .
lt/tdgt - 5?gt
- The echo statement in line 4 is a sensitive
function - It contains a Cross Site Scripting (XSS)
vulnerability
ltscript ...
13Is it Vulnerable?
- A simple taint analysis, e.g., Huang et al.
WWW04, can report this segment vulnerable using
taint propagation - 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- 4 echo lttdgt . l_otherinfo . .www.
lt/tdgt - 5?gt
- echo is tainted ? script is vulnerable
tainted
14How to Fix it?
- To fix the vulnerability we added a sanitization
routine at line s - Taint analysis will assume that www is untainted
and report that the segment is NOT vulnerable - 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- s www preg_replace(A-Za-z0-9
.-_at_//,,www) - 4 echo lttdgt . l_otherinfo . .www.
lt/tdgt - 5?gt
tainted
untainted
15Is It Really Sanitized?
- 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- s www preg_replace(A-Za-z0-9
.-_at_//,,www) - 4 echo lttdgt . l_otherinfo . .www.
lt/tdgt - 5?gt
ltscript ...
16Sanitization Routines can be Erroneous
- The sanitization statement is not correct!
- preg_replace(A-Za-z0-9 .-_at_//,,www)
- Removes all characters that are not in A-Za-z0-9
.-_at_/ - .-_at_ denotes all characters between . and _at_
(including lt and gt) - .-_at_ should have been .\-_at_
- This example is from a buggy sanitization routine
used in MyEasyMarket-4.1 (line 218 in file
trans.php)
17String Analysis
- String analysis determines all possible values
that a string expression can take during any
program execution - Using string analysis we can identify all
possible input values of the sensitive functions - Then we can check if inputs of sensitive
functions can contain attack strings - How can we characterize attack strings?
- Use regular expressions to specify the attack
patterns - Attack pattern for XSS SltscriptS
18String Analysis
- If string analysis determines that the
intersection of the attack pattern and possible
inputs of the sensitive function is empty - then we can conclude that the program is secure
- If the intersection is not empty, then we can
again use string analysis to generate a
vulnerability signature - characterizes all malicious inputs
- Given SltscriptS as an attack pattern
- The vulnerability signature for _GETwww is
- SltasacaraiapatS
- where a? A-Za-z0-9 .-_at_/
19Vulnerabilities can be tricky
- Input lt!scrip!t ...gt does not match the attack
pattern - but it matches the vulnerability signature
- and it can cause an attack
- 1lt?php
- 2 www lt!scrip!t ...gt
- 3 l otherinfo URL
- s www preg replace(A-Za-z0-9 .-_at_//,,
lt!scrip!t...gt) - 4 echo lttdgt . l otherinfo . . ltscript
...gt .lt/tdgt - 5?gt
20Automata-based String Analysis
- Finite State Automata can be used to characterize
sets of string values - We use automata based string analysis
- Associate each string expression in the program
with an automaton - The automaton accepts an over approximation of
all possible values that the string expression
can take during program execution - Using this automata representation we
symbolically execute the program, only paying
attention to string manipulation operations
21Symbolic Automata Representation
- We use the MONA DFA Package for automata
manipulation - Klarlund and Møller, 2001
- Compact Representation
- The transition relation of the DFA is represented
as a multi-terminal BDD (MBDD) - Exploits the MBDD structure in the implementation
of DFA operations - Union, Intersection, and Emptiness Checking
- Projection and Minimization
- Cannot Handle Nondeterminism
- We extended the alphabet with dummy bits to
encode nondeterminism
22Symbolic Automata Representation
Explicit DFA representation
Symbolic DFA representation
23String Analysis Stages
- Construct dependency graphs for the PHP programs
- Combine symbolic forward and backward symbolic
reachability analyses - Forward analysis
- Assume that the user input can be any string
- Propagate this information on the dependency
graph - When a sensitive function is reached, intersect
with attack pattern - Backward analysis
- If the intersection is not empty, propagate the
result backwards to identify which inputs can
cause an attack
Front End
Forward Analysis
Backward Analysis
PHP Program
Vulnerability Signatures
Attack patterns
24Dependency Graphs
- Given a PHP program,
- first construct the
- Dependency graph
- 1lt?php
- 2 www GETwww
- 3 l_otherinfo URL
- 4 www preg_replace(
- A-Za-z0-9 .-_at_//,,www
- )
- 5 echo l_otherinfo .
- .www
- 6?gt
_GETwww, 2
A-Za-z0-9 .-_at_//, 4
, 4
www, 2
URL, 3
, 5
l_otherinfo, 3
preg_replace, 4
str_concat, 5
www, 4
str_concat, 5
echo, 5
Dependency Graph
25Forward Analysis
- Using the dependency graph we conduct
vulnerability analysis - Automata-based forward symbolic analysis that
identifies the possible values of each node - Each node in the dependency graph is associated
with a DFA - DFA accepts an over-approximation of the strings
values that the string expression represented by
that node can take at runtime - The DFAs for the input nodes accept S
- Intersecting the DFA for the sink nodes with the
DFA for the attack pattern identifies the
vulnerabilities - Uses post-image computations of string
operations - postConcat(M1, M2)
- returns M, where MM1.M2
- postReplace(M1, M2, M3) (language-based
replacement) - returns M, where Mreplace(M1, M2, M3)
26Forward Analysis
Forward S
Attack Pattern SltS
_GETwww, 2
, 4
A-Za-z0-9 .-_at_//, 4
www, 2
URL, 3
Forward e
Forward S
Forward A-Za-z0-9 .-_at_/
Forward URL
, 5
preg_replace, 4
l_otherinfo, 3
Forward
Forward A-Za-z0-9 .-_at_/
Forward URL
str_concat, 5
www, 4
Forward URL
Forward A-Za-z0-9 .-_at_/
str_concat, 5
Forward URL A-Za-z0-9 .-_at_/
echo, 5
n
L(SltS)
L(URL A-Za-z0-9 .-_at_/)
Forward URL A-Za-z0-9 .-_at_/
L(URL A-Za-z0-9 .--_at_/ltA-Za-z0-9 .-_at_/)
? Ø
27Intersection Result Automaton
U
R
L
A-Za-z0-9 .--_at_/
A-Za-z0-9 .-_at_/
Space
lt
URL A-Za-z0-9 .--_at_/ltA-Za-z0-9 .-_at_/
28Widening
- String verification problem is undecidable in
general - Two string variables, equality checks and
concatenation is enough to get to undecidability - The forward fixpoint computation is not
guaranteed to converge in the presence of loops
and recursion - We compute a sound approximation
- During fixpoint we compute an over approximation
of the least fixpoint that corresponds to the
reachable states - We use an automata based widening operation to
over-approximate the fixpoint - Widening operation over-approximates the union
operation and accelerates the convergence of the
fixpoint computation
29 Backward Analysis
- A vulnerability signature is a characterization
of all malicious inputs that can be used to
generate attack strings - We identify vulnerability signatures using an
automata-based backward symbolic analysis
starting from the sink node - Uses pre-image computations on string operations
- preConcatPrefix(M, M2)
- returns M1 and where M M1.M2
- preConcatSuffix(M, M1)
- returns M2, where M M1.M2
- preReplace(M, M1, M2)
- returns M3, where Mreplace(M1, M2, M3)
30Backward Analysis
Forward S
Backward ltltS
_GETwww, 2
node 3
node 6
A-Za-z0-9 .-_at_//, 4
, 4
www, 2
URL, 3
Forward e
Forward A-Za-z0-9 .-_at_/
Forward S
Forward URL
Backward Do not care
Backward Do not care
Backward ltltS
Backward Do not care
preg_replace, 4
, 5
Vulnerability Signature ltltS
l_otherinfo, 3
Forward
Forward A-Za-z0-9 .-_at_/
Forward URL
Backward Do not care
Backward A-Za-z0-9 .--_at_/ltA-Za-z0-9
.-_at_/
Backward Do not care
node 10
str_concat, 5
www, 4
Forward A-Za-z0-9 .-_at_/
Forward URL
node 11
Backward A-Za-z0-9 .--_at_/ltA-Za-z0-9 .-_at_/
Backward Do not care
str_concat, 5
Forward URL A-Za-z0-9 .-_at_/
Backward URL A-Za-z0-9 .--_at_/ltA-Za-z0-
9 .-_at_/
node 12
echo, 5
Forward URL A-Za-z0-9 .-_at_/
Backward URL A-Za-z0-9 .--_at_/ltA-Za-z0-
9 .-_at_/
31Vulnerability Signature Automaton
S
lt
lt
Not ASCII
ltltS
32Vulnerability Signatures
- The vulnerability signature is the result of the
input node, which includes all possible malicious
inputs - An input that does not match this signature
cannot exploit the vulnerability - After generating the vulnerability signature
- Can we generate a patch based on the
vulnerability signature? -
- The vulnerability signature automaton for
the running example
lt
S
lt
33Patches from Vulnerability Signatures
- Main idea
- Given a vulnerability signature automaton, find a
cut that separates initial and accepting states - Remove the characters in the cut from the user
input to sanitize - This means, that if we just delete lt from the
user input, then the vulnerability can be removed
lt
S
lt
min-cut is lt
34Patches from Vulnerability Signatures
- Ideally, we want to modify the input (as little
as possible) so that it does not match the
vulnerability signature - Given a DFA, an alphabet cut is
- a set of characters that after removing the
edges that are associated with the characters in
the set, the modified DFA does not accept any
non-empty string - Finding a minimal alphabet cut of a DFA is an
NP-hard problem (one can reduce the vertex cover
problem to this problem) - We use a min-cut algorithm instead
- The set of characters that are associated with
the edges of the min cut is an alphabet cut - but not necessarily the minimum alphabet cut
35Automatically Generated Patch
- Automatically generated patch will make sure that
no string that matches the attack pattern reaches
the sensitive function - lt?php
- if (preg match(/ ltlt./, GETwww))
- GETwww preg replace(lt,,
GETwww) - www _GETwww
- l_otherinfo URL
- www preg_replace(A-Za-z0-9
.-_at_//,,www) - echo lttdgt . l_otherinfo . .www.
lt/tdgt - ?gt
36Experiments
- We evaluated our approach on five vulnerabilities
from three open source web applications - MyEasyMarket-4.1 A shopping cart program
- (2) BloggIT-1.0 A blog engine
- (3) proManager-0.72 A project management system
- We used the following XSS attack pattern
- SltscriptS
37Forward Analysis Results
- The dependency graphs of these benchmarks are
reduced based on the sinks - Unrelated parts are removed
Input Input Input Input Results Results Results
nodes edges sinks inputs Time(s) Mem (kb) states/bdds
21 20 1 1 0.08 2599 23/219
29 29 1 1 0.53 13633 48/495
25 25 1 2 0.12 1955 125/1200
23 22 1 1 0.12 4022 133/1222
25 25 1 1 0.12 3387 125/1200
38Backward Analysis Results
- We use the backward analysis to generate the
vulnerability signatures - Backward analysis starts from the vulnerable
sinks identified during forward analysis
Input Input Input Input Results Results Results
nodes edges sinks inputs Time(s) Mem (kb) states/bdds
21 20 1 1 0.46 2963 9/199
29 29 1 1 41.03 1859767 811/8389
25 25 1 2 2.35 5673 20/302, 20/302
23 22 1 1 2.33 32035 91/1127
25 25 1 1 5.02 14958 20/302
39Alphabet Cuts
- We generate alphabet cuts from the vulnerability
signatures using a min-cut algorithm - Problem When there are two user inputs the patch
will block everything and delete everything - Cannot interpret the relations among input
variables (e.g. only block when the concatenation
of two inputs contains ltscript)
Input Input Input Input Results
nodes edges sinks inputs Alphabet Cut
21 20 1 1 lt
29 29 1 1 s,,
25 25 1 2 S , S
23 22 1 1 lt,,
25 25 1 1 lt,,
Vulnerability signature depends on two inputs
40Relational String Analysis
- Instead of using multiple single-track DFAs use
one multi-track DFA - Each track represents the values of one string
variable - Using multi-track DFAs
- Identifies the relations among string variables
- Generates relational vulnerability signatures for
multiple user inputs of a vulnerable application - Improves the precision of string analysis
- Proves properties that depend on relations among
string variables, e.g., file usr.txt
41Multi-track Automata
- Let X (the first track), Y (the second track), be
two string variables - ? is a padding symbol
- A multi-track automaton that encodes X Y.txt
(?,t)
(?,x)
(?,t)
(a,a), (b,b)
42Relational Vulnerability Signature
- Performs forward analysis using multi-track
automata to generate relational vulnerability
signatures - Each track represents one user input
- An auxiliary track represents the values of the
current node - Intersects the auxiliary track with the attack
pattern upon termination
43Relational Vulnerability Signature
- Consider a simple example having multiple user
inputs - lt?php
- 1 www _GETwww
- 2 url _GETurl
- 3 echo url. www
- ?gt
- Let the attack pattern be S lt S
44Relational Vulnerability Signature
- A multi-track automaton (url, www, aux)
- Identifies the fact that the concatenation of two
inputs contains lt
(a,?,a), (b,?,b),
(a,?,a), (b,?,b),
(lt,?,lt)
(?,a,a), (?,b,b),
(?,a,a), (?,b,b),
(?,lt,lt)
(?,lt,lt)
(?,a,a), (?,b,b),
(?,a,a), (?,b,b),
45Relational Vulnerability Signature
- Project away the auxiliary variable
- Find the min-cut
- This min-cut identifies the alphabet cuts lt for
the first track (url) and lt for the second
track (www)
(a,?), (b,?),
(a,?), (b,?),
(lt,?)
(?,a), (?,b),
(?,a), (?,b),
(?,lt)
(?,lt)
(?,a), (?,b),
(?,a), (?,b),
min-cut is lt,lt
46Patch for Multiple Inputs
- Patch If the inputs match the signature, delete
its alphabet cut - lt?php
- if (preg match(/ ltlt./, GETurl.
GETwww)) -
- GETurl preg replace(lt,,
GETurl) - GETwww preg replace(lt,,
GETwww) -
- 1 www GETwww
- 2 url GETurl
- 3 echo url. www
- ?gt
47Technical Issues
- To conduct relational string analysis, we need to
compute intersection of multi-track automata - Intersection is closed under aligned multi-track
automata - ?s are right justified in all tracks, e.g., ab??
instead of a?b? - However, there exist unaligned multi-track
automata that are not describable by aligned ones - We proposed an alignment algorithm that
constructs aligned automata which over or under
approximate unaligned ones
48Other Technical Issues
- Modeling Word Equations
- Intractability of X cZ
- The number of states of the corresponding aligned
multi-track DFA is exponential to the length of
c. - Irregularity of X YZ
- X YZ is not describable by an aligned
multi-track automata - We proposed a conservative analysis
- Constructs multi-track automata that over or
under-approximate the word equations
49Composite Analysis
- What I have talked about so far focuses only on
string contents - It does not handle constraints on string lengths
- It cannot handle comparisons among integer
variables and string lengths - We extended our string analysis techniques to
analyze systems that have unbounded string and
integer variables - We proposed a composite static analysis approach
that combines string analysis and size analysis
50Size Analysis
- Size Analysis The goal of size analysis is to
provide properties about string lengths - It can be used to discover buffer overflow
vulnerabilities - Integer Analysis At each program point,
statically compute the possible states of the
values of all integer variables. - These infinite states are symbolically
over-approximated as linear arithmetic
constraints that can be represented as an
arithmetic automaton - Integer analysis can be used to perform size
analysis by representing lengths of string
variables as integer variables.
51An Example
- Consider the following segment
- 1 lt?php
- 2 www GETwww
- 3 l otherinfo URL
- 4 www preg replace(A-Za-z0-9
./-_at_//,,www) - 5 if(strlen(www) lt limit)
- 6 echo lttdgt . l otherinfo . . www .
lt/tdgt - 7?gt
- If we perform size analysis solely, after line 4,
we do not know the length of www - If we perform string analysis solely, at line 5,
we cannot check/enforce the branch condition.
52Composite Analysis
- We need a composite analysis that combines string
analysis with size analysis. - Challenge How to transfer information between
string automata and arithmetic automata? - A string automaton is a single-track DFA that
accepts a regular language, whose length forms a
semi-linear set - For example 4, 6 ? 2 3k k 0
- The unary encoding of a semi-linear set is
uniquely identified by a unary automaton - The unary automaton can be constructed by
replacing the alphabet of a string automaton with
a unary alphabet
53Arithmetic Automata
- An arithmetic automaton is a multi-track DFA,
where each track represents the value of one
variable over a binary alphabet - If the language of an arithmetic automaton
satisfies a Presburger formula, the value of each
variable forms a semi-linear set - The semi-linear set is accepted by the binary
automaton that projects away all other tracks
from the arithmetic automaton
54Connecting the Dots
- We developed novel algorithms to convert unary
automata to binary automata and vice versa - Using these conversion algorithms we can conduct
a composite analysis that subsumes size analysis
and string analysis
String Automata
Unary Length Automata
Binary Length Automata
Arithmetic Automata
55Stranger A String Analysis Tool
Stranger is available at www.cs.ucsb.edu/vlab/st
ranger
- Uses Pixy Jovanovic et al., 2006 as a PHP front
end - Uses MONA Klarlund and Møller, 2001 automata
package for automata manipulation
Attack patterns
Symbolic String Analysis
Pixy Front End
String/Automata Operations
Automata Based String Manipulation Library
Parser
String Analyzer
Dependency Graphs
Stranger Automata
PHP program
CFG
DFAs
Dependency Analyzer
String Analysis Report (Vulnerability
Signatures)
MONA Automata Package
56Case Study
- Schoolmate 1.5.4
- Number of PHP files 63
- Lines of code 8181
- Forward Analysis results
- After manual inspection we found the following
Time Memory Number of XSS sensitive sinks Number of XSS Vulnerabilities
22 minutes 281 MB 898 153
Actual Vulnerabilities False Positives
105 48
57Case Study False Positives
- Why false positives?
- Path insensitivity 39
- Path to vulnerable program point is not feasible
- Un-modeled built in PHP functions 6
- Unfound user written functions 3
- PHP programs have more than one execution entry
point - We can remove all these false positives by
extending our analysis to a path sensitive
analysis and modeling more PHP functions
58Case Study - Sanitization
- We patched all actual vulnerabilities by adding
sanitization routines - We ran stranger the second time
- Stranger proved that our patches are correct with
respect to the attack pattern we are using
59Related Work String Analysis
- String analysis based on context free grammars
Christensen et al., SAS03 Minamide, WWW05 - Application of string analysis to web
applications Wassermann and Su, PLDI07,
ICSE08 Halfond and Orso, ASE05, ICSE06 - Automata based string analysis Xiang et al.,
COMPSAC07 Shannon et al., MUTATION07 - String analysis based on symbolic/concolic
execution Bjorner et al., TACAS09 - Bounded string analysis Kiezun et al.,
ISSTA09
60Related Work
- Size Analysis
- Size analysis Hughes et al., POPL96 Chin et
al., ICSE05 Yu et al., FSE07 Yang et al.,
CAV08 - Composite analysis Bultan et al., TOSEM00 Xu
et al., ISSTA08 Gulwani et al., POPL08
Halbwachs et al., PLDI08 - Vulnerability Signature Generation
- Test input/Attack generation Wassermann et al.,
ISSTA08 Kiezun et al., ICSE09 - Vulnerability signature generation Brumley et
al., SP06 Brumley et al., CSF07 Costa et
al., SOSP07
61THE END