Title: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan
1CS 290C Formal Models for Web Software
Lectures 17 Analyzing Input Validation and
Sanitization in Web ApplicationsInstructor
Tevfik Bultan
2Vulnerabilities in Web Applications
- There are many well-known security
vulnerabilities that exist in many web
applications. Here are some examples - Malicious file execution where a malicious user
causes the server to execute malicious code - SQL injection where a malicious user executes
SQL commands on the back-end database by
providing specially formatted input - Cross site scripting (XSS) causes the attacker
to execute a malicious script at a users browser - These vulnerabilities are typically due to
- errors in user input validation or
- lack of user input validation
3String Related Vulnerabilities
- String related web application vulnerabilities as
a percentage of all vulnerabilities (reported by
CVE)
- OWASP Top 10 in 2007
- Cross Site Scripting
- Injection Flaws
- OWASP Top 10 in 2010
- Injection Flaws
- Cross Site Scripting
4Why Is Input Validation Error-prone?
- Extensive string manipulation
- Web applications use extensive string
manipulation - To construct html pages, to construct database
queries in SQL, etc. - The user input comes in string form and must be
validated and sanitized before it can be used - This requires the use of complex string
manipulation functions such as string-replace - String manipulation is error prone
5String Related Vulnerabilities
- String related web application vulnerabilities
occur when - a sensitive function is passed a malicious string
input from the user - This input contains an attack
- It is not properly sanitized before it reaches
the sensitive function - String analysis Discover these vulnerabilities
automatically
6XSS Vulnerability
- A PHP Example
- 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- 4 echo lttdgt . l_otherinfo . . www .
lt/tdgt - 5?gt
- The echo statement in line 4 is a sensitive
function - It contains a Cross Site Scripting (XSS)
vulnerability
ltscript ...
7Is It Vulnerable?
- A simple taint analysis can report this segment
vulnerable using taint propagation - 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- 4 echo lttdgt . l_otherinfo . .www.
lt/tdgt - 5?gt
- echo is tainted ? script is vulnerable
tainted
8How to Fix it?
- To fix the vulnerability we added a sanitization
routine at line s - Taint analysis will assume that www is untainted
and report that the segment is NOT vulnerable - 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- s www ereg_replace(A-Za-z0-9
.-_at_//,,www) - 4 echo lttdgt . l_otherinfo . .www.
lt/tdgt - 5?gt
tainted
untainted
9Is It Really Sanitized?
- 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- s www ereg_replace(A-Za-z0-9
.-_at_//,,www) - 4 echo lttdgt . l_otherinfo . .www.
lt/tdgt - 5?gt
ltscript gt
ltscript gt
10Sanitization Routines can be Erroneous
- The sanitization statement is not correct!
- ereg_replace(A-Za-z0-9 .-_at_//,,www)
- Removes all characters that are not in
A-Za-z0-9 .-_at_/ - .-_at_ denotes all characters between . and _at_
(including lt and gt) - .-_at_ should be .\-_at_
- This example is from a buggy sanitization routine
used in MyEasyMarket-4.1 (line 218 in file
trans.php)
11String Analysis
- String analysis determines all possible values
that a string expression can take during any
program execution - Using string analysis we can identify all
possible input values of the sensitive functions - Then we can check if inputs of sensitive
functions can contain attack strings - How can we characterize attack strings?
- Use regular expressions to specify the attack
patterns - Attack pattern for XSS SltscriptS
12Vulnerabilities Can Be Tricky
- Input lt!scrip!t ...gt does not match the attack
pattern - but it matches the vulnerability signature and it
can cause an attack - 1lt?php
- 2 www _GETwww
- 3 l_otherinfo URL
- s www ereg_replace(A-Za-z0-9
.-_at_//,,www) - 4 echo lttdgt . l_otherinfo . .www.
lt/tdgt - 5?gt
lt!scrip!t gt
ltscript gt
13String Analysis
- If string analysis determines that the
intersection of the attack pattern and possible
inputs of the sensitive function is empty - then we can conclude that the program is secure
- If the intersection is not empty, then we can
again use string analysis to generate a
vulnerability signature - characterizes all malicious inputs
- Given SltscriptS as an attack pattern
- The vulnerability signature for _GETwww is
- SltasacaraiapatS
- where a? A-Za-z0-9 .-_at_/
14Automata-based String Analysis
- Finite State Automata can be used to characterize
sets of string values - We use automata based string analysis
- Associate each string expression in the program
with an automaton - The automaton accepts an over approximation of
all possible values that the string expression
can take during program execution - Using this automata representation symbolically
execute the program, only paying attention to
string manipulation operations
15Input Validation Verification Stages
Application/ Scripts
(Tainted) Dependency Graphs
Parser/ Taint Analysis
Reachable Attack Strings
Vulnerability Analysis
Attack Patterns
Vulnerability Signature
Signature Generation
Sanitization Statements
Patch Synthesis
16Combining Forward Backward Analyses
- Convert PHP programs to dependency graphs
- Combine symbolic forward and backward symbolic
reachability analyses - Forward analysis
- Assume that the user input can be any string
- Propagate this information on the dependency
graph - When a sensitive function is reached, intersect
with attack pattern - Backward analysis
- If the intersection is not empty, propagate the
result backwards to identify which inputs can
cause an attack
17Dependency Graphs
- Given a PHP program,
- first construct the
- Dependency graph
- 1lt?php
- 2 www GETwww
- 3 l_otherinfo URL
- 4 www ereg_replace(
- A-Za-z0-9 .-_at_//,,www
- )
- 5 echo l_otherinfo .
- .www
- 6?gt
_GETwww, 2
A-Za-z0-9 .-_at_//, 4
, 4
www, 2
URL, 3
, 5
l_otherinfo, 3
preg_replace, 4
str_concat, 5
www, 4
str_concat, 5
echo, 5
Dependency Graph
18Forward Analysis
- Using the dependency graph we conduct
vulnerability analysis - Automata-based forward symbolic analysis that
identifies the possible values of each node - Each node in the dependency graph is associated
with a DFA - DFA accepts an over-approximation of the strings
values that the string expression represented by
that node can take at runtime - The DFAs for the input nodes accept S
- Intersecting the DFA for the sink nodes with the
DFA for the attack pattern identifies the
vulnerabilities
19Forward Analysis
- Forward analysis uses post-image computations of
string operations - postConcat(M1, M2)
- returns M, where MM1.M2
- postReplace(M1, M2, M3)
- returns M, where Mreplace(M1, M2, M3)
20Forward Analysis
Forward S
Attack Pattern SltS
_GETwww, 2
, 4
A-Za-z0-9 .-_at_//, 4
www, 2
URL, 3
Forward e
Forward S
Forward A-Za-z0-9 .-_at_/
Forward URL
, 5
preg_replace, 4
l_otherinfo, 3
Forward
Forward A-Za-z0-9 .-_at_/
Forward URL
str_concat, 5
www, 4
Forward URL
Forward A-Za-z0-9 .-_at_/
str_concat, 5
Forward URL A-Za-z0-9 .-_at_/
echo, 5
n
L(SltS)
L(URL A-Za-z0-9 .-_at_/)
Forward URL A-Za-z0-9 .-_at_/
L(URL A-Za-z0-9 .--_at_/ltA-Za-z0-9 .-_at_/)
? Ø
21Result Automaton
U
R
L
A-Za-z0-9 .--_at_/
A-Za-z0-9 .-_at_/
Space
lt
URL A-Za-z0-9 .--_at_/ltA-Za-z0-9 .-_at_/
22Symbolic Automata Representation
- Compact Representation
- Canonical form and
- Shared BDD nodes
- Efficient MBDD Manipulations
- Union, Intersection, and Emptiness Checking
- Projection and Minimization
23Symbolic Automata Representation
Symbolic DFA representation
Explicit DFA representation
24Widening
- String verification problem is undecidable
- The forward fixpoint computation is not
guaranteed to converge in the presence of loops
and recursion - We want to compute a sound approximation
- During fixpoint we compute an over approximation
of the least fixpoint that corresponds to the
reachable states - We use an automata based widening operation to
over-approximate the fixpoint - Widening operation over-approximates the union
operations and accelerates the convergence of the
fixpoint computation
25Widening
- Given a loop such as
- 1lt?php
- 2 var head
- 3 while (. . .)
- 4 var var . tail
- 5
- 6 echo var
- 7?gt
- Our forward analysis with widening would compute
that the value of the variable var in line 6 is
(head)(tail)
26 Backward Analysis
- A vulnerability signature is a characterization
of all malicious inputs that can be used to
generate attack strings - We identify vulnerability signatures using an
automata-based backward symbolic analysis
starting from the sink node - Pre-image computations on string operations
- preConcatPrefix(M, M2)
- returns M1 and where M M1.M2
- preConcatSuffix(M, M1)
- returns M2, where M M1.M2
- preReplace(M, M2, M3)
- returns M1, where Mreplace(M1, M2, M3)
27Backward Analysis
Forward S
Backward ltltS
_GETwww, 2
node 3
node 6
A-Za-z0-9 .-_at_//, 4
, 4
www, 2
URL, 3
Forward e
Forward A-Za-z0-9 .-_at_/
Forward S
Forward URL
Backward Do not care
Backward Do not care
Backward ltltS
Backward Do not care
preg_replace, 4
, 5
Vulnerability Signature ltltS
l_otherinfo, 3
Forward
Forward A-Za-z0-9 .-_at_/
Forward URL
Backward Do not care
Backward A-Za-z0-9 .--_at_/ltA-Za-z0-9
.-_at_/
Backward Do not care
node 10
str_concat, 5
www, 4
Forward A-Za-z0-9 .-_at_/
Forward URL
node 11
Backward A-Za-z0-9 .--_at_/ltA-Za-z0-9 .-_at_/
Backward Do not care
str_concat, 5
Forward URL A-Za-z0-9 .-_at_/
Backward URL A-Za-z0-9 .--_at_/ltA-Za-z0-
9 .-_at_/
node 12
echo, 5
Forward URL A-Za-z0-9 .-_at_/
Backward URL A-Za-z0-9 .--_at_/ltA-Za-z0-
9 .-_at_/
28Vulnerability Signature Automaton
S
lt
lt
Non-ASCII
ltltS
29Vulnerability Signatures
- The vulnerability signature is the result of the
input node, which includes all possible malicious
inputs - An input that does not match this signature
cannot exploit the vulnerability - After generating the vulnerability signature
- Can we generate a patch based on the
vulnerability signature? -
-
- The vulnerability signature automaton for the
running example
lt
S
lt
30Patches from Vulnerability Signatures
- Main idea
- Given a vulnerability signature automaton, find a
cut that separates initial and accepting states - Remove the characters in the cut from the user
input to sanitize - This means, that if we just delete lt from the
user input, then the vulnerability can be removed
lt
S
lt
min-cut is lt
31Patches from Vulnerability Signatures
- Ideally, we want to modify the input (as little
as possible) so that it does not match the
vulnerability signature - Given a DFA, an alphabet cut is
- a set of characters that after removing the
edges that are associated with the characters in
the set, the modified DFA does not accept any
non-empty string - Finding a minimal alphabet cut of a DFA is an
NP-hard problem (one can reduce the vertex cover
problem to this problem) - We can use a min-cut algorithm instead
- The set of characters that are associated with
the edges of the min cut is an alphabet cut - but not necessarily the minimum alphabet cut
32Automatically Generated Patch
- Automatically generated patch will make sure that
no string that matches the attack pattern reaches
the sensitive function - lt?php
- if (preg match(/ ltlt./, GETwww))
- GETwww preg replace(lt,,
GETwww) - www _GETwww
- l_otherinfo URL
- www ereg_replace(A-Za-z0-9
.-_at_//,,www) - echo lttdgt . l_otherinfo . .www.
lt/tdgt - ?gt
33Experiments
- Application of this approach to five vulnerable
input sanitization routines from three open
source web applications - MyEasyMarket-4.1 A shopping cart program
- (2) BloggIT-1.0 A blog engine
- (3) proManager-0.72 A project management system
- We used the following XSS attack pattern
- SltscriptS
34Forward Analysis Results
- The dependency graphs of these benchmarks are
simplified based on the sinks - Unrelated parts are removed using slicing
Input Input Input Input Results Results Results
nodes edges sinks inputs Time(s) Mem (kb) states/bdds
21 20 1 1 0.08 2599 23/219
29 29 1 1 0.53 13633 48/495
25 25 1 2 0.12 1955 125/1200
23 22 1 1 0.12 4022 133/1222
25 25 1 1 0.12 3387 125/1200
35Backward Analysis Results
- We use the backward analysis to generate the
vulnerability signatures - Backward analysis starts from the vulnerable
sinks identified during forward analysis
Input Input Input Input Results Results Results
nodes edges sinks inputs Time(s) Mem (kb) states/bdds
21 20 1 1 0.46 2963 9/199
29 29 1 1 41.03 1859767 811/8389
25 25 1 2 2.35 5673 20/302, 20/302
23 22 1 1 2.33 32035 91/1127
25 25 1 1 5.02 14958 20/302
36Alphabet Cuts
- We generate alphabet cuts from the vulnerability
signatures using a min-cut algorithm - Problem When there are two user inputs the patch
will block everything and delete everything - Overlooks the relations among input variables
(e.g., the concatenation of two inputs contains lt
SCRIPT)
Input Input Input Input Results
nodes edges sinks inputs Alphabet Cut
21 20 1 1 lt
29 29 1 1 S,,
25 25 1 2 S , S
23 22 1 1 lt,,
25 25 1 1 lt,,
Vulnerability signature depends on two inputs
37Relational String Analysis
- Instead of using multiple single-track DFAs use
one multi-track DFA - Each track represents the values of one string
variable - Using multi-track DFAs
- Identifies the relations among string variables
- Generates relational vulnerability signatures for
multiple user inputs of a vulnerable application - Improves the precision of the path-sensitive
analysis - Proves properties that depend on relations among
string variables, e.g., file usr.txt
38Multi-track Automata
- Let X (the first track), Y (the second track), be
two string variables - ? is a padding symbol
- A multi-track automaton that encodes X Y.txt
(t,?)
(x,?)
(t,?)
(a,a), (b,b)
39Relational Vulnerability Signature
- We perform forward analysis using multi-track
automata to generate relational vulnerability
signatures - Each track represents one user input
- An auxiliary track represents the values of the
current node - We intersect the auxiliary track with the attack
pattern upon termination
40Relational Vulnerability Signature
- Consider a simple example having multiple user
inputs - lt?php
- 1 www _GETwww
- 2 url _GETurl
- 3 echo url. www
- ?gt
- Let the attack pattern be S lt S
41Relational Vulnerability Signature
- A multi-track automaton (url, www, aux)
- Identifies the fact that the concatenation of two
inputs contains lt
(a,?,a), (b,?,b),
(a,?,a), (b,?,b),
(lt,?,lt)
(?,a,a), (?,b,b),
(?,a,a), (?,b,b),
(?,lt,lt)
(?,lt,lt)
(?,a,a), (?,b,b),
(?,a,a), (?,b,b),
42Relational Vulnerability Signature
- Project away the auxiliary variable
- Find the min-cut
- This min-cut identifies the alphabet cuts lt for
the first track (url) and lt for the second
track (www)
(a,?), (b,?),
(a,?), (b,?),
(lt,?)
(?,a), (?,b),
(?,a), (?,b),
(?,lt)
(?,lt)
(?,a), (?,b),
(?,a), (?,b),
min-cut is lt,lt
43Patch for Multiple Inputs
- Patch If the inputs match the signature, delete
its alphabet cut - lt?php
- if (preg match(/ ltlt./, GETurl.
GETwww)) -
- GETurl preg replace(lt,,
GETurl) - GETwww preg replace(lt,,
GETwww) -
- 1 www GETwww
- 2 url GETurl
- 3 echo url. www
- ?gt
44Technical Issues
- To conduct relational string analysis, we need to
compute intersection of multi-track automata - Intersection is closed under aligned multi-track
automata - ?s are right justified in all tracks, e.g., ab??
instead of a?b? - However, there exist unaligned multi-track
automata that are not describable by aligned ones - We propose an alignment algorithm that constructs
aligned automata which over or under approximate
unaligned ones
45Other Technical Issues
- Modeling Word Equations
- Intractability of X cZ
- The number of states of the corresponding aligned
multi-track DFA is exponential to the length of
c. - Irregularity of X YZ
- X YZ is not describable by an aligned
multi-track automata - Use a conservative analysis
- Construct multi-track automata that over or
under-approximate the word equations
46Composite Analysis
- What I have talked about so far focuses only on
string contents - It does not handle constraints on string lengths
- It cannot handle comparisons among integer
variables and string lengths - String analysis techniques can be extended to
analyze systems that have unbounded string and
integer variables - Need to use a composite static analysis approach
that combines string analysis and size analysis
47Size Analysis
- Size Analysis The goal of size analysis is to
provide properties about string lengths - It can be used to discover buffer overflow
vulnerabilities - Integer Analysis At each program point,
statically compute the possible states of the
values of all integer variables. - These infinite states are symbolically
over-approximated as linear arithmetic
constraints that can be represented as an
arithmetic automaton - Integer analysis can be used to perform size
analysis by representing lengths of string
variables as integer variables.
48An Example
- Consider the following segment
- 1 lt?php
- 2 www GETwww
- 3 l otherinfo URL
- 4 www ereg replace(A-Za-z0-9
./-_at_//,,www) - 5 if(strlen(www) lt limit)
- 6 echo lttdgt . l otherinfo . . www .
lt/tdgt - 7?gt
- If we perform size analysis solely, after line 4,
we do not know the length of www - If we perform string analysis solely, at line 5,
we cannot check/enforce the branch condition.
49Composite Analysis
- We need a composite analysis that combines string
analysis with size analysis. - Challenge How to transfer information between
string automata and arithmetic automata? - A string automaton is a single-track DFA that
accepts a regular language, whose length forms a
semi-linear set - For example 4, 6 ? 2 3k k 0
- The unary encoding of a semi-linear set is
uniquely identified by a unary automaton - The unary automaton can be constructed by
replacing the alphabet of a string automaton with
a unary alphabet
50Arithmetic Automata
- An arithmetic automaton is a multi-track DFA,
where each track represents the value of one
variable over a binary alphabet - If the language of an arithmetic automaton
satisfies a Presburger formula, the value of each
variable forms a semi-linear set - The semi-linear set is accepted by the binary
automaton that projects away all other tracks
from the arithmetic automaton
51Connecting the Dots
- There are algorithms to convert unary automata to
binary automata and vice versa - Using these conversion algorithms we can conduct
a composite analysis that subsumes size analysis
and string analysis
String Automata
Unary Length Automata
Binary Length Automata
Arithmetic Automata
52Case Study
- Schoolmate 1.5.4
- Number of PHP files 63
- Lines of code 8181
- Forward Analysis results
Time Memory Number of XSS sensitive sinks Number of XSS Vulnerabilities
22 minutes 281 MB 898 153
Actual Vulnerabilities False Positives
105 48
53Case Study False Positives
- Why false positives?
- Path insensitivity 39
- Path to vulnerable program point is not feasible
- Un-modeled built in PHP functions 6
- Unfound user written functions 3
- PHP programs have more than one execution entry
point - We can remove all these false positives by
extending the analysis to a path sensitive
analysis and modeling more PHP functions
54Case Study - Sanitization
- After patching all actual vulnerabilities by
adding automated sanitization routines we can run
string analysis again - When string analysis is used on the automatically
generated patches, it shows that the patches are
correct with respect to the attack pattern -
55String Analysis
- String analysis based on context free grammars
Christensen et al., SAS03 Minamide, WWW05 - String analysis based on symbolic/concolic
execution Bjorner et al., TACAS09, Saxena et
al., SP10 - Bounded string analysis Kiezun et al.,
ISSTA09 - Automata based string analysis Xiang et al.,
COMPSAC07 Shannon et al., MUTATION07,
Balzarotti et al., SP08, Yu et al., SPIN08,
CIAA10, Hooimeijer et al., Usenix11 - Application of string analysis to web
applications Wassermann and Su, PLDI07,
ICSE08 Halfond and Orso, ASE05, ICSE06 - String analysis for JavaScript Saxena et al.
SP10, Alkhalaf et al., ICSE12, ISSTA12
56String Analysis
- Size Analysis
- Size analysis Hughes et al., POPL96 Chin et
al., ICSE05 Yu et al., FSE07 Yang et al.,
CAV08 - Composite analysis Bultan et al., TOSEM00 Xu
et al., ISSTA08 Gulwani et al., POPL08
Halbwachs et al., PLDI08, Yu et al. TACAS09
- Vulnerability Signature Generation
- Test input/Attack generation Wassermann et al.,
ISSTA08 Kiezun et al., ICSE09 - Vulnerability signature generation Brumley et
al., SP06 Brumley et al., CSF07 Costa et
al., SOSP07 - Vulnerability signature generation and patch
generation Yu et al., ASE09, ICSE11