Title: CS 656 Paper Presentation
1CS 656 Paper Presentation
- Inferring Sensitive Information
- from
- Anonymized Network Traces
Presented By Pardeepinder Singh Shashank
Gupta
2Why anonymize network data?
- Networking Research requires trace and log data
to be publically available for verification and
comparison of results. - Publically available network data is vulnerable
to attacks which compromise privacy of end users
and security of networks. - Anonymization is required to share network data.
3Goals of network trace anonymization?
- Goals of anonymization are to prevent
- User audit trails
- Mappings of supported network services
- Leakage of security practices of the network.
- Provide balance between data utility and data
privacy
4How to anonymize network data?
- Destruction Outright removal from the
dataset - Fixed Transformation Single pseudonym value for
all values of the field - Variable Transformation Different pseudonym
values based on the context of the field. - Typed Transformation Single pseudonym value for
each distinct value of the original field.
5Anonymized Network Data Requirements
- Pseudonym Consistency Requirement The addressed
provided within the trace data be consistently
anonymized. - Header Requirement Header information remains
intact. - Transport Protocol Requirement Records
corresponding to transport layer traffic must be
present. - Port Number Assumption Ability to map port
numbers to services.
6Summary of Anonymization Techniques
Source S. E. Coull et al., "Playing Devil's
Advocate Inferring Sensitive Information from
Anonymized Network Traces," Proceedings of the
2007 Network and Distributed System Security
Symposium, San Diego, California, February 2007.
7Primitives
- Heavy Hitters
- Dominant State Analysis
- Subnet Clustering
- Each connection c? C is described by a feature
vector ltc1, c2, ckgt. In the paper k4 and
features are c1source IP address, c2
destination IP address, c3source port number and
c4destination port number
8Heavy Hitters
Normalized Entropy Zero indicates highly
peaked distribution for that attribute. One
indicates uniform distribution.
9Finding Heavy-Hitters
- Iteratively remove very frequent values from the
distribution. - Re compute the normalized entropy until the
distribution becomes sufficiently uniform,
bringing the normalized entropy above a given
threshold tH.
10Finding Heavy-Hitters
Source S. E. Coull et al., "Playing Devil's
Advocate Inferring Sensitive Information from
Anonymized Network Traces," Proceedings of the
2007 Network and Distributed System Security
Symposium, San Diego, California, February 2007.
11Dominant State Analysis
- Used for determining the characteristic behaviors
for a given host . - Apply on Heavy Hitters to create a behavioral
profile for each. - Steps
- For each heavy-hitter address x begin with a
simple behavioral profile src addressx. - C1 is the source IP address of connection c.
Denote set of connections with c1x as Cx.
12Dominant State Analysis
- For each heavy-hitter address x begin with a
simple behavioral profile src addressx. - c1 is the source IP address of connection c.
Denote set of connections with c1x as Cx. - Reorder the remaining features i increasing order
of normalized entropy. - For each feature in the set of connections,
Cxi2..4 look for value of ci whose conditional
probability with the current profile exceeds our
threshold t. Append the value to the profile
vector.
13Dominant State Analysis
- Extend the profiles iteratively until no value
meeting our threshold can be found, or all
features have been examined. - Output of DSA for each heavy-hitter IP address
is a set of feature vectors describing its
behavioral profiles. - Repeat the same process using destination IP
addresses, after finishing for source IP
addresses.
14Dominant State Analysis
Source S. E. Coull et al., "Playing Devil's
Advocate Inferring Sensitive Information from
Anonymized Network Traces," Proceedings of the
2007 Network and Distributed System Security
Symposium, San Diego, California, February 2007.
15Subnet Clustering How to determine subnets?
Naïve solution Look for grouping of contiguous
addresses separated by large gaps. Authors use
k-means clustering algorithm on the set of IP
addresses present within the network data to find
the subnets.
16k-means algorithm
The k-means algorithm is an algorithm to cluster
n objects based on attributes into k partitions,
k lt n. The objective it tries to achieve is to
minimize total intra-cluster variance, or, the
squared error function where there are k
clusters Si, i 1, 2, ..., k, and µi is the
centroid or mean point of all the points xj ?
Si.
17k-means algorithm
Source http//en.wikipedia.org/wiki/K-means_algor
ithm
18Subnet Clustering usign k-means
- Treat IP addresses as 4-dimensional vectors.
- Each element of the vector corresponds to one
octet of the IP address. - Determine cluster membership using modified
Euclidean distance with bitwise exclusive-OR. - Dimensions corresponding to the octets are
exponentially weighted. Weighting ensures that
hierarchical nature of subnetting is preserved.
19Subnet Clustering using k-means
- K-means clustering requires number of clusters to
be specified a priori. We dont have this info! - Divide each octet dimension into m blocks.
- Place initial centriods at the boundaries of
these partitions. - Iteratively recompute centriod and corresponding
cluster membership, till the membership of
clusters reaches a steady state. - Iterative refinement of clusters ensures that the
addresses within a given cluster all reside in
the same subnet without requiring exact identical
centriod placement.
20Active Attacks
Active probing attacks Insert unique values
into anonymized network data, observe the
transformations and deduce.
21Passive Attacks
- Pseudonym Consistency Requirement allows tracking
of behavior over a period of time. - Using the current generation of network data
gathering tools large amount of data passing
through the Internet can be calculated over a
period of time. - The authors launches two passive attacks on the
anonymized network data to perform passive
reconnaissance of the network - Recovering Network Topology
- Inferring Host Behavior
22Recovering Network Topology
- Determine locations within the network where each
of the traces were captured (Observation Points). - Identify the routers at each observation point.
- Infer connectivity among routers.
23Determining Observation Points
- Identify the network subnets which are present in
the local area network where the trace was
recorded. - Two traces that share at least one subnet are
considered to be coming from the same observation
point. - Find subnets by performing subnet clustering on
IP addresses. - (two cases ARP traffic present, not present)
24Identifying Routing Devices
Check for hardware addresses that appear to have
multiple IP addresses associated with them over
some period of time. (Period of time should be
less than a typical DHCP lease) Infer the routes
taken by the observed traffic by applying subnet
clustering to the source and designation
addresses of TCP and UDP traffic that transits
the routers.
25Reconstruct the network topology
- Represent each router as a vertex.
- Add edges when the hardware address for router is
associated with some other discovered routing
device. - Superimpose routes using hardware addresses to
characterize the actual routes taken by the
network traffic.
26Inferring Host Behavior
- Obtain heavy hitters using Algorithm 1.
- Apply dominant state analysis on heavy-hitters
to create behavioral profiles. - Gather public information about publishing
organization (using DNS or web searching
queries). - Match public information to behavioral profiles
created using dominant state analysis.
27Example
- This illustrate the application of the
aforementioned techniques on an anonymized trace
from the Johns Hopkins University (JHU) network. - The topology inference technique indeed finds a
single observation point with subnets of
128220231024 and 128220116026. - The inference technique also finds one router and
one gateway that are directly connected to one
another, with the departmental network behind the
router. - Deanonymization of the web servers within the
departmental network. We must first create an
estimate of the behavioral profile for the heavy
hitters. - Example behavioral profiles for three
heavy-hitters, found via Algorithm 1, in the
departments network.
28Matching Public Information to Profiles
- Apply queries to behavioral profiles to
deanonymize heavy hitters. - Authors use Alexa.com to find popularity of web
servers.
Anonymized Address
Real Address (Hostname)
128.220.231.207 128.220.231.50 128.220.231.168
128.220.231.121
0.93 0.93 0.87
0.85
128.220.247.60 (simnet1) 128.220.247.203
(skdnssec) 128.220.247.5 (spar) -
29Evaluation
- Techniques were evaluated on network traces for
three distinct networks. - Both NetFlow and packet trace data were used for
validation. - The anonymization approach suggested by Pang et
al. was used to anonymize network data.
30- Annonymized trace data include link , network and
transport layer header. - Payload data is deleted and fields are
annonymised. - Pang et al. also remove the packets generated by
routers and certain security devices to prevent
their discovery. - Pang et al. also provides meta data to ensure
sound measurement practices.
31NetFlow
- NetFlow data was obtained from CERT containing
logs of Two-distinct /24 networks taken over four
hours on a single day. - Reference to Routers and Network security devices
was explicitly removed. - All the information about connections(C) , the
heavy hitter hosts , topology information and
their behavioral profiles was dumped into the
Rational database for fast and easy querying.
32Infer Network Topology - LBNL
- Techniques described before found 29 distinct
observation points with total of 31 associated
enterprise subnets. - Subnets found through subnet Clustering Technique
agreed with those provided in Meta-Data. - With Exception of one subnet whose size was
over-estimated , thereby providing 96 accuracy.
33Deanonymization
- Deanonymization of HTTP server for thr Bro IDS
Project and other servers within its subnet. - Why choose these ?
- Public information about these servers is readily
available.
34Steps in Deanonymization
- Query DNS records for the addresses of the Bro
web server(www.bro-ids.org) and related hosts. - Results showed that www.bro-ids.org resides in
the same subnet as the ee.lbl.gov domain which
includes SMTP,HTTP, and FTP services on
ee.lbl.gov and HTTP services on ita.ee.lbl.gov. - Knowledge of these services was gleaned only from
Google and Alexa searches.
35- By inferring subnet size from addresses provided
by DNS records - Set of possible subnets within the trace data is
reduced to ten/22 subnets. - If we consider that target subnet contains at
least 3 heavy-hitter web servers there are
six subnets which match this criteria. - Noting that ee.lbl.gov server provides SMTP,HTTP,
and FTP services , only two servers are left.
36- So Combining subnet size, unique services offered
ee.lbl.gov and specific mix of HTTP servers give
us the actual server in the anonymized trace. - By inspecting Google results for the ee.lbl.gov.
It was suspected that it might be froggy.lbl.gov. - Because it features a CGI web application which
generates short bursts of HTTP connections as a
function of CGI usage.
37Sample Deanonymization results
- These are only suppositions as there is no ground
truth available.
38ResultsOverall Deanonymization
Source S. E. Coull et al., "Playing Devil's
Advocate Inferring Sensitive Information from
Anonymized Network Traces," Proceedings of the
2007 Network and Distributed System Security
Symposium, San Diego, California, February 2007.
39Success/Failure
- These results according to them indicate that
behavioral profiling is a plausible method for
deanonymizing a variety of network traces. - Success rate ranged from 66 to 100 for
significant SMTP servers. - And 28.6 to 50 for the significant HTTP servers
in the subnets they examined.
40 - In addition to deanonymizing select hosts , the
type of traffic present in various observation
points can be characterized. - The following table provides deeper insight into
the presence of important servers.
41Mitigation Strategies
- Creation of network topology maps can be
prevented by - Publishing only anonymized NetFlow logs.
- Remove link layer headers from packet traces.
- Exclude ARP traffic to make discovery of link
layer topology information difficult. - Remap port numbers to hinder ability to directly
infer services offered and make creation of
behavioral profiles more difficult. - Remove hosts with unique behavior to make
behavior profiling tougher.
42Mitigation Strategies
- Hide the true identity of the publishing
organization to make information gathering to
mount attack tougher. Security by obscurity not
a good idea. - Non Technical ways
- Legislation
- Data remains on secure servers with access only
to researchers. - No third party access.
- Inefficient, Violations difficult to detect. But
may provide (regrettably though) better privacy
than offered by present - anonymization methodologies.
- Balance needs to be maintained between the two
conflicting requirements Utility and Privacy of
released network data.
43Conclusion
- This paper shows that even the anonymized data
can be compromised but they are not able to give
any solid example to prove that. - Moreover in order to anonymize the data they have
put lot of constraints on the annonymized data
which might not be feasible to that extent in
real world. - It is still not clear that to what extent privacy
of clients is threatened.
44Future Work
- As a part of future work they tend to explore a
formal framework for examining this Question. - And for expressing the privacy properties of
anonymization techniques in general.
45Questions?
46Thank You