CS 656 Paper Presentation - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

CS 656 Paper Presentation

Description:

Fixed Transformation: Single pseudonym value for all values of the field ... logs of Two-distinct /24 networks taken over four hours on a single day. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 47
Provided by: csColo
Category:

less

Transcript and Presenter's Notes

Title: CS 656 Paper Presentation


1
CS 656 Paper Presentation
  • Inferring Sensitive Information
  • from
  • Anonymized Network Traces

Presented By Pardeepinder Singh Shashank
Gupta
2
Why anonymize network data?
  • Networking Research requires trace and log data
    to be publically available for verification and
    comparison of results.
  • Publically available network data is vulnerable
    to attacks which compromise privacy of end users
    and security of networks.
  • Anonymization is required to share network data.

3
Goals of network trace anonymization?
  • Goals of anonymization are to prevent
  • User audit trails
  • Mappings of supported network services
  • Leakage of security practices of the network.
  • Provide balance between data utility and data
    privacy

4
How to anonymize network data?
  • Destruction Outright removal from the
    dataset
  • Fixed Transformation Single pseudonym value for
    all values of the field
  • Variable Transformation Different pseudonym
    values based on the context of the field.
  • Typed Transformation Single pseudonym value for
    each distinct value of the original field.

5
Anonymized Network Data Requirements
  • Pseudonym Consistency Requirement The addressed
    provided within the trace data be consistently
    anonymized.
  • Header Requirement Header information remains
    intact.
  • Transport Protocol Requirement Records
    corresponding to transport layer traffic must be
    present.
  • Port Number Assumption Ability to map port
    numbers to services.

6
Summary of Anonymization Techniques
Source S. E. Coull et al., "Playing Devil's
Advocate Inferring Sensitive Information from
Anonymized Network Traces," Proceedings of the
2007 Network and Distributed System Security
Symposium, San Diego, California, February 2007.
7
Primitives
  • Heavy Hitters
  • Dominant State Analysis
  • Subnet Clustering
  • Each connection c? C is described by a feature
    vector ltc1, c2, ckgt. In the paper k4 and
    features are c1source IP address, c2
    destination IP address, c3source port number and
    c4destination port number

8
Heavy Hitters
Normalized Entropy Zero indicates highly
peaked distribution for that attribute. One
indicates uniform distribution.
9
Finding Heavy-Hitters
  • Iteratively remove very frequent values from the
    distribution.
  • Re compute the normalized entropy until the
    distribution becomes sufficiently uniform,
    bringing the normalized entropy above a given
    threshold tH.

10
Finding Heavy-Hitters
Source S. E. Coull et al., "Playing Devil's
Advocate Inferring Sensitive Information from
Anonymized Network Traces," Proceedings of the
2007 Network and Distributed System Security
Symposium, San Diego, California, February 2007.
11
Dominant State Analysis
  • Used for determining the characteristic behaviors
    for a given host .
  • Apply on Heavy Hitters to create a behavioral
    profile for each.
  • Steps
  • For each heavy-hitter address x begin with a
    simple behavioral profile src addressx.
  • C1 is the source IP address of connection c.
    Denote set of connections with c1x as Cx.

12
Dominant State Analysis
  • For each heavy-hitter address x begin with a
    simple behavioral profile src addressx.
  • c1 is the source IP address of connection c.
    Denote set of connections with c1x as Cx.
  • Reorder the remaining features i increasing order
    of normalized entropy.
  • For each feature in the set of connections,
    Cxi2..4 look for value of ci whose conditional
    probability with the current profile exceeds our
    threshold t. Append the value to the profile
    vector.

13
Dominant State Analysis
  • Extend the profiles iteratively until no value
    meeting our threshold can be found, or all
    features have been examined.
  • Output of DSA for each heavy-hitter IP address
    is a set of feature vectors describing its
    behavioral profiles.
  • Repeat the same process using destination IP
    addresses, after finishing for source IP
    addresses.

14
Dominant State Analysis
Source S. E. Coull et al., "Playing Devil's
Advocate Inferring Sensitive Information from
Anonymized Network Traces," Proceedings of the
2007 Network and Distributed System Security
Symposium, San Diego, California, February 2007.
15
Subnet Clustering How to determine subnets?
Naïve solution Look for grouping of contiguous
addresses separated by large gaps. Authors use
k-means clustering algorithm on the set of IP
addresses present within the network data to find
the subnets.
16
k-means algorithm
The k-means algorithm is an algorithm to cluster
n objects based on attributes into k partitions,
k lt n. The objective it tries to achieve is to
minimize total intra-cluster variance, or, the
squared error function where there are k
clusters Si, i 1, 2, ..., k, and µi is the
centroid or mean point of all the points xj ?
Si.
17
k-means algorithm

Source http//en.wikipedia.org/wiki/K-means_algor
ithm
18
Subnet Clustering usign k-means
  • Treat IP addresses as 4-dimensional vectors.
  • Each element of the vector corresponds to one
    octet of the IP address.
  • Determine cluster membership using modified
    Euclidean distance with bitwise exclusive-OR.
  • Dimensions corresponding to the octets are
    exponentially weighted. Weighting ensures that
    hierarchical nature of subnetting is preserved.

19
Subnet Clustering using k-means
  • K-means clustering requires number of clusters to
    be specified a priori. We dont have this info!
  • Divide each octet dimension into m blocks.
  • Place initial centriods at the boundaries of
    these partitions.
  • Iteratively recompute centriod and corresponding
    cluster membership, till the membership of
    clusters reaches a steady state.
  • Iterative refinement of clusters ensures that the
    addresses within a given cluster all reside in
    the same subnet without requiring exact identical
    centriod placement.

20
Active Attacks
Active probing attacks Insert unique values
into anonymized network data, observe the
transformations and deduce.
21
Passive Attacks
  • Pseudonym Consistency Requirement allows tracking
    of behavior over a period of time.
  • Using the current generation of network data
    gathering tools large amount of data passing
    through the Internet can be calculated over a
    period of time.
  • The authors launches two passive attacks on the
    anonymized network data to perform passive
    reconnaissance of the network
  • Recovering Network Topology
  • Inferring Host Behavior

22
Recovering Network Topology
  • Determine locations within the network where each
    of the traces were captured (Observation Points).
  • Identify the routers at each observation point.
  • Infer connectivity among routers.

23
Determining Observation Points
  • Identify the network subnets which are present in
    the local area network where the trace was
    recorded.
  • Two traces that share at least one subnet are
    considered to be coming from the same observation
    point.
  • Find subnets by performing subnet clustering on
    IP addresses.
  • (two cases ARP traffic present, not present)

24
Identifying Routing Devices
Check for hardware addresses that appear to have
multiple IP addresses associated with them over
some period of time. (Period of time should be
less than a typical DHCP lease) Infer the routes
taken by the observed traffic by applying subnet
clustering to the source and designation
addresses of TCP and UDP traffic that transits
the routers.
25
Reconstruct the network topology
  • Represent each router as a vertex.
  • Add edges when the hardware address for router is
    associated with some other discovered routing
    device.
  • Superimpose routes using hardware addresses to
    characterize the actual routes taken by the
    network traffic.

26
Inferring Host Behavior
  • Obtain heavy hitters using Algorithm 1.
  • Apply dominant state analysis on heavy-hitters
    to create behavioral profiles.
  • Gather public information about publishing
    organization (using DNS or web searching
    queries).
  • Match public information to behavioral profiles
    created using dominant state analysis.

27
Example
  • This illustrate the application of the
    aforementioned techniques on an anonymized trace
    from the Johns Hopkins University (JHU) network.
  • The topology inference technique indeed finds a
    single observation point with subnets of
    128220231024 and 128220116026.
  • The inference technique also finds one router and
    one gateway that are directly connected to one
    another, with the departmental network behind the
    router.
  • Deanonymization of the web servers within the
    departmental network. We must first create an
    estimate of the behavioral profile for the heavy
    hitters.
  • Example behavioral profiles for three
    heavy-hitters, found via Algorithm 1, in the
    departments network.

28
Matching Public Information to Profiles
  • Apply queries to behavioral profiles to
    deanonymize heavy hitters.
  • Authors use Alexa.com to find popularity of web
    servers.

Anonymized Address
Real Address (Hostname)
128.220.231.207 128.220.231.50 128.220.231.168
128.220.231.121
0.93 0.93 0.87
0.85
128.220.247.60 (simnet1) 128.220.247.203
(skdnssec) 128.220.247.5 (spar) -
29
Evaluation
  • Techniques were evaluated on network traces for
    three distinct networks.
  • Both NetFlow and packet trace data were used for
    validation.
  • The anonymization approach suggested by Pang et
    al. was used to anonymize network data.

30
  • Annonymized trace data include link , network and
    transport layer header.
  • Payload data is deleted and fields are
    annonymised.
  • Pang et al. also remove the packets generated by
    routers and certain security devices to prevent
    their discovery.
  • Pang et al. also provides meta data to ensure
    sound measurement practices.

31
NetFlow
  • NetFlow data was obtained from CERT containing
    logs of Two-distinct /24 networks taken over four
    hours on a single day.
  • Reference to Routers and Network security devices
    was explicitly removed.
  • All the information about connections(C) , the
    heavy hitter hosts , topology information and
    their behavioral profiles was dumped into the
    Rational database for fast and easy querying.

32
Infer Network Topology - LBNL
  • Techniques described before found 29 distinct
    observation points with total of 31 associated
    enterprise subnets.
  • Subnets found through subnet Clustering Technique
    agreed with those provided in Meta-Data.
  • With Exception of one subnet whose size was
    over-estimated , thereby providing 96 accuracy.

33
Deanonymization
  • Deanonymization of HTTP server for thr Bro IDS
    Project and other servers within its subnet.
  • Why choose these ?
  • Public information about these servers is readily
    available.

34
Steps in Deanonymization
  • Query DNS records for the addresses of the Bro
    web server(www.bro-ids.org) and related hosts.
  • Results showed that www.bro-ids.org resides in
    the same subnet as the ee.lbl.gov domain which
    includes SMTP,HTTP, and FTP services on
    ee.lbl.gov and HTTP services on ita.ee.lbl.gov.
  • Knowledge of these services was gleaned only from
    Google and Alexa searches.

35
  • By inferring subnet size from addresses provided
    by DNS records
  • Set of possible subnets within the trace data is
    reduced to ten/22 subnets.
  • If we consider that target subnet contains at
    least 3 heavy-hitter web servers there are
    six subnets which match this criteria.
  • Noting that ee.lbl.gov server provides SMTP,HTTP,
    and FTP services , only two servers are left.

36
  • So Combining subnet size, unique services offered
    ee.lbl.gov and specific mix of HTTP servers give
    us the actual server in the anonymized trace.
  • By inspecting Google results for the ee.lbl.gov.
    It was suspected that it might be froggy.lbl.gov.
  • Because it features a CGI web application which
    generates short bursts of HTTP connections as a
    function of CGI usage.

37
Sample Deanonymization results
  • These are only suppositions as there is no ground
    truth available.

38
ResultsOverall Deanonymization
Source S. E. Coull et al., "Playing Devil's
Advocate Inferring Sensitive Information from
Anonymized Network Traces," Proceedings of the
2007 Network and Distributed System Security
Symposium, San Diego, California, February 2007.
39
Success/Failure
  • These results according to them indicate that
    behavioral profiling is a plausible method for
    deanonymizing a variety of network traces.
  • Success rate ranged from 66 to 100 for
    significant SMTP servers.
  • And 28.6 to 50 for the significant HTTP servers
    in the subnets they examined.

40
  • In addition to deanonymizing select hosts , the
    type of traffic present in various observation
    points can be characterized.
  • The following table provides deeper insight into
    the presence of important servers.

41
Mitigation Strategies
  • Creation of network topology maps can be
    prevented by
  • Publishing only anonymized NetFlow logs.
  • Remove link layer headers from packet traces.
  • Exclude ARP traffic to make discovery of link
    layer topology information difficult.
  • Remap port numbers to hinder ability to directly
    infer services offered and make creation of
    behavioral profiles more difficult.
  • Remove hosts with unique behavior to make
    behavior profiling tougher.

42
Mitigation Strategies
  • Hide the true identity of the publishing
    organization to make information gathering to
    mount attack tougher. Security by obscurity not
    a good idea.
  • Non Technical ways
  • Legislation
  • Data remains on secure servers with access only
    to researchers.
  • No third party access.
  • Inefficient, Violations difficult to detect. But
    may provide (regrettably though) better privacy
    than offered by present
  • anonymization methodologies.
  • Balance needs to be maintained between the two
    conflicting requirements Utility and Privacy of
    released network data.

43
Conclusion
  • This paper shows that even the anonymized data
    can be compromised but they are not able to give
    any solid example to prove that.
  • Moreover in order to anonymize the data they have
    put lot of constraints on the annonymized data
    which might not be feasible to that extent in
    real world.
  • It is still not clear that to what extent privacy
    of clients is threatened.

44
Future Work
  • As a part of future work they tend to explore a
    formal framework for examining this Question.
  • And for expressing the privacy properties of
    anonymization techniques in general.

45
Questions?
46
Thank You
Write a Comment
User Comments (0)
About PowerShow.com