Title: Overview of Privacy Preserving Techniques
1Overview of Privacy Preserving Techniques
2- This is a high-level summary of the
state-of-the-art privacy preserving techniques
and research areas - Focus on problems and the basic ideas
- Office hours are changed to Wed 2-5pm
3Outline
- Privacy problem in computing
- Major techniques
- Data perturbation
- Data anonymization
- Cryptographic methods
- Privacy in different areas
- Data mining
- Data publishing
- Database access/information retrieval
- Social network
- Mobile computing
4Privacy problem
- Individual privacy
- Customer data
- Public data census data, voting record
- Health record
- locations
- Online activities
- etc
- Organization privacy
- Owning collections of personal data
- Business secrets
- Legal issues prevent data sharing
- etc
5Privacy vs. Security
- Security
- Assumption the two parties trust each other, but
the communication network is not trusted.
Alice
Bob
Communication channel
Encrypting data
Decrypting data
Bob knows the original data that Alice owns.
6- Privacy
- Parties do not trust each other curious parties
(including malicious insiders) may look at
sensitive contents - Parties follow protocols honestly (semi-honest
assumption) - (1) Transformation based methods
-
Might be a curious party
Alice
Bob
Communication channel
transformed data
Works on the transformed data only
Bob do not know the original data.
7- (2) Cryptographic methods
Some protocol using cryptographic primitives
Statistical Info/ Intermediate result
Info from other parties
Party 1
Party 2
Party n
data
data
data
8Privacy sensitive scenarios
user 1
user 1
user 1
Private info
9Issues with data transformation
- Techniques performing the transformation
- Transformation should preserve important
information - How much information loss
- How to recover the information from the
transformed data - Methods reconstructing the original data from the
transformed data - Various attacks
- The cost
- Transforming data
- Recovering the important information
10Transformation techniques
- Data Perturbation
- Additive perturbation
- Multiplicative perturbation
- Randomized responses
- Data Anonymization
- k-anonymization
11Additive Data Perturbation
- Definition
- Y X e
- X is the original data column, e is some
zero-mean random noise, and Y is the perturbed
data - History
- Census data
- statistical databases 14
?
Released data
12Additive Data Perturbation
- In data mining
- Perturb only selected data columns
- Some data mining algorithms care only the
distribution of the data column, rather than
exact record values - Distribution can be reconstructed from the
perturbed data, if the noise is known 10,11 - Need to develop new DM algorithms (disadvantage)
reconstructed distribution
perturbed distribution
Original distribution
13Attacks to additive data perturbation
- Noise e can be filtered out
- random matrix theory
- spectral analysis
- Paper 13,15,16 discuss the data (not
distribution) reconstruction techniques - When the perturbation is effective in preserving
privacy
14Additive perturbation to categorical data
- Transactional data
- User A clicked url1, url3, url8
- User B bought items x,y,z
- Categorical data perturbation
- Add/remove fake items to the itemset
- While preserving some global distribution
- Widely used in privacy-preserving association
rule mining - Related work 12,17,111
15Multiplicative Data Perturbation
- Definition
- X is the original data (multiple columns), Y is
perturbed data - Random projection perturbation Y PX
- P is a random projection matrix
- Rotation perturbation YRX
- R is a random rotation matrix ? distance is
preserved - Geometric perturbation YRXTD
- T is translation matrix
- D is random noise matrix
-
16Multiplicative Data Perturbation
- Unique benefits
- No need to release the information of
perturbation parameters - E.g., P,R,T,D
- More robust to spectral analysis
- Preserving (or approximately) distances
- Can use many existing DM algorithms directly on
the perturbed data - No need to develop special DM algorithms
17Attacks to Multiplicative Data Perturbation
- Independent Component Analysis
- For any YAX
- If Y is known, X columns are independent, no more
than one X column has normal distribution - A and X can be estimated
- Requires additional info to be an effective
attack - Attackers knowing a few input/output pairs
- (x1, y1), (x2, y2), (xk, yk)
- Using these pairs to estimate the perturbation
parameters - Related work 22,23
18Randomized Response
- Definition
- Problem need to know the yes/no answers over a
sensitive survey question - Each user perturbs the answer in some way
- The real probability of yes answer can still be
calculated -
- Applications
- Related work31,32,33
- No attack is studied yet.
19Data Anonymization
- Publishing micro data for research
- The problem
- Normally, the explicit user ids (ssn, names) are
removed - Virtual identifier or quasi identifier use
multiple attributes to infer individuals
Voting record
Medical record
Together, the MA governors medical info is
identified
20- K-anonymity
- At least k records have the same virtual
identifier - Challenges
- Techniques to efficiently anonymize the tables
- Risk of privacy breach
- Information loss
?
21Anonymization implementation
- Generalization 37,40
- Suppression 37,43
- Multidimensional clustering47,48,49
-
22Risk of privacy breach
- l-diversity 39
- t-closeness53
- M-invariance52
?
Privacy is not protected
23Risk of privacy breach
- Attackers prior background knowledge
- Difficult to quantify
- Bayesian analysis is the major tool
- Paper 69,70,71,72,73,74
24Cryptographic approaches
- Using the following cryptographic primitives
- Secure multiparty computation (SMC)
- Yaos millionaire problem
- Alice wants to know whether she has more money
than Bob - AliceBob cannot know the exact number of each
others money. Alice knows only the result - Oblivious transfer
- Bob holds n items. Alice wants to know i-th item.
- Bob cannot know i Alices privacy
- Alice knows nothing except the i-th item
- Homomorphic encryption
- Allow computation on encrypted data
- E.g., E(X)E(Y) E(XY)
25- Characteristics
- Pro preserving total privacy
- Con expensive, limited of parties
- Applications distributed datasets (the corporate
model) - All kinds of data mining algorithms
- Statistical analysis (matrix, vector computation)
- Often discussed in two-party scenarios.
26Privacy-preserving data mining
- Privacy-preserving data classification
- Decision tree, naïve bayes classifier
- They work on individual column distributions
- Additive perturbation can be applied
- Distance-based classifiers
- Kernel methods, SVM, linear methods,
- Multiplicative perturbation can be applied
- Cryptographic protocols
27- Data clustering
- Using similarity measure (distance)
- Group data items
- Privacy-Preserving Methods
- Multiplicative perturbation can be used
- Cryptographic protocols
28- Association Rule mining
- Transactional datasets
- Find relationship a,b ? c
- Support probability of abc appear together in
the whole dataset - Confidence a,b appear then the prob of c appears
- methods
- Protecting the original transactional data
- Categorical data perturbation
- Protecting sensitive rules
- rule hiding
29- Stream mining
- Limited memory, unlimited streaming data
- Your algorithm can look at each record only once
- Analysis has to be done incrementally
- Statistical properties evolve over time
- Applications
- Monitoring the correlation between streams
- Monitoring change of clustering structures
- Adaptive classifiers
30- Privacy-preserving stream mining
- Private info in data streams
- Additive perturbation 159
- Sensitive rules in output
- Hiding rules 160
- Private search over data streams 155, 156
31Privacy-preserving data access
- Goal allow user to query database while hiding
- The query she submitted
- The identity of the records in the result
- Motivation patent databases stock quotes web
access many more....
32Basic Modeling
- Server holds n-bit string x
- n should be thought of as very large
- User wishes
- to retrieve xi
- (and
- to keep i private)
33Different Scenarios DuAtallah2000
34Private information matching(PIM)
- Alice does not want Bob knows her query and the
query result. - Bobs database can be private or public
- Private Alice should know only the required
content (probing by queries) - Public no restriction (PIMPD)
- Related work
- 132,136,143,144,145,148
35Secure Storage Outsourcing
- Bob hosts Alices encrypted database
- Alice needs to query the database
- Query privacy
- Result privacy
- Other clients use Alices outsourcing database
(SSCO) - Alice charges the client if she knows the client
queried her database - possible collusion between clients and Bob, how
to prevent? - Related work
- 138,142,147
36Naïve private protocol
x1,x2 , . . ., xn
xi
x x1,x2 , . . ., xn
SERVER
USER
Server sends entire database x to User.
Communication cost n
Bad news it has been proved that with single
server the minimum communication cost is n for a
private protocol
37The state-of-the-art
- Information-theoretic approaches
- Better protocols available when the data are
replicated in gt2 servers - Cryptographic protocols with cryptographic
primitives - E.g., oblivious transfer protocol
- Can be expensive
- Other protocols
- combined with perturbation techniques
38Privacy in Social Network Data
- Publishing social network structure
- Attacks can be applied to reveal the mapping
163,167 - Characteristics of subgraph
- Adversarial background knowledge
Anonymization is the major method
39Privacy in Mobile computing
- Location-based services
- location-aware emergency response,
- location-based advertisement,
- location-based entertainment, etc.
- Location privacy threats
- Ads spam
- Visits to clinics, doctors offices medical
info - Visits to entertainment districts life style
- Visits to political events unpopular political
views - Physical harm domestic abuse
40- Preserving location privacy
- User-defined or system supplied privacy policies
BambaLiu2008, BeresfordStajano2003 - Extending k-anonymity techniques to location
cloaking GedikLiu2008, GruteserGrunwald2002 - Pseudonymity of user identities frequently
changing internal id. BeresfordStajano2003
41