Title: Methods of Secure Computation and Data Integration
1Methods of Secure Computation and Data Integration
- Jerome Reiter, Duke University
- Alan Karr, NISS
- Xiaodong Lin, University of Cincinnati
- Ashish Sanil, Bristol Myers Squibb
2General setting
- Multiple agencies seek to improve analyses by
pooling their data. - Do not want to reveal individual data values
unknown to other agencies. - Want accurate results from pooling procedures.
3Pooling situations
- Horizontally PartitionedAgencies have different
records but same variables. - Purely Vertically PartitionedAgencies have same
records but different variables. - Partially Overlapping, Vertically
PartitionedAgencies have different records and
different variables, with some common records and
variables.
4Horizontal partitioningKarr, Lin, Sanil, Reiter
(JCGS, 2005)
- Secure data integration-- shares data but
protects sources.-- allows any analysis to be
done. - Secure summation-- shares sums without sharing
data -- allows regressions, association rules,
classifications, clustering
5Secure summation
- Obtain without sharing individual
values - Agency A passes (x R) to 2nd agency.
- Agency B adds its x to this value and passes sum
to Agency C. - Process continues until all agencies have added
their x. - Agency A subtracts R from the sum.
6Purely vertical partitioning
- Secure dot/matrix product-- shares dot/matrix
products without sharing data.-- allows
regressions, association rules, classification,
clustering.-- assumes semi-honest. - Synthetic data approaches-- share synthetic
copies of data across agencies.-- allows any
analysis when distributions used to generate
data are accurate.-- generates public use data
file.
7Secure dot/matrix productsKarr, Lin, Reiter,
Sanil (NISS tech. report)
- Compute not revealing individual
values - Agency A passes where
for all i,j to Agency B. - Agency B sends to Agency A.
- Agency A computes
8Purely vertical partitioning
- Secure dot/matrix product-- share dot/matrix
products without sharing data.-- allows
regressions, association rules, classification,
clustering.-- assumes semi-honest. - Synthetic data approaches-- share synthetic
copies of data across agencies.-- allows any
analysis when distributions used to generate
data are accurate.-- generates public use data
file.
9Synthetic data approachKohnen (PhD thesis, 2005)
- Assume X not sensitive.
- Pass real X to Agency B.
- Agency B simulates multiple copies of Y for from
f(YX) estimated using the dataset from Agency A.
Pass the copies to Agency A.
10Synthetic data approachKohnen (PhD thesis, 2005)
- Agency A uses partially synthetic data methods
(Reiter, Surv. Meth., 2003) for inferences based
on YX. - Agency A can release fully synthetic data to
public.
11Synthetic data approachesKohnen (PhD thesis,
2005)
- Agency A simulates disguiser X that look like the
genuine values of X, ideally from distribution
close to f(XY). Pass real X and disguisers to
Agency B. - Agency B simulates multiple copies of Y for each
f(YX) estimated using the datasets from Agency
A. Pass the copies to Agency A.
12Synthetic data approachesKohnen (PhD thesis,
2005)
- Agency A discards disguisers and uses partially
synthetic data methods (Reiter, Surv. Meth.,
2003) to obtain inferences using the real X. - Agency A can release fully synthetic data to
public.
13Partially overlapping, vertical partitioning
- Secure EM algorithm-- uses secure dot
products-- continuous data estimate
covariance matrix for multivariate normal
data-- categorical data estimate parameters
of log-linear models
14Limitations of methodsDefining a research agenda
- Secure computation methods- How to specify
models without viewing data?- What if
sophisticated models needed?- How to do
posterior simulation? - Synthetic data methods- How to generate good
disguisers? - All methods- How to incorporate matching
errors, differences in data quality and
definitions?- How to account for disclosure
risks from models that fit too well?