New Measures of Data Utility - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

New Measures of Data Utility

Description:

A cluster is therefore a collection of objects which are 'similar' between them ... sets, and create an indicator variable Rj with the value 0 for observations from ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 18
Provided by: mija
Category:

less

Transcript and Presenter's Notes

Title: New Measures of Data Utility


1
New Measures of Data Utility
  • Mi-Ja Woo
  • National Institute of Statistical Sciences

2
Question How to evaluate the characteristics
of SDL methods?
  • Previously, data utility measures were studied in
    context of moments and linear regression
    models.- Differences in inferences obtained from
    the original and masked data.- Regression model
    and KL distance rely on the multivariate
    normality assumption.
  • Questions - Is the assumption satisfied in the
    realistic situation?- What if the assumption is
    violated?
  • Example

3
Example Two-dimensional original data and two
masked data by synthetic and resampling methods.
4
  • Different distributions, but the same moments and
    estimates of regression coefficients.
  • New measures are needed.

5
1. CDF utility measure
  • Extension of univariate case.
  • Kolmogorov statistics
  • Cramer-von Mises statistics
  • , where are empiricaldistributions
    of original and masked data. Large MD and MCM
    indicate two data are distributed differently.

6
2. Cluster Data Utility
  • A loose definition of clustering could be the
    process of organizing objects into groups whose
    members are similar in some way.
  • A cluster is therefore a collection of objects
    which are similar between them and are
    dissimilar to the objects belonging to other
    clusters.
  • A data set is said to be randomly assigned when
    proportion of observations from original data for
    each cluster is constant (1/2 with equal number
    of observations for two groups)
  • where is the total number of records,
    is the number of records from original data,
    and is the weight assigned to i-th cluster.

7
3. Propensity Score Data Utility
  • A propensity score is generally defined as the
    conditional probability of assignment to a
    particular treatment given a vector of observed
    covariates (Rosenbaum and Rubin 1983).
  • A data set is said to be randomly assigned when
    propensity score for each covariate is constant
    (1/2 with equal number of observations for two
    groups).
  • In the propensity score method, a propensity
    score is estimated for each observed covariate,
    and utility is measured by

8
Estimation of propensity scores
  • Combine original and masked data sets, and create
    an indicator variable Rj with the value 0 for
    observations from original and 1 otherwise. 1)
    Logistic regression model such as where
  • 2) Tree model.3) Modified logistic
    regression model Classify all data points
    into g groups, and fit a logistic model for
    each group. It combines logistic model with
    clustering, and it borrows strength of logistics
    model and clustering method.
  • Cluster utility is one way of propensity score
    utility.

9
4. Simulation
  • Eight different types of two-dimensional data
    with n10,000 1) Symmetric/non-symmetric2)
    High/ low correlated3) Negative/ positive
    correlated.
  • Masking strategies considered Synthetic,
    microaggregation, microaggregation followed by
    noise, rank swapping, and resampling.
  • Computational details1) Cluster Utility g500
    (5) and g1,000 (10).2) Propensity score
    utility with logistic model

10
  • 3) Propensity score utility with tree model
    Sizes of tree considered are complexity
    parameter cp0.001, and 0.0001. That is, any
    split that does not decrease the overall lack of
    fit by a factor of cp is not attempted.
  • 4) Propensity score utility with modified
    logistic modelThe number of group is g100
    (1), and linear and quadratic logistic
    functions are used to fit logistic regression
    models.

11
ResultsSymmetric high negative case.
12
Symmetric low negative case.
13
Non-symmetric high negative case.
14
Non-symmetric low negative case.
15
Summary
  • CDF utility 1) Do not involve parameters.2) It
    is favorable to rank swapping SDL method.
  • Cluster utility 1) Do not measure the
    differences between two structures of original
    and masked data within a cluster, within-cluster
    variation. 2) Generally, it is consistent to
    overall results.3) For non-symmetric cases,
    large number of clusters tend to produce worse
    utility for the masked data by microaggregation
    method since there are three overlaps in
    microaggregated data.

16
  • Propensity score with logistic model 1) The
    choice of degree is very crucial.2) It is hard
    to deal with high-dimensional data.
  • Propensity score with tree model 1) Small size
    of tree can not distinguish utility of Rank from
    that of Resample.2) Large size of tree leads to
    bad utility for the micro-aggregation method. For
    some cases, large size of tree can not partition
    space for Rank method. 3) It is favorable to
    Rank SDL method.
  • Propensity score with modified logistic model
    1) It possesses both advantages and
    disadvantages of logistic model and clustering
    since it is the combination of cluster and
    propensity score utilities.2) It looks
    consistent to overall results for all data
    structures.

17
END
Write a Comment
User Comments (0)
About PowerShow.com