Title: Diapositiva 1
1STATISTICAL CONFIDENTIALITY IN LONGITUDINAL
LINKED DATA OBJECTIVES AND ATTRIBUTES Mario
Trottini University of Alicante
(Spain) mario.trottini_at_ua.es
Joint UNECE/Eurostat Work Session on Statistical
Confidentiality, Geneva 9-11 November 2005
2Problem Definition
Longitudinal Linked Microdata
Microdata that contain observations from two or
more related sampling frame, with measurements
for multiple time periods for all units of
observation (Abowd and Woodcock 2004)
- How to create the data set ?
- How to disseminate the data ?
3Problem Definition
Longitudinal Linked Microdata
Microdata that contain observations from two or
more related sampling frame, with measurements
for multiple time periods for all units of
observation (Abowd and Woodcock 2004)
- How to create the data set ?
- How to disseminate the data ?
4Data Dissemination Why is It Difficult?
- Should allow legitimate users to perform
statistical analyses as if the were using the
original data
2. Control the risk of misuses of the data by
potential intruders
3. Be operational
Two issues
(i) Objectives are too ambiguous
(ii) Objectives are conflicting
5Data Dissemination as a Decision Problem
Step(1) Identify the alternatives
Step(2) Structuring the objectives
Step(3) Define suitable attributes
Step(4) Assessing the trade-off
between the fundamental objectives
6Data Dissemination as a Decision Problem
Step(1) Identify the alternatives
Step(2) Structuring the objectives
Step(3) Define suitable attributes
Step(4) Assessing the trade-off
between the fundamental objectives
7Outline
- Identify the alternatives review of existing
data -
dissemination procedures
- Structuring the objectives
- -
Theory - -
Current practice
- Selecting attributes
- -
Theory - -
Current practice
8Identifying the Alternatives
Let M Mk , k E denote the class of
alternatives data dissemination procedures
Two rationales
- Data users and data users
- needs are very diverse
- (Mackie and Bradburn 2000)
- Combining different methods
- can produce greater data utility
- for any level of disclosure risk
- (Abowd and Lane 2003)
9Identifying the Alternatives
Let M Mk , k E denote the class of
alternatives data dissemination procedures
MORE REALISTIC APPROACH Mk should be
Combination of 1-5
10Structuring the Objectives Theory
Information Organization Overall Objective
The best data dissemination
Maximize safety
Minimize Cost
Maximize Usefulness
Too broad and ambiguous to be of operational use
STRATEGY Divide an objective in lower level
objectives that clarify the
interpretation of the broader objective
11An Illustration
the data dissemination procedure should allow
legitimate data users to perform the statistical
analyses of interest as if they were using the
data set originally collected.
Usefulness
Sources of ambiguity
12The Hierarchy
Maximize Usefulness
13Structuring the Objectives Current Practice
- Implicit hierarchy is often
- incomplete
?
- However, only few of them are
- taken into account in applications
Transparency, accessibility, feasibility are
often not considered
?
14 An Illustration
?
ORIGINAL MICRODATA
DORIG
- Apply some transformation, T, to the data
- DREL T(
DORIG) ) - 2) Release to the user DMASKED ( DREL, I(T)
) -
DATA MASKING
Usefulness assessment
D F(DORIG)- F(DMASKED)
IGNORING TRANSPARENCY!
15General Guidelines for Structuring the Objectives
- Definition of safety, usefulness and cost
are problem dependent.
- However, providing a clear definition of them in
any specific Data - Dissemination Problem is crucial for the
quality of the final decision.
- The use of hierarchies could be very beneficial
in terms of
1. clarifying the interpretation of the relevant
objectives
2. check that no relevant aspects of the
problem have been ignored
3. facilitate communication
16Selecting Attributes Theory
- Natural attributes
- Constructed Attributes
- Proxy attributes
17Selecting Attributes Theory
- Natural attributes
- Constructed Attributes
- Proxy attributes
Example Objective Minimize Cost (Natural)
attribute Cost in Euros
18The Hierarchy
Maximize Usefulness
19Selecting Attributes Theory
"subjective scale" constructed out of
several aspects typically associated with the
objective of interest.
- Natural attributes
- Constructed Attributes
- Proxy attributes
20Attribute level Description of attribute level
1 Support No groups are opposed to the facility and at least one group has organized support for the facility.
0 Neutrality All groups are indifferent or uninterested.
-1 Controversy One or more groups have organized opposition, although no groups have action-oriented opposition. Other groups may either be neutral or support the facility.
-2 Action-oriented opposition Exactly one group has action-oriented opposition. The other groups have organized support, indifference, or organized opposition.
-3 Strong action-oriented opposition Two or more groups have action-oriented opposition.
Table 1. Constructed attribute for public
attitudes. (Keeney and Gregory 2005)
21Selecting Attributes Theory
"subjective scale" constructed out of
several aspects typically associated with the
objective of interest.
- Natural attributes
- Constructed Attributes
- Proxy attributes
- Defining feature Interpretability
?
22Selecting Attributes Theory
- Natural attributes
- Constructed Attributes
- Proxy attributes
Reflects the degree to which an associate
objective is met but does not directly measure
the objective.
23Proxy Attributes for Usefulness in SDC
GENERAL FORMULATION DORIG ORIGINAL DATA DREL
DISSEMINATED DATA F( Data) some feature of
Data PROXY DISCREPANCY ( F(XORIG), F(XREL) )
INTUITION Low distorsion of the data implies
nearly correct inferences for nearly all
statistical analyses
24Proxy Attributes for Usefulness in SDC
PROXY DISCREPANCY ( F(DORIG), F(DREL) )
F DISCREPANCY
Proxy as discrepancy between summary statistics
Domingo Torra (2001), Yancey W.E. et al. (2002),
Oganyan, A. (2003), Grup Crises (2004)
Summary statistics Absolute (relative) difference Percentage variation Mean variation, etc
Density estimation Hellinger distance Kullback-Leibler divergence Other distances
Model based inferences Estimation Prediction Model Selection Difference in parameter estimates, Intervals overlaps Discrepancy in model ranking etc.
Proxy as discrepancy between distributions Agrawal
and Aggarwal (2001), Gomatam et al. (2004),
Karr et al. (2005)
Inference based proxy Gomatam et al. (2004). ,
A.F. Karr et al. (2005) ,
25Selecting Attributes Theory
- Natural attributes
- Constructed Attributes
- Proxy attributes
- Require some understanding
- of the relationship between
- the objective of interest and
- the associated objective
- measured by the proxy.
26An Illustration
Goal Assessing the trade-off between
Maximize usefulness and maximize safety for a
given level c of Cost
- Attribute for usefulness (Information loss)
Hellinger Distance (IL) - Attribute for safety (Disclosure risk) of
record correctly re-identified (DR)
What does D(IL)0.1 mean in terms of fitting a
regression model?
Data dissemination1 D1
Data dissemination 2 D2
IL(D1)0.4 DR(D1) 1
IL(D2)0.5 DR(D2)0.5
DR(D1)- 0.5 IL(D1) 0.1 C
DR(D1) IL(D1) C
????
27Attribute Selection Theory and Current Practice
THEORY
Prescriptive Order in Attributes selection
- Natural attributes
- Constructed attributes
- Proxy attributes
28Conclusions
There is a tendency in all problem solving to
move quickly away from the ill-defined to the
well-defined, from constraint-freethinking to
constrained thinking. There is a need to feel,
and perhaps even to measure, progress toward
reaching a solution" to a decision problem.
(Keeney, 1992, page 9)
- In this talk it is argued that too little effort
has been made for a comprehensive definition of
the Data Dissemination - problem in terms of
- - alternatives
- - objectives
- - attributes
29Conclusions (Cont.)
- Hierarchy and constructed attributes could
represent useful - tools to address these problems.
- Although the discussion has not focus on
dissemination of - longitudinal linked data as much as desired, I
think it is particularly relevant for this type
of data given - - The complexity of the modeling
- - The multiple decision makers
involved -
- - The different perspectives of
disclosure and utility - that must be accommodated in the
final decision. -
30 Acknowledgements
Preparation of this paper was supported by the
U.S. National Science Foundation under Grant
EIA-0131884 to the National Institute of
Statistical Sciences. The contents of the paper
reflects the authors' personal opinion. The
National Science Foundation is not responsible
for any views or results presented.
31THANK YOU !