Title: Statistical Disclosure Limitation Beyond the Margins
1Statistical Disclosure Limitation Beyond the
Margins
- Making Inferences from Arbitrary Sets of
Conditionals and Marginals for Contingency Tables - Aleksandra B. Slavkovic
- Carnegie Mellon University
- September 16, 2003
2Our Goal
- Determine safe releases in terms of arbitrary
sets of marginal and conditionals - Risk measure is ability to identify small cell
counts - Investigate conditions under which sets of
marginals and conditionals give - unique specifications
- upper/lower bounds on cell entries
- distributions over the cell entries
- Determine/compute the bounds and distributions
3Outline
- Two-way tables
- Uniqueness
- LP/IP bounds on two-way tables
- Higher-way tables
- Extensions of
- Uniqueness
- LP/IP two-way bounds
- Bounds using graphical models
- Algebraic geometry
- Markov basis
- Bounds
- Distributions
- Model based characterization
4Uniqueness Complete specification of the joint
- Uniqueness Theorem for k-way tables
- Two-way table (Gelman Speed 1993, Arnold et al.
1999) - f(x), f(yx)
- f(xy), f(yx)
- Arnold et al. 1999
- Sometimes f(y), f(yx)
- Define the missing marginal
- Vardi Lee (1993) algorithm
5Uniqueness Complete specification of the joint
- Prop Unique solution exists for I x J, I? J
given f(xy) and f(x) - Unique solution for I x 2
- 2 x 2 table, release f(xy), f(x)
Y Y
X p11 p12 p1
X p21 p22 p2
XY
p11 p12
p21 p22
6I x J Tables Summary
Queries Assum. Unique Assum. Bounds
f(xy), f(y) v
f(xy), f(yx) v
f(x), f(y) X - Y v X not - Y max0, xiyj-n xij minxi , yj
f(xy), f(x) I J v I lt J
f(x) 0 xij xi
f(xy)
7LP Bounds given f(xy)
XY
p11 p12
p21 p22
- 2x2 table
- Max p11
- Subject to p11p12p21p22 1
- (p11 1) p11 p11 p21 0,
- (p12 1) p12 p12 p22 0,
- pij ? 0, i1,2, j1,2
- 0 ? p11 ? p11
- Conditionals maintain odds-ratio which makes this
problem different from marginals - a p11p22/p12p21 p11p22/p12p21
LPLinear Programming
8LP Bounds given f(xy)
- 2x2 table 15,10,5,200.3,0.2,0.1,0.4 N50
- Release f(yx) 0.6, 0.4, 0.2, 0.8
- Problems
- a6
- None of the conditional values are zero
- ? Cell in the original table CANNOT be zero.
- ? These are NOT the sharp bounds
X,Y
0,0.6 0, 0.4
0,0.2 0,0.8
X,Y
0,30 0, 20
0,10 0,40
LPLinear Programming
9IP Bounds given f(xy)
- 2x2 table 15,10,5,200.3,0.2,0.1,0.4
- Assume N50 is known
- Branch-and-Bound method
- Max x11
- Subject to x11x12x21x22 50
- (p11 1) x11 p11 x12 0,
- (p12 1) x21 p12 x22 0,
- xij ? 1, i1,2, j1,2
X,Y
3, 27 2, 18
1, 9 4, 36
IPInteger Programming
10LP Bounds given f(xy), f(x), if IltJ
- Example 2x3 table, bounds on p11
- Generalizes to 2 x J tables
p11 p12 p13
p21 p22 p23
p11 p12 p13
p21 p22 p23
LPLinear Programming
11 Bounds on multi-way tables using DAGs
- G X1 X3 X2
- f(x1, x2, x3)f(x2 x3)f(x3 x1)f(x1)
- GuGm X1 X3 X2
- Prop When G satisfies Wermuth condition, the
bounds imposed by a set of conditionals and
marginals reduce to the bounds imposed by a set
of marginals associated with Gu. - max0, p13 p23 - p3 ? p123 ? minp13 , p23
DAGDirected acyclic graph
12Bounds on three-way tables using DAGs
- When Wermuth fails, Gu?Gm
- G X1 X3 X2
- f(x1, x2, x3)f(x3 x2,x1 )f(x2)f(x1)
- DAG implies X1 - X2
- Special case of Gelman Speed Uniqueness Theorem
- f(x3 x2,x1 )f(x2,x1)
- f(x3 ,x2 x1 )f(x1)
- f(x3 ,x1 x2 )f(x2)
- What if Xi , Xj, Xk are dependent?
- 2x2x2 table f(xi ,xj xk ), f(xi,xj) gives
unique specification
13Bounds on Multi-way Tables
- DAG
- Wermuth No Wermuth
-
- - GS
- Marginal Unique
- bounds
- No DAG
- GS
-
-
Unique - (Chain Graphs?)
- Bounds
14Algebraic Geometry
- Methods from computational commutative algebra to
explore the space of all possible tables given
the constraints - Polynomial rings ideals give a way of
representing tables of counts - Bounds distributions given margins
- Gröbner (Markov) bases to enumerate or sample via
MCMC - Diaconis Sturmfels (1998), Dobra Fienberg
(2001, 2002) - Algebraic Geometry of Bayesian Nets (Garcia et
al. 2003) - Algebraic equivalent of local global Markov
properties and Factorization theorem - Probability distribution satisfying conditional
independence statements is the zero set of
polynomials
15Markov basis
- Space of tables zero set of polynomials
polyhedra - P(x?Rn Axb and x? o)
16Markov basis
- A set of generators of a toric ideal
- RingQx11, x12, x21, x22Q?, Lex
- IAlt ?z - ?z- ?x,y?Zn? o , Az0 gt
- kernelZ(A) , A is a matrix with integer
coefficients - Example
- 2x2 table 15,10,5,200.3,0.2,0.1,0.4
- Fixed f(yx) 0.6, 0.4, 0.2, 0.8 3/5, 2/5,
1/5, 4/5 - A
- x113x122 - x211x224
1 1 1 1
0.4 -0.6 0 0
0 0 0.8 -0.2
1 1 1 1
4 -6 0 0
0 0 8 -2
X,Y
3 2
- 1 - 4
17Markov basis
- The moves must maintain a and N
- Unlike IP do not require knowing sample size N
- They depend on the value of the conditional
distribution - Rounding to different decimal place gives
different moves - Must express rational as fraction and use the
numerator for the coefficient of the matrix A - Practical issue with rounding in Matlab
- Margins may be revealed as the denominators, and
often give the unique solution!
18Markov moves for fixed f(yx)
- 2x2 table 2, 5, 1, 100.3,0.2,0.1,0.4
- Fixed f(xy), a4
- No possible moves in this case unless consider
approximation
YX
2/7 5/7
1/11 10/11
X,Y
22 55
- 7 - 70
X,Y
3 7
- 1 - 9
- Do we have a unique solution?
- Do we accept approximation?
0.285714, 0.71426, 0. 09090, 0.90909
19Can we do MCMC now?
- We can enumerate
- Do we have irreducible Markov chain?
- Find the Gröbner basis
- Prove theoretically
- What is the family of distributions that has
marginals AND conditionals as MSS? - Whats the stationary distribution?
- Pr( NnC ) Pr(Nn N ? T )
20Distributions over the space of tables
- Builds on Garcia et al.(2003)
- DAG X1 X3 X2
- Log-linear model log(mijk) u u1(i) u2(j)
u3(k) u13(ik) u23(jk) - Minimal generators for M X1 - X2 X3
- Gröbner basis p121p211 - p111p221, p122p212 -
p112p222
p111 p121
p211 p221
p112 p122
p212 p222
x111-1 x1211
x2111 x221-1
x112-1 x1221
x2121 x222-1
- a11 p111p221/p121p211
- a21 p112p222/p122p212
- Preserves two-way margins which are MSS
- Bounds and distributions via MCMC based on
Markov (Gröbner) basis
21Expected Contributions
- Disclosure Limitation (DL)
- Extension of marginal query space by conditionals
- Enhancement of data usability
- Statistics
- Integration of diverse results methods from
- Disclosure limitation
- Conditional specification of the joint
distribution - Graphical models
- Algebraic geometry
- New results on bounds and distributions on
contingency tables - New theoretical links between DL, Statistical
Theory and Computational Algebraic Geometry
22Framework
Computational Algebra
Graphical Models
Gelman Speed
IP
Uniqueness
Bounds
Distributions
23(No Transcript)
24LP/IP Bounds given f(xy)
- Conditionals maintain odds-ratio which
makes this problem different from
marginals - a p11p22/p12p21 p11p22/p12p21
- N unknown
- LP bounds 0 ? pij ? pij
- Not sharp
- N known
- IP gives sharp bounds
- May not be computationally feasible for k-way
tables
LP Linear Programming IP Integer Programming
25Acknowledgments
- NISS Digital Government Project
- Department of Statistics, Carnegie Mellon
University - Stephen E. Fienberg
- Kimberly F. Sellers
- Larry Wasserman
- Teddy Seidenfeld
-
- Heinz School, Carnegie Mellon University
- Stephen F. Roehrig
26Motivation Statistical Disclosure Limitation
- Sustain proper statistical inference
- Strike a balance between data utility and
disclosure risk - Risk-Utility(R-U) confidentiality maps (Duncan et
al. (2001)) - Bayesian framework (Trottini(2001), Trottini
Fienberg (2002)) - Risk measure is ability to identify small cell
counts
Wealth (W) Weak Weak Strong Strong
Location (L) Center Outskirts Center Outskirts
Gender (G) Male 8 6 2 9
Gender (G) Female 0 3 5 1
Survey of self-employed shop-owners Source
Willenborg deWaal, adapted example
27Motivation Current Disclosure Methods
- NISS Digital Government Project
- Release of margins
- Maintains existing statistical correlations
- Determine safe releases via bounds and
distributions - Linear/Integer programming
- Roehrig et al. (1999), Dobra (2001)
- Decomposable and graphical log-linear models
- Dobra Fienberg(2000, 2002)
- Shuttle Algorithm
- Dobra (2002)
- Gröbner ( Markov) bases to enumerate or sample
- Diaconis Sturmfels (1998), Dobra
Fienberg(2000, 2002), Dobra et al. (2003) - Release of regressions (Jerry Reiter)
28Why conditionals?
Wealth (W) Weak Weak Strong Strong
Location (L) Center Out Center Out
Gender (G) Male 8 6 2 9
Gender (G) Female 0 3 5 1
24,0 20,0 8,5 12,0 10,0 9,6 6,0 5,0 5,2 18,0 15,0 9,6
0,0 0,0 3,0 24,0 6,0 3,0 34,0 9,1 5,2 8,0 2,0 4,1
8,5 9,6 5,2 9,6
3,0 3,0 5,2 4,1
Survey of self-employed shop-owners
Source Willenborg deWaal, adapted example
f(w,l) f(w,g) f(l,g)
- Assess causal distribution P(WwLl)?gP(wl,g)
P(g) - Assess treatment effect P(W1L1,
G1)-P(W1L0, G0) - Release f(w,l,g) vs. f(wl,g), f(g)
29Proposed Work primary focus
- Uniqueness
- Evaluation of Vardi Lee (1993) algorithm
- Algebraic algorithm via Gröbner basis and a zero
ideal - Compare consistency efficiency
- Boundary cases
- Structure computation of bounds
- Linear programming
- Computational algebra
- DAGs
- Structure computation of distributions
- MCMC computational algebra to enumerate or
sample
30Proposed Work secondary focus
- Bayesian Framework
- Explore extensions to Dobra et al. (2003) work on
posterior distributions defined over the space of
all tables given conditional and marginal
constraints - Causality
- Compare our bounds to Balke Pearls causality
bounds - Explore applicability of our bounds to their
problem - Assess the risk associated with the release of
conditionals and marginals in large-scale tabular
databases
31LP Bounds given f(xy),f(x), if IltJ
- Example 3x4 table
- A,B,C are combinations of ratios of differences
of cell values
- Proposed work
- Generalization to I x J table, where Ilt J
- Generalization to k-way tables
- Explore closed form solutions
- common structure like in the case of margins
32Motivation Unique identification
- Publicly available data
- American Fact Finder website (Source U.S. Census
Bureau Block data) - Uniqueness
- Sweeney(2000) Date of birth, gender, 5-digit ZIP
- Likely unique identification of 87 U.S.
population
RACE All ages 18 years and over
RACE Number Number
Total population 83 70
White 70 63
Black or African American 1 1
American Indian and Alaska Native 0 0
Asian 9 6
Native Hawaiian and Other Pacific Islander 0 0
Two or more races 3 0
33Ideal membership problem
- Identifying other sets of equivalent constraints
- Given Gröbner basis G for an ideal I
- f belongs to I iff remainder r on division of f
by G is zero.
34Solving system of equations
- Finding all tables to a given set of conditionals
marginals - Asking for the points in the affine variety
- Locus of tables zero set of polynomials
- Example 2x2 table f(x), f(yx)
- Constraints
- p11 p12 p21 p22 1 0
- p11 p12 0.5 0
- - 0.4 p11 0.6 p21 0
- - 0.8 p12 0.2 p22 0
- Gröbner basis
- -4/5 p12 1/5 p22 0
- p11 p12 p21 p22 1 0
- 3/2 p21 1/4 p21 - ½ 0
- -5/6 p21 1/6 0
35Effect of sample size N
- Important for proper statistical inference
- Tighter bounds distributions
- Mapping between probabilities and counts
- pijxij/xjr/q, r,q ?Z, 0 lt q ? N
- 0 ? p12 ? 0.286
- (p,q) (2,7), (4,14), but ONLY 1 table!
Y Y
X 3 (0.214) 2 (0.143)
X 4 (0.286) 5 (0.357)
XY 0.429 0.286
XY 0.571 0.714
1 1
36Effect of sample size N
- Effect of release of N on bounds and
distributions? - Important for proper statistical inference
- Tighter bounds
- Mapping between probabilities and counts
- pijxij/xjr/q, r,q ?Z, 0 lt q ? N
- 0 ? p11 ? 0.42857
- (p,q) (3,7), (6,14), but ONLY 1 table!
Y Y
X 3 (0.214) 2 (0.143)
X 4 (0.286) 5 (0.357)
XY 0.429 0.286
XY 0.571 0.714
1 1
37Solving system of equations
- Finding all tables to a given set of constraints
- Asking for the points in the affine variety
- Loci of tables zero sets of polynomials
- Marginals and condtionals as polynomials for an
Ideal - Example one conditional, 2x2 table
- Iltp11p12p21p22 1, - 0.4 p11 0.6 p21,- 0.8
p12 0.2 p22gt - Grobner basis
- -0.8p21 0.2 p22, p11p12p21p22 1, -2.5p12
1.25 p22 1 - Hilbert Series (1) / (1- p11)2
- Hilbert Function H(t)t1, t ?0
- Dimension 1, surface degree 1
38Solving system of equations uniqueness
- Example CoCoA given two conditionals
- RingQp11,p12,p21,p22, Lex
- JIdeal(p11p12p21p22 1, - 0.4 p11 0.6
p21,- 0.8 p12 0.2 p22, -0.25 p110.75p21,-1/3
p22 2/3 p12 ) - GBasis(J) 2/3p12 - 1/3p22,-4/5p21 1/5 p22,
p11 p12 p21 p22 - 1, -5/2 p22 1 - Back substitution (0.3,0.2,0.1,04)
- Poincare(MyRing/J) Hilbert series (1)
- Hilbert(MyRing/J) Hilbert function H(0) 1
H(t) 0 for t gt 1 - Dimension 0, point in the space
39Geometric interpretation
- Future work generalization to k-way tables
- Linear constraints give us manifolds in higher
dimensions - Dimension of linearly independent constraints
- (a) fixed margin of levels 1
- (b) fixed conditional of cells - of levels
of conditioning variable - Dimension (IJ-1) - (a) - (b)
- Example 2x2 f(x), f(yx) ? dim0, unique point
- Conditional tables, use algebraic geometry
40Proposed Work
- Uniqueness
- Evaluation of Vardi Lee (1993) algorithm
- Algebraic algorithm via Gröbner basis and a zero
ideal - Compare consistency efficiency
- Boundary cases
- Linear programming
- Bounds generalization to any I x J and multi-way
tables - Is there a common structure like in the case of
margins - Closed form solutions
41Proposed Work
- DAG framework
- Existence of bounds that differ from marginal
bounds - Find a way of computing them
- DAG factorization results parallel to
decomposable models - Chain graphs
- MCMC
- Bounds and distributions
- Markov basis via Gröbner basis
- Saturation algorithm (Bigatti et al. 1999)
42Solving system of equations uniqueness
- Example CoCoA given two conditionals
- Ring Qp11,p12,p21,p22
- Ideal
- p11p12p21p22 1, - 0.4 p11 0.6 p21, - 0.8
p12 0.2 p22, - -0.25p110.75p21, -1/3 p22 2/3 p12
- Gröbner Basis
- 2/3p12 - 1/3p22, -4/5p21 1/5 p22, p11 p12
p21 p22 - 1, -5/2 p22 1 - Back substitution (0.3, 0.2, 0.1, 0.4)
- Dimension 0, point in the space
43Geometric interpretation
- Surface of constant ?6
- Locus of conditional tables P(YXx)
- Locus of conditional tables P(XYy)