Statistical Disclosure Limitation Beyond the Margins - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Statistical Disclosure Limitation Beyond the Margins

Description:

Statistical Disclosure Limitation Beyond the Margins Making Inferences from Arbitrary Sets of Conditionals and Marginals for Contingency Tables – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 44
Provided by: ses117
Category:

less

Transcript and Presenter's Notes

Title: Statistical Disclosure Limitation Beyond the Margins


1
Statistical Disclosure Limitation Beyond the
Margins
  • Making Inferences from Arbitrary Sets of
    Conditionals and Marginals for Contingency Tables
  • Aleksandra B. Slavkovic
  • Carnegie Mellon University
  • September 16, 2003

2
Our Goal
  • Determine safe releases in terms of arbitrary
    sets of marginal and conditionals
  • Risk measure is ability to identify small cell
    counts
  • Investigate conditions under which sets of
    marginals and conditionals give
  • unique specifications
  • upper/lower bounds on cell entries
  • distributions over the cell entries
  • Determine/compute the bounds and distributions

3
Outline
  • Two-way tables
  • Uniqueness
  • LP/IP bounds on two-way tables
  • Higher-way tables
  • Extensions of
  • Uniqueness
  • LP/IP two-way bounds
  • Bounds using graphical models
  • Algebraic geometry
  • Markov basis
  • Bounds
  • Distributions
  • Model based characterization

4
Uniqueness Complete specification of the joint
  • Uniqueness Theorem for k-way tables
  • Two-way table (Gelman Speed 1993, Arnold et al.
    1999)
  • f(x), f(yx)
  • f(xy), f(yx)
  • Arnold et al. 1999
  • Sometimes f(y), f(yx)
  • Define the missing marginal
  • Vardi Lee (1993) algorithm

5
Uniqueness Complete specification of the joint
  • Prop Unique solution exists for I x J, I? J
    given f(xy) and f(x)
  • Unique solution for I x 2
  • 2 x 2 table, release f(xy), f(x)

Y Y
X p11 p12 p1
X p21 p22 p2
XY
p11 p12
p21 p22
6
I x J Tables Summary
Queries Assum. Unique Assum. Bounds
f(xy), f(y) v
f(xy), f(yx) v
f(x), f(y) X - Y v X not - Y max0, xiyj-n xij minxi , yj
f(xy), f(x) I J v I lt J
f(x) 0 xij xi
f(xy)





7
LP Bounds given f(xy)
XY
p11 p12
p21 p22
  • 2x2 table
  • Max p11
  • Subject to p11p12p21p22 1
  • (p11 1) p11 p11 p21 0,
  • (p12 1) p12 p12 p22 0,
  • pij ? 0, i1,2, j1,2
  • 0 ? p11 ? p11
  • Conditionals maintain odds-ratio which makes this
    problem different from marginals
  • a p11p22/p12p21 p11p22/p12p21

LPLinear Programming
8
LP Bounds given f(xy)
  • 2x2 table 15,10,5,200.3,0.2,0.1,0.4 N50
  • Release f(yx) 0.6, 0.4, 0.2, 0.8
  • Problems
  • a6
  • None of the conditional values are zero
  • ? Cell in the original table CANNOT be zero.
  • ? These are NOT the sharp bounds

X,Y
0,0.6 0, 0.4
0,0.2 0,0.8
X,Y
0,30 0, 20
0,10 0,40
LPLinear Programming
9
IP Bounds given f(xy)
  • 2x2 table 15,10,5,200.3,0.2,0.1,0.4
  • Assume N50 is known
  • Branch-and-Bound method
  • Max x11
  • Subject to x11x12x21x22 50
  • (p11 1) x11 p11 x12 0,
  • (p12 1) x21 p12 x22 0,
  • xij ? 1, i1,2, j1,2

X,Y
3, 27 2, 18
1, 9 4, 36
IPInteger Programming
10
LP Bounds given f(xy), f(x), if IltJ
  • Example 2x3 table, bounds on p11
  • Generalizes to 2 x J tables

p11 p12 p13
p21 p22 p23
p11 p12 p13
p21 p22 p23
LPLinear Programming
11
Bounds on multi-way tables using DAGs
  • G X1 X3 X2
  • f(x1, x2, x3)f(x2 x3)f(x3 x1)f(x1)
  • GuGm X1 X3 X2
  • Prop When G satisfies Wermuth condition, the
    bounds imposed by a set of conditionals and
    marginals reduce to the bounds imposed by a set
    of marginals associated with Gu.
  • max0, p13 p23 - p3 ? p123 ? minp13 , p23

DAGDirected acyclic graph
12
Bounds on three-way tables using DAGs
  • When Wermuth fails, Gu?Gm
  • G X1 X3 X2
  • f(x1, x2, x3)f(x3 x2,x1 )f(x2)f(x1)
  • DAG implies X1 - X2
  • Special case of Gelman Speed Uniqueness Theorem
  • f(x3 x2,x1 )f(x2,x1)
  • f(x3 ,x2 x1 )f(x1)
  • f(x3 ,x1 x2 )f(x2)
  • What if Xi , Xj, Xk are dependent?
  • 2x2x2 table f(xi ,xj xk ), f(xi,xj) gives
    unique specification

13
Bounds on Multi-way Tables
  • DAG
  • Wermuth No Wermuth

  • -
  • GS
  • Marginal Unique
  • bounds
  • No DAG
  • GS

  • Unique
  • (Chain Graphs?)
  • Bounds

14
Algebraic Geometry
  • Methods from computational commutative algebra to
    explore the space of all possible tables given
    the constraints
  • Polynomial rings ideals give a way of
    representing tables of counts
  • Bounds distributions given margins
  • Gröbner (Markov) bases to enumerate or sample via
    MCMC
  • Diaconis Sturmfels (1998), Dobra Fienberg
    (2001, 2002)
  • Algebraic Geometry of Bayesian Nets (Garcia et
    al. 2003)
  • Algebraic equivalent of local global Markov
    properties and Factorization theorem
  • Probability distribution satisfying conditional
    independence statements is the zero set of
    polynomials

15
Markov basis
  • Space of tables zero set of polynomials
    polyhedra
  • P(x?Rn Axb and x? o)

16
Markov basis
  • A set of generators of a toric ideal
  • RingQx11, x12, x21, x22Q?, Lex
  • IAlt ?z - ?z- ?x,y?Zn? o , Az0 gt
  • kernelZ(A) , A is a matrix with integer
    coefficients
  • Example
  • 2x2 table 15,10,5,200.3,0.2,0.1,0.4
  • Fixed f(yx) 0.6, 0.4, 0.2, 0.8 3/5, 2/5,
    1/5, 4/5
  • A
  • x113x122 - x211x224

1 1 1 1
0.4 -0.6 0 0
0 0 0.8 -0.2
1 1 1 1
4 -6 0 0
0 0 8 -2
X,Y
3 2
- 1 - 4
17
Markov basis
  • The moves must maintain a and N
  • Unlike IP do not require knowing sample size N
  • They depend on the value of the conditional
    distribution
  • Rounding to different decimal place gives
    different moves
  • Must express rational as fraction and use the
    numerator for the coefficient of the matrix A
  • Practical issue with rounding in Matlab
  • Margins may be revealed as the denominators, and
    often give the unique solution!

18
Markov moves for fixed f(yx)
  • 2x2 table 2, 5, 1, 100.3,0.2,0.1,0.4
  • Fixed f(xy), a4
  • No possible moves in this case unless consider
    approximation

YX
2/7 5/7
1/11 10/11
X,Y
22 55
- 7 - 70
X,Y
3 7
- 1 - 9
  • Do we have a unique solution?
  • Do we accept approximation?

0.285714, 0.71426, 0. 09090, 0.90909
19
Can we do MCMC now?
  • We can enumerate
  • Do we have irreducible Markov chain?
  • Find the Gröbner basis
  • Prove theoretically
  • What is the family of distributions that has
    marginals AND conditionals as MSS?
  • Whats the stationary distribution?
  • Pr( NnC ) Pr(Nn N ? T )

20
Distributions over the space of tables
  • Builds on Garcia et al.(2003)
  • DAG X1 X3 X2
  • Log-linear model log(mijk) u u1(i) u2(j)
    u3(k) u13(ik) u23(jk)
  • Minimal generators for M X1 - X2 X3
  • Gröbner basis p121p211 - p111p221, p122p212 -
    p112p222

p111 p121
p211 p221
p112 p122
p212 p222
x111-1 x1211
x2111 x221-1
x112-1 x1221
x2121 x222-1
  • a11 p111p221/p121p211
  • a21 p112p222/p122p212
  • Preserves two-way margins which are MSS
  • Bounds and distributions via MCMC based on
    Markov (Gröbner) basis

21
Expected Contributions
  • Disclosure Limitation (DL)
  • Extension of marginal query space by conditionals
  • Enhancement of data usability
  • Statistics
  • Integration of diverse results methods from
  • Disclosure limitation
  • Conditional specification of the joint
    distribution
  • Graphical models
  • Algebraic geometry
  • New results on bounds and distributions on
    contingency tables
  • New theoretical links between DL, Statistical
    Theory and Computational Algebraic Geometry

22
Framework
Computational Algebra
Graphical Models
Gelman Speed
IP
Uniqueness
Bounds
Distributions
23
(No Transcript)
24
LP/IP Bounds given f(xy)
  • Conditionals maintain odds-ratio which
    makes this problem different from
    marginals
  • a p11p22/p12p21 p11p22/p12p21
  • N unknown
  • LP bounds 0 ? pij ? pij
  • Not sharp
  • N known
  • IP gives sharp bounds
  • May not be computationally feasible for k-way
    tables

LP Linear Programming IP Integer Programming
25
Acknowledgments
  • NISS Digital Government Project
  • Department of Statistics, Carnegie Mellon
    University
  • Stephen E. Fienberg
  • Kimberly F. Sellers
  • Larry Wasserman
  • Teddy Seidenfeld
  • Heinz School, Carnegie Mellon University
  • Stephen F. Roehrig

26
Motivation Statistical Disclosure Limitation
  • Sustain proper statistical inference
  • Strike a balance between data utility and
    disclosure risk
  • Risk-Utility(R-U) confidentiality maps (Duncan et
    al. (2001))
  • Bayesian framework (Trottini(2001), Trottini
    Fienberg (2002))
  • Risk measure is ability to identify small cell
    counts

Wealth (W) Weak Weak Strong Strong
Location (L) Center Outskirts Center Outskirts
Gender (G) Male 8 6 2 9
Gender (G) Female 0 3 5 1
Survey of self-employed shop-owners Source
Willenborg deWaal, adapted example
27
Motivation Current Disclosure Methods
  • NISS Digital Government Project
  • Release of margins
  • Maintains existing statistical correlations
  • Determine safe releases via bounds and
    distributions
  • Linear/Integer programming
  • Roehrig et al. (1999), Dobra (2001)
  • Decomposable and graphical log-linear models
  • Dobra Fienberg(2000, 2002)
  • Shuttle Algorithm
  • Dobra (2002)
  • Gröbner ( Markov) bases to enumerate or sample
  • Diaconis Sturmfels (1998), Dobra
    Fienberg(2000, 2002), Dobra et al. (2003)
  • Release of regressions (Jerry Reiter)

28
Why conditionals?
Wealth (W) Weak Weak Strong Strong
Location (L) Center Out Center Out
Gender (G) Male 8 6 2 9
Gender (G) Female 0 3 5 1


24,0 20,0 8,5 12,0 10,0 9,6 6,0 5,0 5,2 18,0 15,0 9,6
0,0 0,0 3,0 24,0 6,0 3,0 34,0 9,1 5,2 8,0 2,0 4,1


8,5 9,6 5,2 9,6
3,0 3,0 5,2 4,1
Survey of self-employed shop-owners
Source Willenborg deWaal, adapted example
f(w,l) f(w,g) f(l,g)
  • Assess causal distribution P(WwLl)?gP(wl,g)
    P(g)
  • Assess treatment effect P(W1L1,
    G1)-P(W1L0, G0)
  • Release f(w,l,g) vs. f(wl,g), f(g)

29
Proposed Work primary focus
  • Uniqueness
  • Evaluation of Vardi Lee (1993) algorithm
  • Algebraic algorithm via Gröbner basis and a zero
    ideal
  • Compare consistency efficiency
  • Boundary cases
  • Structure computation of bounds
  • Linear programming
  • Computational algebra
  • DAGs
  • Structure computation of distributions
  • MCMC computational algebra to enumerate or
    sample

30
Proposed Work secondary focus
  • Bayesian Framework
  • Explore extensions to Dobra et al. (2003) work on
    posterior distributions defined over the space of
    all tables given conditional and marginal
    constraints
  • Causality
  • Compare our bounds to Balke Pearls causality
    bounds
  • Explore applicability of our bounds to their
    problem
  • Assess the risk associated with the release of
    conditionals and marginals in large-scale tabular
    databases

31
LP Bounds given f(xy),f(x), if IltJ
  • Example 3x4 table
  • A,B,C are combinations of ratios of differences
    of cell values
  • Proposed work
  • Generalization to I x J table, where Ilt J
  • Generalization to k-way tables
  • Explore closed form solutions
  • common structure like in the case of margins

32
Motivation Unique identification
  • Publicly available data
  • American Fact Finder website (Source U.S. Census
    Bureau Block data)
  • Uniqueness
  • Sweeney(2000) Date of birth, gender, 5-digit ZIP
  • Likely unique identification of 87 U.S.
    population

RACE All ages 18 years and over
RACE Number Number
Total population 83 70
White 70 63
Black or African American 1 1
American Indian and Alaska Native 0 0
Asian 9 6
Native Hawaiian and Other Pacific Islander 0 0
Two or more races 3 0
33
Ideal membership problem
  • Identifying other sets of equivalent constraints
  • Given Gröbner basis G for an ideal I
  • f belongs to I iff remainder r on division of f
    by G is zero.

34
Solving system of equations
  • Finding all tables to a given set of conditionals
    marginals
  • Asking for the points in the affine variety
  • Locus of tables zero set of polynomials
  • Example 2x2 table f(x), f(yx)
  • Constraints
  • p11 p12 p21 p22 1 0
  • p11 p12 0.5 0
  • - 0.4 p11 0.6 p21 0
  • - 0.8 p12 0.2 p22 0
  • Gröbner basis
  • -4/5 p12 1/5 p22 0
  • p11 p12 p21 p22 1 0
  • 3/2 p21 1/4 p21 - ½ 0
  • -5/6 p21 1/6 0

35
Effect of sample size N
  • Important for proper statistical inference
  • Tighter bounds distributions
  • Mapping between probabilities and counts
  • pijxij/xjr/q, r,q ?Z, 0 lt q ? N
  • 0 ? p12 ? 0.286
  • (p,q) (2,7), (4,14), but ONLY 1 table!

Y Y
X 3 (0.214) 2 (0.143)
X 4 (0.286) 5 (0.357)
XY 0.429 0.286
XY 0.571 0.714
1 1
36
Effect of sample size N
  • Effect of release of N on bounds and
    distributions?
  • Important for proper statistical inference
  • Tighter bounds
  • Mapping between probabilities and counts
  • pijxij/xjr/q, r,q ?Z, 0 lt q ? N
  • 0 ? p11 ? 0.42857
  • (p,q) (3,7), (6,14), but ONLY 1 table!

Y Y
X 3 (0.214) 2 (0.143)
X 4 (0.286) 5 (0.357)
XY 0.429 0.286
XY 0.571 0.714
1 1
37
Solving system of equations
  • Finding all tables to a given set of constraints
  • Asking for the points in the affine variety
  • Loci of tables zero sets of polynomials
  • Marginals and condtionals as polynomials for an
    Ideal
  • Example one conditional, 2x2 table
  • Iltp11p12p21p22 1, - 0.4 p11 0.6 p21,- 0.8
    p12 0.2 p22gt
  • Grobner basis
  • -0.8p21 0.2 p22, p11p12p21p22 1, -2.5p12
    1.25 p22 1
  • Hilbert Series (1) / (1- p11)2
  • Hilbert Function H(t)t1, t ?0
  • Dimension 1, surface degree 1

38
Solving system of equations uniqueness
  • Example CoCoA given two conditionals
  • RingQp11,p12,p21,p22, Lex
  • JIdeal(p11p12p21p22 1, - 0.4 p11 0.6
    p21,- 0.8 p12 0.2 p22, -0.25 p110.75p21,-1/3
    p22 2/3 p12 )
  • GBasis(J) 2/3p12 - 1/3p22,-4/5p21 1/5 p22,
    p11 p12 p21 p22 - 1, -5/2 p22 1
  • Back substitution (0.3,0.2,0.1,04)
  • Poincare(MyRing/J) Hilbert series (1)
  • Hilbert(MyRing/J) Hilbert function H(0) 1
    H(t) 0 for t gt 1
  • Dimension 0, point in the space

39
Geometric interpretation
  • Future work generalization to k-way tables
  • Linear constraints give us manifolds in higher
    dimensions
  • Dimension of linearly independent constraints
  • (a) fixed margin of levels 1
  • (b) fixed conditional of cells - of levels
    of conditioning variable
  • Dimension (IJ-1) - (a) - (b)
  • Example 2x2 f(x), f(yx) ? dim0, unique point
  • Conditional tables, use algebraic geometry

40
Proposed Work
  • Uniqueness
  • Evaluation of Vardi Lee (1993) algorithm
  • Algebraic algorithm via Gröbner basis and a zero
    ideal
  • Compare consistency efficiency
  • Boundary cases
  • Linear programming
  • Bounds generalization to any I x J and multi-way
    tables
  • Is there a common structure like in the case of
    margins
  • Closed form solutions

41
Proposed Work
  • DAG framework
  • Existence of bounds that differ from marginal
    bounds
  • Find a way of computing them
  • DAG factorization results parallel to
    decomposable models
  • Chain graphs
  • MCMC
  • Bounds and distributions
  • Markov basis via Gröbner basis
  • Saturation algorithm (Bigatti et al. 1999)

42
Solving system of equations uniqueness
  • Example CoCoA given two conditionals
  • Ring Qp11,p12,p21,p22
  • Ideal
  • p11p12p21p22 1, - 0.4 p11 0.6 p21, - 0.8
    p12 0.2 p22,
  • -0.25p110.75p21, -1/3 p22 2/3 p12
  • Gröbner Basis
  • 2/3p12 - 1/3p22, -4/5p21 1/5 p22, p11 p12
    p21 p22 - 1, -5/2 p22 1
  • Back substitution (0.3, 0.2, 0.1, 0.4)
  • Dimension 0, point in the space

43
Geometric interpretation
  • Surface of constant ?6
  • Locus of conditional tables P(YXx)
  • Locus of conditional tables P(XYy)
Write a Comment
User Comments (0)
About PowerShow.com