Title: Considerations in Grouping Multivariate Data
1(No Transcript)
2Considerations in Grouping Multivariate Data
Myron J. Katzoff, Jay J. Kim, Joe Fred Gonzalez,
Jr., and Lawrence H. Cox CDC, National Center
for Health Statistics, Office of Research and
Methodology, USA April 25, 2006
3What is grouped data?
- Data which have been categorized (or binned) in
accordance with some grid-like structure.
4Reasons for Grouping Data
- To call attention to certain features of the
underlying distributions of variables, - To effectively summarize data for portability to
various applications or analyses, or - To preserve the confidentiality of the sample
members who have provided data.
5Purpose of paper
- Examine some of the consequences of binning with
nonuniform rectangular grids - Consider some problems in multivariate inference
and examine special considerations for results
obtained from complex surveys.
6Relationship to other work
- Our work is connected with research in several
other areas - multivariate analysis with rounded,
microaggregated or coarsened data - regression analysis with grouped data (we will
outline a toy problem for survey data that we
are investigating) - investigation of the ecological inference
problem - investigation of the modifiable areal unit
problem (MAUP) and - investigation of the (general) change of support
problem (COSP).
7Relationship to other work (cont.)
- The ecological inference problem (2002) is often
of - concern in the analysis of public health data.
- Ecological inference is the process of deducing
- individual behavior from aggregate data the
- problem occurs when analyses based on grouped
- data yield conclusions which are markedly
different - from those obtained from individual data. This is
a - difficult problem to ascertain and one rarely
gets the - chance to confirm whether or not the problem
exists - in a given setting.
8Relationship to other work (cont.)
- The ecological inference problem is
- analogous to creating estimates for small
- areas using synthetic (or indirect) estimation
- (1996) by applying national estimates within
- socio-demographic groups at smaller areas.
9Relationship to other work (cont.)
- The underlying rationale for synthetic
- estimation is that the distribution of a health
- characteristic is highly correlated with
- the demographic composition of the
- population (1996).
10Relationship to other work (cont.)
- Therefore, it is assumed that differences in the
prevalence of the characteristics between two
areas are due primarily to differences in
demographic composition (e.g., age, race, sex,
etc.).
11MAUP (modifiable areal unit problem)
- The smoothing effect that results from
- averaging in spatial analysis creates the
- scaling problem in the MAUP. The
- aggregation of areal units smoothes and alters
- the spatial autocorrelation of units causing a
- zoning effect.
12Ecological inference problem vs. MAUP vs. COSP
- The ecological inference problem and the MAUP are
specific realizations of the change of support
problem (COSP). - Many other terms have also been introduced to
describe particular COSPs and solutions to
particular COSPs including the scaling problem,
inference between incompatible zonal systems,
block kriging, pycnophylactic geographic
interpolation, the polygonal overlay problem,
areal interpolation, inference with spatially
misaligned data, contour aggregation, and
multiscale and multiresolution modeling.
13Toy Problem
- Generate N three-tuples (X,Y, Z) from a
trivariate normal distribution. - Create two strata by classifying the three-tuples
on Z arbitrarily into two strata of sizes N1 and
N2 determined after the selection in (1). - We consider two sampling situations (1) sample
from the strata at the same rate and (2) sample
from the strata at different rates.
14Toy Problem (cont.)
- Group the pairs (X,Y) on values of X according to
some variety of schemes for comparisons later and
regress Y on X. (Might want to consider weighted
regression for the case of different sampling
rates see Fuller and SUDAAN documentation.) - Compare results. Main questions How should
different sampling rates be taken into account in
the analysis? Is there evidence that differential
sampling affects the conclusions about grouping?
How? What about tests of significance with regard
to estimates of coefficients?
15Effects of Grouping Data on First andSecond
Distribution Moments (2004)
- Additional Reasons for Grouping Data
- Concise but useful summarization of data
- Protection of respondent confidentiality
- Limitation of disclosure risk
16Mean and Variance of Grouped and Ungrouped Data
- Suppose x is uniformly distributed where x 0,
1, 2, . . . ,n. Suppose n1 is a multiple
of k, where k is the number of intervals. - I. Mean and variance of x when it is not grouped.
- .
- Thus,
- .
- If x starts from c, then , since c is a constant.
17Questions to be answered
- How many intervals should be used?
- Where should interval endpoints be placed?
- What quantities should be produced for each
interval? - What would be a useful measure of disclosure
risk?
18 Consider a random variable X
with a finite range R(X) and let
be a decomposition of R(X) into a finite
collection of k mutually exclusive and
exhaustive sets. Let denote denote the
indicator r.v. for the set , that
is, Define where EZ EX
and it is known that V arX EV ar(XZ)V
arE(XZ).
19Some Directions for Future Work
- Measures of disclosure risk
- One possibility might be
- A multivariate extension to address effects on
other second order moments.
20Directions for Future Work (cont.)
- Considerations of the likely uses of datasets
for example, regression analysis where grouping
might be done on any one or more variables. - Recall regression analysis done with group means
-
- Adaptation of methods for robust estimation, for
example, determining the groups of values to be
eliminated when trimming/winsorizing.
21References
- 1.) Gonzalez, Joe Fred Jr., Placek, Paul J.,
Scott, Chester, Chapter 2 Synthetic Estimation
in Follow-back Surveys at the National Center for
Health Statistics, Lecture Notes in Statistics,
Indirect Estimators in U.S. Federal Programs,
Springer-Verlag Publication, Wesley L. Schaible
(Editor), 1996. - 2.) Gotway and Young (2002), Combining Incomplete
Spatial Data, JASA, pp.632-648. - 3.)Kim, Jay. J., Katzoff, Myron, Gonzalez, Jr.
Joe Fred, and Cox, Lawrence H., Effects of
Grouping on First and Second Distribution
Moments, Survey Research Methods Proceedings,
Joint Statistical Meetings, Toronto, Canada,
2004.
22(No Transcript)