Considerations in Grouping Multivariate Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: Considerations in Grouping Multivariate Data

1
(No Transcript)
2
Considerations in Grouping Multivariate Data
Myron J. Katzoff, Jay J. Kim, Joe Fred Gonzalez,
Jr., and Lawrence H. Cox CDC, National Center
for Health Statistics, Office of Research and
Methodology, USA April 25, 2006
3
What is grouped data?

Data which have been categorized (or binned) in
accordance with some grid-like structure.

4
Reasons for Grouping Data

To call attention to certain features of the
underlying distributions of variables,
To effectively summarize data for portability to
various applications or analyses, or
To preserve the confidentiality of the sample
members who have provided data.

5
Purpose of paper

Examine some of the consequences of binning with
nonuniform rectangular grids
Consider some problems in multivariate inference
and examine special considerations for results
obtained from complex surveys.

6
Relationship to other work

Our work is connected with research in several
other areas
multivariate analysis with rounded,
microaggregated or coarsened data
regression analysis with grouped data (we will
outline a toy problem for survey data that we
are investigating)
investigation of the ecological inference
problem
investigation of the modifiable areal unit
problem (MAUP) and
investigation of the (general) change of support
problem (COSP).

7
Relationship to other work (cont.)

The ecological inference problem (2002) is often
of
concern in the analysis of public health data.
Ecological inference is the process of deducing
individual behavior from aggregate data the
problem occurs when analyses based on grouped
data yield conclusions which are markedly
different
from those obtained from individual data. This is
a
difficult problem to ascertain and one rarely
gets the
chance to confirm whether or not the problem
exists
in a given setting.

8
Relationship to other work (cont.)

The ecological inference problem is
analogous to creating estimates for small
areas using synthetic (or indirect) estimation
(1996) by applying national estimates within
socio-demographic groups at smaller areas.

9
Relationship to other work (cont.)

The underlying rationale for synthetic
estimation is that the distribution of a health
characteristic is highly correlated with
the demographic composition of the
population (1996).

10
Relationship to other work (cont.)

Therefore, it is assumed that differences in the
prevalence of the characteristics between two
areas are due primarily to differences in
demographic composition (e.g., age, race, sex,
etc.).

11
MAUP (modifiable areal unit problem)

The smoothing effect that results from
averaging in spatial analysis creates the
scaling problem in the MAUP. The
aggregation of areal units smoothes and alters
the spatial autocorrelation of units causing a
zoning effect.

12
Ecological inference problem vs. MAUP vs. COSP

The ecological inference problem and the MAUP are
specific realizations of the change of support
problem (COSP).
Many other terms have also been introduced to
describe particular COSPs and solutions to
particular COSPs including the scaling problem,
inference between incompatible zonal systems,
block kriging, pycnophylactic geographic
interpolation, the polygonal overlay problem,
areal interpolation, inference with spatially
misaligned data, contour aggregation, and
multiscale and multiresolution modeling.

13
Toy Problem

Generate N three-tuples (X,Y, Z) from a
trivariate normal distribution.
Create two strata by classifying the three-tuples
on Z arbitrarily into two strata of sizes N1 and
N2 determined after the selection in (1).
We consider two sampling situations (1) sample
from the strata at the same rate and (2) sample
from the strata at different rates.

14
Toy Problem (cont.)

Group the pairs (X,Y) on values of X according to
some variety of schemes for comparisons later and
regress Y on X. (Might want to consider weighted
regression for the case of different sampling
rates see Fuller and SUDAAN documentation.)
Compare results. Main questions How should
different sampling rates be taken into account in
the analysis? Is there evidence that differential
sampling affects the conclusions about grouping?
How? What about tests of significance with regard
to estimates of coefficients?

15
Effects of Grouping Data on First andSecond
Distribution Moments (2004)

Additional Reasons for Grouping Data
Concise but useful summarization of data
Protection of respondent confidentiality
Limitation of disclosure risk

16
Mean and Variance of Grouped and Ungrouped Data

Suppose x is uniformly distributed where x 0,
1, 2, . . . ,n. Suppose n1 is a multiple
of k, where k is the number of intervals.
I. Mean and variance of x when it is not grouped.
.
Thus,
.
If x starts from c, then , since c is a constant.

17
Questions to be answered

How many intervals should be used?
Where should interval endpoints be placed?
What quantities should be produced for each
interval?
What would be a useful measure of disclosure
risk?

18
Consider a random variable X
with a finite range R(X) and let
be a decomposition of R(X) into a finite
collection of k mutually exclusive and
exhaustive sets. Let denote denote the
indicator r.v. for the set , that
is, Define where EZ EX
and it is known that V arX EV ar(XZ)V
arE(XZ).
19
Some Directions for Future Work

Measures of disclosure risk
One possibility might be
A multivariate extension to address effects on
other second order moments.

20
Directions for Future Work (cont.)

Considerations of the likely uses of datasets
for example, regression analysis where grouping
might be done on any one or more variables.
Recall regression analysis done with group means
Adaptation of methods for robust estimation, for
example, determining the groups of values to be
eliminated when trimming/winsorizing.

21
References

1.) Gonzalez, Joe Fred Jr., Placek, Paul J.,
Scott, Chester, Chapter 2 Synthetic Estimation
in Follow-back Surveys at the National Center for
Health Statistics, Lecture Notes in Statistics,
Indirect Estimators in U.S. Federal Programs,
Springer-Verlag Publication, Wesley L. Schaible
(Editor), 1996.
2.) Gotway and Young (2002), Combining Incomplete
Spatial Data, JASA, pp.632-648.
3.)Kim, Jay. J., Katzoff, Myron, Gonzalez, Jr.
Joe Fred, and Cox, Lawrence H., Effects of
Grouping on First and Second Distribution
Moments, Survey Research Methods Proceedings,
Joint Statistical Meetings, Toronto, Canada,
2004.

22
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Considerations in Grouping Multivariate Data PowerPoint PPT Presentation