Considerations in Grouping Multivariate Data - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Considerations in Grouping Multivariate Data

Description:

CDC, National Center for Health Statistics, Office of Research and Methodology, USA ... Jr., Placek, Paul J., Scott, Chester, Chapter 2: Synthetic Estimation in Follow ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 23
Provided by: Non141
Category:

less

Transcript and Presenter's Notes

Title: Considerations in Grouping Multivariate Data


1
(No Transcript)
2
Considerations in Grouping Multivariate Data
Myron J. Katzoff, Jay J. Kim, Joe Fred Gonzalez,
Jr., and Lawrence H. Cox CDC, National Center
for Health Statistics, Office of Research and
Methodology, USA April 25, 2006
3
What is grouped data?
  • Data which have been categorized (or binned) in
    accordance with some grid-like structure.

4
Reasons for Grouping Data
  • To call attention to certain features of the
    underlying distributions of variables,
  • To effectively summarize data for portability to
    various applications or analyses, or
  • To preserve the confidentiality of the sample
    members who have provided data.

5
Purpose of paper
  • Examine some of the consequences of binning with
    nonuniform rectangular grids
  • Consider some problems in multivariate inference
    and examine special considerations for results
    obtained from complex surveys.

6
Relationship to other work
  • Our work is connected with research in several
    other areas
  • multivariate analysis with rounded,
    microaggregated or coarsened data
  • regression analysis with grouped data (we will
    outline a toy problem for survey data that we
    are investigating)
  • investigation of the ecological inference
    problem
  • investigation of the modifiable areal unit
    problem (MAUP) and
  • investigation of the (general) change of support
    problem (COSP).

7
Relationship to other work (cont.)
  • The ecological inference problem (2002) is often
    of
  • concern in the analysis of public health data.
  • Ecological inference is the process of deducing
  • individual behavior from aggregate data the
  • problem occurs when analyses based on grouped
  • data yield conclusions which are markedly
    different
  • from those obtained from individual data. This is
    a
  • difficult problem to ascertain and one rarely
    gets the
  • chance to confirm whether or not the problem
    exists
  • in a given setting.

8
Relationship to other work (cont.)
  • The ecological inference problem is
  • analogous to creating estimates for small
  • areas using synthetic (or indirect) estimation
  • (1996) by applying national estimates within
  • socio-demographic groups at smaller areas.

9
Relationship to other work (cont.)
  • The underlying rationale for synthetic
  • estimation is that the distribution of a health
  • characteristic is highly correlated with
  • the demographic composition of the
  • population (1996).

10
Relationship to other work (cont.)
  • Therefore, it is assumed that differences in the
    prevalence of the characteristics between two
    areas are due primarily to differences in
    demographic composition (e.g., age, race, sex,
    etc.).

11
MAUP (modifiable areal unit problem)
  • The smoothing effect that results from
  • averaging in spatial analysis creates the
  • scaling problem in the MAUP. The
  • aggregation of areal units smoothes and alters
  • the spatial autocorrelation of units causing a
  • zoning effect.

12
Ecological inference problem vs. MAUP vs. COSP
  • The ecological inference problem and the MAUP are
    specific realizations of the change of support
    problem (COSP).
  • Many other terms have also been introduced to
    describe particular COSPs and solutions to
    particular COSPs including the scaling problem,
    inference between incompatible zonal systems,
    block kriging, pycnophylactic geographic
    interpolation, the polygonal overlay problem,
    areal interpolation, inference with spatially
    misaligned data, contour aggregation, and
    multiscale and multiresolution modeling.

13
Toy Problem
  • Generate N three-tuples (X,Y, Z) from a
    trivariate normal distribution.
  • Create two strata by classifying the three-tuples
    on Z arbitrarily into two strata of sizes N1 and
    N2 determined after the selection in (1).
  • We consider two sampling situations (1) sample
    from the strata at the same rate and (2) sample
    from the strata at different rates.

14
Toy Problem (cont.)
  • Group the pairs (X,Y) on values of X according to
    some variety of schemes for comparisons later and
    regress Y on X. (Might want to consider weighted
    regression for the case of different sampling
    rates see Fuller and SUDAAN documentation.)
  • Compare results. Main questions How should
    different sampling rates be taken into account in
    the analysis? Is there evidence that differential
    sampling affects the conclusions about grouping?
    How? What about tests of significance with regard
    to estimates of coefficients?

15
Effects of Grouping Data on First andSecond
Distribution Moments (2004)
  • Additional Reasons for Grouping Data
  • Concise but useful summarization of data
  • Protection of respondent confidentiality
  • Limitation of disclosure risk

16
Mean and Variance of Grouped and Ungrouped Data
  • Suppose x is uniformly distributed where x 0,
    1, 2, . . . ,n. Suppose n1 is a multiple
    of k, where k is the number of intervals.
  • I. Mean and variance of x when it is not grouped.
  • .
  • Thus,
  • .
  • If x starts from c, then , since c is a constant.

17
Questions to be answered
  • How many intervals should be used?
  • Where should interval endpoints be placed?
  • What quantities should be produced for each
    interval?
  • What would be a useful measure of disclosure
    risk?

18
Consider a random variable X
with a finite range R(X) and let
be a decomposition of R(X) into a finite
collection of k mutually exclusive and
exhaustive sets. Let denote denote the
indicator r.v. for the set , that
is, Define where EZ EX
and it is known that V arX EV ar(XZ)V
arE(XZ).
19
Some Directions for Future Work
  • Measures of disclosure risk
  • One possibility might be
  • A multivariate extension to address effects on
    other second order moments.

20
Directions for Future Work (cont.)
  • Considerations of the likely uses of datasets
    for example, regression analysis where grouping
    might be done on any one or more variables.
  • Recall regression analysis done with group means
  • Adaptation of methods for robust estimation, for
    example, determining the groups of values to be
    eliminated when trimming/winsorizing.

21
References
  • 1.) Gonzalez, Joe Fred Jr., Placek, Paul J.,
    Scott, Chester, Chapter 2 Synthetic Estimation
    in Follow-back Surveys at the National Center for
    Health Statistics, Lecture Notes in Statistics,
    Indirect Estimators in U.S. Federal Programs,
    Springer-Verlag Publication, Wesley L. Schaible
    (Editor), 1996.
  • 2.) Gotway and Young (2002), Combining Incomplete
    Spatial Data, JASA, pp.632-648.
  • 3.)Kim, Jay. J., Katzoff, Myron, Gonzalez, Jr.
    Joe Fred, and Cox, Lawrence H., Effects of
    Grouping on First and Second Distribution
    Moments, Survey Research Methods Proceedings,
    Joint Statistical Meetings, Toronto, Canada,
    2004.

22
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com