Title: RANDOM IMPUTATION USING BALANCED SAMPLING
1RANDOM IMPUTATION USING BALANCED SAMPLING
- Jean-Claude Deville,ENSAI/CRESTLaboratoire de
Statistique dEnquêteCampus de
Ker-Lann-35170-BRUZ - deville_at_ensai.fr
2OVERVIEW OF THE PRESENTATION
-
- 1-Random imputationpro and cons
- 2-Binary variable ( (0-1)-type ) and balanced
sampling using the cube-method - 3-Qualitative vectorialized variable
- 4-Numerical variable and balancing using a
submartingale property
3NON-RESPONSE General ideas
Unit Nonresponse , item non response Weithing
or imputation? Imputation by prediction,
Stochastic Imputation Imputation using a
parametric model (or not)...
The p-dimensional parameter is estimated using
some estimating equations having the general form
or the  normal form
4Imputation using a parametric prediction
gk(?) is imputed using a good value of the
parameter. Of course, the gk make use of
explanatory auxiliary variables xk known on
the whole sample s and are deduced from a
plausible economico-social model. Generally, and
we limit ourselves here to this case, we make use
of a system of p estimating equations having the
form  where uk are functions taking their
values in Rp and xk some information in s.
Here, we are interested essentially in equations
of the normal type Â
where the hk areinstruments which can be
chosen arbitrary.
Â
Example1 Only one parameter denoted R and gkRxk
.One estimating equation if the ratio type. If
the response model is SRS, we find the ratio
estimator. Â Example 2Â yk xk? . The model
is a regression and the estimating equations are
the normal equations . Example 3 yk is a
(0-1)Â variable its predictor is a number
between 0 and 1. In many models, this number is
identified to the probability that we get yk1.
For instance, in a logistic regression using the
xk as explanatory variable, one can use the
normal equations with f(.)exp(.)/(1exp(.)) .
5STOCHASTIC IMPUTATION TECHNIQUES FOR REDUCING
THE IMPUTATION VARIANCE
The missing yk are seen as random variables .For
each k , the parametric law k has been estimated
from the sample r of respondents. Imputation
consist to add a random residual to the
prediction using eventually some other estimated
parameters. The goal of this operation is to
obtain a consistent estimation of the
distribution function of y and of any estimation
of the total of a non-linear transformation of y
. If the real world follows the model, it can
be shown that this goal is nearly fulfilled . We
shall consider the case of binary variables (or
0-1), discrete variables (or qualitative)-and
continuous variables (or numerical or real). The
case of or ordered discrete variables s (small
integers as household size) resumes,according to
concrete considerations, to one of the preceding
cases.Anyway stochastic imputation consist in
adding a centred random variable to a predictor
(for a qualitative variable we will use the
vector of indicatives of the modalities, as
usual). Formally
6The estimated law of the e depend
deterministically on r . However the imputed
values depend on a new random mechanism
independent of the randomness of sampling
(including the response mechanism). Under this
mechanism ( conditional to r ) the expectation of
. is . ( which depend only on r), and a
supplementary parasitic variance appear for
the estimator, even for quantities which are
estimated without bias by the predictors.
Denoting E et V the expectation and the
variance of the imputation law we get
variance parasite
Predictor Variance
If imputations are independent, the last term
(seen as sequential) behave like a random walk
and its variance has for order of magnitude
Card(o). We aim to reduce this variance, which is
far from negligible in practice, to a bounded
quantity independent of the size of o by the
introduction of negative covariances between the
imputed values .
72-Binary variable ((0-1)-type) and balanced
sampling ( cubemethod type)
Imputation consist in giving values 0 or 1 to
the k de o according to the estimated
probabilities Qk . Perform independent
imputations resume to a Poisson sampling of the
units imputed to 1 which is very inefficient.
We should prefer a fixed size sampling (we
suppose that the sum of the Qk has been rounded
to an integer). We could also introduce
supplementary constraints on the sampling. If
the instruments hk are available in o too ,
it will be natural to introduce the constraints
which resume the sampling to a balanced
sampling in the line of the Cube-method
(Deville,Tillé,Biometrika(2004)). With those
constraints , the Qk estimated from the imputed
data, are the same as estimated from the
respondents. Another advantage a substantial
reduction of the imputation variance. Following
Deville,Tillé,JSPI(2005) the imputation variance
in estimating the total of some ?k is the
variance of the résiduals of the dk?k on the hk
(which can be chosen arbitrarily a fact which
open many perspectives).
83 Balanced Imputation of a qualitative variable
It is nearly the same. One suppose that the Qki ,
probabilities for Mister k to be in position i of
the qualitative variable, has been estimated
consistently (for instance using a log-linear
adjustment Qkiexp(xk?i ck ) on the
respondents). In fact, what we have to do is to
sample cells in a table o x I in which we put
the value 1, such that the sum in each row is 1.
The column sums can be controlled such the totals
ni are equal to (those quantities can
be rounded to integers by raking). If Card(I) is
not to big , this sampling can be performed using
a multidimensional Poisson sampling as described
in Deville(2005), or the Cube-method, or ,
simply, the following heuristic based on the
raking-ratio  -Step k impute i to k using the
probas Qki -Update the margins ni -gt ni 1,
the other totals do not change -Adjust the probas
Qki by raking-ratio on the new totals for the
rows starting from k1. -Step k1.
9 But we can do best! Like before we can put more
constraints on the imputed values , for
instance something like
Denoting as a row
vectors and the Qki (the preceding
equality is in fact an equality of
matrices). Once more the Cube-method can be
used.
104 Imputation of a real variable using a positive
submartingale (a-introduction)
We impute . The random
terms are taken in the form ?k ek where the
ek follows the same probability law, postulated
by the imputation model, and ?k is a variable
known on s. This can take many patterns, from the
banal normal assumption to an estimated
nonparametric law. For me, I like he following
technique. We collect the empirical prediction
errors for k in r in the estimation procedure. If
it is possible, we estimate a variance function
using a moment type method by solving
with some
instruments h having the same dimension as ?
(the case where x has dimension 1 is standard, as
the case where the constant -and free of
charge!-variable 1 is used)Â . Then we construct
a pool of residuals
w which is a sample of
the postulated law for the deviations. Then, if
they are good-looking, one adjust a standard law,
else (what I prefer), we consider they are a
non-parametric estimation of the law. The naive
method to impute consist in drawing independently
some for k in o and to impute
. We get
11If we are working sequentially, the current
variance of is equal to
. and increase
linearly. It is the same for sums like
using an arbitrary vector of
instruments hk . The variance matrix increase
like We will show that it is possible to
control the Si to stay at the order Op(1)
instead of Op(i ½) and to keep a bounded
variance whatever the number of units in o . In
particular, if one of the coordinates of h is
dk ?k , the estimation of the total of y remain
nearly the same . More generally, if the
predictors have been adjusted using normal type
equations, the fact to reuse the same instruments
insure that the estimating equations for the
parameter will have the same solution with the
imputed data as with the respondent data (of
course, if the instruments are also available in
s ). Before to show how to do it, we will give
another deeper reason to do it .
124-b Why to do it?
We want to estimate
by For a regular function ? small random terms
w get approximately . Taking expectation, we
see that the imputation bias quite disappear
The imputation variance
, is also reduced if we
have an exact regression , t
the variance is equal to zero.
Otherwise, let ?e be the matrix of
variances-covariances of the e . It is symmetric
positive with dimension Card(o) and its
null-space contains the column vectors of the
components of the h. The imputation variance is
(with? vector of the ?k ) ? ?e ? . If the ek
are gaussian we can compute this value. If the
ek are independent the variance matrix is
?diag(?k2 ) . When there are linear constraints
(strictly or approximately respected), we get the
variance matrix of a gaussian vector conditioned
by the linear constraints . .. and
its value is ? (Id -H(H ?--1H)H ?--1 ) , with H
matrix having the hk as rows. The variance of
is the variance of the residuals
of on the hk with weights ?k-2 .
Remark distribution function ?t(y)1(yltt)
not smooth enough? It can be shown that nearly
all what precede remain valid for the
distribution function.
134-c- And now, how to do it ?
- The algorithm is very simple look at the scalar
product Si hi1 . - If Si hi1 gt0 we draw ei1 in the negative
part of the distribution of e, - in the positive part on the opposite case.
- As a consequence the norm of Si decrease in
expectation as soon as it becomes greater than
some threshold and its variance remain bounded.
Here is a proof in the simplified case where the
e have a symmetric law and the same variance,
and where the hk have the same norm .We have
And therefore
Taking the conditional expectation given Si-1 we
get
.
where u is the unit vector in the direction of S
.
14Let a be the infimum in u of the expectation of
the last term. We see that
lt0 whenever
As long as S stay in the security ball,
we do not worry!As soon as
it go out, the sequence of random variables
is a positive
submartingale which tend almost surely to zero.
In particular, the variance of the Si is
uniformly majored by the variance of the law of
the first exit out the security ball.
,
If it happens that the follow spherical
Normal laws, the radius of the security ball is
easy to compute. Its value is
where p is the dimensionality of the h , and ?
the standard deviation of the e.
15Exponential law for e
16Logarithmic scale
17A shorter one (we have a better view)..(always
with exp)
18Here, it was a rectangular law ( uniform)
19THE END
- MERCI DE VOTRE PATIENCE
- et au revoir
- (comme disait G.)