Title: Statas mishandling of missing data
1Statas mishandling of missing data
- Missing data in logical and relational
expressions a problem and two solutions
2Statas conventions
3Statas conventions
a. Relations (such as gt) treat missing data as
positive infinity so relations are never
undefined, they are simply true or false.
4Statas conventions
a. Relations (such as gt) treat missing data as
positive infinity so relations are never
undefined, they are simply true or false. b.
Logical operators (and if) treat missing
values as true so logical expressions
are never undefined, they are simply true or
false.
5Statas conventions
a. Relations (such as gt) treat missing data as
positive infinity so relations are never
undefined, they are simply true or false. b.
Logical operators (and if) treat missing
values as true so logical expressions
are never undefined, they are simply true or
false. (Strictly, (b) is an isolateable subset
of a more general rule c. Logical operators
treat all non-zero values as true. But rule
(c), when detached from (a) and (b),may be
eccentric but is not pernicious.)
6First criticism Commands should do what they
seem to do.
7First criticism Commands should do what they
seem to do. Response? Users should understand
the conventions it is then simple to test for
missing data as appropriate. No big deal.
8First criticism Commands should do what they
seem to do. Response? Users should understand
the conventions it is then simple to test for
missing data as appropriate. No big
deal. Responding, second criticism The
proffered prophylactic strategy does not scale
well. Messy and error-prone.
91. normal Truth table
101. normal Truth table
111. normal Truth table
121. normal Truth table
131. normal Truth table
141. normal Truth table
151. normal Truth table
161. normal Truth table
171. normal Truth table
181. normal Truth table
191. normal Truth table
201. normal Truth table in Stata
211. normal Truth table in Stata
221. normal relation
231. normal relation
241. normal relation
251. normal relation in Stata
261. normal relation in Stata
27First criticism Commands should do what they
seem to do. Response? Users should understand
the conventions it is then simple to test for
missing data as appropriate. No big
deal. Responding, second criticism The
proffered prophylactic strategy does not scale
well. Messy and error-prone.
282. Test for missing data?
if (agtb)
292. Test for missing data?
if (agtb) if (agtb) !mi(a,b)
302. Test for missing data?
if (agtb) if (agtb) !mi(a,b) ? if (agtbcgtd)
312. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (agtbcgtd)
if (agtbcgtd) !mi(a,b,c,d)
322. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (4gt3.gt2)
if (4gt3.gt2) !mi(4,3,.,2)
F
332. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (agtbcgtd)
if (agtbcgtd) !mi(a,b,c,d)
342. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (agtbcgtd)
if (agtbcgtd) !mi(a,b,c,d) if ((agtb)
!mi(a,b)) ((cgtd) !mi(c,d))
352. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (agtbcgtd)
if (agtbcgtd) !mi(a,b,c,d) if ((agtb)
!mi(a,b)) ((cgtd) !mi(c,d)) ?
362. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (agtbcgtd)
if (agtbcgtd) !mi(a,b,c,d) if ((agtb)
!mi(a,b)) ((cgtd) !mi(c,d)) ?
but messy
372. Generating new variables
even messier
382. Generating new variables
Consider .generate v pq
392. Generating new variables
Consider .generate v pq We want this
to be true when pq is true false when pq
is false
402. Generating new variables
Consider .generate v pq We want this
to be true when pq is true false when pq
is false indeterminate when pq is
indeterminate
412. Generating new variables
Consider .generate v pq Stata suggests
two commands .generate v 0 if !(pq)
.replace v 1 if pq !mi(p,q)
422. Generating new variables
Consider .generate v pq Stata suggests
two commands alternatively .generate v 0 if
p0 q0 .replace v 1 if p1 q1
(when p and q are indicator variables)
432. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v pq if
!(pq)!mi(p,q)
442. Generating new variables
Consider .generate v pq Stata can
manage with one command alternatively
.generate v cond(p,cond(q,1,0,.),0,cond(q,
.,0))
cond(p,T,F,.) cond(p,T or ., F)
452. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
462. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
472. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
482. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
492. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
502. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
512. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
522. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
532. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
542. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0)) Hard to
read, but systematic
552. Generating new variables
Consider .generate v pq Users should
be able to write this expressionnot tangle with
the complexities of the recent slides.
562. Generating new variables
Consider .generate v pq Users should
be able to write that expressionnot tangle with
the complexities of the recent slides. And in
real-life, p and q are themselves likely to
be expressions (logical or relational) so
Statas current missing-data tests become even
hairier.
57(No Transcript)
58Two solutions?
59Two solutions?
- Use my program validly, to validly specify
recodes and conditionals
60Two solutions?
- Use my program validly, to validly specify
recodes and conditionals - Persuade Stata to validly specify recodes and
conditionals
61Validly
62Validly
validy is a conventional Stata program but
having an adverbial name, appears to be a
modifier of other commands. validly generate has
the functionality of generate, but, in contrast
to generate, gives the correct result when
missing data are encountered within relational or
logical expressions. Likewise validly
replace. Likewise validly assert validly
Stata_conditional_command executes the specified
conditional_command but, in contrast to Statas
execution of the unwrapped command, gives the
correct result when missing data are encountered
within relational or logical expressions in the
condition.
63Validly - syntax
validly generategenreplace
newvar varname exp if in ,options As
generate or replace, but using valid functional
forms for the expression(s). validly generate
requires newvar replace requires the varname of
an existing variable. validly assert
exp if in ,options As assert, but using
valid functional forms for the expression(s). For
other non-assignment conditional commands which
use if, validly can act as a wrapper
validly any_conditional_command
,,validly_options or (the same syntax,
expressed differently) validly command
parameters if in weight,command_options
,,validly_options validly replaces the
conditional expression by a valid functional
form, and executes the wrapped command
(validlys options appear after double commas, to
differentiate them from the commands options).
64Validly - strategy
validly takes the relevant expression(s), parses
the relational and logical operators into RPN
form, and from that builds, by iterative
insertion into a macro, complex cond
expression(s) as in our earlier example which
can be executed.
65Validly - strategy
validly takes the relevant expression(s), parses
the relational and logical operators into RPN
form, and from that builds, by iterative
insertion into a macro, complex cond
expression(s) as in our earlier example which
can be executed. For it works nested conds
were the only replicable strategy I could devise
to handle missing data, given Statas
conventions. Against the rebarbative results
are computationally expensive.
66Validly - examples
67Validly - examples
68Validly - examples
69(No Transcript)
70Two solutions?
- Use my program validly, to validly specify
recodes and conditionals - Persuade Stata to validly specify recodes and
conditionals
71Proposal
- ? Statas relational operators should behave as
do Statas algebraic operators with regard to
missing data - ? Statas logical operators should follow the
expected rules when encountering missing data.
(Further, when evaluating the truth of an
expression, missing should not count as true).
72Arguments against logic
- It is complex/confusing
- Generates notable inconsistencies
- Requires several rules
73Arguments against logic
1 - Complex/confusing? All these statements can
be made to work, but they are complicated and
yield some surprising results (such as the
drop/keep inconsistency shown above). We feel
that most users including ourselves would
find this more confusing than the system
currently in place. Gould, W (2003) Logical
expressions and missing values
www.stata.com/support/faqs/data/values.html
74Arguments against logic
1 - Complex/confusing? The choice, remember, is
between (on the current coding) having to write
something like .generate v pq if !mi(p,q)
(p !mi(p)) (q !mi(q))
75Arguments against logic
1 - Complex/confusing? The choice, remember, is
between (on the current coding) having to write
something like .generate v pq if !mi(p,q)
(p !mi(p)) (q !mi(q)) or (on the proposed
coding) being able to write .generate v pq
76Arguments against logic
1 - Complex/confusing? The choice, remember, is
between (on the current coding) having to write
something like .generate v pq if !mi(p,q)
(p !mi(p)) (q !mi(q)) or (on the proposed
coding) being able to write .generate v pq It
is not entirely self-evident that the shorter is
more confusing?
77Arguments against logic
2 - Inconsistencies? Changing to a three-valued
logic might make some comparisons more what one
might expect but will introduce inconsistencies
elsewhere.
78Arguments against logic
2 - Inconsistencies? The only example adduced
(trailed by Gould as a notable inconsistency)
is that, under the proposed rules a
command such as keep if agegt65 is no
longer the same as drop if agelt65 In the
current system, missing values are treated
as positive infinity. Once this fact is absorbed
drop and keep statements work as one would
expect.
79Arguments against logic (2)
2 - Inconsistencies? The only example adduced
(trailed by Gould as a notable inconsistency)
is that, under the proposed rules a
command such as keep if agegt65 is no
longer the same as drop if agelt65 But if a
sample has three groups (those known to be
over 65, those 65 or younger, and those for
whom we lack age information) it is surely self
evident that dropping one group should not be
the same as keeping one other?.
80Arguments against logic
2 - Inconsistencies? The only example adduced
(trailed by Gould as a notable inconsistency)
is that, under the proposed rules a
command such as keep if agegt65 is no
longer the same as drop if agelt65 Note keep
if agegt65 would only work as one would expect if
one should expect that those in the sample
lacking age information properly belong in the
group of the retired.
81Arguments against logic
3 - Several rules? under the proposal you would
have to remember several rules for how missing
values were handled in different situations
instead of just one rule
82Arguments against logic
3 - Several rules? under the proposal you would
have to remember several rules for how missing
values were handled in different situations
instead of just one rule My proposal is that we
adopt one rule missing values are treated
as missing
83Arguments against logic
3 - Several rules? under the proposal you would
have to remember several rules for how missing
values were handled in different situations
instead of just one rule In the current system,
missing values are sometimes missing (as
in algebra), sometimes invisible (as in
max), sometimes infinity (sometimes
even, when contrasting .a and .b, distinct
infinities), and sometimes true.
84Proposal reiterated
- ? Statas relational operators should behave as
do Statas algebraic operators with regard to
missing data - ? Statas logical operators should follow the
expected rules when encountering missing data.
(Further, when evaluating the truth of an
expression, missing should not count as true).
85End of Polemic
86How many of these do what they seem to do?
i if agegt50 ii if unemployed iii if a2
b2 iv if a!2 b!2 v if !(a2 b2)
!mi(a,b) vi if agegt50 !mi(age) vii if
log(assets)gt2 !mi(assets) viii if a2
bc ix if (a!2 b!2) !mi(a,b) x if
assets/(inc - expend) gt 100 !mi(assets,inc,expen
d) xi .gen v a2 b2 xii .gen v (a2
b2) !mi( a, b)
87i if agegt50 ii if unemployed iii if a2
b2 iv if a!2 b!2 v if !(a2 b2)
!mi(a,b) vi if agegt50 !mi(age) vii if
log(assets)gt2 !mi(assets) viii if a2
bc ix if (a!2 b!2) !mi(a,b) x if
assets/(inc - expend) gt 100 !mi(assets,inc,expen
d) xi .gen v a2 b2 xii .gen v (a2
b2) !mi( a, b)
88(No Transcript)
89e.g.
To handle .generate v (agtb) (cgtd) we need
something along the lines of .generate p agtb
if !mi(a,b) .generate q cgtd if
!mi(c,d) .generate v 0 if !(pq) .replace v
1 if (pq) !mi(p,q)
90(No Transcript)
91Footnote on max
max(x1,x2,...,xn) . . . . . . Description
returns the maximum value of x1, x2, ..., xn.
Unless all arguments are missing, missing values
are ignored. max(2,10,.,7) 10
max(.,.,.) .
92Footnote on max
Suppose you wished, within a marriage, the higher
income (with IncF and IncM for female and male)
you might expect .generate Highest
max(IncM,IncF) would do the trick?
93Footnote on max
Suppose you wished, within a marriage, the higher
income (say IncF and IncM for female and male)
you might expect .generate Highest
max(IncM,IncF) would do the trick? But for
women whose spouses (perhaps bashful tycoons or
shamefaced paupers) refused to answer, we get the
income of the woman as the purportedly known
higher individual income. The analyst should
regard the outcome for such observations as
strictly unknown else you could have true
high-spending householdswhose highest income
might be very low (these bashful tycoons),
distorting any subsequent analyses.
94Footnote on max
Suppose you wished, within a marriage, the higher
income (say IncF and IncM for female and male)
you might expect .generate Highest
max(IncM,IncF) would do the trick? But for
women whose spouses (perhaps bashful tycoons or
shamefaced paupers) refused to answer, we get the
income of the woman as the purportedly known
higher individual income. If the values of some
variables in a set are unknown, it is misleading
to report the maximum of the known as the known
maximum.
95(No Transcript)
96Transition?
One consequent loss of functionality the loss
of the ability to test for specific missing data
codes, as in v.a
97Transition?
One consequent loss of functionality the loss
of the ability to test for specific missing data
codes, as in v.a could readily be handled
by the introduction of a function mv(v) which
would take one variable as its argument, and
return a value in the range 1-27 corresponding to
the extended missing data codes, and zero
otherwise.
98Transition?
One consequent loss of functionality the loss
of the ability to test for specific missing data
codes, as in v.a or, as validly does,
could scan for explicit missing and parse
separately.
99Transition?
One consequent loss of functionality the loss
of the ability to test for specific missing data
codes, as in v.a or, as validly does,
could scan for explicit missing and parse
separately.
100(No Transcript)