Statas mishandling of missing data - PowerPoint PPT Presentation

About This Presentation
Title:

Statas mishandling of missing data

Description:

so relations are never undefined, they are simply true or false. ... may be eccentric but is not pernicious.) Stata's conventions: First criticism: ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 101
Provided by: MacD153
Category:

less

Transcript and Presenter's Notes

Title: Statas mishandling of missing data


1
Statas mishandling of missing data
  • Missing data in logical and relational
    expressions a problem and two solutions

2
Statas conventions
3
Statas conventions
a. Relations (such as gt) treat missing data as
positive infinity so relations are never
undefined, they are simply true or false.
4
Statas conventions
a. Relations (such as gt) treat missing data as
positive infinity so relations are never
undefined, they are simply true or false. b.
Logical operators (and if) treat missing
values as true so logical expressions
are never undefined, they are simply true or
false.
5
Statas conventions
a. Relations (such as gt) treat missing data as
positive infinity so relations are never
undefined, they are simply true or false. b.
Logical operators (and if) treat missing
values as true so logical expressions
are never undefined, they are simply true or
false. (Strictly, (b) is an isolateable subset
of a more general rule c. Logical operators
treat all non-zero values as true. But rule
(c), when detached from (a) and (b),may be
eccentric but is not pernicious.)
6
First criticism Commands should do what they
seem to do.
7
First criticism Commands should do what they
seem to do. Response? Users should understand
the conventions it is then simple to test for
missing data as appropriate. No big deal.
8
First criticism Commands should do what they
seem to do. Response? Users should understand
the conventions it is then simple to test for
missing data as appropriate. No big
deal. Responding, second criticism The
proffered prophylactic strategy does not scale
well. Messy and error-prone.
9
1. normal Truth table
10
1. normal Truth table
11
1. normal Truth table
12
1. normal Truth table
13
1. normal Truth table
14
1. normal Truth table
15
1. normal Truth table
16
1. normal Truth table
17
1. normal Truth table
18
1. normal Truth table
19
1. normal Truth table
20
1. normal Truth table in Stata
21
1. normal Truth table in Stata
22
1. normal relation
23
1. normal relation
24
1. normal relation
25
1. normal relation in Stata
26
1. normal relation in Stata
27
First criticism Commands should do what they
seem to do. Response? Users should understand
the conventions it is then simple to test for
missing data as appropriate. No big
deal. Responding, second criticism The
proffered prophylactic strategy does not scale
well. Messy and error-prone.
28
2. Test for missing data?
if (agtb)
29
2. Test for missing data?
if (agtb) if (agtb) !mi(a,b)
30
2. Test for missing data?
if (agtb) if (agtb) !mi(a,b) ? if (agtbcgtd)
31
2. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (agtbcgtd)
if (agtbcgtd) !mi(a,b,c,d)
32
2. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (4gt3.gt2)
if (4gt3.gt2) !mi(4,3,.,2)
F
33
2. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (agtbcgtd)
if (agtbcgtd) !mi(a,b,c,d)
34
2. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (agtbcgtd)
if (agtbcgtd) !mi(a,b,c,d) if ((agtb)
!mi(a,b)) ((cgtd) !mi(c,d))
35
2. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (agtbcgtd)
if (agtbcgtd) !mi(a,b,c,d) if ((agtb)
!mi(a,b)) ((cgtd) !mi(c,d)) ?
36
2. Test for missing data?
if (agtb) if (agtb) !mi(a,b) if (agtbcgtd)
if (agtbcgtd) !mi(a,b,c,d) if ((agtb)
!mi(a,b)) ((cgtd) !mi(c,d)) ?
but messy
37
2. Generating new variables
even messier
38
2. Generating new variables
Consider .generate v pq
39
2. Generating new variables
Consider .generate v pq We want this
to be true when pq is true false when pq
is false
40
2. Generating new variables
Consider .generate v pq We want this
to be true when pq is true false when pq
is false indeterminate when pq is
indeterminate
41
2. Generating new variables
Consider .generate v pq Stata suggests
two commands .generate v 0 if !(pq)
.replace v 1 if pq !mi(p,q)
42
2. Generating new variables
Consider .generate v pq Stata suggests
two commands alternatively .generate v 0 if
p0 q0 .replace v 1 if p1 q1
(when p and q are indicator variables)
43
2. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v pq if
!(pq)!mi(p,q)
44
2. Generating new variables
Consider .generate v pq Stata can
manage with one command alternatively
.generate v cond(p,cond(q,1,0,.),0,cond(q,
.,0))
cond(p,T,F,.) cond(p,T or ., F)
45
2. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
46
2. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
47
2. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
48
2. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
49
2. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
50
2. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
51
2. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
52
2. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
53
2. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0))
cond(p,T,F,.) cond(p,T or ., F)
54
2. Generating new variables
Consider .generate v pq Stata can
manage with one command .generate v
cond(p,cond(q,1,0,.),0,cond(q,.,0)) Hard to
read, but systematic
55
2. Generating new variables
Consider .generate v pq Users should
be able to write this expressionnot tangle with
the complexities of the recent slides.
56
2. Generating new variables
Consider .generate v pq Users should
be able to write that expressionnot tangle with
the complexities of the recent slides. And in
real-life, p and q are themselves likely to
be expressions (logical or relational) so
Statas current missing-data tests become even
hairier.
57
(No Transcript)
58
Two solutions?
59
Two solutions?
  • Use my program validly, to validly specify
    recodes and conditionals

60
Two solutions?
  • Use my program validly, to validly specify
    recodes and conditionals
  • Persuade Stata to validly specify recodes and
    conditionals

61
Validly
62
Validly
validy is a conventional Stata program but
having an adverbial name, appears to be a
modifier of other commands. validly generate has
the functionality of generate, but, in contrast
to generate, gives the correct result when
missing data are encountered within relational or
logical expressions. Likewise validly
replace. Likewise validly assert validly
Stata_conditional_command executes the specified
conditional_command but, in contrast to Statas
execution of the unwrapped command, gives the
correct result when missing data are encountered
within relational or logical expressions in the
condition.
63
Validly - syntax
validly generategenreplace
newvar varname exp if in ,options As
generate or replace, but using valid functional
forms for the expression(s). validly generate
requires newvar replace requires the varname of
an existing variable. validly assert
exp if in ,options As assert, but using
valid functional forms for the expression(s). For
other non-assignment conditional commands which
use if, validly can act as a wrapper
validly any_conditional_command
,,validly_options or (the same syntax,
expressed differently) validly command
parameters if in weight,command_options
,,validly_options validly replaces the
conditional expression by a valid functional
form, and executes the wrapped command
(validlys options appear after double commas, to
differentiate them from the commands options).
64
Validly - strategy
validly takes the relevant expression(s), parses
the relational and logical operators into RPN
form, and from that builds, by iterative
insertion into a macro, complex cond
expression(s) as in our earlier example which
can be executed.
65
Validly - strategy
validly takes the relevant expression(s), parses
the relational and logical operators into RPN
form, and from that builds, by iterative
insertion into a macro, complex cond
expression(s) as in our earlier example which
can be executed. For it works nested conds
were the only replicable strategy I could devise
to handle missing data, given Statas
conventions. Against the rebarbative results
are computationally expensive.
66
Validly - examples
67
Validly - examples
68
Validly - examples
69
(No Transcript)
70
Two solutions?
  • Use my program validly, to validly specify
    recodes and conditionals
  • Persuade Stata to validly specify recodes and
    conditionals

71
Proposal
  • ? Statas relational operators should behave as
    do Statas algebraic operators with regard to
    missing data
  • ? Statas logical operators should follow the
    expected rules when encountering missing data.
    (Further, when evaluating the truth of an
    expression, missing should not count as true).

72
Arguments against logic
  • It is complex/confusing
  • Generates notable inconsistencies
  • Requires several rules

73
Arguments against logic
1 - Complex/confusing? All these statements can
be made to work, but they are complicated and
yield some surprising results (such as the
drop/keep inconsistency shown above). We feel
that most users including ourselves would
find this more confusing than the system
currently in place. Gould, W (2003) Logical
expressions and missing values
www.stata.com/support/faqs/data/values.html
74
Arguments against logic
1 - Complex/confusing? The choice, remember, is
between (on the current coding) having to write
something like .generate v pq if !mi(p,q)
(p !mi(p)) (q !mi(q))
75
Arguments against logic
1 - Complex/confusing? The choice, remember, is
between (on the current coding) having to write
something like .generate v pq if !mi(p,q)
(p !mi(p)) (q !mi(q)) or (on the proposed
coding) being able to write .generate v pq
76
Arguments against logic
1 - Complex/confusing? The choice, remember, is
between (on the current coding) having to write
something like .generate v pq if !mi(p,q)
(p !mi(p)) (q !mi(q)) or (on the proposed
coding) being able to write .generate v pq It
is not entirely self-evident that the shorter is
more confusing?
77
Arguments against logic
2 - Inconsistencies? Changing to a three-valued
logic might make some comparisons more what one
might expect but will introduce inconsistencies
elsewhere.
78
Arguments against logic
2 - Inconsistencies? The only example adduced
(trailed by Gould as a notable inconsistency)
is that, under the proposed rules a
command such as keep if agegt65 is no
longer the same as drop if agelt65 In the
current system, missing values are treated
as positive infinity. Once this fact is absorbed
drop and keep statements work as one would
expect.
79
Arguments against logic (2)
2 - Inconsistencies? The only example adduced
(trailed by Gould as a notable inconsistency)
is that, under the proposed rules a
command such as keep if agegt65 is no
longer the same as drop if agelt65 But if a
sample has three groups (those known to be
over 65, those 65 or younger, and those for
whom we lack age information) it is surely self
evident that dropping one group should not be
the same as keeping one other?.
80
Arguments against logic
2 - Inconsistencies? The only example adduced
(trailed by Gould as a notable inconsistency)
is that, under the proposed rules a
command such as keep if agegt65 is no
longer the same as drop if agelt65 Note keep
if agegt65 would only work as one would expect if
one should expect that those in the sample
lacking age information properly belong in the
group of the retired.
81
Arguments against logic
3 - Several rules? under the proposal you would
have to remember several rules for how missing
values were handled in different situations
instead of just one rule
82
Arguments against logic
3 - Several rules? under the proposal you would
have to remember several rules for how missing
values were handled in different situations
instead of just one rule My proposal is that we
adopt one rule missing values are treated
as missing
83
Arguments against logic
3 - Several rules? under the proposal you would
have to remember several rules for how missing
values were handled in different situations
instead of just one rule In the current system,
missing values are sometimes missing (as
in algebra), sometimes invisible (as in
max), sometimes infinity (sometimes
even, when contrasting .a and .b, distinct
infinities), and sometimes true.
84
Proposal reiterated
  • ? Statas relational operators should behave as
    do Statas algebraic operators with regard to
    missing data
  • ? Statas logical operators should follow the
    expected rules when encountering missing data.
    (Further, when evaluating the truth of an
    expression, missing should not count as true).

85
End of Polemic
86
How many of these do what they seem to do?
i if agegt50 ii if unemployed iii if a2
b2 iv if a!2 b!2 v if !(a2 b2)
!mi(a,b) vi if agegt50 !mi(age) vii if
log(assets)gt2 !mi(assets) viii if a2
bc ix if (a!2 b!2) !mi(a,b) x if
assets/(inc - expend) gt 100 !mi(assets,inc,expen
d) xi .gen v a2 b2 xii .gen v (a2
b2) !mi( a, b)
87
i if agegt50 ii if unemployed iii if a2
b2 iv if a!2 b!2 v if !(a2 b2)
!mi(a,b) vi if agegt50 !mi(age) vii if
log(assets)gt2 !mi(assets) viii if a2
bc ix if (a!2 b!2) !mi(a,b) x if
assets/(inc - expend) gt 100 !mi(assets,inc,expen
d) xi .gen v a2 b2 xii .gen v (a2
b2) !mi( a, b)
88
(No Transcript)
89
e.g.
To handle .generate v (agtb) (cgtd) we need
something along the lines of .generate p agtb
if !mi(a,b) .generate q cgtd if
!mi(c,d) .generate v 0 if !(pq) .replace v
1 if (pq) !mi(p,q)
90
(No Transcript)
91
Footnote on max

max(x1,x2,...,xn) . . . . . . Description
returns the maximum value of x1, x2, ..., xn.
Unless all arguments are missing, missing values
are ignored. max(2,10,.,7) 10
max(.,.,.) .
92
Footnote on max
Suppose you wished, within a marriage, the higher
income (with IncF and IncM for female and male)
you might expect .generate Highest
max(IncM,IncF) would do the trick?
93
Footnote on max
Suppose you wished, within a marriage, the higher
income (say IncF and IncM for female and male)
you might expect .generate Highest
max(IncM,IncF) would do the trick? But for
women whose spouses (perhaps bashful tycoons or
shamefaced paupers) refused to answer, we get the
income of the woman as the purportedly known
higher individual income. The analyst should
regard the outcome for such observations as
strictly unknown else you could have true
high-spending householdswhose highest income
might be very low (these bashful tycoons),
distorting any subsequent analyses.
94
Footnote on max
Suppose you wished, within a marriage, the higher
income (say IncF and IncM for female and male)
you might expect .generate Highest
max(IncM,IncF) would do the trick? But for
women whose spouses (perhaps bashful tycoons or
shamefaced paupers) refused to answer, we get the
income of the woman as the purportedly known
higher individual income. If the values of some
variables in a set are unknown, it is misleading
to report the maximum of the known as the known
maximum.
95
(No Transcript)
96
Transition?
One consequent loss of functionality the loss
of the ability to test for specific missing data
codes, as in v.a
97
Transition?
One consequent loss of functionality the loss
of the ability to test for specific missing data
codes, as in v.a could readily be handled
by the introduction of a function mv(v) which
would take one variable as its argument, and
return a value in the range 1-27 corresponding to
the extended missing data codes, and zero
otherwise.
98
Transition?
One consequent loss of functionality the loss
of the ability to test for specific missing data
codes, as in v.a or, as validly does,
could scan for explicit missing and parse
separately.
99
Transition?
One consequent loss of functionality the loss
of the ability to test for specific missing data
codes, as in v.a or, as validly does,
could scan for explicit missing and parse
separately.
100
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com