stata’s mishandling of missing data missing data in logical and relational expressions: a problem...

100
Stata’s mishandling of missing data Missing data in logical and relational expressions: a problem and two solutions

Upload: silvester-davis

Post on 16-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Stata’s mishandling of missing data

Missing data in logical and relational expressions:

a problem and two solutions

Stata’s conventions:

a. Relations (such as ‘>’) treat missing data as ‘positive infinity’; so relations are never undefined, they are simply true or false.

Stata’s conventions:

a. Relations (such as ‘>’) treat missing data as ‘positive infinity’; so relations are never undefined, they are simply true or false.

b. Logical operators (and ‘…if’) treat missing values as ‘true’; so logical expressions are never undefined, they are simply true or false.

Stata’s conventions:

a. Relations (such as ‘>’) treat missing data as ‘positive infinity’; so relations are never undefined, they are simply true or false.

b. Logical operators (and ‘…if’) treat missing values as ‘true’; so logical expressions are never undefined, they are simply true or false.

(Strictly, (b) is an isolateable subset of a more general rule:

c. Logical operators treat all non-zero values as ‘true’. But rule (c), when detached from (a) and (b),may be eccentric but is not pernicious.)

Stata’s conventions:

First criticism:

Commands should do what they seem to do.

First criticism:

Commands should do what they seem to do.

Response? Users should understand the conventions; it is then simple to test for missing data as appropriate. No big deal.

First criticism:

Commands should do what they seem to do.

Response? Users should understand the conventions; it is then simple to test for missing data as appropriate. No big deal.

Responding, second criticism:

The proffered prophylactic strategy does not scale well. Messy and error-prone.

1. normal Truth tablep q !p p&q p|q1 1

1 0

1 0 1

0 0

0 1

0

1. normal Truth tablep q !p p&q p|q1 1

1 0

1 0 1

0 0

0 1

0

1. normal Truth tablep q !p p&q p|q1 1 0

1 0 0

1 0

0 1 1

0 0 1

0 1

1 0

1. normal Truth tablep q !p p&q p|q1 1 0

1 0 0

1 0

0 1 1

0 0 1

0 1

1 0

1. normal Truth tablep q !p p&q p|q11 11 0 1

1 0 0

1 0

0 1 1

0 0 1

0 1

1 0

1. normal Truth tablep q !p p&q p|q1 1 0 1

1 00 0 0

1 0

00 1 1 0

00 00 1 0

00 1 0

1 00 0

1. normal Truth tablep q !p p&q p|q1 1 0 1

1 0 0 0

1 0 0 1 1 0

0 0 1 0

0 1 0

1 0 0

1. normal Truth tablep q !p p&q p|q11 11 0 1 1

11 0 0 0 1

11 0 1

0 11 1 0 1

0 0 1 0

0 1 0

11 1

0 0

1. normal Truth tablep q !p p&q p|q1 1 0 1 1

1 0 0 0 1

1 0 1

0 1 1 0 1

00 00 1 0 0

0 1 0

1 1

0 0

1. normal Truth tablep q !p p&q p|q1 1 0 1 1

1 0 0 0 1

1 0 1

0 1 1 0 1

0 0 1 0 0

0 1 0 1 1

0 0

1. normal Truth tablep q !p p&q p|q1 1 0 1 1

1 0 0 0 1

1 0 1

0 1 1 0 1

0 0 1 0 0

0 1 0 1 1

0 0

1. normal Truth table – in Statap q !p p&q p|q1 1 0 1 1

1 0 0 0 1

1 0 11 1

0 1 1 0 1

0 0 1 0 0

0 1 0 11

1 00 11 1

0 00 0 11

00 11 11

1. normal Truth table – in Statap q !p p&q p|q1 1 0 1 1

1 0 0 0 1

1 0 11 1

0 1 1 0 1

0 0 1 0 0

0 1 0 11

1 00 11 1

0 00 0 11

00 11 11

1. normal relationp q p+q p>q

1 1 2 0

1 0 1 1

1 0 1 1 0

0 0 0 0

0 1 0

1. normal relationp q p+q p>q

1 1 2 0

1 0 1 1

1 0 1 1 0

0 0 0 0

0 1 0

1. normal relationp q p+q p>q

1 1 2 0

1 0 1 1

1 0 1 1 0

0 0 0 0

0 1 0

1. normal relation – in Statap q p+q p>q

1 1 2 0

1 0 1 1

1 00

0 1 1 0

0 0 0 0

0 00

1 11

0 11

00

1. normal relation – in Statap q p+q p>q

1 1 2 0

1 0 1 1

1 00

0 1 1 0

0 0 0 0

0 00

1 11

0 11

00

First criticism:

Commands should do what they seem to do.

Response? Users should understand the conventions; it is then simple to test for missing data as appropriate. No big deal.

Responding, second criticism:

The proffered prophylactic strategy does not scale well. Messy and error-prone.

2. Test for missing data?

… if (a>b)

2. Test for missing data?

… if (a>b)

… if (a>b) & !mi(a,b)

2. Test for missing data?

… if (a>b)

… if (a>b) & !mi(a,b)

… if (a>b|c>d)

2. Test for missing data?

… if (a>b)

… if (a>b) & !mi(a,b)

… if (a>b|c>d)

… if (a>b|c>d) & !mi(a,b,c,d)

2. Test for missing data?

… if (a>b)

… if (a>b) & !mi(a,b)

… if (4>3|.>2)

… if (4>3|.>2) & !mi(4,3,.,2)

FF

2. Test for missing data?

… if (a>b)

… if (a>b) & !mi(a,b)

… if (a>b|c>d)

… if (a>b|c>d) & !mi(a,b,c,d)

2. Test for missing data?

… if (a>b)

… if (a>b) & !mi(a,b)

… if (a>b|c>d)

… if (a>b|c>d) & !mi(a,b,c,d)

… if ((a>b) & !mi(a,b)) | ((c>d) & !mi(c,d))

2. Test for missing data?

… if (a>b)

… if (a>b) & !mi(a,b)

… if (a>b|c>d)

… if (a>b|c>d) & !mi(a,b,c,d)

… if ((a>b) & !mi(a,b)) | ((c>d) & !mi(c,d))

2. Test for missing data?

… if (a>b)

… if (a>b) & !mi(a,b)

… if (a>b|c>d)

… if (a>b|c>d) & !mi(a,b,c,d)

… if ((a>b) & !mi(a,b)) | ((c>d) & !mi(c,d))

but messybut messy

2. Generating new variables

even messiereven messier

2. Generating new variables

Consider: .generate v = p&q

2. Generating new variables

Consider: .generate v = p&q

We want this to be: true when p&q is true false when p&q is false

2. Generating new variables

Consider: .generate v = p&q

We want this to be: true when p&q is true false when p&q is false indeterminate when p&q is indeterminate

2. Generating new variables

Consider: .generate v = p&q

Stata suggests two commands: .generate v = 0 if !(p&q) .replace v = 1 if p&q & !mi(p,q)

2. Generating new variables

Consider: .generate v = p&q

Stata suggests two commands: alternatively .generate v = 0 if p==0 | q==0 .replace v = 1 if p==1 & q==1 (when p and q are indicator variables)

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: .generate v = p&q if !(p&q)|!mi(p,q)

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: alternatively .generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0))

“cond(p,T,F,.) cond(p,T or ., F)”

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: .generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0))

“cond(p,T,F,.) cond(p,T or ., F)”

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: .generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0))

“cond(p,T,F,.) cond(p,T or ., F)”

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: .generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0))

“cond(p,T,F,.) cond(p,T or ., F)”

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: .generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0))

“cond(p,T,F,.) cond(p,T or ., F)”

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: .generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0))

“cond(p,T,F,.) cond(p,T or ., F)”

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: .generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0))

“cond(p,T,F,.) cond(p,T or ., F)”

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: .generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0))

“cond(p,T,F,.) cond(p,T or ., F)”

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: .generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0))

“cond(p,T,F,.) cond(p,T or ., F)”

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: .generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0))

“cond(p,T,F,.) cond(p,T or ., F)”

2. Generating new variables

Consider: .generate v = p&q

Stata can manage with one command: .generate v = cond(p,cond(q,1,0,.),0,cond(q,.,0))

Hard to ‘read’, but systematicHard to ‘read’, but systematic

2. Generating new variables

Consider: .generate v = p&q

Users should be able to write this expressionnot tangle with the complexities of the recent slides.

2. Generating new variables

Consider: .generate v = p&q

Users should be able to write that expressionnot tangle with the complexities of the recent slides.

[And in real-life, ‘p’ and ‘q’ are themselves likely to be expressions (logical or relational) so Stata’s current missing-data tests become even hairier.]

Two solutions?Two solutions?

Two solutions?Two solutions?

Use my program Use my program validlyvalidly,,

to validly specify recodes and conditionals to validly specify recodes and conditionals

Two solutions?Two solutions?

Use my program Use my program validlyvalidly,,

to validly specify recodes and conditionals to validly specify recodes and conditionals

Persuade Stata Persuade Stata to validly specify recodes and conditionals to validly specify recodes and conditionals

Validly

Validly

validy is a conventional Stata program; but having an adverbial name, appears to be a modifier of other commands.

validly generate has the functionality of generate, but, in contrast to generate, gives the correct result when missing data are encountered within relational or logical expressions. Likewise validly replace. Likewise validly assert

validly Stata_conditional_command executes the specified conditional_command but, in contrast to Stata’s execution of the ‘unwrapped’ command, gives the correct result when missing data are encountered within relational or logical expressions in the condition.

Validly - syntax validly [generate|gen|replace] newvar | varname = exp [if] [in] [,options]As generate or replace, but using valid functional forms for the expression(s). validly generate requires newvar; replace requires the varname of an existing variable.

validly assert exp [if] [in] [,options]As assert, but using valid functional forms for the expression(s).

For other non-assignment conditional commands which use if, validly can act as a ‘wrapper’: validly any_conditional_command [,,validly_options]or (the same syntax, expressed differently) validly command parameters if [in] [weight][,command_options] [,,validly_options]validly replaces the conditional expression by a valid functional form, and executes the ‘wrapped’ command (validly’s options appear after double commas, to differentiate them from the command’s options).

Validly - strategyvalidly takes the relevant expression(s), parses the relational and logical operators into RPN form, and from that builds, by iterative insertion into a macro, complex cond expression(s) [as in our earlier example] which can be executed.

Validly - strategyvalidly takes the relevant expression(s), parses the relational and logical operators into RPN form, and from that builds, by iterative insertion into a macro, complex cond expression(s) [as in our earlier example] which can be executed.

For: it works; nested ‘conds’ For: it works; nested ‘conds’ were the only replicable strategy were the only replicable strategy I could devise to handle missing I could devise to handle missing data, given Stata’s conventions.data, given Stata’s conventions.

Against: the rebarbative results Against: the rebarbative results are computationally expensive.are computationally expensive.

Validly - examples

Validly - examples

Validly - examples

Two solutions?Two solutions?

Use my program Use my program validlyvalidly,,

to validly specify recodes and conditionals to validly specify recodes and conditionals

Persuade Stata Persuade Stata to validly specify recodes and conditionals to validly specify recodes and conditionals

ProposalProposal

Stata’s relational operators should behave as do Stata’s algebraic operators with regard to missing data

Stata’s logical operators should follow the expected rules when encountering missing data. (Further, when evaluating the truth of an expression, ‘missing’ should not count as ‘true’).

Arguments against ‘logic’

1. It is complex/confusing

2. Generates notable inconsistencies

3. Requires ‘several rules’

Arguments against ‘logic’

1 - Complex/confusing?

‘All these statements can be made to work, but they are complicated and yield some surprising results (such as the drop/keep inconsistency shown [above]). We feel that most users — including ourselves — would find this more confusing than the system currently in place.’

Gould, W (2003) “Logical expressions and missing values”

www.stata.com/support/faqs/data/values.html

Arguments against ‘logic’

1 - Complex/confusing?

The choice, remember, is between (on the current coding) having to write something like:.generate v = p|q if !mi(p,q) | (p & !mi(p)) | (q & !mi(q))

Arguments against ‘logic’

1 - Complex/confusing?

The choice, remember, is between (on the current coding) having to write something like:.generate v = p|q if !mi(p,q) | (p & !mi(p)) | (q & !mi(q)) or (on the proposed coding) being able to write:.generate v = p|q

Arguments against ‘logic’

1 - Complex/confusing?

The choice, remember, is between (on the current coding) having to write something like:.generate v = p|q if !mi(p,q) | (p & !mi(p)) | (q & !mi(q)) or (on the proposed coding) being able to write:.generate v = p|qIt is not entirely self-evident that the shorter is ‘more confusing’?

Arguments against ‘logic’

2 - Inconsistencies?

‘Changing to a three-valued logic might make some comparisons more what one might expect but will introduce inconsistencies elsewhere’.

Arguments against ‘logic’

2 - Inconsistencies?

The only example adduced (trailed by Gould as a ‘notable inconsistency’) is that, under the proposed rules: a command such as keep if age>65 is no longer the same as drop if age<=65‘In the current system, … missing values are … treated as positive infinity. Once this fact is absorbed … drop and keep statements work as one would expect.’

Arguments against ‘logic’ – (2)

2 - Inconsistencies?

The only example adduced (trailed by Gould as a ‘notable inconsistency’) is that, under the proposed rules: a command such as keep if age>65 is no longer the same as drop if age<=65But if a sample has three groups (those known to be over 65, those 65 or younger, and those for whom we lack age information) it is surely self evident that dropping one group should not be the same as keeping one other?.’

Arguments against ‘logic’

2 - Inconsistencies?

The only example adduced (trailed by Gould as a ‘notable inconsistency’) is that, under the proposed rules: a command such as keep if age>65 is no longer the same as drop if age<=65

Note: keep if age>65 would only work as one would expect if one should expect that those in the sample lacking age information properly belong in the group of the retired.

Arguments against ‘logic’

3 - Several rules?

under the proposal ‘you would have to remember several rules for how missing values were handled in different situations instead of just one rule’

Arguments against ‘logic’

3 - Several rules?

under the proposal ‘you would have to remember several rules for how missing values were handled in different situations instead of just one rule’My proposal is that we adopt one rule: ‘missing values are treated as missing’

Arguments against ‘logic’

3 - Several rules?

under the proposal ‘you would have to remember several rules for how missing values were handled in different situations instead of just one rule’In the current system, missing values are sometimes missing (as in algebra), sometimes invisible (as in max), sometimes infinity (sometimes even, when contrasting .a and .b, distinct infinities), and sometimes ‘true’.

Proposal reiteratedProposal reiterated

Stata’s relational operators should behave as do Stata’s algebraic operators with regard to missing data

Stata’s logical operators should follow the expected rules when encountering missing data. (Further, when evaluating the truth of an expression, ‘missing’ should not count as ‘true’).

End of Polemic

How many of these do what they seem to do?

i …if age>50 ii …if unemployediii …if a==2 & b==2iv …if a!=2 & b!=2v …if !(a==2 & b==2) & !mi(a,b)vi …if age>50 & !mi(age)vii …if log(assets)>2 & !mi(assets) viii …if a==2 & b==cix …if (a!=2 | b!=2) & !mi(a,b)x …if assets/(inc - expend) > 100 & !mi(assets,inc,expend)xi .gen v = a==2 | b==2xii .gen v = (a==2 | b==2) & !mi( a, b)

i …if age>50 ii …if unemployediii …if a==2 & b==2iv …if a!=2 & b!=2v …if !(a==2 & b==2) & !mi(a,b)vi …if age>50 & !mi(age)vii …if log(assets)>2 & !mi(assets) viii …if a==2 & b==cix …if (a!=2 | b!=2) & !mi(a,b)x …if assets/(inc - expend) > 100 & !mi(assets,inc,expend)xi .gen v = a==2 | b==2xii .gen v = (a==2 | b==2) & !mi( a, b)

To handle.generate v = (a>b) & (c>d)we need something along the lines of:.generate p = a>b if !mi(a,b).generate q = c>d if !mi(c,d).generate v = 0 if !(p&q).replace v = 1 if (p&q) & !mi(p,q)

e.g.

Footnote on ‘max’

max(x1,x2,...,xn). . . . . .Description: returns the maximum value of x1, x2, ..., xn. Unless all arguments are missing, missing values are ignored.

max(2,10,.,7) = 10 max(.,.,.) = .

Footnote on ‘max’

Suppose you wished, within a marriage, the higher income (with IncF and IncM for female and male); you might expect: .generate Highest = max(IncM,IncF) would do the trick?

Footnote on ‘max’

Suppose you wished, within a marriage, the higher income (say IncF and IncM for female and male); you might expect: .generate Highest = max(IncM,IncF) would do the trick? But for women whose spouses (perhaps bashful tycoons or shamefaced paupers) refused to answer, we get the income of the woman as the purportedly known higher individual income. The analyst should regard the outcome for such observations as strictly unknown — else you could have true high-spending householdswhose ‘highest income’ might be very low (these bashful tycoons), distorting any subsequent analyses.

Footnote on ‘max’

Suppose you wished, within a marriage, the higher income (say IncF and IncM for female and male); you might expect: .generate Highest = max(IncM,IncF) would do the trick? But for women whose spouses (perhaps bashful tycoons or shamefaced paupers) refused to answer, we get the income of the woman as the purportedly known higher individual income. If the values of some variables in a set are unknown, it is misleading to report the maximum of the known as the known maximum.

Transition?

One consequent loss of functionality — the loss of the ability to test for specific missing data codes, as in ‘v==.a’

Transition?

One consequent loss of functionality — the loss of the ability to test for specific missing data codes, as in ‘v==.a’

— could readily be handled by the introduction of a function mv(v) which would take one variable as its argument, and return a value in the range 1‑27 corresponding to the extended missing data codes, and zero otherwise.

Transition?

One consequent loss of functionality — the loss of the ability to test for specific missing data codes, as in ‘v==.a’

— or, as validly does, could scan for explicit ‘missing’ and parse separately.

Transition?

One consequent loss of functionality — the loss of the ability to test for specific missing data codes, as in ‘v==.a’

— or, as validly does, could scan for explicit ‘missing’ and parse separately.