Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Smoking's Causal Effect on Cancer: Estimation with Do-Calculus and Instruments, Exercises of Advanced Data Analysis

The estimation of the causal effect of smoking on cancer using do-calculus and instrumental variables. It explains how to estimate the causal effect using the potential outcome framework and provides an example using r code. The document also touches upon the importance of linearity in using instrumental variables for estimation.

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Homework Assignment 10: Use and Abuse of
Conditioning
36-402, Advanced Data Analysis, Spring 2011
SOLUTIONS
1. (a) Answer: Observe that occuatpion blocks all back door paths be-
tween smoking and cancer, and that occupation is not a descendant
of smoking. Let Xdenote smoking,Ydenote cancer, and Sdenote
occupation. Now, as described in lecture 23 section 3.1, we can
estimate the causal effect of smoking on cancer using
Pr(Y|do(X=x)) = X
s
Pr(Y|X=x, S =s)Pr(S=s)
where we would need to substitute consistent estimates of the condi-
tional probability Pr(Y|X=x, S =s) and the marginal probability
Pr(S=s).
It is sometimes useful to short-cut having to estimate the marginal
probability by using the law of large numbers. Notice that, with
respect to the sum over s, Pr(Y|X=x, S =s) is just some function
of s. If we have an IID sample of realizations of S, say s1, s2,...sn,
then the law of large numbers says that
1
n
n
X
i=1
f(si)X
s
f(s) Pr(S=s)
Therefore, with a large sample,
Pr(Y|do(X=x)) 1
n
n
X
i=1
Pr(Y|X=x, S =si)
and this will still be (approximately) true when we use a consistent
estimate of the conditional probability, rather than its true value.
In addition to using occupation to block back-door paths, we could
also use asbestos, or occupation and asbestos together. The ad-
vantage of occupation is that it is an ordinal variable with a small
number of levels, so there should be less work involved in estimating
Pr(Y|X=x, S =s) than if we had used a continuous control variable
like asbestos.
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Smoking's Causal Effect on Cancer: Estimation with Do-Calculus and Instruments and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 10: Use and Abuse of

Conditioning

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

  1. (a) Answer: Observe that occuatpion blocks all back door paths be- tween smoking and cancer, and that occupation is not a descendant of smoking. Let X denote smoking, Y denote cancer, and S denote occupation. Now, as described in lecture 23 section 3.1, we can estimate the causal effect of smoking on cancer using

Pr(Y |do(X = x)) =

s

Pr(Y |X = x, S = s) Pr(S = s)

where we would need to substitute consistent estimates of the condi- tional probability Pr(Y |X = x, S = s) and the marginal probability Pr(S = s). It is sometimes useful to short-cut having to estimate the marginal probability by using the law of large numbers. Notice that, with respect to the sum over s, Pr(Y |X = x, S = s) is just some function of s. If we have an IID sample of realizations of S, say s 1 , s 2 ,... sn, then the law of large numbers says that

1 n

∑^ n

i=

f (si) →

s

f (s) Pr(S = s)

Therefore, with a large sample,

Pr(Y |do(X = x)) ≈

n

∑^ n

i=

Pr(Y |X = x, S = si)

and this will still be (approximately) true when we use a consistent estimate of the conditional probability, rather than its true value. In addition to using occupation to block back-door paths, we could also use asbestos, or occupation and asbestos together. The ad- vantage of occupation is that it is an ordinal variable with a small number of levels, so there should be less work involved in estimating Pr(Y |X = x, S = s) than if we had used a continuous control variable like asbestos.

(b) Answer: Let X refer to smoking, M refer to tar, and Y refer to cancer. Then M blocks all directed paths from X to Y ; the only back-door path from X to M is blocked by the collider at damage; and X blocks the only back-door path from M to Y. Now, as described in lecture 23 section 3.2, we can estimate the causal effect of smoking on cancer using

Pr(Y |do(X = x)) =

m

Pr(M = m|X = x)

x′

Pr(Y |X = x′, M = m) Pr(X = x′)

Since the front-door variable tar is continuous, the sum over m is really an integral, as is the sum over x′,

Pr(Y |do(X = x)) =

dm

p(m|X = x)

dx′^ Pr(Y |X = x′, M = m)p(x′)

Once again (see the end of part a), we can simplify the expression by recognizing the inner integral as the expectation of Pr(Y |X = x′, M = m) as a function of x′, so the full equation reduces to

Pr(Y |do(X = x)) ≈

dm

p(m|X = x)

n

∑^ n

i=

Pr(Y |X = x′ i, M = m)

(c) Answer: As described in lecture 23 section 3.3, a variable I is an instrument for identifying the effect of X on Y if we can find a set S of controls such that (i) I is dependent on X|S and (ii) every unblocked path from I to Y has an arrow pointing into to X. Taking I to be occupation and S to be asbestos, we find that both conditions are met. In particular, smoking given asbestos gives information about occupation, so I is dependent on X|S. Also, the path from occupation to cancer through dental, teeth and smoking does not have an arrow into smoking, but that path is blocked by the collider at teeth, so there is no problem. However, as detailed in the course notes, using the instrumental vari- able to estimate the causal effect of X on Y is not straightforward unless we know that the dependence of X on I is linear, as is that of Y on X. Thus we do not carry this analysis further.

(d) Answer: We will use the expression derived in part (a) to estimate Pr(Y |do(X = x)). Since cancer is a binary variable, we will use a model which is designed for that sort of response, and since we have no good reason to use a (generalized) linear model, we will use npcdens to estimate the conditional distribution, telling it that cancer is a categorical variable: library(np) pY.given.X.S = npcdens(factor(cancer)~smoking+ordered(occupation), data = smoke)

Let’s try the grid. An initial look at conditional density plot (not shown) suggests that with smoking=1.5, the probability of tar values outside the range (0. 10 , 0 .25) is next to zero, so we can limit the grid to that interval.

tar.eval.points = seq(from=0.10,to=0.25,length.out=2000) tar.grid.size = tar.eval.points[2] - tar.eval.points[1] tar.grid = expand.grid(smoking=1.5,tar=tar.eval.points) fhat = predict(dens.M.cond.X,newdata=tar.grid) cancer.given.do.tar = sapply(tar.eval.points,pY.given.do.M,pY.given.X.M)

Plotting cancer.given.do.tar vs. tar.eval.points (not shown) yields a nice logistic curve, with the risk of cancer increasing with the level of tar, which reassures us that things have not gone horribly wrong. Now we get the integral by summing over the grid of tar levels:

sum(cancer.given.do.tarfhattar.grid.size)

[1] 0.

This estimates the probability of cancer at about 13%.

However, we could also take a parametric approach for estimating the conditional distribution of tar on smoking. The scatter plot and residual analysis in Figure 1 suggests the conditional density follows a linear-Gaussian model:

Observe that tar is a good linear fit to smoking:

lm.tar.smoke = lm(tar~smoking, data = smoke) par(mfrow = c(2,1)) plot(smoke$smoking, smoke$tar, main = "Tar looks like it could be a linear function of smoking with gaussian noise ...") plot(fitted(lm.tar.smoke),lm.tar.smoke$residuals, main = "... and indeed this appears to be not far from the truth") abline(h = 0) normality = shapiro.test(lm.tar.smoke$residuals) mtext(paste("Shapiro.test for normality gives p.value =", as.character(round(normality$p.value, 3))))

Here we fit the appropriate Gaussian and redo the final step from our previous analysis:

new.data = data.frame(smoking=1.5) tar.mu.1.5 = predict(lm.tar.smoke, newdata = new.data) tar.var.1.5 = mean(lm.tar.smoke$residuals^2)

Compute the conditional density on the same grid as before

tar.dens = sapply(tar.eval.points,dnorm,mean=tar.mu.1.5,sd=sqrt(tar.var.1.5))

Perform outer sum (over values of M = tar)

sum(cancer.given.do.tar * tar.dens * tar.grid.size)

This time our estimate of the probability of cancer is 0.101.

ll

l

lll

l

l

l

l llllll

l ll

l

lll

l

ll

l

lllllllll

l

l

l

l

ll

l

lllll

l

l

l

llll

l

l

l

lllll

l

l

l

l llll

l l

l l ll llll l

l l

l

lll

l l l

lll

l

l

lll l

l

l lll l llll

l

lllll

l

l

l l l

ll

l

lllll l

l

llllll

l

ll

l

llllll

l

lll

l ll

l l

l l

llllll

ll

ll

l

l

l

l

l

llll

ll

lllll

l

l

ll

l l l

l

ll

l

ll

l ll

ll

l

l

ll

l l

l l

l

lllllll l

l llll

l l

l

l

l

l llllll

l

lll l

l

ll

l

l

l

l

ll

l

ll

l

l

ll

l

l l l

lll

l

l

l

l ll ll lllllllll

l

l

ll

l

l

l

l

llll l

l

lllll l l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

ll

l

ll

l l

lllll

l

l

l

l

ll l

l (^) ll

l

l

l

ll

lll

l

l ll

l

l l

llllll

l

l l

ll

ll^ ll l

ll

l

ll

l

l

l

lllll

l

l l lll

l (^) l ll

l

l l

l

l

l

ll

ll

l

lllll

l

ll l

ll

l

l l

l l

l

l

l

ll

l

l

ll

l

l

l

l ll

l

l

l

l

ll

l

l

l

llll l

l

l

l

ll l

l

l

l

l

l

l

l

l l

l

llll

l l

l

llll ll

l l

l l

l

lllll

l

ll

l

lll

l

l

l

l

lll

l

l l

l

l

l

l

llll

l

l

l lll^

l l

l

l

ll

l

ll

l

l

lll

l ll

l

l

l

l

l

l llll

l

l

l

l

l lllllllllllll

l l

l

l

l

l l

l

l

lll

l lllll

l

ll ll l

l

l

l

l

l

l l

l

lll

l

0 1 2 3

Tar looks like it could be a linear function of smoking with gaussian noise ...

smoke$smoking

smoke$tar

l

l

l l

l

l

l l l l

ll l

l l

l (^) l l

l l l l

l

l l

l l

l

l l l

l

lll l l l

l

l

l

l ll l

l l l l

l l

l

ll l l

l (^) l l l

ll

l l

l

l

l

l

l l

l l l

l

l

l l

l

l l

l

l

l l

ll l l

l l l l

l

l

l l

l

l

l

l

l l l

ll

l

ll

ll l

ll

l

l l

l l l^ l l l l

l

ll

l

l l l

l l lll l

ll l lll l lll l

l

ll

l

l

l lllll^ l ll

l l

l

l

l ll

l l

l l l l l

l

l

ll l

l l l

l l

l

l

l

l l l l l

lll l l ll

l l l l

l

l l

l l l

l l

l

l (^) l

ll

l l l l

ll

l l l l

l l l

l

l

l

l

l

l l

l ll l l

l l

l l l l

l l

l

ll l

l l l

l l

l

l

l

l l

l l l

l

ll

l

l

l

llllllll

l l

l ll

l

l

l

l

l

l

l

l

l l l

l

l

l (^) l

l

l (^) l l

l

l l

l

l

l

l l

l l

l

l

l l l

l l l l

ll l l

l l

l l l l l

l

l l

l l l

l l l

l l

l

l

l

l

l

l l l l lll

l ll

l

l

l l l l

l

ll l l (^) l

l

l

l

l

l (^) l lll

ll

l l

l l l

l l

l l

l l

l l (^) l

l l (^) l l

ll l l

l

ll

l l l l l

l l

l l l

l l (^) l

l l l l

l l l l l l (^) l lll

l

l l

l

l

l l^ ll

l

l l

l l

l

l

l

l

ll

l

l l

l (^) l l l l l

l l l l

l

l l

l

l

l l

l

l l (^) l l

l l l l

l l

l l

l

l l

l l l

l l l l

l

lll

l l

l l l

l

ll l

ll

l

l

l

ll

l l

l l

l l l l ll^ l

l

ll

lll l l (^) l l l

l (^) l

l

ll

l l l

l l

ll l lllll l

ll ll l

l

l

l

l l l

l lll l

l l l l

ll

l

l l

l

l

l l

l

l

ll l l l l

l ll l

0.05 0.10 0.15 0.20 0.25 0.30 0.

−0.

... and indeed this appears to be not far from the truth

fitted(lm.tar.smoke)

lm.tar.smoke$residuals

Shapiro.test for normality gives p.value = 0.

Figure 1:

error = abs(Xcoef - alpha.beta[1])/sd.error return(error) }

We can use the optim function to search for parameters that minimize this discrepancy:

solutions = replicate(10, optim(par=c(1,1,1), get.alpha.dev)$par) solutions = as.data.frame(t(solutions)) colnames(solutions) = c("alpha", "beta1", "beta2") alpha beta1 beta 1 1.32 1.57 0. 2 1.00 1.05 1. 3 1.50 1.27 0. 4 1.19 1.31 0. 5 1.42 1.54 0. 6 1.37 1.17 0. 7 1.20 1.40 0. 8 1.09 1.16 0. 9 1.26 1.59 0. 10 1.06 1.02 0.

The output of the optim function is quite unstable, as if many solutions exist. Closer examination of the optim output shows that algorithm is not converging properly. Nevertheless, the result of the fourth optimization (ˆα = 1.19, βˆ 1 = 1.31, βˆ 2 = 0.03) seems to give something close to the desired equality:

errors = replicate(100, get.alpha.dev(as.numeric(solutions[4,]))) mean(errors) [1] 0.

We will now argue that the coefficient of X in the larger model is not a consistent estimate for α unless β 2 = 0, or Cov [X, Z] = 0. As explained in Lectures 1 and 2, the population-level linear regression coefficient for predicting Y from X is Cov [X, Y ] Var [X]

That is, this is the coefficient which minimizes the expected mean-squared error. The OLS estimate of the coefficient will converge to this as the sample size grows. Here Var [X] = 1 and Cov [X, Y ] = αVar [X] = α, so linearly regressing Y on X alone gives a consistent estimate of α.

When predicting Y from X and Z, the vector of population-level regression coefficients is [ Var [X] Cov [X, Z] Cov [X, Z] Var [Z]

]− 1 [

Cov [X, Y ] Cov [Z, Y ]

]

and, again, the OLS estimates will converge on these coefficients. Since the matrix being inverted is only 2 × 2, we can actually do the inversion by hand:

1 Var [X] Var [Z] − Cov [X, Z]^2

[

Var [Z] −Cov [X, Z] −Cov [X, Z] Var [X]

] [

Cov [X, Y ] Cov [Z, Y ]

]

We are interested only in the coefficient for X, which is

Var [Z] Cov [X, Y ] − Cov [X, Z] Cov [Z, Y ] Var [X] Var [Z] − Cov [X, Z]^2

If Cov [X, Z] = 0, this reduces to Cov [X, Y ] /Var [X] = α. Otherwise, it might happen to equal α, if all of the terms, including α, are adjusted precisely, but in general it will not.

We can say a bit more by writing out the various variances and covari- ances in terms of the parameters. It just so happens that we can easily obtain exact values for these and other covariances directly from our model equations:

Var [X] = 1 Var [Y ] = Var [αX + ] = α^2 + σ^2 Cov [X, Y ] = Cov [X, αX + ] = α Var [Z] = β^21 Var [X] + β 22 Var [Y ] + 2β 1 β 2 Cov [X, Y ] + σ^2 = β^21 + β 22 (α^2 + σ^2 ) + 2β 1 β 2 α + σ^2 Cov [X, Z] = Cov [X, β 1 X + β 2 Y + η] = Cov [X, β 1 X + β 2 αX + β 2  + η] = β 1 + β 2 α Cov [Y, Z] = Cov [αX + , β 1 X + β 2 Y + η] = Cov [αX + , β 1 X + β 2 αX + β 2  + η] = αβ 1 + α^2 β 2 + β 2 σ^2

Substituting, we get

(β 12 + β 22 (α^2 + σ^2 ) + 2β 1 β 2 α + σ^2 )α − (β 1 + β 2 α)(α(β 1 + αβ 2 ) + β 2 σ^2 ) β 12 + β^22 (α^2 + σ^2 ) + 2β 1 β 2 α + σ^2 − (β 1 + β 2 α)^2

=

α(β 12 + β 22 α^2 + β^22 σ^2 + 2β 1 β 2 α + σ^2 − (β 1 + αβ 2 )^2 − β^22 σ^2 ) − β 1 β 2 σ^2 σ^2 (1 + β^22 )

=

ασ^2 − β 1 β 2 σ^2 σ^2 (1 + β^22 )

=

α − β 1 β 2 1 + β^22

This will be α only when β 2 = 0, or when β 1 = −β 2 α (in which case Cov [X, Z] = 0). Setting β 2 = 0 corresponds to removing the arrow from

  1. Answer: Notice that in this model, Y = γZ +α 2 U +, while Z = βX +η, so Y = γβX + γη + α 2 U + . If we manipulated X, therefore, we’d find that its effect on Y is summarized by γβ, which is precisely the product of the coefficients summarizing the effect of X on Z (namely β) and that summarizing the effect of Z on Y (namely γ), just as the front- door criterion leads us to hope. On the other hand, simply regressing Y on X will be confounded:

Cov [X, Y ] Var [X]

Cov [X, γβX + γη + α 2 U + ] Var [X]

=

γβVar [X] + α 2 Cov [X, U ] Var [X]

= γβ + α 2

Cov [α 1 U + , U ] Var [X]

= γβ + α 1 α 2

Var [U ] α^21 Var [U ] + σ^2 = γβ +

α 1 α 2 α^21 + σ^2

For the default values of the parameters in the drawUXZY() function, the bias amounts to +0.5. Computationally, we compare the coefficient of X in a naive regression of Y on X, with the product of the coefficient of Z on X, and the coefficient of Z when regressing Y on X and Z.

Make a function that returns estimates of the coefficient of X

for regressions of interest:

estimate.coefficients4 = function(){ uxzy = drawUXZY() y.on.x = lm(y~x, data = uxzy)$coefficients["x"][[1]] y.on.x.z = lm(y~x+z, data = uxzy)$coefficients["z"][[1]] z.on.x = lm(z~x,data=uxzy)$coefficients["x"][[1]] return(c(y.on.x, y.on.x.z*z.on.x)) }

replicate.estimates4 = t(replicate(100, estimate.coefficients4()))

Biased estimate by direct regression:

mean(replicate.estimates4[,1])

[1] 6.

Unbiased estimate by front-dooor adjustment:

mean(replicate.estimates4[,2])

[1] 5.