






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The estimation of the causal effect of smoking on cancer using do-calculus and instrumental variables. It explains how to estimate the causal effect using the potential outcome framework and provides an example using r code. The document also touches upon the importance of linearity in using instrumental variables for estimation.
Typology: Exercises
1 / 10
This page cannot be seen from the preview
Don't miss anything!
Pr(Y |do(X = x)) =
s
Pr(Y |X = x, S = s) Pr(S = s)
where we would need to substitute consistent estimates of the condi- tional probability Pr(Y |X = x, S = s) and the marginal probability Pr(S = s). It is sometimes useful to short-cut having to estimate the marginal probability by using the law of large numbers. Notice that, with respect to the sum over s, Pr(Y |X = x, S = s) is just some function of s. If we have an IID sample of realizations of S, say s 1 , s 2 ,... sn, then the law of large numbers says that
1 n
∑^ n
i=
f (si) →
s
f (s) Pr(S = s)
Therefore, with a large sample,
Pr(Y |do(X = x)) ≈
n
∑^ n
i=
Pr(Y |X = x, S = si)
and this will still be (approximately) true when we use a consistent estimate of the conditional probability, rather than its true value. In addition to using occupation to block back-door paths, we could also use asbestos, or occupation and asbestos together. The ad- vantage of occupation is that it is an ordinal variable with a small number of levels, so there should be less work involved in estimating Pr(Y |X = x, S = s) than if we had used a continuous control variable like asbestos.
(b) Answer: Let X refer to smoking, M refer to tar, and Y refer to cancer. Then M blocks all directed paths from X to Y ; the only back-door path from X to M is blocked by the collider at damage; and X blocks the only back-door path from M to Y. Now, as described in lecture 23 section 3.2, we can estimate the causal effect of smoking on cancer using
Pr(Y |do(X = x)) =
m
Pr(M = m|X = x)
x′
Pr(Y |X = x′, M = m) Pr(X = x′)
Since the front-door variable tar is continuous, the sum over m is really an integral, as is the sum over x′,
Pr(Y |do(X = x)) =
dm
p(m|X = x)
dx′^ Pr(Y |X = x′, M = m)p(x′)
Once again (see the end of part a), we can simplify the expression by recognizing the inner integral as the expectation of Pr(Y |X = x′, M = m) as a function of x′, so the full equation reduces to
Pr(Y |do(X = x)) ≈
dm
p(m|X = x)
n
∑^ n
i=
Pr(Y |X = x′ i, M = m)
(c) Answer: As described in lecture 23 section 3.3, a variable I is an instrument for identifying the effect of X on Y if we can find a set S of controls such that (i) I is dependent on X|S and (ii) every unblocked path from I to Y has an arrow pointing into to X. Taking I to be occupation and S to be asbestos, we find that both conditions are met. In particular, smoking given asbestos gives information about occupation, so I is dependent on X|S. Also, the path from occupation to cancer through dental, teeth and smoking does not have an arrow into smoking, but that path is blocked by the collider at teeth, so there is no problem. However, as detailed in the course notes, using the instrumental vari- able to estimate the causal effect of X on Y is not straightforward unless we know that the dependence of X on I is linear, as is that of Y on X. Thus we do not carry this analysis further.
(d) Answer: We will use the expression derived in part (a) to estimate Pr(Y |do(X = x)). Since cancer is a binary variable, we will use a model which is designed for that sort of response, and since we have no good reason to use a (generalized) linear model, we will use npcdens to estimate the conditional distribution, telling it that cancer is a categorical variable: library(np) pY.given.X.S = npcdens(factor(cancer)~smoking+ordered(occupation), data = smoke)
Let’s try the grid. An initial look at conditional density plot (not shown) suggests that with smoking=1.5, the probability of tar values outside the range (0. 10 , 0 .25) is next to zero, so we can limit the grid to that interval.
tar.eval.points = seq(from=0.10,to=0.25,length.out=2000) tar.grid.size = tar.eval.points[2] - tar.eval.points[1] tar.grid = expand.grid(smoking=1.5,tar=tar.eval.points) fhat = predict(dens.M.cond.X,newdata=tar.grid) cancer.given.do.tar = sapply(tar.eval.points,pY.given.do.M,pY.given.X.M)
Plotting cancer.given.do.tar vs. tar.eval.points (not shown) yields a nice logistic curve, with the risk of cancer increasing with the level of tar, which reassures us that things have not gone horribly wrong. Now we get the integral by summing over the grid of tar levels:
sum(cancer.given.do.tarfhattar.grid.size)
This estimates the probability of cancer at about 13%.
However, we could also take a parametric approach for estimating the conditional distribution of tar on smoking. The scatter plot and residual analysis in Figure 1 suggests the conditional density follows a linear-Gaussian model:
lm.tar.smoke = lm(tar~smoking, data = smoke) par(mfrow = c(2,1)) plot(smoke$smoking, smoke$tar, main = "Tar looks like it could be a linear function of smoking with gaussian noise ...") plot(fitted(lm.tar.smoke),lm.tar.smoke$residuals, main = "... and indeed this appears to be not far from the truth") abline(h = 0) normality = shapiro.test(lm.tar.smoke$residuals) mtext(paste("Shapiro.test for normality gives p.value =", as.character(round(normality$p.value, 3))))
Here we fit the appropriate Gaussian and redo the final step from our previous analysis:
new.data = data.frame(smoking=1.5) tar.mu.1.5 = predict(lm.tar.smoke, newdata = new.data) tar.var.1.5 = mean(lm.tar.smoke$residuals^2)
tar.dens = sapply(tar.eval.points,dnorm,mean=tar.mu.1.5,sd=sqrt(tar.var.1.5))
sum(cancer.given.do.tar * tar.dens * tar.grid.size)
This time our estimate of the probability of cancer is 0.101.
ll
l
lll
l
l
l
l llllll
l ll
l
lll
l
ll
l
lllllllll
l
l
l
l
ll
l
lllll
l
l
l
llll
l
l
l
lllll
l
l
l
l llll
l l
l l ll llll l
l l
l
lll
l l l
lll
l
l
lll l
l
l lll l llll
l
lllll
l
l
l l l
ll
l
lllll l
l
llllll
l
ll
l
llllll
l
lll
l ll
l l
l l
llllll
ll
ll
l
l
l
l
l
llll
ll
lllll
l
l
ll
l l l
l
ll
l
ll
l ll
ll
l
l
ll
l l
l l
l
lllllll l
l llll
l l
l
l
l
l llllll
l
lll l
l
ll
l
l
l
l
ll
l
ll
l
l
ll
l
l l l
lll
l
l
l
l ll ll lllllllll
l
l
ll
l
l
l
l
llll l
l
lllll l l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
ll
l l
lllll
l
l
l
l
ll l
l (^) ll
l
l
l
ll
lll
l
l ll
l
l l
llllll
l
l l
ll
ll^ ll l
ll
l
ll
l
l
l
lllll
l
l l lll
l (^) l ll
l
l l
l
l
l
ll
ll
l
lllll
l
ll l
ll
l
l l
l l
l
l
l
ll
l
l
ll
l
l
l
l ll
l
l
l
l
ll
l
l
l
llll l
l
l
l
ll l
l
l
l
l
l
l
l
l l
l
llll
l l
l
llll ll
l l
l l
l
lllll
l
ll
l
lll
l
l
l
l
lll
l
l l
l
l
l
l
llll
l
l
l lll^
l l
l
l
ll
l
ll
l
l
lll
l ll
l
l
l
l
l
l llll
l
l
l
l
l lllllllllllll
l l
l
l
l
l l
l
l
lll
l lllll
l
ll ll l
l
l
l
l
l
l l
l
lll
l
0 1 2 3
Tar looks like it could be a linear function of smoking with gaussian noise ...
smoke$smoking
smoke$tar
l
l
l l
l
l
l l l l
ll l
l l
l (^) l l
l l l l
l
l l
l l
l
l l l
l
lll l l l
l
l
l
l ll l
l l l l
l l
l
ll l l
l (^) l l l
ll
l l
l
l
l
l
l l
l l l
l
l
l l
l
l l
l
l
l l
ll l l
l l l l
l
l
l l
l
l
l
l
l l l
ll
l
ll
ll l
ll
l
l l
l l l^ l l l l
l
ll
l
l l l
l l lll l
ll l lll l lll l
l
ll
l
l
l lllll^ l ll
l l
l
l
l ll
l l
l l l l l
l
l
ll l
l l l
l l
l
l
l
l l l l l
lll l l ll
l l l l
l
l l
l l l
l l
l
l (^) l
ll
l l l l
ll
l l l l
l l l
l
l
l
l
l
l l
l ll l l
l l
l l l l
l l
l
ll l
l l l
l l
l
l
l
l l
l l l
l
ll
l
l
l
llllllll
l l
l ll
l
l
l
l
l
l
l
l
l l l
l
l
l (^) l
l
l (^) l l
l
l l
l
l
l
l l
l l
l
l
l l l
l l l l
ll l l
l l
l l l l l
l
l l
l l l
l l l
l l
l
l
l
l
l
l l l l lll
l ll
l
l
l l l l
l
ll l l (^) l
l
l
l
l
l (^) l lll
ll
l l
l l l
l l
l l
l l
l l (^) l
l l (^) l l
ll l l
l
ll
l l l l l
l l
l l l
l l (^) l
l l l l
l l l l l l (^) l lll
l
l l
l
l
l l^ ll
l
l l
l l
l
l
l
l
ll
l
l l
l (^) l l l l l
l l l l
l
l l
l
l
l l
l
l l (^) l l
l l l l
l l
l l
l
l l
l l l
l l l l
l
lll
l l
l l l
l
ll l
ll
l
l
l
ll
l l
l l
l l l l ll^ l
l
ll
lll l l (^) l l l
l (^) l
l
ll
l l l
l l
ll l lllll l
ll ll l
l
l
l
l l l
l lll l
l l l l
ll
l
l l
l
l
l l
l
l
ll l l l l
l ll l
0.05 0.10 0.15 0.20 0.25 0.30 0.
−0.
... and indeed this appears to be not far from the truth
fitted(lm.tar.smoke)
lm.tar.smoke$residuals
Shapiro.test for normality gives p.value = 0.
Figure 1:
error = abs(Xcoef - alpha.beta[1])/sd.error return(error) }
We can use the optim function to search for parameters that minimize this discrepancy:
solutions = replicate(10, optim(par=c(1,1,1), get.alpha.dev)$par) solutions = as.data.frame(t(solutions)) colnames(solutions) = c("alpha", "beta1", "beta2") alpha beta1 beta 1 1.32 1.57 0. 2 1.00 1.05 1. 3 1.50 1.27 0. 4 1.19 1.31 0. 5 1.42 1.54 0. 6 1.37 1.17 0. 7 1.20 1.40 0. 8 1.09 1.16 0. 9 1.26 1.59 0. 10 1.06 1.02 0.
The output of the optim function is quite unstable, as if many solutions exist. Closer examination of the optim output shows that algorithm is not converging properly. Nevertheless, the result of the fourth optimization (ˆα = 1.19, βˆ 1 = 1.31, βˆ 2 = 0.03) seems to give something close to the desired equality:
errors = replicate(100, get.alpha.dev(as.numeric(solutions[4,]))) mean(errors) [1] 0.
We will now argue that the coefficient of X in the larger model is not a consistent estimate for α unless β 2 = 0, or Cov [X, Z] = 0. As explained in Lectures 1 and 2, the population-level linear regression coefficient for predicting Y from X is Cov [X, Y ] Var [X]
That is, this is the coefficient which minimizes the expected mean-squared error. The OLS estimate of the coefficient will converge to this as the sample size grows. Here Var [X] = 1 and Cov [X, Y ] = αVar [X] = α, so linearly regressing Y on X alone gives a consistent estimate of α.
When predicting Y from X and Z, the vector of population-level regression coefficients is [ Var [X] Cov [X, Z] Cov [X, Z] Var [Z]
Cov [X, Y ] Cov [Z, Y ]
and, again, the OLS estimates will converge on these coefficients. Since the matrix being inverted is only 2 × 2, we can actually do the inversion by hand:
1 Var [X] Var [Z] − Cov [X, Z]^2
Var [Z] −Cov [X, Z] −Cov [X, Z] Var [X]
Cov [X, Y ] Cov [Z, Y ]
We are interested only in the coefficient for X, which is
Var [Z] Cov [X, Y ] − Cov [X, Z] Cov [Z, Y ] Var [X] Var [Z] − Cov [X, Z]^2
If Cov [X, Z] = 0, this reduces to Cov [X, Y ] /Var [X] = α. Otherwise, it might happen to equal α, if all of the terms, including α, are adjusted precisely, but in general it will not.
We can say a bit more by writing out the various variances and covari- ances in terms of the parameters. It just so happens that we can easily obtain exact values for these and other covariances directly from our model equations:
Var [X] = 1 Var [Y ] = Var [αX + ] = α^2 + σ^2 Cov [X, Y ] = Cov [X, αX + ] = α Var [Z] = β^21 Var [X] + β 22 Var [Y ] + 2β 1 β 2 Cov [X, Y ] + σ^2 = β^21 + β 22 (α^2 + σ^2 ) + 2β 1 β 2 α + σ^2 Cov [X, Z] = Cov [X, β 1 X + β 2 Y + η] = Cov [X, β 1 X + β 2 αX + β 2 + η] = β 1 + β 2 α Cov [Y, Z] = Cov [αX + , β 1 X + β 2 Y + η] = Cov [αX + , β 1 X + β 2 αX + β 2 + η] = αβ 1 + α^2 β 2 + β 2 σ^2
Substituting, we get
(β 12 + β 22 (α^2 + σ^2 ) + 2β 1 β 2 α + σ^2 )α − (β 1 + β 2 α)(α(β 1 + αβ 2 ) + β 2 σ^2 ) β 12 + β^22 (α^2 + σ^2 ) + 2β 1 β 2 α + σ^2 − (β 1 + β 2 α)^2
=
α(β 12 + β 22 α^2 + β^22 σ^2 + 2β 1 β 2 α + σ^2 − (β 1 + αβ 2 )^2 − β^22 σ^2 ) − β 1 β 2 σ^2 σ^2 (1 + β^22 )
=
ασ^2 − β 1 β 2 σ^2 σ^2 (1 + β^22 )
=
α − β 1 β 2 1 + β^22
This will be α only when β 2 = 0, or when β 1 = −β 2 α (in which case Cov [X, Z] = 0). Setting β 2 = 0 corresponds to removing the arrow from
Cov [X, Y ] Var [X]
Cov [X, γβX + γη + α 2 U + ] Var [X]
=
γβVar [X] + α 2 Cov [X, U ] Var [X]
= γβ + α 2
Cov [α 1 U + , U ] Var [X]
= γβ + α 1 α 2
Var [U ] α^21 Var [U ] + σ^2 = γβ +
α 1 α 2 α^21 + σ^2
For the default values of the parameters in the drawUXZY() function, the bias amounts to +0.5. Computationally, we compare the coefficient of X in a naive regression of Y on X, with the product of the coefficient of Z on X, and the coefficient of Z when regressing Y on X and Z.
estimate.coefficients4 = function(){ uxzy = drawUXZY() y.on.x = lm(y~x, data = uxzy)$coefficients["x"][[1]] y.on.x.z = lm(y~x+z, data = uxzy)$coefficients["z"][[1]] z.on.x = lm(z~x,data=uxzy)$coefficients["x"][[1]] return(c(y.on.x, y.on.x.z*z.on.x)) }
replicate.estimates4 = t(replicate(100, estimate.coefficients4()))
mean(replicate.estimates4[,1])
mean(replicate.estimates4[,2])