Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Smoking's Causal Effect on Cancer: Estimation with Do-Calculus and Instruments, Exercises of Advanced Data Analysis

Carnegie Mellon University (CMU)Advanced Data Analysis

The estimation of the causal effect of smoking on cancer using do-calculus and instrumental variables. It explains how to estimate the causal effect using the potential outcome framework and provides an example using r code. The document also touches upon the importance of linearity in using instrumental variables for estimation.

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge 🇺🇸

4.9

(13)

287 documents

1 / 10

This page cannot be seen from the preview

Don't miss anything!

Homework Assignment 10: Use and Abuse of

Conditioning

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

1. (a) Answer: Observe that occuatpion blocks all back door paths be-

tween smoking and cancer, and that occupation is not a descendant

of smoking. Let Xdenote smoking,Ydenote cancer, and Sdenote

occupation. Now, as described in lecture 23 section 3.1, we can

estimate the causal effect of smoking on cancer using

Pr(Y|do(X=x)) = X

s

Pr(Y|X=x, S =s)Pr(S=s)

where we would need to substitute consistent estimates of the condi-

tional probability Pr(Y|X=x, S =s) and the marginal probability

Pr(S=s).

It is sometimes useful to short-cut having to estimate the marginal

probability by using the law of large numbers. Notice that, with

respect to the sum over s, Pr(Y|X=x, S =s) is just some function

of s. If we have an IID sample of realizations of S, say s1, s2,...sn,

then the law of large numbers says that

1

n

X

i=1

f(si)→X

s

f(s) Pr(S=s)

Therefore, with a large sample,

Pr(Y|do(X=x)) ≈1

n

X

i=1

Pr(Y|X=x, S =si)

and this will still be (approximately) true when we use a consistent

estimate of the conditional probability, rather than its true value.

In addition to using occupation to block back-door paths, we could

also use asbestos, or occupation and asbestos together. The ad-

vantage of occupation is that it is an ordinal variable with a small

number of levels, so there should be less work involved in estimating

Pr(Y|X=x, S =s) than if we had used a continuous control variable

like asbestos.

1

Partial preview of the text

Download Smoking's Causal Effect on Cancer: Estimation with Do-Calculus and Instruments and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 10: Use and Abuse of

Conditioning

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

(a) Answer: Observe that occuatpion blocks all back door paths be- tween smoking and cancer, and that occupation is not a descendant of smoking. Let X denote smoking, Y denote cancer, and S denote occupation. Now, as described in lecture 23 section 3.1, we can estimate the causal effect of smoking on cancer using

Pr(Y |do(X = x)) =

s

Pr(Y |X = x, S = s) Pr(S = s)

where we would need to substitute consistent estimates of the condi- tional probability Pr(Y |X = x, S = s) and the marginal probability Pr(S = s). It is sometimes useful to short-cut having to estimate the marginal probability by using the law of large numbers. Notice that, with respect to the sum over s, Pr(Y |X = x, S = s) is just some function of s. If we have an IID sample of realizations of S, say s 1 , s 2 ,... sn, then the law of large numbers says that

1 n

∑^ n

i=

f (si) →

s

f (s) Pr(S = s)

Therefore, with a large sample,

Pr(Y |do(X = x)) ≈

n

∑^ n

i=

Pr(Y |X = x, S = si)

and this will still be (approximately) true when we use a consistent estimate of the conditional probability, rather than its true value. In addition to using occupation to block back-door paths, we could also use asbestos, or occupation and asbestos together. The ad- vantage of occupation is that it is an ordinal variable with a small number of levels, so there should be less work involved in estimating Pr(Y |X = x, S = s) than if we had used a continuous control variable like asbestos.

(b) Answer: Let X refer to smoking, M refer to tar, and Y refer to cancer. Then M blocks all directed paths from X to Y ; the only back-door path from X to M is blocked by the collider at damage; and X blocks the only back-door path from M to Y. Now, as described in lecture 23 section 3.2, we can estimate the causal effect of smoking on cancer using

Pr(Y |do(X = x)) =

m

Pr(M = m|X = x)

x′

Pr(Y |X = x′, M = m) Pr(X = x′)

Since the front-door variable tar is continuous, the sum over m is really an integral, as is the sum over x′,

Pr(Y |do(X = x)) =

dm

p(m|X = x)

dx′^ Pr(Y |X = x′, M = m)p(x′)

Once again (see the end of part a), we can simplify the expression by recognizing the inner integral as the expectation of Pr(Y |X = x′, M = m) as a function of x′, so the full equation reduces to

Pr(Y |do(X = x)) ≈

dm

p(m|X = x)

n

∑^ n

i=

Pr(Y |X = x′ i, M = m)

(c) Answer: As described in lecture 23 section 3.3, a variable I is an instrument for identifying the effect of X on Y if we can find a set S of controls such that (i) I is dependent on X|S and (ii) every unblocked path from I to Y has an arrow pointing into to X. Taking I to be occupation and S to be asbestos, we find that both conditions are met. In particular, smoking given asbestos gives information about occupation, so I is dependent on X|S. Also, the path from occupation to cancer through dental, teeth and smoking does not have an arrow into smoking, but that path is blocked by the collider at teeth, so there is no problem. However, as detailed in the course notes, using the instrumental vari- able to estimate the causal effect of X on Y is not straightforward unless we know that the dependence of X on I is linear, as is that of Y on X. Thus we do not carry this analysis further.

(d) Answer: We will use the expression derived in part (a) to estimate Pr(Y |do(X = x)). Since cancer is a binary variable, we will use a model which is designed for that sort of response, and since we have no good reason to use a (generalized) linear model, we will use npcdens to estimate the conditional distribution, telling it that cancer is a categorical variable: library(np) pY.given.X.S = npcdens(factor(cancer)~smoking+ordered(occupation), data = smoke)

Let’s try the grid. An initial look at conditional density plot (not shown) suggests that with smoking=1.5, the probability of tar values outside the range (0. 10 , 0 .25) is next to zero, so we can limit the grid to that interval.

tar.eval.points = seq(from=0.10,to=0.25,length.out=2000) tar.grid.size = tar.eval.points[2] - tar.eval.points[1] tar.grid = expand.grid(smoking=1.5,tar=tar.eval.points) fhat = predict(dens.M.cond.X,newdata=tar.grid) cancer.given.do.tar = sapply(tar.eval.points,pY.given.do.M,pY.given.X.M)

Plotting cancer.given.do.tar vs. tar.eval.points (not shown) yields a nice logistic curve, with the risk of cancer increasing with the level of tar, which reassures us that things have not gone horribly wrong. Now we get the integral by summing over the grid of tar levels:

sum(cancer.given.do.tarfhattar.grid.size)

[1] 0.

This estimates the probability of cancer at about 13%.

However, we could also take a parametric approach for estimating the conditional distribution of tar on smoking. The scatter plot and residual analysis in Figure 1 suggests the conditional density follows a linear-Gaussian model:

Observe that tar is a good linear fit to smoking:

lm.tar.smoke = lm(tar~smoking, data = smoke) par(mfrow = c(2,1)) plot(smoke$smoking, smoke$tar, main = "Tar looks like it could be a linear function of smoking with gaussian noise ...") plot(fitted(lm.tar.smoke),lm.tar.smoke$residuals, main = "... and indeed this appears to be not far from the truth") abline(h = 0) normality = shapiro.test(lm.tar.smoke$residuals) mtext(paste("Shapiro.test for normality gives p.value =", as.character(round(normality$p.value, 3))))

Here we fit the appropriate Gaussian and redo the final step from our previous analysis:

new.data = data.frame(smoking=1.5) tar.mu.1.5 = predict(lm.tar.smoke, newdata = new.data) tar.var.1.5 = mean(lm.tar.smoke$residuals^2)

Compute the conditional density on the same grid as before

tar.dens = sapply(tar.eval.points,dnorm,mean=tar.mu.1.5,sd=sqrt(tar.var.1.5))

Perform outer sum (over values of M = tar)

sum(cancer.given.do.tar * tar.dens * tar.grid.size)

This time our estimate of the probability of cancer is 0.101.

ll

l

lll

l

l llllll

l ll

l

lll

l

ll

l

lllllllll

l

ll

l

lllll

l

llll

l

lllll

l

l llll

l l

l l ll llll l

l l

l

lll

l l l

lll

l

lll l

l

l lll l llll

l

lllll

l

l l l

ll

l

lllll l

l

llllll

l

ll

l

llllll

l

lll

l ll

l l

llllll

ll

l

llll

ll

lllll

l

ll

l l l

l

ll

l

ll

l ll

ll

l

ll

l l

l

lllllll l

l llll

l l

l

l llllll

l

lll l

l

ll

l

ll

l

ll

l

ll

l

l l l

lll

l

l ll ll lllllllll

l

ll

l

llll l

l

lllll l l l

l

ll

l

ll

l l

lllll

l

ll l

l (^) ll

l

ll

lll

l

l ll

l

l l

llllll

l

l l

ll

ll^ ll l

ll

l

ll

l

lllll

l

l l lll

l (^) l ll

l

l l

l

ll

l

lllll

l

ll l

ll

l

l l

l

ll

l

ll

l

l ll

l

ll

l

llll l

l

ll l

l

l l

l

llll

l l

l

llll ll

l l

l

lllll

l

ll

l

lll

l

lll

l

l l

l

llll

l

l lll^

l l

l

ll

l

ll

l

lll

l ll

l

l llll

l

l lllllllllllll

l l

l

l l

l

lll

l lllll

l

ll ll l

l

l l

l

lll

l

0 1 2 3

Tar looks like it could be a linear function of smoking with gaussian noise ...

smoke$smoking

smoke$tar

l

l l

l

l l l l

ll l

l l

l (^) l l

l l l l

l

l l

l

l l l

l

lll l l l

l

l ll l

l l l l

l l

l

ll l l

l (^) l l l

ll

l l

l

l l

l l l

l

l l

l

l l

l

l l

ll l l

l l l l

l

l l

l

l l l

ll

l

ll

ll l

ll

l

l l

l l l^ l l l l

l

ll

l

l l l

l l lll l

ll l lll l lll l

l

ll

l

l lllll^ l ll

l l

l

l ll

l l

l l l l l

l

ll l

l l l

l l

l

l l l l l

lll l l ll

l l l l

l

l l

l l l

l l

l

l (^) l

ll

l l l l

ll

l l l l

l l l

l

l l

l ll l l

l l

l l l l

l l

l

ll l

l l l

l l

l

l l

l l l

l

ll

l

llllllll

l l

l ll

l

l l l

l

l (^) l

l

l (^) l l

l

l l

l

l l

l

l l l

l l l l

ll l l

l l

l l l l l

l

l l

l l l

l l

l

l l l l lll

l ll

l

l l l l

l

ll l l (^) l

l

l (^) l lll

ll

l l

l l l

l l

l l (^) l

l l (^) l l

ll l l

l

ll

l l l l l

l l

l l l

l l (^) l

l l l l

l l l l l l (^) l lll

l

l l

l

l l^ ll

l

l l

l

ll

l

l l

l (^) l l l l l

l l l l

l

l l

l

l l

l

l l (^) l l

l l l l

l l

l

l l

l l l

l l l l

l

lll

l l

l l l

l

ll l

ll

l

ll

l l

l l l l ll^ l

l

ll

lll l l (^) l l l

l (^) l

l

ll

l l l

l l

ll l lllll l

ll ll l

l

l l l

l lll l

l l l l

ll

l

l l

l

l l

l

ll l l l l

l ll l

0.05 0.10 0.15 0.20 0.25 0.30 0.

−0.

... and indeed this appears to be not far from the truth

fitted(lm.tar.smoke)

lm.tar.smoke$residuals

Shapiro.test for normality gives p.value = 0.

Figure 1:

error = abs(Xcoef - alpha.beta[1])/sd.error return(error) }

We can use the optim function to search for parameters that minimize this discrepancy:

solutions = replicate(10, optim(par=c(1,1,1), get.alpha.dev)$par) solutions = as.data.frame(t(solutions)) colnames(solutions) = c("alpha", "beta1", "beta2") alpha beta1 beta 1 1.32 1.57 0. 2 1.00 1.05 1. 3 1.50 1.27 0. 4 1.19 1.31 0. 5 1.42 1.54 0. 6 1.37 1.17 0. 7 1.20 1.40 0. 8 1.09 1.16 0. 9 1.26 1.59 0. 10 1.06 1.02 0.

The output of the optim function is quite unstable, as if many solutions exist. Closer examination of the optim output shows that algorithm is not converging properly. Nevertheless, the result of the fourth optimization (ˆα = 1.19, βˆ 1 = 1.31, βˆ 2 = 0.03) seems to give something close to the desired equality:

errors = replicate(100, get.alpha.dev(as.numeric(solutions[4,]))) mean(errors) [1] 0.

We will now argue that the coefficient of X in the larger model is not a consistent estimate for α unless β 2 = 0, or Cov [X, Z] = 0. As explained in Lectures 1 and 2, the population-level linear regression coefficient for predicting Y from X is Cov [X, Y ] Var [X]

That is, this is the coefficient which minimizes the expected mean-squared error. The OLS estimate of the coefficient will converge to this as the sample size grows. Here Var [X] = 1 and Cov [X, Y ] = αVar [X] = α, so linearly regressing Y on X alone gives a consistent estimate of α.

When predicting Y from X and Z, the vector of population-level regression coefficients is [ Var [X] Cov [X, Z] Cov [X, Z] Var [Z]

]− 1 [

Cov [X, Y ] Cov [Z, Y ]

]

and, again, the OLS estimates will converge on these coefficients. Since the matrix being inverted is only 2 × 2, we can actually do the inversion by hand:

1 Var [X] Var [Z] − Cov [X, Z]^2

[

Var [Z] −Cov [X, Z] −Cov [X, Z] Var [X]

] [

Cov [X, Y ] Cov [Z, Y ]

]

We are interested only in the coefficient for X, which is

Var [Z] Cov [X, Y ] − Cov [X, Z] Cov [Z, Y ] Var [X] Var [Z] − Cov [X, Z]^2

If Cov [X, Z] = 0, this reduces to Cov [X, Y ] /Var [X] = α. Otherwise, it might happen to equal α, if all of the terms, including α, are adjusted precisely, but in general it will not.

We can say a bit more by writing out the various variances and covari- ances in terms of the parameters. It just so happens that we can easily obtain exact values for these and other covariances directly from our model equations:

Var [X] = 1 Var [Y ] = Var [αX + ] = α^2 + σ^2 Cov [X, Y ] = Cov [X, αX + ] = α Var [Z] = β^21 Var [X] + β 22 Var [Y ] + 2β 1 β 2 Cov [X, Y ] + σ^2 = β^21 + β 22 (α^2 + σ^2 ) + 2β 1 β 2 α + σ^2 Cov [X, Z] = Cov [X, β 1 X + β 2 Y + η] = Cov [X, β 1 X + β 2 αX + β 2 + η] = β 1 + β 2 α Cov [Y, Z] = Cov [αX + , β 1 X + β 2 Y + η] = Cov [αX + , β 1 X + β 2 αX + β 2 + η] = αβ 1 + α^2 β 2 + β 2 σ^2

Substituting, we get

(β 12 + β 22 (α^2 + σ^2 ) + 2β 1 β 2 α + σ^2 )α − (β 1 + β 2 α)(α(β 1 + αβ 2 ) + β 2 σ^2 ) β 12 + β^22 (α^2 + σ^2 ) + 2β 1 β 2 α + σ^2 − (β 1 + β 2 α)^2

=

α(β 12 + β 22 α^2 + β^22 σ^2 + 2β 1 β 2 α + σ^2 − (β 1 + αβ 2 )^2 − β^22 σ^2 ) − β 1 β 2 σ^2 σ^2 (1 + β^22 )

=

ασ^2 − β 1 β 2 σ^2 σ^2 (1 + β^22 )

=

α − β 1 β 2 1 + β^22

This will be α only when β 2 = 0, or when β 1 = −β 2 α (in which case Cov [X, Z] = 0). Setting β 2 = 0 corresponds to removing the arrow from

Answer: Notice that in this model, Y = γZ +α 2 U +, while Z = βX +η, so Y = γβX + γη + α 2 U + . If we manipulated X, therefore, we’d find that its effect on Y is summarized by γβ, which is precisely the product of the coefficients summarizing the effect of X on Z (namely β) and that summarizing the effect of Z on Y (namely γ), just as the front- door criterion leads us to hope. On the other hand, simply regressing Y on X will be confounded:

Cov [X, Y ] Var [X]

Cov [X, γβX + γη + α 2 U + ] Var [X]

=

γβVar [X] + α 2 Cov [X, U ] Var [X]

= γβ + α 2

Cov [α 1 U + , U ] Var [X]

= γβ + α 1 α 2

Var [U ] α^21 Var [U ] + σ^2 = γβ +

α 1 α 2 α^21 + σ^2

For the default values of the parameters in the drawUXZY() function, the bias amounts to +0.5. Computationally, we compare the coefficient of X in a naive regression of Y on X, with the product of the coefficient of Z on X, and the coefficient of Z when regressing Y on X and Z.

Make a function that returns estimates of the coefficient of X

for regressions of interest:

estimate.coefficients4 = function(){ uxzy = drawUXZY() y.on.x = lm(y~x, data = uxzy)$coefficients["x"][[1]] y.on.x.z = lm(y~x+z, data = uxzy)$coefficients["z"][[1]] z.on.x = lm(z~x,data=uxzy)$coefficients["x"][[1]] return(c(y.on.x, y.on.x.z*z.on.x)) }

replicate.estimates4 = t(replicate(100, estimate.coefficients4()))

Biased estimate by direct regression:

mean(replicate.estimates4[,1])

[1] 6.

Unbiased estimate by front-dooor adjustment:

mean(replicate.estimates4[,2])

Smoking's Causal Effect on Cancer: Estimation with Do-Calculus and Instruments, Exercises of Advanced Data Analysis

Related documents

Partial preview of the text

Download Smoking's Causal Effect on Cancer: Estimation with Do-Calculus and Instruments and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 10: Use and Abuse of

Conditioning

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

[1] 0.

Observe that tar is a good linear fit to smoking:

Compute the conditional density on the same grid as before

Perform outer sum (over values of M = tar)

]− 1 [

]

[

] [

]

Make a function that returns estimates of the coefficient of X

for regressions of interest:

Biased estimate by direct regression:

[1] 6.

Unbiased estimate by front-dooor adjustment:

[1] 5.