

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An in-depth explanation of the chain rule in multivariable calculus. The author, nikhil srivastava, uses the concept of linear approximations and matrices to derive the chain rule for functions from r2 to r and r2 to r2. The document also discusses the relationship between the gradient of a function and its jacobian matrix.
What you will learn
Typology: Lecture notes
1 / 3
This page cannot be seen from the preview
Don't miss anything!
The chain rule is a simple consequence of the fact that differentiation produces the linear approximation to a function at a point, and that the derivative is the coefficient appearing in this linear approximation. Let’s see this for the single variable case first. It is especially transparent using o() notation, where once again f (x) = o(g(x)) means that
lim x→ 0
f (x) g(x)
Suppose we are interested in computing the derivative of (f ◦ g)(x) = f (g(x)) at x, where f and g are both differentiable functions from R to R. Since g is differentiable, we have (by the definition of differentiation as a limit):
g(x + ∆x) = g(x) + g′(x)∆x + o(∆x)
for a number g′(x) which we call the derivative of g at x. In words, this says that g is well-approximated by its linear approximation in a neighborhood of x. Similarly, we have
f (y + ∆y) = f (y) + f ′(y)∆y + o(∆y).
Letting y = g(x) and ∆y = g′(x)∆x + o(∆x), we now find that
(f ◦ g)(x + ∆x) = f (g(x + ∆x))
= f (g(x) + g′(x)∆x + o(∆x)) = f (g(x)) + f ′(g(x)) (g′(x)∆x + o(∆x)) + o(∆y) = f (g(x)) + f ′(g(x)) · g′(x)∆x + o(∆x) + o(∆y) since f ′(g(x)) · o(∆x) = o(∆x).
Thus, we have
lim ∆x→ 0
(f ◦ g)(x + ∆x) − (f ◦ g)(x) ∆x
= lim ∆x→ 0
f ′(g(x)) · g′(x) + o(∆x) + o(∆y) ∆x
= f ′(g(x))·g′(x),
establishing the chain rule.
A very similar thing happens in the multivariable case. Suppose f : R^2 → R and g : R^2 → R^2 are differentiable. To parallel the notation used in class, let z = f (x, y) and (x, y) = g(s, t). Since both functions are differentiable, they must have linear approximations:
f (x + ∆x, y + ∆y) = f ((x, y) + (∆x, ∆y)) ≈ f (x, y) + Lf (∆x, ∆y) (∗), g(s + ∆s, t + ∆t) = g((s, t) + (∆s, ∆t)) ≈ g(s, t) + Lg(∆s, ∆t) (∗∗)
where Lf : R^2 → R and LG : R^2 → R^2 are linear functions, and I have used ≈ to indicate equality up to o(∆x) terms^1. But we know that all linear functions are implemented by matrices so there must be a 1 × 2 matrix Df such that
Lf (∆x, ∆y) = Df
∆x ∆y
In fact, we know exactly what this matrix is (by comparing coefficients):
Df =
∂f ∂x
∂f ∂y
so that we have the explicit formula
Lf (∆x, ∆y) =
∂f ∂x
∂f ∂y
] [∆x ∆y
∂f ∂x
∆x +
∂f ∂y
∆y,
which is the same as what is given by the total differential df. Repeating this process for Lg, we get that for the 2 × 2 matrix
Dg =
[∂x ∂s
∂x ∂y ∂t ∂s
∂y ∂t
the linear approximation of g at (s, t) is given by
Lg(∆s, ∆t) = Dg
∆s ∆t
[∂x ∂s
∂x ∂y ∂t ∂s
∂y ∂t
∆s ∆t
[∂x ∂s ∆s^ +^
∂x ∂y ∂t^ ∆t ∂s ∆s^ +^
∂y ∂t ∆t
Now for the punch line: just as in the univariate case, we write:
(f ◦ g)((s, t) + (∆s, ∆t)) = f (g((s, t) + (∆s, ∆t)))
≈ f
g(s, t) + Dg
∆s ∆t
by (∗∗)
≈ f (g(s, t)) + Df Dg
∆s ∆t
by (∗), treating Dg
∆s ∆t
as
∆x ∆y
(^1) There is a subtlety about uniform convergence vs pointwise convergence here, but for the purposes of
this course you can ignore it, and what is written here is good enough.