









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Cheat sheet on Machine Learning: Supervised/Unsupervised Learning, Deep Learning, Machine Learning Tips and Tricks
Typology: Cheat Sheet
1 / 16
This page cannot be seen from the preview
Don't miss anything!
- September 15, Afshine Amidi and Shervine Amidi
Given a set of data points {x(1), ..., x( m )} associated to a set of outcomes {y(1), ..., y( m )}, we want to build a classifier that learns how to predict y from x.
r Type of prediction – The different types of predictive models are summed up in the table below:
Regression Classifier
Outcome Continuous Class
Examples Linear regression Logistic regression, SVM, Naive Bayes
r Type of model – The different models are summed up in the table below:
Discriminative model Generative model
Goal Directly estimate P (y|x) Estimate P (x|y) to deduce P (y|x)
What’s learned Decision boundary Probability distributions of the data
Illustration
Examples Regressions, SVMs GDA, Naive Bayes
r Hypothesis – The hypothesis is noted h θ and is the model that we choose. For a given input
data x( i ), the model prediction output is h θ (x( i )).
r Loss function – A loss function is a function L : (z,y) ∈ R × Y 7 −→ L(z,y) ∈ R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:
Least squared Logistic Hinge Cross-entropy
1 2
(y − z) 2 log(1 + exp(−yz)) max(0, 1 − yz) −
y log( z ) + (1 − y ) log(1 − z )
Linear regression Logistic regression SVM Neural Network
r Cost function – The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:
J(θ) =
i =
L(h θ (x( i )), y( i ))
r Gradient descent – By noting α ∈ R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:
θ ←− θ − α∇J(θ)
Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.
r Likelihood – The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood `(θ) = log(L(θ)) which is easier to optimize. We have:
θopt^ = arg max θ
L(θ)
r Newton’s algorithm – The Newton’s algorithm is a numerical method that finds θ such that `′(θ) = 0. Its update rule is as follows:
θ ← θ −
′(θ)
′′(θ)
Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:
θ ← θ −
∇^2 θ `(θ)
∇ θ `(θ)
We assume here that y|x; θ ∼ N (μ,σ^2 ) r Normal equations – By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:
θ = (X T X) − 1 X T y
r Kernel – Given a feature mapping φ, we define the kernel K to be defined as:
K(x,z) = φ(x) T φ(z)
In practice, the kernel K defined by K(x,z) = exp
|| x − z ||^2 2 σ^2
is called the Gaussian kernel
and is commonly used.
Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don’t need to know the explicit mapping φ , which is often very complicated. Instead, only the values K(x,z) are needed.
r Lagrangian – We define the Lagrangian L(w,b) as follows:
L(w,b) = f (w) +
i =
β i h i (w)
Remark: the coefficients β i are called the Lagrange multipliers.
A generative model first tries to learn how the data is generated by estimating P (x|y), which we can then use to estimate P (y|x) by using Bayes’ rule.
r Setting – The Gaussian Discriminant Analysis assumes that y and x|y = 0 and x|y = 1 are such that:
y ∼ Bernoulli(φ)
x|y = 0 ∼ N (μ 0 ,Σ) and x|y = 1 ∼ N (μ 1 ,Σ)
r Estimation – The following table sums up the estimates that we find when maximizing the likelihood:
m
i =
(^1) { y ( i )=1}
i = (^1) { y ( i )= j }x( i )
i =1 1 { y ( i )= j }
m
i =
(x( i )^ − μ y ( i ) )(x( i )^ − μ y ( i ) ) T
r Assumption – The Naive Bayes model supposes that the features of each data point are all independent:
P (x|y) = P (x 1 ,x 2 ,...|y) = P (x 1 |y)P (x 2 |y)... =
i =
P (x i |y)
r Solutions – Maximizing the log-likelihood gives the following solutions, with k ∈ { 0 , 1 }, l ∈ [[1,L]]
P (y = k) =
m
× #{j|y( j )^ = k} and P (x i = l|y = k) =
#{j|y( j )^ = k and x ( j ) i =^ l} #{j|y( j )^ = k}
Remark: Naive Bayes is widely used for text classification and spam detection.
These methods can be used for both regression and classification problems.
r CART – Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.
r Random forest – It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.
Remark: random forests are a type of ensemble methods.
r Boosting – The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:
Adaptive boosting Gradient boosting
r k -nearest neighbors – The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.
Remark: The higher the parameter k , the higher the bias, and the lower the parameter k , the higher the variance.
r Union bound – Let A 1 , ..., A k be k events. We have:
P (A 1 ∪ ... ∪ A k ) 6 P (A 1 ) + ... + P (A k )
r Hoeffding inequality – Let Z 1 , .., Z m be m iid variables drawn from a Bernoulli distribution
Remark: this inequality is also known as the Chernoff bound.
empirical risk or empirical error, to be as follows:
m
i =
(^1) { h ( x ( i )) 6 = y ( i )}
r Probably Approximately Correct (PAC) – PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions:
r Shattering – Given a set S = {x(1),...,x( d )}, and a set of classifiers H, we say that H shatters
S if for any set of labels {y(1), ..., y( d )}, we have:
∃h ∈ H, ∀i ∈ [[1,d]], h(x( i )) = y( i )
r Upper bound theorem – Let H be a finite hypothesis class such that |H| = k and let δ and the sample size m be fixed. Then, with probability of at least 1 − δ, we have:
min h ∈H
(h)
2 m
log
2 k δ
r VC dimension – The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.
r Theorem (Vapnik) – Let H be given, with VC(H) = d and m the number of training examples. With probability at least 1 − δ, we have:
min h ∈H
(h)
d m
log
m d
m
log
δ
r Calinski-Harabaz index – By noting k the number of clusters, B k and W k the between and within-clustering dispersion matrices respectively defined as
B k =
j =
n c ( i ) (μ c ( i ) − μ)(μ c ( i ) − μ) T^ , W k =
i =
(x( i )^ − μ c ( i ) )(x( i )^ − μ c ( i ) ) T
the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:
s(k) =
Tr(B k ) Tr(W k )
N − k k − 1
It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.
r Eigenvalue, eigenvector – Given a matrix A ∈ R n × n , λ is said to be an eigenvalue of A if there exists a vector z ∈ R n { 0 }, called eigenvector, such that we have:
Az = λz
r Spectral theorem – Let A ∈ R n × n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U ∈ R n × n. By noting Λ = diag(λ 1 ,...,λ n ), we have:
∃Λ diagonal, A = U ΛU T
Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A_._
r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:
x ( i ) j ←^
x ( i ) j −^ μ j σ j
where μ j =
m
i =
x ( i ) j and^ σ
2 j =^
m
i =
(x ( i ) j −^ μ j^ )
2
m
i =
x( i )x( i )
T ∈ R n × n , which is symmetric with real eigenvalues.
It is a technique meant to find the underlying generating sources. r Assumptions – We assume that our data x has been generated by the n-dimensional source vector s = (s 1 ,...,s n ), where s i are independent random variables, via a mixing and non-singular matrix A as follows:
x = As
The goal is to find the unmixing matrix W = A−^1 by an update rule.
r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by following the steps below:
p(x) =
i =
p s (w Ti x) · |W |
l(W ) =
i =
j =
log
g′(w Tj x( i ))
Therefore, the stochastic gradient ascent learning rule is such that for each training example x( i ), we update W as follows:
W ←− W + α
1 − 2 g(w T 1 x( i )) 1 − 2 g(w T 2 x( i )) . . . 1 − 2 g(w Tn x( i ))
( i ) T^
Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.
r Architecture – The vocabulary around neural networks architectures is described in the figure below:
By noting i the i th^ layer of the network and j the j th^ hidden unit of the layer, we have:
z [ i ] j =^ w
[ i ] j
T x + b [ i ] j
where we note w, b, z the weight, bias and output respectively.
r Activation function – Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:
Sigmoid Tanh ReLU Leaky ReLU
g(z) =
1 + e− z^
g(z) =
e z^ − e− z e z^ + e− z^
g(z) = max(0,z) g(z) = max(z,z)
with 1
r Cross-entropy loss – In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:
L(z,y) = −
y log(z) + (1 − y) log(1 − z)
r Learning rate – The learning rate, often noted η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.
r Backpropagation – Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:
∂L(z,y) ∂w
∂L(z,y) ∂a
∂a ∂z
∂z ∂w
As a result, the weight is updated as follows:
w ←− w − η
∂L(z,y) ∂w
r Updating weights – In a neural network, weights are updated as follows:
r Dropout – Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1 − p.
r Convolutional layer requirement – By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:
r Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {x i }. By noting μ B , σ B^2 the mean and variance of that we want to correct to the batch, it is done as follows:
x i ←− γ
x i − μ B
σ^2 B +
It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.
r Types of gates – Here are the different types of gates that we encounter in a typical recurrent neural network:
Input gate Forget gate Output gate Gate
Write to cell or not? Erase a cell or not? Reveal a cell or not? How much writing?
r LSTM – A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding ’forget’ gates.
Given a set of data points {x(1), ..., x( m )}, where each x( i )^ has n features, associated to a set of
outcomes {y(1), ..., y( m )}, we want to assess a given classifier that learns how to predict y from x.
In a context of a binary classification, here are the main metrics that are important to track to assess the performance of the model.
r Confusion matrix – The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:
Predicted class
Actual class
r Main metrics – The following metrics are commonly used to assess the performance of classification models:
Metric Formula Interpretation
Accuracy
Overall performance of model
Precision
How accurate the positive predictions are
Recall
Coverage of actual positive sample
Sensitivity
Specificity
Coverage of actual negative sample
F1 score
Hybrid metric useful for unbalanced classes
r ROC – The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:
Metric Formula Equivalent
True Positive Rate
Recall, sensitivity TPR
False Positive Rate
1-specificity FPR
r AUC – The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:
r Basic metrics – Given a regression model f , the following metrics are commonly used to assess the performance of the model:
Total sum of squares Explained sum of squares Residual sum of squares
SStot =
i =
(y i − y)^2 SSreg =
i =
(f (x i ) − y)^2 SSres =
i =
(y i − f (x i ))^2
r Coefficient of determination – The coefficient of determination, often noted R^2 or r^2 , provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:
2 = 1 −
SSres SStot
r Main metrics – The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consid- eration:
Mallow’s Cp AIC BIC Adjusted R^2
m
2
( n + 2) − log( L )
log( m )( n + 2) − 2 log( L ) 1 −
(1 − R^2 )(m − 1) m − n − 1
r Vocabulary – When selecting a model, we distinguish 3 different parts of the data that we have as follows:
Training set Validation set Testing set
Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:
r Cross-validation – Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:
k -fold Leave- p -out
The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k − 1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.
r Regularization – The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:
LASSO Ridge Elastic Net
... + λ||θ|| 1 ... + λ||θ||^22 ... + λ
(1 − α)||θ|| 1 + α||θ||^22
λ ∈ R λ ∈ R λ ∈ R, α ∈ [0,1]
r Model selection – Train model on training set, then evaluate on the development set, then pick best performance model on the development set, and retrain all of that model on the whole training set.
r Bias – The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.
r Variance – The variance of a model is the variability of the model prediction for given data points.
r Bias/variance tradeoff – The simpler the model, the higher the bias, and the more complex the model, the higher the variance.
Underfitting Just right Overfitting
Regression
r Extended form of Bayes’ rule – Let {A i , i ∈ [[1,n]]} be a partition of the sample space. We have:
P (A k |B) =
P (B|A k )P (A k )
i =
P (B|A i )P (A i )
r Independence – Two events A and B are independent if and only if we have:
P (A ∩ B) = P (A)P (B)
r Random variable – A random variable, often noted X, is a function that maps every element in a sample space to a real line.
r Cumulative distribution function (CDF) – The cumulative distribution function F , which is monotonically non-decreasing and is such that lim x →−∞
F (x) = 0 and lim x →+∞
F (x) = 1, is
defined as:
F (x) = P (X 6 x)
Remark: we have P (a < X 6 B) = F (b) − F (a).
r Probability density function (PDF) – The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.
r Relationships involving the PDF and CDF – Here are the important properties to know in the discrete (D) and the continuous (C) cases.
Case CDF F PDF f Properties of PDF
(D) F (x) =
xi 6 x
P (X = x i ) f (x j ) = P (X = x j ) 0 6 f (x j ) 6 1 and
j
f (x j ) = 1
(C) F (x) =
ˆ (^) x
−∞
f (y)dy f (x) =
dF dx
f (x) > 0 and
−∞
f (x)dx = 1
r Variance – The variance of a random variable, often noted Var(X) or σ^2 , is a measure of the spread of its distribution function. It is determined as follows:
Var(X) = E[(X − E[X])^2 ] = E[X^2 ] − E[X]^2
r Standard deviation – The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:
σ =
Var(X)
r Expectation and Moments of the Distribution – Here are the expressions of the expected value E[X], generalized expected value E[g(X)], k th^ moment E[X k^ ] and characteristic function ψ(ω) for the discrete and continuous cases:
Case E[X] E[g(X)] E[X k^ ] ψ(ω)
i =
x i f (x i )
i =
g(x i )f (x i )
i =
x ki f (x i )
i =
f (x i )e iωxi
−∞
xf (x)dx
−∞
g(x)f (x)dx
−∞
x k^ f (x)dx
−∞
f (x)e iωx dx
Remark: we have e iωx^ = cos(ωx) + i sin(ωx).
r Revisiting the k th^ moment – The k th^ moment can also be computed with the characteristic function as follows:
E[X k^ ] =
i k
∂ k^ ψ ∂ω k
ω =
r Transformation of random variables – Let the variables X and Y be linked by some function. By noting f X and f Y the distribution function of X and Y respectively, we have:
f Y (y) = f X (x)
dx dy
r Leibniz integral rule – Let g be a function of x and potentially c, and a, b boundaries that may depend on c. We have:
∂ ∂c
b
a
g(x)dx
∂b ∂c
· g(b) −
∂a ∂c
· g(a) +
ˆ (^) b
a
∂g ∂c
(x)dx
r Chebyshev’s inequality – Let X be a random variable with expected value μ and standard deviation σ. For k, σ > 0 , we have the following inequality:
P (|X − μ| > kσ) 6
k^2
r Conditional density – The conditional density of X with respect to Y , often noted f X | Y , is defined as follows:
f X | Y (x) =
f XY (x,y) f Y (y)
r Independence – Two random variables X and Y are said to be independent if we have:
f XY (x,y) = f X (x)f Y (y)
r Marginal density and cumulative distribution – From the joint density probability function f XY , we have:
Case Marginal density Cumulative function
(D) f X (x i ) =
j
f XY (x i ,y j ) F XY (x,y) =
xi 6 x
yj 6 y
f XY (x i ,y j )
(C) f X (x) =
−∞
f XY (x,y)dy F XY (x,y) =
ˆ (^) x
−∞
ˆ (^) y
−∞
f XY (x′,y′)dx′dy′
r Distribution of a sum of independent random variables – Let Y = X 1 + ... + X n with X 1 , ..., X n independent. We have:
ψ Y (ω) =
k =
ψ Xk (ω)
r Covariance – We define the covariance of two random variables X and Y , that we note σ^2 XY or more commonly Cov(X,Y ), as follows:
Cov(X,Y ) , σ 2 XY =^ E[(X^ −^ μ X^ )(Y^ −^ μ Y^ )] =^ E[XY^ ]^ −^ μ X^ μ Y
r Correlation – By noting σ X , σ Y the standard deviations of X and Y , we define the correlation between the random variables X and Y , noted ρ XY , as follows:
ρ XY =
σ XY^2 σ X σ Y
Remarks: For any X, Y , we have ρ XY ∈ [− 1 ,1]. If X and Y are independent, then ρ XY = 0_._
r Main distributions – Here are the main distributions to have in mind:
Type Distribution PDF ψ(ω) E[X] Var(X)
X ∼ B(n, p) P (X = x) =
n x
p x q n − x^ (pe iω^ + q) n^ np npq
Binomial x ∈ [[0,n]] (D)
X ∼ Po(μ) P (X = x) =
μ x x!
e− μ^ e μ ( e
iω (^) −1) μ μ Poisson x ∈ N
X ∼ U(a, b) f (x) =
b − a
e iωb^ − e iωa (b − a)iω
a + b 2
(b − a)^2 12 Uniform x ∈ [a,b]
(C) X ∼ N (μ, σ) f (x) =
2 πσ
e
− (^12)
x − μ σ
e iωμ −^
1 2 ω
(^2) σ 2 μ σ^2
Gaussian x ∈ R
X ∼ Exp(λ) f (x) = λe− λx^
1 − iω λ
λ
λ^2 Exponential x ∈ R+
r Random sample – A random sample is a collection of n random variables X 1 , ..., X n that are independent and identically distributed with X.
r Estimator – An estimator θˆ is a function of the data that is used to infer the value of an unknown parameter θ in a statistical model.
r Bias – The bias of an estimator θˆ is defined as being the difference between the expected value of the distribution of θˆ and the true value, i.e.:
Bias(θˆ) = E[ˆθ] − θ
Remark: an estimator is said to be unbiased when we have E[θˆ] = θ.
r Sample mean and variance – The sample mean and the sample variance of a random sample are used to estimate the true mean μ and the true variance σ^2 of a distribution, are noted X and s^2 respectively, and are such that:
n
i =
X i and s 2 = ˆσ 2 =
n − 1
i =
(X i − X) 2
r Central Limit Theorem – Let us have a random sample X 1 , ..., X n following a given distribution with mean μ and variance σ^2 , then we have:
n →+∞
μ,
σ √ n
r Vector – We note x ∈ R n^ a vector with n entries, where x i ∈ R is the i th^ entry:
x =
x^1 2 . . . x n
∈ R n
r Matrix – We note A ∈ R m × n^ a matrix with m rows and n columns, where A i,j ∈ R is the entry located in the i th^ row and j th^ column:
A 1 , 1 · · · A 1 ,n . . .
A m, 1 · · · A m,n
m × n
Remark: the vector x defined above can be viewed as a n × 1 matrix and is more particularly called a column-vector.
r Identity matrix – The identity matrix I ∈ R n × n^ is a square matrix with ones in its diagonal and zero everywhere else:
Norm Notation Definition Use case
Manhattan, L^1 ||x|| 1
i =
|x i | LASSO regularization
Euclidean, L^2 ||x|| 2
i =
x^2 i Ridge regularization
p-norm, L p^ ||x|| p
i =
x p i
p Hölder inequality
Infinity, L∞^ ||x||∞ max i
|x i | Uniform convergence
r Linearly dependence – A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others. Remark: if no vector can be written this way, then the vectors are said to be linearly independent.
r Matrix rank – The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.
r Positive semi-definite matrix – A matrix A ∈ R n × n^ is positive semi-definite (PSD) and is noted A 0 if we have:
A = A T^ and ∀x ∈ R n , x T^ Ax > 0
Remark: similarly, a matrix A is said to be positive definite, and is noted A 0 , if it is a PSD
matrix which satisfies for all non-zero vector x , x T^ Ax > 0_._
r Eigenvalue, eigenvector – Given a matrix A ∈ R n × n , λ is said to be an eigenvalue of A if there exists a vector z ∈ R n { 0 }, called eigenvector, such that we have:
Az = λz
r Spectral theorem – Let A ∈ R n × n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U ∈ R n × n. By noting Λ = diag(λ 1 ,...,λ n ), we have:
∃Λ diagonal, A = U ΛU T
r Singular-value decomposition – For a given matrix A of dimensions m × n, the singular- value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m × n diagonal and V n × n unitary matrices, such that:
r Gradient – Let f : R m × n^ → R be a function and A ∈ R m × n^ be a matrix. The gradient of f with respect to A is a m × n matrix, noted ∇ A f (A), such that:
∇ A f (A)
i,j
∂f (A) ∂A i,j
Remark: the gradient of f is only defined when f is a function that returns a scalar.
r Hessian – Let f : R n^ → R be a function and x ∈ R n^ be a vector. The hessian of f with respect to x is a n × n symmetric matrix, noted ∇^2 x f (x), such that:
∇^2 x f (x)
i,j
∂^2 f (x) ∂x i ∂x j
Remark: the hessian of f is only defined when f is a function that returns a scalar.
r Gradient operations – For matrices A,B,C, the following gradient properties are worth having in mind:
∇ A tr(AB) = B T^ ∇ AT f (A) = (∇ A f (A)) T
∇ A tr(ABA T C) = CAB + C T AB T ∇ A |A| = |A|(A − 1 ) T