Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Super VIP Cheatsheet: Machine Learning, Cheat Sheet of Machine Learning

Cheat sheet on Machine Learning: Supervised/Unsupervised Learning, Deep Learning, Machine Learning Tips and Tricks

Typology: Cheat Sheet

2019/2020

Uploaded on 10/09/2020

palumi
palumi 🇺🇸

4.2

(14)

245 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 229 Machine Learning https://stanford.edu/~shervine
Super VIP Cheatsheet: Machine Learning
Afshine Amidi and Shervine Amidi
September 15, 2018
Contents
1Supervised Learning 2
1.1 Introduction to Supervised Learning ................... 2
1.2 Notations and general concepts ..................... 2
1.3 Linear models .............................. 2
1.3.1 Linear regression ......................... 2
1.3.2 Classification and logistic regression ............... 3
1.3.3 Generalized Linear Models .................... 3
1.4 Support Vector Machines ........................ 3
1.5 Generative Learning ........................... 4
1.5.1 Gaussian Discriminant Analysis ................. 4
1.5.2 Naive Bayes ........................... 4
1.6 Tree-based and ensemble methods .................... 4
1.7 Other non-parametric approaches .................... 4
1.8 Learning Theory ............................. 5
2Unsupervised Learning 6
2.1 Introduction to Unsupervised Learning ................. 6
2.2 Clustering ................................ 6
2.2.1 Expectation-Maximization .................... 6
2.2.2 k-means clustering ........................ 6
2.2.3 Hierarchical clustering ...................... 6
2.2.4 Clustering assessment metrics .................. 6
2.3 Dimension reduction ........................... 7
2.3.1 Principal component analysis .................. 7
2.3.2 Independent component analysis ................. 7
3Deep Learning 8
3.1 Neural Networks ............................. 8
3.2 Convolutional Neural Networks ..................... 8
3.3 Recurrent Neural Networks ....................... 8
3.4 Reinforcement Learning and Control ................... 9
4Machine Learning Tips and Tricks 10
4.1 Metrics .................................. 10
4.1.1 Classification ........................... 10
4.1.2 Regression ............................ 10
4.2 Model selection .............................. 11
4.3 Diagnostics ................................ 11
5Refreshers 12
5.1 Probabilities and Statistics ........................ 12
5.1.1 Introduction to Probability and Combinatorics ......... 12
5.1.2 Conditional Probability ..................... 12
5.1.3 Random Variables ........................ 13
5.1.4 Jointly Distributed Random Variables .............. 13
5.1.5 Parameter estimation ...................... 14
5.2 Linear Algebra and Calculus ....................... 14
5.2.1 General notations ........................ 14
5.2.2 Matrix operations ........................ 15
5.2.3 Matrix properties ........................ 15
5.2.4 Matrix calculus ......................... 16
Stanford University 1Fall 2018
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Super VIP Cheatsheet: Machine Learning and more Cheat Sheet Machine Learning in PDF only on Docsity!

CS 229 – Machine Learning https://stanford.edu/~shervine

Super VIP Cheatsheet: Machine Learning

 - September 15, Afshine Amidi and Shervine Amidi 
  • 1 Supervised Learning Contents
    • 1.1 Introduction to Supervised Learning
    • 1.2 Notations and general concepts
    • 1.3 Linear models
      • 1.3.1 Linear regression
      • 1.3.2 Classification and logistic regression
      • 1.3.3 Generalized Linear Models
    • 1.4 Support Vector Machines
    • 1.5 Generative Learning
      • 1.5.1 Gaussian Discriminant Analysis
      • 1.5.2 Naive Bayes
    • 1.6 Tree-based and ensemble methods
    • 1.7 Other non-parametric approaches
    • 1.8 Learning Theory
  • 2 Unsupervised Learning
    • 2.1 Introduction to Unsupervised Learning
    • 2.2 Clustering
      • 2.2.1 Expectation-Maximization
      • 2.2.2 k -means clustering
      • 2.2.3 Hierarchical clustering
      • 2.2.4 Clustering assessment metrics
    • 2.3 Dimension reduction
      • 2.3.1 Principal component analysis
      • 2.3.2 Independent component analysis
  • 3 Deep Learning
    • 3.1 Neural Networks
    • 3.2 Convolutional Neural Networks
    • 3.3 Recurrent Neural Networks
    • 3.4 Reinforcement Learning and Control - 4 Machine Learning Tips and Tricks - 4.1 Metrics - 4.1.1 Classification - 4.1.2 Regression - 4.2 Model selection - 4.3 Diagnostics - 5 Refreshers - 5.1 Probabilities and Statistics - 5.1.1 Introduction to Probability and Combinatorics - 5.1.2 Conditional Probability - 5.1.3 Random Variables - 5.1.4 Jointly Distributed Random Variables - 5.1.5 Parameter estimation - 5.2 Linear Algebra and Calculus - 5.2.1 General notations - 5.2.2 Matrix operations - 5.2.3 Matrix properties - 5.2.4 Matrix calculus

1 Supervised Learning

1.1 Introduction to Supervised Learning

Given a set of data points {x(1), ..., x( m )} associated to a set of outcomes {y(1), ..., y( m )}, we want to build a classifier that learns how to predict y from x.

r Type of prediction – The different types of predictive models are summed up in the table below:

Regression Classifier

Outcome Continuous Class

Examples Linear regression Logistic regression, SVM, Naive Bayes

r Type of model – The different models are summed up in the table below:

Discriminative model Generative model

Goal Directly estimate P (y|x) Estimate P (x|y) to deduce P (y|x)

What’s learned Decision boundary Probability distributions of the data

Illustration

Examples Regressions, SVMs GDA, Naive Bayes

1.2 Notations and general concepts

r Hypothesis – The hypothesis is noted h θ and is the model that we choose. For a given input

data x( i ), the model prediction output is h θ (x( i )).

r Loss function – A loss function is a function L : (z,y) ∈ R × Y 7 −→ L(z,y) ∈ R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:

Least squared Logistic Hinge Cross-entropy

1 2

(y − z) 2 log(1 + exp(−yz)) max(0, 1 − yz) −

[

y log( z ) + (1 − y ) log(1 − z )

]

Linear regression Logistic regression SVM Neural Network

r Cost function – The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:

J(θ) =

∑^ m

i =

L(h θ (x( i )), y( i ))

r Gradient descent – By noting α ∈ R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:

θ ←− θ − α∇J(θ)

Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.

r Likelihood – The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood `(θ) = log(L(θ)) which is easier to optimize. We have:

θopt^ = arg max θ

L(θ)

r Newton’s algorithm – The Newton’s algorithm is a numerical method that finds θ such that `′(θ) = 0. Its update rule is as follows:

θ ← θ −

′(θ)′′(θ)

Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:

θ ← θ −

∇^2 θ `(θ)

θ `(θ)

1.3 Linear models

1.3.1 Linear regression

We assume here that y|x; θ ∼ N (μ,σ^2 ) r Normal equations – By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:

θ = (X T X) − 1 X T y

r Kernel – Given a feature mapping φ, we define the kernel K to be defined as:

K(x,z) = φ(x) T φ(z)

In practice, the kernel K defined by K(x,z) = exp

|| xz ||^2 2 σ^2

is called the Gaussian kernel

and is commonly used.

Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don’t need to know the explicit mapping φ , which is often very complicated. Instead, only the values K(x,z) are needed.

r Lagrangian – We define the Lagrangian L(w,b) as follows:

L(w,b) = f (w) +

∑^ l

i =

β i h i (w)

Remark: the coefficients β i are called the Lagrange multipliers.

1.5 Generative Learning

A generative model first tries to learn how the data is generated by estimating P (x|y), which we can then use to estimate P (y|x) by using Bayes’ rule.

1.5.1 Gaussian Discriminant Analysis

r Setting – The Gaussian Discriminant Analysis assumes that y and x|y = 0 and x|y = 1 are such that:

y ∼ Bernoulli(φ)

x|y = 0 ∼ N (μ 0 ,Σ) and x|y = 1 ∼ N (μ 1 ,Σ)

r Estimation – The following table sums up the estimates that we find when maximizing the likelihood:

̂ φ μ̂ j (j = 0,1) ̂Σ

m

∑^ m

i =

(^1) { y ( i )=1}

∑ m

i = (^1) { y ( i )= j }x( i )

∑ m

i =1 1 { y ( i )= j }

m

∑^ m

i =

(x( i )^ − μ y ( i ) )(x( i )^ − μ y ( i ) ) T

1.5.2 Naive Bayes

r Assumption – The Naive Bayes model supposes that the features of each data point are all independent:

P (x|y) = P (x 1 ,x 2 ,...|y) = P (x 1 |y)P (x 2 |y)... =

∏^ n

i =

P (x i |y)

r Solutions – Maximizing the log-likelihood gives the following solutions, with k ∈ { 0 , 1 }, l ∈ [[1,L]]

P (y = k) =

m

× #{j|y( j )^ = k} and P (x i = l|y = k) =

#{j|y( j )^ = k and x ( j ) i =^ l} #{j|y( j )^ = k}

Remark: Naive Bayes is widely used for text classification and spam detection.

1.6 Tree-based and ensemble methods

These methods can be used for both regression and classification problems.

r CART – Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.

r Random forest – It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.

Remark: random forests are a type of ensemble methods.

r Boosting – The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:

Adaptive boosting Gradient boosting

  • High weights are put on errors to - Weak learners trained improve at the next boosting step on remaining errors
  • Known as Adaboost

1.7 Other non-parametric approaches

r k -nearest neighbors – The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.

Remark: The higher the parameter k , the higher the bias, and the lower the parameter k , the higher the variance.

1.8 Learning Theory

r Union bound – Let A 1 , ..., A k be k events. We have:

P (A 1 ∪ ... ∪ A k ) 6 P (A 1 ) + ... + P (A k )

r Hoeffding inequality – Let Z 1 , .., Z m be m iid variables drawn from a Bernoulli distribution

of parameter φ. Let φ̂ be their sample mean and γ > 0 fixed. We have:

P (|φ − ̂φ| > γ) 6 2 exp(− 2 γ^2 m)

Remark: this inequality is also known as the Chernoff bound.

r Training error – For a given classifier h, we define the training error ̂(h), also known as the

empirical risk or empirical error, to be as follows:

̂ (h) =

m

∑^ m

i =

(^1) { h ( x ( i )) 6 = y ( i )}

r Probably Approximately Correct (PAC) – PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions:

  • the training and testing sets follow the same distribution
  • the training examples are drawn independently

r Shattering – Given a set S = {x(1),...,x( d )}, and a set of classifiers H, we say that H shatters

S if for any set of labels {y(1), ..., y( d )}, we have:

∃h ∈ H, ∀i ∈ [[1,d]], h(x( i )) = y( i )

r Upper bound theorem – Let H be a finite hypothesis class such that |H| = k and let δ and the sample size m be fixed. Then, with probability of at least 1 − δ, we have:

(̂h) 6

min h ∈H

(h)

2 m

log

2 k δ

r VC dimension – The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.

Remark: the VC dimension of H = { set of linear classifiers in 2 dimensions } is 3.

r Theorem (Vapnik) – Let H be given, with VC(H) = d and m the number of training examples. With probability at least 1 − δ, we have:

(̂h) 6

min h ∈H

(h)

+ O

(√^

d m

log

m d

m

log

δ

r Calinski-Harabaz index – By noting k the number of clusters, B k and W k the between and within-clustering dispersion matrices respectively defined as

B k =

∑^ k

j =

n c ( i ) (μ c ( i ) − μ)(μ c ( i ) − μ) T^ , W k =

∑^ m

i =

(x( i )^ − μ c ( i ) )(x( i )^ − μ c ( i ) ) T

the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:

s(k) =

Tr(B k ) Tr(W k )

×

N − k k − 1

2.3 Dimension reduction

2.3.1 Principal component analysis

It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.

r Eigenvalue, eigenvector – Given a matrix A ∈ R n × n , λ is said to be an eigenvalue of A if there exists a vector z ∈ R n { 0 }, called eigenvector, such that we have:

Az = λz

r Spectral theorem – Let A ∈ R n × n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U ∈ R n × n. By noting Λ = diag(λ 1 ,...,λ n ), we have:

∃Λ diagonal, A = U ΛU T

Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A_._

r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:

  • Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.

x ( i ) j ←^

x ( i ) j −^ μ j σ j

where μ j =

m

∑^ m

i =

x ( i ) j and^ σ

2 j =^

m

∑^ m

i =

(x ( i ) j −^ μ j^ )

2

  • Step 2: Compute Σ =

m

∑^ m

i =

x( i )x( i )

T ∈ R n × n , which is symmetric with real eigenvalues.

  • Step 3: Compute u 1 , ..., u k ∈ R n^ the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.
  • Step 4: Project the data on spanR(u 1 ,...,u k ). This procedure maximizes the variance among all k-dimensional spaces.

2.3.2 Independent component analysis

It is a technique meant to find the underlying generating sources. r Assumptions – We assume that our data x has been generated by the n-dimensional source vector s = (s 1 ,...,s n ), where s i are independent random variables, via a mixing and non-singular matrix A as follows:

x = As

The goal is to find the unmixing matrix W = A−^1 by an update rule.

r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by following the steps below:

  • Write the probability of x = As = W −^1 s as:

p(x) =

∏^ n

i =

p s (w Ti x) · |W |

  • Write the log likelihood given our training data {x( i ), i ∈ [[1,m]]} and by noting g the sigmoid function as:

l(W ) =

∑^ m

i =

∑ n

j =

log

g′(w Tj x( i ))

  • log |W |

Therefore, the stochastic gradient ascent learning rule is such that for each training example x( i ), we update W as follows:

W ←− W + α

1 − 2 g(w T 1 x( i )) 1 − 2 g(w T 2 x( i )) . . . 1 − 2 g(w Tn x( i ))

 x

( i ) T^

  • (W T ) − 1

3 Deep Learning

3.1 Neural Networks

Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.

r Architecture – The vocabulary around neural networks architectures is described in the figure below:

By noting i the i th^ layer of the network and j the j th^ hidden unit of the layer, we have:

z [ i ] j =^ w

[ i ] j

T x + b [ i ] j

where we note w, b, z the weight, bias and output respectively.

r Activation function – Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:

Sigmoid Tanh ReLU Leaky ReLU

g(z) =

1 + e− z^

g(z) =

e z^ − e− z e z^ + e− z^

g(z) = max(0,z) g(z) = max(z,z)

with   1

r Cross-entropy loss – In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:

L(z,y) = −

[

y log(z) + (1 − y) log(1 − z)

]

r Learning rate – The learning rate, often noted η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.

r Backpropagation – Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:

∂L(z,y) ∂w

∂L(z,y) ∂a

×

∂a ∂z

×

∂z ∂w

As a result, the weight is updated as follows:

w ←− w − η

∂L(z,y) ∂w

r Updating weights – In a neural network, weights are updated as follows:

  • Step 1: Take a batch of training data.
  • Step 2: Perform forward propagation to obtain the corresponding loss.
  • Step 3: Backpropagate the loss to get the gradients.
  • Step 4: Use the gradients to update the weights of the network.

r Dropout – Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1 − p.

3.2 Convolutional Neural Networks

r Convolutional layer requirement – By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:

N =

W − F + 2P

S

r Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {x i }. By noting μ B , σ B^2 the mean and variance of that we want to correct to the batch, it is done as follows:

x i ←− γ

x i − μ B

σ^2 B + 

  • β

It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.

3.3 Recurrent Neural Networks

r Types of gates – Here are the different types of gates that we encounter in a typical recurrent neural network:

Input gate Forget gate Output gate Gate

Write to cell or not? Erase a cell or not? Reveal a cell or not? How much writing?

r LSTM – A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding ’forget’ gates.

4 Machine Learning Tips and Tricks

4.1 Metrics

Given a set of data points {x(1), ..., x( m )}, where each x( i )^ has n features, associated to a set of

outcomes {y(1), ..., y( m )}, we want to assess a given classifier that learns how to predict y from x.

4.1.1 Classification

In a context of a binary classification, here are the main metrics that are important to track to assess the performance of the model.

r Confusion matrix – The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:

Predicted class

Actual class

TP FN

  • False Negatives True Positives Type II error

FP TN

  • False Positives True Negatives Type I error

r Main metrics – The following metrics are commonly used to assess the performance of classification models:

Metric Formula Interpretation

Accuracy

TP + TN

TP + TN + FP + FN

Overall performance of model

Precision

TP

TP + FP

How accurate the positive predictions are

Recall

TP

TP + FN

Coverage of actual positive sample

Sensitivity

Specificity

TN

TN + FP

Coverage of actual negative sample

F1 score

2 TP

2 TP + FP + FN

Hybrid metric useful for unbalanced classes

r ROC – The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:

Metric Formula Equivalent

True Positive Rate

TP

TP + FN

Recall, sensitivity TPR

False Positive Rate

FP

TN + FP

1-specificity FPR

r AUC – The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:

4.1.2 Regression

r Basic metrics – Given a regression model f , the following metrics are commonly used to assess the performance of the model:

Total sum of squares Explained sum of squares Residual sum of squares

SStot =

∑^ m

i =

(y i − y)^2 SSreg =

∑^ m

i =

(f (x i ) − y)^2 SSres =

∑^ m

i =

(y i − f (x i ))^2

r Coefficient of determination – The coefficient of determination, often noted R^2 or r^2 , provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:

R

2 = 1 −

SSres SStot

r Main metrics – The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consid- eration:

Mallow’s Cp AIC BIC Adjusted R^2

SSres + 2(n + 1)̂σ^2

m

2

[

( n + 2) − log( L )

]

log( m )( n + 2) − 2 log( L ) 1 −

(1 − R^2 )(m − 1) m − n − 1

where L is the likelihood and ̂σ^2 is an estimate of the variance associated with each response.

4.2 Model selection

r Vocabulary – When selecting a model, we distinguish 3 different parts of the data that we have as follows:

Training set Validation set Testing set

  • Model is trained - Model is assessed - Model gives predictions
  • Usually 80% of the dataset - Usually 20% of the dataset - Unseen data
    • Also called hold-out or development set

Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:

r Cross-validation – Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:

k -fold Leave- p -out

  • Training on k − 1 folds and - Training on n − p observations and assessment on the remaining one assessment on the p remaining ones
  • Generally k = 5 or 10 - Case p = 1 is called leave-one-out

The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k − 1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.

r Regularization – The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:

LASSO Ridge Elastic Net

  • Shrinks coefficients to 0 Makes coefficients smaller Tradeoff between variable
  • Good for variable selection selection and small coefficients

... + λ||θ|| 1 ... + λ||θ||^22 ... + λ

[

(1 − α)||θ|| 1 + α||θ||^22

]

λ ∈ R λ ∈ R λ ∈ R, α ∈ [0,1]

r Model selection – Train model on training set, then evaluate on the development set, then pick best performance model on the development set, and retrain all of that model on the whole training set.

4.3 Diagnostics

r Bias – The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.

r Variance – The variance of a model is the variability of the model prediction for given data points.

r Bias/variance tradeoff – The simpler the model, the higher the bias, and the more complex the model, the higher the variance.

Underfitting Just right Overfitting

  • High training error - Training error - Low training error Symptoms - Training error close slightly lower than - Training error much to test error test error lower than test error
  • High bias - High variance

Regression

r Extended form of Bayes’ rule – Let {A i , i ∈ [[1,n]]} be a partition of the sample space. We have:

P (A k |B) =

P (B|A k )P (A k )

∑^ n

i =

P (B|A i )P (A i )

r Independence – Two events A and B are independent if and only if we have:

P (A ∩ B) = P (A)P (B)

5.1.3 Random Variables

r Random variable – A random variable, often noted X, is a function that maps every element in a sample space to a real line.

r Cumulative distribution function (CDF) – The cumulative distribution function F , which is monotonically non-decreasing and is such that lim x →−∞

F (x) = 0 and lim x →+∞

F (x) = 1, is

defined as:

F (x) = P (X 6 x)

Remark: we have P (a < X 6 B) = F (b) − F (a).

r Probability density function (PDF) – The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.

r Relationships involving the PDF and CDF – Here are the important properties to know in the discrete (D) and the continuous (C) cases.

Case CDF F PDF f Properties of PDF

(D) F (x) =

xi 6 x

P (X = x i ) f (x j ) = P (X = x j ) 0 6 f (x j ) 6 1 and

j

f (x j ) = 1

(C) F (x) =

ˆ (^) x

−∞

f (y)dy f (x) =

dF dx

f (x) > 0 and

−∞

f (x)dx = 1

r Variance – The variance of a random variable, often noted Var(X) or σ^2 , is a measure of the spread of its distribution function. It is determined as follows:

Var(X) = E[(X − E[X])^2 ] = E[X^2 ] − E[X]^2

r Standard deviation – The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:

σ =

Var(X)

r Expectation and Moments of the Distribution – Here are the expressions of the expected value E[X], generalized expected value E[g(X)], k th^ moment E[X k^ ] and characteristic function ψ(ω) for the discrete and continuous cases:

Case E[X] E[g(X)] E[X k^ ] ψ(ω)

(D)

∑^ n

i =

x i f (x i )

∑^ n

i =

g(x i )f (x i )

∑^ n

i =

x ki f (x i )

∑^ n

i =

f (x i )e iωxi

(C)

−∞

xf (x)dx

−∞

g(x)f (x)dx

−∞

x k^ f (x)dx

−∞

f (x)e iωx dx

Remark: we have e iωx^ = cos(ωx) + i sin(ωx).

r Revisiting the k th^ moment – The k th^ moment can also be computed with the characteristic function as follows:

E[X k^ ] =

i k

[

k^ ψ ∂ω k

]

ω =

r Transformation of random variables – Let the variables X and Y be linked by some function. By noting f X and f Y the distribution function of X and Y respectively, we have:

f Y (y) = f X (x)

dx dy

r Leibniz integral rule – Let g be a function of x and potentially c, and a, b boundaries that may depend on c. We have:

∂ ∂c

b

a

g(x)dx

∂b ∂c

· g(b) −

∂a ∂c

· g(a) +

ˆ (^) b

a

∂g ∂c

(x)dx

r Chebyshev’s inequality – Let X be a random variable with expected value μ and standard deviation σ. For k, σ > 0 , we have the following inequality:

P (|X − μ| > kσ) 6

k^2

5.1.4 Jointly Distributed Random Variables

r Conditional density – The conditional density of X with respect to Y , often noted f X | Y , is defined as follows:

f X | Y (x) =

f XY (x,y) f Y (y)

r Independence – Two random variables X and Y are said to be independent if we have:

f XY (x,y) = f X (x)f Y (y)

r Marginal density and cumulative distribution – From the joint density probability function f XY , we have:

Case Marginal density Cumulative function

(D) f X (x i ) =

j

f XY (x i ,y j ) F XY (x,y) =

xi 6 x

yj 6 y

f XY (x i ,y j )

(C) f X (x) =

−∞

f XY (x,y)dy F XY (x,y) =

ˆ (^) x

−∞

ˆ (^) y

−∞

f XY (x′,y′)dx′dy′

r Distribution of a sum of independent random variables – Let Y = X 1 + ... + X n with X 1 , ..., X n independent. We have:

ψ Y (ω) =

∏^ n

k =

ψ Xk (ω)

r Covariance – We define the covariance of two random variables X and Y , that we note σ^2 XY or more commonly Cov(X,Y ), as follows:

Cov(X,Y ) , σ 2 XY =^ E[(X^ −^ μ X^ )(Y^ −^ μ Y^ )] =^ E[XY^ ]^ −^ μ X^ μ Y

r Correlation – By noting σ X , σ Y the standard deviations of X and Y , we define the correlation between the random variables X and Y , noted ρ XY , as follows:

ρ XY =

σ XY^2 σ X σ Y

Remarks: For any X, Y , we have ρ XY ∈ [− 1 ,1]. If X and Y are independent, then ρ XY = 0_._

r Main distributions – Here are the main distributions to have in mind:

Type Distribution PDF ψ(ω) E[X] Var(X)

X ∼ B(n, p) P (X = x) =

n x

p x q nx^ (pe ^ + q) n^ np npq

Binomial x ∈ [[0,n]] (D)

X ∼ Po(μ) P (X = x) =

μ x x!

e− μ^ e μ ( e

(^) −1) μ μ Poisson x ∈ N

X ∼ U(a, b) f (x) =

b − a

e iωb^ − e iωa (b − a)iω

a + b 2

(b − a)^2 12 Uniform x ∈ [a,b]

(C) X ∼ N (μ, σ) f (x) =

2 πσ

e

− (^12)

xμ σ

e iωμ −^

1 2 ω

(^2) σ 2 μ σ^2

Gaussian x ∈ R

X ∼ Exp(λ) f (x) = λe− λx^

1 − iω λ

λ

λ^2 Exponential x ∈ R+

5.1.5 Parameter estimation

r Random sample – A random sample is a collection of n random variables X 1 , ..., X n that are independent and identically distributed with X.

r Estimator – An estimator θˆ is a function of the data that is used to infer the value of an unknown parameter θ in a statistical model.

r Bias – The bias of an estimator θˆ is defined as being the difference between the expected value of the distribution of θˆ and the true value, i.e.:

Bias(θˆ) = E[ˆθ] − θ

Remark: an estimator is said to be unbiased when we have E[θˆ] = θ.

r Sample mean and variance – The sample mean and the sample variance of a random sample are used to estimate the true mean μ and the true variance σ^2 of a distribution, are noted X and s^2 respectively, and are such that:

X =

n

∑^ n

i =

X i and s 2 = ˆσ 2 =

n − 1

∑^ n

i =

(X i − X) 2

r Central Limit Theorem – Let us have a random sample X 1 , ..., X n following a given distribution with mean μ and variance σ^2 , then we have:

X ∼

n →+∞

N

μ,

σ √ n

5.2 Linear Algebra and Calculus

5.2.1 General notations

r Vector – We note x ∈ R n^ a vector with n entries, where x i ∈ R is the i th^ entry:

x =

( x

x^1 2 . . . x n

∈ R n

r Matrix – We note A ∈ R m × n^ a matrix with m rows and n columns, where A i,j ∈ R is the entry located in the i th^ row and j th^ column:

A =

A 1 , 1 · · · A 1 ,n . . .

A m, 1 · · · A m,n

∈ R

m × n

Remark: the vector x defined above can be viewed as a n × 1 matrix and is more particularly called a column-vector.

r Identity matrix – The identity matrix I ∈ R n × n^ is a square matrix with ones in its diagonal and zero everywhere else:

I =

Norm Notation Definition Use case

Manhattan, L^1 ||x|| 1

∑^ n

i =

|x i | LASSO regularization

Euclidean, L^2 ||x|| 2

∑ n

i =

x^2 i Ridge regularization

p-norm, L p^ ||x|| p

∑ n

i =

x p i

p Hölder inequality

Infinity, L∞^ ||x||∞ max i

|x i | Uniform convergence

r Linearly dependence – A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others. Remark: if no vector can be written this way, then the vectors are said to be linearly independent.

r Matrix rank – The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.

r Positive semi-definite matrix – A matrix A ∈ R n × n^ is positive semi-definite (PSD) and is noted A  0 if we have:

A = A T^ and ∀x ∈ R n , x T^ Ax > 0

Remark: similarly, a matrix A is said to be positive definite, and is noted A  0 , if it is a PSD

matrix which satisfies for all non-zero vector x , x T^ Ax > 0_._

r Eigenvalue, eigenvector – Given a matrix A ∈ R n × n , λ is said to be an eigenvalue of A if there exists a vector z ∈ R n { 0 }, called eigenvector, such that we have:

Az = λz

r Spectral theorem – Let A ∈ R n × n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U ∈ R n × n. By noting Λ = diag(λ 1 ,...,λ n ), we have:

∃Λ diagonal, A = U ΛU T

r Singular-value decomposition – For a given matrix A of dimensions m × n, the singular- value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m × n diagonal and V n × n unitary matrices, such that:

A = U ΣV T

5.2.4 Matrix calculus

r Gradient – Let f : R m × n^ → R be a function and A ∈ R m × n^ be a matrix. The gradient of f with respect to A is a m × n matrix, noted ∇ A f (A), such that:

A f (A)

i,j

∂f (A) ∂A i,j

Remark: the gradient of f is only defined when f is a function that returns a scalar.

r Hessian – Let f : R n^ → R be a function and x ∈ R n^ be a vector. The hessian of f with respect to x is a n × n symmetric matrix, noted ∇^2 x f (x), such that:

∇^2 x f (x)

i,j

∂^2 f (x) ∂x i ∂x j

Remark: the hessian of f is only defined when f is a function that returns a scalar.

r Gradient operations – For matrices A,B,C, the following gradient properties are worth having in mind:

A tr(AB) = B T^ ∇ AT f (A) = (∇ A f (A)) T

A tr(ABA T C) = CAB + C T AB TA |A| = |A|(A − 1 ) T