

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Basic Probability Cheat Sheet: Probability and Expectation, Asymptotic Theory, Miscellaneous
Typology: Cheat Sheet
1 / 2
This page cannot be seen from the preview
Don't miss anything!
Bayes rule:
p(θ|X) =
p(X|θ)p(θ) ∫ p(X|θ)p(θ)dθ
p(X|θ)p(θ) p(X)
If X represents data and θ is an unknown quantity of interest, the Bayes rule can be interpreted as making inference about θ based on the data X (Bayesian inference) in the form of the posterior distribution p(θ|X).
Remark. In the machine learning course, you will encounter the words ’learning’ and ’inference’. From a Bayesian point of view, there’s no difference between those two (because everything is expressed by posteriors). But machine learning people tend to use ’learning’ as tuning parameters of a model using data and ’inference’ as computing some quantity with the model (sometimes this includes evaluating a posterior distribution). This distinction is not exhaustive but may be good to know to avoid confusion.
VarX|Y[X|Y]
Theorem. The Law of Large Numbers Let X 1 ,X 2... , be independent iden- tically distributed (i.i.d.) real random variables. Let Sn = (^1) n
∑n i=1 Xi^ and μ = EX 1. If E |X 1 | < ∞ , then Sn → μ as n → ∞^1. (^1) To be precise, we need to define convergence of random variables.
Theorem. The Central Limit Theorem Let X 1 ,X 2... , be as above. Let σ^2 = Var[X 1 ]. Under a (stronger) assumption EX 12 < ∞, the probability distri- bution of
n (Sn σ− μ)converges to the standard normal distribution N (0, 1)^2.
Aμ, AΣA>
, where A is a matrix of appropriate shape.
− (x − μ)^2 /(2σ^2 )
then
N (x; μ 1 , σ 12 )N (x; μ 2 , σ 22 ) = N (μ 1 ; μ 2 ,σ 12 + σ^22 )N
x;
μ 1 σ 12 +^
μ 2 σ^22 1 σ^21 +^
1 σ 22 ,
σ^21 σ^22 σ 12 + σ^22
(^2) Assume σ = 1 for simplicity. In contrast with the law of large numbers, what the central limit theorem says is that if you multiply the error of the estimate of the mean Sn − μ by √n, the distribution of the amplified error √ n(Sn − μ) is a Gaussian N (0, 1) for sufficiently large n. If you don’t, the error converges to a point (zero) as the variance tends to 0 , which agrees with the law of large numbers.