Data Mining Midterm Cheat Sheet | Cheat Sheet Data Mining

Data Mining (CS 145) Midterm Cheat Sheet by Patricia Xiao

Model Data Type Task Type

Linear Regres-

sion

Vector Prediction

Logistic Regres-

sion

Vector Classification

Decision Tree Vector Classification

SVM Vector Classification

NN Vector Classification

KNN Vector Classification

K-means Vector Clustering

hierarchical

clustering

Vector Clustering

DBSCAN Vector Clustering

Mixture Models Vector Clustering

Models

•Dispersion: Quartiles& Inter Range (Q125%,

Q375%, IQR = Q3−Q1, Outlier 1.5 IQR away

Q1/3), 5 n Summary: min, Q1, median, Q3, max

•Bias: E(ˆ

f(x)) −f(x), Variance: V ar(ˆ

f(x)) =

E[( ˆ

f(x)−E(ˆ

f(x)))2], E[( ˆ

f(x)−f(x)−] = bias2+

variance +noise;E() = 0, V ar() = σ2;

bias→underfit; variance→overfit

•Model Evaluation and Selection: K-way cross va-

lidation, AIC (2k−2 ln( ˆ

L)) & BIC (kln(n)−

2 ln( ˆ

L)) (kparams, nobjs), Stepwise feature selec-

tion (forward: add, backward: from full model)

•Generalized Linear Model (GLM): exponential fa-

mily, p(y;η) = b(y)exp(ηTT(y)−a(η)); linear de-

cision boundary

•Bagging: Bootstrap Aggregating (multi-datasets

→multiple classifiers →combine classifiers)

•Kernel: K(xi, xj) = Φ(xi)TΦ(xj)

•Chain rule: ∂ J/∂x = (∂ J/∂y )(∂y/∂ x)

•Minkowski distance (lh): d(x, y) = h

qPd

i(xi−yi)h,

l1Manhattan, l2Euclidean, l∞supremum; triangle

inequality applies (d(i, j)≤d(i, k) + d(k, j )).

•Confusion Matrix: True / False for correctness,

Positive / Negative for result

•Multi-class classification: All-vs-all (AVA) is better

than One-vs-All (OVA)

Basic Concepts

1. σ2=E[(X−E(X))2] = E(X2)−E2(X)

2. kαk2=αTαwhere αis a vector.

3. (AB)T=BTAT, (AB)−1=B−1A−1

4. ∂(Ax)

∂x =A,∂(AX)

∂X =AT

5. ∂(xTAx)

∂x =xT(A+AT), ∂(XTATAX)

∂X = 2ATAX

6. X∼N(µ, σ2)⇒f(X=x) = 1

√2πσ2e−(x−µ)2

2σ2

7. σ0(x) = σ(x)(1 −σ(x))

8. log(ab) = blog a, log(ab) = log(a) + log(b)

9. Classifiers fi(x): var(Pifi(x)

t) = var(fi(x))/t

10. a·b=Paibi=kakkbkcos (a, b)

11. normal n, any vector in plane, x,n·x= 0

12. covariance: σ(X1, X2) = E(X1−µ1)(X2−µ2)

Formula

1. mean - mode = 3 ×(mean - median)

mode(peak)∼median∼mean(∼tail)

2. Z-score (normalization): Z=x−µ

δ(robust: mean

absolute deviation, zjf =xif −meanf

sumf) (nominal:

dummy variable(s), ordinal: (r−1)/(M−1)) where

r, M start from 1.

3. Logistic / Sigmoid Function: σ(x) = 1

1+e−x

4. Entropy: H(Y) = −Pm

i=1 pilog(pi); Conditional

Entropy: H(Y|X) = Pxp(x)H(Y|X=x)

5. Cross Entropy Loss: H(q , p) = −Pkqklog(pk)

6. Lagrange multiplier αis used to solve Quadratic

Programming (e.g. SVM)

7. Soft margin (allow moving at a cost): minimizing

Φ(w) = 1/2wTw⇒Φ(w) = 1/2wTw+CPζi, li-

mitation y(wTxi+b)≥1⇒y(wTxi+b)≥1−ζi

(ζi≥0); doesn’t affect the solution of SVM.

8. ROC (Receiver Operating Characteristics): TP rate

(y-axis) - FP rate (x-axis), score = area below curve

9. Dendrogram: the hierarchical, cut to clusters.

Tools

y=xTβwhere bias term xi0= 1, x: (n×(p+ 1))

matrix, y: (n×1) vector, β: ((p+ 1) ×1) vector.

Continuous y=xβT. (OLS, Ordinary Least Square)

J(β) = 1

2n(Xβ −y)T(X β −y) = 1

2n(βTXTXβ −yTX β −

βTXTy+yTy).

Closed form solution: ∂ J

∂β = 0, ˆ

β= (XTX)−1XTy

Gradient Descent: β(t+1) := β(t)−η∆

Batch GD: (converge) ∆ = ∂J

∂β =Pixi(xT

iβ−yi)/n

Stochastic GD: (n times) ∆ = −(yi−xT

iβ(t))xi

LR with Probabilistic Interpretation: (using MLE,

Maximum Livelihood Estimation)L(β) = Qip(yi|xi, β) =

Qip(N(xT

iβ, σ2)) = Qi

√2πσ2exp{− (ti−xT

iβ)2

2σ2}

Invertible XTX: add λPp

j=1 β2

jto Pi(yi−xT

iβ)2

(Ridge Regression, or linear regression with l2norm)

Non-linear Correlation: create new terms e.g. x2

Linear Regression

Generalized linear model (GLM).

P(Y= 1|X, β) = σ(XTβ) = eXTβ

1+eXTβ

P(Y= 0|X, β)=1−σ(XTβ) = 1

1+eXTβ

Y|X, β ∼Bernoulli(σ(XTβ))

MLE: L=Qipyi

i(1 −pi)1−yi,piis P(Y= 1|X, β)

Eq to max log likelihood L=Pi(yixiβ−log(1 + exT

iβ))

Gradient ascent βnew =βold +η∂L(β)

∂β

Newton-Raphson update βnew =βold −(∂2L(β)

∂β )−1∂L(β)

∂β∂ βT

Cross Entropy Loss (pfor prediction, qfor ground truth,

(q0, q1)|y=0 = (1,0), (q0, q1)|y=1 = (0,1), (p0, p1) =

(P(Y= 0), P (Y= 1)): H(p, q) = −yxTβ+log(1 + exTβ)

Logistic Regression

A framework to approach maximum likelihood.

p(xi, zi=Cj) = wjfj(xi), p(xi) = Pjwjfj(xi)

p(D) = Qip(xi) = QiPjwjfj(xi)

log(p(D)) = Pilog(Pjwjfj(xi))

E(expectation)-step assigns objects to clusters.

wt+1

ij =p(zi=j|θt

j, xi)

∝p(xi|zi=j, θt

j)p(zi=j) = fj(xi)wj

M(maximization)-step finds the new clustering

w.r.t. conditional distribution p(zi=j|θt

j, xi).

θt+1 = argmax

θX

wt+1

ij log L(xi, zi=j|θ)

EM Algorithm

Data Mining Midterm Cheat Sheet, Cheat Sheet of Data Mining

Related documents

Partial preview of the text

Download Data Mining Midterm Cheat Sheet and more Cheat Sheet Data Mining in PDF only on Docsity!

Data Mining (CS 145) Midterm Cheat Sheet by Patricia Xiao

T (A + AT ), ∂(XT^ AT^ AX)

∂X = 2A

T AX

∑N

∑N

∑N

SVM

P (A|B) × P (B|A)

GSP

SPADE