


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Useful cheat sheet with formulas and main concepts for the Data Mining midterm exam
Typology: Cheat Sheet
1 / 4
This page cannot be seen from the preview
Don't miss anything!
On special offer
Model Data Type Task Type Linear Regres- sion
Vector Prediction
Logistic Regres- sion
Vector Classification
Decision Tree Vector Classification SVM Vector Classification NN Vector Classification KNN Vector Classification K-means Vector Clustering hierarchical clustering
Vector Clustering
DBSCAN Vector Clustering Mixture Models Vector Clustering
Models
√∑ d i (xi^ −^ yi)h, l 1 Manhattan, l 2 Euclidean, l∞ supremum; triangle inequality applies (d(i, j) ≤ d(i, k) + d(k, j)).
Basic Concepts
T (^) Ax) ∂x =^ x
(x−μ)^2 2 σ^2
∑ i fi(x) t ) =^ var(fi(x))/t
aibi = ‖a‖‖b‖ cos (a, b)
Formula
∑m i=1 pi^ log(pi); Conditional Entropy: H(Y |X) =
∑ x p(x)H(Y^ |X^ =^ x)
∑ k qk^ log(pk^ )
Tools
y = xT^ β where bias term xi 0 = 1, x: (n × (p + 1)) matrix, y: (n × 1) vector, β: ((p + 1) × 1) vector. Continuous y = xβT^. (OLS, Ordinary Least Square) J(β) = (^21) n (Xβ − y)T^ (Xβ − y) = (^21) n (βT^ XT^ Xβ − yT^ Xβ − βT^ XT^ y + yT^ y). Closed form solution: ∂J∂β = 0, βˆ = (XT^ X)−^1 XT^ y Gradient Descent: β(t+1)^ := β(t)^ − η∆ Batch GD: (converge) ∆ = ∂J∂β = ∑ i xi(x
T i β^ −^ yi)/n Stochastic GD: (n times) ∆ = −(yi − xTi β(t))xi LR with Probabilistic Interpretation: (using MLE, Maximum Livelihood Estimation) L(β) =
∏ ∏^ i^ p(yi|xi, β) = i p(N^ (x
T i β, σ (^2) )) = ∏ i √^1 2 πσ^2 exp{−^
(ti−xTi β)^2 2 σ^2 } Invertible XT^ X: add λ
∑p j=1 β j^2 to^ ∑ i(yi^ −^ x Ti β) 2 (Ridge Regression, or linear regression with l 2 norm) Non-linear Correlation: create new terms e.g. x^2
Linear Regression
Generalized linear model (GLM). P (Y = 1|X, β) = σ(XT^ β) = e XT β 1+eXT β P (Y = 0|X, β) = 1 − σ(XT^ β) = (^) 1+e^1 XT β Y |X, β ∼ Bernoulli(σ(XT^ β)) MLE: L = ∏ i p
yi i (1^ −^ pi) 1 −yi (^) , pi is P (Y = 1|X, β) Eq to max log likelihood L =
∑ i(yixiβ^ −^ log(1 +^ e
xTi β (^) )) Gradient ascent βnew^ = βold^ + η ∂L ∂β(β) Newton-Raphson update βnew^ = βold^ − ( ∂
(^2) L(β) ∂β )
− 1 ∂L(β) ∂β∂βT Cross Entropy Loss (p for prediction, q for ground truth, (q 0 , q 1 )|y=0 = (1, 0), (q 0 , q 1 )|y=1 = (0, 1), (p 0 , p 1 ) = (P (Y = 0), P (Y = 1)): H(p, q) = −yxT^ β + log(1 + ex
T (^) β )
Logistic Regression
A framework to approach maximum likelihood. p(xi, zi = Cj ) = wj fj (xi), p(xi) =
j wj^ fj^ (xi) p(D) =
i p(xi) =^
i
j wj^ fj^ (xi) log(p(D)) =
i log(
j wj^ fj^ (xi)) E(expectation)-step assigns objects to clusters.
w ijt+1 = p(zi = j|θtj , xi) ∝ p(xi|zi = j, θtj )p(zi = j) = fj (xi)wj
M(maximization)-step finds the new clustering w.r.t. conditional distribution p(zi = j|θtj , xi).
θt+1^ = argmax θ
i
j
w ijt+1 log L(xi, zi = j|θ)
EM Algorithm
m for |y| in D, v for |A| Expected Information needed to classify a tuple in D: Inf o(D) = −
∑m i=1 pi^ log 2 (pi) Info after split A: Inf oA(D) =
∑v j=
Dj D ×^ Inf o(Dj^ ) Info Gain (ID3): Gain(A) = Inf o(D) − Inf oA(D) Info gain biases towards multivalued attributes. SplitInf oA(D) = −
∑v j=
Dj D ×^ log^2 (^
Dj D ) GR (C4.5): GainRatio(A) = Gain(A)/SplitInf o(A) GR biases towards unbalanced splits. Gini(D) = 1 −
∑m j=1 p (^2) j for impurity
GiniA(D) =
∑v j=
|Dj | |D| Gini(Dj^ ) Gini (CART): ∆Gini(A) = Gini(D) − GiniA(D) Gini index also biases towards multivalued attributes. STOP: same class; last attr; no sample (maj. vot.) Avoid Over Fitting: Pre/Post-pruning, random forest Classification → Prediction: Maj. Vote → e.g. Avg for leaf node. turn to regression tree, V ar(Dj ) =
∑ y∈Dj (y^ − y)^2 /|Dj |, look for the lowest weighted average vari- ance V arA(D) = ∑v j=
|Dj | |D| ×^ V ar(Dj^ ) A different view: leaf = box in the plane Random forest is a set of trees, ensemble, bagging, good at classification, handles large & missing data, not good at predictions, lack interpretation.
Decision Tree
y = sign(W · X + b), separating hyperplane y = 0 SVM searches for Maximum Marginal Hyperplane To Maximize Margin ρ = (^) ‖w^2 ‖ , w. Lagrange multiplier
α, L(w, b, α) = 12 wT^ w −
i=1 αi(yi(w T (^) xi + b) − 1). ∂L ∂w =^ w^ −^
i=1 αiyixi^ = 0,^
∂L ∂b =^ −^
i=1 αiyi^ = 0 Solution: w =
αiyixi, b = yk − wT^ xk f (x) = wT^ x + b =
αiyixTi x + b default threshold 0 Linear v.s. Non-linear SVM: Kernel Non-linear Decision Boundary:∑ f (x) = wT^ Φ(x) + b = αiyiK(xi, x) + b Scalability: CF-Tree, Hierarchical Micro-cluster, se- lective declustering (decluster the clusters who could be support cluster; support cluster: centroid on sup- port vector)
xi wi −→
(+b) f −→ o
Input vector x, Weight vector w, Bias b, weighted sum, going through activation function f , reach out- put o.
Perceptron (Single Unit)
Stochastic GD + Chain Rule Special case: Sigmoid + Square loss, 2 layers Assume: i, j, k are input, hidden, output layers’ de- notion, and O for output, T for true value. Errk = Ok(1 − Ok)(Tk − Ok), Errj = Oj (1 − Oj )
k Errkwjk,^ wij^ =^ wij^ +^ ηErrj^ Oi^ and^ wjk^ = wjk +ηErrkOj , θj = θj +ηErrj and θk = θk +ηErrk. ∂J ∂wij =^
∂J ∂Ok
∂Ok ∂Oj
∂Oj ∂wij =^ −^
k[(Tk^ −^ Ok)][Ok(1^ − Ok)wjk][Oj (1 − Oj )Oi]
Backpropagation (BP)
nlayers = nhidden + noutput(1) Feed-forward, Non-linear regression, capable of any continuous function. Backpropagation is used for learning.
Neural Network (NN)
Lazy learning (instead of eager), instance-based Consider k nearest neighbors; maj. voting or average. (Could be distance-weighted.) Curse of dimensionality: influence of noise Get rid of irrelevant features; select proper k. Proximity refers to similarity or dissimilarity. Always applies to binary values. If nominal, could do simple matching, or use a series of binary to represent a non-binary; ordinal: rank, normalize zif = r Miff^ − −^11. Proximity could be measured by |(0,1)| all+| (1,0)|for sym- metric variables, |(0 all,1)−||+(0|(1,0),0)| | or Jaccard coefficient (similarity) (^) all|(1−|,(01),|0)| for asymmetric. Mixed type attributes: weighted combine. Another method: cosine similarity cos(d 1 , d 2 )
k - Nearest Neighbors (kNN)
Holdout method; Cross-validation (k-fold) LOO. Confusion Matrix: True / False Positive / Negative Accuracy = (TP + TN) / All Error Rate = (FP + FN) / All Sensitivity = TP / P (P = TP + FN) Specificity = TN / N (N = FP + TN) Precision = TP / P’ (P’ = TP + FP) Recall = TP / P = Sensitivity F 1 / F-score = 2 ×P recisionP recision+×RecallRecall Fβ = (1+β
(^2) )×P recision×Recall β^2 ×P recision+Recall (R: P =^ β^ : 1) ROC curve: TP rate (y) - FP rate (x). (area under) TPR = TP / P, FPR = FP / N
Evaluation: Classification
K-means: J =
∑k j=
i wij^ ‖xi^ −^ cj^ ‖
2 Assign wij = 1 to each xi closest cj ; assign the center to be new centroid; stop when no change. O(tkn). For continuous, convex-shaped data, sensitive to noise. K-modes: mean → mode, for categorical data K-medoids: representative objects, e.g. PAM (s) Hierarchical: bottom-up Agglomerative Nesting (AGNES) merges two closest clusters until end up in 1; top-down DIANA (Divisive Analysis). O(n^2 ). Cluster Distance: Single link for min element-wise dist; Complete link for max; average for avg element pairs dist; centroid, medoid (center obj). DBSCAN: Set Eps and MinPts. Neighborhood defined as N(q) : {p ∈ D|dist(p, q) ≤ }. Core point |N(q)| ≥ M inP ts. p is directly density-reachable from q if q is core point and p ∈ N(q); density- reachable if q → p 2 → · · · → p; density-connected if o → · · · → p
o → · · · → q. Cluster: max set density-connected points. Individual points are noise. DFS O(n log n) w. spacial index, else O(n^2 ). Mixture Model: soft clustering (wij ∈ [0, 1] rather than wij ∈ { 0 , 1 }), joint prob of object i and cluster Cj : p(xi, zi = Cj ) = wj fj (xi), using EM algorithm. Gaussian Mixture Model (GMM): ⊃ k-means Generative model, for each object, pick cluster Z, from X|Z ∼ N (μZ , σ^2 Z ) sample value; Overall li- kelihood function L(D|θ) =
i
j wj^ p(xi|μj^ , σ 2 j ); E wt ij+1 = (wtj p(xi|μtj , (σ^2 j )t))/(
k w
t kp(xi|μ t k,^ (σ 2 k) t)), M μt j+1 = (
i w
t+ ij xi)/(
i w
t+ ij ),^ (σ 2 j ) t+1 (^) = (
i w
t+ ij (xi^ −^ μ
t+ j )
i w
t+ ij ),^ w
t+ j =^
i w
t+ ij /n (in 1-d case) Why EM works? E-Step find tight lower bound L of at θold, M-Step find θnew to maximize the lower bound.
(θnew) ≥ L(θnew) ≥ L(θold) = `(θold)
Clustering
extrinsic (supervised) vs. intrinsic (unsupervised) purity(C, Ω) = (^) N^1
K maxj^ |ck^ ∩^ ωj^ |^ (C^ out, Ω truth) Normalized Mutual Information: N M I(C, Ω) = √HI((C,C)Ω)H(Ω)
I(C, Ω) =
k
j P^ (ck^ ∩^ ωj^ ) log^
P (ck ∩ωj ) P (ck )P (ωj ) = ∑ k
j
|ck ∩ωj | N log^
N |ck ∩ωj | |ck ||ωj | H(Ω) = −
j P^ (ωj^ ) log^ P^ (ωj^ ) =^ −^
j
|ωj | N log^
|ωj | N Precision and Recall: same / different class / cluster Select k: plot square loss - k, larger k smaller cost, find knee points; BIC penaltize; Cross validation
Evaluation: Clustering
Mining by exploring vertical data format, similar with inverted index. Having a t-id list that stores the list of transaction ids where a itemset appears, t(A). t(X) = t(Y) means P (XY ) is high; t(X) ⊂ t(Y) means P (Y |X) is high. diffset is used to accelerate mining (keep track of differences of tids).
Eclat
conf idence(A ⇒ B) = P (B|A) = P P^ ( AB(A)) rule is from a frequent pattern l and all its non-empty subsets. Lif t(AB) = (^) P P(A^ (AB)P ()B) = 1 independent, > 1 positi- vely correlated, < 1 negatively correlated
χ^2 =
∑ (^) (Observed−Expected) 2 Expected has a table to check p − value = P (χ^2 > ∗), if p − value is small enough, it rejects the null hypothesis, so A and B are dependent all conf idence = min{P (A|B), P (B|A)} max conf idence = max{P (A|B), P (B|A)} Kulczynski = 12 (P (A|B) + P (B|A)) Cosine: cos(A, B) =
Lift and χ^2 are affected by null-transaction, that is the “not A and not B”s. Imbalance Ratio (IR): IR(A, B) = (^) sup(|Asup)+(supA)−(Bsup)−(supB)|(AB) where sup refers to supports.
Association Rules
element / event is a non-empty unordered set of items, sequence is an ordered list of events, length is the number of instances of items included. Always written like 〈a(bc)de(f gh)〉. A is B’s subsequence means: any elements in A is a subset of a corresponding element in B, those ele- ments in B are in same order they appear in A. Start from the same L 1 , the major difference is join. In this case, s 1 and s 2 can be joined only if s 1 with 1st item dropped and s 2 has the last item dropped are the same. Joined together: s 1 [0], smid, s 2 [−1]. Note that all the items in any element are “sorted” by f-list.
DB : {〈SID, EID, Items〉} ⇒ Item(SubSeq) : {〈SID, EID〉}, and then join by growing the subse- quences one at a time by Apriori (joining two of those {〈SID, EID〉} tables for Items / Sub-Sequences, e.g. a, b ⇒ ab, ba ⇒ aba, bab). Similar limitations with GSP: costly generation & multiple scans by BFS & long patterns
: blank space used when the last item from prefix is from the first element of suffix. Prefix-based projection (α′): a projection of α w.r.t. prefix β is the maximum subsequence of α with pre- fix β. e.g. α = 〈a(abc)(ac)d(cf )〉, β = 〈ad〉, then α′^ = 〈ad(cf )〉 Start from L 1 , project the database into |L 1 | projec- ted database accordingly, mine each subset recursively via corresponding projected databases. (e.g. a-proj ⇒ ab-proj) Note that a and a are different in counting frequen- cies. With suffix last element s 1 , a only when a appears at the front of the suffix, or see (s 1 a∗). No candidate needed, major cost is projection, pro- jected DB keeps shrinking and could be improved by pseudo-projection (using pointers to point to the division point of the prefix and suffix to save time and space, work well unless DB is too big for main memory, disk-access is slow).
Prefix Span
Time series Y = {Yt : t ∈ T }, time-index T. An observation of time series with length N could be represented as Y = {y 1 , y 2 ,... yN }. Euclidean distance: d(C, Q) = (
|ci − qi|p) p^1 (lp) Lp norm cannot deal with offset and scaling. (sol: normalization c
′ i =^
ci−μ(C) σ(C) ) Warp time axis? Even with different length. X = {x 1 ,... xN }, Y = {y 1 ,... yM }, find alignment between s.t. overall cost is minimized. Local distance (cost) between xn, ym: c(xn, ym). We could have an N × M matrix of costs between all pairs. Our goal: find an (N, M)-warping path p = (p 1 , p 2 ,... , pL) with pl = (nl, ml), conditions: (1) boundary, p 1 = (1, 1), pL = (N, M ); (2) monotoni- city, nl, ml non-decreasing with l; (3) step size, 1, pl+1 − pl ∈ {(0, 1), (1, 0), (1, 1)} Solving by DP: D(n, m) = min{D(n− 1 , m), D(n, m− 1), D(n − 1 , m − 1)} + c(xn, ym), where D(n, m) de- notes the DTW distance between X(1,... n) and Y∑ (1,... m). D(N, M ) = DT W (X, Y ), D(n, 1) = n k=1 c(xk, y^1 ),^ D(1, m) =^
∑m k=1 c(x^1 , yk). O(N M ) time complexity. Trace back to find p∗^ from D, given that p(l) = (n, m): pl− 1 is (1, m − 1) if n = 1, (n − 1 , 1) if m = 1, and otherwise argmin{D(n − 1 , m − 1), D(n − 1 , m), D(n, m − 1)}
Dynamic Time Warping (DTW)
Sometimes series data need to be transformed into Fourier domain to evaluate. Xf = √^1 n^ ∑n t=0−^1 xt exp(−j 2 πf t/n), f = 0, 1 ,... , n Parseval’s Theorem: ∑n t=0−^1 |xt|^2 = ∑n f =0−^1 |Xf |^2 , Euclidean dist in time / freq domains are the same. Keep only first few coefficients brings no false dismissals.
Naive Time and Frequency Domain
jth^ lag of Yt: Yt−j , first diff ∆Yt = Yt − Yt−j , jth^ autocorre- lation ρj : corr(Yt, Yt−j ) = √ varcov(Y(Yt,Yt−j t)var(Yt−j ) , cov(Yt, Yt−j ) = T −^1 j− 1 ∑Tt=j+1(Yt^ −^ Y j+1,T^ )(Yt−j^ −^ Y^1 ,T^ −j^ ).^ AR(1) check^ Yt^ =^ β^0 + β 1 Yt− 1 + ut, β 1 = 0 for useless.
Prediction
Bayes’ Theorem: P (h|X) = P^ (X P| h(X)P)^ (h) X data samples (evidence), h : class(X) = Y , P (X) fixed∏ , P (h) = π prior probability, P (X|h) likelihood n β
xn yn ,^ P^ (X) =^
h P^ (X|h)P^ (h),^ P^ (h|X)^ poste- rior probability. Maximum a posteriori hM AP = argmaxh P (X|h)P (h). y∗^ = argmaxy
n β
xn yn ×^ πy = argmaxy
n xn^ log^ βyn^ + log^ πy Optimization: n word, j class, D documents, d document, βjn = (^) count of all words in class jcount of word n in class j (smoothing: ···+ ···+N ),^ πj^ =^
number of d in class j |D| For test document∏ t, p(y = c|xt) ∝ p(y = c) × n(βcn)
x tn where^ xtn^ is^ n’s appearance in^ xt. A generative model (not discriminative - like log reg). Generative model P (XY ), discriminative P (Y |X) Multinoulli distribution is two options, multi-tryout (z ∼ multinoulli(π)), while Multinomial means multi-class, one tryout ((xd ∼ multinoumial(βd))).
Naive Bayes for Text
corpus: a collection of documents, word w, doc d, topic z, word count in doc c(w, d), word distribution each topic βzw = p(w|z), topic (soft) distribution each document θdz = p(z|d). max log L =
dw c(w, d) log^
z θdz^ βzw^ s.t.^
z θdz^ = 1 ,
w βzw^ =^1 is^ optimized^ by^ EM^ until^ con- verge. Generally E: p(z|w, d) ∝ p(w|z, d)p(z|d) = βzwθdz , M: βzw ∝
∑^ d^ p(z|w, d)c(w, d),^ θdz^ ∝ w p(z|w, d)c(w, d).^ e.g.^ E:^ p(z|w, d) =^ ∑βzw^ θdz z′^ βzw^ θdz′^ , M: βzw =
∑ ∑^ d^ p(z|w,d)c(w,d) w′,d p(z|w′,d)c(w′,d)^ ,^ θdz^ =
∑ w p(z|w,d)c(w,d) Nd , where Nd is the count of words in the document.
pLSA