Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Mining Midterm Cheat Sheet, Cheat Sheet of Data Mining

Useful cheat sheet with formulas and main concepts for the Data Mining midterm exam

Typology: Cheat Sheet

2019/2020
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 10/23/2020

eshal
eshal 🇺🇸

4.3

(37)

258 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining (CS 145) Midterm Cheat Sheet by Patricia Xiao
Model Data Type Task Type
Linear Regres-
sion
Vector Prediction
Logistic Regres-
sion
Vector Classification
Decision Tree Vector Classification
SVM Vector Classification
NN Vector Classification
KNN Vector Classification
K-means Vector Clustering
hierarchical
clustering
Vector Clustering
DBSCAN Vector Clustering
Mixture Models Vector Clustering
Models
Dispersion: Quartiles& Inter Range (Q125%,
Q375%, IQR = Q3Q1, Outlier 1.5 IQR away
Q1/3), 5 n Summary: min, Q1, median, Q3, max
Bias: E(ˆ
f(x)) f(x), Variance: V ar(ˆ
f(x)) =
E[( ˆ
f(x)E(ˆ
f(x)))2], E[( ˆ
f(x)f(x)] = bias2+
variance +noise;E() = 0, V ar() = σ2;
biasunderfit; varianceoverfit
Model Evaluation and Selection: K-way cross va-
lidation, AIC (2k2 ln( ˆ
L)) & BIC (kln(n)
2 ln( ˆ
L)) (kparams, nobjs), Stepwise feature selec-
tion (forward: add, backward: from full model)
Generalized Linear Model (GLM): exponential fa-
mily, p(y;η) = b(y)exp(ηTT(y)a(η)); linear de-
cision boundary
Bagging: Bootstrap Aggregating (multi-datasets
multiple classifiers combine classifiers)
Kernel: K(xi, xj) = Φ(xi)TΦ(xj)
Chain rule: J/∂x = ( J/∂y )(∂y/∂ x)
Minkowski distance (lh): d(x, y) = h
qPd
i(xiyi)h,
l1Manhattan, l2Euclidean, lsupremum; triangle
inequality applies (d(i, j)d(i, k) + d(k, j )).
Confusion Matrix: True / False for correctness,
Positive / Negative for result
Multi-class classification: All-vs-all (AVA) is better
than One-vs-All (OVA)
Basic Concepts
1. σ2=E[(XE(X))2] = E(X2)E2(X)
2. kαk2=αTαwhere αis a vector.
3. (AB)T=BTAT, (AB)1=B1A1
4. (Ax)
∂x =A,(AX)
∂X =AT
5. (xTAx)
∂x =xT(A+AT), (XTATAX)
∂X = 2ATAX
6. XN(µ, σ2)f(X=x) = 1
2πσ2e(xµ)2
2σ2
7. σ0(x) = σ(x)(1 σ(x))
8. log(ab) = blog a, log(ab) = log(a) + log(b)
9. Classifiers fi(x): var(Pifi(x)
t) = var(fi(x))/t
10. a·b=Paibi=kakkbkcos (a, b)
11. normal n, any vector in plane, x,n·x= 0
12. covariance: σ(X1, X2) = E(X1µ1)(X2µ2)
Formula
1. mean - mode = 3 ×(mean - median)
mode(peak)medianmean(tail)
2. Z-score (normalization): Z=xµ
δ(robust: mean
absolute deviation, zjf =xif meanf
sumf) (nominal:
dummy variable(s), ordinal: (r1)/(M1)) where
r, M start from 1.
3. Logistic / Sigmoid Function: σ(x) = 1
1+ex
4. Entropy: H(Y) = Pm
i=1 pilog(pi); Conditional
Entropy: H(Y|X) = Pxp(x)H(Y|X=x)
5. Cross Entropy Loss: H(q , p) = Pkqklog(pk)
6. Lagrange multiplier αis used to solve Quadratic
Programming (e.g. SVM)
7. Soft margin (allow moving at a cost): minimizing
Φ(w) = 1/2wTwΦ(w) = 1/2wTw+CPζi, li-
mitation y(wTxi+b)1y(wTxi+b)1ζi
(ζi0); doesn’t affect the solution of SVM.
8. ROC (Receiver Operating Characteristics): TP rate
(y-axis) - FP rate (x-axis), score = area below curve
9. Dendrogram: the hierarchical, cut to clusters.
Tools
y=xTβwhere bias term xi0= 1, x: (n×(p+ 1))
matrix, y: (n×1) vector, β: ((p+ 1) ×1) vector.
Continuous y=T. (OLS, Ordinary Least Square)
J(β) = 1
2n( y)T(X β y) = 1
2n(βTXT yTX β
βTXTy+yTy).
Closed form solution: J
∂β = 0, ˆ
β= (XTX)1XTy
Gradient Descent: β(t+1) := β(t)η
Batch GD: (converge) = ∂J
∂β =Pixi(xT
iβyi)/n
Stochastic GD: (n times) = (yixT
iβ(t))xi
LR with Probabilistic Interpretation: (using MLE,
Maximum Livelihood Estimation)L(β) = Qip(yi|xi, β) =
Qip(N(xT
iβ, σ2)) = Qi
1
2πσ2exp{− (tixT
iβ)2
2σ2}
Invertible XTX: add λPp
j=1 β2
jto Pi(yixT
iβ)2
(Ridge Regression, or linear regression with l2norm)
Non-linear Correlation: create new terms e.g. x2
Linear Regression
Generalized linear model (GLM).
P(Y= 1|X, β) = σ(XTβ) = eXTβ
1+eXTβ
P(Y= 0|X, β)=1σ(XTβ) = 1
1+eXTβ
Y|X, β Bernoulli(σ(XTβ))
MLE: L=Qipyi
i(1 pi)1yi,piis P(Y= 1|X, β)
Eq to max log likelihood L=Pi(yixiβlog(1 + exT
iβ))
Gradient ascent βnew =βold +η∂L(β)
∂β
Newton-Raphson update βnew =βold (2L(β)
∂β )1∂L(β)
∂β∂ βT
Cross Entropy Loss (pfor prediction, qfor ground truth,
(q0, q1)|y=0 = (1,0), (q0, q1)|y=1 = (0,1), (p0, p1) =
(P(Y= 0), P (Y= 1)): H(p, q) = yxTβ+log(1 + exTβ)
Logistic Regression
A framework to approach maximum likelihood.
p(xi, zi=Cj) = wjfj(xi), p(xi) = Pjwjfj(xi)
p(D) = Qip(xi) = QiPjwjfj(xi)
log(p(D)) = Pilog(Pjwjfj(xi))
E(expectation)-step assigns objects to clusters.
wt+1
ij =p(zi=j|θt
j, xi)
p(xi|zi=j, θt
j)p(zi=j) = fj(xi)wj
M(maximization)-step finds the new clustering
w.r.t. conditional distribution p(zi=j|θt
j, xi).
θt+1 = argmax
θX
iX
j
wt+1
ij log L(xi, zi=j|θ)
EM Algorithm
1
pf3
pf4
Discount

On special offer

Partial preview of the text

Download Data Mining Midterm Cheat Sheet and more Cheat Sheet Data Mining in PDF only on Docsity!

Data Mining (CS 145) Midterm Cheat Sheet by Patricia Xiao

Model Data Type Task Type Linear Regres- sion

Vector Prediction

Logistic Regres- sion

Vector Classification

Decision Tree Vector Classification SVM Vector Classification NN Vector Classification KNN Vector Classification K-means Vector Clustering hierarchical clustering

Vector Clustering

DBSCAN Vector Clustering Mixture Models Vector Clustering

Models

  • Dispersion: Quartiles& Inter Range (Q 1 25%, Q 3 75%, IQR = Q 3 − Q 1 , Outlier 1.5 IQR away Q 1 / 3 ), 5 n Summary: min, Q 1 , median, Q 3 , max
  • Bias: E( fˆ (x)) − f (x), Variance: V ar( fˆ (x)) = E[( fˆ (x) − E( fˆ (x)))^2 ], E[( fˆ (x) − f (x) − ] = bias^2 + variance + noise; E() = 0, V ar() = σ^2 ; bias→underfit; variance→overfit
  • Model Evaluation and Selection: K-way cross va- lidation, AIC (2k − 2 ln( Lˆ)) & BIC (k ln(n) − 2 ln( Lˆ)) (k params, n objs), Stepwise feature selec- tion (forward: add, backward: from full model)
  • Generalized Linear Model (GLM): exponential fa- mily, p(y; η) = b(y)exp(ηT^ T (y) − a(η)); linear de- cision boundary
  • Bagging: Bootstrap Aggregating (multi-datasets → multiple classifiers → combine classifiers)
  • Kernel: K(xi, xj ) = Φ(xi)T^ Φ(xj )
  • Chain rule: ∂J/∂x = (∂J/∂y)(∂y/∂x)
  • Minkowski distance (lh): d(x, y) = h

√∑ d i (xi^ −^ yi)h, l 1 Manhattan, l 2 Euclidean, l∞ supremum; triangle inequality applies (d(i, j) ≤ d(i, k) + d(k, j)).

  • Confusion Matrix: True / False for correctness, Positive / Negative for result
  • Multi-class classification: All-vs-all (AVA) is better than One-vs-All (OVA)

Basic Concepts

  1. σ^2 = E[(X − E(X))^2 ] = E(X^2 ) − E^2 (X)
  2. ‖α‖^2 = αT^ α where α is a vector.
  3. (AB)T^ = BT^ AT^ , (AB)−^1 = B−^1 A−^1
  4. ∂( ∂xAx )= A, ∂( ∂XAX )= AT
  5. ∂(x

T (^) Ax) ∂x =^ x

T (A + AT ), ∂(XT^ AT^ AX)

∂X = 2A

T AX

  1. X ∼ N (μ, σ^2 ) ⇒ f (X = x) = √ 21 πσ 2 e−^

(x−μ)^2 2 σ^2

  1. σ′(x) = σ(x)(1 − σ(x))
  2. log(ab) = b log a, log(ab) = log(a) + log(b)
  3. Classifiers fi(x): var(

∑ i fi(x) t ) =^ var(fi(x))/t

  1. a · b =

aibi = ‖a‖‖b‖ cos (a, b)

  1. normal n, any vector in plane, x, n · x = 0
  2. covariance: σ(X 1 , X 2 ) = E(X 1 − μ 1 )(X 2 − μ 2 )

Formula

  1. mean - mode = 3 × (mean - median) mode(peak)∼median∼mean(∼ tail)
  2. Z-score (normalization): Z = x−δ μ (robust: mean absolute deviation, zjf = xif^ sum−meanf f) (nominal: dummy variable(s), ordinal: (r − 1)/(M − 1)) where r, M start from 1.
  3. Logistic / Sigmoid Function: σ(x) = (^) 1+^1 e−x
  4. Entropy: H(Y ) = −

∑m i=1 pi^ log(pi); Conditional Entropy: H(Y |X) =

∑ x p(x)H(Y^ |X^ =^ x)

  1. Cross Entropy Loss: H(q, p) = −

∑ k qk^ log(pk^ )

  1. Lagrange multiplier α is used to solve Quadratic Programming (e.g. SVM)
  2. Soft margin (allow moving at a cost): minimizing Φ(w) = 1/ 2 wT^ w ⇒ Φ(w) = 1/ 2 wT^ w + C ∑ ζi, li- mitation y(wT^ xi + b) ≥ 1 ⇒ y(wT^ xi + b) ≥ 1 − ζi (ζi ≥ 0); doesn’t affect the solution of SVM.
  3. ROC (Receiver Operating Characteristics): TP rate (y-axis) - FP rate (x-axis), score = area below curve
  4. Dendrogram: the hierarchical, cut to clusters.

Tools

y = xT^ β where bias term xi 0 = 1, x: (n × (p + 1)) matrix, y: (n × 1) vector, β: ((p + 1) × 1) vector. Continuous y = xβT^. (OLS, Ordinary Least Square) J(β) = (^21) n (Xβ − y)T^ (Xβ − y) = (^21) n (βT^ XT^ Xβ − yT^ Xβ − βT^ XT^ y + yT^ y). Closed form solution: ∂J∂β = 0, βˆ = (XT^ X)−^1 XT^ y Gradient Descent: β(t+1)^ := β(t)^ − η∆ Batch GD: (converge) ∆ = ∂J∂β = ∑ i xi(x

T i β^ −^ yi)/n Stochastic GD: (n times) ∆ = −(yi − xTi β(t))xi LR with Probabilistic Interpretation: (using MLE, Maximum Livelihood Estimation) L(β) =

∏ ∏^ i^ p(yi|xi, β) = i p(N^ (x

T i β, σ (^2) )) = ∏ i √^1 2 πσ^2 exp{−^

(ti−xTi β)^2 2 σ^2 } Invertible XT^ X: add λ

∑p j=1 β j^2 to^ ∑ i(yi^ −^ x Ti β) 2 (Ridge Regression, or linear regression with l 2 norm) Non-linear Correlation: create new terms e.g. x^2

Linear Regression

Generalized linear model (GLM). P (Y = 1|X, β) = σ(XT^ β) = e XT β 1+eXT β P (Y = 0|X, β) = 1 − σ(XT^ β) = (^) 1+e^1 XT β Y |X, β ∼ Bernoulli(σ(XT^ β)) MLE: L = ∏ i p

yi i (1^ −^ pi) 1 −yi (^) , pi is P (Y = 1|X, β) Eq to max log likelihood L =

∑ i(yixiβ^ −^ log(1 +^ e

xTi β (^) )) Gradient ascent βnew^ = βold^ + η ∂L ∂β(β) Newton-Raphson update βnew^ = βold^ − ( ∂

(^2) L(β) ∂β )

− 1 ∂L(β) ∂β∂βT Cross Entropy Loss (p for prediction, q for ground truth, (q 0 , q 1 )|y=0 = (1, 0), (q 0 , q 1 )|y=1 = (0, 1), (p 0 , p 1 ) = (P (Y = 0), P (Y = 1)): H(p, q) = −yxT^ β + log(1 + ex

T (^) β )

Logistic Regression

A framework to approach maximum likelihood. p(xi, zi = Cj ) = wj fj (xi), p(xi) =

j wj^ fj^ (xi) p(D) =

i p(xi) =^

i

j wj^ fj^ (xi) log(p(D)) =

i log(

j wj^ fj^ (xi)) E(expectation)-step assigns objects to clusters.

w ijt+1 = p(zi = j|θtj , xi) ∝ p(xi|zi = j, θtj )p(zi = j) = fj (xi)wj

M(maximization)-step finds the new clustering w.r.t. conditional distribution p(zi = j|θtj , xi).

θt+1^ = argmax θ

i

j

w ijt+1 log L(xi, zi = j|θ)

EM Algorithm

m for |y| in D, v for |A| Expected Information needed to classify a tuple in D: Inf o(D) = −

∑m i=1 pi^ log 2 (pi) Info after split A: Inf oA(D) =

∑v j=

Dj D ×^ Inf o(Dj^ ) Info Gain (ID3): Gain(A) = Inf o(D) − Inf oA(D) Info gain biases towards multivalued attributes. SplitInf oA(D) = −

∑v j=

Dj D ×^ log^2 (^

Dj D ) GR (C4.5): GainRatio(A) = Gain(A)/SplitInf o(A) GR biases towards unbalanced splits. Gini(D) = 1 −

∑m j=1 p (^2) j for impurity

GiniA(D) =

∑v j=

|Dj | |D| Gini(Dj^ ) Gini (CART): ∆Gini(A) = Gini(D) − GiniA(D) Gini index also biases towards multivalued attributes. STOP: same class; last attr; no sample (maj. vot.) Avoid Over Fitting: Pre/Post-pruning, random forest Classification → Prediction: Maj. Vote → e.g. Avg for leaf node. turn to regression tree, V ar(Dj ) =

∑ y∈Dj (y^ − y)^2 /|Dj |, look for the lowest weighted average vari- ance V arA(D) = ∑v j=

|Dj | |D| ×^ V ar(Dj^ ) A different view: leaf = box in the plane Random forest is a set of trees, ensemble, bagging, good at classification, handles large & missing data, not good at predictions, lack interpretation.

Decision Tree

y = sign(W · X + b), separating hyperplane y = 0 SVM searches for Maximum Marginal Hyperplane To Maximize Margin ρ = (^) ‖w^2 ‖ , w. Lagrange multiplier

α, L(w, b, α) = 12 wT^ w −

∑N

i=1 αi(yi(w T (^) xi + b) − 1). ∂L ∂w =^ w^ −^

∑N

i=1 αiyixi^ = 0,^

∂L ∂b =^ −^

∑N

i=1 αiyi^ = 0 Solution: w =

αiyixi, b = yk − wT^ xk f (x) = wT^ x + b =

αiyixTi x + b default threshold 0 Linear v.s. Non-linear SVM: Kernel Non-linear Decision Boundary:∑ f (x) = wT^ Φ(x) + b = αiyiK(xi, x) + b Scalability: CF-Tree, Hierarchical Micro-cluster, se- lective declustering (decluster the clusters who could be support cluster; support cluster: centroid on sup- port vector)

SVM

xi wi −→

(+b) f −→ o

Input vector x, Weight vector w, Bias b, weighted sum, going through activation function f , reach out- put o.

Perceptron (Single Unit)

Stochastic GD + Chain Rule Special case: Sigmoid + Square loss, 2 layers Assume: i, j, k are input, hidden, output layers’ de- notion, and O for output, T for true value. Errk = Ok(1 − Ok)(Tk − Ok), Errj = Oj (1 − Oj )

k Errkwjk,^ wij^ =^ wij^ +^ ηErrj^ Oi^ and^ wjk^ = wjk +ηErrkOj , θj = θj +ηErrj and θk = θk +ηErrk. ∂J ∂wij =^

∂J ∂Ok

∂Ok ∂Oj

∂Oj ∂wij =^ −^

k[(Tk^ −^ Ok)][Ok(1^ − Ok)wjk][Oj (1 − Oj )Oi]

Backpropagation (BP)

nlayers = nhidden + noutput(1) Feed-forward, Non-linear regression, capable of any continuous function. Backpropagation is used for learning.

Neural Network (NN)

Lazy learning (instead of eager), instance-based Consider k nearest neighbors; maj. voting or average. (Could be distance-weighted.) Curse of dimensionality: influence of noise Get rid of irrelevant features; select proper k. Proximity refers to similarity or dissimilarity. Always applies to binary values. If nominal, could do simple matching, or use a series of binary to represent a non-binary; ordinal: rank, normalize zif = r Miff^ − −^11. Proximity could be measured by |(0,1)| all+| (1,0)|for sym- metric variables, |(0 all,1)−||+(0|(1,0),0)| | or Jaccard coefficient (similarity) (^) all|(1−|,(01),|0)| for asymmetric. Mixed type attributes: weighted combine. Another method: cosine similarity cos(d 1 , d 2 )

k - Nearest Neighbors (kNN)

Holdout method; Cross-validation (k-fold) LOO. Confusion Matrix: True / False Positive / Negative Accuracy = (TP + TN) / All Error Rate = (FP + FN) / All Sensitivity = TP / P (P = TP + FN) Specificity = TN / N (N = FP + TN) Precision = TP / P’ (P’ = TP + FP) Recall = TP / P = Sensitivity F 1 / F-score = 2 ×P recisionP recision+×RecallRecall Fβ = (1+β

(^2) )×P recision×Recall β^2 ×P recision+Recall (R: P =^ β^ : 1) ROC curve: TP rate (y) - FP rate (x). (area under) TPR = TP / P, FPR = FP / N

Evaluation: Classification

K-means: J =

∑k j=

i wij^ ‖xi^ −^ cj^ ‖

2 Assign wij = 1 to each xi closest cj ; assign the center to be new centroid; stop when no change. O(tkn). For continuous, convex-shaped data, sensitive to noise. K-modes: mean → mode, for categorical data K-medoids: representative objects, e.g. PAM (s) Hierarchical: bottom-up Agglomerative Nesting (AGNES) merges two closest clusters until end up in 1; top-down DIANA (Divisive Analysis). O(n^2 ). Cluster Distance: Single link for min element-wise dist; Complete link for max; average for avg element pairs dist; centroid, medoid (center obj). DBSCAN: Set Eps  and MinPts. Neighborhood defined as N(q) : {p ∈ D|dist(p, q) ≤ }. Core point |N(q)| ≥ M inP ts. p is directly density-reachable from q if q is core point and p ∈ N(q); density- reachable if q → p 2 → · · · → p; density-connected if o → · · · → p

o → · · · → q. Cluster: max set density-connected points. Individual points are noise. DFS O(n log n) w. spacial index, else O(n^2 ). Mixture Model: soft clustering (wij ∈ [0, 1] rather than wij ∈ { 0 , 1 }), joint prob of object i and cluster Cj : p(xi, zi = Cj ) = wj fj (xi), using EM algorithm. Gaussian Mixture Model (GMM): ⊃ k-means Generative model, for each object, pick cluster Z, from X|Z ∼ N (μZ , σ^2 Z ) sample value; Overall li- kelihood function L(D|θ) =

i

j wj^ p(xi|μj^ , σ 2 j ); E wt ij+1 = (wtj p(xi|μtj , (σ^2 j )t))/(

k w

t kp(xi|μ t k,^ (σ 2 k) t)), M μt j+1 = (

i w

t+ ij xi)/(

i w

t+ ij ),^ (σ 2 j ) t+1 (^) = (

i w

t+ ij (xi^ −^ μ

t+ j )

i w

t+ ij ),^ w

t+ j =^

i w

t+ ij /n (in 1-d case) Why EM works? E-Step find tight lower bound L of at θold, M-Step find θnew to maximize the lower bound.(θnew) ≥ L(θnew) ≥ L(θold) = `(θold)

Clustering

extrinsic (supervised) vs. intrinsic (unsupervised) purity(C, Ω) = (^) N^1

K maxj^ |ck^ ∩^ ωj^ |^ (C^ out, Ω truth) Normalized Mutual Information: N M I(C, Ω) = √HI((C,C)Ω)H(Ω)

I(C, Ω) =

k

j P^ (ck^ ∩^ ωj^ ) log^

P (ck ∩ωj ) P (ck )P (ωj ) = ∑ k

j

|ck ∩ωj | N log^

N |ck ∩ωj | |ck ||ωj | H(Ω) = −

j P^ (ωj^ ) log^ P^ (ωj^ ) =^ −^

j

|ωj | N log^

|ωj | N Precision and Recall: same / different class / cluster Select k: plot square loss - k, larger k smaller cost, find knee points; BIC penaltize; Cross validation

Evaluation: Clustering

Mining by exploring vertical data format, similar with inverted index. Having a t-id list that stores the list of transaction ids where a itemset appears, t(A). t(X) = t(Y) means P (XY ) is high; t(X) ⊂ t(Y) means P (Y |X) is high. diffset is used to accelerate mining (keep track of differences of tids).

Eclat

conf idence(A ⇒ B) = P (B|A) = P P^ ( AB(A)) rule is from a frequent pattern l and all its non-empty subsets. Lif t(AB) = (^) P P(A^ (AB)P ()B) = 1 independent, > 1 positi- vely correlated, < 1 negatively correlated

χ^2 =

∑ (^) (Observed−Expected) 2 Expected has a table to check p − value = P (χ^2 > ∗), if p − value is small enough, it rejects the null hypothesis, so A and B are dependent all conf idence = min{P (A|B), P (B|A)} max conf idence = max{P (A|B), P (B|A)} Kulczynski = 12 (P (A|B) + P (B|A)) Cosine: cos(A, B) =

P (A|B) × P (B|A)

Lift and χ^2 are affected by null-transaction, that is the “not A and not B”s. Imbalance Ratio (IR): IR(A, B) = (^) sup(|Asup)+(supA)−(Bsup)−(supB)|(AB) where sup refers to supports.

Association Rules

element / event is a non-empty unordered set of items, sequence is an ordered list of events, length is the number of instances of items included. Always written like 〈a(bc)de(f gh)〉. A is B’s subsequence means: any elements in A is a subset of a corresponding element in B, those ele- ments in B are in same order they appear in A. Start from the same L 1 , the major difference is join. In this case, s 1 and s 2 can be joined only if s 1 with 1st item dropped and s 2 has the last item dropped are the same. Joined together: s 1 [0], smid, s 2 [−1]. Note that all the items in any element are “sorted” by f-list.

GSP

DB : {〈SID, EID, Items〉} ⇒ Item(SubSeq) : {〈SID, EID〉}, and then join by growing the subse- quences one at a time by Apriori (joining two of those {〈SID, EID〉} tables for Items / Sub-Sequences, e.g. a, b ⇒ ab, ba ⇒ aba, bab). Similar limitations with GSP: costly generation & multiple scans by BFS & long patterns

SPADE

: blank space used when the last item from prefix is from the first element of suffix. Prefix-based projection (α′): a projection of α w.r.t. prefix β is the maximum subsequence of α with pre- fix β. e.g. α = 〈a(abc)(ac)d(cf )〉, β = 〈ad〉, then α′^ = 〈ad(cf )〉 Start from L 1 , project the database into |L 1 | projec- ted database accordingly, mine each subset recursively via corresponding projected databases. (e.g. a-proj ⇒ ab-proj) Note that a and a are different in counting frequen- cies. With suffix last element s 1 , a only when a appears at the front of the suffix, or see (s 1 a∗). No candidate needed, major cost is projection, pro- jected DB keeps shrinking and could be improved by pseudo-projection (using pointers to point to the division point of the prefix and suffix to save time and space, work well unless DB is too big for main memory, disk-access is slow).

Prefix Span

Time series Y = {Yt : t ∈ T }, time-index T. An observation of time series with length N could be represented as Y = {y 1 , y 2 ,... yN }. Euclidean distance: d(C, Q) = (

|ci − qi|p) p^1 (lp) Lp norm cannot deal with offset and scaling. (sol: normalization c

′ i =^

ci−μ(C) σ(C) ) Warp time axis? Even with different length. X = {x 1 ,... xN }, Y = {y 1 ,... yM }, find alignment between s.t. overall cost is minimized. Local distance (cost) between xn, ym: c(xn, ym). We could have an N × M matrix of costs between all pairs. Our goal: find an (N, M)-warping path p = (p 1 , p 2 ,... , pL) with pl = (nl, ml), conditions: (1) boundary, p 1 = (1, 1), pL = (N, M ); (2) monotoni- city, nl, ml non-decreasing with l; (3) step size, 1, pl+1 − pl ∈ {(0, 1), (1, 0), (1, 1)} Solving by DP: D(n, m) = min{D(n− 1 , m), D(n, m− 1), D(n − 1 , m − 1)} + c(xn, ym), where D(n, m) de- notes the DTW distance between X(1,... n) and Y∑ (1,... m). D(N, M ) = DT W (X, Y ), D(n, 1) = n k=1 c(xk, y^1 ),^ D(1, m) =^

∑m k=1 c(x^1 , yk). O(N M ) time complexity. Trace back to find p∗^ from D, given that p(l) = (n, m): pl− 1 is (1, m − 1) if n = 1, (n − 1 , 1) if m = 1, and otherwise argmin{D(n − 1 , m − 1), D(n − 1 , m), D(n, m − 1)}

Dynamic Time Warping (DTW)

Sometimes series data need to be transformed into Fourier domain to evaluate. Xf = √^1 n^ ∑n t=0−^1 xt exp(−j 2 πf t/n), f = 0, 1 ,... , n Parseval’s Theorem: ∑n t=0−^1 |xt|^2 = ∑n f =0−^1 |Xf |^2 , Euclidean dist in time / freq domains are the same. Keep only first few coefficients brings no false dismissals.

Naive Time and Frequency Domain

jth^ lag of Yt: Yt−j , first diff ∆Yt = Yt − Yt−j , jth^ autocorre- lation ρj : corr(Yt, Yt−j ) = √ varcov(Y(Yt,Yt−j t)var(Yt−j ) , cov(Yt, Yt−j ) = T −^1 j− 1 ∑Tt=j+1(Yt^ −^ Y j+1,T^ )(Yt−j^ −^ Y^1 ,T^ −j^ ).^ AR(1) check^ Yt^ =^ β^0 + β 1 Yt− 1 + ut, β 1 = 0 for useless.

Prediction

Bayes’ Theorem: P (h|X) = P^ (X P| h(X)P)^ (h) X data samples (evidence), h : class(X) = Y , P (X) fixed∏ , P (h) = π prior probability, P (X|h) likelihood n β

xn yn ,^ P^ (X) =^

h P^ (X|h)P^ (h),^ P^ (h|X)^ poste- rior probability. Maximum a posteriori hM AP = argmaxh P (X|h)P (h). y∗^ = argmaxy

n β

xn yn ×^ πy = argmaxy

n xn^ log^ βyn^ + log^ πy Optimization: n word, j class, D documents, d document, βjn = (^) count of all words in class jcount of word n in class j (smoothing: ···+ ···+N ),^ πj^ =^

number of d in class j |D| For test document∏ t, p(y = c|xt) ∝ p(y = c) × n(βcn)

x tn where^ xtn^ is^ n’s appearance in^ xt. A generative model (not discriminative - like log reg). Generative model P (XY ), discriminative P (Y |X) Multinoulli distribution is two options, multi-tryout (z ∼ multinoulli(π)), while Multinomial means multi-class, one tryout ((xd ∼ multinoumial(βd))).

Naive Bayes for Text

corpus: a collection of documents, word w, doc d, topic z, word count in doc c(w, d), word distribution each topic βzw = p(w|z), topic (soft) distribution each document θdz = p(z|d). max log L =

dw c(w, d) log^

z θdz^ βzw^ s.t.^

z θdz^ = 1 ,

w βzw^ =^1 is^ optimized^ by^ EM^ until^ con- verge. Generally E: p(z|w, d) ∝ p(w|z, d)p(z|d) = βzwθdz , M: βzw ∝

∑^ d^ p(z|w, d)c(w, d),^ θdz^ ∝ w p(z|w, d)c(w, d).^ e.g.^ E:^ p(z|w, d) =^ ∑βzw^ θdz z′^ βzw^ θdz′^ , M: βzw =

∑ ∑^ d^ p(z|w,d)c(w,d) w′,d p(z|w′,d)c(w′,d)^ ,^ θdz^ =

∑ w p(z|w,d)c(w,d) Nd , where Nd is the count of words in the document.

pLSA