Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Model-Based Clustering Methods: Expectation-Maximization and Conceptual Clustering, Assignments of Data Warehousing

An overview of model-based clustering methods, specifically focusing on Expectation-Maximization (EM) and Conceptual Clustering. EM is an iterative refinement algorithm used to find parameter estimates for probability distributions in a mixture density model. Conceptual Clustering, on the other hand, is a form of unsupervised learning that produces a classification scheme for a set of unlabeled objects and finds characteristic descriptions for each concept. The document also discusses the COBWEB clustering method, a popular and simple method of incremental conceptual learning.

Typology: Assignments

2020/2021

Uploaded on 04/10/2021

surajkant352
surajkant352 🇮🇳

1 document

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
DELHI TECHNOLOGICAL UNIVERSITY
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
DATA WAREHOUSING AND DATA MINING
“MODEL BASED CLUSTERING METHODS”
Presented By:
Surajkant Suman
2K17/CO/352
Batch:A5
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Model-Based Clustering Methods: Expectation-Maximization and Conceptual Clustering and more Assignments Data Warehousing in PDF only on Docsity!

DELHI TECHNOLOGICAL UNIVERSITY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING “MODEL BASED CLUSTERING METHODS” Presented By: Surajkant Suman 2K17/CO/ Batch:A

Model-Based Clustering Methods  Attempt to optimize the fit between the data and some mathematical model  Assumption: Data are generated by a mixture of underlying probability distributions  Techniques  Expectation-Maximization  Conceptual Clustering  Neural Networks Approach

Expectation Maximization

 Iterative Refinement Algorithm – used to find parameter

estimates

 Extension of k-means

 Assigns an object to a cluster according to a weight representing probability of membership

 Initial estimate of parameters

 Iteratively reassigns scores

 Initial guess for parameters; randomly select k objects to

represent cluster means or centers

 Iteratively refine parameters / clusters

 Expectation Step  Assign each object xi to cluster Ck with probability where  Maximization Step  Re-estimate model parameters

 Simple and easy to implement

 Complexity depends on features, objects and iterations

COBWEB Clustering Method A classification tree

COBWEB  Classification tree

 Each node – Concept and its probabilistic

distribution (Summary of objects under that node)

 Description – Conditional probabilities P(A

i=vij

/ C

k

 Sibling nodes at given level form a partition

 Category Utility

 Increase in the expected number of attribute values that can be correctly guessed given a partition

COBWEB  COBWEB is sensitive to order of records  Additional operations

 Merging and Splitting

 Two best hosts are considered for merging  Best host is considered for splitting  Limitations  The assumption that the attributes are independent of each other is often too strong because correlation may exist  Not suitable for clustering large database data  CLASSIT - an extension of COBWEB for incremental clustering of continuous data

Neural Network Approach

 Represent each cluster as an exemplar, acting as a “prototype” of the cluster  New objects are distributed to the cluster whose exemplar is the most similar according to some distance measure  Self Organizing Map  Competitive learning  Involves a hierarchical architecture of several units (neurons)  Neurons compete in a “winner-takes-all” fashion for the object currently being presented  Organization of units – forms a feature map  Web Document Clustering

Clustering High-Dimensional data

 As dimensionality increases

 number of irrelevant dimensions may produce noise and mask real clusters  data becomes sparse  Distance measures –meaningless

 Feature transformation methods

 PCA, SVD – Summarize data by creating linear combinations of attributes  But do not remove any attributes; transformed attributes – complex to interpret

 Feature Selection methods

 Most relevant set of attributes with respect to class labels  Entropy Analysis  Subspace Clustering – searches for groups of clusters within different subspaces of the same data set

CLIQUE: CLustering In QUest

 Dimension growth subspace clustering

 Starts at 1-D and grows upwards to higher dimensions

 Partitions each dimension – grids – determines whether

cell is dense

 CLIQUE

 Determines sparse and crowded units  Dense unit – fraction of data points > threshold  Cluster – maximal set of connected dense units

 Finds subspace of highest dimension

 Insensitive to order of inputs

 Performance depends on grid size and density threshold

 Difficult to determine across all dimensions

 Several lower dimensional subspaces will have to

be processed

 Can use adaptive strategy

PROCLUS – PROjected CLUStering

 Dimension-reduction Subspace Clustering technique

 Finds initial approximation of clusters in high

dimensional space

 Avoids generation of large number of

overlapped clusters of lower dimensionality

 Finds best set of medoids by hill-climbing process

(Similar to CLARANS)

 Manhattan Segmental distance measure

Frequent Pattern based Clustering

 Frequent patterns may also form clusters

 Instead of growing clusters dimension by dimension sets

of frequent itemsets are determined

 Two common technqiues

 Frequent term-based text Clustering  Clustering by Pattern similarity

Frequent-term based text clustering

 Text documents are clustered based on frequent terms

they contain

 Documents – terms

 Dimensionality is very high

 Frequent term based analysis

 Well selected subset of set of all frequent terms must be discovered  Fi – Set of frequent term sets  Cov(Fi) – set of documents covered by Fi  i= k cov(Fi) = D and overlap between Fi and Fj must be minimized  Description of clusters – their frequent term sets