















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An overview of model-based clustering methods, specifically focusing on Expectation-Maximization (EM) and Conceptual Clustering. EM is an iterative refinement algorithm used to find parameter estimates for probability distributions in a mixture density model. Conceptual Clustering, on the other hand, is a form of unsupervised learning that produces a classification scheme for a set of unlabeled objects and finds characteristic descriptions for each concept. The document also discusses the COBWEB clustering method, a popular and simple method of incremental conceptual learning.
Typology: Assignments
1 / 23
This page cannot be seen from the preview
Don't miss anything!
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING “MODEL BASED CLUSTERING METHODS” Presented By: Surajkant Suman 2K17/CO/ Batch:A
Model-Based Clustering Methods Attempt to optimize the fit between the data and some mathematical model Assumption: Data are generated by a mixture of underlying probability distributions Techniques Expectation-Maximization Conceptual Clustering Neural Networks Approach
Expectation Maximization
Assigns an object to a cluster according to a weight representing probability of membership
Expectation Step Assign each object xi to cluster Ck with probability where Maximization Step Re-estimate model parameters
COBWEB Clustering Method A classification tree
COBWEB Classification tree
i=vij
k
Increase in the expected number of attribute values that can be correctly guessed given a partition
COBWEB COBWEB is sensitive to order of records Additional operations
Two best hosts are considered for merging Best host is considered for splitting Limitations The assumption that the attributes are independent of each other is often too strong because correlation may exist Not suitable for clustering large database data CLASSIT - an extension of COBWEB for incremental clustering of continuous data
Represent each cluster as an exemplar, acting as a “prototype” of the cluster New objects are distributed to the cluster whose exemplar is the most similar according to some distance measure Self Organizing Map Competitive learning Involves a hierarchical architecture of several units (neurons) Neurons compete in a “winner-takes-all” fashion for the object currently being presented Organization of units – forms a feature map Web Document Clustering
Clustering High-Dimensional data
number of irrelevant dimensions may produce noise and mask real clusters data becomes sparse Distance measures –meaningless
PCA, SVD – Summarize data by creating linear combinations of attributes But do not remove any attributes; transformed attributes – complex to interpret
Most relevant set of attributes with respect to class labels Entropy Analysis Subspace Clustering – searches for groups of clusters within different subspaces of the same data set
CLIQUE: CLustering In QUest
Determines sparse and crowded units Dense unit – fraction of data points > threshold Cluster – maximal set of connected dense units
Difficult to determine across all dimensions
PROCLUS – PROjected CLUStering
Frequent Pattern based Clustering
Frequent term-based text Clustering Clustering by Pattern similarity
Frequent-term based text clustering
Well selected subset of set of all frequent terms must be discovered Fi – Set of frequent term sets Cov(Fi) – set of documents covered by Fi i= k cov(Fi) = D and overlap between Fi and Fj must be minimized Description of clusters – their frequent term sets