Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Model-Based Clustering Methods: Expectation-Maximization and Conceptual Clustering, Assignments of Data Warehousing

Delhi Technological University Data Warehousing

An overview of model-based clustering methods, specifically focusing on Expectation-Maximization (EM) and Conceptual Clustering. EM is an iterative refinement algorithm used to find parameter estimates for probability distributions in a mixture density model. Conceptual Clustering, on the other hand, is a form of unsupervised learning that produces a classification scheme for a set of unlabeled objects and finds characteristic descriptions for each concept. The document also discusses the COBWEB clustering method, a popular and simple method of incremental conceptual learning.

Typology: Assignments

2020/2021

Uploaded on 04/10/2021

surajkant352 🇮🇳

1 document

1 / 23

This page cannot be seen from the preview

Don't miss anything!

DELHI TECHNOLOGICAL UNIVERSITY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

DATA WAREHOUSING AND DATA MINING

“MODEL BASED CLUSTERING METHODS”

Presented By:

Surajkant Suman

2K17/CO/352

Batch:A5

Partial preview of the text

Download Model-Based Clustering Methods: Expectation-Maximization and Conceptual Clustering and more Assignments Data Warehousing in PDF only on Docsity!

DELHI TECHNOLOGICAL UNIVERSITY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING “MODEL BASED CLUSTERING METHODS” Presented By: Surajkant Suman 2K17/CO/ Batch:A

Model-Based Clustering Methods  Attempt to optimize the fit between the data and some mathematical model  Assumption: Data are generated by a mixture of underlying probability distributions  Techniques  Expectation-Maximization  Conceptual Clustering  Neural Networks Approach

Expectation Maximization

 Iterative Refinement Algorithm – used to find parameter

estimates

 Extension of k-means

 Assigns an object to a cluster according to a weight representing probability of membership

 Initial estimate of parameters

 Iteratively reassigns scores

 Initial guess for parameters; randomly select k objects to

represent cluster means or centers

 Iteratively refine parameters / clusters

 Expectation Step  Assign each object xi to cluster Ck with probability where  Maximization Step  Re-estimate model parameters

 Simple and easy to implement

 Complexity depends on features, objects and iterations

COBWEB Clustering Method A classification tree

COBWEB  Classification tree

 Each node – Concept and its probabilistic

distribution (Summary of objects under that node)

 Description – Conditional probabilities P(A

i=vij

/ C

 Sibling nodes at given level form a partition

 Category Utility

 Increase in the expected number of attribute values that can be correctly guessed given a partition

COBWEB  COBWEB is sensitive to order of records  Additional operations

 Merging and Splitting

 Two best hosts are considered for merging  Best host is considered for splitting  Limitations  The assumption that the attributes are independent of each other is often too strong because correlation may exist  Not suitable for clustering large database data  CLASSIT - an extension of COBWEB for incremental clustering of continuous data

Neural Network Approach

 Represent each cluster as an exemplar, acting as a “prototype” of the cluster  New objects are distributed to the cluster whose exemplar is the most similar according to some distance measure  Self Organizing Map  Competitive learning  Involves a hierarchical architecture of several units (neurons)  Neurons compete in a “winner-takes-all” fashion for the object currently being presented  Organization of units – forms a feature map  Web Document Clustering

Clustering High-Dimensional data

 As dimensionality increases

 number of irrelevant dimensions may produce noise and mask real clusters  data becomes sparse  Distance measures –meaningless

 Feature transformation methods

 PCA, SVD – Summarize data by creating linear combinations of attributes  But do not remove any attributes; transformed attributes – complex to interpret

 Feature Selection methods

 Most relevant set of attributes with respect to class labels  Entropy Analysis  Subspace Clustering – searches for groups of clusters within different subspaces of the same data set

CLIQUE: CLustering In QUest

 Dimension growth subspace clustering

 Starts at 1-D and grows upwards to higher dimensions

 Partitions each dimension – grids – determines whether

cell is dense

 CLIQUE

 Determines sparse and crowded units  Dense unit – fraction of data points > threshold  Cluster – maximal set of connected dense units

 Finds subspace of highest dimension

 Insensitive to order of inputs

 Performance depends on grid size and density threshold

 Difficult to determine across all dimensions

 Several lower dimensional subspaces will have to

be processed

 Can use adaptive strategy

PROCLUS – PROjected CLUStering

 Dimension-reduction Subspace Clustering technique

 Finds initial approximation of clusters in high

dimensional space

 Avoids generation of large number of

overlapped clusters of lower dimensionality

 Finds best set of medoids by hill-climbing process

(Similar to CLARANS)

 Manhattan Segmental distance measure

Frequent Pattern based Clustering

 Frequent patterns may also form clusters

 Instead of growing clusters dimension by dimension sets

of frequent itemsets are determined

 Two common technqiues

 Frequent term-based text Clustering  Clustering by Pattern similarity

Frequent-term based text clustering

 Text documents are clustered based on frequent terms

they contain

 Documents – terms

 Dimensionality is very high

 Frequent term based analysis

 Well selected subset of set of all frequent terms must be discovered  Fi – Set of frequent term sets  Cov(Fi) – set of documents covered by Fi  i= k cov(Fi) = D and overlap between Fi and Fj must be minimized  Description of clusters – their frequent term sets

Model-Based Clustering Methods: Expectation-Maximization and Conceptual Clustering, Assignments of Data Warehousing

Related documents

Partial preview of the text

Download Model-Based Clustering Methods: Expectation-Maximization and Conceptual Clustering and more Assignments Data Warehousing in PDF only on Docsity!

DELHI TECHNOLOGICAL UNIVERSITY

 Iterative Refinement Algorithm – used to find parameter

estimates

 Extension of k-means

 Initial estimate of parameters

 Iteratively reassigns scores

 Initial guess for parameters; randomly select k objects to

represent cluster means or centers

 Iteratively refine parameters / clusters

 Simple and easy to implement

 Complexity depends on features, objects and iterations

 Each node – Concept and its probabilistic

distribution (Summary of objects under that node)

 Description – Conditional probabilities P(A

/ C

 Sibling nodes at given level form a partition

 Category Utility

 Merging and Splitting

Neural Network Approach

 As dimensionality increases

 Feature transformation methods

 Feature Selection methods

 Dimension growth subspace clustering

 Starts at 1-D and grows upwards to higher dimensions

 Partitions each dimension – grids – determines whether

cell is dense

 CLIQUE

 Finds subspace of highest dimension

 Insensitive to order of inputs

 Performance depends on grid size and density threshold

 Several lower dimensional subspaces will have to

be processed

 Can use adaptive strategy

 Dimension-reduction Subspace Clustering technique

 Finds initial approximation of clusters in high

dimensional space

 Avoids generation of large number of

overlapped clusters of lower dimensionality

 Finds best set of medoids by hill-climbing process

(Similar to CLARANS)

 Manhattan Segmental distance measure

 Frequent patterns may also form clusters

 Instead of growing clusters dimension by dimension sets

of frequent itemsets are determined

 Two common technqiues

 Text documents are clustered based on frequent terms

they contain

 Documents – terms

 Dimensionality is very high

 Frequent term based analysis