Download Data Mining - Clustering High - Dimensional Data and more Study notes Data Mining in PDF only on Docsity!
November 27, 2014 Data Mining: Concepts and 1
Chapter 6. Cluster
Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary
November 27, 2014 Data Mining: Concepts and 2 Clustering High-Dimensional Data
- (^) Clustering high-dimensional data
- (^) Many applications: text documents, DNA micro-array data
- Major challenges:
- (^) Many irrelevant dimensions may mask clusters
- (^) Distance measure becomes meaningless—due to equi-distance
- (^) Clusters may exist only in some subspaces
- Methods
- (^) Feature transformation: only effective if most dimensions are relevant
- (^) PCA & SVD useful only when features are highly correlated/redundant
- (^) Feature selection: wrapper or filter approaches
- useful to find a subspace where the data have nice clusters
- (^) Subspace-clustering: find clusters in all the possible subspaces
- (^) CLIQUE, ProClus, and frequent pattern-based clustering
November 27, 2014 Data Mining: Concepts and 4 Why Subspace Clustering? (adapted from Parsons et al. SIGKDD Explorations 2004)
- (^) Clusters may exist only in some subspaces
- (^) Subspace-clustering: find clusters in all the subspaces
November 27, 2014 Data Mining: Concepts and 5 CLIQUE (Clustering In QUEst)
- (^) Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
- (^) Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
- (^) CLIQUE can be considered as both density-based and grid-based
- (^) It partitions each dimension into the same number of equal
length interval
- (^) It partitions an m-dimensional data space into non-
overlapping rectangular units
- (^) A unit is dense if the fraction of total data points contained in
the unit exceeds the input model parameter
- (^) A cluster is a maximal set of connected dense units within a
subspace
November 27, 2014 Data Mining: Concepts and 7
Salary (10,000)
age
age
Vacation( week)
age
Vacation
Salary
November 27, 2014 Data Mining: Concepts and 8 Strength and Weakness of CLIQUE
- (^) Strength
- (^) automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces
- (^) insensitive to the order of records in input and does not presume some canonical data distribution
- (^) scales linearly with the size of input and has good scalability as the number of dimensions in the data increases
- (^) Weakness
- (^) The accuracy of the clustering result may be degraded at the expense of simplicity of the method
November 27, 2014 Data Mining: Concepts and 10 Clustering by Pattern Similarity ( p- Clustering)
- (^) Right: The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space - (^) Difficult to find their patterns
- (^) Bottom: Some subsets of dimensions form nice shift and scaling patterns
November 27, 2014 Data Mining: Concepts and 11 Why p- Clustering?
- (^) Microarray data analysis may need to
- (^) Clustering on thousands of dimensions (attributes)
- Discovery of both shift and scaling patterns
- (^) Clustering with Euclidean distance measure? — cannot find shift patterns
- Clustering on derived attribute Aij = ai – aj? — introduces N(N-1) dimensions
- (^) Bi-cluster using transformed mean-squared residue score matrix (I, J)
- (^) Where
- (^) A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0
- (^) Problems with bi-cluster
- (^) No downward closure property,
- (^) Due to averaging, it may contain outliers but still within δ-threshold
j J ij d ij J d | |
i I ij d Ij I d | |
i I j J ij d IJ I J d | || | , 1