3 min read

Cluster Analysis: Find Best Clusters in unlabelled datasets

The purpose of this analysis is that to find the best clusters among unknown datasets. If two objects have similar characteristics, that means they must be belong to the same group.
Cluster Analysis: Find Best Clusters in unlabelled datasets

Clustering is a concept of grouping similar objects in a same cluster and dissimilar objects in a different cluster. So that objects within a cluster similar to each other(i.e. have high similarity), but are dissimilar to objects in other clusters. Dissimilarities and similarities are measured based on the attribute values that describes the objects and by using distance measures.

The purpose of this analysis is that to find the best clusters among unknown datasets. If two objects have similar characteristics, that means they must belong to the same group.

Requirements for Cluster Analysis

  • Scalabilty: In real world there is huge amount of data present. So, clustering algorithm should work well on large datasets.
Source: Scalability and Fast Transaction Times Are Crucial for Mass Adoption of Crypto
  • Ability to deal with different types of attributes : A real life dataset contains different types of data such as nominal, ordinal, binary and more complex data types such as graphs, images, documents etc. So the clustering algorithm should be able to group these types of data.
Different types of attributes (Source:Machine Learning For Beginners- Towards Data Science, By Divyansh Dwivedi)
  • Discovery of clusters with arbitrary shape: Many of the clustering algorithm are designed to find spherical clusters. Clusters can be of any shape or according to density there may be different clusters. So, algorithm should find clusters with different shape or based on density.
Different types of clusters (Source: DBSCAN part 2, Machine learning tv)
  • Ability to deal with noisy data: There may be possibilities that datasets contain wrong, noisy, unknown data or outliers. If the algorithm cannot be able to detect these noisy data or outliers then the algorithm makes wrong cluster or poor quality of clusters. So algorithm should be able to deal with these situation.
  • Capability of clustering high-dimensional data: A data set can contain many dimensions or attributes. Most clustering algorithms are good at handling low-dimensional data such as data sets involving only two or three dimensions. Finding clusters of data objects in a high dimensional space is challenging.
High dimensional Data (Source: 3d scatterplot, Demonstration of a basic scatterplot in 3D. )

Clustering methods can be compared by using following aspects:

  • The Partitioning Criteria
  • Separation of clusters
  • Similarity measure
  • Clustering space

Basic Clustering Methods

The major fundamental clustering algorithms are classified into four categories.

Partitioning Methods

It is a distance based clustering methods. It finds only  mutually exclusive clusters of spherical shape. It uses mean or medoid to represent centroid or cluster center and it is Effective for small data sets.

Partitioning method (Source:Clustering and Classification methods for Biologists)

Hierarchical  Methods

  Hierarchical methods creates a hierarchical decomposition (i.e., multiple levels) of a datasets. And it cannot correct erroneous merges or splits

Hierarchical clustering Method (Source: EDUCBA,Hierarchical Clustering Analysis)

Density Based Methods

Density Based methods can find arbitrarily shaped clusters.Clusters are dense regions of objects in space that are separated by low-density region. It can detect outlier

Density based clusters (Source: 1 Concepts of density-based clustering )

Grid Based Methods

Grid-based methods quantize the object space into a finite number of cells that form a grid structure.It has fast processing time.


So with this analysis, we can conclude that the algorithm which can find mutually exclusive clusters, arbitrarily shaped clusters or find clusters in high dimensional data space or can find noise or outliers, that algorithm can be used to find clusters, which will group the data set in a best way. And also the algorithm must take less processing and computation time.