Clustering is a concept of grouping similar objects in a same cluster and dissimilar objects in a different cluster. So that objects within a cluster similar to each other(i.e. have high similarity), but are dissimilar to objects in other clusters. Dissimilarities and similarities are measured based on the attribute values that describes the objects and by using distance measures.
The purpose of this analysis is that to find the best clusters among unknown datasets. If two objects have similar characteristics, that means they must belong to the same group.
Requirements for Cluster Analysis
- Scalabilty: In real world there is huge amount of data present. So, clustering algorithm should work well on large datasets.
- Ability to deal with different types of attributes : A real life dataset contains different types of data such as nominal, ordinal, binary and more complex data types such as graphs, images, documents etc. So the clustering algorithm should be able to group these types of data.
- Discovery of clusters with arbitrary shape: Many of the clustering algorithm are designed to find spherical clusters. Clusters can be of any shape or according to density there may be different clusters. So, algorithm should find clusters with different shape or based on density.
- Ability to deal with noisy data: There may be possibilities that datasets contain wrong, noisy, unknown data or outliers. If the algorithm cannot be able to detect these noisy data or outliers then the algorithm makes wrong cluster or poor quality of clusters. So algorithm should be able to deal with these situation.
- Capability of clustering high-dimensional data: A data set can contain many dimensions or attributes. Most clustering algorithms are good at handling low-dimensional data such as data sets involving only two or three dimensions. Finding clusters of data objects in a high dimensional space is challenging.
Clustering methods can be compared by using following aspects:
- The Partitioning Criteria
- Separation of clusters
- Similarity measure
- Clustering space
Basic Clustering Methods
The major fundamental clustering algorithms are classified into four categories.
It is a distance based clustering methods. It finds only mutually exclusive clusters of spherical shape. It uses mean or medoid to represent centroid or cluster center and it is Effective for small data sets.
Hierarchical methods creates a hierarchical decomposition (i.e., multiple levels) of a datasets. And it cannot correct erroneous merges or splits
Density Based Methods
Density Based methods can find arbitrarily shaped clusters.Clusters are dense regions of objects in space that are separated by low-density region. It can detect outlier
Grid Based Methods
Grid-based methods quantize the object space into a finite number of cells that form a grid structure.It has fast processing time.
So with this analysis, we can conclude that the algorithm which can find mutually exclusive clusters, arbitrarily shaped clusters or find clusters in high dimensional data space or can find noise or outliers, that algorithm can be used to find clusters, which will group the data set in a best way. And also the algorithm must take less processing and computation time.