# K-means Clustering: A Basic Overview

The K-means Clustering is based on partitioning methods. It clusters the n objects or data points into k number of clusters.

## Introduction

The simplest and most fundamental cluster analysis is the partitioning method, which organizes the objects(i.e. Data points) into several mutually exclusive groups or clusters. So the K-means Clustering algorithm uses partitioning methods to cluster unlabeled data. It clusters the n objects or data points into k number of clusters. It is an unsupervised machine learning algorithm.

So let me explain this by an example of small dataset .

So we are going to group above unlabeled data into different clusters. Let take k=3, i.e we group these data points into three clusters.

K-means algorithm uses centroid based partitioning technique and the centroid is basically the mean of the objects (or data points).

An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. which is the sum of squared error between all objects and the centroid.

## How does the k-means algorithm work?

Step 1:  First, it randomly selects centroid of the objects, each of which initially represents a cluster mean or center. Randomly select 3 centroid, Represented by red,blue and green cross symbol

Step 2: Now, for each of the remaining objects, based on the Euclidean distance between the data points and the centroid assign each data point to its closest center. So the object is assigned to that cluster to which it is most similar.

Here the points near to red centroid, is assign in red cluster and so on.

Step 3: Now for each cluster, it computes the new mean by taking the average of all the points assigned to that cluster. All the data points are then reassigned using the updated means.

Step 4: Repeat Step 2 and 3 until none of the cluster assignments change or until the assignment is stable.

So, we get the final clusters.

## How to choose the value of k (i.e the number of clusters)?

We run the algorithm using different value of k and plot it against Sum of Squared Error(SSE) in a graph. And by using elbow method we choose the value of k (i.e. the elbow point).

Here k=3 is the optimum value of k that will cluster the data points optimally.

## Code

``````import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#creating dataset
df = pd.DataFrame({
'x': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24],
'y': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72]
})

np.random.seed(200)

#number of clusters
k = 3

#randomly select centroids
centroids = {
i+1: [np.random.randint(0, 80), np.random.randint(0, 80)]
for i in range(k)
}

#plot in figure
plt.scatter(df['x'], df['y'], color='k')
colmap = {1: 'r', 2: 'g', 3: 'b'}
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i], marker='x', s=80)
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()``````

Here we create dataset using pandas dataframe and randomly choose centroids, and plot it in figure.

``````#Creating a function which has dataset and centrods as a parameter
#Which calculate distance between each data points and the centroids
def assignment(df, centroids):
for i in centroids.keys():
#Formula of Euclidean Distance
# sqrt((x1 - x2)^2 - (y1 - y2)^2)
df['distance_from_{}'.format(i)] = (
np.sqrt(
(df['x'] - centroids[i]) ** 2
+ (df['y'] - centroids[i]) ** 2
)
)
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)
df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
df['color'] = df['closest'].map(lambda x: colmap[x])
return df
#call the function
df = assignment(df, centroids)
#head() will display initial 5 rows

#after calculating plot it again with new df
plt.scatter(df['x'], df['y'], color=df['color'], edgecolor='k')
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i], marker='x', s=80)
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()``````

To calculate Eucledian distance between data points and centroid we create an function called 'assignment'. And assign the points to the closest centroid.

``````#To update or recalculate the centroids we create another function 'update'
import copy

old_centroids = copy.deepcopy(centroids)

def update(k):
for i in centroids.keys():
centroids[i] = np.mean(df[df['closest'] == i]['x'])
centroids[i] = np.mean(df[df['closest'] == i]['y'])
return k

centroids = update(centroids)

#Again plot the updated centroids and the data points
ax = plt.axes()
plt.scatter(df['x'], df['y'], color=df['color'], alpha=1, edgecolor='k')
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i], marker='x', s=80)
plt.xlim(0, 100)
plt.ylim(0, 100)
for i in old_centroids.keys():
old_x = old_centroids[i]
old_y = old_centroids[i]
dx = (centroids[i] - old_centroids[i]) * 0.75
dy = (centroids[i] - old_centroids[i]) * 0.75

plt.show()``````

Here, we calculate mean of newly formed cluster and update the centroids.

``````df = assignment(df, centroids)

# Plot results

plt.scatter(df['x'], df['y'], color=df['color'], edgecolor='k')
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i], marker='x', s=80)
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()``````
``````#If is there any changes in updated centroids, the loop will stop
while True:
closest_centroids = df['closest'].copy(deep=True)
centroids = update(centroids)
df = assignment(df, centroids)
if closest_centroids.equals(df['closest']):
break

plt.scatter(df['x'], df['y'], color=df['color'], edgecolor='k')
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i], marker='x', s=80)
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()``````

If there is no changes in updated centroids, the loop will stop.

## Conclusion

Although K-Means algorithm is simple and easy to understand or it take less computation time, but it has many issues. i.e it cannot find arbitrarily shaped clusters, it cannot be work efficiently in large datasets, and it is vulnerable to outliers and noise.