Introduction

The simplest and most fundamental cluster analysis is the partitioning method, which organizes the objects(i.e. Data points) into several mutually exclusive groups or clusters. So the K-means Clustering algorithm uses partitioning methods to cluster unlabeled data. It clusters the n objects or data points into k number of clusters. It is an unsupervised machine learning algorithm.

So let me explain this by an example of small dataset .

An Example of Unlabeled data points
An Example of Unlabeled data points

So we are going to group above unlabeled data into different clusters. Let take k=3, i.e we group these data points into three clusters.

K-means algorithm uses centroid based partitioning technique and the centroid is basically the mean of the objects (or data points).

An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. which is the sum of squared error between all objects and the centroid.

How does the k-means algorithm work?

Step 1:  First, it randomly selects centroid of the objects, each of which initially represents a cluster mean or center.

Randomly select 3 centroid, Represented by red,blue and green cross symbol
Randomly select 3 centroid, Represented by red,blue and green cross symbol

Step 2: Now, for each of the remaining objects, based on the Euclidean distance between the data points and the centroid assign each data point to its closest center. So the object is assigned to that cluster to which it is most similar.

Assign the data points to the closest centroids
Assign the data points to the closest centroids

Here the points near to red centroid, is assign in red cluster and so on.

Step 3: Now for each cluster, it computes the new mean by taking the average of all the points assigned to that cluster. All the data points are then reassigned using the updated means.

Reassigned the data points using new cluster centers
Plot new cluster centers
Reassigned the data points using new cluster centers(or centroids)
Reassigned the data points using new cluster centers

Step 4: Repeat Step 2 and 3 until none of the cluster assignments change or until the assignment is stable.

The final mutually exclusive clusters
The final mutually exclusive clusters

So, we get the final clusters.

How to choose the value of k (i.e the number of clusters)?

We run the algorithm using different value of k and plot it against Sum of Squared Error(SSE) in a graph. And by using elbow method we choose the value of k (i.e. the elbow point).

Elbow point Example
Elbow point Example (Source:How-can-we-choose-a-good-K-for-K-means-clustering - Quora)

Here k=3 is the optimum value of k that will cluster the data points optimally.

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#creating dataset
df = pd.DataFrame({
    'x': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24],
    'y': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72]
})

np.random.seed(200)

#number of clusters
k = 3

#randomly select centroids
centroids = {
    i+1: [np.random.randint(0, 80), np.random.randint(0, 80)]
    for i in range(k)
}

#plot in figure
plt.scatter(df['x'], df['y'], color='k')
colmap = {1: 'r', 2: 'g', 3: 'b'}
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i], marker='x', s=80)
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()

Here we create dataset using pandas dataframe and randomly choose centroids, and plot it in figure.

#Creating a function which has dataset and centrods as a parameter
#Which calculate distance between each data points and the centroids  
def assignment(df, centroids):
    for i in centroids.keys():
        #Formula of Euclidean Distance
        # sqrt((x1 - x2)^2 - (y1 - y2)^2)
        df['distance_from_{}'.format(i)] = (
            np.sqrt(
                (df['x'] - centroids[i][0]) ** 2
                + (df['y'] - centroids[i][1]) ** 2
            )
        )
    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
    df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)
    df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
    df['color'] = df['closest'].map(lambda x: colmap[x])
    return df
#call the function
df = assignment(df, centroids)
#head() will display initial 5 rows
print(df.head())

#after calculating plot it again with new df
plt.scatter(df['x'], df['y'], color=df['color'], edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i], marker='x', s=80)
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()

To calculate Eucledian distance between data points and centroid we create an function called 'assignment'. And assign the points to the closest centroid.

#To update or recalculate the centroids we create another function 'update'
import copy

old_centroids = copy.deepcopy(centroids)

def update(k):
    for i in centroids.keys():
        centroids[i][0] = np.mean(df[df['closest'] == i]['x'])
        centroids[i][1] = np.mean(df[df['closest'] == i]['y'])
    return k

centroids = update(centroids)

#Again plot the updated centroids and the data points    
ax = plt.axes()
plt.scatter(df['x'], df['y'], color=df['color'], alpha=1, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i], marker='x', s=80)
plt.xlim(0, 100)
plt.ylim(0, 100)
for i in old_centroids.keys():
    old_x = old_centroids[i][0]
    old_y = old_centroids[i][1]
    dx = (centroids[i][0] - old_centroids[i][0]) * 0.75
    dy = (centroids[i][1] - old_centroids[i][1]) * 0.75
    
plt.show()

Here, we calculate mean of newly formed cluster and update the centroids.

df = assignment(df, centroids)

# Plot results

plt.scatter(df['x'], df['y'], color=df['color'], edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i], marker='x', s=80)
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()
#If is there any changes in updated centroids, the loop will stop
while True:
    closest_centroids = df['closest'].copy(deep=True)
    centroids = update(centroids)
    df = assignment(df, centroids)
    if closest_centroids.equals(df['closest']):
        break

plt.scatter(df['x'], df['y'], color=df['color'], edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i], marker='x', s=80)
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()

If there is no changes in updated centroids, the loop will stop.

Conclusion

Although K-Means algorithm is simple and easy to understand or it take less computation time, but it has many issues. i.e it cannot find arbitrarily shaped clusters, it cannot be work efficiently in large datasets, and it is vulnerable to outliers and noise.