K-Mean Cluster Analysis

Introduction

Cluster analysis is one of the methods of data reduction technique. Other data reduction techniques, like Principal Component Analysis, reduce data in columns, i.e., it reduces the number of variables, whereas cluster analysis reduces the number of observations. This analysis has some similarities with Discriminant Analysis concerning the classification of observations, but there are key differences. Discriminant Analysis allocates an object to a population based on prior information, while cluster analysis identifies homogeneous groups without assumptions about group membership or structure.

Cluster analysis is used to identify groups or subsets of closely associated individuals based on similarities in recorded observations. Objects that share similar characteristics are grouped, while dissimilar ones are placed in different groups.

Applications of Cluster Analysis

In Agriculture, plants with similar characteristics can be grouped.
In soil fertility studies, regions with similar land fertility for a crop may be grouped together.
In Marketing, people with similar buying habits can be clustered.
In Medicine, patients with similar diseases can be grouped for better treatment plans.

K-Mean Clustering

The K-mean clustering algorithm (a non-hierarchical method) is used to create desired partitions of a dataset. The goal is to divide a dataset into k clusters, with each cluster having a centroid that minimizes the distance between itself and all individual points within the cluster.

Why Use K-Mean Clustering?

This method is suitable for large datasets and is preferred over hierarchical methods because it allows subjects to move between clusters. The process begins with specifying the desired number of clusters and iteratively refining the centroids and clusters until an optimal grouping is achieved.

K-Mean Clustering Algorithm

Decide the number of clusters (k).
Classify the data into k clusters, assigning initial random centroids.
For each data point, calculate the distance from the centroids and assign the point to the nearest one.
Update each centroid by calculating the mean of its group's points.
Repeat until the centroids no longer change.

Choosing the Optimal Number of Clusters

The Elbow Method is used to determine the best number of clusters by plotting the total within-cluster sum of squares (WSS) against the number of clusters and selecting the "elbow" point, where adding more clusters does not significantly reduce WSS.

Steps:

Run the clustering algorithm for different values of k (e.g., 1 to 10).
Calculate the WSS for each value of k.
Plot WSS versus k and find the elbow point.

Data Preparation

For K-mean clustering, data should be arranged with rows representing observations and columns representing variables. Any missing values must be removed or estimated before analysis.

Testing the Module with Iris Dataset

The algorithm was tested using the Iris dataset, which contains four features (sepal length, sepal width, petal length, and petal width) of 150 samples of three species of Iris. These features were used to classify the species into clusters. Below is a sample of the dataset:

5.10	3.50	1.40	0.20
4.90	3.00	1.40	0.20
4.70	3.20	1.30	0.20
4.60	3.10	1.50	0.20
5.00	3.60	1.40	0.20
5.40	3.90	1.70	0.40
4.60	3.40	1.40	0.30
5.00	3.40	1.50	0.20
4.40	2.90	1.40	0.20
4.90	3.10	1.50	0.10
5.40	3.70	1.50	0.20
4.80	3.40	1.60	0.20
4.80	3.00	1.40	0.10
4.30	3.00	1.10	0.10
5.80	4.00	1.20	0.20
5.70	4.40	1.50	0.40
5.40	3.90	1.30	0.40
5.10	3.50	1.40	0.30
5.70	3.80	1.70	0.30
5.10	3.80	1.50	0.30
5.40	3.40	1.70	0.20
5.10	3.70	1.50	0.40
4.60	3.60	1.00	0.20
5.10	3.30	1.70	0.50
4.80	3.40	1.90	0.20
5.00	3.00	1.60	0.20
5.00	3.40	1.60	0.40
5.20	3.50	1.50	0.20
5.20	3.40	1.40	0.20
4.70	3.20	1.60	0.20
4.80	3.10	1.60	0.20
5.40	3.40	1.50	0.40
5.20	4.10	1.50	0.10
5.50	4.20	1.40	0.20
4.90	3.10	1.50	0.10
5.00	3.20	1.20	0.20
5.50	3.50	1.30	0.20
4.90	3.10	1.50	0.10
4.40	3.00	1.30	0.20
5.10	3.40	1.50	0.20
5.00	3.50	1.30	0.30
4.50	2.30	1.30	0.30
4.40	3.20	1.30	0.20
5.00	3.50	1.60	0.60
5.10	3.80	1.90	0.40
4.80	3.00	1.40	0.30
5.10	3.80	1.60	0.20
4.60	3.20	1.40	0.20
5.30	3.70	1.50	0.20
5.00	3.30	1.40	0.20
7.00	3.20	4.70	1.40
6.40	3.20	4.50	1.50
6.90	3.10	4.90	1.50
5.50	2.30	4.00	1.30
6.50	2.80	4.60	1.50
5.70	2.80	4.50	1.30
6.30	3.30	4.70	1.60
4.90	2.40	3.30	1.00
6.60	2.90	4.60	1.30
5.20	2.70	3.90	1.40
5.00	2.00	3.50	1.00
5.90	3.00	4.20	1.50
6.00	2.20	4.00	1.00
6.10	2.90	4.70	1.40
5.60	2.90	3.60	1.30
6.70	3.10	4.40	1.40
5.60	3.00	4.50	1.50
5.80	2.70	4.10	1.00
6.20	2.20	4.50	1.50
5.60	2.50	3.90	1.10
5.90	3.20	4.80	1.80
6.10	2.80	4.00	1.30
6.30	2.50	4.90	1.50
6.10	2.80	4.70	1.20
6.40	2.90	4.30	1.30
6.60	3.00	4.40	1.40
6.80	2.80	4.80	1.40
6.70	3.00	5.00	1.70
6.00	2.90	4.50	1.50
5.70	2.60	3.50	1.00
5.50	2.40	3.80	1.10
5.50	2.40	3.70	1.00
5.80	2.70	3.90	1.20
6.00	2.70	5.10	1.60
5.40	3.00	4.50	1.50
6.00	3.40	4.50	1.60
6.70	3.10	4.70	1.50
6.30	2.30	4.40	1.30
5.60	3.00	4.10	1.30
5.50	2.50	4.00	1.30
5.50	2.60	4.40	1.20
6.10	3.00	4.60	1.40
5.80	2.60	4.00	1.20
5.00	2.30	3.30	1.00
5.60	2.70	4.20	1.30
5.70	3.00	4.20	1.20
5.70	2.90	4.20	1.30
6.20	2.90	4.30	1.30
5.10	2.50	3.00	1.10
5.70	2.80	4.10	1.30
6.30	3.30	6.00	2.50
5.80	2.70	5.10	1.90
7.10	3.00	5.90	2.10
6.30	2.90	5.60	1.80
6.50	3.00	5.80	2.20
7.60	3.00	6.60	2.10
4.90	2.50	4.50	1.70
7.30	2.90	6.30	1.80
6.70	2.50	5.80	1.80
7.20	3.60	6.10	2.50
6.50	3.20	5.10	2.00
6.40	2.70	5.30	1.90
6.80	3.00	5.50	2.10
5.70	2.50	5.00	2.00
5.80	2.80	5.10	2.40
6.40	3.20	5.30	2.30
6.50	3.00	5.50	1.80
7.70	3.80	6.70	2.20
7.70	2.60	6.90	2.30
6.00	2.20	5.00	1.50
6.90	3.20	5.70	2.30
5.60	2.80	4.90	2.00
7.70	2.80	6.70	2.00
6.30	2.70	4.90	1.80
6.70	3.30	5.70	2.10
7.20	3.20	6.00	1.80
6.20	2.80	4.80	1.80
6.10	3.00	4.90	1.80
6.40	2.80	5.60	2.10
7.20	3.00	5.80	1.60
7.40	2.80	6.10	1.90
7.90	3.80	6.40	2.00
6.40	2.80	5.60	2.20
6.30	2.80	5.10	1.50
6.10	2.60	5.60	1.40
7.70	3.00	6.10	2.30
6.30	3.40	5.60	2.40
6.40	3.10	5.50	1.80
6.00	3.00	4.80	1.80
6.90	3.10	5.40	2.10
6.70	3.10	5.60	2.40
6.90	3.10	5.10	2.30
5.80	2.70	5.10	1.90
6.80	3.20	5.90	2.30
6.70	3.30	5.70	2.50
6.70	3.00	5.20	2.30
6.30	2.50	5.00	1.90
6.50	3.00	5.20	2.00
6.20	3.40	5.40	2.30
5.90	3.00	5.10	1.80

Copy Data