Cluster analysis is one of the methods of data reduction technique. Other data reduction techniques, like Principal Component Analysis, reduce data in columns, i.e., it reduces the number of variables, whereas cluster analysis reduces the number of observations. This analysis has some similarities with Discriminant Analysis concerning the classification of observations, but there are key differences. Discriminant Analysis allocates an object to a population based on prior information, while cluster analysis identifies homogeneous groups without assumptions about group membership or structure.
Cluster analysis is used to identify groups or subsets of closely associated individuals based on similarities in recorded observations. Objects that share similar characteristics are grouped, while dissimilar ones are placed in different groups.
The K-mean clustering algorithm (a non-hierarchical method) is used to create desired partitions of a dataset. The goal is to divide a dataset into k clusters, with each cluster having a centroid that minimizes the distance between itself and all individual points within the cluster.
This method is suitable for large datasets and is preferred over hierarchical methods because it allows subjects to move between clusters. The process begins with specifying the desired number of clusters and iteratively refining the centroids and clusters until an optimal grouping is achieved.
The Elbow Method is used to determine the best number of clusters by plotting the total within-cluster sum of squares (WSS) against the number of clusters and selecting the "elbow" point, where adding more clusters does not significantly reduce WSS.
For K-mean clustering, data should be arranged with rows representing observations and columns representing variables. Any missing values must be removed or estimated before analysis.
The algorithm was tested using the Iris dataset, which contains four features (sepal length, sepal width, petal length, and petal width) of 150 samples of three species of Iris. These features were used to classify the species into clusters. Below is a sample of the dataset:
5.10 3.50 1.40 0.20 4.90 3.00 1.40 0.20 4.70 3.20 1.30 0.20 4.60 3.10 1.50 0.20 5.00 3.60 1.40 0.20 5.40 3.90 1.70 0.40 4.60 3.40 1.40 0.30 5.00 3.40 1.50 0.20 4.40 2.90 1.40 0.20 4.90 3.10 1.50 0.10 5.40 3.70 1.50 0.20 4.80 3.40 1.60 0.20 4.80 3.00 1.40 0.10 4.30 3.00 1.10 0.10 5.80 4.00 1.20 0.20 5.70 4.40 1.50 0.40 5.40 3.90 1.30 0.40 5.10 3.50 1.40 0.30 5.70 3.80 1.70 0.30 5.10 3.80 1.50 0.30 5.40 3.40 1.70 0.20 5.10 3.70 1.50 0.40 4.60 3.60 1.00 0.20 5.10 3.30 1.70 0.50 4.80 3.40 1.90 0.20 5.00 3.00 1.60 0.20 5.00 3.40 1.60 0.40 5.20 3.50 1.50 0.20 5.20 3.40 1.40 0.20 4.70 3.20 1.60 0.20 4.80 3.10 1.60 0.20 5.40 3.40 1.50 0.40 5.20 4.10 1.50 0.10 5.50 4.20 1.40 0.20 4.90 3.10 1.50 0.10 5.00 3.20 1.20 0.20 5.50 3.50 1.30 0.20 4.90 3.10 1.50 0.10 4.40 3.00 1.30 0.20 5.10 3.40 1.50 0.20 5.00 3.50 1.30 0.30 4.50 2.30 1.30 0.30 4.40 3.20 1.30 0.20 5.00 3.50 1.60 0.60 5.10 3.80 1.90 0.40 4.80 3.00 1.40 0.30 5.10 3.80 1.60 0.20 4.60 3.20 1.40 0.20 5.30 3.70 1.50 0.20 5.00 3.30 1.40 0.20 7.00 3.20 4.70 1.40 6.40 3.20 4.50 1.50 6.90 3.10 4.90 1.50 5.50 2.30 4.00 1.30 6.50 2.80 4.60 1.50 5.70 2.80 4.50 1.30 6.30 3.30 4.70 1.60 4.90 2.40 3.30 1.00 6.60 2.90 4.60 1.30 5.20 2.70 3.90 1.40 5.00 2.00 3.50 1.00 5.90 3.00 4.20 1.50 6.00 2.20 4.00 1.00 6.10 2.90 4.70 1.40 5.60 2.90 3.60 1.30 6.70 3.10 4.40 1.40 5.60 3.00 4.50 1.50 5.80 2.70 4.10 1.00 6.20 2.20 4.50 1.50 5.60 2.50 3.90 1.10 5.90 3.20 4.80 1.80 6.10 2.80 4.00 1.30 6.30 2.50 4.90 1.50 6.10 2.80 4.70 1.20 6.40 2.90 4.30 1.30 6.60 3.00 4.40 1.40 6.80 2.80 4.80 1.40 6.70 3.00 5.00 1.70 6.00 2.90 4.50 1.50 5.70 2.60 3.50 1.00 5.50 2.40 3.80 1.10 5.50 2.40 3.70 1.00 5.80 2.70 3.90 1.20 6.00 2.70 5.10 1.60 5.40 3.00 4.50 1.50 6.00 3.40 4.50 1.60 6.70 3.10 4.70 1.50 6.30 2.30 4.40 1.30 5.60 3.00 4.10 1.30 5.50 2.50 4.00 1.30 5.50 2.60 4.40 1.20 6.10 3.00 4.60 1.40 5.80 2.60 4.00 1.20 5.00 2.30 3.30 1.00 5.60 2.70 4.20 1.30 5.70 3.00 4.20 1.20 5.70 2.90 4.20 1.30 6.20 2.90 4.30 1.30 5.10 2.50 3.00 1.10 5.70 2.80 4.10 1.30 6.30 3.30 6.00 2.50 5.80 2.70 5.10 1.90 7.10 3.00 5.90 2.10 6.30 2.90 5.60 1.80 6.50 3.00 5.80 2.20 7.60 3.00 6.60 2.10 4.90 2.50 4.50 1.70 7.30 2.90 6.30 1.80 6.70 2.50 5.80 1.80 7.20 3.60 6.10 2.50 6.50 3.20 5.10 2.00 6.40 2.70 5.30 1.90 6.80 3.00 5.50 2.10 5.70 2.50 5.00 2.00 5.80 2.80 5.10 2.40 6.40 3.20 5.30 2.30 6.50 3.00 5.50 1.80 7.70 3.80 6.70 2.20 7.70 2.60 6.90 2.30 6.00 2.20 5.00 1.50 6.90 3.20 5.70 2.30 5.60 2.80 4.90 2.00 7.70 2.80 6.70 2.00 6.30 2.70 4.90 1.80 6.70 3.30 5.70 2.10 7.20 3.20 6.00 1.80 6.20 2.80 4.80 1.80 6.10 3.00 4.90 1.80 6.40 2.80 5.60 2.10 7.20 3.00 5.80 1.60 7.40 2.80 6.10 1.90 7.90 3.80 6.40 2.00 6.40 2.80 5.60 2.20 6.30 2.80 5.10 1.50 6.10 2.60 5.60 1.40 7.70 3.00 6.10 2.30 6.30 3.40 5.60 2.40 6.40 3.10 5.50 1.80 6.00 3.00 4.80 1.80 6.90 3.10 5.40 2.10 6.70 3.10 5.60 2.40 6.90 3.10 5.10 2.30 5.80 2.70 5.10 1.90 6.80 3.20 5.90 2.30 6.70 3.30 5.70 2.50 6.70 3.00 5.20 2.30 6.30 2.50 5.00 1.90 6.50 3.00 5.20 2.00 6.20 3.40 5.40 2.30 5.90 3.00 5.10 1.80Copy Data