Finding Optimal Number of Clusters Using Heuristic Clustering Algorithms

Hanin Haqi; Tareef Kamil  Mustafa

doi:10.30526/38.2.4047

Authors

Hanin Haqi Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq https://orcid.org/0009-0003-5920-5035
Tareef Kamil Mustafa Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq https://orcid.org/0000-0001-6291-5976

DOI:

https://doi.org/10.30526/38.2.4047

Keywords:

Clustering, K-means, Heuristic, Genetic Algorithm

Abstract

The problem of estimating the number of clusters k is considered one of the major challenges for partition clustering. The k-means algorithm is a division-based clustering method where only objects are entered into a set of K, and the algorithm decides the number group initially, but there is no specific way to estimate the best number for the cluster. The outcome of the clustering process can be better after being performed in more than one attempt. Therefore, the optimal count of iterations can be identified through a steady method in advance that can determine the time and number of rounds. So, the problem is finding the optimal number (k) in an easy and fast way. The aim of this paper is to use a genetic algorithm (GA) with a new objective function to determine the best number of clusters of different types of datasets. The fitness function optimizes both the heterogeneity among clusters and homogeneity within each cluster. Utilize the gap statistic equation, the optimal number of clusters is determined in four standard datasets.

Author Biographies

Hanin Haqi, Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq

Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq
Tareef Kamil Mustafa, Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq

Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq

References

1. Zhu, E.; Zhang, Y.; Wen, P.; Liu, F. Fast and Stable Clustering Analysis Based on Grid-Mapping K-Means Algorithm and New Clustering Validity Index. Neurocomputing 2019, 363, 149–170. https://doi.org/10.1016/j.neucom.2019.07.048

2. Abdullah, D. Determining a Cluster Centroid of Kmeans Clustering Using Genetic Algorithm. Int. J. Comput. Sci. Softw. Eng. 2015, 4, 160–164. https://doi.org/10.1155/2020/7636857

3. SUBRAMANIAN, M. Binning-Based Silhouette Approach to Find the Optimal Cluster Using K-Means. https://doi.org /10.1109/ACCESS.2022.3215568

4. Sultan Alalawi, S.J.; Mohd Shaharanee, I.N.; Mohd Jamil, J. Clustering Student Performance Data Using K-Means Algorithms. J. Comput. Innov. Anal. 2023, 2, 41–55.

https://doi.org/10.32890/jcia2023.2.1.3

5. Yang, J.; Lee, J.-Y.; Choi, M.; Joo, Y. A New Approach to Determine the Optimal Number of Clusters Based on the Gap Statistic. In Proceedings of the International Conference on Machine Learning for Networking; 2019; pp. 227–239. https://doi.org/ 10.1007/978-3-030-45778-5_15

6. Mor, M.; Gupta, P.; Sharma, P. A Genetic Algorithm Approach for Clustering. Int J Eng Comput Sci 2014, 3, 6442–6447. https://doi.org/10.1007/s40995-020-00890-8

7. El-Shorbagy, M.A.; Ayoub, A.Y.; Mousa, A.A.; El-Desoky, I.M. An Enhanced Genetic Algorithm with New Mutation for Cluster Analysis. Comput. Stat. 2019, 34, 1355–1392.

https://doi.org/10.1007/s00180-019-00871-5

8. Hruschka, E.R.; Ebecken, N.F.F. A Genetic Algorithm for Cluster Analysis. Intell. data Anal. 2003, 7, 15–25. https://doi.org/10.1109/TSMCC.2008.2007252

9. Liu, Y.; Ye, M.; Peng, J.; Wu, H. Finding the Optimal Number of Clusters Using Genetic Algorithms. In Proceedings of the 2008 IEEE Conference on Cybernetics and Intelligent Systems; 2008; pp. 1325–1330. https://doi.org/10.1109/ICCIS.2008.4670864

10. Rahman, M.A.; Islam, M.Z. A Hybrid Clustering Technique Combining a Novel Genetic Algorithm with K-Means. Knowledge-Based Syst. 2014, 71, 345–365.

https://doi.org/10.1016/j.knosys.2014.08.011

11. Roy, D.K.; Sharma, L.K. Genetic K-Means Clustering Algorithm for Mixed Numeric and Categorical Data Sets. Int. J. Artif. Intell. & Appl. 2010, 1, 23–28. https://doi.org/ 10.5121/ijaia.2010.1203

12. Han, S.; Xiao, L. An Improved Adaptive Genetic Algorithm. In Proceedings of the SHS web of conferences; 2022; Vol. 140, p. 1044. https://doi.org/10.1051/shsconf/202214001044

13. Fahim, A. K and Starting Means for K-Means Algorithm. J. Comput. Sci. 2021, 55, 101445.

https://doi.org/ 10.1016/j.jocs.2021.101445

14. Nazari, Z.; Kang, D.; Asharif, M.R.; Sung, Y.; Ogawa, S. A New Hierarchical Clustering Algorithm. In Proceedings of the 2015 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS); 2015; pp. 148–152. https://doi.org/ 10.1109/UPCON.2018.8596795

15. Fluck, E. Tangles and Hierarchical Clustering. SIAM J. Discret. Math. 2024, 38, 75–92.

https://doi.org/10.1137/22M1484936

16. Moulavi, D.; Jaskowiak, P.A.; Campello, R.J.G.B.; Zimek, A.; Sander, J. Density-Based Clustering Validation. In Proceedings of the Proceedings of the 2014 SIAM international conference on data mining; 2014; pp. 839–847. https://doi.org/10.1137/1.9781611973440.96

17. Behadil, S.F.; Mhalhal, N.K. Mobility Prediction Based on Deep Learning Approach Using GPS Phone Data. Ibn AL-Haitham J. Pure Appl. Sci. 2024, 37, 423–438.

https://doi.org/10.30526/37.4.3916

18. Ren, Y.; Wang, N.; Li, M.; Xu, Z. Deep Density-Based Image Clustering. Knowledge-Based Syst. 2020, 197, 105841. https://doi.org/10.48550/arXiv.1812.04287

19. Wang, F.; Liao, F.; Li, Y.; Wang, H. A New Prediction Strategy for Dynamic Multi-Objective Optimization Using Gaussian Mixture Model. Inf. Sci. (Ny). 2021, 580, 331–351. https://doi.org/10.1016/j.ins.2021.08.065

20. Mahmoudi, M.R.; Baleanu, D.; Mansor, Z.; Tuan, B.A.; Pho, K.-H. Fuzzy Clustering Method to Compare the Spread Rate of Covid-19 in the High Risks Countries. Chaos, Solitons & Fractals 2020, 140, 110230. https://doi.org/10.1016/j.chaos.2020.110230

21. Amma NG, N.; Amma NG, B. Pclusba: A Novel Partition Clustering-Based Biometric Authentication Mechanism. IETE J. Res. 2024, 70, 467–472. https://doi.org/10.1504/IJBM.2017.084131

22. Kapil, S.; Chawla, M.; Ansari, M.D. On K-Means Data Clustering Algorithm with Genetic Algorithm. In Proceedings of the 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC); 2016; pp. 202–206. https://doi.org/10.1109/PDGC.2016.7913145

23. Haldurai, L.; Madhubala, T.; Rajalakshmi, R. A Study on Genetic Algorithm and Its Applications. Int. J. Comput. Sci. Eng 2016, 4, 139–143.

24. Zeebaree, D.Q.; Haron, H.; Abdulazeez, A.M.; Zeebaree, S.R. Combination of K-Means Clustering with Genetic Algorithm: A Review. Int. J. Appl. Eng. Res. 2017, 12, 14238–14245.

25. Maulana, I.; Roestam, R. Optimizing KNN Algorithm Using Elbow Method for Predicting Voter Participation Using Fixed Voter List Data (DPT). J. Sos. Teknol. 2024, 4, 441–451. https://doi.org/10.59188/jurnalsostech.v4i7.1308

26. Vardakas, G.; Papakostas, I.; Likas, A. Deep Clustering Using the Soft Silhouette Score: Towards Compact and Well-Separated Clusters. arXiv Prepr. arXiv2402.00608 2024.

27. Layton, R.; Watters, P.; Dazeley, R. Evaluating Authorship Distance Methods Using the Positive Silhouette Coefficient. Nat. Lang. Eng. 2013, 19, 517–535. https://doi.org/ 10.1017/S1351324912000241

28. Sagala, N.T.M.; Gunawan, A.A.S. Discovering the Optimal Number of Crime Cluster Using Elbow, Silhouette, Gap Statistics, and NbClust Methods. ComTech Comput. Math. Eng. Appl. 2022, 13, 1–10. https://doi.org/ 10.21512/comtech.v13i1.7270

29. Xiao, J.; Lu, J.; Li, X. Davies Bouldin Index Based Hierarchical Initialization K-Means. Intell. Data Anal. 2017, 21, 1327–1338. https://doi.org/10.3233/IDA-163129

30. Gupta, T.; Panda, S.P. A Comparison of K-Means Clustering Algorithm and Clara Clustering Algorithm on Iris Dataset. Int. J. Eng. & Technol. 2018, 7, 4766–4768. https://doi.org/10.14419/ijet.v7i4.21472

31. Seewald, A.K. Digits-a Dataset for Handwritten Digit Recognition. Austrian Res. Inst. Artif. Intell. Tech. report, Vienna 2005, 7.

32. Spanhol, F.A.; Oliveira, L.S.; Petitjean, C.; Heutte, L. A Dataset for Breast Cancer Histopathological Image Classification. Ieee Trans. Biomed. Eng. 2015, 63, 1455–1462.

https://doi.org/10.1109/TBME.2015.2496264