Finding Optimal Number of Clusters Using Heuristic Clustering Algorithms
DOI:
https://doi.org/10.30526/eaayea31Keywords:
Clustering, K-means, Heuristic, Genetic AlgorithmAbstract
The problem of estimating the number of clusters k is considered one of the major challenges for partition clustering. The k-means algorithm is a division-based clustering method where only objects are entered into a set of K, and the algorithm decides the number group initially, but there is no specific way to estimate the best number for the cluster. The outcome of the clustering process can be better after being performed in more than one attempt. Therefore, the optimal count of iterations can be identified through a steady method in advance that can determine the time and number of rounds. So, the problem is finding the optimal number (k) in an easy and fast way. The aim of this paper is to use a genetic algorithm (GA) with a new objective function to determine the best number of clusters of different types of datasets. The fitness function optimizes both the heterogeneity among clusters and homogeneity within each cluster. Utilize the gap statistic equation, the optimal number of clusters is determined in four standard datasets.
References
1. Zhu, E.; Zhang, Y.; Wen, P.; Liu, F. Fast and Stable Clustering Analysis Based on Grid-Mapping K-Means Algorithm and New Clustering Validity Index. Neurocomputing 2019, 363, 149–170. https://doi.org/10.1016/j.neucom.2019.07.048
2. Abdullah, D. Determining a Cluster Centroid of Kmeans Clustering Using Genetic Algorithm. Int. J. Comput. Sci. Softw. Eng. 2015, 4, 160–164. https://doi.org/10.1155/2020/7636857
3. SUBRAMANIAN, M. Binning-Based Silhouette Approach to Find the Optimal Cluster Using K-Means. https://doi.org /10.1109/ACCESS.2022.3215568
4. Sultan Alalawi, S.J.; Mohd Shaharanee, I.N.; Mohd Jamil, J. Clustering Student Performance Data Using K-Means Algorithms. J. Comput. Innov. Anal. 2023, 2, 41–55.
https://doi.org/10.32890/jcia2023.2.1.3
5. Yang, J.; Lee, J.-Y.; Choi, M.; Joo, Y. A New Approach to Determine the Optimal Number of Clusters Based on the Gap Statistic. In Proceedings of the International Conference on Machine Learning for Networking; 2019; pp. 227–239. https://doi.org/ 10.1007/978-3-030-45778-5_15
6. Mor, M.; Gupta, P.; Sharma, P. A Genetic Algorithm Approach for Clustering. Int J Eng Comput Sci 2014, 3, 6442–6447. https://doi.org/10.1007/s40995-020-00890-8
7. El-Shorbagy, M.A.; Ayoub, A.Y.; Mousa, A.A.; El-Desoky, I.M. An Enhanced Genetic Algorithm with New Mutation for Cluster Analysis. Comput. Stat. 2019, 34, 1355–1392.
https://doi.org/10.1007/s00180-019-00871-5
8. Hruschka, E.R.; Ebecken, N.F.F. A Genetic Algorithm for Cluster Analysis. Intell. data Anal. 2003, 7, 15–25. https://doi.org/10.1109/TSMCC.2008.2007252
9. Liu, Y.; Ye, M.; Peng, J.; Wu, H. Finding the Optimal Number of Clusters Using Genetic Algorithms. In Proceedings of the 2008 IEEE Conference on Cybernetics and Intelligent Systems; 2008; pp. 1325–1330. https://doi.org/10.1109/ICCIS.2008.4670864
10. Rahman, M.A.; Islam, M.Z. A Hybrid Clustering Technique Combining a Novel Genetic Algorithm with K-Means. Knowledge-Based Syst. 2014, 71, 345–365.
https://doi.org/10.1016/j.knosys.2014.08.011
11. Roy, D.K.; Sharma, L.K. Genetic K-Means Clustering Algorithm for Mixed Numeric and Categorical Data Sets. Int. J. Artif. Intell. & Appl. 2010, 1, 23–28. https://doi.org/ 10.5121/ijaia.2010.1203
12. Han, S.; Xiao, L. An Improved Adaptive Genetic Algorithm. In Proceedings of the SHS web of conferences; 2022; Vol. 140, p. 1044. https://doi.org/10.1051/shsconf/202214001044
13. Fahim, A. K and Starting Means for K-Means Algorithm. J. Comput. Sci. 2021, 55, 101445.
https://doi.org/ 10.1016/j.jocs.2021.101445
14. Nazari, Z.; Kang, D.; Asharif, M.R.; Sung, Y.; Ogawa, S. A New Hierarchical Clustering Algorithm. In Proceedings of the 2015 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS); 2015; pp. 148–152. https://doi.org/ 10.1109/UPCON.2018.8596795
15. Fluck, E. Tangles and Hierarchical Clustering. SIAM J. Discret. Math. 2024, 38, 75–92.
https://doi.org/10.1137/22M1484936
16. Moulavi, D.; Jaskowiak, P.A.; Campello, R.J.G.B.; Zimek, A.; Sander, J. Density-Based Clustering Validation. In Proceedings of the Proceedings of the 2014 SIAM international conference on data mining; 2014; pp. 839–847. https://doi.org/10.1137/1.9781611973440.96
17. Behadil, S.F.; Mhalhal, N.K. Mobility Prediction Based on Deep Learning Approach Using GPS Phone Data. Ibn AL-Haitham J. Pure Appl. Sci. 2024, 37, 423–438.
https://doi.org/10.30526/37.4.3916
18. Ren, Y.; Wang, N.; Li, M.; Xu, Z. Deep Density-Based Image Clustering. Knowledge-Based Syst. 2020, 197, 105841. https://doi.org/10.48550/arXiv.1812.04287
19. Wang, F.; Liao, F.; Li, Y.; Wang, H. A New Prediction Strategy for Dynamic Multi-Objective Optimization Using Gaussian Mixture Model. Inf. Sci. (Ny). 2021, 580, 331–351. https://doi.org/10.1016/j.ins.2021.08.065
20. Mahmoudi, M.R.; Baleanu, D.; Mansor, Z.; Tuan, B.A.; Pho, K.-H. Fuzzy Clustering Method to Compare the Spread Rate of Covid-19 in the High Risks Countries. Chaos, Solitons & Fractals 2020, 140, 110230. https://doi.org/10.1016/j.chaos.2020.110230
21. Amma NG, N.; Amma NG, B. Pclusba: A Novel Partition Clustering-Based Biometric Authentication Mechanism. IETE J. Res. 2024, 70, 467–472. https://doi.org/10.1504/IJBM.2017.084131
22. Kapil, S.; Chawla, M.; Ansari, M.D. On K-Means Data Clustering Algorithm with Genetic Algorithm. In Proceedings of the 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC); 2016; pp. 202–206. https://doi.org/10.1109/PDGC.2016.7913145
23. Haldurai, L.; Madhubala, T.; Rajalakshmi, R. A Study on Genetic Algorithm and Its Applications. Int. J. Comput. Sci. Eng 2016, 4, 139–143.
24. Zeebaree, D.Q.; Haron, H.; Abdulazeez, A.M.; Zeebaree, S.R. Combination of K-Means Clustering with Genetic Algorithm: A Review. Int. J. Appl. Eng. Res. 2017, 12, 14238–14245.
25. Maulana, I.; Roestam, R. Optimizing KNN Algorithm Using Elbow Method for Predicting Voter Participation Using Fixed Voter List Data (DPT). J. Sos. Teknol. 2024, 4, 441–451. https://doi.org/10.59188/jurnalsostech.v4i7.1308
26. Vardakas, G.; Papakostas, I.; Likas, A. Deep Clustering Using the Soft Silhouette Score: Towards Compact and Well-Separated Clusters. arXiv Prepr. arXiv2402.00608 2024.
27. Layton, R.; Watters, P.; Dazeley, R. Evaluating Authorship Distance Methods Using the Positive Silhouette Coefficient. Nat. Lang. Eng. 2013, 19, 517–535. https://doi.org/ 10.1017/S1351324912000241
28. Sagala, N.T.M.; Gunawan, A.A.S. Discovering the Optimal Number of Crime Cluster Using Elbow, Silhouette, Gap Statistics, and NbClust Methods. ComTech Comput. Math. Eng. Appl. 2022, 13, 1–10. https://doi.org/ 10.21512/comtech.v13i1.7270
29. Xiao, J.; Lu, J.; Li, X. Davies Bouldin Index Based Hierarchical Initialization K-Means. Intell. Data Anal. 2017, 21, 1327–1338. https://doi.org/10.3233/IDA-163129
30. Gupta, T.; Panda, S.P. A Comparison of K-Means Clustering Algorithm and Clara Clustering Algorithm on Iris Dataset. Int. J. Eng. & Technol. 2018, 7, 4766–4768. https://doi.org/10.14419/ijet.v7i4.21472
31. Seewald, A.K. Digits-a Dataset for Handwritten Digit Recognition. Austrian Res. Inst. Artif. Intell. Tech. report, Vienna 2005, 7.
32. Spanhol, F.A.; Oliveira, L.S.; Petitjean, C.; Heutte, L. A Dataset for Breast Cancer Histopathological Image Classification. Ieee Trans. Biomed. Eng. 2015, 63, 1455–1462.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Ibn AL-Haitham Journal For Pure and Applied Sciences

This work is licensed under a Creative Commons Attribution 4.0 International License.
licenseTerms