1DIMES, Engineering Department of Informatics Modelling Electronics and Systems Science University of Calabria, Italy
2CNR - National Research Council of Italy - Institute for High Performance Computing and Networking (ICAR), Italy
*Corresponding author:Libero Nigro, DIMES–Engineering Department of Informatics Modelling Electronics and Systems Science University of Calabria, 87036 Rende, Italy
Submission: August 08, 2023; Published: August 25, 2023
ISSN 2578-0247Volume3 Issue2
This paper proposes an algorithm, named HWK-Sets, based on K-Means, suited for clustering data which are variable-sized sets of elementary items. An example of such data occurs in the analysis of medical diagnosis, where the goal is to detect human subjects who share common diseases so as to predict future illnesses from previous medical history possibly. Clustering sets is difficult because data objects do not have numerical attributes and therefore it is not possible to use the classical Euclidean distance upon which K-Means is normally based. An adaptation of the Jaccard distance between sets is used, which exploits application-sensitive information. More in particular, the Hartigan and Wong variation of K-Means is adopted, which can favor the fast attainment of a careful solution. The HWK-Sets algorithm can flexibly use various stochastic seeding techniques. Since the difficulty of calculating a mean among the sets of a cluster, the concept of a medoid is employed as a cluster representative (centroid), which always remains a data object of the application. The paper describes the HWK-Sets clustering algorithm and outlines its current implementation in Java based on parallel streams. After that, the efficiency and accuracy of the proposed algorithm are demonstrated by applying it to 15 benchmark datasets.
Keywords: Clustering sets; Hartigan and Wong K-Means; Jaccard distance; Medoids; Seeding methods; Java parallel streams
Abbreviations: SDH: Sum of Distances to Histogram; ARI: Adjusted Rand Index; SDM: Sum of Distances to Medoids; CI: Centroids Index; SI: Silhouette Index