Crimson Publishers Publish With Us Reprints e-Books Video articles

Abstract

Open Access Biostatistics & Bioinformatics

Adapting Hartigan & Wong K-Means for the Efficient Clustering of Sets1

  • Open or CloseLibero Nigro1* and Franco Cicirelli2

    1DIMES, Engineering Department of Informatics Modelling Electronics and Systems Science University of Calabria, Italy

    2CNR - National Research Council of Italy - Institute for High Performance Computing and Networking (ICAR), Italy

    *Corresponding author:Libero Nigro, DIMES–Engineering Department of Informatics Modelling Electronics and Systems Science University of Calabria, 87036 Rende, Italy

Submission: August 08, 2023; Published: August 25, 2023

DOI: 10.31031/OABB.2023.03.000564

ISSN 2578-0247
Volume3 Issue2

Abstract

This paper proposes an algorithm, named HWK-Sets, based on K-Means, suited for clustering data which are variable-sized sets of elementary items. An example of such data occurs in the analysis of medical diagnosis, where the goal is to detect human subjects who share common diseases so as to predict future illnesses from previous medical history possibly. Clustering sets is difficult because data objects do not have numerical attributes and therefore it is not possible to use the classical Euclidean distance upon which K-Means is normally based. An adaptation of the Jaccard distance between sets is used, which exploits application-sensitive information. More in particular, the Hartigan and Wong variation of K-Means is adopted, which can favor the fast attainment of a careful solution. The HWK-Sets algorithm can flexibly use various stochastic seeding techniques. Since the difficulty of calculating a mean among the sets of a cluster, the concept of a medoid is employed as a cluster representative (centroid), which always remains a data object of the application. The paper describes the HWK-Sets clustering algorithm and outlines its current implementation in Java based on parallel streams. After that, the efficiency and accuracy of the proposed algorithm are demonstrated by applying it to 15 benchmark datasets.

Keywords: Clustering sets; Hartigan and Wong K-Means; Jaccard distance; Medoids; Seeding methods; Java parallel streams

Abbreviations: SDH: Sum of Distances to Histogram; ARI: Adjusted Rand Index; SDM: Sum of Distances to Medoids; CI: Centroids Index; SI: Silhouette Index

Get access to the full text of this article