A study of K-Means-based algorithms for constrained clustering

Covões, Thiago F.; Hruschka, Eduardo R.; Ghosh, Joydeep

doi:10.3233/IDA-130590

A study of K-Means-based algorithms for constrained clustering

Article type: Research Article

Authors: Covões, Thiago F.^{a; b} | Hruschka, Eduardo R.^{a; b; *} | Ghosh, Joydeep^b

Affiliations: [a] University of São Paulo, São Carlos, Brazil | [b] University of Texas, Austin, TX, USA

Correspondence: [*] Corresponding author: Eduardo R. Hruschka, Computer Science Department, University of São Paulo, São Carlos (USP), Av. Trabalhador São-carlense, 400, Centro, Caixa Postal 668, CEP 13.560-970, São Carlos, SP, Brazil. Tel.: +55 (16) 3373 8182, Fax: +55 (16) 3373 9751; E-mails: [email protected]; [email protected].

Abstract: The problem of clustering with constraints has received considerable attention in the last decade. Indeed, several algorithms have been proposed, but only a few studies have (partially) compared their performances. In this work, three well-known algorithms for k-means-based clustering with soft constraints – Constrained Vector Quantization Error (CVQE), its variant named LCVQE, and the Metric Pairwise Constrained K-Means (MPCK-Means) – are systematically compared according to three criteria: Adjusted Rand Index, Normalized Mutual Information, and the number of violated constraints. Experiments were performed on 20 datasets, and for each of them 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance are presented. In terms of accuracy, LCVQE has shown to be competitive with CVQE, while violating less constraints. In most of the datasets, both CVQE and LCVQE presented better accuracy compared to MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. The robustness of the algorithms with respect to noisy constraints was also analyzed. From this perspective, the most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments. The computational complexities of the algorithms are also presented. Finally, a variety of (more specific) new experimental findings are discussed in the paper – e.g., deduced constraints usually do not help finding better data partitions.

Keywords: Constrained clustering, K-means, semi-supervised clustering

DOI: 10.3233/IDA-130590

Journal: Intelligent Data Analysis, vol. 17, no. 3, pp. 485-505, 2013

Published: 16 May 2013

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia