Unsupervised active learning techniques for labeling training sets: An experimental evaluation on sequential data

Souza, Vinicius M.A.; Rossi, Rafael G.; Batista, Gustavo E.A.P.A.; Rezende, Solange O.

doi:10.3233/IDA-163075

Unsupervised active learning techniques for labeling training sets: An experimental evaluation on sequential data

Article type: Research Article

Authors: Souza, Vinicius M.A.^{a; *} | Rossi, Rafael G.^{a; b} | Batista, Gustavo E.A.P.A.^a | Rezende, Solange O.^a

Affiliations: [a] Instituto de Ciências Matemáticas e de Computação (ICMC), Universidade de São Paulo (USP), São Paulo, Brazil | [b] Federal University of Mato Grosso do Sul (UFMS), Mato Grosso do Sul, Brazil

Correspondence: [*] Corresponding author: Vinicius M.A. Souza, Instituto de Ciências Matemáticas e de Computação (ICMC), Universidade de São Paulo (USP), Av. Trabalhador São-carlense, 400, Centro, CEP: 13560-970, São Carlos, São Paulo, Brazil. Tel.: +55 16 3373 9700, Fax: +55 16 3373 8888; E-mail: [email protected].

Abstract: Many real-world applications, such as those related to sensors, allow collecting large amounts of inexpensive unlabeled sequential data. However, the use of supervised machine learning methods is frequently hindered by the high costs involved in gathering labels for such data. These methods assume the availability of a considerable amount of labeled data to build an accurate classification model. To overcome this bottleneck, active learning methods are designed to selectively label the most informative examples instead of requesting all true labels. Although active learning has been widely used in many problems, most of the methods consider the presence of labeled data or some prior knowledge about the problem, as the number of classes. Differently, in this paper, we are interested in the realistic scenario where the active learning is performed from scratch on a fully unlabeled dataset and with the absence of any classifier or prior knowledge about the data. In general, the methods that consider fully unlabeled data use random sampling to select examples to label. The goal of this work is to show a broad experimental evaluation with different unsupervised active learning methods to select examples from fully unlabeled sequential data. We evaluated methods based on clustering algorithms and centrality measures from graphs for instance selection and the performance of supervised and semi-supervised learning algorithms in the classification task. Given our evaluation on a benchmark of sequential data and in a case study of insect species classification, we indicated the sampling based on hierarchical clustering or k-Means. These methods present a statistically significantly better performance to the popular random sampling. In addition, they are simple algorithms and readily available in many software packages.

Keywords: Unsupervised active learning, training set labeling, clustering, centrality measures, sequential data

DOI: 10.3233/IDA-163075

Journal: Intelligent Data Analysis, vol. 21, no. 5, pp. 1061-1095, 2017

Published: 10 October 2017

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia