On mining incomplete medical datasets: Ordering imputation and classification

Chen, Chih-Wen; Lin, Wei-Chao; Ke, Shih-Wen; Tsai, Chih-Fong; Hu, Ya-Han

doi:10.3233/THC-151018

On mining incomplete medical datasets: Ordering imputation and classification

Article type: Research Article

Authors: Chen, Chih-Wen^a | Lin, Wei-Chao^b | Ke, Shih-Wen^c | Tsai, Chih-Fong^{d; *} | Hu, Ya-Han^e

Affiliations: [a] Department of Pharmacy, Kaohsiung Municipal Chinese Medical Hospital, Taiwan | [b] Department of Computer Science and Information Engineering, Hwa Hsia University of Technology, Taiwan | [c] Department of Information and Computer Engineering, Chung Yuan Christian University, Taiwan | [d] Department of Information Management, National Central University, Taiwan | [e] Department of Information Management, National Chung Cheng University, Taiwan

Correspondence: [*] Corresponding author: Chih-Fong Tsai, Department of Information Management, National Central University, Taiwan. Tel.: +886 3 422 7151; Fax: +886 3 4254604; E-mail:[email protected]

Abstract: BACKGROUND: To collect medical datasets, it is usually the case that a number of data samples contain some missing values. Performing the data mining task over the incomplete datasets is a difficult problem. In general, missing value imputation can be approached, which aims at providing estimations for missing values by reasoning from the observed data. Consequently, the effectiveness of missing value imputation is heavily dependent on the observed data (or complete data) in the incomplete datasets. OBJECTIVE: In this paper, the research objective is to perform instance selection to filter out some noisy data (or outliers) from a given (complete) dataset to see its effect on the final imputation result. Specifically, four different processes of combining instance selection and missing value imputation are proposed and compared in terms of data classification. METHODS: Experiments are conducted based on 11 medical related datasets containing categorical, numerical, and mixed attribute types of data. In addition, missing values for each dataset are introduced into all attributes (the missing data rates are 10%, 20%, 30%, 40%, and 50%). For instance selection and missing value imputation, the DROP3 and k-nearest neighbor imputation methods are employed. On the other hand, the support vector machine (SVM) classifier is used to assess the final classification accuracy of the four different processes. RESULTS: The experimental results show that the second process by performing instance selection first and imputation second allows the SVM classifiers to outperform the other processes. CONCLUSIONS: For incomplete medical datasets containing some missing values, it is necessary to perform missing value imputation. In this paper, we demonstrate that instance selection can be used to filter out some noisy data or outliers before the imputation process. In other words, the observed data for missing value imputation may contain some noisy information, which can degrade the quality of the imputation result as well as the classification performance.

Keywords: Instance selection, missing value imputation, incomplete data, medical data mining

DOI: 10.3233/THC-151018

Journal: Technology and Health Care, vol. 23, no. 5, pp. 619-625, 2015

Received 20 February 2015

Accepted 5 June 2015

Published: 2015

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia