Imputing missings in official statistics for general tasks – our vote for distributional accuracy

Thurow, Maria; Dumpert, Florian; Ramosaj, Burim; Pauly, Markus

doi:10.3233/SJI-210798

Imputing missings in official statistics for general tasks – our vote for distributional accuracy

Article type: Research Article

Authors: Thurow, Maria^{a; *} | Dumpert, Florian^b | Ramosaj, Burim^a | Pauly, Markus^a

Affiliations: [a] Department of Statistics, TU Dortmund University, Dortmund, Germany | [b] Federal Statistical Office of Germany (DESTATIS), Wiesbaden, Germany

Correspondence: [*] Corresponding author: Maria Thurow, Department of Statistics, TU Dortmund University, 44221 Dortmund, Germany. E-mail: [email protected].

Abstract: In statistical survey analysis, (partial) non-responders are integral elements during data acquisition. Treating missing values during data preparation and data analysis is therefore a non-trivial underpinning. Focusing on the German Structure of Earnings data from the Federal Statistical Office of Germany (DESTATIS), we investigate various imputation methods regarding their imputation accuracy and its impact on parameter estimates in the analysis phase after imputation. Since imputation accuracy measures are not uniquely determined in theory and practice, we study different measures for assessing imputation accuracy: Beyond the most common measures, the normalized-root mean squared error (NRMSE) and the proportion of false classification (PFC), we put a special focus on (distribution) distance measures for assessing imputation accuracy. The aim is to deliver guidelines for correctly assessing distributional accuracy after imputation and the potential effect on parameter estimates such as the mean gross income. Our empirical findings indicate a discrepancy between the NRMSE resp. PFC and distance measures. While the latter measure distributional similarities, NRMSE and PFC focus on data reproducibility. We realize that a low NRMSE or PFC is in general not accompanied by lower distributional discrepancies. However, distributional based measures correspond with more accurate parameter estimates such as mean gross income under the (multiple) imputation scheme.

Keywords: Missing values, multiple imputation, distributional similarities, kolmogorov-smirnov-test, random forest, MICE

DOI: 10.3233/SJI-210798

Journal: Statistical Journal of the IAOS, vol. 37, no. 4, pp. 1379-1390, 2021

Published: 26 November 2021

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia