Learning algorithm selection for comprehensible regression analysis using datasetoids

Loterman, Gert; Mues, Christophe

doi:10.3233/IDA-150756

Learning algorithm selection for comprehensible regression analysis using datasetoids

Subtitle:

Article type: Research Article

Authors: Loterman, Gert^{a; *} | Mues, Christophe^b

Affiliations: [a] Business Informatics and Operations Management, Ghent University, Ghent, Belgium | [b] School of Management, University of Southampton, Southampton, UK

Correspondence: [*] Corresponding author: Gert Loterman, Business Informatics and Operations Management, Ghent University, Tweekerkenstraat 2, 9000 Ghent, Belgium. E-mail:[email protected]

Abstract: Data mining tools often include a workbench of algorithms to model a given dataset but lack sufficient guidance to select the most accurate algorithm given a certain dataset. The best algorithm is not known in advance and no single model format is superior for all datasets. Evaluating a number of candidate algorithms on large datasets to determine the most accurate model is however a computational burden. An alternative and more time efficient way is to select the optimal algorithm based on the nature of the dataset. In this meta-learning study, it is explored to what degree dataset characteristics can help identify which regression/estimation algorithm will best fit a given dataset. We chose to focus on comprehensible `white-box' techniques in particular (i.e. linear, spline, tree, linear tree or spline tree) as those are of particular interest in many real-life estimation settings. A large scale experiment with more than thousand so called datasetoids representing various real-life dependencies is conducted to discover possible relations. It is found that algorithm based characteristics such as sampling landmarks are major drivers for successfully selecting the most accurate algorithm. Further, it is found that data based characteristics such as the length, dimensionality and composition of the independent variables, or the asymmetry and dispersion of the dependent variable appear to contribute little once landmarks are included in the meta-model.

Keywords: Data mining, regression, comprehensibility, meta learning, datasetoids

DOI: 10.3233/IDA-150756

Journal: Intelligent Data Analysis, vol. 19, no. 5, pp. 1019-1034, 2015

Published: 2015

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia