Importance of data preprocessing in time series prediction using SARIMA: A case study
Article type: Research Article
Authors: Adineh, Amir Hosseina | Narimani, Zahraa; * | Satapathy, Suresh Chandrab
Affiliations: [a] Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran | [b] School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar, India
Correspondence: [*] Corresponding author: Zahra Narimani, Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences, No. 45137-66731, Zanjan, Iran. %****␣kes-24-kes200065_temp.tex␣Line␣25␣**** Tel.: +98 24 3315 3374; E-mail: [email protected].
Abstract: Over last decades, time series data analysis has been in practice of specific importance. Different domains such as financial data analysis, analyzing biological data and speech recognition inherently deal with time dependent signals. Monitoring the past behavior of signals is a key for precise predicting the behavior of a system in near future. In scenarios such as financial data prediction, the predominant signal has a periodic behavior (starting from beginning of the month, week, etc.) and a general trend and seasonal behavior can also be assumed. Autoregressive Integrated Moving Average (ARIMA) model and its seasonal extension, SARIMA, have been widely used in forecasting time-series data, and are also capable of dealing with the seasonal behavior/trend in the data. Although the behavior of data may be autoregressive and trends and seasonality can be detected and handled by SARIMA, the data is not always exactly compatible with SARIMA (or more generally ARIMA) assumptions. In addition, the existence of missing data is not pre-assumed in SARIMA, while in real-world, there can be always missing data for different reasons such as holidays for which no data may be recorded. For different week days, different working hours may be a cause of observing irregular patterns compared to what is expected by SARIMA assumptions. In this paper, we investigate the effectiveness of applying SARIMA on such real-world data, and demonstrate preprocessing methods that can be applied in order to make the data more suitable to be modeled by SARIMA model. The data in the existing research is derived from transactions of a mutual fund investment company, which contains missing values (single point and intervals) and also irregularities as a result of the number of working hours per week days being different from each other which makes the data inconsistent leading to poor result without preprocessing. In addition, the number of data points was not adequate at the time of analysis in order to fit a SARIM model. Preprocessing steps such as filling missing values and tricks to make data consistent has been proposed to deal with existing problems. Results show that prediction performance of SARIMA on this set of real-world data is significantly improved by applying several preprocessing steps introduced in order to deal with mentioned circumstances. The proposed preprocessing steps can be used in other real-world time-series data analysis.
Keywords: Time series data prediction, SARIMA, data preprocessing
DOI: 10.3233/KES-200065
Journal: International Journal of Knowledge-based and Intelligent Engineering Systems, vol. 24, no. 4, pp. 331-342, 2020