Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Article type: Research Article
Authors: Barcaroli, Giulioa; * | Scannapieco, Monicab
Affiliations: [a] Via Monte Delle Gioie 29, Roma 00199, Italy | [b] Italian National Institute of Statistics, Roma 00184, Italy
Correspondence: [*] Corresponding author: Giulio Barcaroli, Via Monte Delle Gioie 29, Roma 00199, Italy. E-mail: [email protected].
Abstract: Since 2013, the Italian National Institute of Statistics (Istat) has been investigating the potential of Big Data sources for Official Statistics. Among such sources, Internet data originated by websites content has been considered as one of the most important to produce information about enterprises. In 2018, Istat started producing experimental statistics on the activities that enterprises carry out through their websites (web ordering, job vacancy advertisement, link to social media, etc.). They are a subset of the statistics currently produced by the “Survey on ICT usage and e-Commerce in Enterprises” and are computed starting from enterprise websites’ contents, acquired by web scraping tools and processed with text mining techniques. A machine learning approach is adopted to estimate models in the subset of enterprises for which the survey and the web sources are both available, with survey data serving as training set for the machine learning task. The content scraped from successfully reached websites is used as input to predict the target values by applying the model fitted in the previous step. The experimental statistics are obtained using two different estimators: (i) a full model based estimator; (ii) an estimator that combines model and survey based estimates. Considering the various domains for which they have been calculated, the three sets of estimates (survey, model and combined) in most cases are not significantly different (i.e. model and combined estimated values lay in the confidence intervals of survey estimates). Simulations have demonstrated that the Mean Square Errors of these new estimates are competitive as compared to those produced in the traditional way.
Keywords: Big Data, Internet data, web scraping, text mining, machine learning, experimental statistics
DOI: 10.3233/SJI-190553
Journal: Statistical Journal of the IAOS, vol. 35, no. 4, pp. 643-656, 2019
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
如果您在出版方面需要帮助或有任何建, 件至: [email protected]