Editorial
1.Present and future of the European Master in Official Statistics (EMOS)
This Journal has already extensively discussed the emerging needs and new developments of training in official statistics in the special issue of September 2021. That issue of the SJIAOS dealt with a variety of topics: the foundations and needs for training in official statistics; the changing skill mix and competencies required, including the increased importance of computational and technological skills; the methodology for evaluating training demands; innovative modalities for training delivery; and an overview of current training initiatives at national, international, and sectoral levels. The special issue also included two articles that described the design and objectives of the European Master in Official Statistics (EMOS), a shared initiative of Eurostat, the partners in the European Statistical System and the European System of Central Banks, to connect the producers of official statistics and the European universities with the aim of training future official statisticians.
Obviously, formal initial training and lifelong training are two different approaches to education and skill development that cater to distinct stages and needs in a person’s life. While lifelong training is typically of short duration and emphasizes continuous learning and skill development throughout a person’s professional career, adapting to changing needs and interests, formal initial training provides a foundational education, skill set, and competencies at the beginning of one’s professional journey, it has a fixed duration and structured curriculum, leading to a degree or certification, and is typically provided by recognised educational institutions. In this regard, the EMOS programme currently brings together 33 master’s courses offered by universities in 17 European countries. A key requirement for master’s programmes applying for the award of the EMOS label is a close cooperation between the study programmes and the national statistical institutes or other organisations producing official statistics, which translates into the compulsory organisation of internships and master’s theses in these institutions as well as hosting guest lectures by official statisticians in the EMOS curricula.
The crucial importance of this programme to meet the recruitment needs of national and international statistical institutions cannot be stressed enough. In general, without close and active cooperation with practitioners, it can be difficult for researchers and teaching staff at universities to be aware of the challenges faced in the production and dissemination of official statistics and, consequently, to ensure that these are well-treated in academic curricula. Typically, undergraduate (i.e., bachelor’s level) and graduate (master’s level) courses in statistics provide a comprehensive understanding of the mathematical foundations of statistics, probability theory, advanced statistical methods, and data analysis, but do not give the required space to the applicability and actual adoption of these methods and techniques in the real world. The EMOS programme tackles this issue by bridging the gap between statistics producers and the academic community. In addition to a wide-ranging knowledge of advanced statistical methods, the EMOS curriculum offers specialised know-how on the collection, analysis, and interpretation of data within organisations producing official statistics. The main objective of EMOS is to enhance the abilities of students to understand and become familiar with traditional production processes of official data, but also with innovative ones, which make more frequent use of model-base estimation models; with the integration of different data sources, including unstructured and big data; with the institutional set-up and coordination mechanisms of the global statistical system; and with the dissemination of user-friendly, policy-relevant, and impartial statistical information. This range of skills represents the ideal foundation for the development of professionals able to interpret the fast-changing official data production system of the 21
In January 2023, Eurostat commissioned a study on the future development of EMOS. The purpose of this study is to thoroughly assess the achievements and challenges of EMOS to date and explore the options to increase its scope and impact as a pan-European, cross-border, high-quality study programme in official statistics, considering the current developments in the new data ecosystem as well as in European higher education policy.
On October 26–27, 2023, Eurostat, the Prague University of Economics and Business, and the Czech Statistical Office co-organised the 9
The first day was dedicated to sharing information and experiences that laid the foundation for the co-creation workshop taking place on the second day. Participants heard from a Eurostat representative on recent innovation activities in Eurostat and the European Statistical System, followed by a presentation by a representative of the Directorate General for Education, Youth, Sport, and Culture of the European Commission on ongoing initiatives aiming to intensify transnational cooperation in higher education in Europe.
Next, EMOS students and graduates were joined by representatives of EMOS-labelled master’s programmes, statistics producers, and private companies in two panel discussions, sharing their experiences and perspectives regarding traineeships, employability, and activities fostering collaboration and cohesion in the EMOS network.
The exchange of experiences highlighted the current strengths, challenges, and opportunities in the implementation of EMOS. A key discerned advantage of EMOS is the close collaboration between academia and statistical producers, including activities such as mandatory internships, guest lectures, and a master thesis competition. These components provide students with the chance to engage with real data on meaningful projects, comprehend the nuances of work in official statistics, and commence building their professional networks. Expanding on existing partnerships and establishing new ones emerged as discussed opportunities. This could involve forging stronger ties not only with European national statistical institutes but also with entities such as national central banks, international organisations, research institutes, and private companies. Strengthened collaboration holds the potential to broaden the employment prospects for EMOS graduates and ensure that the EMOS curriculum encompasses the skills necessary for graduates to pursue careers in relevant organisations, which, in turn, gain access to highly skilled talent.
The European character of EMOS was discussed both as an opportunity and a challenge. The benefits of EMOS-labelled master’s programmes, student exchanges, and cross-border traineeships were emphasised by all participants. However, challenges were acknowledged in organising these activities, attributed to legal, administrative, and organisational disparities across national, regional, and local educational contexts.
The need to adapt the learning outcomes of the EMOS curriculum to the new data ecosystem and the new roles of statistical offices and statisticians became evident during the discussions. Several EMOS-labelled master’s programmes have recently revamped their curricula, incorporating data science and computing skills to enhance the appeal to potential students and ensure strong employability prospects. Systematically integrating these and other topics into the EMOS curriculum by, among other things, redefining the learning outcomes could create new opportunities and add value for key EMOS stakeholders. For example, including topics such as data governance, FAIR data principles, and metadata catalogue building would equip students with the necessary skills for contributing to the development of European data spaces; fostering closer collaboration with international organisations would enhance the employability of EMOS graduates and potentially position EMOS in a role in international capacity development; introducing subjects such as public communication, stakeholder and project management, ethics, and legislation into the EMOS curriculum would allow students to specialise in specific areas crucial for official statistics producers in the evolving data ecosystem.
These and many other options for the future development of EMOS were explored during the second day of the co-creation workshop. The organisers leveraged the diverse experiences and collective knowledge of the participants who worked in small groups, to brainstorm concrete solutions for the real challenges that the organisations participating in the implementation of EMOS face.
The overall conclusion is clear: in today’s rapidly evolving data landscape, the case for EMOS has grown more compelling. Official statistics producers must have access to a skilled and adaptable workforce of statisticians. For academia, EMOS is an opportunity to engage with practitioners, offer a stimulating study programme with plenty of opportunities to gain work experience and build a network, and ensure their statistical curricula remain relevant and up-to-date. At the same time, new roles for official statistics producers and trained statisticians have emerged in the new data ecosystem, and there is a need to adapt the objectives and implementation modalities of EMOS to leverage its strengths and ensure that its key stakeholders reap the full benefits of the opportunities it presents.
The EMOS Board and the European Statistical System Committee recognise the importance of re-evaluating the EMOS programme to align it with the evolving landscape of official statistics and higher education, addressing the requirements of a new generation of learners. We are eagerly anticipating the results of the study on the future of EMOS and the subsequent formulation of a development strategy for the programme by the European Statistical System Committee in 2024.
2.The content of this issue
2.1Data science skills
This issue of the SJIAOS starts with a paper on the role of data science in shaping the future of the profession of official statisticians, which holds a clear link with the content of the previous section. The article is “Data Science Skills for the Next Generation of Statisticians” by Laura Antonucci (Campania University), Antonio Balzanella (Campania University), Elvira Bruno (SDG Group), Corrado Crocetta (University of Bari), Simone Di Zio (University of Chieti-Pescara), Lara Fontanella (University of Chieti-Pescara), Maurizio Sanarico (SDG Group), Bruno Scarpa (University of Padova), Rosanna Verde (Campania University), and Giorgio Vittadini (University of Milano-Bicocca). After reviewing alternative definitions of data science, the article turns to an assessment of the skills required by the labour market for data scientists and the specific characteristics of this profession, combining elements of mathematics, statistics, computer science, and knowledge of an application domain. According to the authors, therefore, we should not refer to an individual data scientist but to different data scientists according to their specialised experience in various fields. Finally, the specific role of a data scientist along the phases of a data science project (data collection, data analysis, and communication of results) is described, outlining how the specific expertise of more “traditional” statisticians can significantly contribute to the improvement of data science projects by ensuring the statistical soundness of the entire data processing chain.
2.2IAOS Young Statistician Prize 2023 (YSP 2023)
The second section of the Journal publishes the winning papers of the Young Statistician Prize (YSP) of the International Association for Official Statistics (IAOS) for the year 2023. The YSP has been running since 2011, attracting submissions from across the globe – both from developing and developed nations – of young statisticians who are less than 35 years old as of February of that year. With this prize, the IAOS seeks to actively encourage the membership of young statisticians, their involvement in the Association and in the implementation of its activities. Today’s young statisticians are tomorrow’s leaders, and they inspire us, “older” statisticians, to think about the current and new frontiers in official statistics, promoting dynamic and innovative professional cooperation. The IAOS encourages members and supervisors to support young statisticians entering the Prize. Our colleague Gemma Van Halderen has been the coordinator of this competition since 2014 and has drafted this section of the editorial.
In 2023, for the first time, two papers were jointly awarded first place. This was in recognition of their high quality, as assessed against four criteria: 1) Scientific and/or strategic merit; 2) Originality; 3) Applicability of the ideas in the practice of statistical organisations; and 4) Quality of the exposition.
The joint first-place papers use machine learning and small-area estimation methodologies for important applications in official statistics.
Ms. Joanne Yoon, a young statistician from Statistics Canada, uses machine learning methods to classify respondent comments from the 2021 Canadian Census of Population. Machine learning methods are becoming more and more commonplace, and their use offers great potential to improve efficiency, reduce costs, and improve timeliness. Their use in classification processes has applications across many areas of official statistics.
Mr. Nelson Chua and Mr. Benjamin Long, young statisticians from the Australian Bureau of Statistics, add depth and value to the field of small area estimation by building a methodology for time-to-event data such as first job after graduation, home ownership, or having a child. They argue that the approach can also be applied to longitudinal data and cross-sectional sample collections; it may also be particularly useful for sensitive data items that can only be obtained via surveys, such as the onset of certain health conditions.
In second place, estimation methodologies are in the spotlight with the use of big data integrated with survey data to estimate medians. Mr. Ryan Covey, a young statistician also from the Australian Bureau of Statistics, outlines his methodology and how it can be applied to highly skewed data such as personal income and age.
Dr. Alba Cervantes Loreto, a young statistician at Statistics New Zealand, models the self-identification of Māori businesses in Aotearoa, New Zealand, to improve the identification of Indigenous businesses and better estimates of the Indigenous economy. Her paper looks at relationships between personal and business demographics and concludes that Māori ownership of a business is a weak predictor of self-identification as a Māori business.
The Association especially welcomes submissions from developing nations and was pleased to award a special commendation to three young statisticians from the Census and Statistics Department of Hong Kong, China. Mr. Benjamin Chan, Mr. Ian Ng, and Ms. Natalie Chung’s paper explores the use of deep learning techniques to detect anomalies in trade declarations prepared by customs officials. It demonstrates the potential of using deep learning approaches in quality assurance, enhancing the accuracy of error detection, and overcoming some limitations of traditional rule-based approaches.
2.3The Impact of COVID-19 on Official Statistics
The third section of the Journal is dedicated to “The Impact of COVID-19 on Official Statistics”, a topic featured in every issue of the Journal since 2020. The article contained in this issue is “Spatial Modelling of the Effect of Socio-Economic Indicators on the Incidence Rate of COVID-19 in Nigeria” by Nureni O. Adeboye, Kehinde A. Bashiru, Taiwo A. Ojurongbe, Habeeb A. Afolabi, Timothy A. Ogunleye (all from Osun State University, Osogbo, Nigeria), Olawale V. Abimbola (Creative Advanced Technologies, Dubai, UAE), and Osuolale, P. Popoola (The Ibarapa Polytechnic, Eruwa, Nigeria). The paper analyses the effects of key socio-economic indicators on the incidence rate of COVID-19 in the different states of Nigeria, where the pandemic has spread rapidly since February 2020. The research used spatial modelling techniques to examine differentials in the social vulnerability to COVID-19 across Nigeria’s affected states. In particular, the goodness of fit of the Ordinary Least Squares (OLS) model was compared with the Spatial Lag Model (SLM), the Spatial Error Model (SEM), and the Geographically Weighted Regression (GWR) model to take into account the “spatial dependence” of the phenomenon under study. Based on the findings, the GWR model outperformed other models and was able to estimate consistent predictions and spatial variability of the incidence rate of COVID-19 in Nigerian states.
2.4Hidden and Hard-to-Measure Population Groups
The fourth section of the Journal contains two articles tackling the issue of hidden and hard-to-measure population groups. The development of unbiased and efficient methods for measuring hidden and hard-to-measure population groups is extremely important to enable NSOs to produce reliable disaggregated data for many SDG indicators and achieve the objective of leaving no one behind, which is one of the pillars of the 2030 Agenda. The first article in this section is “Hard-to-reach Groups in Administrative Sources: Main Challenges and Future Work” by Donatella Zindato (Istat) and Maciej Truszczynski (former Statistics Denmark). The paper discusses alternative definitions of hard-to-reach groups and the ways of capturing them in administrative sources in relation to the traditional hard-to-count groups in censuses and surveys. One of the interpretations selects groups difficult to reach with traditional survey methods and then tries to capture them in registers, as administrative data might offer the potential to improve frame coverage for some target populations. On the other side, the other interpretation refers to the incompleteness of registers or linked administrative databases that makes some groups hard-to-reach and hence describe with data, due to time lag in reporting of some events or to coverage problems of the source itself. The paper summarises the experience of selected national statistical offices in accessing hard-to-reach groups and describes problems and challenges in monitoring their characteristics and evolution. It also proposes further possible work to improve access to hard-to-reach groups using administrative data.
The second article is “Unbiased Estimation Strategies for Respondent-Driven Sampling” by Demetrio Falorsi (Sapienza University of Rome), Giorgio Alleva (Sapienza University of Rome), and Francesca Petrarca (University of Roma Tre). This paper proposes a strategy to improve the measurement of hidden or hard-to-count population groups focused on respondent-driven sampling (RDS), which is a valuable survey methodology to estimate the size and characteristics of hidden or hard-to-measure population groups. From a data collection point of view, the RDS methodology makes it possible to gather information on these populations by exploiting the relationships between their components. However, the RDS suffers from the lack of an estimation methodology that is sufficiently robust to accommodate the varying conditions under which it can be applied. In this paper, the authors address the estimation problem of the RDS methodology and, by approaching it as a particular indirect sampling technique, propose three unbiased estimation methods as viable solutions.
2.5Register-based Population Statistics
The first article in this section is “To count or to estimate: a note on compiling population estimates from administrative data” by John Dunne, Francesca Kay, and Timothy Linehan (all from the Central Statistics Office, Ireland). The paper discusses the possibility of using administrative data for compiling population estimates in Ireland. Since the country does not have a Central Population Register, the preliminary step is to build a Statistical Population Dataset (SPD) from administrative data, with ideally just one record for each person in the population containing the relevant attributes. The ideal SPD would then allow the compilation of population statistics by simply counting over records. In practice, however, the compilation of the SPD is prone to four types of errors: overcoverage, undercoverage, domain misclassification, and linkage error. To date, Ireland has investigated two different approaches to the compilation of population estimates from administrative data: the simple count method, by building an SPD that minimises the overall number of individual record errors; and the estimation method, by building an SPD that aims to eliminate all error types, except for undercoverage, and then adjusts counts for undercoverage using dual system estimation methods to obtain population estimates. This paper explores the advantages and disadvantages of both methods before considering how they could be integrated to eliminate the disadvantages.
The second article of this section is “Register-based Census in Thailand: A Case Study in Chachoengsao Province” by Nuttirudee Charoenruk, Narongrid Asavaroungpipop (both from Chulalongkorn University, Thailand), Pannee Pattanapradit (National Electronics and Computer Technology Center, Thailand), Kittiya Ku-kiattikun, and Chainarong Amornbunchornvej (both from the National Statistical Office, Thailand). The authors tested the feasibility of using the register-based census in Thailand. In this paper, they describe the methodology used for data preparation and integration, as well as the quality of the results obtained by comparing the data of the register-based census with those of the traditional census conducted in the Chachoengsao province in 2020. The main result of their analysis is that using one recent and complete database is sufficient for conducting the register-based census, as this avoids the problem of overcoverage and biased sex distribution that derives from linking data from multiple registers. The authors conclude their paper with a series of recommendations for ensuring that a register-based census can be conducted in Thailand.
2.6Innovative Statistical Methods
This section of the Journal comprises five papers on innovative statistical methods that complement the articles on new methods and techniques awarded the Young Statistician Prize 2023. The first paper in this section, “Understanding financial distress by using Markov random fields on linked administrative data" by Floris Fonville, Peter G.M. van der Heijden, Arno P.J.M. Siebes, and Daniel L. Oberski (all from Utrecht University, The Netherlands), aims to provide a theoretical framework to explain the relationship between household financial distress and social problems using graphical models. The main challenges in graph estimation from data networks are addressed with the eLasso method, a computationally efficient model for estimating network structures. The approach combines logistic regression with model selection based on a goodness-of-fit measure to identify relevant relationships between variables that define connections in a network. In the resulting graph, financial distress occupies a central position that connects to both youth and adult socio-related problems.
The second article is “Spatial and demographic distributions of personal insolvency: an opportunity for official statistics” by Jonas Klingwort (Statistics Netherlands), Sven Alexander Brocker (University of Duisburg-Essen, Germany), and Christian Borgs (Statistics North Rhine-Westphalia, Germany). The main goal of this paper is to demonstrate the possibility of producing detailed and more timely official statistics on spatial and demographic distributions of personal insolvency using an existing and untapped large administrative database. The statistics are obtained by combining web scraping and text-mining techniques. The main findings of this study show that personal insolvency is concentrated in certain age groups, i.e., individuals in their early thirties, and in certain regions. The authors elaborate on the limitations of the proposed approach and advocate for improving the availability of data on insolvencies in the European context to allow for cross-country comparisons.
The third article is “Web Scraping for Price Statistics in the Philippines” by Manuel Leonard F. Albis, Sabrina O. Romasoc, Shushimita G. Pelayo, Bea Andrea C. Gavira, and Jazzen Paul J. Asombrado (all from the Philippine Statistical Research and Training Institute). This paper initially provides a survey of the experiences of various national statistical agencies in using web scraping data to produce official statistics on the Consumer Price Index (CPI). As digital and online platforms are increasingly utilised for commercial transactions, web scraping offers a way to increase the frequency of price data collection while reducing its cost compared to price surveys. The authors experiment with the use of web scraping data to estimate the CPI for food and alcoholic beverages in the National Capital Region of the Philippines, which is then compared with the official CPI estimate of the Philippine Statistics Authority. Since web-scraped prices originate only from supermarkets, the resulting indices tend to be higher than the official CPI, which collects data from both supermarkets and wet markets. The authors suggest that the methodology used in this paper can be further enhanced by exploring additional price sources on the web and that the PSA can consider establishing a standard procedure and index for the web scraping of prices in the Philippines.
In “Machine learning estimation of the resident population”, Violeta Calian, Margherita Zuppardo, and Omar Hardarson (all from Statistics Iceland) address the issue of estimating the resident population, i.e., correcting for overcounts in administrative register data, as a binary classification problem that can be solved using machine learning algorithms. The selection and optimisation of the chosen algorithm, random forest (RF) in this case, is illustrated for predicting the resident status of individuals in Iceland from Census and survey data, describing in detail its performance, including the uncertainty associated with the results. The limitations of the exercise, a small and noisy sample of survey data used as training data to fit the RF model, are also highlighted, together with the plans to investigate an alternative solution to the same problem based on a much larger training data set, i.e., administrative registers’ data over multiple years.
The paper “Machine Learning and Data Augmentation in the Proxy Means Test for Poverty Targeting” by Wayne Wobcke and Siti Mariyah” (both from the University of New South Wales, Australia) uses data science methods to address the problem of poverty targeting in three districts of Indonesia. In particular, the authors compare a few statistical and machine learning methods with a new approach that uses area-level features and data augmentation at the subdistrict level for estimating the 2020 per capita household expenditure at the district level. They demonstrate that this novel approach, which uses machine learning to combine a variety of data sources, significantly reduces the inclusion/exclusion errors in the probability of identifying the poorest 40% of the population.
2.7Open-Source Software Tools
The first paper in this section is “An R package for automatically generating candidate correspondence tables between classifications” by Martin Karlberg (Eurostat), Vasilis Chasiotis (Athens University of Economics and Business), Photis Stavropoulos (Quantos S.A. Statistics and Information Systems), Christine Laaboudi (Eurostat), Mátyás Mészáros (Eurostat), and Despoina-Avgerini Nasiopoulou (Eurostat). The authors present the newly developed ‘correspondence tables’ R package, available on CRAN, which automates much of the ‘mechanical’ work required for developing a correspondence table. When statistics for the same topic are compiled using different classifications (e.g., national and international classifications; or a revised version of a pre-existing classification), they need to be transformed in order to become comparable by means of a correspondence table. However, correspondence tables between the two classifications involved do not often exist, and they require considerable time and effort by classification experts to be developed. The main advantages of using the package for creating a candidate correspondence table are the reduction of manual work of a clerical nature, the elimination of the risk of errors stemming from manual operations, and the tidier and cleaner look of the candidate correspondence table. Moreover, the paper presents lessons learned along the way, including unforeseen quality issues with input data, and outlines areas for future improvement. The second article in this section is “RJDemetra, a Promising Tool for the Seasonal Adjustment of Official Statistics” by Giancarlo Lutero and Andrea d’Orazio (both from ISTAT). The authors present a new software tool developed in R that complements the JDemetra+ suite. The most important Seasonal Adjustment (SA) methods, X-13Arima-Seats and Tramo-Seats, are currently included in JDemetra+, a universal open-source environment that is available on several platforms and operating systems, as a result of the adoption of Java programming language for source codes and XML metalanguage for input specifications. This paper focuses on the advantages of RJDemetra by illustrating its functionalities with several examples and the associated R scripts. In addition, it proposes an alternative procedure that enhances the consistency checks in the input system and the interactive update of the time series in the SA revision procedure step in order to improve and streamline the SA estimation process and, at the same time, ensure greater security and efficiency. Finally, the interaction between two different environments, such as SAS-IML and R, is displayed through a new SAS-R procedure available for estimating the SA series of the Quarterly Accounts.
3.SJIAOS discussion platform
With the release of this issue of the Journal (December 2023), the 18
The readers are invited to react to the statements above but are also free to give their overall opinion on this issue. The discussion will be open around mid-December on the SJIAOS discussion platform (www.officialstatistics.com).
4.Call for papers
For the upcoming issues of the SJIAOS, we are inviting authors to send manuscripts on “Understanding and Assessing the Value of Official Statistics”.
With declining budgets, increasing demands, and a proliferation of alternative players in the arena of statistics, producers of official statistics are under ever more pressure to stake their claim on public funds by proving and even quantifying the value of their products. But recent work under the Conference of European Statisticians suggests that in order to prove that something has value, organisations need to properly understand what value means. Value means different things to different people, necessitating decisions about which needs, and whose needs, we are trying to fulfil, how, and why. Any indicators we use to quantify value must be clearly grounded in the concepts they are supposed to measure.
This shift in perspective calls for an entirely novel approach to understanding the value of official statistics – one that calls for critical self-assessment and wide-ranging consultation, instead of starting out with the assumption that the value of official statistics is a given.
The key questions that authors may want to address are: What is the impact of official statistics on decision-making? How can we measure the use of official statistics in policy processes and investment decisions? What is the development impact of these informed-based policies and decisions? Or what would be the cost of the lack of essential statistical information? How can the importance of official statistics be communicated to policymakers? How can a business case for justifying an investment in a major statistical operation be built?
Submit your articles to https://www.editorialmanager. com/sji/default2.aspx.
Pietro Gennari
Editor-in-Chief
November 2023
Statistical Journal of the IAOS
E-mail: [email protected]