Affiliations: Data Science and Innovation Division, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, ON, K1A 0T6, Canada | Tel.: +1 343 542 5625; E-mail: [email protected]
Correspondence:
[*]
Corresponding author: Data Science and Innovation Division, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, ON, K1A 0T6, Canada. Tel.: +1 343 542 5625; E-mail: [email protected].
Note: [1] This paper was a joint first-place prize winning submission to the 2023 IAOS Young Statistician Prize. The author is grateful for the support from Statistics Canada, especially Data Science and Innovation Division and Census Subject Matter Secretariat.
Abstract: To improve the analysis of respondent comments from the Canadian Census of Population, data scientists at Statistics Canada compared and evaluated traditional machine learning, deep learning and transformer-based techniques. Cross-lingual Language Model-Robustly Optimized Bidirectional Encoder Representations from Transformers (XLM-R), a cross-lingual language model, fine-tuned on census respondent comments yield the best result of 89.91% F1 score overall despite language and class imbalances. Following the evaluation, the fine-tuned model was implemented successfully to objectively categorize comments from the 2021 Census of Population, with high accuracy. As a result, feedback from respondents was directed to the appropriate subject matter analysts, for them to analyze post-collection.
Keywords: Deep learning, multilingual text classification, machine learning, census respondent comments, natural language processing, young statistician prize 2023