You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Predicting the quality and evaluating the use of administrative data for the 2021 Canadian Census of Population

Abstract

This paper presents the statistical contingency plan for the 2021 Canadian Census of Population, developed in response to the COVID-19 pandemic, wherein administrative data was to impute non-responding households in areas with a low response rate and where the administrative data were of sufficient quality. We describe the modeling approach for predicting the quality of data available for administrative households, including important extensions to existing approaches. As well, we provide a framework for evaluating direct imputation using administrative data, relative to traditional donor imputation, in the absence of a simulation study. We conclude by discussing the evaluation using preliminary data and subsequent implementation for the 2021 Canadian Census of Population.

1.Introduction

Before the World Health Organisation declared a global pandemic in March 2020, natural disasters had impacted or limited Census field operations. In Canada, wild fires in 2016 and flooding in 2011 had necessitated that Statistics Canada prepare local contingency plans using administrative data as a way to compensate for non-response. These events launched a long term research agenda towards the use of administrative data in a combined census approach.

In 2020, the increased use of administrative data in census collection and research towards a combined census were also under development in other countries [1, 2, 3]. However, the advancement of the pandemic in March 2020 accelerated exponentially the research regarding the potential use of administrative data for the 2021 Canadian Census of Population, in light of this global emergency and the associated public health measures.

Statistics Canada developed a statistical contingency plan to mitigate a low response rate in the event that the pandemic affected collection. The plan was to use administrative data to impute non-responding households in areas with a low response rate and where the administrative data are of sufficient quality. The impact of the pandemic on the response rate was unknown and, therefore, the use of administrative data was reserved for processing stages following the traditional collection process. For this purpose, we adapted the modeling approach used by other countries, namely, the US Census Bureau [3] and Statistics New Zealand [4] to identify administrative households with good quality data.

Model development was based on data from the 2016 Census. However, the response rate for the 2016 Canadian Census of Population was a record high for the country (98%) and unlikely to reflect the response mechanisms observed during a pandemic. The contingency plan developed a timely but reliable framework to evaluate direct imputation using administrative data, relative to traditional donor imputation, under a variety of response mechanisms. Moreover, this framework allowed us to evaluate the identification of households with good quality data using preliminary data from the 2021 Census during the collection period and adjust parameter specifications accordingly.

The remainder of this paper proceeds as follows. In Section 2, we describe the modeling approach for predicting the quality of data available for administrative households. Thereafter, in Section 3, we discuss the model development using 2016 Census data. In Section 4, we present the evaluation using preliminary data and subsequent implementation for the 2021 Census. Conclusions are provided in Section 5.

2.Modeling approach for predicting the quality of administrative households

Census data are essential for a country, as all layers of society use census data. In particular, it is often the only source of information for small sub-populations. Producing high quality census data is the objective of any National Statistics Organisation. It became evident that one integral part of the research on how to incorporate administrative data into a traditional enumeration census is the evaluation of the quality of the administrative data itself. We use a modeling approach to rank the quality of the available administrative data at the household level. Broadly, this approach is termed the household model and consists of three components: the person-place model, the household composition model and a distance metric.

The basis of the household model is a database of administrative persons, created for the sole purpose of the Census research, composed of multiple sources acquired by Statistics Canada from other government departments. This database includes a variable predicting if the administrative person is in-scope for the Census, the person’s age and sex at birth, all of which are determined using probabilistic models. As well, auxiliary data are available from a variety of administrative data sources such as tax files, immigration files and vital statistics files. Some but not all of these data sources include detailed address information. From these, a list of unique person-address pairs is created. Note that all possible addresses are included in this list and, therefore, a person may have more than one administrative address. Conversely, a person may have no administrative address.

2.1Person-place model

The first component of the household model, the person-place model, predicts the probability that an administrative person is observed at the correct dwelling. The population of eligible persons consists of the set of persons deemed to be in-scope for the Census with a least one administrative address in the list of person-address pairs. Let

yih𝑃𝑃={1if person i is found in administrativeif records and 2016 Census atif dwelling h0otherwise

We model the probability that person i is correctly placed at address h, pih=P(yih𝑃𝑃=1), using logistic regression. For each person-address pair, we obtain a person-level estimated probability of coherence. If person i has administrative records at more than one dwelling, we assign the address with highest predicted probability, maxhp^ih, to that person. Next, we form administrative households, defined as all persons assigned to a given dwelling. For each dwelling h, we defined the dwelling-level estimated probability of coherence as

p^h𝑃𝑃=min(p^1h,,p^nhh)

where nh is the size of the administrative household at dwelling h. This provides a conservative estimate of the probability that every member of the administrative household is correctly placed at that dwelling.

2.2Household composition model

The household composition model is used to predict the probability that an administrative household matches the household observed in the Census of Population. The household composition model applies to all dwellings with at least one administrative person. The outcome of interest, Yh𝐻𝐶, is categorical and has four levels, called coherence levels. The coherence levels characterize dwellings in terms of the degree to which the administrative household matches the census household at the person-level. These levels cover three dimensions of similarity: correct placement of administrative person(s), number of persons and household composition. The household composition indicates the presence of children less than 18 years old and/or the presence of adults 18 years or older. The four coherence levels for the household composition model are detailed in Table 1.

Table 1

Coherence levels for the household composition model

Coherence levelDescription
1Perfect match – administrative household exactly matches census household.
2Partial match (type 1) – At least one administrative person matches the census household, the administrative household count is greater or equal to the census count and the composition matches.
3Partial match (type 2) – At least one administrative person matches the census household, the administrative household count is less than the census count and/or the composition does not match.
4Non-match – No administrative person is matched to the census household.

We model the probability that dwelling h belongs to each coherence level using multinomial logistic regression. In particular, the non-match coherence level is used as the baseline category and we specify three independent binary logistic regression models:

{logP(Yh𝐻𝐶=1)P(Yh𝐻𝐶=4)=β1𝑿𝒉logP(Yh𝐻𝐶=2)P(Yh𝐻𝐶=4)=β2𝑿𝒉logP(Yh𝐻𝐶=3)P(Yh𝐻𝐶=4)=β3𝑿𝒉

This yields three sets of estimated regression coefficients. The primary estimate of interest is the probability of perfect match which we calculate as:

p^h𝐻𝐶=eβ^1𝑿𝒉1+k=13eβ^k𝑿𝒉

Note that this specification of the household composition model differs from that proposed by [3] to identify households with good quality administrative data. This previous approach defined a household composition match based on number of adults and children and does not consider the person-level links.

2.3Distance metric

Ideally, we want to accurately identify dwellings where high quality administrative data is available for every household member. This corresponds to a perfect match under the household composition model. However, a limitation of the household composition model is that the proportion of true perfect matches is overestimated. In order to address this limitation, we use a distance metric which incorporates both the estimated probability of a perfect match from the household composition model and the dwelling-level estimated probability of coherence from the person-place model into one measure of quality for dwelling-level administrative data.

We use an extension of the Euclidian distance-based metric initially proposed by [5] with a penalty for administrative households of size 1. This penalty was implemented, since preliminary analyses indicated that single person households were overrepresented within the dwellings predicted to be high quality. The distance metric for dwelling h is defined as:

dh=(1-p^h𝑃𝑃)2+(1-(p^h𝐻𝐶)eh)2

where p^h𝑃𝑃 is minimum estimated probability from the person-place model for all persons placed at dwelling h, p^h𝐻𝐶 is the estimated probability that dwelling h

Acknowledgments

The author thanks Karelyn Davis, Arthur Goussanou and Thomas Yoon for their many contributions to the household model project. She also thanks Michelle Simard for her support of this work and for her constructive comments and suggestions.

References

[1] 

Blackwell L, Charlesworth A, Rogers NJ. Linkage of census and administrative data to quality assure the 2011 census for England and Wales. Journal of Official Statistics. (2015) ; 31: (3): 453–73.

[2] 

Bycroft C. Census transformation in New Zealand: Using administrative data without a population register. Statistical Journal of the IAOS. (2015) ; 31: (3): 401–11.

[3] 

Morris DS, Keller A, Clark B. An approach for using administrative records to reduce contacts in the 2020 Decennial Census. Statistical Journal of the IAOS. (2016) ; 32: (2): 177–88.

[4] 

Bycroft C, Matheson-Dunning N. Use of administrative records for non-response in the New Zealand 2018 Census. Statistical Journal of the IAOS. (2020) ; 36: (1): 107–16.

[5] 

Keller A, Mule VT, Morris DS, Konicki S. A distance metric for modeling the quality of administrative records for use in the 2020 US Census. Journal of Official Statistics. (2018) ; 34: (3): 599–624.

[6] 

Morris DS. A modeling approach for administrative record enumeration in the decennial census. Public Opinion Quarterly. (2017) ; 81: (S1): 357–84.

[7] 

Statistics Canada. Guide to the Census of Population, 2021 [Internet]. Ottawa (CA): Statistics Canada; (2022) [updated 2022 Feb 8; cited 2022 Sept 25]. Available from: https://www12.statcan.gc.ca/census-recensement/2021/ref/98-304/2021001/app-ann1-7-eng.cfm