You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Quality measures for multisource statistics

Abstract

The ESSnet on Quality of Multisource Statistics is part of the ESS.VIP Admin Project. The main objectives of that latter project are (i) to improve the use of administrative data sources and (ii) to support the quality assurance of the output produced using administrative sources. The ultimate aim of the ESSnet is to produce quality guidelines for National Statistics Institutes (NSIs) that are specific enough to be used in statistical production at those NSIs. The guidelines aim to cover the diversity of situations in which NSIs work as well as restrictions on data availability. The guidelines will list a variety of potential measures, indicate for each of them their applicability and in what situation it is preferred or not, and provide an ample set of examples of specific cases and decision-making processes. Work Package 3 (WP 3) of the ESSnet focuses on developing and testing quantitative measures for measuring the quality of output based on multiple data sources and on methods to compute such measures. In particular, WP 3 focuses on non-sampling errors. Well-known examples of such quality measures are bias and variance of the estimated output. Methods for computing these and other quality measures often depend on the specific data sources. Therefore, we have identified several basic data configurations for the use of administrative data sources in combination with other sources, for which we propose, revise and test quantitative measures for the accuracy and coherence of the output. In this article we discuss the identified basic data configurations, the approach taken in WP 3, and give some examples of quality measures and methods to compute those measures. We also point out some topics for future work.

1.Introduction: A short fairy tale

Once upon a time, long ago, in a country far, far away, there lived some good and honest people. These good people were struggling to produce single-source statistics, and did that to the best of their abilities. Still, they were always hoping for more data. They said to each other: “See, if only we had more data, we could have produced more and better statistics.” Their biggest wish in life was to produce multisource statistics. One day a good fairy or wicked witch – the story is not too specific about this – showed up and told them that she saw how hard the good people had to work in order to produce their single-source statistics, and that she had decided to fulfill their biggest wish and give them all the data they ever wanted and more than that. The good people were very excited and said: “Thank you, dear fairy.” The people immediately started to work on their multisource statistics and were very happy at first. Soon they realized, however, that now they had more problems and had to work harder than ever before. Suddenly they could see measurement errors in their data, and linkage errors, and coverage errors, and new estimation problems, and so on and so on. They were even starting to think that perhaps it was a wicked witch after all who had given them all these data. Nevertheless, they worked hard, even harder and still even harder, and finally managed to produce multisource statistics. Proudly, they went to their king to show him their results. The king, a good and wise man, was impressed by the work they had done. However, after he had expressed his admiration he then asked the question: “And what can you tell me about the quality of your statistics?” That is where this fairy tale ends for now, and the ESSnet on Quality of Multisource Statistics starts.

The ESSnet on Quality of Multisource Statistics (also referred to as Komuso) is part of the ESS.VIP Admin Project. The main objectives of the ESS.VIP Admin Project are (i) to improve the use of administrative data sources and (ii) to support the quality assurance of the output produced using administrative sources.

Partners in Komuso are Statistics Denmark (overall project leader of the ESSnet), Statistics Norway, ISTAT (the Italian national statistical institute), Statistics Lithuania, Statistics Austria, the Hungarian Central Statistical Office, the Central Statistical Office of Ireland, and Statistics Netherlands.

The first Specific Grant Agreement (SGA) of the ESSNet lasted from January 2016 until April 2017. At the time of writing this article we are finalizing SGA 2. This second SGA started in May 2017 and lasts until mid-October 2018. A potential SGA 3 might start at the end of 2018. In the first and second SGAs of Komuso, Work Package 3 (WP 3) focused on measuring the quality of statistical output based on multiple data sources. Measuring the quality of statistical output differs fundamentally from measuring the quality of input data since one ideally wants to take into account all processing and estimation steps that were taken to achieve the output. Statistics Netherlands was project leader of this work package in both SGA 1 and SGA 2.

The problem encountered in WP 3 is not so much in defining the quality measures that one would like to use. For instance, with respect to the quality dimension “accuracy” most National Statistical Institutes (NSIs) would like to use bias, variances and/or mean squared errors of their estimates as quality measures. The main problem is rather how these quality measures should be computed for a given set of input data sets and a certain procedure for combining these input data sets. In other words, the main problem is describing a recipe for calculating quality measures for a given multisource situation.

This problem is much more complicated for multisource statistics than for single source statistics. We give two reasons for this. The first reason is that the procedure for combining the various input data sets has to be accounted for in the quality measure(s). Such procedures may involve many different processing steps and can be very complicated.

The second reason is that, due to the abundance of data, errors become much more visible. For example, when one has two data sets with (supposedly) the same variable, the values of this variable in the different data sets may differ due to measurement errors. A correction procedure for measurement error is then highly desired and should be accounted for in the quality measures for output based on these data sets. Another example is when one has two data sets that are supposed to cover the same target population. In such a case one will often notice that they actually do not. That is, coverage problems can become visible simply by comparing units in different data sets. Also, linkage problems often occur when trying to link units in different data sets. Again, correcting for coverage and linkage errors is highly desired and should be accounted for in the quality measures for output based on the involved data sets. In contrast, in single-source statistics, one often focusses on sampling errors only, since due to the lack of data other kinds of errors are hard or even impossible to detect and correct.

In this article we discuss WP 3 of Komuso and some of the results obtained. Section 2 describes the approach taken in WP 3. Section 3 gives some examples of quality measures and methods to compute them. All examples are based on work (partly) done by Statistics Netherlands. Section 4 concludes the article with a brief discussion.

2.Approach taken in WP 3

In WP 3 we have subdivided the work into three consecutive steps:

  • 1. We carry out a literature review or suitability test. In a literature review we study and describe existing quality measures and recipes to compute them. In a suitability test we go a step further and also test quality measures and recipes to compute them, either already known ones or newly proposed ones. In such a suitability test we examine practical and theoretical aspects of a quality measure and the accompanying calculation recipe.

  • 2. We produce so-called Quality Measures and Computation Methods (QMCMs). Such a QMCM is a short description of a quality measure and the accompanying calculation recipe as well as a description of the situation(s) in which the quality measure and accompanying recipe can be applied.

  • 3. We provide hands-on examples to some of the QMCMs.

In SGA 1 of Komuso, the focus in WP 3 was on carrying out literature reviews and suitability tests for the quality dimension “accuracy” (principle 12 in the European Statistics Code of Practice, see [1]). In SGA 2 the focus is on producing QMCMs and a selected number of hands-on examples for the literature reviews and suitability tests from SGA 1, and on carrying out suitability tests for the quality dimension “coherence” (principle 14 in the European Statistics Code of Practice, see [1]). In a potential SGA 3 we hope to produce more examples and produce some QMCMs for “coherence” and other quality dimensions, such as “reliability”.

WP 3 is strongly related to WP 1 of Komuso in SGA 2 (and probably also in a potential SGA 3). In WP 1 quality guidelines for multisource statistics are produced. The QMCMs and hand-on examples thereof produced by WP 3 will form an Annex to these quality guidelines for multisource statistics.

Many different situations can arise when multiple data sets are used to produce statistical output, depending on the nature of the data sets and on the kind of output produced. In order to structure the work within WP 3 we use a breakdown into a number of Basic Data Configurations (BDCs) that are most commonly encountered in practice at NSIs. The aim of the BDCs is to provide a useful focus and direction for the work to be carried out. In Komuso we have identified 6 BDCs:

  • BDC 1: Multiple non-overlapping cross-sectional microdata sets that together provide a complete data set without any under-coverage problems;

  • BDC 2: Same as BDC 1, but with overlapping variables and units between different data sets;

  • BDC 3: Same as BDC 2, but now with under-coverage of the target population;

  • BDC 4: Microdata and aggregated data that need to be reconciled with each other;

  • BDC 5: Only aggregated data that need to be reconciled;

  • BDC 6: Longitudinal data sets that need to be reconciled over time.

BDC 1 can be subdivided into two cases: the split-variable case where the data sets contain different variables for the same units and the split-population case where the data sets contain the same variables for different units. For more information on BDCs and methods to produce multi-source statistics we refer to [2].

3.Examples of QMCMs

In total 23 QMCMs are planned to be produced for WP 3 in SGA 2. The vast majority of the QMCMs relate to BDC 2 (“overlapping variables and units between different data sets”), which appears to be the most common and most important situation with respect to multisource statistics at NSIs. For some other BDCs, such as BDCs 3 and 6 we will produce only a few QMCMs; for instance in the case of BDC 3 we will produce only one QMCM. Reasons for producing few QMCMs for BDC 3 and BDC 6 are either that the situation for multisource statistics is similar to the situation for single-source statistics (BDC 3) or, conversely, that NSIs have only very limited experience with the estimation of output quality in a multisource context (BDC6).

Since a complete description of the work done in WP 3 is impossible given the limited length of this article, we limit ourselves to giving some examples of QMCMs.

3.1Mean squared error of level estimates affected by classification errors – BDC 1 “multiple non-overlapping cross-sectional microdata sets”

In this example we assume that we have several non-overlapping data sets together covering the entire target population and that the only source of errors are classification errors. In this section we will assume that the data are on businesses, which are classified by industry code (main economic activity). The unobserved true industry code of unit i is denoted by si; the observed industry code that is prone to errors is denoted by s^i. The set of possible industry codes is denoted by .

Let θ=f(y1,,yN,s1,,sN) denote a target parameter and yi stand for the value of a target variable for unit i. Based on the observed data, this parameter is estimated by θ^=f(y1,,yN,s^1,,s^N). We are interested to estimate the mean squared error of θ^ as affected by classification errors. We assume that the values of the target variable, y1,,yN, are error-free.

[3] assumes that a business i with a true industry code g is classified as falling into class h with probability pghi due to classification error. They propose the following method for computing the mean squared error of θ^. The first step is to estimate the probabilities pghi. This requires an independent collection of data on the classification variable, where observed and cleaned versions of those data are needed. Options to obtain such data are:

  • draw a specific audit sample that is cleaned from errors;

  • use information from the editing step in the regular process of producing statistics where units are verified on classification errors;

  • the classification variable may also be present in a central business register or in a central population register. A regular quality assessment procedure of the register may then be used.

Together the estimated probabilities p^ghi form a transition matrix 𝐏^i=(p^ghi). The transition matrix, per unit, can be modelled as a function of background variables. See [3] for an example.

The second step is to estimate bias and variance of θ^ by drawing bootstrap samples from 𝐏^i. For each unit we draw a new value for the industry code, given the original observed industry code s^i, according to 𝐏^i. Based on the results for this draw, denoted by s^ir, we compute θ^r=f(y1,,yN,s^1r,,s^Nr). We repeat this procedure R times, thus r=1,,R, and use the set of outcomes θ^r to compute estimates of the bias and variance of θ^:

B^R(θ^)=mR(θ^)-θ^,
𝑉𝑎𝑟^R(θ^)=1R-1r=1R{θ^r-mR(θ^)}2.

with mR(θ^)=1Rr=1Rθ^r.

When the target parameter θ concerns a vector of stratum totals (level estimates), one can also estimate the bias and variance-covariance matrix through analytical formulae, with R: (cf. [3]):

B^*(𝒚^)=i=1N(𝐏^it-𝐈)𝒂^iyi,
(1)
𝑉𝑎𝑟^*(𝒚^)=i=1N{diag(𝐏^it𝒂^iyi2)-𝐏^itdiag(𝒂^iyi2)𝐏^i}

where superscript t indicates taking the transpose, ‘*’ superscript indicates the analytical expression, 𝐈 is an identity matrix, 𝒚^=i=1N𝒂^iyi stands for a vector of estimated stratum totals and 𝒂^i=(a^1i,,a^||i)t is a vector of dummy variables that describes in which stratum unit i is observed (a^hi= 1 if s^i=h and a^hi= 0 otherwise). In particular, it follows that the bias of the estimated total in stratum h, Y^h=i=1Na^hiyi, can be estimated by

B^*(Y^h)=i=1N
{(p^hhi-1)a^hiyi+g,𝒽p^ghia^giyi}

and its variance can be estimated by

𝑉𝑎𝑟^*(Y^h)=i=1Ngp^ghi(1-p^ghi)a^giyi2

[4] developed a method to estimate quarterly turnover by Dutch consumers on European webshops. To that end, they use the tax file returns in the Netherlands, which includes selling of goods and services by European webshops to Dutch customers. From the tax data, quarterly turnover figures can be derived for companies that exceed a certain turnover threshold. The smaller companies are not considered. The challenge of the methodology is to identify which of the records in those tax data concern European webshops. These records are identified using a binary classifier, based on machine-learning, which separates the companies into those that belong to European webshops and those that do not. The predictions by the algorithms are not error-free, they make classification errors. In this example we explain how the uncertainty of the estimated quarterly turnover due to those classification errors is estimated.

[4] constructed a training data set of 180 companies and a test set of 79 companies by manually classifying them. Within the training data set, 76 webshops were identified and within the test set of 79 companies 13 webshops were identified.

Using the training set, the parameters of the machine-learning algorithms were estimated. These trained algorithms were used to predict the scores s^i for the units in the test set and compare them with the true scores si. For the final algorithm, the result is shown in the transition matrix below, with si given as the rows and s^i as the columns. The top row concerns having a webshop, the bottom row having no webshop.

𝐏^=(8/135/134/6662/66)

[4] assumed that this transition matrix is the same for all units i in the population.

Next, the machine-learning model was used to predict s^i for the units not in the training set. Denote 𝒂i for the 2-vector [𝐼𝑛𝑑(si=1),𝐼𝑛𝑑(si=0)]t, where Ind stands for an indicator variable. The authors of [4] were interested in the aggregate turnover 𝒚=iC𝒂iyi where C stands for the population of companies. The turnover in the first class, i.e. with 𝐼𝑛𝑑(si=1), is the target parameter. For the units in the training set, denoted by CM, the values of the vector 𝒂i are determined manually and considered to be error-free. For the remaining units 𝒂i is predicted by the machine-learning algorithm. The total aggregated turnover is estimated by

𝒚^=𝒚M+𝒚^A-B^*(y^A)=𝒚M+(2𝐈-𝐏^t)𝒚^A=iCM𝒂iyi+(2𝐈-𝐏^t)iC\CM𝒂^iyi.

where 𝒚M stands for turnover of units in the training set which are checked manually, 𝒚A stands for the true turnover of units not in the training set but which are classified by the algorithm, 𝒚^A stands for the estimate of 𝒚A, and B^*(𝒚^A) is the estimated bias of 𝒚^A [cf. expression (1)]. The variance of the final turnover estimate 𝒚^ is estimated in [4] by (2𝐈-𝐏^t)𝑉𝑎𝑟^*(𝒚^A)(2𝐈-𝐏^t)t, with 𝑉𝑎𝑟^*(𝒚^A) given by Eq. (3.1). This variance estimate ignores the additional uncertainty due to estimating 𝐏. Note that 𝒚M does not contribute to the variance.

The final results on companies estimated to be European retailers with a webshop can be found in Table 1 (taken from [4]). The manually checked turnover of European webshops, yM1, consists of about half the value of the total estimated turnover, y^1. The estimated bias of y^A1 varied considerably over the years with the largest values in 2014. The estimated standard deviation of y^1 suggests that the margin around the final estimate is still rather large. If we assume a normal distribution, the 95% confidence interval around y^1 in 2014 would be 837 ± 190. This corresponds to a relative margin of 22%.

Table 1

Final results in millions of euros

Year yM1 y^A1 y^1 B^*(y^A1) Sd^(y^1)
20144054958376397
20155655861,13221101
20167256671,37219110

[5, 6] propose an extension of the approach in this section in order to quantify the effect of classification errors on the accuracy of growth rates in business statistics, rather than the effect on the accuracy of level estimates.

3.2Variance of estimates based on reconciled microdata containing measurement errors – BDC 2 “overlapping variables and units”

The situation in this example is that a categorical target value is measured for each individual unit (with measurement error) in several data sources. We assume that a Latent Class (LC) model is used to estimate the true values of this target variable. The quality of estimates based on the reconciled microdata is then measured by the estimated variance of these estimates.

Let 𝒀=(Y1,Y2,,Ys)t denote a vector of observed categorical variables that measure the same conceptual variable of interest (for instance, in s different data sources). The true value with respect to the variable of interest is represented by a latent class variable X. We assume for convenience that all variables Yj and X have the same set of categories, say {1,,L}. The marginal probability Pr(𝒀=𝒚) of observing a particular vector of values 𝒚=(y1,y2,,ys)t can be expressed as the sum of the joint probabilities

Pr(X=x,𝒀=𝒚)=Pr(X=x)Pr(𝒀=𝒚|X=x)

over all possible latent classes:

(2)
Pr(𝒀=𝒚)=x=1LPr(X=x)Pr(𝒀=𝒚|X=x)

A common assumption in LC analysis is that the classification errors in different observed variables are conditionally independent, given the true value (local independence), i.e.

Pr(𝒀=𝐲|X=x)=Pr(Y1=y1|X=x)Pr(Y2=y2|X=x)Pr(Ys=ys|X=x).

In combination with Eq. (2), this leads to the basic LC model

Pr(𝒀=𝒚)=x=1LPr(X=x)j=1sPr(Yj=yj|X=x)

Estimating the LC model amounts to estimating the probabilities in this expression. Probabilities of the form Pr(Yj=yj|X=x) provide information about classification errors in the observed data. For example, units that in reality belong to the first category of the target variable (X=1) are misclassified on observed variable Yj with probability Pr(Yj1|X=1)=1-Pr(Yj=1|X=1). This can be seen as a quality measure for the quality of the j-th observed variable.

The model can also be used to estimate, for each unit in the data, the probability of belonging to a particular latent class, given its vector of observed values. Using Bayes’ rule, it follows that:

Pr(X=x|𝒀=𝒚)=
(3)
Pr(X=x)j=1sPr(Yj=yj|X=x)x=1LPr(X=x)j=1sPr(Yj=yj|X=x)

Edit restrictions, for instance the edit restriction that someone who receives rent benefit cannot be a home owner, can be imposed by setting certain conditional probabilities equal to zero; for instance:

Pr(X=owner|Y=rent benefit)=0

The so-called MILC method (see [7]) takes measurement errors into account by combining Multiple Imputation (MI) and LC analysis. The method starts with linking all data sets on the unit level, and then proceeds with 5 steps.

  • 1. Select m bootstrap samples from the original combined data set.

  • 2. Create an LC model for every bootstrap sample.

  • 3. Multiply impute latent “true” variable X for each bootstrap sample. That is, create m empty variables (W1,,Wm) and impute them by drawing one of the categories using the estimated posterior membership probabilities Eq. (3.2) from the LC model.

  • 4. Obtain estimates of interest from each data set with imputed variables.

  • 5. Pool the estimates using Rubin’s pooling rules for multiple imputation (see [8]). An essential aspect of these pooling rules is that an estimated variance of the pooled estimate is obtained. This estimated variance is a quality measure for the reconciled data.

[7] applied the MILC method on a combined data set to measure home ownership. This combined data set consisted of data from the LISS (Longitudinal Internet Studies for the Social sciences) panel from 2013 and a register from Statistics Netherlands from 2013. From this combined data set, they used two variables indicating whether a person is a home-owner or rents a home as indicators for the imputed “true” latent variable home-owner/renter (or other). The combined data set also contained a variable measuring whether someone receives rent benefit from the government. A person can only receive rent benefit if this person rents a house. Moreover, a variable indicating whether a person is married or not was included in the latent class model as a covariate. The three data sets used to combine the data are:

  • BAG: A register containing data on addresses and buildings originating from municipalities from 2013. From the BAG, [7] used a variable indicating whether a person “owns”/ “rents (or other)” the house he or she lives in.

  • LISS background study: A web survey on general background variables from January 2013. [7] used the variable marital status. They also used a variable indicating whether someone is a “(co-)owner” and “(sub-)tenant or other”.

  • LISS housing study: A web survey on housing from June 2013. From this survey [7] used the variable rent benefit, indicating whether someone “receives rent benefit”, “does not receive rent benefit”, or “prefers not to say”.

These data sets were linked on a unit level, and matching was done on person identification numbers. Not every individual is observed in every data set. This causes that some missing values are introduced when the different data sets are linked on a unit level. Full Information Maximum Likelihood (see, e.g. [9]) was used to handle the missing values when estimating the LC model. The MILC method was applied to impute the latent variable home owner/renter (or other) by using two indicator variables and two covariates.

As already explained, the MILC method can be used to assess the quality of the input sources. In Table 2 classification probabilities of the models, estimated by means of the MILC method, are given. The higher these probabilities, the higher the quality of the input data.

Table 2

Classification probabilities for LISS and BAG

Pr (observed = rent|true = rent)Pr (observed = own|true = own)
LISS0.93440.9992
BAG0.94960.9525

To give an example of how to measure the quality of – even quite complicated – aspects of the combined data set, [7] used a logit model to predict home ownership by means of marriage. By using Rubin’s pooling rules on the imputations produced by the MILC method they obtained the following estimates for the intercept and regression coefficient: 2.7712 and -1.3817. This means that the estimated odds of owning a home when not married are e-1.3817= 0.25 times the odds when married. The 95% confidence interval of the estimated intercept is given by [2.5036; 3.0389], and the 95% confidence interval of the estimated regression coefficient by [-1.6493; -1.1140]. These 95% confidence intervals provide quality measures for this aspect of the combined data set. The smaller these confidence intervals, the more accurate the estimates based on the combined data set.

In a similar way, by using Rubin’s pooling rules on the imputations produced by the MILC method, variances and confidence intervals for other estimands can be estimated.

3.3Validity and measurement bias of observed numerical variables – BDC 2 “overlapping variables and units”

In this example we again have several indicators (with measurement error) for target variables that we use to estimate the true values of these target variables. In particular, the true distribution of one or more numerical target variables, which are measured (with measurement error) for individual units in several linked data sets, is estimated, as well as the relation between each target variable and its associated observed variables. From this, it can be assessed to what extent each observed variable is a valid indicator of its target variable, and to what extent measurement bias occurs. The quality measure and related calculation method can be seen as equivalents for numerical data of the quality measure and calculation method for categorical data based on latent class analysis given in the previous example.

The validity coefficient of an observed variable is defined as the absolute value of its correlation with the associated target variable. In the context of the model used here, this coefficient captures the effect of random measurement errors in the observed data. The validity coefficient lies between 0 and 1, with values closer to 1 indicating better measurement quality (absence of random measurement error).

Measurement bias indicates to what extent values of an observed variable are systematically larger or smaller than the true values of the associated target variable. Under the assumption that the relation between the target and observed variable is linear, the measurement bias is summarized in terms of intercept bias and slope bias. Intercept bias indicates a constant shift that occurs for all values; for instance, observed values are on average €1,000 too large. Slope bias indicates a shift that is proportional to the true value; for instance, observed values are inflated by on average 5%. Ideally, the intercept and slope bias are both zero. In the case of administrative data, measurement bias can occur in particular due to conceptual differences between the variable that is used for administrative purposes and the target variable that is needed for statistical production (see, e.g. [10]).

Suppose that one has a linked micro data set with observed variables y1,,yp from different sources. The underlying “true” target variables are not observed directly and denoted by latent variables η1,,ηm. For simplicity, it is assumed that each observed variable yk is an indicator of exactly one target variable ηj(k). By contrast, it is assumed that the same target variable is measured by multiple (at least two) observed variables (so p>m).

A linear structural equation model (SEM) for these data consists of two sets of regression equations. Firstly, there are measurement equations that relate the observed variables to the latent variables:

(4)
yk=τk+λkηj(k)+ϵk,(k=1,,p)

Here, τk denotes a measurement intercept, λk denotes a slope parameter (also known as a factor loading in this context), and ϵk denotes a zero-mean random measurement error that affects yk. It is often assumed that the random errors ϵk and ϵl are uncorrelated for kl.

Second, the SEM may contain structural equations that relate different latent variables to each other:

ηj=αj+g=1(gj)mβjgηg+ζj,(j=1,,m)

Here, αj denotes a structural intercept and ζj denotes a zero-mean disturbance term. The coefficient βjg represents a direct effect of variable ηg on variable ηj (with gj). In practice, some of these coefficients are usually set equal to zero when the model is specified, based on substantive considerations.

Once the SEM has been estimated, the validity and measurement bias of the observed variables can be assessed from the model parameters. The validity coefficientVC of yk as an indicator for the target variable ηj(k) is defined as the absolute value of the correlation between yk and ηj(k). It can be shown that this correlation is equal to the standardized version of λk, say λk(s), so that:

𝑉𝐶(yk)=|λk(s)|=|λk|sd(ηj(k))sd(yk)=1-var(ϵk)var(yk)

In addition, the parameters τk and λk provide information about measurement bias in yk with respect to ηj(k). If no bias occurs, then it holds that τk= 0 and λk= 1; cf. (4). Interceptbias is indicated by a deviation of τk from 0; slope bias is indicated by a deviation of λk from 1.

For any SEM, it has to be checked whether all model parameters can be identified from the available data. Here, a distinction occurs between applications where only the validity is estimated and applications where also the intercept and slope bias are estimated. In the first case, the model can be identified with m2 correlated latent variables and at least two indicators for each latent variable, or with m= 1 latent variable that has at least three indicators (see [11]). Identification of the model may also be improved by including covariates that are considered to be measured (essentially) without error.

To assess the “true” measurement bias, an additional assumption is needed to fix the “true” scale of each latent variable. (Note that this scale is not relevant for the validity coefficient, since it is defined as a correlation.) Following [12, 13],[14] suggests to identify the model in this case by collecting additional “gold standard” data for a small random subsample of the original data set (an audit sample).

Figure 1.

Example of a two-group SEM identified by an audit sample.

Example of a two-group SEM identified by an audit sample.

Figure 1 shows an example of an SEM that is identified by means of an audit sample. The model contains three latent variables, each of which is measured by two ordinary, error-prone observed variables. For the units that are selected in the audit sample, additionally observed variables y7, y8 and y9 are obtained that are supposed to measure the latent variables without error. Thus, the measurement equations for these observed variables are simply: y7=η1, y8=η2 and y9=η3. The model is divided into two groups: group 1 represents the audit sample and group 2 the remaining units without additional “gold standard” variables. In group 1, the model is identified by means of the error-free audit data. In group 2, the model is identified by restricting all model parameters in this group to be equal to the corresponding parameters in group 1. This restriction is meaningful, because the audit sample has been selected by random subsampling from the original data set.

In practice, the “gold standard” audit data may be obtained by letting subject-matter experts re-edit a random subset of the original observations. Results on simulated data in [15] suggest that only a small audit sample (say, 50 units) is needed.

An alternative way to identify the model, without the need to collect additional audit data, is to assume that one of the observed variables for each latent variable does not contain systematic measurement errors (while still allowing for random errors). Then the model may be identified by restricting τk= 0 and λk= 1 for these variables. In some cases, this assumption may be reasonable, e.g., for survey variables. However, the assumption cannot be tested with the data at hand (the same holds for the assumption that audit data are error-free).

To estimate the parameters of an SEM, a standard approach is to apply maximum likelihood estimation under the assumption that the data consist of independent, identically distributed observations from a multivariate normal distribution (see [11]). A so-called Robust Maximum Likelihood or Pseudo Maximum Likelihood estimation procedure is available which can handle non-normality of the data as well as complex sampling designs for finite populations (see [16]). Details on how an audit sample can be formally incorporated into this estimation procedure can be found in [14]. Standard fit measures are available to evaluate whether a fitted SEM gives an adequate description of the observed data, and to compare the fit of different SEMs (see [11]).

The validity coefficients VC and the parameters τk and λk provide information about the quality of the input data yk. They can also be used as input to a further procedure to assess the accuracy of output based on these data. As a very simple example, suppose that we are interested in the population mean of the true variable η: θ=(1/N)i=1Nηi. Suppose that we have two available estimators:

  • The direct estimator based on a simple random sample without replacement of n units, S, where the target variable is measured by y1: θ^1=(1/n)iSy1i.

  • An estimator based on a register that covers the entire population, where the target variable is measured by y2: θ^2=(1/N)i=1Ny2i.

The measurement model for the observed variables y1 and y2 is given by Eq. (4). Under this model, with the true values ηi treated as fixed, the following expressions can be derived for the mean squared error of the two estimators:

𝑀𝑆𝐸(θ^1)={τ1+(λ1-1)θ}2+1n{1-nN𝑉𝐶2(y1)}σy12;
𝑀𝑆𝐸(θ^2)={τ2+(λ2-1)θ}2+1N{1-𝑉𝐶2(y2)}σy22

Here, σy12 and σy22 denote the expected population variances of the observed variables under the model. The first terms in the above expression for the mean squared errors are the squares of the corresponding biases, the second terms are the variances.

Thus, in this example the validity coefficients and intercept and slope bias can be used directly to quantify and compare the accuracy of different estimators for the same target parameter. In other situations, it may be too complicated to derive an analytical expression for the mean squared error of a proposed estimator. In that case, the results of the structural equation model could still be used as input for a resampling method (such as the bootstrap) to simulate the effect of measurement errors on the output accuracy.

3.4Variance-covariance matrix for a vector reconciled by means of macro-integration – BDCs 4 and 5 “microdata and aggregated data”

Many statistical figures, for instance in the context of national economic and social accounting systems, are connected by known constraints. We refer to the equations which satisfy such constraints as accounting equations. Insofar as the initial input estimates need to be based on a variety of sources, they usually do not automatically satisfy the set of accounting equations due to the errors of estimates. An adjustment or reconciliation step is required, by which the input estimates are modified to conform to the constraints. Statistical figures sometimes also have to satisfy inequality constraints besides accounting constraints, and input data may need to be adjusted to satisfy these inequality constraints as well. Macro-integration is a technique used for reconciling statistical figures so they satisfy the known constraints.

Consider, for example, an economic model with five macro-economic quantities: national income y, consumption c, investments i, export x, and import m, which are stacked in a vector 𝜷. Suppose that there is one known national accounts identity y=c+i+x-m and one known inequality xm. Given is an initial vector of estimates 𝜷0=(y^0,c^0,i^0,x^0,m^0), where y^0= 600, c^0= 445, i^0= 100, x^0= 515 and m^0= 475. These input data do not satisfy the accounting equations, since 600 445 + 100 + 515 - 475. By means of macro-integration the initial estimates can be reconciled so the constraints are satisfied.

Macro-integration can be used to reconcile aggregated data as in the above example. It can also be used to reconcile microdata with aggregated data by first transforming the microdata into estimates on an appropriate aggregated level, for instance by means of weighting these microdata. This approach is, for instance used for the Dutch Population Census (see, e.g. [17, 18]).

[19] considers the following general macro-integr-ation problem

min𝜷12(𝜷-𝜷^0)tV-1(𝜷-𝜷^0)
(5)
subject toT𝜷=𝒈andR𝜷𝒉

where 𝜷^0 is an unbiased initial vector of estimates for the (numerical) quantities of interest obtained from several data sources with a known (or estimated) variance-covariance matrix V. T and R are matrices and 𝒈 and 𝒉 are vectors defining the constraints that have to be satisfied by the quantities of interest. Problem Eq. (3.4) is a so-called Quadratic Programming problem.

Our aim is to calculate the variance of the vector of estimates 𝜷 after reconciliation. The calculation formula for this variance depends on the type of restrictions that have to be obeyed. Four different cases are distinguished. These cases are discussed below.

3.4.1Only equality restrictions

If there are only equality restrictions in Eq. (3.4), i.e. T𝜷=𝒈, the solution to Eq. (3.4) is given by

𝜷^QP=𝜷^0+𝑉𝑇t(𝑇𝑉𝑇t)-1(𝒈-T𝜷^0)

and

Var(𝜷^QP)=(Ik-𝑉𝑇t(𝑇𝑉𝑇t)-1T)V

where Ik is the k×k identity matrix.

3.4.2Only one inequality restriction and no equality restrictions

When there is only one inequality restriction, the vector 𝒉 reduces to a scalar h and the solution to Eq. (3.4) is given by

𝜷^QP={𝜷^0ifR𝜷^0h𝜷^0+K(h-R𝜷^0)ifR𝜷^0<h

𝜷^QP can be re-written as

(6)
𝜷^QP=𝜷+K(sh+-R𝜷)+𝒖

where sh+=max(R𝜷^0,h), 𝒖=𝜷^0-E(𝜷^0|s)=(Ik-𝐾𝑅)(𝜷^0-𝜷), and K=𝑉𝑅t(𝑅𝑉𝑅t)-1. sh+ follows a censored normal distribution 𝐶𝑁(R𝜷,𝑅𝑉𝑅t,h,) and is independent of 𝒖. Hence,

(7)
Var(𝜷^QP)=KVar(sh+)Kt+Var(𝒖)=(Ik-(1-d22)𝐾𝑅)V

The parameter d22 follows from the variance formula for a censored normal distribution (see [19, 20]).

3.4.3Multiple inequality restrictions and no equality restrictions

When there are r2 inequality restrictions and no equality restrictions, [19] proposes to use Eq. (6) as an approximation for 𝜷^QP in order to evaluate the variance-covariance matrix. Furthermore, [19] assumes that the correlation coefficients between the elements of 𝒔𝒉+ differ not too much, say less than 0.05, from those between the elements of R𝜷^0. Var(𝜷^QP) can then be approximated by

Var(𝜷^QP)Var(K𝒔𝒉+)+(Ik-𝐾𝑅)V𝐾𝐷2𝑅𝑉𝑅tD2Kt+(Ik-𝐾𝑅)V

where D2=diag(d21,,d2r) and the parameters d21,,d2r again follow from the variance formula of a censored normal distribution.

3.4.4Multiple equality and inequality restrictions

When we have a set of t equality restrictions T𝜷=𝒈 and a set of r inequality restrictions R𝜷𝒉 two steps have to be carried out in order to estimate the variance-covariance matrix of the final reconciled vector 𝜷^QP2 In the first step, we calculate

K1=𝑉𝑇t(𝑇𝑉𝑇t)-1
(8)
𝜷^QP1=𝜷^0+K1(𝒈-T𝜷^0)
V1Var(𝜷^QP1)=(Ik-K1T)V

In the second step we find the final solution 𝜷^QP2 by solving

minβ12(𝜷-𝜷^QP1)tV1-1(𝜷-𝜷^QP1)
subject toR𝜷𝒉

The variance-covariance matrix Var(𝜷^QP2) can now be estimated in the same way as for the case where we have only inequality restrictions, with 𝜷^0 and V in Eq. (3.4) replaced by 𝜷^QP1 and V1.

We return to our example with the five macro-economic quantities, y, c, i, x and m. We assume that the standard errors of the initial estimates are given by se(y^0)= 30.0, se(c^0)= 22.0, se(i^0)= 7.1, se(x^0)= 28.3 and se(m^0)= 28.3. We also assume that the covariances are zero. Applying Eq. (3.4.4) to this example, we first obtain the following QP1 solution: y^QP1= 595.55, c^QP1= 447.39, i^QP1= 100.25, x^QP1= 518.96 and m^QP1= 471.04.

This QP1 solution obeys the equality y=c+i+x-m. According to Eq. (3.4.4), the standard errors of 𝜷^QP1 are given by: se(y^QP1)= 24.74, se(c^QP1)= 20.02, se(i^QP1)= 7.01, se(x^QP1)= 24.00 and se(m^QP1)= 24.00.

This solution satisfies the inequality xm. So, the solution 𝜷^QP2 to the second step mentioned below Eq. (3.4.4) equals 𝜷^QP1. We still need to take the effect of the inequality on the variances and standard errors into account, though. In order to do so, the variances and the standard errors have to be adjusted according to Eq. (7) with R= (0, 0, 0, 1, -1)t and V=V1 where V1 is defined by Eq. (3.4.4). This yields the standard errors of the estimates: se(y^QP2)= 24.15, se(c^QP2)= 19.81, se(i^QP2)= 7.00, se(x^QP2)= 23.61 and se(m^QP2)= 23.61.

As could be expected, these standard errors are smaller than the standard errors of 𝜷^QP1 mentioned above, because the inequality restricts the region of the allowed values. Since there is only one inequality restriction in this example, these standard errors are exact under the assumptions of [19].

3.5Scalar measures of uncertainty in (economic) accounts – BDC 5 “aggregated data”

We again consider accounting systems connected by known constraints. The uncertainty in such an accounting system is, in principle, given by a variance-covariance matrix of all variables involved in the accounting system. However, [21] considers such a system of accounting equations as a single entity and aims to define uncertainty measures that capture the adjustment effect as well as the relative contribution of the various input estimates to the final estimated account in a single scalar. [21] discussed two approaches: the covariance approach and the deviation approach. Below we sketch the covariance approach.

Consider the additive account A=[Y1++Yi++Yp=Z]. Let Σ𝑿~ be the variance-covariance matrix of the adjusted estimates 𝑿~=(Y~1,,Y~pZ~). This matrix can be partitioned as follows: Σ𝑿~=(Σ𝒀~Σ𝒀~𝒁~Σ𝒁~𝒀~Σ𝒁~).

One scalar measure, denoted here by γ(A), proposed in [21] is defined as the sum of the variances of the components of 𝑿~, that is

(9)
γ(A)=Trace(Σ𝑿~)=Trace(Σ𝒀~)+1tΣ𝒀~1=i=1pVar(Yi~)+Var(Z~)

To illustrate the covariance approach let us suppose that we have normally distributed input estimates 𝒀^=(Y^1,,Y^p)Np(𝝁,Σ𝒀), where 𝝁=(μ1,,μp) and Σ𝒀 is a diagonal matrix with diagonal values σk2=Var(Yk)=σ2μk for k=1,,p.

Suppose we have an additive accounting equation of the form

Y1++Yp=Z

where Z~=z is treated as fixed. The original values Y^1,,Y^p do not necessarily satisfy this accounting equation. Consider a common benchmarking method, which yields the adjusted estimates

Y~k=Y^k+(z-j=1pY^j)υk

where the υk are adjustment weights that sum up to 1 for k=1,,p. By this benchmarking method the total difference is simply apportioned to each component estimator. Suppose one would like to compare two choices: υk=1/p and υk=μk/j=1pμj, which yield, respectively, the adjusted estimates

(10)
Y~1k=Y^k+1p(z-j=1pY^j)

and

(11)
Y~2k=Y^k+μkj=1pμj(z-j=1pY^j)

We are interested in which of the two choices, Eq. (10) or Eq. (11), yields statistical data of higher quality. Denote the reconciled estimated account based on Eq. (11) by A1 and the one based on Eq. (11) by A2.

It can be shown (see [21]) that the random variables Y~k-Mk(k=1,,p) where Mk=E(Y~k)=μk+(z-j=1pμj)υk, are negatively correlated with each other and have normal distributions N(0,σ~k2) with

σ~k2=υk2j=1pVar(Yj)+Var(Yk)(1-2υk)=υk2σ2j=1pμj+σ2μk-2vkσ2μk

Since Var(Z~)= 0 because Z~=z is fixed, measure γ(A) (see Eq. (9)) is by given by

γ(A)=k=1pVar(Yk~)=k=1pσ~k2

We examine the difference between γ(A1) and γ(A2). This difference is given by

γ(A1)-γ(A2)=k=1pσ2
(1p2j=1pμj-2pμk-(μkj=1pμj)2
j=1pμj+2μkj=1pμjμk)
=k=1pσ2(1p2j=1pμj-2pμk+μk2j=1pμj)
=k=1pσ2(j=1pμjp-μkj=1pμj)20

This implies that γ(A1)γ(A2) with strict inequality unless all μk are equal. So, we conclude that adjustment method Eq. (10) always leads to more uncertainty than method Eq. (11) according to quality measure γ – unless all μk are equal in which case both adjustment methods are equivalent – and that reconciliation method Eq. (11) should therefore be preferred.

3.6Other BDCs

With respect to BDC 3 (“under-coverage”) we have studied extensions of the well-known Petersen capture-recapture estimator (also referred to as the Petersen-Lincoln estimator). This estimator can be used to estimate the size of a population. In the simplest case of this estimator two random samples A and B from the same target population are linked. By n11 we denote the number of units that are observed in both samples A and B, by n10 the number of units that are observed in sample A only, by n01 the number of units that are observed in sample B only, and by n00 the number of population units that are not observed in either of the samples. The unknown population size equals n11+n10+n01+n00. This quantity is unknown since the value of n00 is not observed and unknown. The Petersen estimator for n00 is given by

n^00=n10n01n11

Its variance (see, e.g., [22]), and hence the variance of the estimator for the population size n11+n10+n01+n^00, is given by

Var(n^00)=
(n10+1)(n01+1)(n10-n11)(n10-n11)(n11+1)2(n11+2)

This variance estimate provides a quality measure for the population size estimation. In the literature review we carried out for WP 3, we studied how the Petersen estimator can be applied when the samples A and B are not obtained by survey sampling, but are instead based on administrative data (see, e.g. [23, 24]).

With respect to BDC 6 (“longitudinal data”), we have, for instance, carried out a literature review of [25]. [25] presents an approach to estimate (and possibly correct for) the amount of classification error in longitudinal microdata of categorical variables by estimating a so-called Hidden Markov Model for the underlying “true” distribution. A Hidden Markov Model is a special type of latent class model suitable for longitudinal data. Besides assumptions on the latent class model for each time point also assumptions on the longitudinal structure of the data are needed. In the approach proposed in [25], the true distribution of a categorical target variable, which is measured (with measurement error) for individual units in several linked data sets at several time points, is estimated at each time point. The observed variables are seen as error-prone indicators of a latent true variable, with some assumptions about the distribution of measurement errors. From this, the misclassification rate in each observed variable at each time point can be estimated. The accuracy of observed higher-order properties, such as observed transition rates between categories over time, can be estimated. The approach proposed in [25] can be seen as a longitudinal variation on the approach proposed in [7], which was described in Section 3.2. We refer to [25] for more information about their approach.

4.Discussion

We hope that we have succeeded in giving a flavor of the work that has been and is being done in WP 3 of the ESSnet on Quality of Multisource Statistics and the results that have been achieved with respect to quality measures for output based on multiple data sources.

Future work could focus on improving the developed quality measures, and make them easier to apply in practical situations as well as extend the range of situations in which they can be applied. Future work could also focus on developing quality measures and related computation methods that have not yet been considered in WP 3 of the ESSnet on Quality of Multisource Statistics.

Another important topic for future work is the further development of a systematic framework for situations, methods and quality measures that can arise in a multisource context. For instance, for single-source statistics where the focus is generally on sampling error, survey sampling theory offers the basics (and usually a lot more than just the basics) for computing survey estimates (by means of weighting) and variance estimates thereof. We feel that NSIs should aim for similar generally applicable theories that can be used for measurement errors, coverage errors, linkage errors etc. in a multisource context. In our opinion, NSIs should also aim for a general framework enabling one to combine all these separate aspects in an overall quality measure for output based on multiple data sets.

5.Back to the fairy tale

We do not know how the fairy tale from the beginning of our article will end, but we hope that it will end along the following lines.

The good and wise king looked at the work that had been done in the ESSnet, and told his people that he was once again impressed by the work they had done. He said: “I know this is not the final answer. Only the future may provide us the final answer. However, what you have done in this project is really useful and an important step forward. It will help not only us but also others in their efforts to determine the quality of multisource statistics.” The king and his people lived happily ever after, constantly improving the quality measurement of their statistics.

Acknowledgments

This work has partly been carried out as part of the ESSnet on Quality of Multisource Statistics (Framework Partnership Agreement Number 07112.2015.003 -2015.226), and has been funded by the European Commission. An earlier version of this paper has been presented at the Q2018 conference, Krakow, Poland, June 2018.

References

[1] 

Eurostat, European Statistics Code of Practice; 2018, Available at https//ec.europa.eu/eurostat/web/products-catalogues/-/KS-02-18-142.

[2] 

De Waal T, Van Delden A, Scholtus S, Multisource statistics: basic situations and methods. Discussion paper, Statistics Netherlands; (2017) , Available at www.cbs.nl.

[3] 

Van Delden A, Scholtus S, Burger J, Accuracy of mixed-source statistics as affected by classification errors. Journal of Official Statistics (2016) ; 32: (3): 1-25.

[4] 

Meertens QA, Diks CGH, Van den Herik HJ, Takes FW, Data-driven supply-side approach for measuring cross-border internet purchases; (2018) , Available at: arXiv1805.06930v1 [statAP], accessed at 17 May 2018.

[5] 

Van Delden A, Scholtus S, Burger J, Exploring the effect of time-related classification errors on the accuracy of growth rates in business statistics; 2016, Paper presented at the ICES V conference, 21–24 June (2016) , Geneva.

[6] 

Scholtus S, Van Delden A, Burger J, Analytical expressions for the accuracy of growth rates as affected by classification errors; 2017, Deliverable of SGA1 of the ESSnet on Quality of Multisource Statistics.

[7] 

Boeschoten L, Oberski D, De Waal T, Estimating classification error under edit restrictions in combined survey-register data using multiple imputation latent class modelling (MILC), Journal of Official Statistics (2017) ; 33: (4): 921-962.

[8] 

Rubin DB, Multiple imputation for nonresponse in surveys. John Wiley & Sons, New York; (1987) .

[9] 

Arbuckle JL, Full information estimation in the presence of incomplete data. In: Advanced structural equation modeling, (eds). Marcoulides G.A., Schumacker, R.E., Mahwah NJ. Lawrence Erlbaum Associates, Inc.; (1996) , p. 243-277.

[10] 

Van Delden A, Banning R, De Boer A, Pannekoek J, Analysing correspondence between administrative and survey data. Statistical Journal of the IAOS (2016) ; 32: (4): 569-584.

[11] 

Bollen KA, Structural equations with latent variables. John Wiley & Sons, New York; (1989) .

[12] 

Bielby WT, Arbitrary metrics in multiple-indicator models of latent variables. Sociological Methods and Research (1986) ; 15: (1–2): 3-23.

[13] 

Sobel ME, Arminger G, Platonic and operational true scores in covariance structure analysis. Sociological Methods and Research (1986) ; 15: : 44-58.

[14] 

Scholtus S, Bakker BFM, Van Delden A, Modelling measurement error to estimate bias in administrative and survey variables. Discussion Paper 2015-17; (2015) , Available at: wwwcbsnl.

[15] 

Scholtus S, Explicit and implicit calibration of covariance and mean structures. Discussion Paper 2014-09; (2014) , Available at: wwwcbsnl.

[16] 

Muthén BO, Satorra A, Complex sample data in structural equation modeling. Sociological Methodology (1995) ; 25: : 267-316.

[17] 

Houbiers M, Towards a social statistical database and unified estimates at Statistics Netherlands. Journal of Official Statistics (2016) ; 20: (1): 55-75.

[18] 

Knottnerus P, Van Duin C, Variances in repeated weighting with an application to the Dutch labour force survey. Journal of Official Statistics (2006) ; 22: (3): 565-584.

[19] 

Knottnerus P, On new variance approximations for linear models with inequality constraints. Statistica Neerlandica (2016) ; 70: (1): 26-46.

[20] 

Johnson NL, Kotz SN, Balakrishnan N, Continuous univariate distributions Vol. 1. Wiley, New York; (1994) .

[21] 

Mushkudiani N, Pannekoek J, Zhang LC, Uncertainty measures for economic accounts. Deliverable of the ESSnet on Quality of Multisource Statistics; (2017) .

[22] 

Chao A, Pau H-Y, Chiang, S-C, The Petersen-Lincoln estimator and its extension to estimate the size of a shared population. Biometrical Journal (2008) ; 50: (6): 957-970.

[23] 

Gerritse S, Van der Heijden PGM, Bakker BFM, Sensitivity of population size estimation for violating parameter assumptions in log-linear models. Journal of Official Statistics (2015) ; 31: (3): 357-379.

[24] 

Van der Heijden PGM, Smith PA, Cruyff M, Bakker B, An overview of population size estimation where linking registers results in incomplete covariates, with an application to node of transport of serious road casualties. Journal of Official Statistics (2018) ; 34: (1): 239-263.

[25] 

Pavlopoulos D, Vermunt JK, Measuring temporary employment: do survey or register data tell the truth? Survey Methodology (2015) ; 41: (1): 197-214.