You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Unbiased estimation strategies for respondent driven sampling

Abstract

In this paper, we focus on respondent-driven sampling (RDS), which is a valuable survey methodology to estimate the size and the characteristics of hidden or hard-to-measure population groups. The RDS methodology makes it possible to gather information on these populations by exploiting the relationships between their components. However, RDS suffers from the lack of an estimation methodology that is sufficiently robust to accommodate the varying conditions under which it is applied. In this paper, we address the estimation problem of the RDS methodology and, by approaching it as a particular indirect sampling technique, we propose three unbiased estimation methods as possible solutions.

1.Introduction

In this paper, we focus on respondent-driven sampling (RDS), which is a valuable survey methodology for both national and international organisations to estimate the size and characteristics of hidden (e.g., homeless people, undocumented immigrants) or hard-to-measure population groups (e.g., minorities, indigenous people).

The principle of “leaving no one behind” is at the heart of the 2030 Agenda and a key requirement for many Sustainable Development Goals (SDG) indicators is to be available for the most vulnerable and marginalised population groups. Nevertheless, halfway through the implementation of the 2030 Agenda, most SDG indicators are still not available at the needed level of disaggregation to monitor the socioeconomic conditions of hidden and hard-to-count population groups. As a result, it is neither possible to produce reliable structural data on the needed disaggregation dimensions nor to monitor the developments of emerging phenomena that need to be approached with targeted evidence-based policy interventions.

The RDS methodology makes it possible to gather information on these populations by exploiting the relationships between their components. Moreover, the effectiveness of the RDS can be further increased by employing an integrated approach in which the RDS is used in conjunction with other information sources, such as administrative or geographical data.

The RDS is a network-based sampling technique [1, 2] that was developed first by Eckathorn [3]. RDS has been the favourite survey method for sampling populations that are difficult to reach due to the potential of a viable sampling technique with reasonable inferential approaches. As a result, since its establishment, it has been employed in countless investigations of these populations across many countries [4]. RDS starts with a small sample of individuals (“seeds”) with which the researchers are familiar. Each participant is then given a small number of coupons with unique identifiers to distribute to their contacts in the target population, enrolling them in the study and growing the sample size until the sample includes the desired number of respondents. The RDS process stops either when, in the selection process, we encounter only units already identified in the previous steps or at a predetermined data-collection step (e.g., the fifth step). Picture 1.1 below illustrates an example of a network sampling process articulated into three steps. The blue lines are the links observed in the sample. Up to and including step 2, participants g, b, c, a, 3, and 4 are kept in the sample. Participants d, 1, 2, 5, and 6 are not observed.

Figure 1.

Example of network sampling process*. *The thin grey lines represent links existing in the population but not observed in the sample.

Example of network sampling process*. *The thin grey lines represent links existing in the population but not observed in the sample.

We may view RDS as a specific extension of the extensively used group of convenience sampling methods known as “snowball sampling networks” [5] which are frequently employed as a last option when a traditional sample frame is not available [6]. Compared to those of more traditional snowball sampling, RDS offers two key benefits. First, respondents receive few coupons. This enables statistical inference to be more appropriately defined and makes it more plausible to approximate the final sample as a probability sample. Second, asking respondents to pass coupons to their contacts in a potentially stigmatised community reduces potential confidentiality issues. Due to this innovation, RDS is a compelling method for gathering data from marginalized and difficult-to-reach populations.

However, the RDS methodology suffers from the lack of an estimation methodology that is sufficiently robust to accommodate the varying conditions under which it is applied. Although it is quite robust for estimating mean and proportion values [7], the accuracy of the total estimates depends on several features including the nature of the network connecting the individuals in the population and the lack of a rigorous sampling approach to select the sampling units.

In this paper, we address the estimation problem and by approaching the RDS methodology as a particular indirect sampling technique [8], we propose three unbiased estimation methods as possible solutions. In particular, the first method assumes a random sampling of the initial individuals. In contrast, the second method, which considers purposive sample selection, creates a nonbiased estimation if the initial sample of respondents falls into all the clusters of networks that characterise the population of interest. Finally, leveraging the generalised capture-recapture estimation approach [9], we propose an estimator that accounts for the noncoverage of two independent indirect samplings.

The paper is organised as follows. In Section 2, we summarise the traditional methodology of the RDS methodology, illustrating the data collection technique and the Volz and Heckathorn estimator [10], which has been very successful in practical applications due to its lack of computational complexity. Section 3 introduces the basic symbology, and we show how the RDS can be seen as a particular specification of indirect sampling in which each survey wave represents the indirect basis for the subsequent RDS phase. In Section 4, we expand the sampling aspects in the RDS. Section 5 introduces the three estimators. Section 6 concludes the work, and we begin to outline a strategy to overcome information gaps for SDG indicators for hard-to-reach populations, focusing on indigenous peoples.

2.Data collection and estimation in the classical RDS approach

RDS is frequently carried out by using techniques suggested by Salganik et al. [11] and outlined in protocols such as those proposed by White et al. [4] and Johnston [12].

A preliminary sample of typically 2 to 10 seeds is chosen. Aiming to represent all the key socioeconomic subpopulations that researchers anticipate may exist in the target population, seeds are selected to be as varied as possible. The rationale of this derives from the fact that each subpopulation may represent a separate network (or a cluster) of target individuals. If we select a seed in a given subpopulation, we can explore the network of related individuals. In contrast, if we do not select any individual in a subpopulation, the specific cluster of individuals cannot be observed in the RDS process. Therefore, picking up in the initial sample all possible distinct networks increases the possibility of constructing unbiased inferences on the target population.

The enumerators should include community opinion leaders in the initial seeds. Hence, their acceptance and support of the survey method may likely inspire widespread involvement from other target population members. This buy-in is crucial in target populations that are unlawful or stigmatised, especially if the population has any prior exposure to risky research practices. Following an interview, the seeds are given some coupons, each with a unique identification number, to spread to other population members. This number was used to reconcile the practical need to prevent the early termination of the sample trees with the inferential aim of limiting the branching of the sample. Members of the population who receive coupons visit a study centre where they are directly interviewed or given an interview appointment. Three coupons are likewise supplied to subsequent replies; this process continues until the sample size is approximately reached and the coupons are tapered or discontinued. Participants are paid for their time spent taking the survey. Additionally, for each successful recruiting, participants receive rewards. In the survey, the number of target population contacts of each respondent must be measured. This is typically done by asking questions that narrow the recruit’s references to the precise definition of the target population. Interviewers must also verify membership in the target population. Researchers must also assure participants do not participate in the survey more than once. Study staff are familiar enough with the target population in many settings to notice repeat participation attempts. In other cases, repetition is prevented by collecting nonidentifiable but unique information about participants, as in Johnston [13].

The RDS methodology can be applied alternatively to the entire population of individuals or by considering only the subpopulation at risk of belonging to the target population. For instance, if the target population coincides with forced labour people, we may observe people working in sugarcane.

Figure 2.

Example of target population divided in disjoint clusters.

Example of target population divided in disjoint clusters.

Contrary to what was previously believed, Eckathorn [3] used a Markov modelling of the peer recruitment process to demonstrate that bias from the convenience sample of beginning participants from which the sample started gradually diminished as the sample increased wave by wave. By using the model, they demonstrated how the sample approached an equilibrium independent of the beginning location or independent of the convenience sample of seeds from which it started as it expanded wave by wave. The conclusion was that this sampling technique may become reliable if there were enough waves, meaning that any seed selection can eventually yield the same equilibrium sample composition. However, the researchers did not show how an unbiased estimate can be derived.

Eckathorn[14] introduced the first RDS population estimator based on the essential idea that in RDS, relationships tend to be reciprocal. This implies that if person A knows person B, then B knows A.

The estimator bases its validity on the principle of network balance between population subgroups. Up to a constant factor, the volume of network connections to and from each group can be approximated. For each pair of groups, this results in a set of balance equations that may be used to solve for the relative size of each group. Volz and Eckathorn [10] proposed a slightly biased estimator. In the following we call this estimator the VH estimator. The VH estimator is based on the following hypotheses [15]: 1. The network size is large compared to that of the realised sampling, including the initial seeds and the respondents recruited by the RDS process. 2. Homophily is weak enough, where homophily is the tendency for nodes to preferentially form network contacts with others like themselves. 3. Reciprocity of contacts. 4. With-replacement sampling. 5. Enough sample waves. 6. Accurate measurement. 7. The recruitment in the subsequent waves of the RDS process is random.

The first three hypotheses relate to the nature of the contact network, while Hypotheses 4 to 7 relate to sampling. Hypothesis 4 is the most critical and may introduce a certain level of bias, as the sampling process adopted assumes a link between the persons recruited.

Focusing on the first three assumptions, let us consider the example below in Fig. 2, where we assume that people of the target population belong to three disjoint clusters. If the starting sample includes only persons in one group, the traditional RDS can estimate only the total number of persons in that cluster suffering from a substantial undercoverage problem. Since people of the target population are often grouped geographically, observing each cluster’s units in the starting sample is appropriate.

The VH estimator has had great application success in the practice of real investigation due to its great computational ease. Subsequently, other estimators have been proposed in the literature (see among others [16, 17]) each of which overcomes some of the limitations associated with the assumptions made with the VH estimator. These estimators have higher computational complexity, and introduce some modelling assumptions on the cluster variables or the nature of the contact network.

To illustrate the classical VH estimator, let us consider the case where the total of a characteristic

(1)
Y=kUyk

in the population of interest U (of N units) is to be estimated, where yk is the value of y for unit kU.

Let S be the sample of different units at the end of the RDS process. Let αk be the number of times unit k is observed in S, where

m=kSak.

The probability of selection of unit k is supposed to be proportional to their contacts

pk=Lk/NL¯,

where

Lk=jUλj,k,L¯=LN,andL=kULk

where λj,k is the link (0,1) variable between individuals j and k. Let

p^k=Lk/NL¯^

be the Hansen Hurwitz (HH) [18] estimate of pk, where

L¯^=(SaLmp)/(Sa/mp)

is the HH estimate ratio of the average number of contacts in the population. The numerator on the right-hand side of the previous equality is given by:

SaLmp=SaLmLL
=1mLSa=L.

Figure 3.

Example of construction of the totals YS0,R.

Example of construction of the totals YS0,R.

The denominator can be expressed as

Samp=SaLmL

L¯^ and p^k are, therefore, given by:

L¯^=LSaLmL=mSa1L,p^k=Lk/NmSa1L.

The VH estimator of Y is

(2)
Y^𝑉𝐻=kSakykmp^k=kSykw𝑉𝐻,k

where

(3)
w𝑉𝐻,k=akmp^k.

is the sample weight assigned to unit k.

3.Totals of interests and a formalisation of the RDS as a particular case of indirect sampling

The RDS can be formalised as an indirect sampling scheme. In this type of sampling, there is an initial UA population of NA units from which the research starts, and a UB population of NB units that constitute the study’s target population.

In our case, UBU means that it coincides with the target population U, which implies NB=N.

The specific unit j of the initial population UA consists of the unit j itself and all its contacts. Let j and k be the labels identifying the population units in UA and UB, respectively. Let λj,k be the link variable between units jUA and kUB, where λj,k=1 if j is directly linked to k. We have λj,k=λk,j and λj,j=1. Let

(4)
y¯jA=-kUBλj,kLkByk

be the value of the characteristic of interest for unit jUA, where

LkB=jUAλj,k.

In the initial population UA, each unit in contact with unit j contributes to the y value of that unit in a weighted manner, where the reciprocal of the total number of contacts gives the weight.

The two populations UA and UB have the same numbers of units: NA=NB=N. We have

NA=jUAkUBλj,kLkB
(5)
=kUBjUAλj,kLkB
=kUB1=NB=N.

Moreover, the target parameter may also be expressed as the sum of the y¯jA values over the population UA:

(6)
Y=jUAy¯jA.

Indeed, it is

(7)
Y=jUAy¯jA=jUAkUBλj,kLkByk=kUBjUAλj,kLkByk=kUByk.

In addition to the total Y, another total that plays a crucial role in the RDS methodology is the aggregate, YS0,R. That is, the total of the variable y where starting from the sample S0 (which constitutes step 0 of the process), additional units are considered in subsequent steps through all their contacts. The total considers this aggregate after the step of this process. We can consider this as a search process on a graph.

Figure 3 illustrates this process starting from sample S0, which includes only unit j.

To clarify how this total can be constructed, let us compute it step by step. To distinguish among the search processes of the different steps, let jr (with jrUA and r=0,1,,R-1) be the subscript of the population unit involved in step r of this search process on a graph.

At step 0, the total YS0,0 is simply the sum of the values yj for the units included in the sample S0:

(8)
YS0,0=j0S0yj.

In step 1, we observe the links of units in S0. We compute YS0,1 by adding to the total YS0,0 the values of y of the new units individuated in the search process starting from the units in S0. We can formalise this process as

(9)
YS0,1=j0S0kUBykλj0,kLkS0,

where LkS0=j0S0λj0,k.

To clarify the notation, we note that the j0 unit is also one of the k units. We also note that LkS0 is the total number of links to unit k (identified in step 1 of the process) that can be computed starting from the units in S0.

We can reverse the order of the sums and formulate as

(10)
YS0,1=kUBykj0S0λj0,kLkS0

where LkS0=j0S0λj0,k.

We note that Eq. (10) avoids the multiple counting of a unit linked to different elements of the initial sample, as illustrated in Fig. 4, where unit b is connected to both units j and a of S0. Indeed, it is

j0S0λj0,bLbS0=λa,bLbS0+λb,bLbS0+λj,bLbS0
=13+13+13=1.

Figure 4.

Example of Steps 0 and 1.

Example of Steps 0 and 1.

At step 2, we have

(11)
YS0,2=j0S0j1UAkUBykλj0,j1Lj1S0λj1,kLkB.

We note that the last summation is over UB, i.e., the target population. The middle summation is on UA, i.e., on the initial population. The first summation is limited to the initial sample from which the research starts. This kind of organisation of the order of summations also appears in the following formulas. The last summation is always on UB, and the first is on S0. In contrast, the intermediate summations are always on the starting population UA.

Reversing the order of the summations, we have

(12)
YS0,2=kUBykj0S0j1UAλj0,j1Lj1S0λj1,kLkB.

Let us note that in this case, the unit kUB is counted only once, avoiding the multiple counting of a unit linked to different elements of the initial sample and its links. This is illustrated in Appendix 1.

Continuing recursively the above process, at step R we have

(13)
YS0,R=j0S0j1UAjR-1UAkUBykλj0,j1Lj1S0λj1,j2Lj2B××λjR-2,jR-1LjR-1BλjR-1,kLkB=kUBykj0S0j1UAjR-1UAλj0,j1Lj1S0λj1,j2Lj2B××λjR-2,jR-1LjR-1BλjR-1,kLkB.

Even in this case, we avoid the multiple counting of a unit linked to different elements collected in the R steps of the RDS process.

Based on the above expressions, we note the following.

  • Each step of network sampling can be formalised as an indirect sampling mechanism.

  • In a given step of the RDS process, the participants from which the search starts constitute the source list UA, and their links are the target population UB.

  • In the subsequent step, the people found in the target population UB become the people of the initial population UA, from which a new search starts.

  • This switch in the role of the sample participants, from the target population UB to the initial population UA of the next step, occurs at each wave of the RDS search chain.

Remark 1. A path in graph theory is a finite or infinite sequence of edges that joins a sequence of vertices. We note that if R is greater than the maximum of the minimum paths between any pair of nodes in each cluster of the units of S0, then the following implications follow.

  • YS0,R represents the total of the y variable related to all the groups to which the elements of S0 belong.

  • if S0 does not include all clusters characterising the population of interest, then YS0,R<Y.

We obtain these important outcomes illustrated in remarks 2 and 3 below by interpreting the result of Remark 1 in an alternative way.

Remark 2.YS0,R=Y if

  • R is greater than the maximum of the minimum paths between any pair of nodes in each cluster of the units of S0.

  • The S0 sample must include people from all clusters with people from the target population.

Remark 3. We consider the case of two interconnected units in which they know each other. However, these units may belong to two separate clusters if the RDS search rules call for stopping the search if a connection is identified with a person living in a geographic location distinct from that of the original contact. In this case, to ensure the equality YS0,R=Y, in addition to the two conditions in Remark 2, there is the additional condition that the S0 sample must include people from all geographic locations with people in the target population.

4.Sampling the RDS research chain

Unlike in the previous section, in the RDS process, not all links of a unit kept in the process are observed, but only a randomly selected sample of them is observed. The RDS process starts at step r=0, with sample S0 (which may be randomly selected or not), and in subsequent steps r=1,2,,R, we form samples S1S2SrSR, each incorporating the sample from the previous step. To illustrate the formation of the generic sample Sr+1, we introduce additional symbology below. Let us consider the jrSr unit and denote by

LjrSr=kUBλjr,kδk(Sr)=kSrλjr,k

the total number of contacts of the unit that have been selected in sample Sr, where δk(A)=1 if unit k belongs to set A and δk(A)=0 otherwise. For the same unit jr, let

LjrCr=LjrB-LjrSr

be the number of contacts that have not been selected in sample Sr, and can be selected in sample Sr+1. For each unit jr included in the Sr sample, we select, independently of the other units included in Sr,mjr+1 units for the Sr+1 sample where mjr=𝑀𝑖𝑛(m,LjrCr), being m is a fixed number (e.g., m=2 or 3) that remains unchanged in the different steps of the RDS. Of these mjr+1 units, one is the jr unit itself, and the other mjr units are selected with a simple random sampling without replacement (SRSWOR) out from the LjrCr units. The conditional probability that unit jr+1 is selected in sample Sr+1, given jrSr

τjr+1jrSr={1 if (λjr,jr+1=1)and[δjr+1(Sr)=1]mjrLjrCr if (λjr,jr+1=1) and [δjr+1(Sr)=0]0 if λjr,jr+1=0.

Figure 5.

Example of the formation of sample S1 from sample S0 with m= 2.

Example of the formation of sample S1 from sample S0 with m= 2.

Remark 4 on the feasibility of the selection process. The illustrated selection process avoids the dead-end loops typical of graph sampling. However, to make it feasible, it is essential to know the LjrCr quantity, obtained as difference of two quantities, LjrB and LjrSr. The value of LjrB can be requested directly from respondent jr. Remembering that the relationships explored in RDS have the character of reciprocity, LjB corresponds to the total number of people who unit jr knows and can point to in turn. Operationally, the LjrSr quantity can be obtained in alternative ways. Suppose nonidentifiable but unique information about contacts of units included in the Sr sample [13] is available in the data-collection APP used by the interviewer. In that case, a specific software application can be launched that identifies units not included in Sr and proceeds to select mjr units to be included in sample Sr+1 randomly. Alternatively, the same software application can be run by the study centre (see Section 2 above) that supports the survey operations, and the results can be reported and provided in real-time to the interviewer who makes the Sr+1 sample selection. Depending on the specific conditions of the survey, other feasible operational mechanisms can be defined.

Remark 5 on the research chain for a subpopulation. We consider the case of constructing the RDS sampling search chain only on the units of a subpopulation, for example, only on the people of a particular class-age. We denote by xj a dichotomous variable that takes value 1 if unit j belongs to the subpopulation and takes value 0 otherwise. In this case, the link variables are defined as

λ(x)j,k=λj,kxjxk.

The values LkB are modified accordingly.

5.Estimators

Next, we present three estimators. The first assumes a random selection of the S0 sample, and the second adopts the traditional RDS methodology while considering a non-probabilistic S0 sample. The third estimator is developed under a capture-recapture approach [12] while allowing for the smoothing of the coverage problems that may affect both of the first two estimators.

5.1S0 selected with a random sampling

Let us suppose a random sample S0 of fixed size n0 is selected from UA without replacement and with inclusion probabilities πj0, where

πj0>0forj0=1,,NAand
(14)
j0UAπj0=n0.

To facilitate the understanding of the calculation method, we construct the estimator step by step. In each step, we obtain a sampling unbiased estimate of the total Y. However, as the steps of the RDS process progress, the estimate is based on a more significant number of observations.

At step 0, we have the classical HT estimator:

(15)
Y^0=j0S0yj01πj0=j0S0yj0wj0,

where wj0=1/πj0 is the sampling weight.

In Step 1, we have

(16)
Y^1=j0S0kS1ykλj0,kLkB1πj01τk|j0S0.

As illustrated in Appendix 2, denoting E(), the sampling expectation operator, we have E(Y^1)=Y, meaning that Y^1 is a sampling unbiased estimate of Y. Reversing the order of sums, we can express Y^1 in the classical weighted form as

(17)
Y^1=kS1ykwk1

where

wk1=j0S0λj0,kLkB1πj01τk|j0S0.

Estimator Y^1 can also be formulated referring to Eq. (6) as:

(18)
Y^1=j0S01πj0y¯^j0A

where y¯^j0A=kS1ykλj0,kLkB1τkj0S0 is the unbiased estimate of y¯j0A

In Step 2, taking Eqs (16), (17), and (18) of step 1 as a reference, the unbiased estimator Y^2 can be expressed according to these three alternative ways

Y^2=j0S0j1S1kS2ykλj0,j1Lj1Bλj1,kLkB1πj01τj1j0S01τkj1S1,
Y^2=kS2ykwk2,
Y^2=j0S0j1S11πj0λj0,j1Lj1B1τj1j0S0y¯^j1A,

where

wk2=j0S0j1S1λj0,j1Lj1Bλj1,kLkB1πj01τj1j0S01τkj1S1,

and y¯^j1A=kS2ykλj1,kLkB1τkj1S1 is the unbiased estimate of y¯j1A. We have E(Y^2)=Y.

Recursively using the previous procedure, at the ultimate step R of the RDS process, we have:

Y^R=j0S0j1S1jR-1SR-1kSRyk×λj0,j1Lj1B××λjR-1,kLkB1πj01τj1j0S0××1τkjR-1SR-1,
Y^R=kSRykwkR,
Y^R=j0S0j1S1jR-1SR-1λj0,j1Lj1B××λjR-1,kLkB1πj01τj1j0S0××1τjR-1jR-2SR-2y¯^jR-1A,

where

wkR=j0S0j1S1jR-1SR-1λj0,j1Lj1B××λjR-1,kLkB1πj01τj1j0S0××1τkjR-1SR-1

and

y¯^jR-1A=kSR-1ykλjR-1,kLkB1τkjR-1SR-1.

In Appendix 2, we see E(Y^R)=Y

Remark 6 on estimating the sampling variance. We can approximate the RDS design with multistage sampling with replacement, where each step may be viewed as a specific sampling stage, and the replacement refers to a single unit. In that way, we may derive an estimate of the sampling variance [19][(Formula 11.35)];

v(Y^R)=1n0(n0-1)j0S0(1zj0Y^j0-Y^R)

where

Y^j0=j1S1jR-1SR-1kSRykλj0,j1Lj1B××λjR-1,kLkB1πj01τj1j0S0××1τkjR-1SR-1,

and zj0=πj0/n0.

Remark 7 on the estimator for a subpopulation. As illustrated in Remark 5, the link variables are defined as λ(x)j,k=λj,kxjxk, and the variables LkB are modified accordingly. Moreover, the target variable yk is modified as y(x)k=ykxk.

Remark 8 on type of estimator. Considering the above expressions, we can see how the Y^R estimator can be seen as a particular case of the generalised weight share method (GWSM) estimator.

Remark 9 on the starting sampling. The sampling design should maximise the number of observed individuals of the target population in the sample S0 by adopting proper choices in the first and ultimate stages (or phases) of the sampling process. First-stage selection tends to oversample the areas where the researchers have some a priori information of a high concentration of the target population. Final-stage sampling tends to oversample the target people by modelling the inclusion probabilities on variables predictive of the phenomenon available in the sampling frames adopted to select the final-stage units.

5.2S0 selected with a nonrandom sampling

If the sample S0 is selected nonrandomly, we can obtain a nonbiased estimate only of the total YS0,R. We illustrate this case by referring only to step R and the weighted form of the estimator. An unbiased (see Appendix 2) estimator of the total YS0,R is

Y^S0,R=kSRykwkR,S0

where

wkR,S0=j0S0j1S1jR-1SR-1λj0,j1Lj1S1××λjR-1,kLkB1τj1j0S0××1τkjR-1SR-1.

We note that the formulation of Y^S0,R is similar to that of the Y^R estimator, except that in Y^S0,R, the first weighting factor (1/πj0) of Y^R is equal to 1 , and Lj1S0 replaces the Lj1B factor.

Remark 10 on type of estimator. We can straightforwardly see how the Y^S0,R estimator can be seen as a particular case of the GWSM estimator.

Remark 11 on the starting sampling. To ensure that estimator Y^S0,R is an unbiased estimate of the total Y, i.e., that condition E(Y^S0,R)=Y is met, the initial sample S0 should respect the three conditions illustrated in remarks 2 and 3.

5.3Generalised capture-recapture estimator

Even if the S0 sample is randomly selected, the first estimator Y^R may be biased. Indeed, undercoverage may occur if respondents do not trust the interviewers and tend to hide their status.

Likewise, if the S0 sample is nonrandomly chosen, even the second estimator Y^S0,R can undercover the total Y if the following conditions are not met: (i) R is greater than the maximum of the minimum paths between any pair of nodes in each cluster of the units of S0, and (ii) S0 does not include all clusters characterising the population of interest.

The generalised capture-recapture (CReG) estimator allows us to overcome the abovementioned undercoverage by leveraging a capture-recapture perspective. A comprehensive treatment of this estimator and how it mitigates undercoverage deserves much more space in this article than can be devoted. The interested reader can undoubtedly look to the extensive work reported in [9].

Let us consider two independent surveys based on the RDS methodology, and we suppose they are articulated in R steps. The first starts from an initial random sampling, while the second starts from a nonrandom sample. Furthermore, we suppose that the two sample selections are independent. The CReG estimator of Y may be expressed as

Y^CReG =Y^RY^S0,RY^𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡,R

where

Y^𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡,R=kSR, intersect wkRwkR,S0yk

where SR,intersect  is the intersection sample between the two samples that are generated from random and nonrandom sampling after R steps.

A useful approach is applying the estimator CReG on the two nonrandom starting samples but with a different mechanism of undercoverage of the two respondent groups.

6.Conclusions

The disaggregation of data for SDG indicators on hidden or hard-to-count population groups presents several critical issues that are difficult to overcome to produce reliable official statistics at the national and international levels. In this context, it is impossible to estimate the characteristics of these groups through models as in other situations.

Considering, for example, indigenous populations, data availability varies widely from country to country. Few countries provide up-to-date and high-quality data. Many other countries have only data that are scattered over time. Or the data they provide is not supported by a sufficiently robust methodology, both for precisely identifying the subpopulation group of interest and for the sampling technique adopted.

Given the current context of producing official statistics on the subject, it is unrealistic that this lack of data on indigenous peoples is going to improve soon.

Therefore, it becomes necessary to define and apply an implementation program that can improve this situation relatively quickly.

This implementation program should be based on the following pillars:

  • 1. The first pillar is to develop a valuable data collection strategy for sample surveys that is capable of measuring the number of people belonging to the indigenous population in a given area. This strategy needs to cover various aspects, including the formulation of the questionnaire for identifying persons belonging to the indigenous population and the characteristics of special sampling techniques that can maximise the efficiency of surveys aimed at obtaining data on these hidden or hard-to-measure population groups. Specifically, the data collection strategy consists of technical manuals, open software modules, ad hoc training materials, etc. In short, anything that enables or helps conduct surveys or specific survey modules to estimate the size of indigenous populations. Regarding the questionnaire, the data collection strategy should develop a set of standard questions on the key characteristics of the specific population of interest and not adopt generic questions on whether specific individuals belong to an indigenous population group.

  • 2. The second pillar is to adapt the data collection strategy to ongoing survey programs. For example, the indigenous module may be applied to a large-scale national survey conducted by a national statistics office. Another example can be to promote the application of the indigenous sampling module to international surveys, such as the World Bank’s LSMS survey. Regarding the sampling aspects, it is helpful to consider an overall sampling strategy to maximise the number of observed individuals of the target population in the sample by combining the first and final sampling stages. First-stage methods should tend to oversample the areas where the researchers have some a priori information on the geographical concentration of the target population. Final-stage sampling should oversample the target population by adopting strategies, such as respondent-driven sampling (RDS), that leverage the existing hidden relationships among the individuals of the target populations.

  • 3. The third pillar is adopting estimation methods that allow unbiased estimates of phenomena of interest in target populations. In this paper, we have considered the RDS, widely used for observing hidden or rare populations but which lacks an estimation methodology that is sufficiently robust to accommodate the varying conditions under which it is applied. We have proposed three unbiased estimators. The first assumes a random selection of the starting sample, and the second considers a nonprobabilistic starting sample. The third estimator is developed under a capture-recapture approach and allows for smoothing of the coverage problems that may affect the estimators. We have studied their sampling expectation and have indicated the conditions that may be fulfilled to guarantee their unbiasedness concerning the target totals.

References

[1] 

Gile K, Beaudry IS. Handcock MS, Miles Q. Methods for Inference from Respondent-Driven Sampling Data. Annu Rev Stat Appl. (2018) ; 5: (4): 1-429.

[2] 

Heckathorn DD, Cameron J. Network Sampling: From Snowball and Multiplicity to Respondent-Driven Sampling. Annual Review of Sociology. August (2017) .

[3] 

Heckathorn DD. Respondent driven sampling: a new approach to the study of hidden samples. Soc Probl. (1997) ; 44: (2): 174-99.

[4] 

White RG, Hakim AJ, Salganik MJ, Spiller Spiller MV, Johnston LG, Kerr L, Kendall C, Drake A, Wilson D, Orroth K, Egger M, Wolfgang H. Strengthening the Reporting of Observational Studies in Epidemiology for respondent-driven sampling studies: “STROBE-RDS” statement. Journal Clinical Epidemiology. (2015) Dec; 68: (12): 1463-71. doi: 10.1016/j.jclinepi.2015.04.002. Epub 2015 May 1.

[5] 

Goodman LA. Snowball sampling. Ann Math Stat. (1961) ; 32: : 148-70.

[6] 

Handcock MS, Gile KJ. Comment: on the concept of snowball sampling. Sociol Methodol. (2011) ; 41: (1): 367-71.

[7] 

Heckathorn DD, Jeffri. Finding the Beat. Using respondent-driven sampling to study jazz musicians. Poetics. (2001) ; 28: : 307-329. Elsevier.

[8] 

Lavallée P. Indirect Sampling. Springer. New York, (2007) .

[9] 

Lavallée P, Rivest LP. Capture – Recapture Sampling and Indirect Sampling. Journal of Official Statistics. (2012) ; 28: (1): 1-27.

[10] 

Volz E, Heckathorn DD. Probability based estimation theory for respondent driven sampling. J Official Statistics. (2008) ; 24: : 79.

[11] 

Salganik M.J., Douglas D. Heckathorn. Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling. Sociological Methodology. (2004) ; 34: : 193-239.

[12] 

Johnston LG. Conducting respondent driven sampling studies in diverse settings: a manual for planning RDS studies. Cent Dis Control Prev, Atlanta, GA, (2007) .

[13] 

Johnston LG. Introduction to HIV/AIDS and sexually transmitted infection surveillance. Module 4. Introduction to respondent-driven sampling. World Health Organ., Geneva. http//www.lisagjohnston. com/respondent-driven-sampling/respondent-driven-sampling, (2013) .

[14] 

Heckathorn DD. Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc Probl. (2002) ; 49: : 11-34.

[15] 

Heckathorn DD. Assumptions of RDS: analytic versus functional assumptions. Presented at CDC Consult. Anal. Data Collect. Respond.-Driven Sampl., Atlanta, GA, (2008) .

[16] 

Gile K, Handcock MS. Network model-assisted inference from respondent-driven sampling data. J R Stat Soc A. (2015) ; 178: (3): 619-39.

[17] 

Verdery AM, Merli MG, Moody J, Smith J, Fisher JC. Respondent-driven sampling estimators under real and theoretical recruitment x conditions of female sex workers in China. Epidemiology. (2015) a; 26: : 661.

[18] 

Hansen MH, Hurwitz WN. On the theory of sampling from finite populations. Ann. Math. Stat.. (1943) ; 14: (4): 333-62.

[19] 

Cochan WG.. Sampling Techniques. J Wiley New York, (1943) .

Appendices

Appendix 1: Equation (7)

We have

YS0,2=kUBykj0S0j1UAλj0,j1Lj1S0λj1,kLkB=kUBykj0S0λj0,j1Lj1S0j1UAλj1,kLkB

being

j1UAλj1,kLkB={1ifλj1,k=10,otherwise,
j0S0λj0,j1Lj1B={1ifλj0,j1=10,otherwise.

Appendix 2: Unbiasedness of Y^1

We have

E(Y^1)=j0UA1πj0E[δj0(S0)]kUBykλj0,kLkB(1τkj0S0)E[δk(S1)j0=jUAπj0πj0kUBykλj0,kLkB(τkj0S0τkj0S0)j0UAkUBykλj0,kLkB=kUByk,

since

j0UAλj0,kLkB=1.

Unbiasedness of Y^S0,R

E(Y^R)=j0UAj1UA××jR-1UAkUB(λj0,j1Lj1B××λjR-1,kLkB)×(E[δj0(S0)]πj0E[δj1(S1)j0S0]τj1j0S0××E[δk(SR)jR-1SR-1]τkjR-1SR-1)yk=j0UAj1UAjR-1UAkUB(λj0,j1Lj1B××λjR-2,jR-1LkBλjR-1,kLkB)yk=j0UAλj0,j1Lj1Bj1UAλj1,j2Lj2B××(kUBjR-1UAλjR-1,kLkByk)=j0UAλj0,j1Lj1Bj1UAλj1,j2Lj2B××jR-2UAλjR-3,R-2jR-2Y=Y.

Unbiasedness of Y^S0,R

E(Y^S0,R)=j0S0j1UA××jR-1UAkUB(λj0,j1Lj0××λjR-1,kLkB)××(E[δj1(S1)j0S0]τj1j0S0××E[δk(SR)jR-1SR-1]τkjR-1SR-1)yk=j0S0j1UA××jR-1UAkUB(λj0,j1Lj1××λjR-1,kLkB)yk=YS0,R