Integration of remote sensing data into national statistical office sampling designs for agriculture
The integration of remote sensing data in agricultural statistics is a research topic with a long history. The research focus is on using statistical models to link ground and remote sensing data such that the resulting estimators are design-consistent.
A design-consistent estimator assisted by linear models is well established in the literature. However, it requires enough geographic information about the boundaries of agricultural parcels to develop a simple sample of areas. Many countries use complex samples based on non-georeferenced list frames of households or farms and reduce to point data the georeferenced information required for linking ground and remote sensing data.
Data on crop acreage observed at a point are necessarily categorical because a point is dimensionless. Little work has been done on the integration of categorical ground data within complex list samples using remote sensing data. Our focus was on using multinomial logit models for this integration. Special attention was paid to evaluate the cost efficiency of remote sensing data.
Recently, the European Space Agency (ESA) launched the Sentinels for Agricultural Statistics(Sen4Stat) project. The aim was to facilitate the use of remote sensing information at the National Statistical Offices (NSOs) supporting agricultural statistics. Other international programs to improve agricultural statistics were begun around the same time. An additional Sen4Stat target is synergy with other initiatives, including the Living Standards Measurement Study – Integrated Survey on Agriculture (LSMS-ISA), which was launched in 2008 by the World Bank, and the Global Strategy to Improve Agricultural and Rural Statistics (GSARS), which was endorsed in 2010 by the United Nations Statistical Commission and implemented by the Food and Agriculture Organization (FAO) of the United Nations. An FAO output is the Agricultural Integrated Survey (AGRIS) method , together with the 50 2030 Initiative, which builds on work of the LSMS-ISA and AGRIS programs . Four of five partner countries in Sen4Stat were also partners in other initiatives, i.e., Ecuador and Senegal of AGRIS, Malawi of LSMS-ISA, and Tanzania of GSARS. Accordingly, through existing collaborations and initiatives, the FAO coordinated the provision of ground data from respective agencies in Ecuador, Senegal and Tanzania, while ground data collection from Malawi was coordinated by the World Bank.
Remote sensing (RS) furnishes a very useful auxiliary data source to improve the cost efficiency of design-based inference, and the integration of ground and remote sensing data is a research topic with a long history [3, 4, 5, 6, 7]. Statistical models are used for this integration, in such a way that the estimator assisted by these models is design-consistent .
The focus has been on linear models, assuming that the sample design is simple. In countries where geographic information of agricultural parcel boundaries is available, it is possible to use an individual or cluster of parcels as the sampling unit and to design a simple sample of areas (polygons) using area frames. A linear model is suitable for integrating ground and remote sensing data on crop acreage or yield because these are continuous variables. However, many countries use complex samples based on non-georeferenced list frames (of households or farms) and reduce to point data the georeferenced information required for linking ground and remote sensing data. Crop data observed at a point are necessarily categorical because a point is dimensionless, and multinomial logit is an appropriate statistical model to integrate remote sensing and categorical ground data [9, 10].
Little work has been done on the integration of categorical ground data in complex list samples of households or farms with remote sensing data. In this paper, we focus on using multinomial logit models for this integration. We give details on statistical models to be used for crop acreage, yield and production estimation, as well as on the sample design used by five Sen4Stat partner countries that are representative of the large types of sample designs proposed in the literature. These are list, area, and point sampling .
Considering the literature on the applications of remote sensing to agricultural statistics, as well as expectations of the NSO partner countries for these applications, we elaborated four use cases: 1. Cost efficiency; 2. Granularity; 3. Multi-seasonal estimation; 4. Optimizing the sample design. A prototype for integrating ground and remote sensing data in each use case was developed for a large set of sampling designs, including those currently used by the partner countries. We include results of the application of these prototypes to data provided by the NSOs, for cost efficiency and granularity. Finally, we outline prototypes for multi-seasonal estimation and optimizing the sample design.
The approach to elaborate official statistics is well established in the literature. It is design-based in the sense that the inference is based on the sampling distribution generated by the probabilistic sampling scheme designed to select the sample [12, 13]. This sampling scheme associates with each population unit, , a known and positive inclusion probability , which play a key role in design-based inference. Usually, the sample size is sufficiently large to attain design consistency (the design-based distribution of the estimator is tightly concentrated around the true population value) at national and regional levels, and design-consistent estimators are used. Estimate accuracy is also based on the estimator’s design-based distribution.
Our focus was on improving the cost-efficiency of the currently used sampling designs for elaborating official statistics on crop acreage, using RS as auxiliary data. The approach to using auxiliary data in design-based inference is well stablished in the literature , i.e., the auxiliary extra-sample information is integrated in the sampling design using statistical models, without loss of design-consistency. In minor administrative areas (municipalities) the sample size is small or nil, so that design consistency is meaningless. Thus, a model-based approach is used to get estimates, whose accuracy strongly depends on the auxiliary information reliability.
Model suitability for ground and remote sensing data integration depends on the nature of ground data. For continuous ground data a linear regression model is used, but for categorical and counts ground data, a generalized linear model is more suitable [9, 10]. Both types of models allow remote sensing data of any character, continuous or discrete. The efficiency of RS as auxiliary data for crop acreage has already been extensively demonstrated in the scientific literature, but only for continuous ground data and with the assumption that the sample design is simple. To date, little work has been done on the integration of categorical ground data in complex list samples of households or farms with remote sensing data. In this paper, we focus on using multinomial logit models for this integration.
The design of the sampling scheme for use in sample selection is key in design-based inference because it associates with each population unit a known and positive inclusion probability , which is used to define the estimators’ statistical distribution. This distribution is used to evaluate the estimators’ characteristics, including design consistency and accuracy.
A sampling frame is required to select a probabilistic sample and we considered the two main types of sampling frames used in practice, area frame and list frame. The former is based on maps and/or satellite images and the latter on population censuses. Sampling units are either polygons or points in area frames and households or farms in list frames. Usually, the former is georeferenced, but not the latter .
The area frame has important advantages over list frames, including versatility, reduction of non-sampling errors such as coverage errors, and longevity. However, it has some inconvenience (sensitivity to outliers and sub representation of rare or minor activities) . To overcome these inconveniences, a dual frame composed of an area and list frame is recommended . In Section 4, we give details on the dual frame used in Senegal to integrate ground and RS data.
Using area frames, it is possible to define sampling units (polygons) of the same size and simple samples with equal inclusion probabilities. In contrast, using list frames, the sample selection scheme entails two or more stages, the sampling units are usually of distinct size, and the probabilistic scheme assigns unequal (size-proportional) inclusion probabilities. Therefore, the sample is not simple but complex.
The integration of ground and RS data will be illustrated for the set of sample designs most often used in practice for collecting ground data . With area frames, systematic sampling with one or more random starts and stratified random sampling are most often used as the sample schemes when geographic information on parcel boundaries is available. If the parcel boundaries are unknown, the sampling unit is usually a point, and we will illustrate how to integrate a sample of points selected from a list frame with RS data.
Agricultural data are frequently collected using national household surveys, for which the sample is selected from a list frame (a population census) in two or more stages, with unequal inclusion probabilities. The sampling unit is a household and geographic information at parcel level is rarely available. For RS data integration, georeferenced information is required at a minimum. To have at least one georeferenced point by agricultural parcel within the household sample, an additional sampling stage is used to select the parcel centroid.
Three of the five Sen4Stat partner countries use area samples, namely, Spain, Ecuador, and Tanzania. In Spain, the area frame is the national topographic map. The sampling unit is a square segment of side 700 m (49 hectares). The sample design is non-stratified systematic with three random starts. The sample of segments for crop acreage estimation is selected in a single stage with inclusion probability , where and are the number of sampling units in the sample and population, respectively.
In Ecuador, the sampling frame is a land-use map, stratified into four strata according to the percentage of cultivated area. The sampling unit is a segment with geometric boundaries, whose size changes with strata: 9 hectares (ha) in stratum 1a, 36 ha in stratum 1b, 144 ha in stratum 2, and 576 ha in stratum 3. The inclusion probability is equal within strata, , where and are the number of sampling units in the sample and in the population of stratum , respectively.
In Tanzania the sampling frame used to be a list frame based on an agricultural census. Recently, in the framework of the GSARS program, the Government of Tanzania in collaboration with the FAO, United States Department of Agriculture (USDA) and African Development Bank, designed an area sample . The sampling frame is a map of the country stratified by land use and the sampling unit is a point. Point sampling has a long history [17, 18]. In agricultural statistics, it is used for selecting a sample of farms when a sampling frame is not available; only an accurate map is required . The sample of farms is used for data on a large set of items, generally by direct interview of farmers.
Some of the aforementioned quantities, including crop acreage and yield and natural resources, can be observed directly on the ground, bypassing the need to contact farmers for data collection. France is an example of countries using point sampling for crop acreage estimation, using data collected directly on the ground . Furthermore, the USA is an example of countries using point sampling for a national inventory of natural resources [20, 21].
The sampling design is basically the same in the two aforementioned countries, i.e., a two-stage sampling in which the primary sampling unit (PSU) is a square segment and the secondary sampling unit (SSU) is a point. In the USA National Resources Inventory survey, the segments are stratified, and the side of the typical segment is 800 meters (1/2 mile), i.e., 64 hectares, with some smaller (200 meters) and some larger (1600 meters). The primary sampling rates are generally 2% to 6% of the land area. There are about 300,000 PSUs and 844,000 points, and the second-stage sample size is between 1 and 3 points per PSU.
In the French TER-UTI, the sampling design is non-stratified, one-stage, and systematic. The national territory is divided into square blocks with side 12000 meters. Each block is divided into four square sections of side 6000 meters. In each section, a systematic sample of 36 points is selected, aligned in both row and column directions, separated by a distance of 300 meters, forming a 6 6 point grid.
This sampling scheme can be seen as two-stage, so the inclusion probability is of the form , where and are the first and second stage inclusion probabilities, respectively.
Two of the five Sen4Stat partner countries use list sampling, Malawi and Senegal. In Malawi, the ground data were collected in the framework of the LSMS-ISA program. The IHS5 sampling frame is based on the 2018 Malawi Population and Housing Census. It is stratified into rural and urban strata and each of the 27 districts were considered a separate sub-stratum, part of the main rural stratum. A two-stage sampling scheme was used to select the sample from each district. The PSU is the Enumeration Area (EA) and the first-stage sample is systematic without replacement, with probabilities proportional to the number of households in the PSUs. The SSU is a household and a systematic sample of 16 households was used to choose the second-stage sample, using equal probabilities. To integrate with RS data, a georeferenced point is selected in each parcel of the sample .
In Senegal, the sampling frame is based on the 2013 population census . The sampling unit is a household and the sample was designed within the AGRIS program framework. This is a two-stage sampling scheme in which the PSU is the EA, and the first-stage sample is selected with replacement and probabilities proportional to the number of households in the PSUs. The SSU is a farm (households with agricultural activities) and the second-stage sample is selected without replacement and with equal probabilities among households with agricultural activities.
To integrate the data in the list sample of farms with RS data, a point is selected in each parcel within the two-stage sample. The inclusion probability of this point is , where and are the first and second stage inclusion probabilities, respectively. This is so because if the SSU is included in the sample, then the set of parcels are also included, ; the georeferenced point is not selected at random but is the parcel centroid .
Statistical models are used to integrate the auxiliary extra-sample information into the sampling design without loss of design-consistency . We used linear models if field data were continuous and multinomial models if those data were categorical [9, 24].
We consider a population of units, say agricultural parcels. If geographic information on parcel boundaries is available, it is possible to generate the required remote sensing (RS) data at unit level. We then consider a linear model relating ground data in the unit , with the values of a set of RS variables associated with this same unit, . The parameter vector is unknown but can be estimated. represents unobservable zero-mean random perturbations that, conditionally on , account for the variability of about its expected value .
The survey variable total in the population is and the mean is . If contains 1 (say 1), then . Here, , and designates the vector of population regression coefficients , where and . Since is known for every population unit , then and are known. However, is unknown and so is .
To estimate , we use the data collected in the sample of size units, selected from the population with inclusion probabilities . The design-based estimator is design-consistent for and as a result, the synthetic or projective estimator is design-consistent for . This same estimator can be written in the form , called a generalized regression estimator, where and . The projective estimator can be expressed as weighted sum with weight , where .
The error of as an estimator of is and the sampling variance is , where and . A designconsistent estimator of the sampling variance is , where is the design-based variance and .
The asymptotic distribution of is and can be used for constructing confidence intervals for .
A design-consistent estimator for the mean is . Its sampling variance is and can be estimated using . Its asymptotic distribution is the same as that of .
A domain is a part of the population, for instance, a region or province within a country. Let be the domain size, where if unit belongs to the domain, and 0 otherwise. The survey variable total in the domain is and the mean is . To estimate , we use the sample of size selected from the population with inclusion probabilities and the estimator . Here, is the number of units in the sample belonging to the study domain and is an estimator of . If is unknown, then we use the estimator . The sampling variance is . A design-based estimator of is .
The sample was designed to achieve the required estimate accuracy at national level. However, reliable estimates in small administrative areas such as a municipality are also required without increasing sample size . In a minor administrative area, will always be smaller than at national level and, as a result, so will be the estimators’ accuracy. A small area is a part of the population of which, owing to the small sample size, the design-based estimator is not sufficiently accurate for most uses and the design consistency requirement is meaningless.
For estimates at municipality level, we used a model-based estimator to “borrow strength” from related small areas to obtain accurate estimates for a given small area. Let be the available dataset, where represents ground data in unit of the small area , is the vector of RS data, is sample size in the small area , and is the number of small areas, so that .
We consider the unit-level linear mixed model , where are zero-mean independent random variables of variances . We assume the same regression parameters and same variance components , through small areas.
The model-based estimator of the total survey variable in a small area is , where is the generalized least-square estimator of , with , , and , where . Here, and , with , are unbiased estimators of the variance components, where is the sum of squared residuals in the model fitted by ordinary least squares, taking as fixed the small-area effect . is the sum of squared residuals in the model fitted by ordinary least squares, with 0. and are the population and sample means of the vector , respectively.
An unbiased estimator of the total mean-square error estimator is :
Here, . Because , may be simplified to .
2.2.2Multinomial logit models
If geographic information on parcel boundaries is unavailable, then it is not possible to generate RS data at parcel level. In this case, we use a point as the unit of observation. Because a point is dimensionless, observed crop data are necessarily categorical. For integrating categorical ground and RS data, multinomial logit is a suitable model [9, 10, 24].
We consider the survey vector , where 1 if crop covers pixel and 0 otherwise. is the total number of crops so that the constraint holds. We focus on high-resolution satellite images and the population size is the number of pixels in the study area , where is the area of the piece of land represented by a pixel.
The population total of the survey vector is , where is the number of pixels covered by crop . It is assumed that a pixel is covered by only one crop (or that a class of mixed pixels is included), so that the area covered by is and the population area is . We want to estimate the total or mean , where is the proportion of pixels covered by .
We assume that the survey vector follows a multinomial distribution , where and is the probability that unit is of category , i.e., the probability of 1 and 0; , with the constraint 1. We estimate using an estimator of based on a sample of pixels selected with inclusion probability , where is the vector of RS data.
To link with RS data, we consider a multinomial logit model , where is an unknown parameter vector. To estimate the probability that the cover of pixel is crop , it is sufficient to acquire estimates of for . The probability of category is and can be estimated using the estimates for .
The design-based parameter estimator is design-consistent for the maximum likelihood estimator of based on the population data , considered a simple random sample of the multinomial model. This can be found iteratively using , where and , with if and 0 otherwise A design-consistent estimator of is , where , for , , , and is the estimator of the total number of pixels covered by crop for ; for category it is . The sampling covariance matrix of is given approximately by , where is a function of , the estimators of the total number of sampling units of category for (because , and of the estimator of the total residuals of category . We focus on the sampling variance for , which are the elements along the diagonal of . Let be a row vector whose components are the estimators on which depends. The sampling variance is given approximately by , where is the design-based covariance matrix of . The sampling covariance matrix is estimated replacing by in .
The survey variable total in the domain is , where is the number of pixels in covered by crop . To estimate , we use the sample of size selected from the population with inclusion probabilities and the estimator . Here, is the number of units in the sample belonging to the study domain, and . The sampling variance of is given approximately by . Let and . The sampling variance is given approximately by , where . The sampling covariance matrix is estimated replacing by in .
Works in the literature [26, 27, 28] for small area estimation based on multinomial mixed models follow a penalized quasi-likelihood approach. As pointed out by McCulloch and Searle , these methods are not completely satisfactory in practice. Those authors recommended instead a linearization of the non-linear multinomial models and used linear mixed models. Additional research is required to achieve a completely satisfactory solution to this problem
Cost efficiency is the usual criterion for comparing a set of sampling strategies developed for estimating the same characteristic in the same population. Let be the cost and the sampling error of the current sampling strategy using only ground data, and let and be the cost and sampling error, respectively, of the strategy integrating ground and RS data. The cost efficiency is for the former and for the latter.
Although the Sentinel images and software required for the integration of ground and RS data are provided by ESA for free, there are costs for the NSOs such as the time of experts and commercial cloud services required for storage and computations. Because estimation of these other costs is a difficult task, we assume in the following that , so the comparison criterion reduces to the efficiency.
The relative efficiency of RS data with respect to ground data is . If , then, using RS data, the current sampling error may be reduced to without increasing the survey cost. Here, is the effect of RS data on the sampling efficiency. In other words, using RS data, the current sample size can be reduced without loss of accuracy in a quantity equal to . Thus, using RS data, the ground sample size can be .
We evaluate the effect of RS data on the cost efficiency using either linear or multinomial models. Spain and Ecuador are partner countries using area samples selected in only one stage, with equal inclusion probabilities, and we consider linear models for RS data integration. In Spain, the sampling frame is not stratified and the inclusion probability is . In Ecuador, the sampling frame is stratified and the inclusion probability is for every sampling unit of the same stratum , with .
The sampling variance using only ground data is in Spain, and in Ecuador.
Using RS data, the sampling variance is in Spain, where , and in Ecuador, where .
Using RS data, the current sampling error can be reduced to a quantity . In Spain, and . In Ecuador, , where , , and is the cost per sampling unit in stratum .
Senegal, Malawi, and Tanzania use point sampling with unequal inclusion probabilities , , and we consider multinomial models for RS data integration. The sampling variance of the estimator using only ground data is approximately . Using ground and RS data, the sampling error is . The relative efficiency of RS data for estimating the total number of pixels covered by crop is .
RS cost-efficiency relies on the RS contribution to reduce the sampling variance of the residuals with respect to the sampling variance of the ground data . The sampling variance reduction depends on the reliability of the RS data to estimate the expected values, . Consequently, the accuracy and timeliness of the datasets required to estimate are key to achieve a significant gain of cost-efficiency.
As mentioned in the Introduction, “multi-seasonal estimation” and “optimizing the sample design” are applications of RS data useful for agricultural statistics. We developed prototypes for these two applications but we cannot test them here because the required data are unavailable. Thus, we are limited to outlining the developed prototypes.
We consider the annual agricultural cycle divided into, say, four seasons. We use the subscript to refer to season of year . We want to estimate the total of the survey variable in of and its annual aggregate .
22.214.171.124.1. The sample
We examine a two-phase sampling strategy. The first-phase sample is the NSO sample already in place and used for agricultural statistics. We use the subscript 1 to refer to this sample, for instance, denotes its size. We select a secondphase sample specific to each season from among the sampling units within . We choose a splitpanel or supplementary panel sampling design consisting of a panel component and specific component .
For the first season, the panel component is a sample of sampling units selected from , and the specific sample for that season is a sample of sampling units selected from among those included in but not included in . In the second season, the panel component is the same as in season 1, , and the specific sample is a number of sampling units chosen from among those included in but not included in either or . For the third season, the panel component is the same as in seasons 1 and 2 ( and the specific sample is a number of sampling units selected from among those included in but not included in , , or . In the fourth season, the panel component is the same as in seasons 1, 2 and 3, and the specific sample is a number of sampling units chosen from among those included in but not included in , , , or .
126.96.36.199.2. Composite estimators
A composite estimator is a function of single estimators, each defined separately for panel and specific samples. We consider the linear model , referring to season and both panel and specific samples. Model parameters are estimated using the weigthed estimator and data from the second-phase sample (with for the panel sample and for the specific sample), with . The seasonal survey variable total is estimated separately, using data from the panel and specific samples and projective estimator . A design-consistent estimator of the sampling error variance is , where .
Let be the vector of survey variable totals in the four seasons of year , and let be the vector of total estimators based on the panel and specific samples. We consider the model , where is the vector of sampling errors and is an matrix indicator of the panel or specific sample.
We assume that panel and specific samples in the same season are independent and, as a result, in the error covariance matrix the covariances between and are nil. A composite estimator of is and its covariance matrix is . The composite estimator of the annual total is and its variance is .
To estimate , we use the empirical correlation and an autoregressive model , where and are modeled using . Here, designates the zeromean independent random variable 0 and the variance of is . is an AR(1) stationary process, where is a random zeromean perturbation term whose variance is .
As a result, we have the linear mixed model , whose temporal random component is autocorrelated and 0. In this last model, represents the observed data in sampling unit in season and the right term in the equation is interpreted as follows: is the true value of the survey variable and the sum of the population mean and . The latter is a specific component associated with sampling unit in season and represents deviation from Finally, is the measurement error of the aforementioned true value.
The autocovariance function of the process is , where and . The variance of is and the autocorrelation function is . The autocovariance function of the process is and the variance is . The autocorrelation function for the process is , where and is an indicator variable whose value is 1 if the argument is true and zero otherwise.
The ratio is a measure of the weight of the measurement error relative to the total error. The parameters and can be estimated through fitting by nonlinear minimum least squares the theoretical correlation model to the empirical correlations in .
The sample design is optimized to find the design variable values that minimize the sampling variance subject to cost (budget) constraints. The sampling variance depends on the spatial correlation structure of the survey variable and we follow a superpopulation approach to identify this structure. We limit ourselves to area samples and assume that the population values of the survey variable constitute a sample generated according to a second-order stationary random process with the following characteristics : the mean is , variance is , and the covariance between two elementary observations at the points of coordinates and is positive, decreasing when the distance between these points increases.
To assess the correlation structure , theoretical variogram and correlogram models have been proposed in the literature. Two frequently used correlogram models are the exponential, , and the spherical if and 0 if . Here is the number of sampling units between and in the row direction, the number of sampling units between and in the column direction, and . The model parameters are the range rate and ratio . Here is the nugget effect, i.e., the variation at or near the origin (independent of distance), is the partial sill (a function of distance between sampling points) and is the sill, i.e. the maximum variation far from the origin.
It was demonstrated by Ambrosio et al.  how this approach should be followed to design systematic samples using a land-use map. Here, we propose instead to use a crop-type map  to estimate the parameters of the correlogram model. The estimated correlogram can be used to evaluate the anticipated variance  as a function of design variables for the set of sampling strategies used in practice [35, 36].
188.8.131.52.1. Anticipated variance
We illustrate the proposed approach considering segments of size and a simple random sample of segments of size . The variance of the sampling error is , where is the population variance. The anticipated variance is the model-based expected value of this design-based sampling variance , where and . Here, is the average correlation between pairs of observations over the pairs in the population. is the average correlation between pairs of observations over the pairs in a segment.
184.108.40.206.2. The optimization problem
The design variables are the segment and sample sizes, and we find optimal values of these variables by solving the following optimization problem: , subject to . Here, is the cost of adding a segment to the sample, excluding travel cost but including positioning cost (travel to the first segment visited from the interviewer home base and then back to that base from the last segment visited during the data-collection trip), and is the observing cost, including the cost of locating the segment
The solution to this problem is the optimum segment size and optimum sample size of segments . In addition to the budget, this optimum solution is conditioned to the correlogram model parameters .
We used field data collected by the NSO of Spain to illustrate our approach to the integration of continuous ground data in the sampling design using linear models, and field data collected by the NSO of Senegal to illustrate our approach to the integration of categorical ground data into the sampling design using multinomial models.
The two RS products traditionally used as auxiliary data for crop acreage and yield estimation are pixel classification by crop type (for the former) and a set of vegetation indices (for the latter). For training pixel classification models, ground georeferenced data of crop type is required. In many countries, data from the agricultural sector are collected using national household or farm surveys, and no geographic information at parcel level is available.
|Data||Acreage||95% Confidence interval (hectares)||Coeff. of||Relative efficiency|
|(hectare)||Limits||Amplitude||variation (%)||of RS data|
|Ground RS||228550.1||Lw: 219699.8||17700.5||1.98||5.2|
Quotient between root square of the sampling variance and estimate.
|Data||Yield||95% Confidence interval (kg/hectare)||Coeff. of||Relative efficiency|
|(kg/hectare)||Limits||Amplitude||variation (%)||of RS data|
|Ground RS||4155.6||Lw: 4033.2||244.9||1.50||0.95|
|Data||Yield||95% Confidence interval (kg/hectare)||Coeff. of||Relative efficiency|
|(kg/hectare)||Limits||Amplitude||variation (%)||of RS data|
|Ground RS||2327.8||Lw: 2215.1||225.4||2.47||1.22|
Azzari et al.  compared several ways of generating the data needed to train pixel classification models when no such parcel-level information is available. Focusing on integrating ground survey data of maize in Malawi and Ethiopia from national household surveys using the Sentinel-2 satellite, the authors evaluated the accuracy of pixel classification producing georeferenced data at parcel level, ranging from the full parcel boundary to only one point (centroid).
The authors concluded that collecting full-parcel boundary data or GPS coordinates of the polygon defined by complete parcel corner points yields the best-quality information for model training; however, the use of mid-sized sample (3000–4000 parcels) plot centroids could perform similarly to full plot boundaries. The authors did not consider the statistical model required to achieve the integration of ground data with Sentinel-2 to improve the accuracy of crop acreage estimates, obtaining design-based consistency.
In Sen4Stat, the focus was on developing an open-source system that permits any user to generate from Sentinel 1 SCL Sentinel-2 L1C and/or L2A images the RS data required to improve agricultural statistics. In addition to vegetation indices, using cloudmask and cloud-free mosaics for optical sensors, and for transforming data from SAR sensors, the system allows the choice of several techniques of supervised classification (including random forest) for pixel classification and crop-type map generation.
Raw data in spectral bands can be directly integrated with ground data using multinomial models. Indeed, as shown by Hogland et al. , using spectral-band digital numbers in conjunction with multinomial models is a pixel classification method.
4.Results and discussion
4.1.1Crop acreage estimates at national level
We consider the area covered by an image of size 100 100 km in Castilla y León (Spain), with X coordinates (in meters): (300000,400000) and Y coordinates (4600000,4700000). The sampling unit is a square segment of side 700 meters and sample size in the area is 419 segments.
Ground data are observed at parcel level. However, we aggregated data on crop acreage at segment level for computations, so that represents ground data on acreage of the study crop in segment . Further, the RS data are , where is the number of pixels classified as belonging to the study crop in segment . Model parameter estimates are 0.046 for the independent term and 0.903 for the angular coefficient of . Results for barley data observed in 2018 are in Table 1.
These results show that using RS data, estimator accuracy improved considerably; the amplitude of the confidence interval decreased and the estimation error was reduced by half. The relative efficiency was high; using RS data, the current sample size could be reduced to less than one fifth without loss of accuracy. This is so because RS data are reliable for crop acreage. These results are design-based, which implies that the results change with the sample design. Thus, the reduced sample should be chosen using the same design as the NSO is currently using.
4.1.2Crop yield estimates
Ground data on yield are observed at parcel level, with representing these data for the study crop in parcel of segment . The RS data are denoted , a row vector with 1 in the first position and a set of vegetation indices in the remaining positions. The latter include the normalized difference vegetation index (NDVI) and leaf area index (LAI) (sum of LAI simulated by a Savitsky-Golay interpolation, fitting all LAI observations in the growing season) during (i) sprouting, (ii) flowering, and (iii) ripeness, plus (iv) maximum value of the LAI S-G interpolation and yield simulated by the simple algorithm for yield (SAFY) crop-growth model. Results for barley data observed in 2018 are in Table 2a for NDVI only and in Table 2b for LAI and yield simulated by SAFY.
The relationship between crop yield and the vegetation indices is statistically significant. However, their correlation is not as strong as for crop acreage and as a result, using RS data, yield estimator accuracy improves little. Using RS data, the estimation error is of the same order of magnitude as using only ground data. The relative efficiency is nearly 1; using RS data, the current sample size could be reduced little without loss of accuracy. This is because RS data reliability for crop yield is not as great as for crop acreage. Additional research is required to make the production estimates more reliable through an improved estimate for yield from RS; physically based models are more reliable and can be integrated with RS data.
4.1.3Crop production estimates
We estimate crop production as the product of the crop acreage and yield estimates. Results are in Table 3.
|Data||Production||95% confidence interval (tons: 1000 kg)||Coeff. of|
|(tons) (1000 kg)||Limits||Amplitude||variation (%)|
|Ground RS||965733||Lw: 915194||101078||2.67|
Thanks to improvement in the crop acreage estimator using RS data, the production estimator accuracy increased considerably; the estimation error decreased by half, even if the RS data failed to improve the yield estimate.
4.1.4Crop acreage estimates at province level
To illustrate provincial estimation using areal sampling, we consider barley acreage estimates at the provincial level within the 100 100 km area of Castilla y León (Spain). Results are in Table 4.
|Province||Using only ground data||Using ground & RS data||Relative efficiency|
|Acreage (has.)||Error (CV%)||Acreage (has.)||Error (CV%)||of RS data|
The estimate accuracy at provincial level is, as expected, less than at the national level, but the RS relative efficiency is of the same order of magnitude. In provinces where the sampling error using only ground data is very high (León), the RS contribution is qualitative in the sense that it allows the estimates to be labeled as official statistics by reducing the coefficient of variation below the standard limits (20%) considered acceptable in official statistics.
4.1.5Crop acreage estimates at the municipality level based on linear mixed models
To illustrate small-area estimation using areal sampling, we consider barley acreage estimates at municipality level in that part of Zamora province within the 100 100 km area of Castilla y León (Spain). Results are in Table 5.
|Belver de los Montes||1||212.96||29.1|
|Pinilla de Toro||4||963.30||10.0|
|Quintanilla del Monte||1||466.65||20.3|
|Villamayor de Campos||1||1056.23||11.1|
|Villanueva del Campo||1||784.03||13.2|
|Villar de Fallaves||1||844.16||11.0|
Considering that the sample size at municipality level is small or null, the accuracy of municipality estimates is good, thanks to the RS contribution. In most municipalities, the estimates could be labeled as official statistics because the coefficient of variation is smaller than the standard limit (20%), even when the sample size is only one sampling unit
4.2.1Crop acreage estimates in Nioro Department
To demonstrate crop acreage estimation using point sampling and multinomial models, we consider the Nioro Department of Senegal, in the Kaolack Region. The ground data were observed in a sample of 345 points pixels selected using a list frame. The source of RS data is a crop-type map. In fact, we used a dual frame, since we complemented the NSO list frame with an area frame based on satellite images, in which agricultural areas were distinguished from non-agricultural areas and were stratified into crop types.
In this case, the NSO list frame alone is not sufficient to integrate field and RS data This is because the number of pixels in the agricultural areas is required for expanding the sample estimates to the entire population. This expansion factor is provided by the area frame.
We focus on the two main crops (millet and groundnut) observed in the field sample. The remaining crops (mainly maize) are included in a third category called other, whose probability estimate is one minus the estimate’s probability of millet and groundnut. The RS data are coded according to the Earth Observation (EO) crop type into which pixels are classified: for any pixel in the EO class maize, for any pixel in the EO class millet, for any pixel in the EO class groundnut, and for any pixel in the EO class other crops. Model parameter estimates are in Table 6.
|Crop||EO crop type map|
|Crop type||Hectare||Standard||Coefficient of||Limits of 95% confidence interval|
|Standard errors of proportion estimators||Relative efficiency|
|Crop type||Using only ground data||Using ground & RS data||of RS data|
|Ground data||Number of pixels in the EO class (%)|
|Crop||Number of points/pixels in the sample||Maize||Millet||Groundnut||Other|
|Maize||48||9 (18.8)||26 (54.2)||12 (25.0)||1 (2.0)|
|Millet||134||10 (7.4)||97 (72.4)||23 (17.2)||4 (3.0)|
|Groundnut||163||5 (3.1)||14 (8.6)||143 (87.7)||1 (0.6)|
% Percentage of pixels correctly classified.
Crop acreage estimates of millet and groundnut based on ground and RS data are in Table 7.
We evaluate the RS data efficiency for the acreage estimation of millet and groundnut. Results are in Table 8.
The effect of integrating RS data in the ground sample data was a reduction in the sampling error of millet and groundnut and, as a result, in the confidence interval of these two crops, without loss of design-based consistency. In other words, using RS data, the cost of estimating millet and groundnut acreage could be reduced to less than a third of the current cost, without loss of accuracy. These results are design-based, so it is understood that the reduced sample size should be selected using the currently sampling design used by the NSO.
4.2.2Estimation directly based on pixel classification
There is consensus in the official statistics community on using methods providing design-consistent estimates of the population characteristics under study and measures of estimator uncertainty, such as sampling error, coefficient of variation, or confidence intervals. The method proposed in this paper for integrating RS data in the NSO sampling design agrees with this consensus.
|EO class||Millet||Groundnut||Maize & Other_crops|
|EO class||Estimates of the number of pixels covered by crop in each EO crop type class,|
|Crop type||Number of pixels||Millet||Groundnut||Maize & Other|
|Estimates of the total number of pixels by crop||19184501||8921492||7881468||2381541|
|Acreage (has.)||Error (CV%)||Acreage (has.)||Error (CV%)|
Although the proposed approach allows for any form of RS data, including raw reflectance data in the form of pixel digital numbers, we focus on the use of croptype maps as auxiliary information. In this context, a natural question that arises is: Why not directly use the number of pixels in each EO croptype class to estimate crop acreage instead of using it as auxiliary data in a statistical model?
We follow the suggestion of a referee to clarify this question, comparing the results of our design-based approach (shown in Table 7) with estimates directly based on the croptype map. The comparison is necessarily limited to the crop acreage estimates because the usual algorithms used for croptype map generation do not provide measures of uncertainty comparable to those of Table 7.
Using a random forest classifier, four EO crop types were considered in the croptype map of Nioro: maize, millet, groundnut, and others. The remaining pixels were classified in a non-agricultural landuse class. The number of pixels in each EO croptype class is in Table 11.
As seen in Table 9, there is uncertainty in the pixels classification and any estimate based on it are subject to this uncertainty. Most pixels of millet in the sample (72.4%) are correctly classified in the EO class millet, but a non-negligible percentage of them were confounded with groundnut (17.2%), maize (7.4%), and others (3.0%). For groundnut, the percentage of pixels that were correctly classified is higher (87.7%) than for millet, but the confusion with millet (8.6%) and maize (3.1%) is non-negligible.
In the proposed approach, the multinomial logit model is used for estimating the probability that the crop covering a pixel classified in the EO class is actually crop . The model parameter estimates are in Table 6 and the probability estimates are in Table 10.
To estimate the number of pixels in EO class that are actually covered by crop , we multiplied the total number of pixels in EO class by the estimator of the aforementioned probability : . The estimator of the total number of pixels covered by crop in Nioro, is the sum of the estimators in each EO class. This estimator is design-consistent and the estimates based on it are in Table 11.
Both the number of pixels in EO crop type millet (9470572, i.e., 94705.72 hectares, as a pixel represents 100 m) and in EO crop type groundnut (8076448, i.e., 80764.48 hectares) are larger than the design-based number estimates: 8921492 (89214.92 hectares) for millet and 7881468 (78814.68 hectares) for groundnut. For millet, the difference was 5490.8 hectares (6.2%) and for groundnut 1949.8 hectares (2.5%). For the former, the difference is greater than the design-based standard error (3661.10 hectares) and the coefficient of variation (4.11%). For the latter the difference is smaller than the design-based standard error (2923.94 hectares) and the coefficient of variation (3.71%). These differences are not statistically significant, since the estimates directly based on the EO croptype classes are within the design-based confidence limits, namely, [81978.88, 96330.4] hectares for millet and [73089.15, 84550.98] hectares for groundnut.
However, the comparison must focus not on the results, which are always uncertain, but on the methods. The methods make the major difference; whereas the design-based estimators are in the mainstream of official statistics, the estimators based directly on croptype maps are not. The proposed approach provides designconsistent estimates together with the usual measures of uncertainty (sampling error, coefficient of variation, and confidence intervals). The two main objections to estimators based directly on the croptype map are that (i) they are not design-consistent and (ii) the usual uncertainty measures are not available.
4.2.3Crop acreage estimates at district level in Nioro Department
To illustrate crop acreage estimation at district level using point sampling and multinomial models, we considered the three districts (arrondissements) of Nioro, Medina Sabakh, Paoskoto, and Wack Ngouna. The estimates are in Table 12.
The estimate accuracy at district (arrondissement) level is, as expected, lower than at the national level. However, the sampling error is within the standard limit (coefficient of variation 20%) for labeling as official statistics.
The differences between the number of pixels in the EO crop type millet and the design-based estimate of the number of pixels covered by millet are smaller than the sampling error in Medina Sabakh (0.7%) and Paoskoto (4.1), and larger than the design-based estimate in Wack Ngouna (12.4%). The differences between the number of pixels in the EO crop type groundnut and the design-based estimate of the number of pixels covered by groundnut are smaller than sampling error in the three districts: Medina Sabakh (3.4%), Paoskoto (2.5%), and Wack Ngouna (1.2%).
A result well established in the literature is that RS data improve the cost-efficiency of design-consistent crop acreage estimators based on linear models and simple area samples of georeferenced polygons (segments). The strong RS reliability for crop acreage estimation facilitates a notable improvement of crop production estimates, even if RS data are not sufficiently reliable for yields.
However, many countries use complex list samples of non-georeferenced households or farms for agricultural statistics. We have demonstrated in this paper that with the use of multinomial models, a point georeferenced by parcel in these complex samples is sufficient to enhance the cost efficiency of design-consistent, national-level crop acreage estimators using RS data.
The open-source Sen4Stat system is being designed in such a way that any NSO can use it to improve the cost efficiency of the procedure currently used to obtain crop acreage and production estimates, at both national and minor administrative area levels (province and municipality). The only inputs required by the system are the observed data in the sample currently designed by the NSO for field data collection. The system allows for both a segment simple sample, based on area frames, and a point complex sample, based on list frames of households/farms. In both cases, the system generates from Sentinel images a crop-type map and integrates it with ground data, using linear models in the segment case and multinomial models in the point case.
We have extended the method developed in this paper to two additional RS applications, multi-season estimates and optimization of the sample design. Additional work is required to obtain the data required to test these two additional RS applications.
This research was carried out in the framework of a consortium integrating the Université Catholique de Louvain (UCLouvain) as contractor and the Universidad Politécnica de Madrid (UPM), CS ROMANIA S.A. (C-S RO) and Systèmes d’Information à Référence Spatial (SIRS/CLS) as sub-contractors. The national statistical offices of Ecuador, Malawi, Senegal, Spain, and Tanzania participated in the work as partners and provided the required ground data. SIRS/CLS elaborated the ground databases, C-S RO developed the system for image classification, and UPM developed statistical prototypes for the integration of ground and RS data. UCLouvain coordinated the project and contributed mainly to the specification of products and services, system design RS database elaboration for integration with the ground database, and the full-scale demonstration plan.
We thank the referees and the subject editor for their very helpful comments and suggestions.
This work was supported by the European Space Agency in the Sen4Stat project framework [Contract No. 4000127181/19/I-NS]. This agency contributed to the study design, together with a steering committee made up of members of international organizations such as the World Bank and FAO, with the aim of achieving synergies between this project and those launched by these organizations to improve agricultural statistics.
Fonteneau F, Delincé J. Surveying Farms in the 21st Century. In: FAO editor. Handbook on the Agricultural Integrated Survey (AGRIS); Global Strategy to Improve Agricultural and Rural Statistics. Rome: FAO; (2018) ; pp. 1-6.
Zezza A, Gourlay S, Molini V. Closing the data gap in agriculture through sustainable investment in the data value chain: Realizing the vision of the 50x2030 Initiative. Stat J IAOS. (2022) ; 38: : 57-62. doi: 10.3233/SJI-220933.
Hanuschak GA, Allen RD, Wigton WH. Integration of Landsat data into the crop estimation program of USDA’s Statistical Reporting Service (1972-1982). Invited paper at: 8th International Symposium on Machine Processing of Remote Sensed Data. 1982 July 7–9; West Lafayette, Indiana: Purdue University. Available from: https//www.lars.purdue.edu/home/references/sym_1982/sym_1982.html.
Allen JD. A look at the Remote Sensing Applications Program of the National Agricultural Statistics Service. J Off Stat. (1990) ; 6: (4): 393-409.
Ambrosio L, Alonso R, Villa A. Estimación de superficies cultivadas por muestreo de áreas y teledetección. Precisión relativa. Estad Esp. (1993) ; 35: (132): 91-103.
Delincé J. Cost-Effectiveness of Remote Sensing for Agricultural Statistics in Developing and Emerging Economies. In: FAO editor. GSARS Technical Report GO09-2015. Rome: FAO; 2015. doi: 10.13140/RG.2.2.25985.45927.
Gallego FJ. Remote sensing and land cover area estimation. Int J Remote Sens. (2004) ; 25: (15): 3019-47. doi: 10.1080/01431160310001619607.
Firth D, Bennett KE. Robust models on probability sampling. J. R. Stat. Soc. Series B Stat. Methodol. (1998) ; 60: :3-21. doi: 10.1111/1467-9868.00105.
Agresti A. Categorical data analysis. New York: Wiley; (2002) ; 710 p.
Hogland J, Billor N, Anderson N. Comparison of standard maximum likelihood classification and polytomous logistic regression used in remote sensing. Eur J Remote Sens. (2013) ; 46: (1): 623-40. doi: 10.5721/EuJRS20134637.
FAO. Handbook on Master Sampling Frames for Agricultural Statistics. Frame Development, Sample Design and Estimation. Improving Agricultural and Rural Statistics. Global Strategy. Rome: FAO; (2015) Dec. 170 p.
Särndal CE, Swensson B, Wretman, J. Model Assisted Survey Sampling. New York: Springer; (1992) ; 695 p.
Fuller WA. Sampling statistics. New York: Wiley; (2009) ; 472 p. doi: 10.1002/9780470523551.
Cotter J, Nealon J. Area frame design for agricultural surveys. Research and Applications Division. National Agricultural Statistics Service. United States Department of Agriculture. Washington, D.C., (1987) .
FAO. Multiple frame agricultural surveys. Volume 1: Current surveys based on area and list sampling methods. Volume 2: Agricultural survey programmes based on area frame or dual frame (area and list) sample designs. Statistical development series 7 and 10. Rome: FAO; 1996, 1998.
Ministry of Agriculture, Livestock and Fisheries. 2014/15 Annual Agricultural Sample Survey Report. Zanzibar (TZ); (2016) ; 72 p. Available from: http//hdl.handle.net/20.500.12018/2905.
Yates F. Sampling Methods for Census and Surveys. London: Griffin. (1949) ; 318 p.
Jinguji I. Dot sampling method for area estimation. In: Crop Monitoring for Improved Food Security. Srivastava M.K. editor, Bangkok (TH): FAO and ADB; (2015) RAP PUBLICATION 2014/28: pp. 27-48.
Gay C, Porchier JC. Land cover and land use classification using TER-UTI. In: Holland TE and van den Broecke MPR, editors. Proceedings of Agricultural Statistics 2000, an International Conference on Agricultural Statistics 1998 Mar 18-20: pp. 93-201. Available from: https//www.isi-web.org/isi.cbs.nl/iamamember/Books/agric2000/page-193.pdf.
Nusser SM, Goebel JJ. The National Resources Inventory: a long-term multi-resource monitoring programme. Environ Ecol Stat. (1997) ; 4: (3): 181-204. doi: 10.1023/A1018574412308.
Nusser SM, Goebel JJ, Fuller WA. Design and estimation for investigating the dynamics of natural resources. Ecol Appl. (1998) ; 8: (2): 234-45. doi: 10.1890/1051-0761(1998)008[0234:DAEFIT]2.0.CO;2.
Azzari G, Jain S, Jeffries G, Kilic T, Murray S. Understanding the requirements for surveys to support satellite-based crop type mapping: evidence from Sub-Saharan Africa. Remote Sens. (2021) ; 13: (23), 4749. doi: 10.3390/rs13234749.
Direction de l’Analyse de la Prévision et des Statistiques Agricoles. Ministère de l’Agriculture et de l’Equipement Rural. Senegal – Enquete Agricole Annuelle. Methodologie et plan de sondage de l’enquete agricole. Senegal. (2017) . http//anads.ansd.sn/index.php/catalog/218/download/1846.
MDcCulloch CE, Searle SR. Generalized, linear, and mixed models. New York: Wiley; (2001) ; 325 p.
Ambrosio L, Iglesias L. Land cover estimation in small areas using ground surveys and remote sensing. Remote Sens. Environ. (2000) ; 74: : 240-8. doi: 10.1016/S0034-4257(00)00114-0.
Saei A, Chambers R. Small area estimation under linear and generalized linear mixed models with time and area effects. Methodology Working Paper M03/15. Southampton (UK): Southampton Statistical Sciences Research Institute, (2003) : 31 p. http//eprints.soton.ac.uk/id/eprint/8165.
Molina I, Saei A, Lombardia MJ. Small area estimates of labour force participation under a multinomial logit mixed model. J. R. Stat. Soc. Ser. A Stat. Soc. (2007) ; 170: (4): 975-1000. doi: 10.1111/j.1467-985X.2007.00493.x.
López-Vizcaíno E, Lombardía MJ, Morales D. Small area estimation of labour force indicators under a multinomial model with correlated time and area effects. J. R. Stat. Soc. Ser. A Stat. Soc. (2015) ; 178: (3): 535-65. doi: 10.1111/rssa.12085.
Fuller WA. Environmental surveys over time. J Agric Biol Environ Stat. (1999) ; 4: (4): 331-45. doi: 10.2307/1400493.
Fuller WA, Breidt FJ. Estimation for supplemented panels. The Indian Journal of Statistics, Serie B. (1999) ; 61: (1): 58-70. https//www.jstor.org/stable/25053068.
Cochran WG. Relative accuracy of systematic and stratified random samples for a certain class of populations. Annals of Mathematical Statistics. (1946) ; 17: (2): 164-77.
Ambrosio L, Iglesias L, Marin C. Systematic sample design for the estimation of spatial means. Environmetrics. (2003) ; 14: (1): 45-61. doi: 10.1002/env.564.
Defourny P, Bontemps S, Bellemans N, Cara C, Dedieu G, Guzzonato E, el al. Near real-time agriculture monitoring at national scale at parcel resolution: performance assessment of the Sen2-Agri automated system in various cropping systems around the world. Remote Sens. Environ. (2019) ; 221: : 551-68. doi: 10.1016/j.rse.2018.11.007.
Fuller WA, Isaki CT. Survey design under superpolutation models. In: Current Topics in Survey Sampling. Krewski D, Rao JNK and Platek R, editors, New York: Academic Press; (1981) ; pp. 199-226.
Ambrosio L, Iglesias L. Identifying the most appropriate sampling frame for specific landscape types. Technical Report Series. GO-01-2014. FAO. https//www.fao.org/3/ca6436en/ca6436en.pdf.
Ambrosio L, Iglesias L, Marín C. A model-assisted approach to identify a cost-efficient spatial sampling strategy. Paper presented at: Eighth International Conference on Agricultural Statistics (ICAS-VIII); (2019) Nov 18-21; New Delhi (IN).