Integration of remote sensing data into national statistical office sampling designs for agriculture
Abstract
The integration of remote sensing data in agricultural statistics is a research topic with a long history. The research focus is on using statistical models to link ground and remote sensing data such that the resulting estimators are design-consistent.
A design-consistent estimator assisted by linear models is well established in the literature. However, it requires enough geographic information about the boundaries of agricultural parcels to develop a simple sample of areas. Many countries use complex samples based on non-georeferenced list frames of households or farms and reduce to point data the georeferenced information required for linking ground and remote sensing data.
Data on crop acreage observed at a point are necessarily categorical because a point is dimensionless. Little work has been done on the integration of categorical ground data within complex list samples using remote sensing data. Our focus was on using multinomial logit models for this integration. Special attention was paid to evaluate the cost efficiency of remote sensing data.
1.Introduction
Recently, the European Space Agency (ESA) launched the Sentinels for Agricultural Statistics(Sen4Stat) project. The aim was to facilitate the use of remote sensing information at the National Statistical Offices (NSOs) supporting agricultural statistics. Other international programs to improve agricultural statistics were begun around the same time. An additional Sen4Stat target is synergy with other initiatives, including the Living Standards Measurement Study – Integrated Survey on Agriculture (LSMS-ISA), which was launched in 2008 by the World Bank, and the Global Strategy to Improve Agricultural and Rural Statistics (GSARS), which was endorsed in 2010 by the United Nations Statistical Commission and implemented by the Food and Agriculture Organization (FAO) of the United Nations. An FAO output is the Agricultural Integrated Survey (AGRIS) method [1], together with the 50
Remote sensing (RS) furnishes a very useful auxiliary data source to improve the cost efficiency of design-based inference, and the integration of ground and remote sensing data is a research topic with a long history [3, 4, 5, 6, 7]. Statistical models are used for this integration, in such a way that the estimator assisted by these models is design-consistent [8].
The focus has been on linear models, assuming that the sample design is simple. In countries where geographic information of agricultural parcel boundaries is available, it is possible to use an individual or cluster of parcels as the sampling unit and to design a simple sample of areas (polygons) using area frames. A linear model is suitable for integrating ground and remote sensing data on crop acreage or yield because these are continuous variables. However, many countries use complex samples based on non-georeferenced list frames (of households or farms) and reduce to point data the georeferenced information required for linking ground and remote sensing data. Crop data observed at a point are necessarily categorical because a point is dimensionless, and multinomial logit is an appropriate statistical model to integrate remote sensing and categorical ground data [9, 10].
Little work has been done on the integration of categorical ground data in complex list samples of households or farms with remote sensing data. In this paper, we focus on using multinomial logit models for this integration. We give details on statistical models to be used for crop acreage, yield and production estimation, as well as on the sample design used by five Sen4Stat partner countries that are representative of the large types of sample designs proposed in the literature. These are list, area, and point sampling [11].
Considering the literature on the applications of remote sensing to agricultural statistics, as well as expectations of the NSO partner countries for these applications, we elaborated four use cases: 1. Cost efficiency; 2. Granularity; 3. Multi-seasonal estimation; 4. Optimizing the sample design. A prototype for integrating ground and remote sensing data in each use case was developed for a large set of sampling designs, including those currently used by the partner countries. We include results of the application of these prototypes to data provided by the NSOs, for cost efficiency and granularity. Finally, we outline prototypes for multi-seasonal estimation and optimizing the sample design.
2.Methods
The approach to elaborate official statistics is well established in the literature. It is design-based in the sense that the inference is based on the sampling distribution generated by the probabilistic sampling scheme designed to select the sample [12, 13]. This sampling scheme associates with each population unit,
Our focus was on improving the cost-efficiency of the currently used sampling designs for elaborating official statistics on crop acreage, using RS as auxiliary data. The approach to using auxiliary data in design-based inference is well stablished in the literature [8], i.e., the auxiliary extra-sample information is integrated in the sampling design using statistical models, without loss of design-consistency. In minor administrative areas (municipalities) the sample size is small or nil, so that design consistency is meaningless. Thus, a model-based approach is used to get estimates, whose accuracy strongly depends on the auxiliary information reliability.
Model suitability for ground and remote sensing data integration depends on the nature of ground data. For continuous ground data a linear regression model is used, but for categorical and counts ground data, a generalized linear model is more suitable [9, 10]. Both types of models allow remote sensing data of any character, continuous or discrete. The efficiency of RS as auxiliary data for crop acreage has already been extensively demonstrated in the scientific literature, but only for continuous ground data and with the assumption that the sample design is simple. To date, little work has been done on the integration of categorical ground data in complex list samples of households or farms with remote sensing data. In this paper, we focus on using multinomial logit models for this integration.
2.1Sample designs
The design of the sampling scheme for use in sample selection is key in design-based inference because it associates with each population unit
A sampling frame is required to select a probabilistic sample and we considered the two main types of sampling frames used in practice, area frame and list frame. The former is based on maps and/or satellite images and the latter on population censuses. Sampling units are either polygons or points in area frames and households or farms in list frames. Usually, the former is georeferenced, but not the latter [11].
The area frame has important advantages over list frames, including versatility, reduction of non-sampling errors such as coverage errors, and longevity. However, it has some inconvenience (sensitivity to outliers and sub representation of rare or minor activities) [14]. To overcome these inconveniences, a dual frame composed of an area and list frame is recommended [15]. In Section 4, we give details on the dual frame used in Senegal to integrate ground and RS data.
Using area frames, it is possible to define sampling units (polygons) of the same size and simple samples with equal inclusion probabilities. In contrast, using list frames, the sample selection scheme entails two or more stages, the sampling units are usually of distinct size, and the probabilistic scheme assigns unequal (size-proportional) inclusion probabilities. Therefore, the sample is not simple but complex.
The integration of ground and RS data will be illustrated for the set of sample designs most often used in practice for collecting ground data [11]. With area frames, systematic sampling with one or more random starts and stratified random sampling are most often used as the sample schemes when geographic information on parcel boundaries is available. If the parcel boundaries are unknown, the sampling unit is usually a point, and we will illustrate how to integrate a sample of points selected from a list frame with RS data.
Agricultural data are frequently collected using national household surveys, for which the sample is selected from a list frame (a population census) in two or more stages, with unequal inclusion probabilities. The sampling unit is a household and geographic information at parcel level is rarely available. For RS data integration, georeferenced information is required at a minimum. To have at least one georeferenced point by agricultural parcel within the household sample, an additional sampling stage is used to select the parcel centroid.
2.1.1Area sampling
Three of the five Sen4Stat partner countries use area samples, namely, Spain, Ecuador, and Tanzania. In Spain, the area frame is the national topographic map. The sampling unit is a square segment of side 700 m (49 hectares). The sample design is non-stratified systematic with three random starts. The sample of segments for crop acreage estimation is selected in a single stage with inclusion probability
In Ecuador, the sampling frame is a land-use map, stratified into four strata according to the percentage of cultivated area. The sampling unit is a segment with geometric boundaries, whose size changes with strata: 9 hectares (ha) in stratum 1a, 36 ha in stratum 1b, 144 ha in stratum 2, and 576 ha in stratum 3. The inclusion probability is equal within strata,
In Tanzania the sampling frame used to be a list frame based on an agricultural census. Recently, in the framework of the GSARS program, the Government of Tanzania in collaboration with the FAO, United States Department of Agriculture (USDA) and African Development Bank, designed an area sample [16]. The sampling frame is a map of the country stratified by land use and the sampling unit is a point. Point sampling has a long history [17, 18]. In agricultural statistics, it is used for selecting a sample of farms when a sampling frame is not available; only an accurate map is required [11]. The sample of farms is used for data on a large set of items, generally by direct interview of farmers.
Some of the aforementioned quantities, including crop acreage and yield and natural resources, can be observed directly on the ground, bypassing the need to contact farmers for data collection. France is an example of countries using point sampling for crop acreage estimation, using data collected directly on the ground [19]. Furthermore, the USA is an example of countries using point sampling for a national inventory of natural resources [20, 21].
The sampling design is basically the same in the two aforementioned countries, i.e., a two-stage sampling in which the primary sampling unit (PSU) is a square segment and the secondary sampling unit (SSU) is a point. In the USA National Resources Inventory survey, the segments are stratified, and the side of the typical segment is 800 meters (1/2 mile), i.e., 64 hectares, with some smaller (200 meters) and some larger (1600 meters). The primary sampling rates are generally 2% to 6% of the land area. There are about 300,000 PSUs and 844,000 points, and the second-stage sample size is between 1 and 3 points per PSU.
In the French TER-UTI, the sampling design is non-stratified, one-stage, and systematic. The national territory is divided into square blocks with side 12000 meters. Each block is divided into four square sections of side 6000 meters. In each section, a systematic sample of 36 points is selected, aligned in both row and column directions, separated by a distance of 300 meters, forming a 6
This sampling scheme can be seen as two-stage, so the inclusion probability is of the form
2.1.2List sampling
Two of the five Sen4Stat partner countries use list sampling, Malawi and Senegal. In Malawi, the ground data were collected in the framework of the LSMS-ISA program. The IHS5 sampling frame is based on the 2018 Malawi Population and Housing Census. It is stratified into rural and urban strata and each of the 27 districts were considered a separate sub-stratum, part of the main rural stratum. A two-stage sampling scheme was used to select the sample from each district. The PSU is the Enumeration Area (EA) and the first-stage sample is systematic without replacement, with probabilities proportional to the number of households in the PSUs. The SSU is a household and a systematic sample of 16 households was used to choose the second-stage sample, using equal probabilities. To integrate with RS data, a georeferenced point is selected in each parcel of the sample [22].
In Senegal, the sampling frame is based on the 2013 population census [23]. The sampling unit is a household and the sample was designed within the AGRIS program framework. This is a two-stage sampling scheme in which the PSU is the EA, and the first-stage sample is selected with replacement and probabilities proportional to the number of households in the PSUs. The SSU is a farm (households with agricultural activities) and the second-stage sample is selected without replacement and with equal probabilities among households with agricultural activities.
To integrate the data in the list sample of farms with RS data, a point is selected in each parcel within the two-stage sample. The inclusion probability of this point is
2.2Statistical models
Statistical models are used to integrate the auxiliary extra-sample information into the sampling design without loss of design-consistency [8]. We used linear models if field data were continuous and multinomial models if those data were categorical [9, 24].
2.2.1Linear models
We consider a population of
The survey variable total in the population is
To estimate
The error of
The asymptotic distribution of
A design-consistent estimator for the mean is
A domain is a part of the population, for instance, a region
The sample was designed to achieve the required estimate accuracy at national level. However, reliable estimates in small administrative areas such as a municipality are also required without increasing sample size
For estimates at municipality level, we used a model-based estimator to “borrow strength” from related small areas to obtain accurate estimates for a given small area. Let
We consider the unit-level linear mixed model
The model-based estimator of the total survey variable in a small area
An unbiased estimator of the total mean-square error estimator
where
Here,
2.2.2Multinomial logit models
If geographic information on parcel boundaries is unavailable, then it is not possible to generate RS data at parcel level. In this case, we use a point as the unit of observation. Because a point is dimensionless, observed crop data are necessarily categorical. For integrating categorical ground and RS data, multinomial logit is a suitable model [9, 10, 24].
We consider the survey vector
The population total of the survey vector is
We assume that the survey vector
To link
The design-based parameter estimator
The survey variable total in the domain
Works in the literature [26, 27, 28] for small area estimation based on multinomial mixed models follow a penalized quasi-likelihood approach. As pointed out by McCulloch and Searle [24], these methods are not completely satisfactory in practice. Those authors recommended instead a linearization of the non-linear multinomial models and used linear mixed models. Additional research is required to achieve a completely satisfactory solution to this problem
2.2.3Cost efficiency
Cost efficiency is the usual criterion for comparing a set of sampling strategies developed for estimating the same characteristic in the same population. Let
Although the Sentinel images and software required for the integration of ground and RS data are provided by ESA for free, there are costs for the NSOs such as the time of experts and commercial cloud services required for storage and computations. Because estimation of these other costs is a difficult task, we assume in the following that
The relative efficiency of RS data with respect to ground data is
We evaluate the effect of RS data on the cost efficiency using either linear or multinomial models. Spain and Ecuador are partner countries using area samples selected in only one stage, with equal inclusion probabilities, and we consider linear models for RS data integration. In Spain, the sampling frame is not stratified and the inclusion probability is
The sampling variance using only ground data is
Using RS data, the sampling variance is
Using RS data, the current sampling error
Senegal, Malawi, and Tanzania use point sampling with unequal inclusion probabilities
RS cost-efficiency relies on the RS contribution to reduce the sampling variance of the residuals
2.2.4Other applications
As mentioned in the Introduction, “multi-seasonal estimation” and “optimizing the sample design” are applications of RS data useful for agricultural statistics. We developed prototypes for these two applications but we cannot test them here because the required data are unavailable. Thus, we are limited to outlining the developed prototypes.
We consider the annual agricultural cycle divided into, say, four seasons. We use the subscript
2.2.4.1.1. The sample
We examine a two-phase sampling strategy. The first-phase sample is the NSO sample already in place and used for agricultural statistics. We use the subscript 1 to refer to this sample, for instance,
For the first season, the panel component is a sample of
2.2.4.1.2. Composite estimators
A composite estimator is a function of single estimators, each defined separately for panel and specific samples. We consider the linear model
Let
We assume that panel and specific samples in the same season are independent and, as a result, in the error covariance matrix
To estimate
As a result, we have the linear mixed model
The autocovariance function of the process
The ratio
The sample design is optimized to find the design variable values that minimize the sampling variance subject to cost (budget) constraints. The sampling variance depends on the spatial correlation structure of the survey variable and we follow a superpopulation approach to identify this structure. We limit ourselves to area samples and assume that the population values of the survey variable
To assess the correlation structure
It was demonstrated by Ambrosio et al. [32] how this approach should be followed to design systematic samples using a land-use map. Here, we propose instead to use a crop-type map [33] to estimate the parameters of the correlogram model. The estimated correlogram can be used to evaluate the anticipated variance [34] as a function of design variables for the set of sampling strategies used in practice [35, 36].
2.2.4.2.1. Anticipated variance
We illustrate the proposed approach considering segments of size
2.2.4.2.2. The optimization problem
The design variables are the segment and sample sizes, and we find optimal values of these variables by solving the following optimization problem:
The solution to this problem is the optimum segment size
3.Data
We used field data collected by the NSO of Spain to illustrate our approach to the integration of continuous ground data in the sampling design using linear models, and field data collected by the NSO of Senegal to illustrate our approach to the integration of categorical ground data into the sampling design using multinomial models.
The two RS products traditionally used as auxiliary data for crop acreage and yield estimation are pixel classification by crop type (for the former) and a set of vegetation indices (for the latter). For training pixel classification models, ground georeferenced data of crop type is required. In many countries, data from the agricultural sector are collected using national household or farm surveys, and no geographic information at parcel level is available.
Table 1
Uncertainty | |||||
---|---|---|---|---|---|
Data | Acreage | 95% Confidence interval (hectares) | Coeff. of | Relative efficiency | |
(hectare) | Limits | Amplitude | variation (%) | of RS data | |
Ground | 236165.4 | Lw: 215951.7 | 40427.2 | 4.37 | – |
Up: 256379.0 | |||||
Ground | 228550.1 | Lw: 219699.8 | 17700.5 | 1.98 | 5.2 |
Up: 237400.3 |
Table 2a
Uncertainty | |||||
---|---|---|---|---|---|
Data | Yield | 95% Confidence interval (kg/hectare) | Coeff. of | Relative efficiency | |
(kg/hectare) | Limits | Amplitude | variation (%) | of RS data | |
Ground | 4213.6 | Lw: 4093.9 | 239.2 | 1.45 | – |
Up: 4333.2 | |||||
Ground | 4155.6 | Lw: 4033.2 | 244.9 | 1.50 | 0.95 |
Up: 4278.1 |
Table 2b
Uncertainty | |||||
---|---|---|---|---|---|
Data | Yield | 95% Confidence interval (kg/hectare) | Coeff. of | Relative efficiency | |
(kg/hectare) | Limits | Amplitude | variation (%) | of RS data | |
Ground | 2352.0 | Lw: 2227.4 | 249.1 | 2.70 | – |
Up: 2476.6 | |||||
Ground | 2327.8 | Lw: 2215.1 | 225.4 | 2.47 | 1.22 |
Up: 2440.6 |
Azzari et al. [22] compared several ways of generating the data needed to train pixel classification models when no such parcel-level information is available. Focusing on integrating ground survey data of maize in Malawi and Ethiopia from national household surveys using the Sentinel-2 satellite, the authors evaluated the accuracy of pixel classification producing georeferenced data at parcel level, ranging from the full parcel boundary to only one point (centroid).
The authors concluded that collecting full-parcel boundary data or GPS coordinates of the polygon defined by complete parcel corner points yields the best-quality information for model training; however, the use of mid-sized sample (3000–4000 parcels) plot centroids could perform similarly to full plot boundaries. The authors did not consider the statistical model required to achieve the integration of ground data with Sentinel-2 to improve the accuracy of crop acreage estimates, obtaining design-based consistency.
In Sen4Stat, the focus was on developing an open-source system that permits any user to generate from Sentinel 1 SCL Sentinel-2 L1C and/or L2A images the RS data required to improve agricultural statistics. In addition to vegetation indices, using cloudmask and cloud-free mosaics for optical sensors, and for transforming data from SAR sensors, the system allows the choice of several techniques of supervised classification (including random forest) for pixel classification and crop-type map generation.
Raw data in spectral bands can be directly integrated with ground data using multinomial models. Indeed, as shown by Hogland et al. [10], using spectral-band digital numbers in conjunction with multinomial models is a pixel classification method.
4.Results and discussion
4.1Spain
4.1.1Crop acreage estimates at national level
We consider the area covered by an image of size 100
Ground data are observed at parcel level. However, we aggregated data on crop acreage at segment level for computations, so that
These results show that using RS data, estimator accuracy improved considerably; the amplitude of the confidence interval decreased and the estimation error was reduced by half. The relative efficiency was high; using RS data, the current sample size could be reduced to less than one fifth without loss of accuracy. This is so because RS data are reliable for crop acreage. These results are design-based, which implies that the results change with the sample design. Thus, the reduced sample should be chosen using the same design as the NSO is currently using.
4.1.2Crop yield estimates
Ground data on yield are observed at parcel level, with
The relationship between crop yield and the vegetation indices is statistically significant. However, their correlation is not as strong as for crop acreage and as a result, using RS data, yield estimator accuracy improves little. Using RS data, the estimation error is of the same order of magnitude as using only ground data. The relative efficiency is nearly 1; using RS data, the current sample size could be reduced little without loss of accuracy. This is because RS data reliability for crop yield is not as great as for crop acreage. Additional research is required to make the production estimates more reliable through an improved estimate for yield from RS; physically based models are more reliable and can be integrated with RS data.
4.1.3Crop production estimates
We estimate crop production as the product of the crop acreage and yield estimates. Results are in Table 3.
Table 3
Uncertainty | ||||
---|---|---|---|---|
Data | Production | 95% confidence interval (tons: 1000 kg) | Coeff. of | |
(tons) (1000 kg) | Limits | Amplitude | variation (%) | |
Ground | 987841 | Lw: 903640 | 168402 | 4.35 |
Up: 1072042 | ||||
Ground | 965733 | Lw: 915194 | 101078 | 2.67 |
Up: 1016272 |
Thanks to improvement in the crop acreage estimator using RS data, the production estimator accuracy increased considerably; the estimation error decreased by half, even if the RS data failed to improve the yield estimate.
4.1.4Crop acreage estimates at province level
To illustrate provincial estimation using areal sampling, we consider barley acreage estimates at the provincial level within the 100
Table 4
Province | Using only ground data | Using ground & RS data | Relative efficiency | ||
---|---|---|---|---|---|
Acreage (has.) | Error (CV%) | Acreage (has.) | Error (CV%) | of RS data | |
León | 6853.2 | 24.15 | 6834.5 | 16.20 | 2.3 |
Palencia | 88602.0 | 7.36 | 90535.3 | 3.33 | 4.7 |
Valladolid | 128209.5 | 5.57 | 119707.4 | 2.66 | 5.1 |
Zamora | 12324.2 | 17.71 | 10948.2 | 8.16 | 6.4 |
Total area | 235989.1 | 4.37 | 228028.5 | 1.98 | 5.2 |
The estimate accuracy at provincial level is, as expected, less than at the national level, but the RS relative efficiency is of the same order of magnitude. In provinces where the sampling error using only ground data is very high (León), the RS contribution is qualitative in the sense that it allows the estimates to be labeled as official statistics by reducing the coefficient of variation below the standard limits (20%) considered acceptable in official statistics.
4.1.5Crop acreage estimates at the municipality level based on linear mixed models
To illustrate small-area estimation using areal sampling, we consider barley acreage estimates at municipality level in that part of Zamora province within the 100
Table 5
Municipality | Sample | Acreage | |
---|---|---|---|
size | Hectares | Coeff. of | |
(segments) | variation (%) | ||
Belver de los Montes | 1 | 212.96 | 29.1 |
Castroverde | 3 | 2914.22 | 8.0 |
Pinilla de Toro | 4 | 963.30 | 10.0 |
Quintanilla del Monte | 1 | 466.65 | 20.3 |
Toro | 1 | 615.91 | 14.0 |
Vezdemarbán | 3 | 1358.22 | 12.6 |
Villalpando | 2 | 560.05 | 39.1 |
Villamayor de Campos | 1 | 1056.23 | 11.1 |
Villanueva del Campo | 1 | 784.03 | 13.2 |
Villar de Fallaves | 1 | 844.16 | 11.0 |
Villardondiego | 1 | 516.40 | 11.5 |
Villavendimio | 1 | 656.07 | 10.4 |
Total Zamora | 20 | 10948.20 | 8.2 |
Considering that the sample size at municipality level is small or null, the accuracy of municipality estimates is good, thanks to the RS contribution. In most municipalities, the estimates could be labeled as official statistics because the coefficient of variation is smaller than the standard limit (20%), even when the sample size is only one sampling unit
4.2Senegal
4.2.1Crop acreage estimates in Nioro Department
To demonstrate crop acreage estimation using point sampling and multinomial models, we consider the Nioro Department of Senegal, in the Kaolack Region. The ground data were observed in a sample of 345 points pixels selected using a list frame. The source of RS data is a crop-type map. In fact, we used a dual frame, since we complemented the NSO list frame with an area frame based on satellite images, in which agricultural areas were distinguished from non-agricultural areas and were stratified into crop types.
In this case, the NSO list frame alone is not sufficient to integrate field and RS data This is because the number of pixels in the agricultural areas is required for expanding the sample estimates to the entire population. This expansion factor is provided by the area frame.
We focus on the two main crops (millet and groundnut) observed in the field sample. The remaining crops (mainly maize) are included in a third category
Table 6
Crop | EO crop type map | |||
---|---|---|---|---|
Maize | Millet | Groundnut | Other crops | |
Millet | 1.47958334 | 0.88931346 | 8.76504093 | |
Groundnut | 2.89873948 | 1.21898514 |
Table 7
Uncertainty | ||||||
Crop type | Hectare | Standard | Coefficient of | Limits of 95% confidence interval | ||
error | variation (%) | Lower | Upper | Amplitude | ||
Millet | 89215 | 3661.1 | 4.11 | 81978.8 | 96330.4 | 14351.5 |
Groundnut | 78815 | 2923.9 | 3.71 | 73089.1 | 84550.9 | 11461.8 |
Table 8
Standard errors of proportion estimators | Relative efficiency | ||
Crop type | Using only ground data | Using ground & RS data | of RS data |
Millet | 3.37 | 1.90 | 3.13 |
Groundnut | 3.34 | 1.52 | 4.80 |
Table 9
Ground data | Number of pixels in the EO class (%) | ||||
---|---|---|---|---|---|
Crop | Number of points/pixels in the sample | Maize | Millet | Groundnut | Other |
Maize | 48 | 9 (18.8) | 26 (54.2) | 12 (25.0) | 1 (2.0) |
Millet | 134 | 10 (7.4) | 97 (72.4) | 23 (17.2) | 4 (3.0) |
Groundnut | 163 | 5 (3.1) | 14 (8.6) | 143 (87.7) | 1 (0.6) |
Total | 345 | 24 | 137 | 178 | 6 |
Crop acreage estimates of millet and groundnut based on ground and RS data are in Table 7.
We evaluate the RS data efficiency
The effect of integrating RS data in the ground sample data was a reduction in the sampling error of millet and groundnut and, as a result, in the confidence interval of these two crops, without loss of design-based consistency. In other words, using RS data, the cost of estimating millet and groundnut acreage could be reduced to less than a third of the current cost, without loss of accuracy. These results are design-based, so it is understood that the reduced sample size should be selected using the currently sampling design used by the NSO.
4.2.2Estimation directly based on pixel classification
There is consensus in the official statistics community on using methods providing design-consistent estimates of the population characteristics under study and measures of estimator uncertainty, such as sampling error, coefficient of variation, or confidence intervals. The method proposed in this paper for integrating RS data in the NSO sampling design agrees with this consensus.
Table 10
EO class | Millet | Groundnut | Maize & Other_crops |
---|---|---|---|
Maize | 0.395 | 0.206 | 0.399 |
Millet | 0.739 | 0.093 | 0.168 |
Groundnut | 0.113 | 0.841 | 0.046 |
Other | 0.997 | 0.001 | 0.002 |
Table 11
EO class | Estimates of the number of pixels covered by crop | |||
---|---|---|---|---|
Crop type | Number of pixels | Millet | Groundnut | Maize & Other |
|
|
|
| |
Maize | 1032628 | 407587 | 213088 | 411953 |
Millet | 9470572 | 7000329 | 876040 | 1594203 |
Groundnut | 8076448 | 910538 | 6791735 | 374175 |
Other | 604853 | 603038 | 605 | 1210 |
Estimates of the total number of pixels by crop | 19184501 | 8921492 | 7881468 | 2381541 |
Table 12
Arrondissement | Millet | Groundnut | ||
---|---|---|---|---|
Acreage (has.) | Error (CV%) | Acreage (has.) | Error (CV%) | |
Medina Sabakh | 20067.2 | 8.6 | 19765.3 | 7.3 |
Paoskoto | 38316.0 | 5.3 | 35018.2 | 4.0 |
Wack Ngouna | 30831.7 | 11.9 | 24030.7 | 10.7 |
Total Nioro | 89215,0 | 4.1 | 78815,0 | 3.7 |
Although the proposed approach allows for any form of RS data, including raw reflectance data in the form of pixel digital numbers, we focus on the use of croptype maps as auxiliary information. In this context, a natural question that arises is: Why not directly use the number of pixels in each EO croptype class to estimate crop acreage instead of using it as auxiliary data in a statistical model?
We follow the suggestion of a referee to clarify this question, comparing the results of our design-based approach (shown in Table 7) with estimates directly based on the croptype map. The comparison is necessarily limited to the crop acreage estimates because the usual algorithms used for croptype map generation do not provide measures of uncertainty comparable to those of Table 7.
Using a random forest classifier, four EO crop types were considered in the croptype map of Nioro: maize, millet, groundnut, and others. The remaining pixels were classified in a non-agricultural landuse class. The number of pixels in each EO croptype class is in Table 11.
As seen in Table 9, there is uncertainty in the pixels classification and any estimate based on it are subject to this uncertainty. Most pixels of millet in the sample (72.4%) are correctly classified in the EO class millet, but a non-negligible percentage of them were confounded with groundnut (17.2%), maize (7.4%), and others (3.0%). For groundnut, the percentage of pixels that were correctly classified is higher (87.7%) than for millet, but the confusion with millet (8.6%) and maize (3.1%) is non-negligible.
In the proposed approach, the multinomial logit model is used for estimating the probability
To estimate the number of pixels
Both the number of pixels in EO crop type millet (9470572, i.e., 94705.72 hectares, as a pixel represents 100 m
However, the comparison must focus not on the results, which are always uncertain, but on the methods. The methods make the major difference; whereas the design-based estimators are in the mainstream of official statistics, the estimators based directly on croptype maps are not. The proposed approach provides designconsistent estimates together with the usual measures of uncertainty (sampling error, coefficient of variation, and confidence intervals). The two main objections to estimators based directly on the croptype map are that (i) they are not design-consistent and (ii) the usual uncertainty measures are not available.
4.2.3Crop acreage estimates at district level in Nioro Department
To illustrate crop acreage estimation at district level using point sampling and multinomial models, we considered the three districts (arrondissements) of Nioro, Medina Sabakh, Paoskoto, and Wack Ngouna. The estimates are in Table 12.
The estimate accuracy at district (arrondissement) level is, as expected, lower than at the national level. However, the sampling error is within the standard limit (coefficient of variation
The differences between the number of pixels in the EO crop type millet and the design-based estimate of the number of pixels covered by millet are smaller than the sampling error in Medina Sabakh (0.7%) and Paoskoto (4.1), and larger than the design-based estimate in Wack Ngouna (12.4%). The differences between the number of pixels in the EO crop type groundnut and the design-based estimate of the number of pixels covered by groundnut are smaller than sampling error in the three districts: Medina Sabakh (3.4%), Paoskoto (2.5%), and Wack Ngouna (1.2%).
5.Concluding remarks
A result well established in the literature is that RS data improve the cost-efficiency of design-consistent crop acreage estimators based on linear models and simple area samples of georeferenced polygons (segments). The strong RS reliability for crop acreage estimation facilitates a notable improvement of crop production estimates, even if RS data are not sufficiently reliable for yields.
However, many countries use complex list samples of non-georeferenced households or farms for agricultural statistics. We have demonstrated in this paper that with the use of multinomial models, a point georeferenced by parcel in these complex samples is sufficient to enhance the cost efficiency of design-consistent, national-level crop acreage estimators using RS data.
The open-source Sen4Stat system is being designed in such a way that any NSO can use it to improve the cost efficiency of the procedure currently used to obtain crop acreage and production estimates, at both national and minor administrative area levels (province and municipality). The only inputs required by the system are the observed data in the sample currently designed by the NSO for field data collection. The system allows for both a segment simple sample, based on area frames, and a point complex sample, based on list frames of households/farms. In both cases, the system generates from Sentinel images a crop-type map and integrates it with ground data, using linear models in the segment case and multinomial models in the point case.
We have extended the method developed in this paper to two additional RS applications, multi-season estimates and optimization of the sample design. Additional work is required to obtain the data required to test these two additional RS applications.
Acknowledgments
This research was carried out in the framework of a consortium integrating the Université Catholique de Louvain (UCLouvain) as contractor and the Universidad Politécnica de Madrid (UPM), CS ROMANIA S.A. (C-S RO) and Systèmes d’Information à Référence Spatial (SIRS/CLS) as sub-contractors. The national statistical offices of Ecuador, Malawi, Senegal, Spain, and Tanzania participated in the work as partners and provided the required ground data. SIRS/CLS elaborated the ground databases, C-S RO developed the system for image classification, and UPM developed statistical prototypes for the integration of ground and RS data. UCLouvain coordinated the project and contributed mainly to the specification of products and services, system design RS database elaboration for integration with the ground database, and the full-scale demonstration plan.
We thank the referees and the subject editor for their very helpful comments and suggestions.
Funding
This work was supported by the European Space Agency in the Sen4Stat project framework [Contract No. 4000127181/19/I-NS]. This agency contributed to the study design, together with a steering committee made up of members of international organizations such as the World Bank and FAO, with the aim of achieving synergies between this project and those launched by these organizations to improve agricultural statistics.
References
[1] | Fonteneau F, Delincé J. Surveying Farms in the 21st Century. In: FAO editor. Handbook on the Agricultural Integrated Survey (AGRIS); Global Strategy to Improve Agricultural and Rural Statistics. Rome: FAO; (2018) ; pp. 1-6. |
[2] | Zezza A, Gourlay S, Molini V. Closing the data gap in agriculture through sustainable investment in the data value chain: Realizing the vision of the 50x2030 Initiative. Stat J IAOS. (2022) ; 38: : 57-62. doi: 10.3233/SJI-220933. |
[3] | Hanuschak GA, Allen RD, Wigton WH. Integration of Landsat data into the crop estimation program of USDA’s Statistical Reporting Service (1972-1982). Invited paper at: 8th International Symposium on Machine Processing of Remote Sensed Data. 1982 July 7–9; West Lafayette, Indiana: Purdue University. Available from: https//www.lars.purdue.edu/home/references/sym_1982/sym_1982.html. |
[4] | Allen JD. A look at the Remote Sensing Applications Program of the National Agricultural Statistics Service. J Off Stat. (1990) ; 6: (4): 393-409. |
[5] | Ambrosio L, Alonso R, Villa A. Estimación de superficies cultivadas por muestreo de áreas y teledetección. Precisión relativa. Estad Esp. (1993) ; 35: (132): 91-103. |
[6] | Delincé J. Cost-Effectiveness of Remote Sensing for Agricultural Statistics in Developing and Emerging Economies. In: FAO editor. GSARS Technical Report GO09-2015. Rome: FAO; 2015. doi: 10.13140/RG.2.2.25985.45927. |
[7] | Gallego FJ. Remote sensing and land cover area estimation. Int J Remote Sens. (2004) ; 25: (15): 3019-47. doi: 10.1080/01431160310001619607. |
[8] | Firth D, Bennett KE. Robust models on probability sampling. J. R. Stat. Soc. Series B Stat. Methodol. (1998) ; 60: :3-21. doi: 10.1111/1467-9868.00105. |
[9] | Agresti A. Categorical data analysis. New York: Wiley; (2002) ; 710 p. |
[10] | Hogland J, Billor N, Anderson N. Comparison of standard maximum likelihood classification and polytomous logistic regression used in remote sensing. Eur J Remote Sens. (2013) ; 46: (1): 623-40. doi: 10.5721/EuJRS20134637. |
[11] | FAO. Handbook on Master Sampling Frames for Agricultural Statistics. Frame Development, Sample Design and Estimation. Improving Agricultural and Rural Statistics. Global Strategy. Rome: FAO; (2015) Dec. 170 p. |
[12] | Särndal CE, Swensson B, Wretman, J. Model Assisted Survey Sampling. New York: Springer; (1992) ; 695 p. |
[13] | Fuller WA. Sampling statistics. New York: Wiley; (2009) ; 472 p. doi: 10.1002/9780470523551. |
[14] | Cotter J, Nealon J. Area frame design for agricultural surveys. Research and Applications Division. National Agricultural Statistics Service. United States Department of Agriculture. Washington, D.C., (1987) . |
[15] | FAO. Multiple frame agricultural surveys. Volume 1: Current surveys based on area and list sampling methods. Volume 2: Agricultural survey programmes based on area frame or dual frame (area and list) sample designs. Statistical development series 7 and 10. Rome: FAO; 1996, 1998. |
[16] | Ministry of Agriculture, Livestock and Fisheries. 2014/15 Annual Agricultural Sample Survey Report. Zanzibar (TZ); (2016) ; 72 p. Available from: http//hdl.handle.net/20.500.12018/2905. |
[17] | Yates F. Sampling Methods for Census and Surveys. London: Griffin. (1949) ; 318 p. |
[18] | Jinguji I. Dot sampling method for area estimation. In: Crop Monitoring for Improved Food Security. Srivastava M.K. editor, Bangkok (TH): FAO and ADB; (2015) RAP PUBLICATION 2014/28: pp. 27-48. |
[19] | Gay C, Porchier JC. Land cover and land use classification using TER-UTI. In: Holland TE and van den Broecke MPR, editors. Proceedings of Agricultural Statistics 2000, an International Conference on Agricultural Statistics 1998 Mar 18-20: pp. 93-201. Available from: https//www.isi-web.org/isi.cbs.nl/iamamember/Books/agric2000/page-193.pdf. |
[20] | Nusser SM, Goebel JJ. The National Resources Inventory: a long-term multi-resource monitoring programme. Environ Ecol Stat. (1997) ; 4: (3): 181-204. doi: 10.1023/A1018574412308. |
[21] | Nusser SM, Goebel JJ, Fuller WA. Design and estimation for investigating the dynamics of natural resources. Ecol Appl. (1998) ; 8: (2): 234-45. doi: 10.1890/1051-0761(1998)008[0234:DAEFIT]2.0.CO;2. |
[22] | Azzari G, Jain S, Jeffries G, Kilic T, Murray S. Understanding the requirements for surveys to support satellite-based crop type mapping: evidence from Sub-Saharan Africa. Remote Sens. (2021) ; 13: (23), 4749. doi: 10.3390/rs13234749. |
[23] | Direction de l’Analyse de la Prévision et des Statistiques Agricoles. Ministère de l’Agriculture et de l’Equipement Rural. Senegal – Enquete Agricole Annuelle. Methodologie et plan de sondage de l’enquete agricole. Senegal. (2017) . http//anads.ansd.sn/index.php/catalog/218/download/1846. |
[24] | MDcCulloch CE, Searle SR. Generalized, linear, and mixed models. New York: Wiley; (2001) ; 325 p. |
[25] | Ambrosio L, Iglesias L. Land cover estimation in small areas using ground surveys and remote sensing. Remote Sens. Environ. (2000) ; 74: : 240-8. doi: 10.1016/S0034-4257(00)00114-0. |
[26] | Saei A, Chambers R. Small area estimation under linear and generalized linear mixed models with time and area effects. Methodology Working Paper M03/15. Southampton (UK): Southampton Statistical Sciences Research Institute, (2003) : 31 p. http//eprints.soton.ac.uk/id/eprint/8165. |
[27] | Molina I, Saei A, Lombardia MJ. Small area estimates of labour force participation under a multinomial logit mixed model. J. R. Stat. Soc. Ser. A Stat. Soc. (2007) ; 170: (4): 975-1000. doi: 10.1111/j.1467-985X.2007.00493.x. |
[28] | López-Vizcaíno E, Lombardía MJ, Morales D. Small area estimation of labour force indicators under a multinomial model with correlated time and area effects. J. R. Stat. Soc. Ser. A Stat. Soc. (2015) ; 178: (3): 535-65. doi: 10.1111/rssa.12085. |
[29] | Fuller WA. Environmental surveys over time. J Agric Biol Environ Stat. (1999) ; 4: (4): 331-45. doi: 10.2307/1400493. |
[30] | Fuller WA, Breidt FJ. Estimation for supplemented panels. The Indian Journal of Statistics, Serie B. (1999) ; 61: (1): 58-70. https//www.jstor.org/stable/25053068. |
[31] | Cochran WG. Relative accuracy of systematic and stratified random samples for a certain class of populations. Annals of Mathematical Statistics. (1946) ; 17: (2): 164-77. |
[32] | Ambrosio L, Iglesias L, Marin C. Systematic sample design for the estimation of spatial means. Environmetrics. (2003) ; 14: (1): 45-61. doi: 10.1002/env.564. |
[33] | Defourny P, Bontemps S, Bellemans N, Cara C, Dedieu G, Guzzonato E, el al. Near real-time agriculture monitoring at national scale at parcel resolution: performance assessment of the Sen2-Agri automated system in various cropping systems around the world. Remote Sens. Environ. (2019) ; 221: : 551-68. doi: 10.1016/j.rse.2018.11.007. |
[34] | Fuller WA, Isaki CT. Survey design under superpolutation models. In: Current Topics in Survey Sampling. Krewski D, Rao JNK and Platek R, editors, New York: Academic Press; (1981) ; pp. 199-226. |
[35] | Ambrosio L, Iglesias L. Identifying the most appropriate sampling frame for specific landscape types. Technical Report Series. GO-01-2014. FAO. https//www.fao.org/3/ca6436en/ca6436en.pdf. |
[36] | Ambrosio L, Iglesias L, Marín C. A model-assisted approach to identify a cost-efficient spatial sampling strategy. Paper presented at: Eighth International Conference on Agricultural Statistics (ICAS-VIII); (2019) Nov 18-21; New Delhi (IN). |