- Home
- /
- Classifications and Tools
- /
- Methods and software of...
- /
- Process phase
- /
- Weighting, estimation and sampling...

# Weighting, estimation and sampling error evaluation

The activities concerning the production of target estimates and the evaluation of sampling errors refer to sub-process 5.6 “*Calculate weights*” and 5.7 “*Calculate aggregates*” of GSBPM.

## Production of the estimates of interest

Each estimation method is based on the principle that the subset of the units of the population included in the sample must also represent the complementary subset consisting of the remaining units of the population. This principle is generally achieved by assigning to each unit included in the sample a weight that can be seen as the number of population elements represented by that unit.

The sample surveys carried out by Istat are large-scale surveys that have the purpose of providing a large number of estimates of population parameters such as counts, totals, proportions, averages, etc.

The estimation of the parameters of the population can be made using two different approaches:

**Methods based on the direct approach**using values of the variable of interest observed on the sample units belonging to the domain of interest. They are the standard methods used by Istat and by all the main National Institutes of Statistics to produce survey estimates.**Methods based on the indirect approach**that make use values of the variable of interest observed on sample units belonging to a wider domain containing the domain of interest and/or other survey occasions. They are, usually, used for particular estimation problems, such as those associated with the generation of estimates referring to domains in which the sample size is too small for the production of estimates using direct methods.

## Direct methods

In general, for the estimation a total the following two operations should be performed:

- computation of the weight to be assigned to each unit included in the sample;
- calculation of the estimates as weighted sums of the values of the target variables using with weights determined in step 1.

The weight given to each unit is obtained according to a procedure divided in several steps:

- the
*starting weight*of each sample unit, named*direct weight*, is calculated according to the sampling design, as the reciprocal of the inclusion probability; - the starting weight is adjusted in order to account for non-response, obtaining the
*base weight*; - correction factors of the base weight basis are computed to take into account equality constraints between some known parameters of the population and the corresponding sample estimates;
- the
*final weight*is obtained as the product between the base weight and the correction factors.

The class of estimators corresponding to the operations described above is known as *calibration estimators*, since both the adjustment to correct for non-response and the weight correction to achieve consistency with known population parameters is obtained by solving a constrained minimization problem. In details we want to minimize the distance between the weight before and after the calibration phase.

The main problem for the choice of the estimation method is to find an estimator satisfying:

*efficiency*criteria in terms of sample variance and bias due to the presence of*total and partial non-responses*, frame under-coverage;- external and internal
*coherence*. The*external consistency*of the estimates arises whenever known totals are available from external sources. Estimates of the total produced by the survey should generally match or not deviate too much from the known values of these totals. The*internal consistency*of the estimates is achieved when all the estimates of the same aggregate coincide with each other. This result can be obtained using a unique system of weights.

Calibration estimators meet the above criteria since:

- they yield, generally, more efficient estimates than those obtained using direct estimators; the higher the correlation between the auxiliary variables and the target variables the greater the efficiency;
- they are approximately design unbiased;
- they produce estimates of totals that coincide with the known values of these totals;
- they mitigate the non-response bias effect;
- they reduce the bias due to the under-coverage of the frame from which the sample is selected.

Calibration estimators are used for calculating the final weights for most social and business surveys carried out by Istat.

## Indirect methods

Indirect estimation methods are used by Istat to give clear responses to the growing need of local governments for accurate information for small geographical areas, or more generally domains, called small areas. Sample surveys conducted by Istat are, however, designed to provide reliable information for the main aggregates of interest for planned domains defined at design stage and may not be able to respond appropriately to the production of estimates for larger level of detail.

The solution adopted in the past by Istat to obtain estimates at unplanned domain level, was to increase the sample size without changing the sampling strategy, i.e. without modifying neither the sampling design nor the estimator. However over-sampling, besides rising collection costs and increasing the difficulty of organizational issues, implies the increase of non-sampling errors due to the difficulty to handle with too large sample sizes. In addition, the over-sampling is a partial solution to the problem of small area estimation, since not being able to increase the sample size over a certain threshold makes it possible to provide reliable estimates only for a subset of the small areas of interest.

For these reasons, Istat makes use of estimation methods based on:

- the use of auxiliary information, related to the phenomena under study, known at small areas level;
- the adoption (implicit or explicit) of statistical models linking the variable of interest observed in the small area with the values of the same variable related to a larger area containing the small area of interest and/or related to other survey occasions.

An important problem related with these methods is that they are based on models and, therefore, the properties of the results depend on the validity of the model assumptions. Since models are never expected to perfectly match reality, these estimators introduce unmeasurable bias which may arise serious concerns about their use.

## Assessment of sampling errors

For the evaluation of the sampling errors of the estimates, Istat usually uses approximated variance estimation methods. In fact, for most of the estimation procedure an analytical expression of the estimator variance is not available, since:

- ISTAT surveys are carried out using complex sampling designs, generally based on multiple selection stages, on the stratification of the units, and on without repetition selection scheme with varying selection probabilities among units;
- estimates are determined through the use of calibration estimators which are non-linear functions of the sample information.

The estimation methods of the sample variance generally used in Istat are based on the method of linearization of Woodruff (1971) which provides estimates of the sample variance in the case where the estimators used are non-linear functions of the sample data.

This variance estimation methods are implemented by Istat in the generalized software GENESEES and ReGenesees, which feature a user friendly interface and are currently used to estimate the sampling errors of the estimates produced by Istat surveys.

In addition, by means of these software, it is possible to compute important analysis statistics which provide useful tools to evaluate the adopted sampling design. In particular it is possible to evaluate:

- the overall efficiency of the sampling design;
- the impact on the efficiency of the estimates due to stratification, the number and type of selection stages and weighting effect.

It is important to point out that a generalized variance errors representation is also provided. It is a summary obtained making use of regression models that relate the estimates with the corresponding sampling errors. These models provide important summary information on sampling errors and are disseminated together with the tables reporting the estimate values.