Detection and treatment of measurement errors and imputation of partial non-responses

Partial non-responses (PNR) and measurement errors are specific types of non-sampling errors which are generally treated in the editing and imputation phase.

In the present context, a measurement error is defined as a discrepancy between “true value” and observed value of a variable in a statistical unit. Discrepancy could be originated in any phase of the measurement process (data collection, coding, storing, etc.). Since NRPs and measurement errors may seriously compromise the accuracy of the target estimates, they should be prevented by means of suitable strategies. However, even though efforts are made to limit the impact of non-sampling errors, a proportion of collected data are typically affected by partial non-responses and measurement errors. Thus, use of editing and imputation methods are necessary.

In the context of GSBPM, the two sub-processes 5.3 “Review and validate” e 5.4 “Edit and impute” concern input data validation and editing and imputation respectively. However, often in the real contexts, these two sub-processes are hardly distinguishable.

Detection of measurement errors

Methods for error detection can be classified according to the different error typologies. A first important distinction is between systematic errors and random errors.

An error is called systematic if it depends on structural problems in the statistical production process such as defects in the questionnaire design or in the storing system. They usually determine deviation from the true values “in the same direction” for one or more variables of interest. Systematic errors are generally treated by means of deterministic rules base on the knowledge of the error mechanism. A typical systematic error for quantitative variables is the unity measure error.

An error is called random if it depends on stochastic elements that are not identifiable. In contrast with the case of systematic errors, a deterministic approach is generally not appropriate for random errors.

An important typology of errors involves errors causing “out of range” values, that is, values that do not belong to a known set of acceptable values. A similar typology involves errors determining inconsistencies between different observed variables. These errors are detected by applying coherence rules (edit) on data. Consistency errors that are believed to be “not influent” are generally detected via automatic methods based on some “general” principles. In this context, a popular approach is the minimum change principle. According to this principle, erroneous values are detected in such a way that for each record the minimum number of items is to be changed in order for the record to pass all the edits. The Fellegi-Holt methodology implements the minim change principle. It has been originally developed for categorical variables and later extended to numerical variables.

Another important class of errors includes errors which determine outliers in data. Outliers are observations whose behavior deviates from the one typical for most observations. Usually, technics for outlier identifications assume, at least implicitly, a model for the “typical” data and try to identify deviations from the model. Methods for outlier detection are often used to identify influential errors. However, the concept of influential error is to be distinguished from the one of outlier. In fact the latter is related to a model assumed for the data, while the former depends on the population parameter to be estimated. Specifically, outlier may not be caused by influential errors, and influential errors may not cause outliers.

Correction of errors and imputation of partial non-responses

Once errors have been detected, incorrect values have to be replaced (imputed) by correct or nearly correct values. Imputation is also used for partial non-responses. Imputation is commonly used for a number of technical and practical reasons. First, released data must be complete and coherent at micro level. Furthermore, imputation allows the users using standard methods and software on the final dataset.

A lot of imputation methods are available in many statistical packages both proprietary and free. A frequent classification of imputation methods is in terms of parametric methods (e.g., regression type methods), relying on some explicit assumptions on the probability distribution generating the data, and non-parametric methods, generally based on weaker not explicit assumptions (e.g., hot-deck, nearest neighbor donor).

Sometimes imputation methods are called deterministic when different applications on the same dataset provide same outputs, and stochastic, if the outcomes are characterized by a certain level of variability.

The choice of the imputation method depends on the analyses that have to be performed on data. For instance, If the quantity of interest is a linear quantity such as a mean or a total, a deterministic method may be appropriate, while, if also are distributional characteristics (such as the ones involving the second moment of the data distribution), a stochastic imputation method is generally preferable.

Methods and software of the statistical process

Detection and treatment of measurement errors and imputation of partial non-responses

Detection of measurement errors

Correction of errors and imputation of partial non-responses