Data integration

Record linkage

Record linkage is an important process for the integration of data coming from different sources. The purpose of record linkage is to identify the same real world entity that can be differently represented in data sources, even if unique identifiers are not available or are affected by errors.

In official statistics, record linkage is needed for several applications: for instance,

  • to enrich the information stored in different data sets;
  • to create, update and de-duplicate a frame;
  • to improve the data quality of a source;
  • to measure a population amount by capture-recapture method;
  • to check the confidentiality of public-use microdata.

The complexity of the whole linking process relies on several aspects. The lack of unique identifiers requires sophisticated statistical procedures, the huge amount of data to process involves complex IT solutions, constraints related to a specific application may require the solution of difficult linear programming problems.

Statistical matching

The goal of statistical matching (sometimes named as data fusion) is the integration of two or more data sources referring to the same population with the aim of exploring the relationships between variables that are not jointly observed in the same data source. The sources to be integrated are composed of different non-overlapping units; as usually happens when data from several sample surveys are integrated. The typical situation of statistical matching is the one in which there are two data sources A and B; variables X and Y are available in A, variables X and Z are observed in B; the objective is to study the relationship between Y and Z by exploiting the common information in X. The objective of statistical matching can be macro or micro; in the first case the interest is in one or more parameters that summarize the relationship between Y and Z (correlation coefficient, regression coefficient, contingency table, etc.); in the second case the result of integration is a synthetic data set in which all the variables of interest, X, Y and Z are present.

The objectives of matching can be achieved by means of a parametric or nonparametric approach, or a mixture of them (mixed methods).

The parametric approach requires the specification of a model and the estimation of the related parameters. In absence of auxiliary information, it is generally assumed the conditional independence of Y and Z given the common variables X. This assumption is rather strong and unfortunately in the typical situation of the matching it is not testable.

Nonparametric methods are usually applied when you have a micro objective. In this case hot-deck imputation methods are frequently used. They aims at imputing missing variables in the data set chosen as recipient (for instance A) by using the observed values in the data set (B) chosen as donor. The donor unit for a given unit in A is the most similar observation in B in terms of the values of the common variables X. The mixed approach is composed of two steps: 1) a parametric model is assumed, parameter estimation is performed, and imputation is carried out; 2) a hot-deck imputation procedure is applied, it makes use of imputed values for choosing the donor observation.

It is worthwhile noting that an alternative approach based on the quantification of the uncertainty inherent the estimation of a particular parameter can be used. This approach does not require the conditional independence assumption nor of auxiliary information on non-identifiable parameters, i.e., those related to relations between Y and Z. The study of uncertainty does not lead to a point estimate but to a set of plausible estimates. The set of parameter estimates referring to Y and Z is composed of those consistent with the estimates obtainable from the data at hand, i.e., parameters concerning the couples (Y, X) and (Z, X).

The application of the matching data from complex surveys poses additional problems. In such circumstances it is required to take into account the sampling design as well as the other methodologies used to deal with nonsampling errors (coverage and non-response).

Coding of textual answers

The coding activity is generally performed in case the survey questionnaire contains textual variables that refer to official classifications that allow for national and/or international data comparability. Example of this kind of variables are Economic Activity (NACE), Occupation, Education, Places (of birth, of residence, etc.).

Coding means to assign a unique code to a textual answer according to a classification scheme. The level of detail of the matched code depends on the survey aims and/or the dissemination needs. Coding can be performed manually or trough automated systems. Manual coding can be performed only at the end of the data collection phase, while if automated systems are used, it can be run during or after collection of data: in the first case it is called assisted coding (on-line coding) while in second case automated coding (batch coding).

With reference to GSBPM, coding belongs to the sub-process 5.2 “Classify and code” of the Phase 5 “Process” that includes those activities that are necessary to make data ready for the analysis (Phase 6 “Analyse“). Obviously, in case of assisted coding some of the activities of sub-process 5.2 can start before Phase 4 “Collect” ends, improving the timeliness of data delivery.

Coding is, in general, a very hard activity of the survey process. Besides, if it is manually performed it is also difficult to standardise, because coding results strictly depend on coders. Despite coders are well trained about criteria and principles of each official classification, coding is influenced by the cognitive process of each coder that might lead to different (subjective) interpretations and, therefore, different coding of the same textual answer.

The use of specialised coding software can produce a considerable saving of time and resources and will also guarantee a higher standardisation level of the coding process, increasing the expected quality of the coding results.

As already said, computer assisted coding can be distinguished in “automatic coding” and in “assisted coding”. They differs in terms of aims and coding process:

  • Automatic coding: the coding software analyses and codes, on the basis of a reference dictionary, a data file containing all the textual answers collected during the collection phase (batch coding). The aim is to look for and to assign a single code to each textual answer according to quality thresholds;
  • Assisted coding: the coding software is an interactive instrument, that aids the coder/respondent in coding the textual answer. The aim is to offer the user a wider set of possible matching codes among which to choose the correct one.

The key point of any coding system, automated or assisted, is the implementation of the informative basis that represents the reference dictionary containing codes and texts of the official classification and enriched with textual answers collected by Istat surveys (and correctly coded). In order to be processed by a software, the reference dictionary has to undergo a number a standardisation operations aimed at producing analytic, synthetic and not ambiguous descriptions. Besides, in general, the richest the dictionary the higher the coding rate.

Generally speaking, coding systems varying according to the algorithm used to match the textual answers with the dictionary descriptions. They can be classified as follows:

  • dictionary algorithms: they look for exact matches on the bases of key words (or groups of key words);
  • weighting algorithms: they look for partial or exact matches on the basis of similarity functions among texts that assign weights to each word according to its informative content;
  • sub-strings algorithms: they look for partial or exact matches processing portions of texts (bigrams or trigrams).

Besides, for what concern assisted coding, there are three possible methods to consult (to navigate) the reference dictionary:

  • tree search: it navigates inside the classification hierarchical structure, from the higher branch to the lowest one (leave) that represents the most detailed code (highest number of digits) that can be assigned to a textual answer;
  • alphabetic search: it navigates inside the entire dictionary looking for the definition which is equal or the  most similar to the textual answer to be coded;
  • mixed mode search: it makes an alphabetic search inside the selected classification branch.

Data collection technique highly influences the choice of the searching method. A special distinction is among interviewer administered and self-administered modes. For the latter, where respondents are not trained on classifications and coding like interviewers are, it is extremely important to provide a coding system that is user friendly and guarantees high quality results.

The quality of coding activity is highly influenced by the update of both the dictionary content and the matching rules (training phase). It is advisable to perform the training phase periodically, in general after the coding of textual answers collected by a survey. To this aim, after a coding application, it is important to:

  • verify the quality of the coded cases;
  • use the not coded cases to update the coding application (dictionary and checking rules);
  • highlight eventual lacks of the classification used.

Per la valutazione della qualità delle due modalità di codifica, è possibile utilizzare i seguenti indicatori:

Indicators for assisted and automated coding can be used to evaluate the performance of the coding phase:

Automated coding indicators:

  • efficacy/coding rate: ratio of “number of coded texts” to “total number of texts to be coded”;
  • accuracy: ratio of “number of correctly coded texts” to “number of coded texts“;
  • efficiency: unitary coding time.

Assisted coding indicators:

  • average time to assign a single code;
  • coherence among each collected textual description and the assigned code.

 

Detection and treatment of measurement errors and imputation of partial non-responses

Partial non-responses (PNR) and measurement errors are specific types of non-sampling errors which are generally treated in the editing and imputation phase. In the present context, a measurement error is defined as a discrepancy between “true value” and observed value of a variable in a statistical unit. Discrepancy could be originated in any phase of the measurement process (data collection, coding, storing, etc.). Since NRPs and measurement errors may seriously compromise the accuracy of the target estimates, they should be prevented by means of suitable strategies. However, even though efforts are made to limit the impact of non-sampling errors, a proportion of collected data are typically affected by partial non-responses and measurement errors. Thus, use of editing and imputation methods are necessary.

In the context of GSBPM, the two sub-processes 5.3 “Review and validate” e 5.4 “Edit and impute” concern input data validation and editing and imputation respectively. However, often in the real contexts, these two sub-processes are hardly distinguishable.

Detection of measurement errors

Methods for error detection can be classified according to the different error typologies. A first important distinction is between systematic errors and random errors.

An error is called systematic if it depends on structural problems in the statistical production process such as defects in the questionnaire design or in the storing system. They usually determine deviation from the true values “in the same direction” for one or more variables of interest. Systematic errors are generally treated by means of deterministic rules base on the knowledge of the error mechanism. A typical systematic error for quantitative variables is the unity measure error.

An error is called random if it depends on stochastic elements that are not identifiable. In contrast with the case of systematic errors, a deterministic approach is generally not appropriate for random errors.

An important typology of errors involves errors causing “out of range” values, that is, values that do not belong to a known set of acceptable values. A similar typology involves errors determining inconsistencies between different observed variables. These errors are detected by applying coherence rules (edit) on data. Consistency errors that are believed to be “not influent” are generally detected via automatic methods based on some “general” principles. In this context, a popular approach is the minimum change principle. According to this principle, erroneous values are detected in such a way that for each record the minimum number of items is to be changed in order for the record to pass all the edits. The Fellegi-Holt methodology implements the minim change principle. It has been originally developed for categorical variables and later extended to numerical variables.

Another important class of errors includes errors which determine outliers in data. Outliers are observations whose behavior deviates from the one typical for most observations. Usually, technics for outlier identifications assume, at least implicitly, a model for the “typical” data and try to identify deviations from the model. Methods for outlier detection are often used to identify influential errors. However, the concept of influential error is to be distinguished from the one of outlier. In fact the latter is related to a model assumed for the data, while the former depends on the population parameter to be estimated. Specifically, outlier may not be caused by influential errors, and influential errors may not cause outliers.

Correction of errors and imputation of partial non-responses

Once errors have been detected, incorrect values have to be replaced (imputed) by correct or nearly correct values. Imputation is also used for partial non-responses. Imputation is commonly used for a number of technical and practical reasons. First, released data must be complete and coherent at micro level. Furthermore, imputation allows the users using standard methods and software on the final dataset.

A lot of imputation methods are available in many statistical packages both proprietary and free. A frequent classification of imputation methods is in terms of parametric methods (e.g., regression type methods), relying on some explicit assumptions on the probability distribution generating the data, and non-parametric methods, generally based on weaker not explicit assumptions (e.g., hot-deck, nearest neighbor donor).

Sometimes imputation methods are called deterministic when different applications on the same dataset provide same outputs, and stochastic, if the outcomes are characterized by a certain level of variability.

The choice of the imputation method depends on the analyses that have to be performed on data. For instance, If the quantity of interest is a linear quantity such as a mean or a total, a deterministic method may be appropriate, while, if also are distributional characteristics (such as the ones involving the second moment of the data distribution), a stochastic imputation method is generally preferable.

Weighting, estimation and sampling error evaluation

The activities concerning the production of target estimates and the evaluation of sampling errors refer to sub-process 5.6 “Calculate weights” and 5.7 “Calculate aggregates” of  GSBPM.

Production of the estimates of interest

Each estimation method is based on the principle that the subset of the units of the population included in the sample must also represent the complementary subset consisting of the remaining units of the population. This principle is generally achieved by assigning to each unit included in the sample a weight that can be seen as the number of population elements represented by that unit.

The sample surveys carried out by Istat are large-scale surveys that have the purpose of providing a large number of estimates of population parameters such as counts, totals, proportions, averages, etc.

The estimation of the parameters of the population can be made using two different approaches:

  • Methods based on the direct approach using values of the variable of interest observed on the sample units belonging to the domain of interest. They are the standard methods used by Istat and by all the main National Institutes of Statistics to produce survey estimates.
  • Methods based on the indirect approach that make use values of the variable of interest observed on sample units belonging to a wider domain containing the domain of interest and/or other survey occasions. They are, usually, used for particular estimation problems, such as those associated with the generation of estimates referring to domains in which the sample size is too small for the production of estimates using direct methods.

Direct methods

In general, for the estimation a total the following two operations should be performed:

  1. computation of the weight to be assigned to each unit included in the sample;
  2. calculation of the estimates as weighted sums of the values of the target variables using with weights determined in step 1.

The weight given to each unit is obtained according to a procedure divided in several steps:

  1. the starting weight of each sample unit, named direct weight, is calculated according to the sampling design, as the reciprocal of the inclusion probability;
  2. the starting weight is adjusted in order to account for non-response, obtaining the base weight;
  3. correction factors of the base weight basis are computed to take into account equality constraints between some known parameters of the population and the corresponding sample estimates;
  4. the final weight is obtained as the product between the base weight and the correction factors.

The class of estimators corresponding to the operations described above is known as calibration estimators, since both the adjustment to correct for non-response and the weight correction to achieve consistency with known population parameters is obtained by solving a constrained minimization problem. In details we want to minimize the distance between the weight before and after the calibration phase.

The main problem for the choice of the estimation method is to find an estimator satisfying:

  • efficiency criteria in terms of sample variance and bias due to the presence of total and partial non-responses, frame under-coverage;
  • external and internal coherence. The external consistency of the estimates arises whenever known totals are available from external sources. Estimates of the total produced by the survey should generally match or not deviate too much from the known values of these totals. The internal consistency of the estimates is achieved when all the estimates of the same aggregate coincide with each other. This result can be obtained using a unique system of weights.

Calibration estimators meet the above criteria since:

  • they yield, generally, more efficient estimates than those obtained using direct estimators; the higher the correlation between the auxiliary variables and the target variables the greater the efficiency;
  • they are approximately design unbiased;
  • they produce estimates of totals that coincide with the known values of these totals;
  • they mitigate the non-response bias effect;
  • they reduce the bias due to the under-coverage of the frame from which the sample is selected.

Calibration estimators are used for calculating the final weights for most social and business surveys carried out by Istat.

Indirect methods

Indirect estimation methods are used by Istat to give clear responses to the growing need of local governments for accurate information for small geographical areas, or more generally  domains, called small areas. Sample surveys conducted by Istat are, however, designed to provide reliable information for the main aggregates of interest for planned domains defined at design stage and may not be able to respond appropriately to the production of estimates for larger level of detail.

The solution adopted in the past by Istat to obtain estimates at unplanned domain level, was to increase the sample size without changing the sampling strategy, i.e. without modifying neither the sampling design nor the estimator. However over-sampling, besides rising collection costs and increasing the difficulty of organizational issues, implies the increase of non-sampling errors due to the difficulty to handle with too large sample sizes. In addition, the over-sampling is a partial solution to the problem of small area estimation, since not being able to increase the sample size over a certain threshold makes it possible to provide reliable estimates only for a subset of the small areas of interest.

For these reasons, Istat makes use of estimation methods based on:

  • the use of auxiliary information, related to the phenomena under study, known at small areas level;
  • the adoption (implicit or explicit) of statistical models linking the variable of interest observed in the small area with the values of the same variable related to a larger area containing the small area of interest and/or related to other survey occasions.

An important problem related with these methods is that they are based on models and, therefore, the properties of the results depend on the validity of the model assumptions. Since models are never expected to perfectly match reality, these estimators introduce unmeasurable bias which may arise serious concerns about their use.

Assessment of sampling errors

For the evaluation of the sampling errors of the estimates, Istat usually uses approximated variance estimation methods. In fact, for most of the estimation procedure an analytical expression of the estimator variance is not available, since:

  • ISTAT surveys are carried out using complex sampling designs, generally based on multiple selection stages, on the stratification of the units, and on without repetition selection scheme with varying selection probabilities among units;
  • estimates are determined through the use of calibration estimators which are non-linear functions of the sample information.

The estimation methods of the sample variance generally used in Istat are based on the method of linearization of Woodruff (1971) which provides estimates of the sample variance in the case where the estimators used are non-linear functions of the sample data.

This variance estimation methods are implemented by Istat in the generalized software GENESEES and ReGenesees, which feature a user friendly interface and are currently used to estimate the sampling errors of the estimates produced by Istat surveys.

In addition, by means of these software, it is possible to compute important analysis statistics which provide useful tools to evaluate the adopted sampling design. In particular it is possible to evaluate:

  • the overall efficiency of the sampling design;
  • the impact on the efficiency of the estimates due to stratification, the number and type of selection stages and weighting effect.

It is important to point out that a generalized variance errors representation is also provided. It is a summary obtained making use of regression models that relate the estimates with the corresponding sampling errors. These models provide important summary information on sampling errors and are disseminated together with the tables reporting the estimate values.

Last edit: 19 March 2018