Search
strumenti

Methods and software of the statistical process

Apply disclosure control

The primary function of a public statistical system is to produce official statistics for its own country. In fact, the Legislative Decree no.322 of September 6, 1989, constituting the National Statistical System (Sistan), cites: “The official statistical information is provided to the country and to international organizations through the National Statistical System” (Article 1, paragraph 2), and “the data processed in the framework of statistical surveys included in the national statistics program are the heritage of the community and are distributed for purposes of study and research to those who require them under the rules of the present decree, subject to the prohibitions contained in art. 9 “concerning statistical confidentiality (art. 10 paragraph 1).The Legislative Decree no.322 / 1989, also states that “the data collected in the context of statistical surveys included in the national Statistical Programme cannot be communicated or disseminated to any external, public or private entity, nor to any office of public administration except in aggregate form so as to be unable to derive any reference to identifiable individuals.” In any case, the data cannot be used in order to re-identify the parties concerned.

General principles concerning the protection of data confidentiality, are established by the REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL. In the framework of European legislation, the Code of Ethics and Good Conduct attached to the Code on the protection of personal data (Legislative Decree no. 4. 196 of 2003, amended by Legislative Decree no. 101 of 2018 containing provisions for the adaptation of national legislation to EU Regulation no. 679/2016), has been replaced by the Ethical Standards for processing related to statistical or scientific research purposes carried out within the National Statistical System (Gazzetta Ufficiale no. 11 of 14 January 2019). These Standards defines the concept of identifiability of a statistical unit, in terms of possibilities, through the use of reasonable means, to establish a significantly likely relationship between a combination of values referred to a certain set of variables and the identity of a respondent. It also specifies the means that can reasonably be used to identify the respondents, such as, for example, economic resources, time, linkage with nominative archives or other sources, etc..

The translation of the concepts set out by the law into operational rules, from a statistical point of view, requires a preliminary identification of the statistical units subject to identification risk and thus a precise definition of what constitutes a breach of confidentiality. The subsequent quantification of the probability of disclosure allows us to define the most suitable techniques to ensure data protection. By defining as intruder a person who has an interest in breaching the confidentiality of the released data, the intrusion occurs when the intruder is able to match, with a certain degree of certainty, the information released to the respondent. The release of statistical information in no case involves the variables that uniquely identify the person such as tax code, name or company name, address, etc.). The confidentiality problem arises for the variables that allow the intruder to narrow down a subset of units in order to pursue her purposes. To limit the risk of disclosure, National Statistical Institutes may modify data (for example by using perturbative techniques), or reduce the degree of details (e.g. municipality replaced by district or region). The application of the protection techniques, both for the dissemination of tables and for the communication of elementary data, leads to a reduction or a change in the information content of the data.

The breach of confidentiality in the disseminated tables

Tabular data are mostly used by the National Statistical Institutes for the dissemination of aggregate data. The concept of a breach of confidentiality does not depend upon the type of product used for the dissemination. The definition of “confidential” information also includes data as defined by the articles 9 and 10 in the Regulation UE n. 679/2016 of the European Parliament and the Council, while public variables are not considered confidential (public variables refer to information present in public registers, lists, records, documents or sources accessible to anyone). In order to release a table an initial assessment is needed and regards the information to be published: if they are not confidential then protections are not applied. Data protection for tabular data involves several steps. The first stage defines which tables are to be processed and what their characteristics are. Then we define the risk rule or the criterion to determine whether or not a cell is at risk of a breach of confidentiality. The final phase concerns the implementation of procedures for the protection of confidentiality.

Magnitude tables

The risk rules for magnitude tables are those based on the size of the cells (threshold or frequency rule), and those based on concentration measurements (dominance rule and ratio rule). The threshold rule is widely used by the Istat and states that a cell is sensitive if the number of units contained therein is less than an n value (threshold) fixed a priori. In order to apply this rule to magnitude tables, the relative frequency tables are needed. There is no univocal criterion for identifying the threshold value that will depend on the assumed scenario of intrusion and the data processed.

According to the dominance rule [(n, k)-dominance] a cell is at risk if the first n contributors hold a proportion of its total value higher than the threshold k% fixed a priori. The level of protection that should be applied to the table depends the two values n and k. Based on the statistical units involved and the desired levels of security, it is possible to define the parameters by identifying a maximum allowable concentration.

The ratio rule (p-rule) relies on the accuracy with which the value of the first contributor can be estimated, assuming that the second contributor attempts the breach. The cell is considered at risk if the relative error is less than a p threshold fixed a priori.

In case of tables with possible data of opposite sign, the rules based on concentration measurements become meaningless. However, their application is possible by using the absolute values of the contributors.

A breach of confidentiality is more difficult to happen when negative magnitudes are possible. The general recommendation is to assign parameters to the risk functions with less stringent values than those adopted in the case all magnitudes are non-negative.

In the case of sample tables, the evaluation of the risk of disclosure has to take into account the sampling plan. Each cell value is an estimate achieved extending the sample value to the reference population because the true value of the population is not available. A weak risk of violation characterize cells with survey’s weight greater than the unit. In this context a breach of confidentiality appears unlikely. However, a careful assessment of the disclosure risk is necessary. In fact, in some cases the most representative units (dominant) are included in the sample with certainty.

Frequency tables

Frequency tables are mainly used to represent social phenomena and census data. The criterion adopted to determine whether or not a cell is at risk for this type of tables is based on the size of the cells. Indeed the risk rules based on measures of concentration cannot be applied.

Statistical protection of tables

After the cells at risk are selected, it is necessary to modify the table to make anonymous its information content. There are many techniques of data protection and they range from the merging of originary cells to methods based on modifications of the original data, to the introduction of missing values (suppressions). Perturbation techniques modify data so that they cannot trace back to the original values. In this case the structure of the table remains unchanged but the additivity between internal and marginal values ​​is not always guaranteed. The methods mostly used by Istat are the amendment of classification variables in order to induce the merging of the original cells and the introduction of missing values.

The technique based on the introduction of missing values consists in the suppression of the cells at risk. The distribution of the treated cells (primary suppression) has to ensure that the table is adequately protected. When it does not happen, additional missing values (secondary suppressions) applied to non-risk cells are needed. Several algorithms have been proposed by the literature on Statistical Disclosure Control in order to select secondary suppressions. Currently the HiTas algorithm , available in some generalized software such as Tau-ARGUS, is adopted by the Istat.

The breach of confidentiality in the release of elementary data

Elementary data can be defined as the end product of a statistical survey after the phases of design, implementation, monitoring and correction. The elementary data in the dissemination phase is an archive of records each containing all the validated information (usually a subset of those recorded) related to statistical units. These variables, as well as in the case of aggregate data disseminated through tables, can be classified as variables usable for re-identification (hereinafter key variables) and confidential variables.

The methods of protection fall into three main categories:

  • global recoding of variables reduces the detail of some variables (e.g. the age in the five-year classes rather than annual classes);
  • local suppression of information erases features that make some records  at risk of disclosure;
  • data perturbation involves different methods to weaken the association between key and confidential variables .

Among the initiatives involved in the “protected” release of elementary data,  Microdata Files for Research (MFR), public use files (mIcro.STAT) and the Laboratory for the Analysis of the ELEmentary data (ADELE) have to be mentioned. MFR files for social or business statistical surveys are specifically designed for scientific research needs. The release of these files is subject to the fulfilment of certain requirements relating both to the organization that requests to access the protected microdata and to the characteristics of its research project. MIcro.STAT files are public use files, obtained  from the respective MFR file, properly treated in terms of confidentiality and downloaded directly from Istat website.

The ADELE Laboratory, active since 1999, is a so-called Research Data Centre (RDC) that is a “safe” place which can be accessed by researchers to make their own statistical analysis on elementary data produced by the National Statistics Institution in compliance with the confidentiality  principles. The ADELE Laboratory provides to external “expert” users the ability to analyze the elementary data of the Istat surveys, by shifting the verification of confidentiality on the output of the analyses rather than on their input (as in the case of files for research purposes and public use files). The protection of confidentiality for the statistical analyses carried out at the ADELE Laboratory is ensured in several ways:

  • legally; the user subscribes a form in which he commits himself to respect specific rules of conduct;
  • physically, through the control of the working environment; the Laboratory is located at the Istat headquarters and input, output and access to the external network are inhibited to users;
  • statistically; all results are checked before the release.