This sub-process is where the data are transformed into statistical outputs. It includes the production of additional measurements such as indices, trends or seasonally adjusted series, as well as the recording of quality characteristics.
Computation and evaluation of composite indices
A composite index is a mathematical combination (or aggregation as it is termed) of a set of indicators that represent the different dimensions of a phenomenon to be measured.
Constructing a composite index is a complex task. Its phases involve several alternatives and possibilities that affect the quality and reliability of the results. The main problems, in this approach, concern the choice of theoretical framework, the availability of the data, the selection of the more representative indicators and their treatment in order to compare and aggregate them.
In particular, we can summarize the procedure in the following main steps:
- Defining the phenomenon to be measured. The definition of the concept should give a clear sense of what is being measured by the composite index. It should refer to a theoretical framework, linking various sub-groups and underlying indicators. Also the model of measurement must be defined, in order to specify the relationship between the phenomenon to be measured (concept) and its measures (individual indicators). If causality is from the concept to the indicators we have a reflective model – indicators are interchangeable and correlations between indicators are explained by the model; if causality is from the indicators to the concept we have a formative model – indicators are not interchangeable and correlations between indicators are not explained by the model.
- Selecting a group of individual indicators. The selection is generally based on theory, empirical analysis, pragmatism or intuitive appeal. Ideally, indicators should be selected according to their relevance, analytical soundness, timeliness, accessibility and so on. The selection step is the result of a trade-off between possible redundancies caused by overlapping information and the risk of losing information. However, the selection process also depends on the measurement model used: in a reflective model, all the individual indicators must be intercorrelated; whereas in a formative model they can show negative or zero correlations.
- Normalizing the individual indicators. This step aims to make the indicators comparable. Normalization is required before any data aggregation as the indicators in a data set often have different measurement units. Therefore, it is necessary to bring the indicators to the same standard, by transforming them into pure, dimensionless, numbers. Another motivation for the normalization is the fact that some indicators may be positively correlated with the phenomenon to be measured (positive polarity), whereas others may be negatively correlated with it (negative polarity). We want to normalize the indicators so that an increase in the normalized indicators corresponds to increase in the composite index. There are various methods of normalization, such as re-scaling (or Min-Max), standardization (or z-scores) and ‘distance’ from a reference (or index numbers).
- Aggregating the normalized indicators. It is the combination of all the components to form one or more composite indices (mathematical functions). This step requires the definition of the importance of each individual indicator (weighting system) and the identification of the technique (compensatory or non-compensatory) for summarizing the individual indicator values into a single number. Different aggregation methods can be used, such as additive methods (compensatory approach) or multiplicative methods and unbalance-adjusted functions (non-compensatory or partially compensatory approach).
- Validating the composite index. Validation step aims to assess the robustness of the composite index, in terms of capacity to produce correct and stable measure, and its discriminant capacity (Influence Analysis and Robustness Analysis).
Seasonal adjustment of time series
Seasonality can be defined as the systematic intra-year movement caused by various factors, e.g. weather changes, calendar, vacation or holidays and usually consists of periodic, repetitive and generally regular and predictable patterns in the level of a time series. Seasonality can be influenced also by production and consumption decisions made by economic agents taking into account several factors like endowments, their own expectations, as well as preferences and the production techniques available in the economy.
Differently, cyclic pattern presents non fixed rises and falls and its fluctuation length is usually not shorter than 2 years.
The overlap of this two kinds of fluctuations (seasonal and cyclic) in a time series could provide some problems for short term (monthly or quarterly) variation interpretation, above all when the seasonal component is highly represented in the observed data. For this reason, in order to measure cyclical changes, short term variations are computed from seasonal adjusted series. In turn, seasonal adjustment is the process of seasonal and calendar effects removal from a time series. This process is performed by means of analytical techniques that break down the series into components with different dynamic features. These components are unobserved and have to be identified from the observed data based on an ex-ante assumptions on their expected behavior. Broadly speaking, seasonal adjustment includes the removal of both within-a-year seasonal movements and the influence of calendar effects (such as the different number of working days, or Easter and moving holidays).
Notice that calendar effects are not constant among different countries or economic sectors, so that, time series which include them are not comparable each other. For this reason, generally, calendar effects are removed together with seasonal component in the seasonal adjusted series, so that, it is possible to better catch the yearly variation (computed with respect to the same period of the previous year), as well as, the mean yearly variation. Moreover, together with the seasonal adjusted series, can be also produced time series corrected only for calendar effects.
Once removed the repeated impact of these effects, seasonally adjusted data highlight the underlying long-term trend and short-run innovations in the series.
Seasonal adjustment approaches
All the seasonal adjustment methods are based on the assumption that each time series, Yt (with a time index t = 1,2,…T), can be decomposed into three different unobserved components:
- A trend-cycle (CTt) component representing long-run movement of the series (like those associated to business cycles). It generally depends on structural conditions like institutional situations, technological and demographic trends or patterns of civil and social organization.
- A seasonal component (St) representing the intra-year (monthly, quarterly) fluctuations.
- An irregular component (It) representing the short term fluctuations that are not systematic and, to a certain extent unpredictable, e.g. uncharacteristic weather patterns.
Although the series may be decomposed in different ways, generally two main approaches consistent with the European guideline (Eurostat 2015), are considered:
- Arima Model Based (AMB) approach, developed among the others by Burman (1980), Box, Hillmer and Tiao (1978) and Hillmer and Tiao (1982), based on the assumption that there exists a statistical parametric model (ARIMA) representing the probabilistic structure of the stochastic process connected to the observed time series. Time series assumed to be a finite part of a particular realization of a stochastic process. The linear filters used in this approach depend, consequently, on the features of the time series considered. This kind of approach is adopted in the TRAMO-SEATS (Time series regression with ARIMA noise, missing observations and outliers and Signal Extraction in ARIMA time series – TS) procedure developed by Gómez and Maravall (1996).
- Filter Based Approach (FLB), a non-parametric or semiparametric approach, which, differently from AMB approach, does not require to hypothesize a statistical model representing the series. Indeed, it is based on an iterative application of several linear filters on the series based on central moving averages. These procedures are referred to as ad hoc because the filters are chosen according to empirical rules, not taking into account the probabilistic structure of the stochastic process generating the series. To this approach belong the classical methods of the X-11 (X11) family: from the first X11 and X11-Arima (X-11A) to the more recent X-12-ARIMA (X-12A) (Findley et al. 1998) and X-13-ARIMA-SEATS (X-13AS) (Findley, 2005) which include several improvements over the previous versions; among which, the most remarkable is the use of reg-Arima models aimed at pre-treating the data and at improving the forecasting performances of the series, that in turn translates into an improvement of the moving average symmetric filters employed and, generally, into a higher stability of the estimated seasonal factor.
In both cases data are pre-treated for selecting the decomposition scheme to be applied to the time series (additive, multiplicative, log-additive, etc.). Moreover, some deterministic effects like outliers or calendar effects are removed. This pre-treated series is the input of the following step whose output is the seasonal adjusted series (SA). Once the seasonal adjusted series is obtained there is a last step in which some elements identified in the pre-treatment phase and related to the trend-cycle components (like level shift) or to the irregular component (like additive outlier or temporary changes) are included back; while are taken out from the final series the calendar effects and the seasonal outliers.
The primary function of a public statistical system is to produce official statistics for its own country. In fact, the Legislative Decree no.322 of September 6, 1989, constituting the National Statistical System (Sistan), cites: “The official statistical information is provided to the country and to international organizations through the National Statistical System” (Article 1, paragraph 2), and “the data processed in the framework of statistical surveys included in the national statistics program are the heritage of the community and are distributed for purposes of study and research to those who require them under the rules of the present decree, subject to the prohibitions contained in art. 9 “concerning statistical confidentiality (art. 10 paragraph 1).The Legislative Decree no.322 / 1989, also states that “the data collected in the context of statistical surveys included in the national Statistical Programme cannot be communicated or disseminated to any external, public or private entity, nor to any office of public administration except in aggregate form so as to be unable to derive any reference to identifiable individuals.” In any case, the data cannot be used in order to re-identify the parties concerned.
Further principles concerning the protection of the data confidentiality, are established by the Code of Professional Ethics and Good Conduct for the Treatment of personal data for statistical purposes and scientific research purposes performed within the framework of the National Statistical System (Legislative Decree no. 196, June 30, 2003). In particular, the Code defines the concept of identifiability of a statistical unit, in terms of possibilities, through the use of reasonable means, to establish a significantly likely relationship between the combination of the mode of the variables related to the statistics unit and its data identification. Moreover, the means used for the identification of the person concerned, for example, the economic resources, time, the possibility of crosschecking with name records or other sources, etc. are specified.
The translation of the concepts laid down by the law in operational rules from a statistical point of view requires a preliminary identification of the statistical units subject to identification risk and thus a precise definition of what constitutes a breach of confidentiality. The subsequent quantification of the probability of breaching the confidentiality shall define the most suitable techniques to ensure data protection.
The definition of a breach of confidentiality adopted by the National Statistical Institutes is based on the concept of identifiability of a unit of the population observed (the respondent). By defining as intruder a person who has an interest in breaching the confidentiality of the released data, the intrusion occurs when the intruder is able to match, with a certain degree of certainty, the information released to the respondent. The release of statistical information with confidential data in no case involves the so-called direct identifiers (i.e. the variables that uniquely identify the person such as tax code, name or company name, address, etc.). The problem arises for so-called indirect identifiers (or key variables). These are the variables that do not identify the person directly but allow to circumscribe the belonging population and which the intruder will use for his/her own purposes. Indirect identification could be determined, for example, by the combined use of territorial variables, economic activity and size class. The mechanism by which identification can happen may be immediate (e.g. direct recognition) or assigned to more or less complex algorithms of information combination (record linkage, statistical-matching, etc.).
To limit the risk of identification the National Statistical Institutes may modify data (for example by using disturbance techniques), or have an effect on the indirect identifiers by removing it in whole or in part, or reducing its details (i.e. by deciding to not release such detail as municipality and leaving in its place the variable district, or region). The application of the protection techniques, both for the dissemination of tables as for the communication of elementary data, leads to a reduction or a change in the information content of the data released (loss of information).
The breach of confidentiality in the disseminated tables
The table represents the tool mostly used by the national statistical institutes for the dissemination of aggregate data, or grouped together in cells, defined by the intersection of the classification variables. The concept of a breach of confidentiality does not depend upon the type of product used for the dissemination. Consistently in line with the previous section, also in the case of aggregate data, a breach occurs when it is possible to draw information that allows the identification of the individual. The definition of “confidential” information also includes sensitive data and judicial data (as defined in the Legislative Decree no. 196, June 30, 2003, art. 4), while public variables are not considered confidential (the character or combination of characters, qualitative or quantitative, subject to a statistical survey that refers to information present in public registers, lists, records, documents or sources accessible to anyone – definition contained in the Code of Conduct). When a table is to be released an initial assessment concerns the content information on data to be published: if it is not confidential than there is no necessity to implement statistical security procedures of data, otherwise it is necessary to apply the rules of protection of privacy. The evaluation of the risk of a breach of confidentiality of data in the table is carried out for each cell individually: when the value inside one of the cells refers to (with a certain degree of certainty) the subject to which relates the data itself (sensitive cell), then the table does not respect the rules on the protection of confidentiality.
The process of aggregated data protection comprises several phases. The first stage defines the area in which one is working, which tables are to be processed and what their characteristics are. Then we define the risk rule or the criterion according to which to determine whether or not a cell is at risk of a breach of confidentiality. The final phase concerns the implementation of procedures for the protection of confidentiality. These depend on the type of tables that one intends to release and on any restrictions of publication, but also on the type of reserved variables, on the underlying complexity to each processing and on data availability.
Although some of the principles described below, with particular reference to the threshold rule, are also used for frequency tables, the rules listed mainly refer to the magnitude tables. In the case of frequency tables, cells at risk are identified as a result of an evaluation done on a case by case basis and not by resorting to general rules as is the case of the magnitude tables.
Magnitude tables and risk rules
The risk rules used for magnitude tables are those based on the size of the cells (threshold or frequency rule), and those based on concentration measurements (dominance rule and ratio rule). The threshold rule is widely used by Istat according to which a cell is sensitive if the number of units contained therein is less than an n value (threshold) fixed a priori. In order to apply this rule to magnitude tables you need the relative frequency table. The protection depends upon the value of n that is applied to the table: the higher the threshold value the higher the level of protection applied. There is no univocal criterion for identifying the threshold value that will depend on the assumed scenario of intrusion and data processed. The minimum value of the threshold shall be three (as provided by the Code of Conduct).
According to the dominance rule [(n, k) -dominance] a cell is at risk if the first n contributors hold a proportion of its total value which is higher than the threshold k% fixed a priori. The level of protection that should be applied to the table depends from the two values of n and of k. There are no univocal criteria for fixing the two parameters. Based on the statistical units involved and the desired levels of security one can define the parameters by identifying a maximum allowable concentration.
The ratio rule (p-rule) relies on the accuracy with which the value of the first contributor can be estimated, assuming that the second contributor attempts the breach. The cell is considered at risk if the relative error is less than a p threshold fixed a priori.
In case of tables with possible data of opposite sign, the risk rules based on concentration measurements become meaningless. However, their application is possible by using the absolute values of the contributors.
Operating a breach of confidentiality in a context with possible negative data is much more complex. The general recommendation is to assign parameters to the risk functions with less stringent values in respect to solely positive data.
In the case of sample tables that are obtained by detecting data on a subset of the reference population, the evaluation of the breach of confidentiality risk has to take into account the sampling plan used. The value listed in the cells is an estimate made by extending a partial value (detected in the sample) to the reference population. The units observed are unknown and also the true value of the population is not recognized. The risk of violation is contained for cells containing data with a survey’s weight greater than the unit. In this context suggestion of a breach of confidentiality appears unlikely. However, especially for tables of economic data, a careful assessment of the breach of confidentiality risk is necessary even in the case of sample tables. In fact, in some cases the most representative units (dominant) are included in the sample with certain probability. Furthermore, in the case of stratified samples, some cells are sampled at 100% and then the detected value matches (unless missing responses) with the value of the population.
Except for special cases where the sample design and the number of sampled units allow considering a safe table in terms of confidentiality, rules of confidentiality must be applied to the sample tables too.
The criterion used by Istat considers the implementation of the risk rules on the obtained estimated cell values using survey’s weights. That requires that the sampled units are “similar” to those present in the population.
Frequency tables and risk rule
Frequency tables are used primarily to represent social phenomena and census data. The only criterion to determine whether or not a cell is at risk for this type of tables is based on the size of the cells. The risk rules cannot in fact be applied based on measures of concentration. There are no univocal rules for determining whether a frequency table is at risk of a breach of confidentiality or not. In fact, a cell with a low frequency (for example equal to 1) not always indicates a cell at risk, and vice versa a cell that contains a high number of units not always can be considered safe in terms of statistical confidentiality.
As a general rule, the frequency tables that have one of the cases listed below are considered at risk of confidentiality breach:
- marginal with less than three contributors;
- all units belong to a single category (group disclosure) or the sole contributor of a cell (self recognition) acquires confidential information on all other units (all concentrated in another cell).
Statistical protection of tables
After the cells at risk are identified, one must modify the table in an appropriate manner ensuring the anonymity of information contained therein. There are many techniques of data protection and they range from a unification of adjacent modes, to methods based on the original data modification, to the introduction of missing values (suppressions). The methods used by Istat are: the amendment of classification variables mode and the introduction of missing values.
A method of protection of the tables that is not based on the change of the values in the cells is the definition of a different modes combination. After identifying the risk rule, the method consists in determining the said modes in such a way that the distribution of the characters and / or units in the cells must be such that it does not present any sensitive cell.
By changing the modality accordingly it is possible, for example, to obtain a table which has a minimum size (for example greater than or equal to three) in each cell, or a table with a default maximum concentration of the character in each cell.
The change in the modes of classifying variables can be considered a practical solution only when the nature of the classifying variables is transferable, and if the tables to be released need not satisfy strict rules dictated by regulations that constrain the details of the classifying variables.
The technique that involves insertion of a missing value stipulates that the value of the cells at risk is deleted (obscured). The suppression operated on cells at risk is also called primary suppression. With the introduction of missing values within the sensitive cells, the process of protecting the table does not exhaust. In the first place, it is necessary to evaluate that the deleted cells cannot be calculated from the issued data, for example, by difference from the marginal values. Suppressions are to be distributed among the cells of the table to ensure that the table is properly protected according to the set criteria. When this does not occur it is necessary to introduce additional missing values between not-at-risk cells: the secondary suppressions. The literature proposes several algorithms for the determination of the secondary suppressions. Currently the most widely used by Istat is the HiTas algorithm available in some generalized software such as Tau-ARGUS.
The tables are defined as linked when they contain data on the same response variable and have at least the same classifying variable. The most frequent case of linked tables is represented by tables with common cells, with particular reference to the marginal values. The connection between the statistical data may also form part of a wider context. Sometimes indeed different surveys show the same aggregates.
The application of the confidentiality rules to linked tables implies that common information (cells) have assigned the same deployability status.
To optimize the process of protection it would be appropriate, where possible, to simultaneously operate the protection of all the linked tables.
The breach of confidentiality in the release of elementary data
Elementary data can be defined as the end product of a statistical survey after the phases of design, implementation, monitoring and correction. The elementary data in the dissemination phase is an archive of records each containing all the validated information (usually a subset of those recorded) relating to an individual statistical unit. These variables, as well as in the case of aggregate data disseminated through tables, can be classified as key variables, as indirect identifiers, or as reserved variables.
With respect to tables’ release, there will be a substantial change both in the set of key variables, that in general will be more numerous, and in the content of a possible violation, as the variables reserved in the elementary data are shown all together. By contrast, the release of micro-data only covers sample collections and file access is much more controlled (for research purposes only and behind the signing of a form / contract). However, there is no doubt that the release of elementary data is a most sensitive issue with respect to disseminated tables. For this reason, specific models for measuring the risk of identification often based on probability models have been developed. The methods of data elementary protection fall into three categories:
- recoding of variables (global recoding) consists in reduction of release detail of some variables (e.g. the age in the five-year classes rather than annual classes);
- suppression of information (local suppression): to eliminate features that make some records more easily identifiable;
- disturbance of the published data: different methods but with the same purpose of the tables.
Among the initiatives involved in “protected” release of elementary data are included the so called Microdata File for Research (MFR), the public use files (mIcro.STAT) and the Laboratory for the Analysis of the ELEmentary data (ADELE). MFR files are produced for statistical surveys regarding both individuals and families and businesses and are made specifically for the needs of scientific research. The release of these files is subject to the fulfilment of certain requirements relating both to the organization and to the characteristics of the research project for which the file is required. MIcro.STAT files are public use files, obtained starting from the respective MFR file, properly treated in terms of protection of privacy and downloaded directly from Istat website.
The ADELE Laboratory, active since 1999, is a so-called Research Data Centre (RDC) or a “safe” place which can be accessed by the researchers and academics to make their own statistical analysis of elementary data produced by the National Statistics Institution in compliance with the confidentiality rules. Main objective of the ADELE Laboratory is to provide external “expert” users the ability to analyze the basic data of the Istat surveys, by shifting the phase of verification of the protection of confidentiality of statistical analysis on output rather than input (as in the case of files for research purposes and public use files). The protection of confidentiality for processing carried out at the ADELE Laboratory is ensured in several ways:
- legally: the user subscribes to a form in which he/she undertakes to comply with specific rules of conduct;
- physically: through the control of the working environment. The Laboratory is located at the head office of Istat with workers in charge of the control room; input and output operations and access to external network are disabled to users;
- statistically: by controlling the user analysis results prior to the release.