For the Analyse phase the most important information elements about the following sub-processes are shown:
This sub-process is where the data are transformed into statistical outputs. It includes the production of additional measurements such as indices, trends or seasonally adjusted series, as well as the recording of quality characteristics.
Computation and evaluation of composite indices
A composite index is a mathematical combination (or aggregation as it is termed) of a set of indicators that represent the different dimensions of a phenomenon to be measured.
Constructing a composite index is a complex task. Its phases involve several alternatives and possibilities that affect the quality and reliability of the results. The main problems, in this approach, concern the choice of theoretical framework, the availability of the data, the selection of the more representative indicators and their treatment in order to compare and aggregate them.
In particular, we can summarize the procedure in the following main steps:
- Defining the phenomenon to be measured. The definition of the concept should give a clear sense of what is being measured by the composite index. It should refer to a theoretical framework, linking various sub-groups and underlying indicators. Also the model of measurement must be defined, in order to specify the relationship between the phenomenon to be measured (concept) and its measures (individual indicators). If causality is from the concept to the indicators we have a reflective model – indicators are interchangeable and correlations between indicators are explained by the model; if causality is from the indicators to the concept we have a formative model – indicators are not interchangeable and correlations between indicators are not explained by the model.
- Selecting a group of individual indicators. The selection is generally based on theory, empirical analysis, pragmatism or intuitive appeal. Ideally, indicators should be selected according to their relevance, analytical soundness, timeliness, accessibility and so on. The selection step is the result of a trade-off between possible redundancies caused by overlapping information and the risk of losing information. However, the selection process also depends on the measurement model used: in a reflective model, all the individual indicators must be intercorrelated; whereas in a formative model they can show negative or zero correlations.
- Normalizing the individual indicators. This step aims to make the indicators comparable. Normalization is required before any data aggregation as the indicators in a data set often have different measurement units. Therefore, it is necessary to bring the indicators to the same standard, by transforming them into pure, dimensionless, numbers. Another motivation for the normalization is the fact that some indicators may be positively correlated with the phenomenon to be measured (positive polarity), whereas others may be negatively correlated with it (negative polarity). We want to normalize the indicators so that an increase in the normalized indicators corresponds to increase in the composite index. There are various methods of normalization, such as re-scaling (or Min-Max), standardization (or z-scores) and ‘distance’ from a reference (or index numbers).
- Aggregating the normalized indicators. It is the combination of all the components to form one or more composite indices (mathematical functions). This step requires the definition of the importance of each individual indicator (weighting system) and the identification of the technique (compensatory or non-compensatory) for summarizing the individual indicator values into a single number. Different aggregation methods can be used, such as additive methods (compensatory approach) or multiplicative methods and unbalance-adjusted functions (non-compensatory or partially compensatory approach).
- Validating the composite index. Validation step aims to assess the robustness of the composite index, in terms of capacity to produce correct and stable measure, and its discriminant capacity (Influence Analysis and Robustness Analysis).
Seasonal adjustment of time series
Seasonality can be defined as the systematic intra-year movement caused by various factors, e.g. weather changes, calendar, vacation or holidays and usually consists of periodic, repetitive and generally regular and predictable patterns in the level of a time series. Seasonality can be influenced also by production and consumption decisions made by economic agents taking into account several factors like endowments, their own expectations, as well as preferences and the production techniques available in the economy.
Differently, cyclic pattern presents non fixed rises and falls and its fluctuation length is usually not shorter than 2 years.
The overlap of this two kinds of fluctuations (seasonal and cyclic) in a time series could provide some problems for short term (monthly or quarterly) variation interpretation, above all when the seasonal component is highly represented in the observed data. For this reason, in order to measure cyclical changes, short term variations are computed from seasonal adjusted series. In turn, seasonal adjustment is the process of seasonal and calendar effects removal from a time series. This process is performed by means of analytical techniques that break down the series into components with different dynamic features. These components are unobserved and have to be identified from the observed data based on an ex-ante assumptions on their expected behavior. Broadly speaking, seasonal adjustment includes the removal of both within-a-year seasonal movements and the influence of calendar effects (such as the different number of working days, or Easter and moving holidays).
Notice that calendar effects are not constant among different countries or economic sectors, so that, time series which include them are not comparable each other. For this reason, generally, calendar effects are removed together with seasonal component in the seasonal adjusted series, so that, it is possible to better catch the yearly variation (computed with respect to the same period of the previous year), as well as, the mean yearly variation. Moreover, together with the seasonal adjusted series, can be also produced time series corrected only for calendar effects.
Once removed the repeated impact of these effects, seasonally adjusted data highlight the underlying long-term trend and short-run innovations in the series.
Seasonal adjustment approaches
All the seasonal adjustment methods are based on the assumption that each time series, Yt (with a time index t = 1,2,…T), can be decomposed into three different unobserved components:
- A trend-cycle (CTt) component representing long-run movement of the series (like those associated to business cycles). It generally depends on structural conditions like institutional situations, technological and demographic trends or patterns of civil and social organization.
- A seasonal component (St) representing the intra-year (monthly, quarterly) fluctuations.
- An irregular component (It) representing the short term fluctuations that are not systematic and, to a certain extent unpredictable, e.g. uncharacteristic weather patterns.
Although the series may be decomposed in different ways, generally two main approaches consistent with the European guideline (Eurostat 2015), are considered:
- Arima Model Based (AMB) approach, developed among the others by Burman (1980), Box, Hillmer and Tiao (1978) and Hillmer and Tiao (1982), based on the assumption that there exists a statistical parametric model (ARIMA) representing the probabilistic structure of the stochastic process connected to the observed time series. Time series assumed to be a finite part of a particular realization of a stochastic process. The linear filters used in this approach depend, consequently, on the features of the time series considered. This kind of approach is adopted in the TRAMO-SEATS (Time series regression with ARIMA noise, missing observations and outliers and Signal Extraction in ARIMA time series – TS) procedure developed by Gómez and Maravall (1996).
- Filter Based Approach (FLB), a non-parametric or semiparametric approach, which, differently from AMB approach, does not require to hypothesize a statistical model representing the series. Indeed, it is based on an iterative application of several linear filters on the series based on central moving averages. These procedures are referred to as ad hoc because the filters are chosen according to empirical rules, not taking into account the probabilistic structure of the stochastic process generating the series. To this approach belong the classical methods of the X-11 (X11) family: from the first X11 and X11-Arima (X-11A) to the more recent X-12-ARIMA (X-12A) (Findley et al. 1998) and X-13-ARIMA-SEATS (X-13AS) (Findley, 2005) which include several improvements over the previous versions; among which, the most remarkable is the use of reg-Arima models aimed at pre-treating the data and at improving the forecasting performances of the series, that in turn translates into an improvement of the moving average symmetric filters employed and, generally, into a higher stability of the estimated seasonal factor.
In both cases data are pre-treated for selecting the decomposition scheme to be applied to the time series (additive, multiplicative, log-additive, etc.). Moreover, some deterministic effects like outliers or calendar effects are removed. This pre-treated series is the input of the following step whose output is the seasonal adjusted series (SA). Once the seasonal adjusted series is obtained there is a last step in which some elements identified in the pre-treatment phase and related to the trend-cycle components (like level shift) or to the irregular component (like additive outlier or temporary changes) are included back; while are taken out from the final series the calendar effects and the seasonal outliers.
The primary function of a public statistical system is to produce official statistics for its own country. In fact, the Legislative Decree no.322 of September 6, 1989, constituting the National Statistical System (Sistan), cites: “The official statistical information is provided to the country and to international organizations through the National Statistical System” (Article 1, paragraph 2), and “the data processed in the framework of statistical surveys included in the national statistics program are the heritage of the community and are distributed for purposes of study and research to those who require them under the rules of the present decree, subject to the prohibitions contained in art. 9 “concerning statistical confidentiality (art. 10 paragraph 1).The Legislative Decree no.322 / 1989, also states that “the data collected in the context of statistical surveys included in the national Statistical Programme cannot be communicated or disseminated to any external, public or private entity, nor to any office of public administration except in aggregate form so as to be unable to derive any reference to identifiable individuals.” In any case, the data cannot be used in order to re-identify the parties concerned.
General principles concerning the protection of data confidentiality, are established by the REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL. In the framework of European legislation, the Code of Ethics and Good Conduct attached to the Code on the protection of personal data (Legislative Decree no. 4. 196 of 2003, amended by Legislative Decree no. 101 of 2018 containing provisions for the adaptation of national legislation to EU Regulation no. 679/2016), has been replaced by the Ethical Standards for processing related to statistical or scientific research purposes carried out within the National Statistical System (Gazzetta Ufficiale no. 11 of 14 January 2019). These Standards defines the concept of identifiability of a statistical unit, in terms of possibilities, through the use of reasonable means, to establish a significantly likely relationship between a combination of values referred to a certain set of variables and the identity of a respondent. It also specifies the means that can reasonably be used to identify the respondents, such as, for example, economic resources, time, linkage with nominative archives or other sources, etc..
The translation of the concepts set out by the law into operational rules, from a statistical point of view, requires a preliminary identification of the statistical units subject to identification risk and thus a precise definition of what constitutes a breach of confidentiality. The subsequent quantification of the probability of disclosure allows us to define the most suitable techniques to ensure data protection. By defining as intruder a person who has an interest in breaching the confidentiality of the released data, the intrusion occurs when the intruder is able to match, with a certain degree of certainty, the information released to the respondent. The release of statistical information in no case involves the variables that uniquely identify the person such as tax code, name or company name, address, etc.). The confidentiality problem arises for the variables that allow the intruder to narrow down a subset of units in order to pursue her purposes. To limit the risk of disclosure, National Statistical Institutes may modify data (for example by using perturbative techniques), or reduce the degree of details (e.g. municipality replaced by district or region). The application of the protection techniques, both for the dissemination of tables and for the communication of elementary data, leads to a reduction or a change in the information content of the data.
The breach of confidentiality in the disseminated tables
Tabular data are mostly used by the National Statistical Institutes for the dissemination of aggregate data. The concept of a breach of confidentiality does not depend upon the type of product used for the dissemination. The definition of “confidential” information also includes data as defined by the articles 9 and 10 in the Regulation UE n. 679/2016 of the European Parliament and the Council, while public variables are not considered confidential (public variables refer to information present in public registers, lists, records, documents or sources accessible to anyone). In order to release a table an initial assessment is needed and regards the information to be published: if they are not confidential then protections are not applied. Data protection for tabular data involves several steps. The first stage defines which tables are to be processed and what their characteristics are. Then we define the risk rule or the criterion to determine whether or not a cell is at risk of a breach of confidentiality. The final phase concerns the implementation of procedures for the protection of confidentiality.
The risk rules for magnitude tables are those based on the size of the cells (threshold or frequency rule), and those based on concentration measurements (dominance rule and ratio rule). The threshold rule is widely used by the Istat and states that a cell is sensitive if the number of units contained therein is less than an n value (threshold) fixed a priori. In order to apply this rule to magnitude tables, the relative frequency tables are needed. There is no univocal criterion for identifying the threshold value that will depend on the assumed scenario of intrusion and the data processed.
According to the dominance rule [(n, k)-dominance] a cell is at risk if the first n contributors hold a proportion of its total value higher than the threshold k% fixed a priori. The level of protection that should be applied to the table depends the two values n and k. Based on the statistical units involved and the desired levels of security, it is possible to define the parameters by identifying a maximum allowable concentration.
The ratio rule (p-rule) relies on the accuracy with which the value of the first contributor can be estimated, assuming that the second contributor attempts the breach. The cell is considered at risk if the relative error is less than a p threshold fixed a priori.
In case of tables with possible data of opposite sign, the rules based on concentration measurements become meaningless. However, their application is possible by using the absolute values of the contributors.
A breach of confidentiality is more difficult to happen when negative magnitudes are possible. The general recommendation is to assign parameters to the risk functions with less stringent values than those adopted in the case all magnitudes are non-negative.
In the case of sample tables, the evaluation of the risk of disclosure has to take into account the sampling plan. Each cell value is an estimate achieved extending the sample value to the reference population because the true value of the population is not available. A weak risk of violation characterize cells with survey’s weight greater than the unit. In this context a breach of confidentiality appears unlikely. However, a careful assessment of the disclosure risk is necessary. In fact, in some cases the most representative units (dominant) are included in the sample with certainty.
Frequency tables are mainly used to represent social phenomena and census data. The criterion adopted to determine whether or not a cell is at risk for this type of tables is based on the size of the cells. Indeed the risk rules based on measures of concentration cannot be applied.
Statistical protection of tables
After the cells at risk are selected, it is necessary to modify the table to make anonymous its information content. There are many techniques of data protection and they range from the merging of originary cells to methods based on modifications of the original data, to the introduction of missing values (suppressions). Perturbation techniques modify data so that they cannot trace back to the original values. In this case the structure of the table remains unchanged but the additivity between internal and marginal values is not always guaranteed. The methods mostly used by Istat are the amendment of classification variables in order to induce the merging of the original cells and the introduction of missing values.
The technique based on the introduction of missing values consists in the suppression of the cells at risk. The distribution of the treated cells (primary suppression) has to ensure that the table is adequately protected. When it does not happen, additional missing values (secondary suppressions) applied to non-risk cells are needed. Several algorithms have been proposed by the literature on Statistical Disclosure Control in order to select secondary suppressions. Currently the HiTas algorithm , available in some generalized software such as Tau-ARGUS, is adopted by the Istat.
The breach of confidentiality in the release of elementary data
Elementary data can be defined as the end product of a statistical survey after the phases of design, implementation, monitoring and correction. The elementary data in the dissemination phase is an archive of records each containing all the validated information (usually a subset of those recorded) related to statistical units. These variables, as well as in the case of aggregate data disseminated through tables, can be classified as variables usable for re-identification (hereinafter key variables) and confidential variables
The methods of protection fall into three main categories:
- global recoding of variables reduces the detail of some variables (e.g. the age in the five-year classes rather than annual classes);
- local suppression of information erases features that make some records at risk of disclosure;
- data perturbation involves different methods to weaken the association between key and confidential variables .
Among the initiatives involved in the “protected” release of elementary data, Microdata Files for Research (MFR), public use files (mIcro.STAT) and the Laboratory for the Analysis of the ELEmentary data (ADELE) have to be mentioned. MFR files for social or business statistical surveys are specifically designed for scientific research needs. The release of these files is subject to the fulfilment of certain requirements relating both to the organization that requests to access the protected microdata and to the characteristics of its research project. MIcro.STAT files are public use files, obtained from the respective MFR file, properly treated in terms of confidentiality and downloaded directly from Istat website.
The ADELE Laboratory, active since 1999, is a so-called Research Data Centre (RDC) that is a “safe” place which can be accessed by researchers to make their own statistical analysis on elementary data produced by the National Statistics Institution in compliance with the confidentiality principles. The ADELE Laboratory provides to external “expert” users the ability to analyze the elementary data of the Istat surveys, by shifting the verification of confidentiality on the output of the analyses rather than on their input (as in the case of files for research purposes and public use files). The protection of confidentiality for the statistical analyses carried out at the ADELE Laboratory is ensured in several ways:
- legally; the user subscribes a form in which he commits himself to respect specific rules of conduct;
- physically, through the control of the working environment; the Laboratory is located at the Istat headquarters and input, output and access to the external network are inhibited to users;
- statistically; all results are checked before the release.