RELAIS (REcord Linkage At IStat) is a toolkit providing a set of techniques for dealing with record linkage projects.
The purpose of record linkage is to identify the same real world entity that can be differently represented in data sources, even if unique identifiers are not available or are affected by errors. In statistics, record linkage is needed for several applications, including: enriching the information stored in different data-sets; de-duplicating data-sets; improving the data quality of a source; measuring a population amount by capture-recapture method; checking the confidentiality of public-use micro data. In fact, record linkage can be seen as a complex process consisting of several phases involving different knowledge areas; moreover, several different techniques can be adopted for each phase. We believe that the choice of the most appropriate technique not only depends on the practitioner’s skill but, most of all, it is application specific.
Moreover, in some applications, there is no evidence to prefer a given method to others or of the fact that different choices, at some linkage stage, could bring to the same results. This is why it is reasonable to dynamically select the most appropriate technique for each phase and to combine the selected techniques for building a record linkage work-flow of a given application. RELAIS is a toolkit relying on these ideas.
The principal features of RELAIS are:
- It is designed and developed to allow the combination of different techniques for each of the record linkage phases, so that the resulting work-flow is actually built on the basis of application and data specific requirements.
- It has been developed as an open source project, so several solutions already available for record linkage in the scientific community can be easily re-used. It is released under the EUPL license (European Union Public License).
- It has been implemented by using two languages based on different paradigms: Java, an object-oriented language, and R, a functional language. This choice depends on our belief that a record linkage process is composed of techniques for manipulating data, for which Java is more appropriate, and of calculation-oriented techniques for which R is a preferable choice. The choice of Java and R is also in line with the open source philosophy of the RELAIS project.
- It has been implemented using a relational database architecture, in particular it is based on a MySQL environment that is also in line with the open source philosophy of the RELAIS project.
The RELAIS project aims to provide record linkage techniques easily accessible to not-expert users. Indeed, the developed system has a GUI (Graphical User Interface) that on the one hand permits to build record linkage work-flows with a good flexibility. On the other hand it checks the execution order among the different provided techniques whereas precedence rules must be controlled.
|GSBPM code:||5.1 Integrate data|
|Programming language:||R, Java|
|Language of the GUI:||EN|
|Keywords:||data integration, probabilistic record linkage, string comparators, blocking/sorting/indexing, deduplication, open source software|
|Contact:||name: Luca Valentino
Java SE Development Kit (version ≥ 13)
R (version ≥ 3.4.0)
R packages: ROI, ROI.plugin.clp, slam, RODBC
MySQL Server (version ≥ 5.0)
MySQL Connector/ODBC (version ≥ 5.0)
Copyright 2015 Istat
Licensed under the European Union Public Licence (EUPL), version 1.1 or subsequent. You may not use this work except in compliance with the Licence. You may obtain a copy of the Licence at: http://ec.europa.eu/idabc/eupl.html. Unless required by applicable law or agreed to in writing, software distributed under the Licence is distributed on an “AS IS” basis, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the Licence for the specific language governing permissions and limitations under the Licence.
Istat assumes no responsibility for the results arising from use of the instrument that is inconsistent with the methodological guidance contained in the documentation available.
TECHNICAL AND METHODOLOGICAL DOCUMENTATION
Cibella N., G.L. Fernandez, M. Guigò, F. Hernandez, M. Scannapieco, L. Tosco, T. Tuoto. 2009. Sharing Solutions for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences. In Proceedings of New Techniques and Technologies for Statistics (NTTS) Conference, Eurostat, Brussels, 18-20 February 2009.
Eurostat. 2009. Theory and practice of developing a record linkage software. In “Insights on Data Integration Methodologies. ESSnet-ISAD workshop, Vienna, 29-30 May 2008“. Methodologies and working papers, Eurostat.
Cibella N., M. Fortini, M. Scannapieco, L. Tosco, T. Tuoto. 2007. RELAIS: Don’t Get Lost in a Record Linkage Project. In Proceedings of the FCSM 2007 Conference, Federal Committee on Statistical Methodology, Arlington, 5–7 November 2007.
Fortini M., P.D. Falorsi, C. Vaccari, N. Cibella, T. Tuoto, M. Scannapieco, L. Tosco. 2006. Towards an Open Source Toolkit for Building Record Linkage Workflows. In Proceedings of the International Workshop on Information Quality in Information Systems (IQIS), Chicago, 30 June 2006.