SeleMix (Selective editing via Mixture models)

The contents related to SeleMix are shown in the following sections:

Selemix is an R package to treat quantitative data, which aims to identify a set of units affected by errors which potentially influence the estimates of interest (selective editing).

The underlying methodology is based on particular latent class models known in the literature as contamination models. Specifically, it is assumed that “true” data (that is not affected by errors), possibly in logarithmic scale, are independent realizations of a multivariate Gaussian distribution with the mean vector which can in turn be expressed as a linear combination of a set of covariates not contaminated. The “intermittent” nature of the error mechanism is captured by Bernoulli variables that have the role of indicators for the occurrence of error on each unit. Moreover, the error is assumed additive and associated with a Gaussian vector with zero mean and variance covariance matrix proportional to the variance covariance matrix that characterizes the distribution of data without errors. The explicit modeling of the not contaminated data distribution and the error mechanism allow to obtain the distribution of actual data conditionally on the observed data. On the basis of this distribution estimates of the true values, and therefore of errors, are obtained. For each unit, a score is calculated in terms of the difference (possibly weighted with sample weight) between the predicted value and the observed value. So all units are sorted (in descending order) according to their score. Assuming that the parameter of interest is an average or total population, selection of observation to be reviewed as interactive, is made considering the estimated error that remains in the data after interactive editing. The number of units selected according to this criterion relies on a user-specified threshold that is related to the accuracy of the estimate of interest.

In the following, the main functions of the package SeleMix are described:

ml.est: this function performs the maximum likelihood estimates of the parameters of a contamination model by ECM algorithm and it provides the expected values of the “true” data for all units that were used for the estimation. Also it returns, to each unit, the posterior probability of occurrence of the error and the flags of classification as outliers – no outlier calculated on the basis of a threshold for the probability of error specified by the user.It requires the specification of the type of model assumed for the true data (normal or lognormal) and some technical parameters for the algorithm ECM. It requires the specification of the type of model assumed for the true data (normal or lognormal) and some technical parameters for the algorithm ECM.
pred.y: on the basis of a set of contamination model parameters, and a set of observed data, it calculates the expected values of the corresponding real data. Missing values for the variables response as well as are allowed, but not for covariates.
sel.edit: it performs Selective Editing. On the basis of a set of observed data and the corresponding predictions for the true data, it selects the units required for interactive editing. It requires in input the wanted accuracy threshold and, if present, the sample weights associated with the units. It provides the score for each unit and the corresponding rank.

Given the opportunity to use the features of the package also in the presence of incomplete data, it can also be used as a tool for robust imputation of multivariate Gaussian data.

Status: validated

Author: Istat

Licence: EUPL-1.1

GSBPM code:

5.3 Review and validate
5.4 Edit and impute

Programming language: R

Keywords: latent class models, selective editing, influential error

Contact:

name: Tiziana Pichiorri
email: pichiorr@istat.it

COPYRIGHT

Licensed under the European Union Public Licence (EUPL), version 1.1 or subsequent. You may not use this work except in compliance with the Licence. You may obtain a copy of the Licence at: http://ec.europa.eu/idabc/eupl.html. Unless required by applicable law or agreed to in writing, software distributed under the Licence is distributed on an “AS IS” basis, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the Licence for the specific language governing permissions and limitations under the Licence.

DISCLAIMER

Istat assumes no responsibility for the results arising from use of the instrument that is inconsistent with the methodological guidance contained in the documentation available.

DOWNLOAD
Release date: 12/12/2013

SELEMIX Version 0.9.1 – Windows binaries

SELEMIX Version 0.9.1 – Package source

INSTALLATION
Install the downloaded package from within R as follows:
> install.packages(path_to_file, repos = NULL)
where the character path_to_file is the path to the .zip or .tar.gz file you downloaded.

TECHNICAL AND METHODOLOGICAL DOCUMENTATION

Reference manual – SeleMix v. 0.9.1

Vignettes – SeleMix v. 0.9.1

OTHER DOCUMENTATION

Barcaroli, G., and D. Zardetto. 2012. “Use of R in Business Surveys at the Italian National Institute of Statistics: Experiences and Perspectives“. In Proceedings of the 4^th International Conference of Establishment Surveys (ICES IV). American Statistical Association, Montréal, 11-14 June 2012.

Methods and software of the statistical process

SeleMix (Selective editing via Mixture models)

Description

Information

Software and documentation