Tanagra - Data Mining and Data Science Tutorials: October 2012

Monday, October 29, 2012

Handling missing values in prediction process

The treatment of missing values during the learning process has been received a lot of attention of researchers. We have published a tutorial about this in the context of logistic regression induction . By contrast, the handling of the missing values during the classification process, i.e. when we apply the classifier on an unlabeled instance, is less studied. However, the problem is important. Indeed, the model is designed to work only when the instance to label is fully described. If some values are not available, we cannot directly apply the model. We need a strategy to overcome this difficulty .

In this tutorial, we are in the supervised learning context. The classifier is a logistic regression model. All the descriptors are continuous. We want to evaluate on various datasets from the UCI repository the behavior of two imputations methods: the univariate approach and the multivariate approach. The constraint is that the imputation models must rely on information from the learning sample. We consider that this last one does not contain missing values.

We note that the occurrence of the missing value on the instance to classify is "missing completely at random" in our experiments i.e. each descriptor has the same probability to be missing.

Keywords: missing values, missing features, classification model, logistic regression, multiple linear regression, r software, glm, lm, NA
Components: Binary Logistic Regression
Tutorial: : en_Tanagra_Missing_Values_Deployment.pdf
Dataset and programs (R language): md_logistic_reg_deployment.zip
References:
Howell, D.C., "Treatment of Missing Data".
M. Saar-Tsechansky, F. Provost, “Handling Missing Values when Applying Classification Models”, JMLR, 8, pp. 1625-1657, 2007.

Sunday, October 14, 2012

Handling Missing Values in Logistic Regression

The handling of missing data is a difficult problem. Not because of its management which is simple, we just report the missing value with a specific code, but rather because of the consequences of their treatment on the characteristics of the models learned on the treated data.

We have already analyzed this problem in a previous paper. We studied the impact of the different techniques of missing values treatment on a decision tree learning algorithm (C4.5). In this paper, we repeat the analysis by examining their influence on the results of the logistic regression. We consider the following configuration: (1) missing values are MCAR, we wrote a program which removes randomly some values in the learning sample; (2) we apply logistic regression on the pre-treated training data i.e. on a dataset on which we apply a missing value processing technique; (3) we evaluate the different techniques of treatment of missing data by observing the accuracy rate of the classifier on a separate test sample which has no missing values.

In a first time, we conduct the experiments with R. We compare the listwise deletion approach to the univariate imputation (the mean for the quantitative variables, the mode for the categorical ones). We will see that this latter is a very viable approach in MCAR situation. In a second time, we will study the available tools in Orange, Knime and RapidMiner. We will observe that despite their sophistication, they are not better than the univariate imputation in our context.

Keywords: missing value, missing data, logistic regression, listwise deletion, casewise deletion, univariate imputation, missing data, R software, glm
Tutorial: en_Tanagra_Missing_Values_Imputation.pdf
Dataset and programs: md_experiments.zip
References:
Howell, D.C., "Treatment of Missing Data".
Allison, P.D. (2001), « Missing Data ». Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-136. Thousand Oaks, CA : Sage.
Little, R.J.A., Rubin, D.B. (2002), « Statistical Analysis with Missing Data », 2nd Edition, New York : John Wiley.