Saturday, December 20, 2008

K-Means - Classification of a new instance

The deployment is an important step of the Data Mining framework. In the case of a clustering, after the construction of clusters with a learning algorithm, we want to determine to which particular cluster (group) a new unlabelled instance belongs.

In this tutorial, we use the K-Means algorithm. We assign each new instance to the group which is closest using the distance to the center of groups. The method is fair because the technique used to assign a group in the deployment phase is consistent with the learning algorithm. It is not true if we use another learning algorithm e.g. HAC (hierarchical agglomerative clustering) with de single linkage aggregation rule. The distance to the center of groups is inadequate in this context. Thus, the classification strategy must be consistent with the learning strategy.

All the descriptors are discrete in our dataset. The K-Means algorithm does not handle directly this kind of data. We must transform them. We use a multiple correspondence analysis algorithm.

In this tutorial, we compare the results of Tanagra 1.4.28 and R 2.7.2.

Keywords: data clustering, k-means, multiple correspondence analysis, factorial analysis, clusters interpretation, data exportation
Components: MULTIPLE CORRESPONDENCE ANALYSIS, K-MEANS, GROUP CHARACTERIZATION, CONTINGENCY CHI-SQUARE, EXPORT DATASET
Tutorial: en_Tanagra_KMeans_Deploiement.pdf
Dataset: banque_classif_deploiement.zip
References:
Wikipedia (en), « K-Means algorithm ».
F. Husson, S. Lê, J. Josse, J. Mazet, « FactoMineR – A package dedicated to Factor Analysis and Data Mining with R ».