Monday, May 14, 2012

Tanagra - Version 1.4.44

LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/). Update of the LIBSVM library for support vector machine algorithms (version 3.12, April 2012) [C - SVC, Epsilon-SVR, nu - SVR]. The calculations are faster. The attributes can be normalized or not. They were automatically normalized previously.

LIBCVM (http://c2inet.sce.ntu.edu.sg/ivor/cvm.html; version 2.2). Incorporation of the LIBCVM library. Two methods are available: CVM and BVM (Core Vector Machine and Ball Vector Machine). The dezscriptors can be normalized or not.

TR-IRLS (http://autonlab.org/autonweb/10538). Update of the TR-IRLS library, for the logistic regression on large dataset (large number of predictive attributes) [last available version – 2006/05/08]. The deviance is automatically provided. The display of the regression coefficients is more precise (higher number of decimals). The user can tune the learning algorithms, especially the stopping rules.

SPARSE DATA FILE. Tanagra can handle sparse data file format now (see SVMlight ou libsvm file format). The data can be used for supervised learning process or regression problem. A description of this kind of file is available on line (http://c2inet.sce.ntu.edu.sg/ivor/cvm.html).

INSTANCE SELECTION. A new component for the selection of the m first individuals among n in a branch of the diagram is available [SELECT FIRST EXAMPLES]. This option is useful when the data file is the result of the concatenation of the learning and test samples.

Download page : setup

Thursday, May 3, 2012

Using PDI-CE for model deployment (PMML)

Model deployment is a crucial task of the data mining process. In the supervised learning, it can be the applying of the predictive model on new unlabeled cases. We have already described this task for various tools (e.g. Tanagra, Sipina, Spad, R). They have as common feature the use of the same tool for the model construction and the model deployment.

In this tutorial, we describe a process where we do not use the same tool for the model construction and the model deployment. This is only possible if (1) the model is described in a standard format, (2) the tool which used for the deployment can handle both the database with unlabeled instances and the model. Here, we use the PMML standard description for the sharing of the model, and the PDI-CE (Pentaho Data Integration Community Edition) for the applying of the model on the unseen cases.

We create a decision tree with various tools such as SIPINA, KNIME or RAPIDMINER; we export the model in the PMML format; then, we use PDI-CE for applying the model on a data file containing unlabeled instances. We see that the use of the PMML standard enhances dramatically the powerful of both the data mining tool and the ETL tool.

In addition, we will describe other solutions for deployment in this tutorial. We will see that Knime has its own PMML reader. It is able to apply a model on unlabeled datasets, whatever the tool used for the construction of the model. The key is that the PMML standard is respected. In this sense, Knime can be substituted to PDI-CE. Another possible solution, Weka, which is included into the Pentaho Community Edition suite, can export the model in a proprietary format that PDI-CE can handle.

Keywords: model deployment, predictive model, pmml, decision tree, rapidminer 5.0.10, weka 3.7.2, knime 2.1.1, sipina 3.4
Tutorial: en_Tanagra_PDI_Model_Deployment.pdf
Dataset: heart-pmml.zip
References:
Data Mining Group, "PMML standard"
Pentaho, "Pentaho Kettle Project"
Pentaho, "Using the Weka Scoring Plugin"