Monday, November 10, 2008

Cost-sensitive Decision Tree

Error rate evaluation is a key point of the induction process. A usual approach is to partition the dataset in a learning set, which is used for the induction of the classification model, and in a test set, which is used for the performance evaluation.

The first subject of this tutorial is to show how to make a partition of the dataset with SIPINA. Then, we build the tree on the first part of the dataset. Later, we classify the examples of the second part of the dataset. We compare the predicted value and the true value. We obtain honest error rate estimation.

The second main subject of this document is to show how to take into account the misclassification costs during the learning process and the evaluation process. We use a slightly modified version of C4.5 (Quinlan, 1993).

Keywords: decision trees, C4.5, classifier evaluation, cost-sensitive learning, F-Measure, spams detection
Tutorial: en_sipina_cost_sensitive.pdf
Dataset: spam.xls