Friday, August 11, 2017

Gradient boosting with R and Python

This tutorial follows the course material devoted to the “Gradient Boosting” to which we are referring constantly in this document. It also comes in addition to the supports and tutorials for Bagging, Random Forest and Boosting approaches (see References).

The thread will be basic: after importing the data which are split into two data files (learning and testing) in advance, we build predictive models and evaluate them. The test error rate criterion is used to compare performance of various classifiers.

The question of parameters, particularly sensitive in the context of the gradient boosting, is studied. Indeed, there are many parameters, and their influence on the behavior of the classifier is considerable. Unfortunately, if we guess about the paths to explore to improve the quality of the models (more or less regularization), accurately identifying the parameters to modify and set the right values are difficult, especially because they (the various parameters) can interact with each other. Here, more than for other machine learning methods, the trial and error strategy takes a lot of importance.

We use R and Python with their appropriate packages.

Keywords: gradient boosting, R software, decision tree, adabag package, rpart, xgboost, gbm, mboost, Python, scikit-learn package, gridsearchcv, boosting, random forest
Tutorial: Gradient boosting
Programs and datasets: gradient_boosting.zip
References:
Tanagra tutorial, "Gradient boosting - Slides", June 2016.
Tanagra tutorial, "Bagging, Random Forest, Boosting - Slides", December 2015.
Tanagra tutorial, "Random Forest and Boosting with R and Python", December 2015.