Wednesday, November 24, 2010

Multithreading for decision tree induction

Nowadays, much of modern personal computers (PC) have multicore processors. The computer operates as if it had multiple processors. Software and data mining algorithms must be modified in order to benefit of this new feature.

Currently, few free tools exploit this opportunity because it is impossible to define a generic approach that would be valid regardless of the learning method used. We must modify each existing learning algorithm. For a given technique, decomposing an algorithm into elementary tasks that can execute in parallel is a research field in itself. In a second step, we must adopt a programming technology which is easy to implement.

In this tutorial, I propose a technology based on threads for the induction of decision trees. It is well suited in our context for various reasons. (1) It is easy to program with the modern programming languages. (2) Threads can share information; they can also modify common objects. Efficient synchronization tools enable to avoid data corruption. (3) We can launch multiple threads on a mono-core and mono-processor system. It is not really advantageous, but at least the system does not crash. (4) On a multiprocessor or multi-core system, the threads will actually run at the same time, with each processor or core running a particular thread. But, because of the necessity of synchronization between threads, the computation time is not divided by the number of cores in this case.

First, we briefly present the modification of the decision tree learning algorithm in order to benefit of the multithreading technology. Then, we show how to implement the approach with SIPINA (version 3.5 and later). We show also that the multithreaded decision tree learners are available in various tools such as Knime 2.2.2 or RapidMiner 5.0.011. Last, we study the behavior of the multithreaded algorithms according to the dataset characteristics.

Keywords: multithreading, thread, threads, decision tree, chaid, sipina 3.5, knime 2.2.2, rapidminer 5.0.011
Tutorial: en_sipina_multithreading.pdf
Dataset: covtype.arff.zip
References :
Wikipedia, "Decision tree learning"
Wikipedia, "Thread (Computer science)"
Aldinucci, Ruggieri, Torquati, " Porting Decision Tree Algorithms to Multicore using FastFlow ", Pkdd-2010.