Implementation of all decision tree algorithms with a single framework

There are different decision tree algorithms used in different use cases and few of them can be applied to the same problem. All we have to do is change the algorithm, adapt them and compare their performance separately. In this article, we are going to cover an approach through which we can quickly run all decision tree algorithms using the same framework and easily compare performance. We will be using ChefBoost which is a lightweight decision tree framework and we can implement decision tree algorithms using it in a few lines of code. The main points covered in this article are listed below.

Contents

  1. Brief on Decision Trees
  2. Different Decision Tree Algorithms
  3. About the ChefBoost framework
  4. Implementation of all decision tree algorithms with ChefBoost

Let’s start by understanding decision trees in a nutshell.

Brief on Decision Trees

Different algorithms can be used in regression analysis and classification analysis and decision tree is one such algorithm. As their name suggests, these algorithms can be compared to a tree to make any decision. We mainly find the use of such a tree when the data we are dealing with has non-linear relationship properties. The image below is a representation of how any decision tree works.

In the image above, we can see that based on the two conditions, a student has to decide whether to go to school or not. This decision can be made based on certain criteria. As if he thinks he does not suffer from covid there is a relationship that generates in the next stage that due to going to school he may suffer from covid and again there is a condition that must be confirmed before going to school. So, in the example, we can say that the block is made up of the nodes, and dividing these nodes allows us to make a decision. This is a very simple example and one can easily tell under what condition what decision needs to be made by the student.

One thing that greatly affects the decision made by the algorithm is the huge amount of data. Under such conditions, there are various things we need to know about the decision tree algorithm, such as information gain, entropy, number of splits, etc. One of our articles covers most of this information that we can get on the decision tree. We can also call this algorithm the root of the forestry algorithms. Decision tree algorithms are not just one algorithm, but there are different versions.

Are you looking for a comprehensive repository of Python libraries used in data science, check here.

Different Decision Tree Algorithms

All decision tree algorithms are based on extracting features from data that can yield the highest information gain. But as each updated version has better qualities than the old one, decision trees also improve with simple updates. There are 5 popular versions of decision tree algorithms:

  • ID3: We can compare this decision tree algorithm with the standard version and it is the acronym of iterative dichotomization where the word ‘dichotomize’ can be explained as dividing a condition into two opposite decisions. Thus, the whole phenomenon can be seen as an iterative division of the conditions into two major decisions and others to build the tree. Finally, after the construction of the tree algorithm calculates the information such as entropy and information gain, the most dominant decision appears as the final result.
  • C4.5: One thing we discussed in ID3 decision tree algorithm is that it mainly finds the dominant decision for various conditions, but when it comes to continuous data, there is no addition in this algorithm. To deal with continuous data, the C4.5 comes into the picture and using this version we can also deal with continuous data and missing values.
  • CART: As a practitioner of data science, we had to listen to the Gini index in decision trees and random forests. For the very first time, this index is introduced with this CART version of the decision tree algorithms. Using this index, this algorithm is able to calculate the overinformation gain. We understand the Gini index as the process of subtracting the sum of the squared probabilities of each class from one. This version is also able to handle both types of problems (classification and regression).
  • CHAID: This algorithm can be considered as one of the oldest decision tree algorithms and it is able to find the features that have the highest statistical significance. The word CHAID stands for Automatic Chi-Square Interaction Detection and we can easily understand it because Chi-Square is used to determine the statistical significance of features. This version is mainly known to solve the classification problem.
  • Regression trees: This decision tree algorithm is primarily designed to perform regressions. In the above we have seen that the other versions of trees are good with the classification problem and no one has specially designed to perform the regression. The approach of this tree is to make a continuous characteristic a categorical characteristic. The importance of continuous features can be extracted where the feature has the highest information gain.

Here we have discussed the different versions of Decision Trees and one thing that can be more appreciated is that they perform advanced techniques such as Gradient Boost, Random Forest and AdaBoost to achieve better results which are upgraded. level on the decision trees. Let’s take an idea of ​​how we can implement decision trees.

About the ChefBoost framework

Before writing this article, we did various research and discovered that ChefBoost is a python package that provides functions for implementing all types of decision trees and advanced techniques. One thing that is remarkable about the package is that we can build any version of the decision tree above using just a few lines of codes.

This package supports data as a pandas database, which makes the process easier for people who know how to use pandas for data preprocessing. The functions are designed as by just pushing the pandas dataframe and the decision tree type, we can build a decision tree model. The naming conventions for trees are also simple as we named it in the section above. For example, to implement ID3, just pass ‘ID3’ in the configuration function.

The implementation that we are going to perform in this article will allow us to know the use of this package. We can install this package using the following lines of codes.

!pip install chefboost

Implementation of all decision tree algorithms with ChefBoost

There are several ways to implement a decision tree. So we will see how we can implement a decision tree using this package.

Let’s see how we can build decision trees. We will start by importing golf data which can be found here.

import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/serengil/chefboost/master/tests/dataset/golf.txt')
data.head()

Production:

Here we can see that there are 4 conditions using which we or the algorithm can tell whether to play golf or not. Let’s build the model.

config = {'algorithm': 'C4.5'}
model = chef.fit(data, config = config, target_label="Decision")

Production:

Here we can see the details we obtained by modeling the C4.5 version of the decision tree. Let’s make a prediction.

prediction = chef.predict(model, param = data.iloc[0])
prediction

Production:

Here in the model codes we can see that we have set the C4.5 version to the C4.5 model version of the decision tree. We can just change it according to our needs. Let’s try another version to verify this.

config = {'algorithm': 'ID3'}
model = chef.fit(data, config = config, target_label="Decision")

Production:

Here in the output we can see that the ID3 version is built this time. Although I have implemented all the decision tree algorithms mentioned above in this link for more ideas we can browse.

Last words

In this article, we have discussed Decision Tree which is a supervised machine learning algorithm that can be used for regression and classification problems and has different versions of itself. Along with this, we looked at the idea of ​​implementing all versions of decision trees using a single framework named ChefBoost.

References

Sharon D. Cole