I am going to implement algorithms for decision tree classification in this tutorial. I am going to train a simple decision tree and two decision tree ensembles (RandomForest and XGBoost), these models will be compared with 10-fold cross-validation. I am using the Titanic data set from kaggle, this data set will be preprocessed and visualized before it is used for training.
Decision tree algorithms was among the first solutions to aid in decision support system (expert systems). A decision tree is constructed as a number of if-then rules that builds an hierarchical tree that looks more like a pyramid. A decision tree is created with recursive binary splitting from the root node and down to the final predictions. We want to have the most important features at the top of the tree as this makes it faster to reach a satisfactory result.
Decision trees is easy to understand and explain, they can be used for binary classification problems and for multiclass problems. Decision trees can be biased if the data set not is balanced and they can be unstable as different trees might be generated after small variations in the input data.
Decision tree ensemble methods combines multiple descision trees to improve prediction performance. Decision tree ensemble methods can implement bagging or boosting. Bagging means that multiple trees is created on subsets of the input data, the result of such a model is the average prediction for all trees. Boosting is a technique where trees are created sequential, the next tree will try to minimize the loss/error from the previous tree. Random Forest is an example of an ensemble method that uses bagging och XGBoost is an example of an ensemble method that uses boosting.
Data set and libraries
I am going to use the Titanic dataset (download it) from kaggle.com, you need to register to able to download the data set. The data set consists of a training set and a test set, the test set is used if you want to make a submission. The data set includes data about passengers on Titanic and a boolean target value that indicates if the passenger survived or not. I am using the following libraries: pandas, joblib, numpy, matplotlib, csv, xgboost, graphviz and scikit-learn.
Data preparation
You can open the train.csv file with Excel, OpenOffice Calc or investigate it on kaggle. Some columns in the data set includes a lot of unique values like PassengerId, Name, Age, Ticket, Fare and Cabin. Columns with a lot of unique values might be removed or reconstructed. I decided to remove PassengerId, Name and Ticket, Cabin is reconstructed to indicate if the passenger has a cabin or not. You might be able to improve the accuracy by reconstructing Age and Fare. Some of the columns includes null (NaN) values and string values needs to be converted to numbers. The following method in a module called common (common.py) is used to prepare the data set.
# Preprocess data
def preprocess_data(ds):
# Get passenger ids (should not be part of the dataset)
ids = ds['PassengerId']
# Set cabin to a boolean value (no, yes)
cabins = ds['Cabin'].copy()
for i in range(len(cabins)):
if type(cabins.loc[i]) == float:
cabins.loc[i] = 0
else:
cabins.loc[i] = 1
# Update the cabin column in the data set
ds['Cabin'] = cabins
# Remove null (NaN) values from the data set
median_fare = ds['Fare'].median()
mean_age = ds['Age'].mean()
ds['Fare'] = ds['Fare'].fillna(median_fare)
ds['Age'] = ds['Age'].fillna(mean_age)
ds['Embarked'] = ds['Embarked'].fillna('S')
# Map string values to numbers (to be able to train and test models)
ds['Sex'] = ds['Sex'].map({'female': 0, 'male': 1})
ds['Embarked'] = ds['Embarked'].map({'Q': 0, 'C': 1, 'S': 2})
# Drop columns
ds = ds.drop(columns=['PassengerId', 'Name', 'Ticket'])
# Return ids and data set
return ids, ds
Visualize data set
The following module is used to visualize the data set. The output from the visualization process is shown below the code.
# Import libraries
import pandas
import joblib
import math
import numpy as np
import matplotlib.pyplot as plt
import annytab.decision_trees.common as common
# Visualize data set
def visualize_dataset(ds):
# Print first 10 rows in data set
print('--- First 10 rows ---\n')
#pandas.set_option('display.max_columns', 12)
print(ds[0:10])
# Print the shape
print('\n--- Shape of data set ---\n')
print(ds.shape)
# Print class distribution
print('\n--- Class distribution ---\n')
print(ds.groupby('Survived').size())
# Group data set
survivors = ds[ds.Survived == True]
non_survivors = ds[ds.Survived == False]
# Create a figure
figure = plt.figure(figsize = (12, 8))
figure.suptitle('Surviviors and Non-surviviors on Titanic', fontsize=16)
# Create a default grid
plt.rc('axes', facecolor='#ececec', edgecolor='none', axisbelow=True, grid=True)
plt.rc('grid', color='w', linestyle='solid')
# Add spacing between subplots
plt.subplots_adjust(top = 0.9, bottom=0.1, hspace=0.3, wspace=0.4)
# Plot by Pclass (1)
plt.subplot(2, 4, 1) # 2 rows and 4 columns
survivors_data = survivors.groupby('Pclass').size().values
non_survivors_data = non_survivors.groupby('Pclass').size().values
plt.bar(range(len(survivors_data)), survivors_data, label='Survivors', alpha=0.5, color='g')
plt.bar(range(len(non_survivors_data)), non_survivors_data, bottom=survivors_data, label='Non-Survivors', alpha=0.5, color='r')
plt.xticks([0,1,2], [1, 2, 3])
plt.ylabel('Count')
plt.title('Pclass')
plt.legend(loc='upper left')
# Plot by Gender (2)
plt.subplot(2, 4, 2) # 2 rows and 4 columns
survivors_data = survivors.groupby('Sex').size().values
non_survivors_data = non_survivors.groupby('Sex').size().values
plt.bar(range(len(survivors_data)), survivors_data, label='Survivors', alpha=0.5, color='g')
plt.bar(range(len(non_survivors_data)), non_survivors_data, bottom=survivors_data, label='Non-Survivors', alpha=0.5, color='r')
plt.xticks([0,1], ['Female', 'Male'])
plt.ylabel('Count')
plt.title('Gender')
plt.legend(loc='upper left')
# Plot by Age (3)
plt.subplot(2, 4, 3) # 2 rows and 4 columns
survivors_data = survivors.groupby(['AgeGroup']).size().values
non_survivors_data = non_survivors.groupby(['AgeGroup']).size().values
plt.bar(range(len(survivors_data)), survivors_data, label='Survivors', alpha=0.5, color='g')
plt.bar(range(len(non_survivors_data)), non_survivors_data, bottom=survivors_data, label='Non-Survivors', alpha=0.5, color='r')
plt.xticks([0,1,2,3,4,5,6,7], ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79'], rotation=40, horizontalalignment='right')
plt.ylabel('Count')
plt.title('Age')
plt.legend(loc='upper left')
# Plot by SibSp (4)
plt.subplot(2, 4, 4) # 2 rows and 4 columns
survivors_data = np.append(survivors.groupby('SibSp').size().values, np.array([0,0])) # Make sure that arrays have same length
non_survivors_data = non_survivors.groupby('SibSp').size().values
plt.bar(range(len(survivors_data)), survivors_data, label='Survivors', alpha=0.5, color='g')
plt.bar(range(len(non_survivors_data)), non_survivors_data, bottom=survivors_data, label='Non-Survivors', alpha=0.5, color='r')
plt.ylabel('Count')
plt.title('Number of siblings/spouses')
plt.legend(loc='upper left')
# Plot by Parch (5)
plt.subplot(2, 4, 5) # 2 rows and 4 columns
survivors_data = np.append(survivors.groupby('Parch').size().values, np.array([0,0])) # Make sure that arrays have same length
non_survivors_data = non_survivors.groupby('Parch').size().values
plt.bar(range(len(survivors_data)), survivors_data, label='Survivors', alpha=0.5, color='g')
plt.bar(range(len(non_survivors_data)), non_survivors_data, bottom=survivors_data, label='Non-Survivors', alpha=0.5, color='r')
plt.ylabel('Count')
plt.title('Number of parents/children')
plt.legend(loc='upper left')
# Plot by Fare (6)
plt.subplot(2, 4, 6) # 2 rows and 4 columns
survivors_data = survivors.groupby(['FareGroup']).size().values
non_survivors_data = non_survivors.groupby(['FareGroup']).size().values
plt.bar(range(len(survivors_data)), survivors_data, label='Survivors', alpha=0.5, color='g')
plt.bar(range(len(non_survivors_data)), non_survivors_data, bottom=survivors_data, label='Non-Survivors', alpha=0.5, color='r')
plt.xticks([0,1,2,3,4,5], ['0-99', '100-199', '200-299', '300-399', '400-499', '500-599'], rotation=40, horizontalalignment='right')
plt.ylabel('Count')
plt.title('Fare')
plt.legend(loc='upper left')
# Plot by Cabin (7)
plt.subplot(2, 4, 7) # 2 rows and 4 columns
survivors_data = survivors.groupby('Cabin').size().values
non_survivors_data = non_survivors.groupby('Cabin').size().values
plt.bar(range(len(survivors_data)), survivors_data, label='Survivors', alpha=0.5, color='g')
plt.bar(range(len(non_survivors_data)), non_survivors_data, bottom=survivors_data, label='Non-Survivors', alpha=0.5, color='r')
plt.xticks([0,1], ['No', 'Yes'])
plt.ylabel('Count')
plt.title('Cabin')
plt.legend(loc='upper left')
# Plot by Embarked (8)
plt.subplot(2, 4, 8) # 2 rows and 4 columns
survivors_data = survivors.groupby('Embarked').size().values
non_survivors_data = non_survivors.groupby('Embarked').size().values
plt.bar(range(len(survivors_data)), survivors_data, label='Survivors', alpha=0.5, color='g')
plt.bar(range(len(non_survivors_data)), non_survivors_data, bottom=survivors_data, label='Non-Survivors', alpha=0.5, color='r')
plt.xticks([0,1,2], ['Q', 'C', 'S'])
plt.ylabel('Count')
plt.title('Embarked')
plt.legend(loc='upper left')
# Show or save the figure
#plt.show()
plt.savefig('C:\\DATA\\Python-data\\titanic\\plots\\bar-charts.png')
# The main entry point for this module
def main():
# Load data set (includes header values)
ds = pandas.read_csv('C:\\DATA\\Python-data\\titanic\\train.csv')
# Preprocess data
ids, ds = common.preprocess_data(ds)
# Create age groups
ds['AgeGroup'] = pandas.cut(ds.Age, range(0, 81, 10), right=False, labels=['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79'])
# Create fare groups
ds['FareGroup'] = pandas.cut(ds.Fare, range(0, 601, 100), right=False, labels=['0-99', '100-199', '200-299', '300-399', '400-499', '500-599'])
# Visualize data set
visualize_dataset(ds)
# Tell python to run main method
if __name__ == "__main__": main()
--- First 10 rows ---
Survived Pclass Sex Age ... Cabin Embarked AgeGroup FareGroup
0 0 3 1 22.000000 ... 0 2 20-29 0-99
1 1 1 0 38.000000 ... 1 1 30-39 0-99
2 1 3 0 26.000000 ... 0 2 20-29 0-99
3 1 1 0 35.000000 ... 1 2 30-39 0-99
4 0 3 1 35.000000 ... 0 2 30-39 0-99
5 0 3 1 29.699118 ... 0 0 20-29 0-99
6 0 1 1 54.000000 ... 1 2 50-59 0-99
7 0 3 1 2.000000 ... 0 2 0-9 0-99
8 1 3 0 27.000000 ... 0 2 20-29 0-99
9 1 2 0 14.000000 ... 0 1 10-19 0-99
[10 rows x 11 columns]
--- Shape of data set ---
(891, 11)
--- Class distribution ---
Survived
0 549
1 342
dtype: int64
Baseline performance
The data set is not perfectly balanced as there is 549 non-survivors and 342 surviviors, a possible measure to get better results is to create a better balance in the data set. The probability to make a correct prediction of a non-survivor is 66.67 % (549/891) and our model must perform better than this.
Python module
The following module is used for training, evaluation and submission. I am using tree models which each has a lot of hyperparameters that can be adjusted. All of the project files is stored in annytab/decision_trees and the namespace for our common module is therefore annytab.decision_trees
.
# Import libraries
import pandas
import joblib
import csv
import numpy as np
import sklearn.model_selection
import sklearn.tree
import sklearn.ensemble
import sklearn.metrics
import xgboost
import graphviz
import matplotlib.pyplot as plt
import annytab.decision_trees.common as common
# Train and evaluate
def train_and_evaluate():
# Load train data set (includes header values)
ds = pandas.read_csv('C:\\DATA\\Python-data\\titanic\\train.csv')
# Preprocess data
ids, ds = common.preprocess_data(ds)
# Slice data set in values and target (2D-array)
X = ds.values[:,1:9] # Data
Y = ds.values[:,0] # Survived
# Create models
models = []
models.append(('DecisionTree', sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=5, min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)))
models.append(('RandomForest', sklearn.ensemble.RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=5,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0,
warm_start=False, class_weight=None)))
models.append(('XGBoost', xgboost.XGBClassifier(booster='gbtree', max_depth=6, min_child_weight=1, learning_rate=0.1, n_estimators=500, verbosity=0, objective='binary:logistic',
gamma=0, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=0,
scale_pos_weight=1, seed=0, missing=None)))
# Loop models
for name, model in models:
# Train the model on the whole data set
model.fit(X, Y)
# Save the model (Make sure that the folder exists)
joblib.dump(model, 'C:\\DATA\\Python-data\\titanic\\models\\' + name + '.jbl')
# Evaluate on training data
print('\n--- ' + name + ' ---')
print('\nTraining data')
predictions = model.predict(X)
accuracy = sklearn.metrics.accuracy_score(Y, predictions)
print('Accuracy: {0:.2f}'.format(accuracy * 100.0))
print('Classification Report:')
print(sklearn.metrics.classification_report(Y, predictions))
print('Confusion Matrix:')
print(sklearn.metrics.confusion_matrix(Y, predictions))
# Evaluate with 10-fold CV
print('\n10-fold CV')
predictions = sklearn.model_selection.cross_val_predict(model, X, Y, cv=10)
accuracy = sklearn.metrics.accuracy_score(Y, predictions)
print('Accuracy: {0:.2f}'.format(accuracy * 100.0))
print('Classification Report:')
print(sklearn.metrics.classification_report(Y, predictions))
print('Confusion Matrix:')
print(sklearn.metrics.confusion_matrix(Y, predictions))
# Predict and submit
def predict_and_submit():
# Load test data set (includes header values)
ds = pandas.read_csv('C:\\DATA\\Python-data\\titanic\\test.csv')
# Preprocess data
ids, ds = common.preprocess_data(ds)
# Slice data set in values (2D-array), test set does not have target values
X = ds.values[:,0:8] # Data
# Load the best models
model = joblib.load('C:\\DATA\\Python-data\\titanic\\models\\RandomForest.jbl')
# Make predictions
predictions = model.predict(X)
# Save predictions to a csv file
file = open('C:\\DATA\\Python-data\\titanic\\submission.csv', 'w', newline='')
writer = csv.writer(file, delimiter=',')
writer.writerow(('PassengerId', 'Survived'))
for i in range(len(predictions)):
writer.writerow((ids[i], predictions[i].astype(int)))
file.close()
# Print success
print('Successfully created submission.csv!')
# Plot models
def plot_models():
# Load models
decision_tree_model = joblib.load('C:\\DATA\\Python-data\\titanic\\models\\DecisionTree.jbl')
random_forest_model = joblib.load('C:\\DATA\\Python-data\\titanic\\models\\RandomForest.jbl')
xgboost_model = joblib.load('C:\\DATA\\Python-data\\titanic\\models\\XGBoost.jbl')
# Names
feature_names = ['Pclass', 'Gender', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked']
class_names = ['Died', 'Survived']
# Save decision tree model to an image
source = graphviz.Source(sklearn.tree.export_graphviz(decision_tree_model, out_file=None, feature_names=feature_names, class_names=class_names, filled=True))
source.render('C:\\DATA\\Python-data\\titanic\\plots\\decision-tree',format='png', view=False)
# Save random forest model to an image
source = graphviz.Source(sklearn.tree.export_graphviz(random_forest_model.estimators_[8], out_file=None, filled=True))
source.render('C:\\DATA\\Python-data\\titanic\\plots\\random-forest',format='png', view=False)
# Save xgboost model to an image
xgboost_model.get_booster().feature_names = feature_names
xgboost.plot_tree(xgboost_model, num_trees=0)
figure = plt.gcf()
figure.set_size_inches(100, 50)
plt.savefig('C:\\DATA\\Python-data\\titanic\\plots\\xgboost.png')
# The main entry point for this module
def main():
# Train and evaluate
#train_and_evaluate()
# Predict and submit
#predict_and_submit()
# Plot a model
plot_models()
# Tell python to run main method
if __name__ == "__main__": main()
Training and evaluation
A for loop is used to train and evaluate models, each model is saved to a file. The output from the training and evaluation process is shown below.
--- DecisionTree ---
Training data
Accuracy: 94.84
Classification Report:
precision recall f1-score support
0.0 0.94 0.98 0.96 549
1.0 0.96 0.90 0.93 342
accuracy 0.95 891
macro avg 0.95 0.94 0.94 891
weighted avg 0.95 0.95 0.95 891
Confusion Matrix:
[[536 13]
[ 33 309]]
10-fold CV
Accuracy: 78.90
Classification Report:
precision recall f1-score support
0.0 0.82 0.85 0.83 549
1.0 0.74 0.70 0.72 342
accuracy 0.79 891
macro avg 0.78 0.77 0.77 891
weighted avg 0.79 0.79 0.79 891
Confusion Matrix:
[[464 85]
[103 239]]
--- RandomForest ---
Training data
Accuracy: 94.84
Classification Report:
precision recall f1-score support
0.0 0.94 0.98 0.96 549
1.0 0.97 0.90 0.93 342
accuracy 0.95 891
macro avg 0.95 0.94 0.94 891
weighted avg 0.95 0.95 0.95 891
Confusion Matrix:
[[538 11]
[ 35 307]]
10-fold CV
Accuracy: 82.27
Classification Report:
precision recall f1-score support
0.0 0.84 0.88 0.86 549
1.0 0.79 0.73 0.76 342
accuracy 0.82 891
macro avg 0.82 0.81 0.81 891
weighted avg 0.82 0.82 0.82 891
Confusion Matrix:
[[483 66]
[ 92 250]]
--- XGBoost ---
Training data
Accuracy: 98.20
Classification Report:
precision recall f1-score support
0.0 0.98 0.99 0.99 549
1.0 0.99 0.97 0.98 342
accuracy 0.98 891
macro avg 0.98 0.98 0.98 891
weighted avg 0.98 0.98 0.98 891
Confusion Matrix:
[[544 5]
[ 11 331]]
10-fold CV
Accuracy: 81.37
Classification Report:
precision recall f1-score support
0.0 0.84 0.87 0.85 549
1.0 0.77 0.73 0.75 342
accuracy 0.81 891
macro avg 0.80 0.80 0.80 891
weighted avg 0.81 0.81 0.81 891
Confusion Matrix:
[[475 74]
[ 92 250]]
Submission
I created a submission file by using the XGBoost model and uploaded the file to kaggle. My accuracy score was 0.73250, not much better than the baseline performance.
Plot trees
You will need to unpack or install Graphviz in order to plot models in Python. You also need to add a Path to the bin folder (C:\Program Files\Graphviz\bin) in environment variables. I load all the models and save plots as png:s, you can save them as pdf:s or other formats.