3 Popular Ways for Hyperparameter Tuning with Random Forest

Hyperparameter Tuning is the process of optimizing a machine learning model on a particular task. In this post, we will cover the Grid Search, Randomized Search, and Bayesian Search techniques for Hyperparameter Tuning with Random Forest. All code presented here is available on my Github.

Table of Contents

3 Popular Ways for Hyperparameter Tuning with Random Forest – image by Abi Plata

Video

For those who prefer video content, you can watch me cover the topic of Hyperparameter Tuning with Random Forest here:

What is Hyperparameter Tuning?

Most machine learning models consist of 2 different sets of parameters:

Model parameters: these are learned by the model when we fit on the training data
Hyperparameters: these define model architecture attributes, and cannot be learned by fitting on the training data

A simple example would be to consider how Decision Trees are built. Deciding how splits are done, at a specific node, is learned during training and so constitutes an example of model parameters. How many splits should be allowed in total for the tree cannot be learned during training, and so is a type of hyperparameter.

3 Methods for Hyperparameter Tuning

We’ll explore 3 different approaches for hyperparameter tuning in the example below. These include:

Grid Search : cycle through every configuration in a predetermined set of hyperparameter values
Randomized Search : randomly select configurations from a set of hyperparameter distributions
Bayesian Optimisation : select configurations based on prior distributions for each hyperparameter

Hyperparameters for a Random Forest Classifier

There are numerous different hyperparameters available for a Random Forest classifier. A complete listing of these parameters for the scikit-learn implementation can be found here. For the purpose of our work in this post, I’ll only consider tuning the following:

criterion : function used to measure quality of splits
n_estimators : number of trees to include in the ensemble
max_depth : maximum number of splits per tree
min_samples_split : minimum number of samples in a node for a split to occur
min_samples_leaf : minimum samples in a node for it to be considered a leaf node
max_features : function used to determine number of features to consider when doing a split

Hyperparameter Tuning with Random Forest in Python

We can start by importing all the packages needed throughout the example:

# imports
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score
)
from skopt import BayesSearchCV
from skopt.space import Integer
from scipy.stats import poisson, randint
import numpy as np
from typing import Callable

Now let’s create our dataset, by making use of the make_classification function from scikit-learn. These data will consist of 5000 samples and 100 predictive features. Only half of these features will be informative, however. There will be a slight class imbalance of 60%-40% for the 2 class labels.

We can also perform a train-test split, keeping 20% of the data for testing:

# load in and prepare data
X, y = make_classification(n_samples=5000, 
                           n_features=100, 
                           n_informative=50,
                           n_classes=2, 
                           weights=[0.6,0.4],
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Let’s now define a helper function to use when evaluating performance on the test set:

# helper function
def print_results(clf: Callable, X_test: np.array, y_test: np.array) -> None:
    y_pred = clf.predict(X_test)
    y_prob = clf.predict_proba(X_test)
    print(f'accuracy score: {accuracy_score(y_test, y_pred):.2f}')
    print(f"precision score: {precision_score(y_test, y_pred):.2f}")
    print(f"recall score: {recall_score(y_test, y_pred):.2f}")
    print(f"f1 score: {f1_score(y_test, y_pred):.2f}")
    print(f"ROC AUC score: {roc_auc_score(y_test, y_prob[:,1])}")

Baseline

To be able to measure the effects of our tuning, let’s first measure how well Random Forest does on the test set with all default hyperparameter values:

%%time

# fit a model with default parameters
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# compute performance on test set
print_results(clf, X_test, y_test)

accuracy score: 0.88
precision score: 0.96
recall score: 0.72
f1 score: 0.83
ROC AUC score: 0.9547358510304426
CPU times: user 1.79 s, sys: 7.59 ms, total: 1.8 s
Wall time: 1.79 s

Grid Search

This represents a brute force approach to hyperparameter tuning. Each parameter configuration will be validated using 5-fold Cross-Validation. Afterwards, the best model will be selected, and tested against our held-out test set.

%%time

# setup parameter space
parameters = {
    'criterion':["gini", "entropy", "log_loss"],
    'n_estimators':[50, 100, 500],
    'max_depth':[1, 5, 10, 15],
    'min_samples_split':[2, 4, 6, 8],
    'min_samples_leaf':[1, 2, 3, 4],
    'max_features':["sqrt", "log2"]
}

# create an instance of the grid search object
g = GridSearchCV(RandomForestClassifier(random_state=42), parameters, cv=5, n_jobs=-1)

# conduct grid search over the parameter space
g.fit(X_train, y_train)

# show best parameter configuration found for classifier
cls_params = g.best_params_
cls_params

# compute performance on test set
print_results(g.best_estimator_, X_test, y_test)

accuracy score: 0.88
precision score: 0.97
recall score: 0.73
f1 score: 0.83
ROC AUC score: 0.9658667224605083

Randomized Search

We can do hyperparameter tuning through random sampling from a probability distribution, for non-categorical hyperparameters. Each parameter configuration will be validated using 5-fold Cross-Validation, like before. Afterwards, the best model will be selected, and tested against our held-out test set.

%%time

# setup parameter space
parameters = {
    'criterion':["gini", "entropy", "log_loss"],
    'n_estimators':poisson(mu=500),
    'max_depth':poisson(mu=10),
    'min_samples_split':randint(low=2, high=5),
    'min_samples_leaf':randint(low=1, high=5),
    'max_features':["sqrt", "log2"]
}

# create an instance of the randomized search object
r = RandomizedSearchCV(RandomForestClassifier(random_state=42), parameters, cv=5, n_iter=10, random_state=42, n_jobs=-1)

# conduct grid search over the parameter space
r.fit(X_train,y_train)

# show best parameter configuration found for classifier
cls_params2 = r.best_params_
cls_params2

# compute performance on test set
print_results(r.best_estimator_, X_test, y_test)

accuracy score: 0.89
precision score: 0.98
recall score: 0.73
f1 score: 0.84
ROC AUC score: 0.9621090072183283

Bayesian Search

The final method we’ll try takes advantage of Bayes theorem for hyperparameter tuning. Like before, the search space for non-categorical hyperparameters is defined by a set of probability distributions, in this case in the form of priors. Care will be needed when selecting these prior distributions. Each parameter configuration will be validated using 5-fold Cross-Validation. Afterwards, the best model will be selected, and tested against our held-out test set.

%%time

# setup parameter space
parameters = {
    'criterion':["gini", "entropy", "log_loss"],
    'n_estimators':Integer(50,1000,prior='uniform'),
    'max_depth':Integer(1,20,prior='uniform'),
    'min_samples_split':Integer(2,5,prior='log-uniform'),
    'min_samples_leaf':Integer(1,5,prior='log-uniform'),
    'max_features':["sqrt", "log2"]
}

# create an instance of the bayesian search object
b = BayesSearchCV(RandomForestClassifier(random_state=42), parameters, cv=5, n_iter=10, random_state=42, n_jobs=-1)

# conduct randomized search over the parameter space
b.fit(X_train,y_train)

# show best parameter configuration found for classifier
cls_params3 = b.best_params_
cls_params3

# compute performance on test set
print_results(b.best_estimator_, X_test, y_test)

accuracy score: 0.89
precision score: 0.98
recall score: 0.74
f1 score: 0.84
ROC AUC score: 0.9657453708546919

Final Remarks

We have tested out 6 key hyperparameters for Random Forest, using 3 popular techniques for parameter tuning. All results listed below are dependent on the dataset used in this notebook:

Bayesian Search takes a bit longer to run than Randomized Search, but the Bayesian approach yields slightly better results for the same number of iterations.
Both Randomized Search and Bayesian Search benefit from being able to handle distributions for their non-categorical hyperparameters.
Grid Search is by far the slowest method, and suffers from needing fixed parameter values defined in an array (as opposed to a distribution). Yields results that are somewhat better than the baseline!

I hope you enjoyed this article, and gained some value from it. If you would like to take a closer look at the code presented here, please take a look at my GitHub. If you have any questions or suggestions, please feel free to add a comment below. Your input is greatly appreciated.

Note I have started a New Monthly Newsletter! At the end of each month I will send out this free newsletter to each of my subscribers by email. This is the best way to stay on top of my latest content. Sign up for the newsletter here!

Hi I'm Michael Attard, a Data Scientist with a background in Astrophysics. I enjoy helping others on their journey to learn more about machine learning, and how it can be applied in industry.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.