Hyperparameter Tuning is the process of optimizing a machine learning model on a particular task. In this post, we will cover the Grid Search, Randomized Search, and Bayesian Search techniques for Hyperparameter Tuning with Random Forest. All code presented here is available on my Github.
Table of Contents
3 Popular Ways for Hyperparameter Tuning with Random Forest – image by Abi Plata
Video
For those who prefer video content, you can watch me cover the topic of Hyperparameter Tuning with Random Forest here:
What is Hyperparameter Tuning?
Most machine learning models consist of 2 different sets of parameters:
- Model parameters: these are learned by the model when we fit on the training data
- Hyperparameters: these define model architecture attributes, and cannot be learned by fitting on the training data
A simple example would be to consider how Decision Trees are built. Deciding how splits are done, at a specific node, is learned during training and so constitutes an example of model parameters. How many splits should be allowed in total for the tree cannot be learned during training, and so is a type of hyperparameter.
3 Methods for Hyperparameter Tuning
We’ll explore 3 different approaches for hyperparameter tuning in the example below. These include:
- Grid Search : cycle through every configuration in a predetermined set of hyperparameter values
- Randomized Search : randomly select configurations from a set of hyperparameter distributions
- Bayesian Optimisation : select configurations based on prior distributions for each hyperparameter
Hyperparameters for a Random Forest Classifier
There are numerous different hyperparameters available for a Random Forest classifier. A complete listing of these parameters for the scikit-learn implementation can be found here. For the purpose of our work in this post, I’ll only consider tuning the following:
- criterion : function used to measure quality of splits
- n_estimators : number of trees to include in the ensemble
- max_depth : maximum number of splits per tree
- min_samples_split : minimum number of samples in a node for a split to occur
- min_samples_leaf : minimum samples in a node for it to be considered a leaf node
- max_features : function used to determine number of features to consider when doing a split
Hyperparameter Tuning with Random Forest in Python
We can start by importing all the packages needed throughout the example:
# imports
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
roc_auc_score
)
from skopt import BayesSearchCV
from skopt.space import Integer
from scipy.stats import poisson, randint
import numpy as np
from typing import Callable Now let’s create our dataset, by making use of the make_classification function from scikit-learn. These data will consist of 5000 samples and 100 predictive features. Only half of these features will be informative, however. There will be a slight class imbalance of 60%-40% for the 2 class labels.
We can also perform a train-test split, keeping 20% of the data for testing:
# load in and prepare data
X, y = make_classification(n_samples=5000,
n_features=100,
n_informative=50,
n_classes=2,
weights=[0.6,0.4],
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) Let’s now define a helper function to use when evaluating performance on the test set:
# helper function
def print_results(clf: Callable, X_test: np.array, y_test: np.array) -> None:
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)
print(f'accuracy score: {accuracy_score(y_test, y_pred):.2f}')
print(f"precision score: {precision_score(y_test, y_pred):.2f}")
print(f"recall score: {recall_score(y_test, y_pred):.2f}")
print(f"f1 score: {f1_score(y_test, y_pred):.2f}")
print(f"ROC AUC score: {roc_auc_score(y_test, y_prob[:,1])}") Baseline
To be able to measure the effects of our tuning, let’s first measure how well Random Forest does on the test set with all default hyperparameter values:
%%time
# fit a model with default parameters
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
# compute performance on test set
print_results(clf, X_test, y_test) accuracy score: 0.88 precision score: 0.96 recall score: 0.72 f1 score: 0.83 ROC AUC score: 0.9547358510304426 CPU times: user 1.79 s, sys: 7.59 ms, total: 1.8 s Wall time: 1.79 s
Grid Search
This represents a brute force approach to hyperparameter tuning. Each parameter configuration will be validated using 5-fold Cross-Validation. Afterwards, the best model will be selected, and tested against our held-out test set.
%%time
# setup parameter space
parameters = {
'criterion':["gini", "entropy", "log_loss"],
'n_estimators':[50, 100, 500],
'max_depth':[1, 5, 10, 15],
'min_samples_split':[2, 4, 6, 8],
'min_samples_leaf':[1, 2, 3, 4],
'max_features':["sqrt", "log2"]
}
# create an instance of the grid search object
g = GridSearchCV(RandomForestClassifier(random_state=42), parameters, cv=5, n_jobs=-1)
# conduct grid search over the parameter space
g.fit(X_train, y_train)
# show best parameter configuration found for classifier
cls_params = g.best_params_
cls_params CPU times: user 17 s, sys: 1.96 s, total: 18.9 s Wall time: 32min 36s
{'criterion': 'entropy',
'max_depth': 15,
'max_features': 'sqrt',
'min_samples_leaf': 3,
'min_samples_split': 2,
'n_estimators': 500}# compute performance on test set
print_results(g.best_estimator_, X_test, y_test) accuracy score: 0.88 precision score: 0.97 recall score: 0.73 f1 score: 0.83 ROC AUC score: 0.9658667224605083
Randomized Search
We can do hyperparameter tuning through random sampling from a probability distribution, for non-categorical hyperparameters. Each parameter configuration will be validated using 5-fold Cross-Validation, like before. Afterwards, the best model will be selected, and tested against our held-out test set.
%%time
# setup parameter space
parameters = {
'criterion':["gini", "entropy", "log_loss"],
'n_estimators':poisson(mu=500),
'max_depth':poisson(mu=10),
'min_samples_split':randint(low=2, high=5),
'min_samples_leaf':randint(low=1, high=5),
'max_features':["sqrt", "log2"]
}
# create an instance of the randomized search object
r = RandomizedSearchCV(RandomForestClassifier(random_state=42), parameters, cv=5, n_iter=10, random_state=42, n_jobs=-1)
# conduct grid search over the parameter space
r.fit(X_train,y_train)
# show best parameter configuration found for classifier
cls_params2 = r.best_params_
cls_params2 CPU times: user 7.89 s, sys: 57.9 ms, total: 7.95 s Wall time: 52.1 s
{'criterion': 'gini',
'max_depth': 14,
'max_features': 'sqrt',
'min_samples_leaf': 4,
'min_samples_split': 4,
'n_estimators': 487}# compute performance on test set
print_results(r.best_estimator_, X_test, y_test) accuracy score: 0.89 precision score: 0.98 recall score: 0.73 f1 score: 0.84 ROC AUC score: 0.9621090072183283
Bayesian Search
The final method we’ll try takes advantage of Bayes theorem for hyperparameter tuning. Like before, the search space for non-categorical hyperparameters is defined by a set of probability distributions, in this case in the form of priors. Care will be needed when selecting these prior distributions. Each parameter configuration will be validated using 5-fold Cross-Validation. Afterwards, the best model will be selected, and tested against our held-out test set.
%%time
# setup parameter space
parameters = {
'criterion':["gini", "entropy", "log_loss"],
'n_estimators':Integer(50,1000,prior='uniform'),
'max_depth':Integer(1,20,prior='uniform'),
'min_samples_split':Integer(2,5,prior='log-uniform'),
'min_samples_leaf':Integer(1,5,prior='log-uniform'),
'max_features':["sqrt", "log2"]
}
# create an instance of the bayesian search object
b = BayesSearchCV(RandomForestClassifier(random_state=42), parameters, cv=5, n_iter=10, random_state=42, n_jobs=-1)
# conduct randomized search over the parameter space
b.fit(X_train,y_train)
# show best parameter configuration found for classifier
cls_params3 = b.best_params_
cls_params3 CPU times: user 9.86 s, sys: 60.4 ms, total: 9.92 s Wall time: 58.9 s
OrderedDict([('criterion', 'entropy'),
('max_depth', 18),
('max_features', 'sqrt'),
('min_samples_leaf', 2),
('min_samples_split', 2),
('n_estimators', 481)])# compute performance on test set
print_results(b.best_estimator_, X_test, y_test) accuracy score: 0.89 precision score: 0.98 recall score: 0.74 f1 score: 0.84 ROC AUC score: 0.9657453708546919
Final Remarks
We have tested out 6 key hyperparameters for Random Forest, using 3 popular techniques for parameter tuning. All results listed below are dependent on the dataset used in this notebook:
- Bayesian Search takes a bit longer to run than Randomized Search, but the Bayesian approach yields slightly better results for the same number of iterations.
- Both Randomized Search and Bayesian Search benefit from being able to handle distributions for their non-categorical hyperparameters.
- Grid Search is by far the slowest method, and suffers from needing fixed parameter values defined in an array (as opposed to a distribution). Yields results that are somewhat better than the baseline!
I hope you enjoyed this article, and gained some value from it. If you would like to take a closer look at the code presented here, please take a look at my GitHub. If you have any questions or suggestions, please feel free to add a comment below. Your input is greatly appreciated.
Note I have started a New Monthly Newsletter! At the end of each month I will send out this free newsletter to each of my subscribers by email. This is the best way to stay on top of my latest content. Sign up for the newsletter here!
Related Posts
Hi I'm Michael Attard, a Data Scientist with a background in Astrophysics. I enjoy helping others on their journey to learn more about machine learning, and how it can be applied in industry.
