Motivation to Implement the Bootstrap Method

In this post, we will implement the Bootstrap Method in Python from scratch. A key question we can ask, regarding any modelling effort, is the uncertainty in our estimates. This applies both to the learned model parameters, as well as the resulting predictions obtained from a trained model.  These uncertainty estimates could, for example, play an important role in choosing an optimal model for a given problem we want to solve. The bootstrap method provides a means by which these uncertainties can be quantified. 

This article will be concerned with the non-parametric bootstrap method. In this case, no assumptions are made regarding the form of the underlying distribution from which our data was obtained. 

The application of this technique is often used to calculate standard errors, confidence intervals, and to perform hypothesis testing on sample statistics. Prediction error can also be estimated. The bootstrap was first developed in 1979 by Bradley Efron1.

In the following sections, I’ll review the assumptions behind the bootstrap method, and then the derivation of the bootstrap method. If you’re more interested in practical examples, just jump ahead to the next section where we will implement the bootstrap method in Python from scratch. Final remarks are covered in the final portion of this article.

Bootstrap Method Assumptions and Considerations

The following assumptions will be required to develop and implement the bootstrap method:

  • The data sample being used accurately represents the true population of the data
  • The variance of the true population is finite

Derivation of the Bootstrap Algorithm

We wish to use a Monte Carlo method by which uncertainties can be determined though the use of numerical methods. We will treat the available data as one of many possible samples that could have been obtained from the true underlying distribution.

Imagine a scenario where multiple samples are collected from a data source. For any given sample, a statistic computed on those data will vary between the different samples. This variation follows what is termed a sampling distribution. 

The bootstrap method seeks to estimate this sampling distribution by continually resampling from the available data, with replacement. By “with replacement”, we mean that each data point in the available sample can be resampled multiple times. Typically this resampling is done until the selected data is of equal size to the original sample. This process can be repeated multiple times to yield a set of resampled, bootstrapped samples.

The bootstrap method can therefore be outlined in the following 4 points:

  1. Ensure each data point in the original sample has equal probability of being selected
  2. Select a data point from the original sample for inclusion in the current bootstrap sample. This selection is done with replacement
  3. Repeat point 2. until the current bootstrap sample is the same size as the original sample
  4. Repeat points 2. & 3. until the required number of bootstrap samples are obtained

We can visualise this process below:

Implement the Bootstrap Method in Python

In this example, the original sample consists of 5 observations: the rabbit, cow, pig, turkey, and monkey. We run through the method outlined above to built three bootstrap samples. Note that each bootstrap sample consists of 5 observations, the same as the original sample. Also notice that in the boostrap samples some duplicates occur, and the same observation can be present in different bootstrap samples.  This is a consequence of sampling with replacement.

In practice, anywhere from hundreds to thousands of bootstrap samples are normally generated during a statistical analysis.

Implement the Bootstrap Method in Python

We will start the worked example here by importing the required packages:

## imports ##
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,mean_squared_error

Implement Bootstrap Functionality

Now I will define two functions that will help with our analysis:

## function to generate bootstrap datasets ##
def make_bootstraps(data,n_bootstraps=100):
    #initialize output dictionary & unique value count
    dc   = {}
    unip = 0
    #get sample size
    b_size = data.shape[0]
    #get list of row indexes
    idx = [i for i in range(b_size)]
    #loop through the required number of bootstraps
    for b in range(n_bootstraps):
        #obtain boostrap samples with replacement
        sidx   = np.random.choice(idx,replace=True,size=b_size)
        b_samp = data[sidx,:]
        #compute number of unique values contained in the bootstrap sample
        unip  += len(set(sidx))
        #obtain out-of-bag samples for the current b
        oidx   = list(set(idx) - set(sidx))
        o_samp = np.array([])
        if oidx:
            o_samp = data[oidx,:]
        #store results
        dc['boot_'+str(b)] = {'boot':b_samp,'test':o_samp}
    #state the mean number of unique values in the bootstraps
    print('Mean number of unique values in each bootstrap: {:.2f}'.format(unip/n_bootstraps))
    #return the bootstrap results

The make_boostraps function takes in a numpy array data and generates the requested number of bootstrap samples set by the n_bootstraps argument. By default, this is set to 100. This function also calculates, and prints out, the mean number of unique data points contained in the bootstrap samples.

## function determining the dependence on training size for our model ##
def error_by_trainsize(train,test,train_size,iters=10):
    #initialize model
    lr    = LinearRegression()
    maes  = np.array([])
    mses  = np.array([])
    #get list of row indexes
    idx = [i for i in range(train.shape[0])]
    #loop over each training size
    for ts in train_size:
        #initialize error metrics
        mae = 0
        mse = 0
        #for each training size, I will repeat the calculation iters times
        for j in range(iters):
            #obtain training samples without replacement
            sidx     = np.random.choice(idx,replace=False,size=ts)
            trainset = np.copy(train[sidx,:])
            #fit a linear regression model to the training set
            #generate predictions & calculate error metrics
            yp  = lr.predict(test[:,0].reshape(-1,1))
            mae += mean_absolute_error(test[:,1],yp)
            mse += mean_squared_error(test[:,1],yp) 
        #store the mean error metrics over all repetitions
        mae = np.array([mae/iters])
        mse = np.array([mse/iters])
        maes = np.concatenate((maes,mae))
        mses = np.concatenate((mses,mse))
    #return error metrics

The function error_by_trainsize computes the Mean Absolute Error (MAE) and Mean Squared Error (MSE) on predictions from a linear regression model for a provided test data set. The MAE and MSE are given by:

MAE = \frac{1}{N}\sum_i^N|y_{i,pred}-y_{i,true}|    (1)

MSE = \frac{1}{N}\sum_i^N(y_{i,pred}-y_{i,true})^2    (2)

where y_{pred} are the predicted values, y_{true} are the true values, and N is the total number of samples considered in the calculation. Note that the Root Mean Squared Error (RMSE) is simply the square root of the MSE. The lowest possible value is 0, indicating no deviation between the predicted and true values.  The calculation of these error metrics is repeated over a number of different training sizes set by the list train_size.  Since the selection of the training set is random, I repeat the calculations iters times and compute the mean for each training set size. With this function, we can analyse the effect of training size on model performance.

Data Exploration & Bias Check

We can now proceed to generate data to work with. I will make use of the make_regression function provided by scikit-learn:

## build a synthetic regression dataset ##
x,y,coef = make_regression(n_samples=5000,n_features=1,n_targets=1,coef=True,noise=25.0,bias=1.0,random_state=42)

Note that these data consist of 5000 samples with 1 dependent and 1 independent variables. The noise injected onto the dependent variable follows a normal distribution with a standard deviation of 25.0. The intercept is set to 1.0.

Let’s now plot these data:

## plot regression data ##
plt.title('Synethic Regression Data')
Implement the Bootstrap Method in Python

It is evident that we have highly dispersed data, which nonetheless have a positive slope.  

We now need to consider the effects of bias and variance in our analysis. The aim of machine learning is to produce models that are able to generalise well beyond the training data used during fitting. Prediction error should ideally be due to noise in the data (i.e. the irreducible error).

Bias results from having a model that lacks the expressive ability to reproduce general trends seen in the data. This usually results from having insufficient training data, or from the choice of model.

On the other hand, Variance results from having a model that is too expressive, and as such is prone to becoming highly optimised on the training data. This results in a model that is too sensitive to noise present in the data, preventing it from generalising well.  

For the current analysis, the variance problem is handled by the fact that we will be combining the results over many bootstrapped samples. This will have the effect of cancelling out the effects of variance over the aggregate. However, bias could be present due to the effect of the amount of unique data points available for training. If each bootstrap sample has an insufficient amount of unique data points to adequately train the model, then bias will be present in our results.

To check if bias will be a potential problem, let’s see how training set size effects model prediction accuracy.  Let’s proceed to run the analysis:

## extract out a small test set to measure performance ##
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.1, random_state=42)
## define a set of training set sizes to choose from ##
train_size = [10,100,500,1000,1500,2000,2500,3000,3500,4000]
## package the data ##
train = np.concatenate((xtrain,ytrain.reshape(-1,1)),axis=1)
test  = np.concatenate((xtest,ytest.reshape(-1,1)),axis=1)
## for each training set size, fit a model and calculate the error ##
maes,mses = error_by_trainsize(train,test,train_size)

This code block performs a train-test split on the data, with 10% of the data allocated for testing. A set of 10 different training sizes are defined, and the data is packaged for use inside our function error_by_trainsize. We can plot the results to get:

## plot the error curves ##
plt.title('Error Metric Dependence on Training Size')
plt.xlabel('Number Unique Observations in Train Set')
plt.ylabel('Error Metric Units')
Implement the Bootstrap Method in Python

This plot shows that linear regression reaches an optimal level of performance at a training set size of ~500 unique samples. As such, we should aim to have at least this number in each of our bootstrap samples.

Let’s now create 100 bootstrap samples from the complete dataset:

## package the data ##
data = np.concatenate((x,y.reshape(-1,1)),axis=1)
## generate bootstrap samples ##
dcBoot = make_bootstraps(data)

Mean number of unique values in each bootstrap: 3160.36

Note that the mean number of unique data points in each bootstrap sample is above 3000, which is well above the ~500 lower limit determine earlier. As such, we can conclude that bias shouldn’t be a major issue for us in this analysis.

Bootstrap Analysis of Linear Regression

We now can iterate through each bootstrap sample, and fit a linear regression model to each sample. We can then collect the determined model parameters, as well as calculate the prediction errors using (1) and (2). The prediction errors are computed on the unique data points not selected in the current bootstrap sample, otherwise termed the out-of-bag samples.

## iterate through each bootstrap sample, and compute the desired statistics ##
#initialize storage variables & model
coefs = np.array([])
intrs = np.array([])
maes  = np.array([])
mses  = np.array([])
lr    = LinearRegression()
#loop through each bootstrap sample
for b in dcBoot:
    #fit a linear regression model to the current sample[b]['boot'][:,0].reshape(-1, 1),dcBoot[b]['boot'][:,1].reshape(-1, 1))
    #store model parameters
    coefs = np.concatenate((coefs,lr.coef_.flatten()))
    intrs = np.concatenate((intrs,lr.intercept_.flatten()))
    #compute the predictions on the out-of-bag test set & compute metrics
    if dcBoot[b]['test'].size:
        yp  = lr.predict(dcBoot[b]['test'][:,0].reshape(-1, 1))
        mae = mean_absolute_error(dcBoot[b]['test'][:,1],yp)
        mse = mean_squared_error(dcBoot[b]['test'][:,1],yp)   
        #store the error metrics
        maes = np.concatenate((maes,mae.flatten()))
        mses = np.concatenate((mses,mse.flatten()))

Let’s plot a histogram of the linear regression coefficient, along with some statistical information. Note: re-running this code can produce slightly different results, due to the random selection process built into the bootstrap method.

## plot histogram of regression coefficients ##
sns.displot(coefs, kde=True)
plt.title('Linear Regression Coefficient Data', fontsize=18)
plt.xlabel('Coefficients', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
Implement the Bootstrap Method in Python
## Compute Statistics on Model Coefficient ##
print('Expected value: ',np.mean(coefs))
print('Standard Error: ',np.std(coefs))
print('99% Confidence Interval for regression coefficient: {',np.percentile(coefs,0.5),'-',np.percentile(coefs,99.5),'}')

Expected value:  16.746588821657973

Standard Error:  0.4108450651178147

99% Confidence Interval for regression coefficient: { 15.896069584719381 – 17.690885196641283 }

The true coefficient was obtained directly when calling make_regression:

## what is the true regression coefficient? ##
print('True regression coefficient: ',str(coef))

True regression coefficient:  16.82365791084919

The true value falls within the 99% confidence interval determined from our bootstrap samples, as well as within +/- 1 standard error about the expected value.  

We can now do a similar analysis for the model intercept:

## plot histogram of regression intercepts ##
sns.displot(intrs, kde=True)
plt.title('Linear Regression Intercepts Data', fontsize=18)
plt.xlabel('Intercepts', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
Implement the Bootstrap Method in Python
## Compute Statistics on Model Intercept ##
print('Expected value: ',np.mean(intrs))
print('Standard Error: ',np.std(intrs))
print('99% Confidence Interval for regression intercept: {',np.percentile(intrs,0.5),'-',np.percentile(intrs,99.5),'}')

Expected value:  0.4964812253308403

Standard Error:  0.38344421924723365

99% Confidence Interval for regression intercept: { -0.38487890139735226 – 1.4300311903826721 }

The true intercept was set to 1. We can see that the true value falls within 99% confidence interval, however the expected value is only half the true value. The range [expected value – standard error, expected value + standard error] does not cover the true intercept value. We can conclude that the intercept value we get with these data, from a linear regression fit, will tend to be underestimated with respect to the true value. 

We can also produce histogram plots of the prediction MAE and RMSE, along with their respective 99% confidence intervals:

## plot histogram of mae on predictions ##
sns.displot(maes, kde=True)
plt.title('Linear Regression Prediction MAE Results', fontsize=18)
plt.xlabel('MAE', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
Implement the Bootstrap Method in Python
## 99% CI of MAE results ##
print('99% Confidence Interval for Prediction MAE results: {',np.percentile(maes,0.5),'-',np.percentile(maes,99.5),'}')

99% Confidence Interval for Prediction MAE results: { 19.730436495431668 – 20.989068027313436 }

## plot histogram of rmse on predictions ##
sns.displot(np.sqrt(mses), kde=True)
plt.title('Linear Regression Prediction RMSE Results', fontsize=18)
plt.xlabel('RMSE', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
Implement the Bootstrap Method in Python
## 99% CI of RMSE results ##
print('99% Confidence Interval for Prediction RMSE results: {',np.percentile(np.sqrt(mses),0.5),'-',np.percentile(np.sqrt(mses),99.5),'}')

99% Confidence Interval for Prediction RMSE results: { 24.715987604610945 – 26.264339484690247 }

The histograms above, along with the associated 99% confidence intervals, gives us a sense for the expected value of the prediction error, along with the possible range in prediction error. Note that the injected noise level of 25.0 falls nicely within the 99% confidence interval for the RMSE.  If we look at the bias-variance breakdown of the MSE, we can see that this is expected:

MSE = Bias^2 + Variance + \sigma^2    (3)

In our case, we know that our data originates from a linear relationship between 1 independent variable and 1 dependent variable, with some injected noise (\sigma). As we have a sufficient number of samples in our dataset (and also in each bootstrap sample), it is fair to conclude that the effect of bias and variance should be minimal in our analysis. Therefore MSE \approx \sigma^2 or RMSE \approx \sigma. 

Finally, we can produce a plot of the data with the resulting linear regression fit, including estimates for the uncertainties in the fit (99% confidence intervals) and prediction error (RMSE):

## plot the model against the dataset ##
xp = np.linspace(min(x),max(x),100)
plt.title('Linear Regression Fit to Generated Data', fontsize=18)
plt.xlabel('X', fontsize=14)
plt.ylabel('Y', fontsize=14)
plt.legend(['Model','Model+2RMSE','Model-2RMSE','99% CI','Data'],loc='lower right')
Implement the Bootstrap Method in Python

Note that the RMSE used in this plot is the expected value. Since the RMSE approximates the gaussian noise, I multiply by a factor of 2 to have upper/lower bounds that account for ~95% of the spread in the data.

Final Remarks

In this article you’ve learned:

  • The motivation behind the bootstrap method and when it was developed
  • The methodology of the algorithm
  • How to implement the bootstrap method in Python, to address questions of uncertainty in your machine learning applications
  • How to analyse the results from a bootstrap experiment

I hope you enjoyed this article and gained some value from it!  If you would like to take a closer look at the code presented here please take a look at my GitHub.


  1. Efron, Bradley. 1979. Bootstrap Methods: Another Look at the Jacknife, The Annals of Statistics, Vol. 7, No. 1, 1-26

0 0 votes
Article Rating
Notify of

Newest Most Voted
Inline Feedbacks
View all comments
Ivanka Vasenska
Ivanka Vasenska
9 months ago

I would like to implement bootstrapping for classification in order to use it NN.
Could you give me some advice for the code regarding the bootstrapping?
I will leave my email in order to have a way to send you my data so we can discuss it further!


Would love your thoughts, please comment.x