## Motivation to Build a Logistic Regression Classifier

For this post, we will build a **logistic regression** classifier in Python. We will start from first principles, and work straight through to code implementation.

The problem of predicting a categorical variable is generally termed as **classification**. Classification is an extensively studied and widely applicable branch of machine learning: tasks such as determining whether a given email is spam, to whether an image contains a dog or a cat, etc. all fall under this category. In this article we’ll cover one of the simplest classification algorithms, logistic regression.

The mathematical foundations for logistic regression were initially developed by Pierre François Verhulst and Adolphe Quetelet, during the 1830s to 1840s, in their work on population growth. The work of Joseph Berkson in the 1940s to 1950s saw the formal development of logistic regression for the case of a dependent variable with just 2 classes. The 1960s saw the development of the multinomial logistic regression model, where the dependent variable has more than 2 classes, with the work of David Cox & Henri Thiel.

In the following sections, I’ll review the *mathematical assumptions*, and then the derivation of the *logistic regression model using maximum likelihood estimation (MLE)*. If you’re more interested in practical examples, just jump ahead to the next section where we will build a *logistic regression* classifier in Python. *Final remarks* are on the last page.

## Assumptions and Considerations to Build a Logistic Regression Classifier

Here I’ll list the key assumptions behind the the derivation that we’ll work though in the next section:

- the dependent variable is categorical in nature
- the independent variables are independent of, and uncorrelated with, each other
- there is a linear relationship between the independent variables and the log-odds of the dependent variable
- a sufficient sample size, such that each class in the dependent variable is well represented

## Derivation of Logistic Regression

For the sake of simplicity we will limit the derivation here to the case of a two-class problem, where the dependent variable (y) can take on only two values {0,1}. Note that because of this characteristic of y, it is not possible to use linear regression to model these data, since the output for this model is not bounded to {0,1}!

To build a model of this dataset, we will adopt a probabilistic approach. The probability for either class will be assumed to take on the form of the **sigmoid** or **logistic** function:

P(y_1|\bold{x}) = \frac{\exp(\beta_0 + \bold{\hat{\beta}}^T\bold{x})}{1 + \exp(\beta_0 + \bold{\hat{\beta}}^T\bold{x})} **(1)**

P(y_0|\bold{x}) = \frac{1}{1 + \exp(\beta_0 + \bold{\hat{\beta}}^T\bold{x})} **(2)**

where \bold{x} = \begin{bmatrix} x_1 \\ x_2 \\ . \\ . \\ . \\ x_M \end{bmatrix} is a vector of independent variables, \bold{\hat{\beta}} = \begin{bmatrix} \beta_1 \\ \beta_2 \\ . \\ . \\ . \\ \beta_M \end{bmatrix} is a vector of model parameters, \beta_0 is the model bias parameter, y_0 represents the case where y = 0, and y_1 represents the case where y = 1. To better understand these equations, we can plot both of them:

The model is constructed such that smaller values for \beta_0 + \hat{\beta}^T\bold{x} will favour the y = 0 case, while increasingly larger values will predict the y = 1 case. The goal for fitting the logistic regression model is to find the set of model parameters {\beta_0,\hat{\beta}} such that the output probability from equation (1) is close to 1.0 for cases where y = 1, and the output probability from equation (2) is close to 1.0 for cases where y = 0.

The plots of equations (1) and (2), along with paying close attention to the form of these equations, makes a few things apparent:

- P(y_1|\bold{x}) + P(y_0|\bold{x}) = 1
- As \bold{x} \rightarrow \infty then P(y_1|\bold{x}) \rightarrow 1 and P(y_0|\bold{x}) \rightarrow 0
- As \bold{x} \rightarrow -\infty then P(y_1|\bold{x}) \rightarrow 0 and P(y_0|\bold{x}) \rightarrow 1

Equations (1) and (2) have the properties of cumulative probability distribution functions. Output probabilities from these equations can be mapped to y by adhering to:

- P(y_1|\bold{x}) < 0.5 \implies y = 0 and P(y_1|\bold{x}) > 0.5 \implies y = 1
- P(y_0|\bold{x}) < 0.5 \implies y = 1 and P(y_0|\bold{x}) > 0.5 \implies y = 0

The decision boundary for this classifier is therefore at P(y_1|\bold{x}) = 0.5 and P(y_0|\bold{x}) = 0.5. To get a better intuition on this boundary, we can express (1) and (2) in terms of a odds ratio at the boundary:

\frac{P(y_1|\bold{x})}{P(y_0|\bold{x})} = \frac{0.5}{0.5} = 1

Taking the natural logarithm of both sides, and noting the forms of (1) and (2), we can write:

\ln(\frac{P(y_1|\bold{x})}{P(y_0|\bold{x})}) = \beta_0 + \hat{\beta}^T\bold{x} = 0 **(3)**

where \ln() is the natural logarithm. It is apparent from (3) that the decision boundary is linear for this model, along with the log-odds in general. We can also note that the log-odds are changed by \hat{\beta}^T for every unit change in \bold{x}.

We’ll now focus on actually fitting the model. Notice that (1) and (2) are exponential functions, and as such an analytic least squares estimation isn’t possible. Instead, we will fit the model parameters {\beta_0,\hat{\beta}} by **maximum likelihood estimation** (MLE), which is an iterative method. First let’s take note of the following:

- The exponent in the conditional probabilities (1) and (2) will be written as: \beta_0 + \hat{\beta}^T\bold{x} = \beta^T\bold{x}, where \beta includes {\beta_0,\hat{\beta}}, and \bold{x} on the right hand side includes a 1 as the first element
- To emphasise the dependence on the model parameters, we’ll set P(y_1|\bold{x}) = P_1(\bold{x};\beta) and P(y_0|\bold{x}) = P_0(\bold{x};\beta)
- Specific to the two-case problem, we can set P(y_1|\bold{x}) = P(\bold{x};\beta) and P(y_0|\bold{x}) = 1 – P(\bold{x};\beta)

The likelihood function is the probability of the data {\bold{x},y} given the model parameters (\beta). This is given by:

L(\bold{\beta}) = \prod_{i=1}^{N}P_{c_i}(\bold{x}_i;\beta)

where the index i iterates over all N data samples, and c_i can either be 0 or 1. The likelihood function is the product of N probabilities, and as such can be numerically unstable. For this reason, it’s more common to work with log-likelihood’s:

\ell(\beta) = \ln(L(\beta)) = \sum_{i=1}^{N}\ln(P_{c_i}(\bold{x}_i;\beta)) **(4)**

If we now make note of points 2 & 3 above, we can rewrite the terms within the summation as:

\ln(P_{c_i}(\bold{x}_i;\beta)) = y_i\ln(P(\bold{x}_i;\beta)) + (1-y_i)\ln(1-P(\bold{x}_i;\beta))

Therefore plugging the above result, equation (1), and equation (2) into (4) we get:

\ell(\beta) = \sum_{i=1}^N y_i\ln(P(\bold{x}_i;\beta)) + (1-y_i)\ln(1-P(\bold{x}_i;\beta))

= \sum_{i=1}^N y_i[\ln(\exp(\beta^T\bold{x}_i))-\ln(1 + \exp(\beta^T\bold{x}_i))] + (1-y_i)[-\ln(1+\exp(\beta^T\bold{x}_i))]

= \sum_{i=1}^N y_i\beta^T\bold{x}_i – \ln(1+\exp(\beta^T\bold{x}_i)) **(5)**

To maximise the log-likelihood we will calculate the first derivative of (5) with respect to \beta, and set it to zero:

\frac{\partial{\ell(\beta)}}{\partial\beta} = \sum_{i=1}^N y_i\bold{x}_i – \frac{1}{1+\exp(\beta^T\bold{x}_i)}\bold{x}_i\exp(\beta^T\bold{x}_i)

= \sum_{i=1}^N \bold{x}_i(y_i-P(\bold{x}_i;\beta)) = 0 **(6)**

Notice that \bold{x}_i is a column vector, and has such equation (6) represents M equations, which are non-linear in \beta. To solve (6), we’ll use the **Newton-Raphson gradient ascent algorithm**, which has the general form:

x_1 = x_0 – \frac{f(x_0)}{f^{‘}(x_0)}

Applying this algorithm to equation (6) requires the second derivative of the log-likelihood:

\frac{\partial^2\ell(\beta)}{\partial\beta\partial\beta^T} = \sum_{i=1}^N -\bold{x}_i(\frac{\bold{x}_i^T\exp(\beta^T\bold{x}_i)}{1 + \exp(\beta^T\bold{x}_i)} – \frac{\exp(\beta^T\bold{x}_i)}{(1 + \exp(\beta^T\bold{x}_i))^2}\bold{x}_i^T\exp(\beta^T\bold{x}_i) )

= – \sum_{i=1}^N \bold{x}_i\bold{x}_i^T(P(\bold{x}_i;\beta) – P^2(\bold{x}_i;\beta))

= – \sum_{i=1}^N \bold{x}_i\bold{x}_i^TP(\bold{x}_i;\beta)(1 – P(\bold{x}_i;\beta)) **(7)**

Note here that the second derivative is negative, validating the concave nature of the likelihood function and ensuring the existence of a global maximum.

The Newton-Raphson algorithm for our purposes will be:

\beta^{‘} = \beta – (\frac{\partial^2\ell(\beta)}{\partial\beta\partial\beta^T})^{-1}\frac{\partial\ell(\beta)}{\partial\beta} **(8)**

where \beta^{‘} is the updated value for the model parameters at each iteration. To facilitate the calculation of \beta^{‘} at each iteration step, let’s first rewrite equations (6) and (7) in matrix notation, and then plug the results into (8). Let’s start with the dependent and independent variables, along with the conditional probabilities:

\bold{y} = \begin{bmatrix} y_1 \\ y_2 \\ . \\ . \\ . \\ y_N \end{bmatrix}, \bold{X} = \begin{bmatrix} 1 & x_{1,1} & … & x_{1,M} \\ 1 & x_{2,1} & … & x_{2,M} \\ … & … & … & … \\ … & … & … & … \\ 1 & x_{N,1} & … & x_{N,M} \end{bmatrix}, \bold{P} = \begin{bmatrix} P(\bold{x}_1;\beta) \\ P(\bold{x}_2;\beta) \\ . \\ . \\ . \\ P(\bold{x}_N;\beta) \end{bmatrix} **(9)**

In addition we’ll set d_{i,i} = P(\bold{x}_i;\beta)(1 – P(\bold{x}_i;\beta)), and build a diagonal matrix such that:

\bold{D} = \begin{bmatrix} d_{1,1} & 0 & … & 0 \\ 0 & d_{2,2} & … & 0 \\ … & … & … & … \\ … & … & … & … \\ 0 & 0 & … & d_{N,N} \end{bmatrix} **(10)**

Therefore, using (9) and (10) we can express equations (6) and (7) as:

\frac{\partial{\ell(\beta)}}{\partial\beta} = \bold{X}^T(\bold{y}-\bold{P})

\frac{\partial^2\ell(\beta)}{\partial\beta\partial\beta^T} = -\bold{X}^T\bold{D}\bold{X}

and now write (8) in matrix notation:

\beta^{‘} = \beta + (\bold{X}^T\bold{D}\bold{X})^{-1}\bold{X}^T(\bold{y}-\bold{P}) **(11)**

Equation (11) represents the update rule required to find the optimal set of model parameters, where all probabilities on the right hand side are calculated based on \beta. In practice, this update rule is executed until a certain tolerance threshold is reached, or a maximum number of iterations is completed.

## Build and Test a Logistic Regression Classifier in Python

What we’ll work through below is the implementation of the model developed in the * previous section*. The model will be used to predict benign/malignant tumours in the scikit-learn breast cancer dataset.

### Breast Cancer Dataset Exploration

Let’s start by importing the necessary libraries:

```
## imports ##
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score,roc_curve
```

and now proceed to load the dataset:

```
## load data ##
data = load_breast_cancer()
X = data.data
y = data.target
```

The matrix of independent variables \bold{X} consists of 30 features, while the label vector \bold{y} has elements that are either 1 for benign, and 0 for malignant. Let’s now proceed to explore these data. I’ll generate a box plot to get a sense of the distributions and scales in the data, and a correlation heat map to determine what, if any, features are highly correlated:

Note that the heat map was generated with Pearson correlation values, which measures linear correlations between variables. We can conclude from these plots that some features:

- have very different scales
- have skewed distributions
- are highly correlated

Highly correlated features should be removed from the dataset prior to modelling. Let’s take advantage of the build-in functions included in Pandas dataframes to indicate what features share a correlation coefficient > 0.8:

```
#package the independent variables into a dataframe
dfX = pd.DataFrame(X,columns=data.feature_names)
#compute the pearson correlation coefficients
dfCorr = dfX.corr()
#reformat correlation matrix
dfCorr = dfCorr.stack().reset_index()
#remove diagnonal elements
dfCorr = dfCorr[dfCorr['level_0']!=dfCorr['level_1']]
#sort and remove duplicate elements
dfCorr = dfCorr.sort_values(by=[0],ascending=False).reset_index(drop=True).iloc[::2,:]
#select values with high correlations
dfCorr = dfCorr[dfCorr[0]>0.8]
```

We can now print out the results:

```
## view highly correlated feature pairs ##
gb = dfCorr.groupby(by=['level_0'])
for name,group in gb:
print(name," ",group['level_1'].values)
```

area error [‘mean area’]

compactness error [‘concavity error’]

fractal dimension error [‘compactness error’]

mean area [‘mean perimeter’ ‘mean concave points’]

mean compactness [‘mean concavity’]

mean concave points [‘worst perimeter’ ‘mean compactness’ ‘worst radius’ ‘worst area’]

mean concavity [‘mean concave points’ ‘worst concavity’]

mean perimeter [‘mean radius’ ‘worst radius’ ‘mean concave points’]

mean radius [‘mean area’ ‘worst perimeter’ ‘worst area’ ‘mean concave points’]

mean texture [‘worst texture’]

perimeter error [‘radius error’ ‘area error’]

radius error [‘area error’]

worst area [‘mean area’ ‘mean perimeter’ ‘area error’]

worst compactness [‘worst concavity’ ‘mean compactness’ ‘worst concave points’]

worst concave points [‘mean concave points’ ‘mean concavity’ ‘mean compactness’]

worst concavity [‘worst concave points’ ‘mean compactness’]

worst fractal dimension [‘worst compactness’]

worst perimeter [‘worst radius’ ‘worst area’ ‘mean perimeter’ ‘mean area’

‘worst concave points’]

worst radius [‘worst area’ ‘mean radius’ ‘mean area’]

worst smoothness [‘mean smoothness’]

Based on these results we can see that the following 18 features are highly correlated:

- mean area
- concavity error
- compactness error
- mean perimeter
- mean concave points
- mean concavity
- worst perimeter
- mean compactness
- worst radius
- worst area
- worst concavity
- mean radius
- worst texture
- radius error
- worst concave points
- worst compactness
- mean smoothness

Let’s proceed to remove these columns using the Pandas drop function, and then produce a box plot on the remaining features:

```
## come up with a truncated feature set ##
dfXtrun = dfX.drop(['mean area',
'concavity error',
'compactness error',
'mean perimeter',
'mean concave points',
'mean concavity',
'worst perimeter',
'mean compactness',
'worst radius',
'worst area',
'worst concavity',
'mean radius',
'worst texture',
'radius error',
'area error',
'worst concave points',
'worst compactness',
'mean smoothness'],
axis=1)
```

The remaining features still have variations in terms of scale and distribution. To facilitate the operation of the learning algorithm, we should scale these features such that each has a mean of zero and standard deviation of 1. This ensures that the gradients used during the optimisation are influenced by each feature equally. We’ll use the scikit-learn StandardScaler to accomplish this:

```
## scale input features ##
scaler = StandardScaler()
Xtrun = scaler.fit_transform(dfXtrun)
```

Finally, we’ll do a train-test split on the data. The training set will be used to learn the model parameters \beta, and we’ll test the effectiveness of the trained model on the unseen test set. The training set will comprise 70% of the data, and the remainder will be allocated to the test set. The scikit-learn train_test_split function will be used:

```
## do train/test split ##
X_train, X_test, y_train, y_test = train_test_split(Xtrun, y, test_size=0.3, random_state=42)
```

### Build a Custom Logistic Regression Classifier

Now let’s encapsulate the algorithm, derived in the previous section, into a Python class:

```
## build a class to encompass the logistic regression model with the Newton-Raphson algorithm ##
class LogR:
#initialiser
def __init__(self):
#private parameters array
self.__B = np.array([])
#train function
def train(self,Xin,Yin,max_iter=100,tol=1e-4):
#add column of 1's to independent variables
X = np.ones((Xin.shape[0],Xin.shape[1]+1))
X[:,1:] = Xin
#ensure correct dimensions for the input labels
Y = Yin.reshape(-1,1)
#prepare parameters array
self.__B = np.zeros((X.shape[1],1))
#loop through the Newton-Raphson algorithm
for i in range(max_iter):
#compute conditional probabilities
P = 1/(1 + np.exp(-np.matmul(X,self.__B)))
#built diagonal matrix
D = np.diag(np.multiply(P,(1-P)).flatten())
#parameter update rule
Minv = np.linalg.inv(np.matmul(np.matmul(X.T,D),X))
dB = np.matmul(np.matmul(Minv,X.T),np.subtract(Y,P))
self.__B += dB
#check if we've reached the tolerance
if tol > np.linalg.norm(dB):
break
#return
return
#predict function
def predict(self,Xin,return_prob=False):
#add column of 1's to independent variables
X = np.ones((Xin.shape[0],Xin.shape[1]+1))
X[:,1:] = Xin
#calculate predictions
Yp = 1/(1 + np.exp(-np.matmul(X,self.__B)))
#return predictions
if not return_prob:
Yp = Yp.round(decimals=0)
return(Yp)
#return model parameters
def return_B(self):
return(self.__B)
```

The class functions are as follows:

**__init__(self)**: this is the initialiser function which executes automatically when a class instance is declared. The model parameters are initialised here.**train(self, Xin,Yin,max_iter=100,tol=1e-4)**: this function calculates the model parameters according to equation (11), and stores the results internally. The training data is composed of input independent variables**Xin**, and associated labels,**Yin**.**max_iter**is the maximum number of iterations to carry out, and**tol**is the tolerance on which to halt the learning procedure (before max_iter is reached).**predict(self, Xin,return_prob=False)**: this function predicts response values based upon the input**Xin**values, and the internally stored model parameters. If**return_prob**is set to True, the probabilities for y = 1 are returned instead of the actual label values.**return_B(self)**: return the model parameters

We can now declare a class instance, train it, and predict the labels on the test set along with their associated probabilities:

```
## fit our logistic regression model & generate predictions ##
#declare the model
lr = LogR()
#fit the model on the training set
lr.train(X_train,y_train)
#generate predictions & associated probabilities
ypred = lr.predict(X_test)
yprob = lr.predict(X_test,return_prob=True)
```

In order to determine how well our model is doing, we will calculate the * accuracy and precision* of the results. The accuracy metric measures the percentage of correctly predicted labels. The precision measures the ability of the model to repeatedly predict a positive label correctly, and is defined as:

\frac{TP}{TP+FP}

where TP and FP are true positive and false positive, respectively. Both range from [0,1], where 0 and 1 are the worst and best possible results, respectively. We can use the functions imported from scikit-learn to measure these metrics:

```
## evaluate model performance ##
print("accuracy: %.2f" % accuracy_score(y_test,ypred))
print("precision: %.2f" % precision_score(y_test,ypred))
```

accuracy: 0.96

precision: 0.97

These results indicate our model is functioning well with this data set. Finally, we can visualise the results by creating a * Receiver Operating Characteristic *(ROC) plot. This plot shows the False Positive Rate versus the True Positive Rate, for all possible decision rules (in our code of the LogLR class, the decision rule is set at 0.5). We can use the

*roc_curve*function from scikit-learn to produce this plot:

```
## create a ROC plot ##
#calculate false & true positive rates
fpr,tpr,_ = roc_curve(y_test, yprob)
#construct plot
plt.plot(fpr,tpr)
plt.title('Receiver Operating Characteristic')
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".8"), plt.plot([1, 1] , c=".8")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
```

The diagonal orange line indicates the result for predictions made at random; the further our model results are from this line, and towards the upper left, the better the classifier correctly identifies true and false positives. This plot confirms what our earlier accuracy and precision calculations showed: the model functions as a reasonably good classifier for the breast cancer dataset used in this example.

### Scikit-Learn Logistic Regression Model

Instead of building our own logistic regression model from scratch, it’s also possible to use the one available through *scikit-learn*:

```
## imports ##
from sklearn.linear_model import LogisticRegression
```

We can now train and predict using this model and the same data set as used previously:

```
## fit our data and generate predictions ##
#declare a LinearRegression object
lr = LogisticRegression()
#fit the model
lr.fit(X_train,y_train)
#make predictions & associated probabilities
ypred = lr.predict(X_test)
yprob = lr.predict_proba(X_test)
## evaluate model performance ##
print("accuracy: %.2f" % accuracy_score(y_test,ypred))
print("precision: %.2f" % precision_score(y_test,ypred))
```

accuracy: 0.96

precision: 0.97

* *Finally we can also reproduce the ROC plot for these data:

```
## create a ROC plot ##
#calculate false & true positive rates
fpr,tpr,_ = roc_curve(y_test, yprob[:,1])
#construct plot
plt.plot(fpr,tpr)
plt.title('Receiver Operating Characteristic')
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".8"), plt.plot([1, 1] , c=".8")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
```

We can see that the accuracy and precision metrics, along with the ROC plot, are identical to the results obtained with our custom built model.

## Final Remarks

In this article you’ve learned:

- The motivation behind logistic regression and when it was developed
- How logistic regression with maximum likelihood estimation is derived
- How to build a logistic regression classifier in Python
- How to use the logistic regression module provided by the popular scikit-learn library

I hope you enjoyed this article and gained some value from it! If you would like to take a closer look at the code presented here please take a look at my * GitHub*.

Congratulations on building your website. Keep up the good work!

Thank you!