Yes, Decision Trees handle missing values naturally. It is straightforward to extend the CART algorithm to support the handling of missing values. However, attention needs to be made regarding how the algorithm is implemented in code. In this post, I will implement classification and regression Decision Trees capable of dealing with missing values in the data.

How Can Decision Trees Handle Missing Values

In an earlier article, I covered in depth the CART algorithm, and how to implement this in Python. For that discussion, it was implicitly assumed that the data used for training, and generating predictions, have no missing values. Real-life data, obtained in a production setting, will typically not meet this assumption. One way to handle missing values is to add a preprocessing step to treat for them in our machine learning pipeline. In fact, many machine learning algorithms will require such a step to be added in order to prevent software failures during operations.

The CART algorithm makes it possible to handle missing values within the Decision Tree itself. We can therefore skip the additional step during preprocessing. Some modifications will be required to the algorithm, however. Below I will outline one possible approach to grow a Decision Tree, and produce predictions from a trained model, on data with missing values. These steps will be identical to those covered in my article on Decision Trees, with the required modifications put in bold:

Train a Decision Tree

    1. Decide how samples with missing values will be passed to the child nodes. This decision can be treated as a hyperparameter.
    2. Select a metric upon which the decision rules will be based.
    3. Start at the root node with all the training data.
    4. Choose one feature, and threshold value, by which the training data will be split that minimises our metric. Check to ensure the threshold value is a defined number (not a NaN or None). Record the feature – threshold value pair.
    5. Partition the data according to the chosen feature and threshold, then repeat steps 4 and 5 for each subset of the data. If a sample upon which we are applying the decision rule has a missing value, move the sample according to the decision made in step 1 above.
    6. Stop partitioning when: i) all samples in the given node have the same non-missing label value \bold{y}, ii) a specified model limit is reached (e.g. maximum tree depth, etc.). In the terminal partitions, compute the leaf node values while omitting any missing values in the label, and record these.

Predictions from a Trained Decision Tree

    1. Pass the input predictor data through the tree, starting at the root node.
    2. Move to either the left or right child node. The choice of moving right or left depends on how these data satisfy the decision rules (which are based on the feature – threshold pairs stored in step 3 of the training procedure). If our input is a missing value, this move will be based on our decision from point 1 of the training procedure.
    3. Navigate through the tree, in the manner described by point 2, until a leaf node is reached. The stored leaf node value can then be provided as the prediction.

Modify Python Decision Tree Classes

Let’s now modify the Python code, implemented in my previous article on Decision Trees, to support missing values. Note the complete code is available on my GitHub. In that previous article, 4 classes were defined to do the job:

  1. Node : this class controls the behaviour of a single node in the tree.
  2. DecisionTree : an abstract base class that encapsulates all the common functionality between classification and regression Decision Trees.
  3. DecisionTreeClassifier : a child class of DecisionTree, that encapsulates all functionality unique to classification.
  4. DecisionTreeRegressor : another child class of DecisionTree, that encapsulates all functionality unique to regression.

The Node class will be unchanged, and so we can move along to the implementation of DecisionTree. A new boolean hyperparameter will be introduced: nans_go_right. This will be used to decide how input samples with missing values will be moved within the tree:

class DecisionTree(ABC):
    """
    Base class to encompass the CART algorithm
    """
    
    def __init__(self, max_depth: int=None, min_samples_split: int=2, nans_go_right=True) -> None:
        """
        Initializer
        
        Inputs:
            max_depth         -> maximum depth the tree can grow
            min_samples_split -> minimum number of samples required to split a node
            nans_go_right     -> boolean to determine where NaN values in predictors are allocated
        """
        self.tree              = None
        self.max_depth         = max_depth
        self.min_samples_split = min_samples_split
        self.nans_go_right     = nans_go_right

Private functions will also be added to control how samples flow through the tree. These will make use of nans_go_right

    def __split_right(self, feature_col: np.array, split_point: float) -> np.array:
        """
        Private function to determine which elements in the feature column go to the child right node
        
        Inputs:
            feature_col -> feature column to analyse
            split_point -> split point for the feature column
        Outputs:
            numpy array of boolean values
        """
        return (feature_col > split_point) | (np.isnan(feature_col) & self.nans_go_right)
 
    def __split_left(self, feature_col: np.array, split_point: float) -> np.array:
        """
        Private function to determine which elements in the feature column go to the child left node
        
        Inputs:
            feature_col -> feature column to analyse
            split_point -> split point for the feature column
        Outputs:
            numpy array of boolean values
        """
        return (feature_col <= split_point) | (np.isnan(feature_col) & (not self.nans_go_right))

The private recursive function __grow, that is used to grow the Decision Tree during training, will also have some modifications. These mainly center around ensuring NaN’s don’t leak into our calculations. We will also make use of the functions defined above, __split_right and __split_left, to determine how the data are split at every non-leaf node:

    def __grow(self, node: Node, D: np.array, level: int) -> None:
        """
        Private recursive function to grow the tree during training
        
        Inputs:
            node  -> input tree node
            D     -> sample of data at node 
            level -> depth level in the tree for node
        """
        # are we in a leaf node?
        depth = (self.max_depth is None) or (self.max_depth >= (level+1))
        msamp = (self.min_samples_split <= D.shape[0])
        cls   = D[:,-1]
        n_cls = np.unique(cls[~np.isnan(cls)]).shape[0] != 1
        
        # not a leaf node
        if depth and msamp and n_cls:
        
            # initialize the function parameters
            ip_node = None
            feature = None
            split   = None
            left_D  = None
            right_D = None
            # iterate through the possible feature/split combinations
            for f in range(D.shape[1]-1):
                f_values = D[:,f]
                for s in np.unique(f_values[~np.isnan(f_values)]):
                    # for the current (f,s) combination, split the dataset
                    D_l = D[self.__split_left(D[:,f],s)]
                    D_r = D[self.__split_right(D[:,f],s)]
                    # ensure we have non-empty arrays
                    if D_l.size and D_r.size:
                        # calculate the impurity
                        ip  = (D_l.shape[0]/D.shape[0])*self._impurity(D_l) + (D_r.shape[0]/D.shape[0])*self._impurity(D_r)
                        # now update the impurity and choice of (f,s)
                        if (ip_node is None) or (ip < ip_node):
                            ip_node = ip
                            feature = f
                            split   = s
                            left_D  = D_l
                            right_D = D_r
            # set the current node's parameters
            node.set_params(split,feature)
            # declare child nodes
            left_node  = Node()
            right_node = Node()
            node.set_children(left_node,right_node)
            # investigate child nodes
            self.__grow(node.get_left_node(),left_D,level+1)
            self.__grow(node.get_right_node(),right_D,level+1)
                        
        # is a leaf node
        else:
            
            # set the node value & return
            node.leaf_value = self._leaf_value(D)
            return

The private recursive function __traverse will also be adapted to make use of __split_left:

    def __traverse(self, node: Node, Xrow: np.array) -> int | float:
        """
        Private recursive function to traverse the (trained) tree
        
        Inputs:
            node -> current node in the tree
            Xrow -> data sample being considered
        Output:
            leaf value corresponding to Xrow
        """        
        # check if we're in a leaf node?
        if node.leaf_value is None:
            # get parameters at the node
            (s,f) = node.get_params()
            # decide to go left or right?
            if (self.__split_left(Xrow[f],s)):
                return(self.__traverse(node.get_left_node(),Xrow))
            else:
                # note nan's in Xrow will go right
                return(self.__traverse(node.get_right_node(),Xrow))
        else:
            # return the leaf value
            return(node.leaf_value)

I included some minor changes to the train and predict functions, that aim at converting ‘None’ values to numpy ‘NaN’:

    def train(self, Xin: np.array, Yin: np.array) -> None:
        """
        Train the CART model
        
        Inputs:
            Xin -> input set of predictor features
            Yin -> input set of labels
        """        
        # prepare the input data
        D            = np.concatenate((Xin,Yin.reshape(-1,1)),axis=1)
        D[D == None] = np.nan
        D            = D.astype('float64')
        # set the root node of the tree
        self.tree = Node()
        # build the tree
        self.__grow(self.tree,D,1)
        
    def predict(self, Xin: np.array) -> np.array:
        """
        Make predictions from the trained CART model
        
        Input:
            Xin -> input set of predictor features
        Output:
            array of prediction values
        """
        # prepare the input data
        Xin[Xin == None] = np.nan
        Xin              = Xin.astype('float64')
        # iterate through the rows of Xin
        p = []
        for r in range(Xin.shape[0]):
            p.append(self.__traverse(self.tree,Xin[r,:]))
        # return predictions
        return(np.array(p).flatten())

Let’s now proceed to DecisionTreeClassifier. The changes to this class consist of:

  1. adding the new hyperparameter, nans_go_right
  2. ensuring NaN values do not leak into impurity or leaf node value calculations
class DecisionTreeClassifier(DecisionTree):
    """
    Decision Tree Classifier
    """
    
    def __init__(self, max_depth: int=None, min_samples_split: int=2, nans_go_right=True, loss: str='gini') -> None:
        """
        Initializer
        
        Inputs:
            max_depth         -> maximum depth the tree can grow
            min_samples_split -> minimum number of samples required to split a node
            nans_go_right     -> boolean to determine where NaN values in predictors are allocated
            loss              -> loss function to use during training
        """
        DecisionTree.__init__(self,max_depth,min_samples_split,nans_go_right)
        self.loss = loss   
    
    def __gini(self, D: np.array) -> float:
        """
        Private function to define the gini impurity
        
        Input:
            D -> data to compute the gini impurity over
        Output:
            Gini impurity for D
        """        
        # initialize the output
        G = 0
        # iterate through the unique classes
        cls = D[:,-1]
        for c in np.unique(cls[~np.isnan(cls)]):
            # compute p for the current c
            p = D[D[:,-1]==c].shape[0]/D.shape[0]
            # compute term for the current c
            G += p*(1-p)
        # return gini impurity
        return(G)
    
    def __entropy(self, D: np.array) -> float:
        """
        Private function to define the shannon entropy
        
        Input:
            D -> data to compute the shannon entropy over
        Output:
            Shannon entropy for D
        """        
        # initialize the output
        H = 0
        # iterate through the unique classes
        cls = D[:,-1]
        for c in np.unique(cls[~np.isnan(cls)]):
            # compute p for the current c
            p = D[D[:,-1]==c].shape[0]/D.shape[0]
            # compute term for the current c
            H -= p*np.log2(p)
        # return entropy
        return(H)
    
    def _impurity(self, D: np.array) -> float:
        """
        Protected function to define the impurity
        
        Input:
            D -> data to compute the impurity metric over
        Output:
            Impurity metric for D        
        """        
        # use the selected loss function to calculate the node impurity
        ip = None
        if self.loss == 'gini':
            ip = self.__gini(D)
        elif self.loss == 'entropy':
            ip = self.__entropy(D)
        # return results
        return(ip)
    
    def _leaf_value(self, D: np.array) -> int:
        """
        Protected function to compute the value at a leaf node
        
        Input:
            D -> data to compute the leaf value
        Output:
            Mode of D         
        """        
        return(stats.mode(D[:,-1],keepdims=False,nan_policy='omit')[0])

Similar changes are also required for the DecisionTreeRegressor class:

class DecisionTreeRegressor(DecisionTree):
    """
    Decision Tree Regressor
    """
    
    def __init__(self, max_depth: int=None, min_samples_split: int=2, nans_go_right=True, loss: str='mse') -> None:
        """
        Initializer
        
        Inputs:
            max_depth         -> maximum depth the tree can grow
            min_samples_split -> minimum number of samples required to split a node
            nans_go_right     -> boolean to determine where NaN values in predictors are allocated
            loss              -> loss function to use during training
        """        
        DecisionTree.__init__(self,max_depth,min_samples_split,nans_go_right)
        self.loss = loss   
    
    def __mse(self, D: np.array) -> float:
        """
        Private function to define the mean squared error
        
        Input:
            D -> data to compute the MSE over
        Output:
            Mean squared error over D
        """
        # compute the mean target for the node
        y_m = np.mean(D[:,-1],where=~np.isnan(D[:,-1]))
        # compute the mean squared error wrt the mean
        E = np.sum((D[:,-1] - y_m)**2,where=~np.isnan(D[:,-1]))/D.shape[0]
        # return mse
        return(E)
    
    def __mae(self, D: np.array) -> float:
        """
        Private function to define the mean absolute error
        
        Input:
            D -> data to compute the MAE over
        Output:
            Mean absolute error over D
        """
        # compute the mean target for the node
        y_m = np.mean(D[:,-1],where=~np.isnan(D[:,-1]))
        # compute the mean absolute error wrt the mean
        E = np.sum(np.abs(D[:,-1] - y_m),where=~np.isnan(D[:,-1]))/D.shape[0]
        # return mae
        return(E)
    
    def _impurity(self, D: np.array) -> float:
        """
        Protected function to define the impurity
        
        Input:
            D -> data to compute the impurity metric over
        Output:
            Impurity metric for D        
        """            
        # use the selected loss function to calculate the node impurity
        ip = None
        if self.loss == 'mse':
            ip = self.__mse(D)
        elif self.loss == 'mae':
            ip = self.__mae(D)
        # return results
        return(ip)
    
    def _leaf_value(self, D: np.array) -> float:
        """
        Protected function to compute the value at a leaf node
        
        Input:
            D -> data to compute the leaf value
        Output:
            Mean of D           
        """
        return(np.mean(D[:,-1],where=~np.isnan(D[:,-1])))

Classification Example with Missing Values

We can now test out the modified Decision Tree implementations covered in the previous section. Let’s start by creating a synthetic dataset with the make_classification function from scikit-learn:

# create a classification dataset
X,y = make_classification(n_samples=1000, n_features=8, n_informative=5, n_classes=2, random_state=42)

These data consist of 1000 samples, with a total of 8 predictive features. Of these 8 features, only 5 are informative. Our label includes 2 classes. Since I haven’t introduced weights for the label classes, our dataset should be balanced already.

To visualise what we have generated, let’s produce a box plot of the predictive features:

# box plot of features
plt.boxplot(X)
plt.xlabel('features')
plt.ylabel('values')
plt.title('Distributions in Features')
plt.xticks([1, 2, 3, 4, 5, 6, 7, 8], ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8'])
plt.show()
decision trees handle missing values

These distributions look reasonable, and show some variation among the different features. 

Now we need to introduce some missing values. To do this, let’s define two functions for setting missing values in both the predictor and label values:

from random import choices
import numpy as np

def add_missing_values_to_X(X: np.array, n_nan: int=0) -> np.array:
    """
    Function to add missing values to predictor values X
    
    Inputs:
        X      -> set of predictor values
        n_nan  -> number of np.nan values to introduce
    Outputs:
        X with np.nan values introduced
    """
    # setup indices lists
    idx_x = [x_i for x_i in range(X.shape[0])]
    idx_y = [y_i for y_i in range(X.shape[1])]
    # add np.nan values
    idx_nan_x = choices(idx_x,k=n_nan)
    idx_nan_y = choices(idx_y,k=n_nan)
    X[idx_nan_x,idx_nan_y] = np.nan
    # return
    return(X)

def add_missing_values_to_y(y: np.array, n_nan: int=0) -> np.array:
    """
    Function to add missing values to label values y
    
    Inputs:
        y      -> set of label values
        n_nan  -> number of np.nan values to introduce
    Outputs:
        y with np.nan values introduced
    """
    # set type to float64
    y = y.astype('float64')
    # setup indices list
    idx = [x_i for x_i in range(y.shape[0])]
    # add np.nan values
    idx_nan = choices(idx,k=n_nan)
    y[idx_nan] = np.nan
    # return
    return(y)

Now we can proceed to introduce missing values, and then do a train-test split:

# add missing values to predictors
X = add_missing_values_to_X(X, n_nan=20)
# add missing values to labels
y = add_missing_values_to_y(y, n_nan=2)

# do train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In total, 20 missing values were inserted into the predictor features. Additionally, 2 more missing values were placed in the label column. Twenty percent of the data are then held-out as a test set.

Let’s see how our modified Decision Tree classifier performs on these data:

# declare the classifier and train the model
clf = DecisionTreeClassifier(max_depth=5, loss='gini')
clf.train(X_train,y_train)

# generate predictions
yp = clf.predict(X_test)

# evaluate model performance
idx    = ~np.isnan(y_test)
y_test = y_test[idx]
yp     = yp[idx]
print("accuracy: %.2f" % accuracy_score(y_test,yp))
print("precision: %.2f" % precision_score(y_test,yp))
print("recall: %.2f" % recall_score(y_test,yp))
accuracy: 0.86
precision: 0.85
recall: 0.89

Our implementation works, and yields good results. If you try to train the scikit-learn Decision Tree classifier on these data, you will see it generates a ValueError exception.

Regression Example with Missing Values

In the same manner as was done for the classification example, we can now try out the Decision Tree regressor implementation. Let’s start by creating a synthetic dataset with the make_regression function from scikit-learn:

# create a regression dataset
X,y = make_regression(n_samples=1000, n_features=8, n_informative=5, n_targets=1, noise=0.3, random_state=42)

Like before, these data consist of 1000 samples, with a total of 8 predictive features. Of these 8 features, only 5 are informative. We have a single target, and gaussian noise introduced with a standard deviation of 0.3.

To visualise what we have generated, let’s produce a box plot of the predictive features, as well as a histogram of the target:

# box plot of features
plt.boxplot(X)
plt.xlabel('features')
plt.ylabel('values')
plt.title('Distributions in Features')
plt.xticks([1, 2, 3, 4, 5, 6, 7, 8], ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8'])
plt.show()
decision trees handle missing values
# histogram of label
plt.hist(y,bins=50)
plt.xlabel('y label')
plt.ylabel('frequency')
plt.title('Distribution in y Label')
plt.show()
decision trees handle missing values

These data look reasonable, with significantly less variation in the predictor features when compared with the classification case. 

Now we can proceed to introduce missing values, using the same two functions add_missing_values_to_X and add_missing_values_to_y defined previously. Following that, let’s do a train-test split:

# add missing values to predictors
X = add_missing_values_to_X(X, n_nan=5)
# add missing values to targets
y = add_missing_values_to_y(y, n_nan=2)

# do train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In total, 5 missing values were inserted into the predictor features. Additionally, 2 more missing values were placed in the target column. Twenty percent of the data is then held-out as a test set.

Let’s see how our modified Decision Tree regressor performs on these data:

# declare the regressor and train the model
rgr = DecisionTreeRegressor(max_depth=5, loss='mae', nans_go_right=False)
rgr.train(X_train,y_train)

# make predictions
yp = rgr.predict(X_test)

# evaluate model performance
idx    = ~np.isnan(y_test)
y_test = y_test[idx]
yp     = yp[idx]
print("rmse: %.2f" % np.sqrt(mean_squared_error(y_test,yp)))
print("mae: %.2f" % mean_absolute_error(y_test,yp))
rmse: 68.86
mae: 53.35

Our implementation works, and yields reasonable results. Note that in this case, I set the hyperparameter nans_go_right to False. If you try to train the scikit-learn Decision Tree regressor on these data, you will see it generates a ValueError exception.

Final Remarks

In this article you have learned:

  • That Decision Trees can handle missing values in the data.
  • One possible approach to how CART Decision Trees can be modified to handle missing values.
  • An actual Python implementation of Decision Trees that can support missing values.

Please feel free to visit my GitHub page, where I have the complete Python code presented here in a single jupyter notebook. Download the notebook, and play around with the implementation. If you have any questions or comments, please leave them below!

Related Posts

Hi I'm Michael Attard, a Data Scientist with a background in Astrophysics. I enjoy helping others on their journey to learn more about machine learning, and how it can be applied in industry.

5 1 vote
Article Rating
Subscribe
Notify of
guest

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Nelly
Nelly
1 year ago

Hi There,

Can you please clarify what choices represent in your code? I tried to reproduce the code but got stuck here

Screenshot 2023-05-25 163222.png
2
0
Would love your thoughts, please comment.x
()
x
Newsletter Signup