I often hear that supervised machine learning models are ‘‘black box’’ methods which offer little information on what drives predictions. In this post I discuss three different techniques — permutation importance, partial dependence, and LIME — which provide insight into the behavior of machine learning models, produce interpretable outputs, and can be used with a variety of supervised learning models.1 The code for this post can be found here.2

Data

To demonstrate how these techniques can be utilized, I analyze soccer player data from the FIFA video game series, which I obtained from Kaggle. This data provides extremely detailed information on thousands of soccer players. As this is a pedagogical exercise, I utilize a small subset of the available information, specifying a player’s weekly wage as the outcome to be predicted and employing a player’s transfer value, position, age, and ability as predictors. From the figure below, we can see that the distribution of players’ wages is quite skewed, so I work with the log transformed version of this data. I generate predictions using a random forests model, a widely used machine learning model.3

Distribution of soccer players' wages
Distribution of soccer players' wages

Permutation Importance

Permutation importance is a technique for identifying variables with the greatest predictive power. This technique is particularly useful in the scenarios where we have a large set of predictors and want to understand which predictors contribute most to the predictive accuracy of a model. Permutation importance was popularized by Leo Breiman in his classic paper on random forests. To calculate the permutation importance of a predictor, we take the following steps:

  1. Randomly permute the values of the predictor.

  2. Replace the original predictor with this permuted version.

  3. Generate predictions with this new dataset.

  4. Calculate the difference in a chosen predictive accuracy metric before and after permuting.4

The logic underlying this procedure is simple. If a predictor is important then the random permutation of its values should negatively impact the predictive accuracy of a model.

Permutation importance for players' wages
Permutation importance for players' wages

I plot the permutation importance for the predictors in the model above. From this plot, we can see a player’s ability and transfer value appear to have substantially larger importance to the predictive performance of the model than the age and position of a player. This suggests that, out of the predictors included in this model, ability and transfer value primarily influence the salary received by a player.

Partial Dependence

Once the important predictors in a model have been identified, we may want to understand more about the respective relationships between these variables and the outcome of interest. Partial dependence is a technique which enables us to understand the relationship between the outcome and a predictor across different values of the predictor while accounting for the influence of other predictors. The calculation of the partial dependence between a predictor and the outcome involves the following steps:

  1. Generate a list of the unique values of the predictor.

  2. For each of the unique values in the list: assign this value to all observations for the predictor of interest, generate predictions with this new dataset, and calculate the mean/quantiles of the generated predictions.

  3. Plot the calculated means/quantiles against their unique values.

To make this clearer, imagine our dataset has two observations and we are interested in the partial dependence between players’ ability and wages.

Name Value Ability Position Age
Cristiano Ronaldo 95 94 Attacker 32
Lionel Messi 105 93 Attacker 30

Given there are two unique values of player ability, we would generate predictions using the following datasets:

Name Value Ability Position Age
Cristiano Ronaldo 95 94 Attacker 32
Lionel Messi 105 94 Attacker 30
Name Value Ability Position Age
Cristiano Ronaldo 95 93 Attacker 32
Lionel Messi 105 93 Attacker 30

I present the partial dependence plot for player ability using the actual data below. From this plot we can see that there appears to be a non-linear relationship between player ability and wages. Interestingly, it seems that players receive substantially higher wages once they reach an ability level of around 70.

Partial dependence plot for player ability
Partial dependence plot for player ability

From the plot below, we can see that there also appears be a non-linear relationship between a player’s transfer value and their wage. Specifically, the partial dependence plot suggests that players with a value of approximately £15m or more receive significantly higher salaries than less valuable players.

Partial dependence plot for transfer value
Partial dependence plot for transfer value

LIME (Local Interpretable Model-Agnostic Explanations)

LIME is a technique which helps us understand the specific prediction for a given observation. This contrasts with the techniques discussed above which are typically calculated using an entire dataset or a sizeable subset of the data.5 LIME involves the following steps:

  1. Select the observation we would like understand more about.

  2. Generate a sample of peturbed observations using information about the distributions of predictors from the training data and get predictions for these observations using the estimated model.

  3. Fit a weighted sparse linear model to this perturbed dataset where perturbed observations that are more similar to the selected observation receive larger weights.6

LIME explanations for players
LIME explanations for players

I present the output of this procedure for six players above. From this plot, we can see how each of the predictors in the model influence the prediction for a given player. Weights in this context represent the estimated coefficients from the linear model, with LIME breaking predictors into discrete bins to ease interpretation. For all players in this sample, the salary of a player seems to be primarily influenced by their ability and transfer value, which supports the results of the earlier permutation importance analysis.

No longer a black box?

The techniques presented in this post can hopefully be utilized in a wide variety of contexts to shed light on the behavior of machine learning models and provide users with a better understanding of what influences predictions.

  1. My thinking on this topic has been strongly influenced by conversations with my former Penn State colleague Zach Jones. Zach wrote his dissertation on techniques for interpreting statistical learning models and I encourage you to check out his work

  2. For this exercise I used the R packages mlr and lime, which have nice implementations of these methods. To my knowledge Python’s excellent scikit-learn library does not have analagous flexible functions. 

  3. Other methods that could have been used here include linear regression, boosting, and neural networks. The techniques discussed in this piece can be utilized in conjunction with all of these models. 

  4. For classification tasks we often use classification accuracy as our metric, while mean squared error is a popular choice for regression tasks. 

  5. In Breiman’s paper, permutation importance is calculated using out of bag data. Constrastingly, other implementations of permutation importance and partial dependence use training data. 

  6. Examples of sparse models are stepwise regression and LASSO regression.