What Should My Train/Test Split Be?

What Should My Train/Test Split Be?

A key step in the process of building a machine learning model is training and testing the dataset. Without training the algorithm, the model parameters do not converge on useful values, and the model is useless. Without testing the algorithm after training the model parameters, there is no way to know how good the model actually is, which is equally useless. 

When it comes to testing and training, there are several best practices that need to be followed in order to ensure an accurate evaluation of the model performance. The process involves splitting the dataset into a training set (to train the model) and a test set (to evaluate the model performance). How the dataset is split determines how well you can trust the test set evaluation metric. Perhap the most important best practice is that you should never include the test set in the training set. This may seem obvious, but doing so will skew the model to be more accurate than it actually is. The second best practice is to make sure that the test set is an accurate statistical representation of the dataset as a whole. There are several ways to do so, which we will discuss later in the article. If the test set and training set do not have similar distributions in the space defined by the features, the model accuracy will consistently be low (offset by the average difference between the two distributions). The third best practice, which is related to the second, is to ensure that the test set is large enough to yield statistically meaningful results. If the test set is too small, the evaluation metric will not have enough points to converge to a value that reflects the true model performance. In other words, the test set will not have enough points to properly reflect the distribution of the training set.   

Fulfilling these conditions starts with correctly splitting the dataset between test and training. In this article, I will show you how to create a model. Then, I’ll demonstrate how changing the test and training split affects the model evaluation metric as well as introduce other metrics that you should also consider. 

Getting Started

All of the code in this article can be found in my GitLab repository. To follow along, you need to have a version of Python installed. I use the Anaconda Navigator, but the code can be used with any Python distribution of 3.7 or greater. To install Anaconda, go here.  

To begin, we import all of the packages that we’ll use:

<pre><code class="Python">import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time as clock

from sklearn.datasets import fetch_california_housing
import sklearn.datasets as DS

from sklearn.model_selection import train_test_split
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error</code></pre>

All set? Let’s go.

Loading Some Data & Building a Model

The first thing we need is some data. The scikit-learn package has pre-cleaned datasets that we can load. I use the California housing dataset:

<pre><code class="Python">data = DS.fetch_california_housing()
df = pd.DataFrame(
 np.column_stack([data.data, data.target]),
columns=np.r_[data.feature_names, ['MEDV']])#.sample(frac = 0.1)

df</code></pre>

You should see an output like this:


The dataset has the following features (corresponding to the columns above):

  • MedInc: median income in block
  • HouseAge: median house age in block
  • AveRooms: average number of rooms
  • AveBedrms: average number of bedrooms
  • Population: block population
  • AveOccup: average house occupancy
  • Latitude: house block latitude
  • Longitude: house block longitude

The predefined target variable that we want to predict is the final column MEDV (median house value for California districts). We split the data into our target variable and feature dataset, then, to get a baseline performance, we split our test and training set with a 30% and 70% random split, respectively. Randomly selecting observations during the split is a good way to ensure similar test and training set distributions.

<pre><code class="Python">X = df[data.feature_names]
y = df[['MEDV']]
RANDOM_SEED = 123456789
X_train,X_test,y_train,y_test=train_test_split(X, y, train_size = 0.7, random_state=RANDOM_SEED)</code></pre>

Then, we use scikit-learn’s Pipeline method to build the model. This employs a standard scalar that normalizes each feature (because they all have different units, and therefore scales). Then we use kernel ridge regression for our model algorithm:


<pre><code class="Python">model = Pipeline(
  [
      ('scaling', StandardScaler()),
      ('krr', KernelRidge(kernel='rbf'))
  ]
)
model.fit(X_train,y_train)

predict_train = model.predict(X_train)
predict_test = model.predict(X_test)
print('Test Score: ',model.score(X_test,y_test))print('Train Score: ',model.score(X_train,y_train))
print('MSE: ',mean_squared_error(y_test,predict_test))</code></pre>

We use two metrics to evaluate our model. The first is the coefficient of determination, or R2 value (which I call Score). This ranges from 0 to 1, and typically evaluates the correlation between two or more variables. In our case, this signifies the correlation between our features and target variable. We can use this metric to compare the test set statistical distribution and the training set statistical distribution. If the correlation between the two is similar, that’s a good indication that we are sampling our test set correctly and that the two distributions are similar. The second metric that we use is the mean-square error between our predicted values and the true values in the test set. This will indicate how well the model actually performs when predicting the target variable.     


Another, perhaps more intuitive way to compare the two distributions is to actually plot them. We can do so by running:

<pre><code class="Python">fig,axs = plt.subplots(figsize = (12,8))
axs.scatter(y_train,predict_train , marker = '.' , s = 1)
axs.scatter(y_test,predict_test , marker = '.' , s = 1)
axs.set_xlim(0,5) ; axs.set_ylim(0,5)
axs.set_xlabel('True Target') ; axs.set_ylabel('Predicted Target')
axs.set_aspect('equal')</code></pre>

We plot the true target variable values against the predicted values for both the test and training sets. The output should look something like this:


Clearly, the distributions are right on top of each other. This is a great sign! Now that we have an idea of the values that we should be aiming for, we can vary the train/test split and see how they change. 

Determining the Optimal Train/Test Size

As we mentioned, we can use a few different metrics to evaluate the optimal train/test size:

  • the difference in the score of the model between test and training set
  • the mean square error between the predicted values and the true values
  • how long it takes train the model

The last metric (the training time) is less important when the dataset is small. However, it will be significant if your dataset is large and you need additional sampling but have limited computational resources. To do this, we can vary the training set proportion from 0.1 to 0.99 (with 20 steps), to create 20 different train/test splits. The timer is initiated before the pipeline, and clocked when each model is trained. The time taken to train each model, the difference in model scores, and the mean square error of the test sets are saved to a dataframe called metrics.

<pre><code class="Python">values = []

for n in np.linspace(0.01,0.99,20):
X_train,X_test,y_train,y_test=train_test_split(X, y, train_size = n , random_state=RANDOM_SEED)

start_time = clock.time()
model = Pipeline(
[
('scaling', StandardScaler()),
('krr', KernelRidge(kernel='rbf'))
]
)
model.fit(X_train,y_train)
time = clock.time() - start_time # time to train model
print('Trained Run ','{0:3.2f}'.format(n),' in ','{0:3.1f}'.format(time), ' seconds.')

predict_train = model.predict(X_train)
predict_test = model.predict(X_test)

score = model.score(X_test,y_test) - model.score(X_train,y_train)
mse_test = mean_squared_error(y_test,predict_test)

values.append([n,score,time,mse_test])


names = ['train_size','score','traintime','mse_test']
metrics = pd.DataFrame(values, columns = names)</code></pre>

<pre><code class="Python">metrics.head()</code></pre>

The output should look like this:


To compare the 20 models that we trained, we can plot the difference in score and run time as a function of train size:

<pre><code class="Python">fig, ax1 = plt.subplots(figsize = (8,6))

color = 'tab:red'
ax1.set_xlabel('Train size')
ax1.set_ylabel('Score', color=color)
ax1.plot(metrics.train_size[:],metrics.score[:], marker = '.', color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()# instantiate a second axes that shares the same x-axis

color = 'tab:blue'
ax2.set_ylabel('Run Time [s]', color=color)# we already handled the x-label with ax1
ax2.plot(metrics.train_size[:],metrics.traintime[:], marker = '.', color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()# otherwise the right y-label is slightly clipped
plt.show()</code></pre>


And to compare the mean-square error for each model:

<pre><code class="Python">fig, ax1 = plt.subplots(figsize = (8,6))

color = 'tab:red'
ax1.set_xlabel('Train size')
ax1.set_ylabel('MSE', color=color)
ax1.plot(metrics.train_size , metrics.mse_test , marker = '.', color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()# instantiate a second axes that shares the same x-axis

color = 'tab:blue'
ax2.set_ylabel('Run Time [s]', color=color)# we already handled the x-label with ax1
ax2.plot(metrics.train_size[:],metrics.traintime[:], marker = '.', color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()# otherwise the right y-label is slightly clipped
plt.show()</code></pre>


Note that the run time will vary depending on the machine that you use. With the two plots, it’s easy to see that the minimum score occurs at around 60% and the minimum mean square error occurs around 80%. If we go any higher than 80%, the exponential nature of computation kicks in and the run time blows up. Additionally, the test set will not be large enough, and it will not reflect the true distribution of data. 

Summary

In this article, we reviewed the best practices for splitting your test and training set, and we demonstrated a methodology to determine the best train/test split for any dataset using just a few metrics. In the dataset that we used, we determined that a train/test split between 60/40 percent and 80/20 percent minimized the mean-square error, the training time, and the difference in scores between the test and training set. To get the code that we used in this article, you can go to my GitLab repository.

Learned something from this tutorial? Subscribe to get content like this directly to your inbox.