August 23, 2019

The role of validation type in model accuracy estimation

Introduction

pSeven provides a variety of tools for Predictive Modeling. The modeling tools are powered by the Generic Tool for Approximation, which is used in the ApproxBuilder and Approximation model blocks, and also in Model Builder and Model Validator tools in pSeven Analyze. While the ApproxBuilder block is intended to automate model training, Analyze tools are designed for interactive usage and do not require creating workflows.

Understanding the basics of validation types is crucial when accessing the approximation model quality, for example, with a dedicated Model validator tool. It is a tool to estimate model quality and compare models in Analyze mode. It allows testing models against reference data or using cross-validation and find the most accurate model by analyzing error plots and statistics.

The Model validator can show two kinds of plots:

Scatter plot directly compares reference sample outputs with model predictions (Fig. 1).

validation-type-1

Fig. 1. Scatter plot in the Model validator

Quantile plot (default) is useful to analyze error distribution (Fig. 2).

validation-type-2

Fig. 2. Quantile plot in the Model validator

On the quantile plot, each point shows the fraction of sample points, for which errors are lower than the value on the horizontal axis. A steeper curve is better: it means that error value is lower for a larger fraction of points, probably with a few outliers that form a long “tail” on top.

Validation type specifies the method to estimate model accuracy. Quality metrics can be calculated using training sample, internal validation or test data.

The Sample selector changes the source of reference data used for model validation:

If “training” is selected, reference data is the model’s training sample.
If “test” is selected, reference data comes from the test sample.
If “internal validation” is selected, both reference and prediction data are read from the model’s internal validation results.

Validation type is the subject of discussion in this Tech Tip.

Validation on training sample

A training sample is a sample used for model training. Obviously, if the training sample is used as reference data, then the values of approximation errors are expected to be small. This is an important fact that is often misinterpreted. In many cases, lower errors on the training sample appear due to overfitting, and the model fails to fit “unseen” dataset, or will predict the outputs with lower accuracy. In other words, the model learned patterns specific to the training data, which are irrelevant to other data.

Thus, a computation of model accuracy on the training set is very cheap, but obtained accuracy may be greatly overestimated.

Validation on test sample

A test sample is a dataset, whose points do not participate in model building. The test sample is independent of the training dataset, but that follows the same probability distribution (typically, we are talking about uniform design of experiments) as the training sample. With this validation method, the tool calculates predictions for given test inputs and compares them with given test outputs by computing a standard set of error metrics. If the test sample is used as reference data, the model typically shows higher errors: it’s fit to the training sample points, and hence its predictions for test points are less accurate. If the model fit to the training dataset also fits the test dataset well, minimal overfitting has taken place.

Thus, validation on the test sample is more reliable and informative. However, we have to remember, that both test sample and train sample should properly represent the domain of interest in the design space.

The independent test data is often unavailable. In this case, it’s possible to split the existing training sample to train/test subsets.

In Analyze, the “Split data…” feature in Data series pane menu is available. It allows to set splitting and technique parameters. A configuration dialog is presented in Fig. 3.

validation-type-3

Fig. 3. Splitting the sample to the train /test subsets

This option allows splitting the sample into the training set and the independent test set as a percentage. The higher percent for the training sample the better for model building accuracy, but in this case, validation of a trained model is less informative. And vice versa, the higher percentage of the test set gives more accurate estimates of the model, however, the model built on the small training sample can give not accurate predictions.

By default, a split proportion 80-20% (80% training set – 20% test set) is set. Moreover, it’s possible to select a splitting method: CART (variance-based), DUPLEX (distance-based) and Random. Besides the Random technique, special approaches are applied to keep a structure of samples.

Internal validation

The internal validation (IV) procedure provides an estimate of the expected overall accuracy of the approximation algorithm. This estimate is obtained by doing cross-validation of the algorithm on the training data. Cross-validation is a well established way of statistical assessment of the model accuracy with the given training set only. However, it should be stressed out that it does not directly estimate the predictive power of the model. , The purpose of cross-validation is rather to assess the efficiency of the approximation algorithm on various subsets of the available data, assuming that the conclusions can be extended to the observations from the total design space, and a final model constructed on all the data.

Algorithm splits the whole training set into subsets, uses one subset as the test set and the remainder as the training set, repeats for each subset and averages the resulting errors. If there is no test data, the internal validation shows a more reliable result.

By default, cross-validation is also used internally by the Smart training procedure (see “Validation type in SmartSelection mode” section below). There is also a way to set a test sample to estimate the quality of intermediate models.

Comparison

In this section, a comparison of the three validation types is presented. In Fig. 4 scatter plots for the same output are shown. Errors between the outputs of the reference sample and the predictions of the model are smaller if points are located on diagonals of the scatter plots.

validation-4-1

validation-4-2

validation-4-3

Fig. 4. Scatter plots with different validation types

Fig. 5 demonstrates quantiles plots for validation on test and training samples. As mentioned above, a steeper curve is better for quantiles graphs.

validation-type-5

Fig. 5. Quantiles plots

The table below (Table 1) contains prediction error metrics. You can find the detailed information about these metrics in pSeven documentation. One can see that the values of errors metrics by validation on the training sample are smaller than these by validation on the test sample.

Table 1. Error metrics for different validation types (absolute errors)

Model	R²	RMS	Maximum	Q99	Q95	Median	Mean
On training sample	0.9985	0.5765	2.3251	2.3251	1.2151	0.2891	0.4125
Model	R²	RMS	Maximum	Q99	Q95	Median	Mean
On internal validation	0.9980	0.6709	1.2170	1.2170	1.2170	0.5496	0.5810
Model	R²	RMS	Maximum	Q99	Q95	Median	Mean
On test sample	0.9895	1.3447	6.7806	6.7806	2.8609	0.5722	0.8448

Validation type in SmartSelection mode

Smart training is a procedure that automatically chooses an approximation technique and tunes values of its options to obtain the most accurate model for a given problem. It’s designed to enable non-experts to build accurate surrogate models easily.

One of the steps required to build the model is preparing a sample to calculate the quality metrics. If the test sample is not given, the training sample with the cross-validation procedure will be used to calculate quality metrics. All internal validation options are set by default. However, you can change them by clicking “Plus” button and selecting “Internal validation…” string (Fig. 6).

validation-type-6

Fig. 6. Dialog of Model builder and expanded pane of hints for SmartSelection mode

In the dialog that appears (Fig. 7), it can be helpful to change:

SubsetCount – number of subsets that the original training set is divided into. Default (0) means the number of subsets will be automatically set equal to min (10, |S|), where |S| is the sample size.
TrainingCount – number of training/validation sessions. Default (0) means the number of sessions will be set automatically based on the sample size.

validation-type-7

Fig. 7. Dialog with internal validation settings in SmartSelection mode

The cross-validation is a time-consuming procedure, so using the test sample considerably reduces the training time. That is why, if the test sample is available, then SmartSelection uses it by default.

However, the validation type can be changed in Smart training. Click “Validation type…” string that is located above “Internal validation…” string (Fig. 6). In the dialog that appears (Fig.8), you can set validation type manually: internal validation, an independent test sample (if available) or split option to obtain a test sample from a given training dataset. “Auto” technique uses the test sample, if available, by default or the cross-validation as stated above.

validation-type-8

Fig. 8. Dialog with validation types by Smart training

Summary

It’s recommended to use a test data sample to validate the model when possible: test sample validation shows model’s ability to predict outputs for new input values that were not available in training. Training sample validation tends to overestimate model accuracy. Low errors on the training sample (steeper error quantile curves) can actually be a sign of overfitting, especially if the same model shows significantly higher errors on a test sample. If holdout test data is not available, it’s recommended to switch to internal validation: this data is obtained from cross-validation tests that run when building the model.