January 9, 2020
Automated Determination of the Validity Domain for Approximation Models
It is important to keep in mind that making predictions with an approximation model is reasonable only when done in the domain of the training dataset. Using extrapolation, which becomes the domain of a function, is strongly undesirable and may lead to serious errors. However, there is always a possibility to forget about the model's bounding box and produce unreliable results.
The solution is to limit the input design space by training sample's bounds and output NaN values for input points that lay outside this domain. For this end, boundaries of different types may be applied to the training sample to define the function domain. The 6.16 release adds a dedicated feature to set up this behavior: it is called InputDomainType.
The InputDomainType option can be coupled with the prediction of NaNs in the training set. The feature is supported both by the ApproxBuilder block and the predictive modeling tools in Analyze. In this tech tip, we will describe how to use this option.
In the SmartSelection mode, you can set the domain using the Advanced option… hint (Fig. 1).
Fig. 1. Feature location
The input domain type can be:
- Unbound (default) – does not limit the input domain – extrapolated values for outputs will be generated for any input points.
- Box – crop the domain by the training sample’s bounding box and predict NaN values out of the hypercube.
- Auto – crop the area by the intersection of the training sample’s bounding box and the region bound by an ellipsoid which envelops the training sample in order to deal with samples with arbitrary points scatter.
If the input point is outside the defined area (Box or Auto) the model returns NaN value. Let’s consider the following example. Suppose that our training sample can be generated only in the constrained areas resulting in the Design of experiments shown in Fig. 2 (this is a contrived example). The white region in Fig. 2 is out of the validity domain.
Fig. 2. Training area
If we train a model in the Box mode and make predictions, we will obtain the following result (Fig. 3).
Fig. 3. Predictions in Box mode
It can be observed that the input domain is a box clipped by training points, and predictions are available in the whole rectangle.
The Auto mode allows reducing predictions area by an ellipsoid in addition to the box (Fig. 4). Outside the color region, the model will produce NaNs.
Fig. 4. Predictions in the Auto mode
We can use the InputDomainType feature with NaN predictions simultaneously. Suppose that the training sample contains some NaN points (Fig. 5).
Fig. 5. Training sample with NaN points
The predictions results for both modes are shown in Fig. 6.
Fig. 6. Predictions result in the Box (top) and Auto (bottom) modes with NaN points in the training dataset
In both cases, NaN values are processed in a regular way, i.e. the NaN values are detected, and regions around them are restored to be used for NaN prediction.
You can find an analytical form of constraints created by the InputDomainType option on the Constraints tab in model details (Fig. 7). They also can be copied to the clipboard if necessary.
Fig. 7. Analytical form of constraints in Box (top) and Auto (bottom) modes including constraints around NaN values
In this tech tip, we considered the InputDomainType feature, which allows limiting the input design space to avoid incorrect function predictions in the area outside the training domain. It is especially useful when the training set is not rectangular. The feature can be used simultaneously with NaNs prediction.
By Yulia Bogdanova, Application Engineer, DATADVANCE