October 13, 2016
“Build. Validate. Explore.” - Part 1: SmartSelection in Predictive Modeling Toolkit
What is SmartSelection?
SmartSelection is an intelligent model training technology in pSeven that automatically selects approximation technique and its options in order to obtain the most accurate approximation model.
SmartSelection allows the user to focus on solving particular task without delving into details of approximation techniques and methods by automating the trial-and-error process of approximation model construction. It features automatic model training technology which hides the complexity of underlying machine learning algorithms behind a user-friendly interface.
There’s a large set of approximation techniques under the hood. From simple linear regression or quadratic polynomial and splines to Gaussian processes, original High Dimensional Approximation (HDA) method based on neural networks and gradient boosted regression trees. Every technique has its own tunable parameters. Each has its own strengths and weaknesses and no single technique is best for all possible and data sets, i.e. there’s No Free Lunch.
And the purpose of SmartSelection is to automate selection of the model with the best predictive performance by exploring different techniques and optimizing their parameters to find a minimum of approximation error on cross-validation or holdout test set.
Let`s consider simple “Static mixer optimization” example from pSeven package. When an engineer wants to study certain process, the DOE is used to sample data. Data is used as training sample to create an approximation model.
In this example DOE samples 200 points:
- 4 inputs: ‘Flow temperature’, ‘Pressure drop’, ‘1st flow velocity’, ‘2nd flow velocity’
- 2 outputs: ‘Nozzle angle’, ‘Nozzle diameter’
Input Data Properties
Training sample is required at minimum. But additional data, if provided, may improve quality of approximation:
- Hold-out test sample for validation (cross-validation is used by default)
- Weights for input points of training sample
- Output noise variance of training sample
- Marks for categorical variables
- Data filters for training and test samples e.g. to remove outliers
Three groups of high-level hints can be used to express domain knowledge, model requirements and time/quality constraints.
1. Domain knowledge about data underlying studied process
Any additional prior knowledge narrows search space of possible configurations and thus reduces training time and may influence the predictive performance of the final model.
- process is nonlinear or discontinuous
- input data is noisy
- there is a dependency or correlation between outputs
2. Requirements for the model properties
- require the model to be a smooth (differentiable) approximation function
- require the ability to evaluate the uncertainty of the predictions
- predict NaN values in regions that are close to points for which training sample contained NaN output (invalid design point is marked as NaN), require the model to fit the training sample exactly
3. Time constraints and quality management. Ballance time/quality tradeoff
- or this example define acceptable quality with metric R2 = 0.99 on cross-validation
- limit the time for the selection process: set nightly experiment
User Interface shows declared hints in a form of tags:
SmartSelection algorithm starts selection with given knowledge, requirements and time/quality tradeoff.
The quality of approximation can be measured in 3 different ways:
- Using Internal Validation
- Via splitting given training sample into train/test subsets
- Using additional holdout test sample
Optimal model is constructed for each outcome variable in case of vector (multidimensional) output.
Manual Mode vs. SmartSelection
For advanced users who want to get closer to core approximation techniques with all the knobs and switches Manual mode is available. But compared with it SmartSelection technology always gives similar or in most of the cases better approximation results.
After the approximation model is built it can be validated on a new data and compared with other models using Model Validator, additionally smoothed, evaluated and exported in different formats (C, Octave, FMI etc.).
In “Build. Validate. Explore.” - Part 2 we will describe Model Validator - an interactive analysis tool that allows to estimate model’s quality (i.e. predictive performance) and compare different models. It allows to test models against reference data and find the most accurate model using error plots and statistics.
In “Build. Validate. Explore.” - Part 3 we’ll see how to “look inside the model” and explore its behaviour with an interactive visual tool called Model Explorer. Stay tuned!
By Dennis Shilko, Senior Software Engineer, DATADVANCE