ApproxBuilder¶
- Group: Modeling
Sections
Introduction¶
The ApproxBuilder block trains an approximation model in the SmartSelection or manual mode using two input samples containing the values of variables and responses. After training, the block outputs the binary model, block finish status and a human readable model summary. It can also save the model to disk and export the model to a variety of other formats compatible with third-party programs and tools.
Configuration¶
Usually configuration of an ApproxBuilder block requires the following steps:
- Congifure smart or manual training. For details, see section Training Modes.
- Provide the training data and additional information on variables and responses. For details, see section Training Data.
- Obtain the results. The block outputs the approximation model to the model port. It also outputs a human-readable model summary to info port after successfully training a model.
Optionally, you can:
- Start training with an initial model (see Initial Model).
- Configure the block to save the model to disk (see Saving the Model).
- Set up model export to other formats (see Model Export).
- Output model information and validation data as an HTML report (see Model Report).
Training Modes¶
The ApproxBuilder block supports two training modes: SmartSelection (default) and manual. In the SmartSelection mode, pSeven Desktop automatically selects and tunes the approximation technique in order to obtain the most accurate approximation model. Manual training mode is intended for expert users who want to use specific training options and control the model quality manually.
If needed, you can configure both modes in the same block and then switch between them. The block saves all settings you specify, but only settings for the selected mode apply when it runs. For example, if you select a specific training technique in the manual mode, and then switch to the SmartSelection mode and run the workflow, the technique setting is ignored.
Tip
You can manually specify certain options even in the SmartSelection mode, using the Advanced options… hint. For example, you can use it to specify the approximation techniques allowed in SmartSelection. Option settings specified by this hint apply to SmartSelection only, manual mode ignores them.
To switch between SmartSelection and manual training mode from Run, you can add a workflow parameter that selects the mode: see the Parameter checkbox near the mode switch in the configuration dialog. In particular, this parameter can be useful when you configure a workflow for other users, as they will be able to switch modes without opening the block’s configuration dialog. This scenario assumes that your users do not edit the ApproxBuilder block configuration, so you should also add required hints in the SmartSelection mode and set required options in the manual mode (the block saves all these settings). For a more flexible configuration, you can also add required manual mode options to workflow parameters — then your users will be able to apply custom options from Run after they select the manual training mode.
SmartSelection¶
SmartSelection is a method that automatically selects an approximation technique and its options to obtain the most accurate model for a given problem. You can provide more details about the training data, specify requirements to the model, or training features to control the time-quality trade-off with the help of hints. These hints are divided into three groups accordingly: data features, model requirements, and training features. You can also add custom settings using the Advanced options… hint.
To add a hint, click anywhere inside the Hints pane 1
or use the button 2 to open the list of hints.

For hints that require additional settings, a dialog appears when you select the hint from a list. When you finish adding a hint, it shows up on the Hints pane and becomes disabled (grayed out) in the list.
Data features hints provide additional information about the training data:
- Linear dependency — the dependency specified by the training sample is supposed to be linear.
- Quadratic dependency — the dependency specified by the training sample is supposed to be quadratic.
- Discontinuous dependency — the dependency specified by the training sample is supposed to be discontinuous.
- Dependent outputs… — specifies the type of dependency between different outputs.
The following options are available for this hint:
All dependent
: different components of the output are treated as possibly dependent.Partial linear
: before training, pSeven Desktop will search for linear dependencies between outputs in the training data. If such dependencies are found, pSeven Desktop will train a model which keeps these dependencies.
- Tensor structure — input points in the training sample are placed in a grid-like pattern. This is usually the case when inputs are generated by some factorial technique for design of experiments.
Model requirements hints add specific requirements for the trained model:
- Acceptable quality… — the metric and acceptable level of prediction error used to validate the model. With this hint, model training is stopped once the acceptable value of the metric is reached.
- Smoothing — requires the model to support additional smoothing (see Model Smoothing).
- Accuracy evaluation – requires the model to support accuracy evaluation.
- Exact fit — requires the model to fit training data points exactly.
- Gradient — requires the model to support evaluation of output gradients. Gradient support depends on the technique used to train the final model, and many techniques support gradients by default even without the Gradient hint. Adding the Gradient hint prohibits using the techniques that cannot support gradient — in particular, disables the GBRT technique.
- Do not store training sample — training sample should not be stored inside the model (by default, the training sample is stored inside the model). Enabling the hint reduces the size of the model stored on disk, in particular, when tensor techniques are used. You can also use this hint, for example, if you want to transfer the model but not its training data.
- Enable NaN prediction — the built model should predict NaN output values in areas near those points of the training sample that contain NaN output values.
- Do not store internal validation data — the final model should not contain cross-validation data samples (by default, SmartSelection runs cross-validation for the final model and saves model outputs obtained in all cross-validation sessions, so you can review this data later). Adding this hint can reduce the model size. Note that a model trained with this hint can still contain internal validation statistics, if cross-validation is selected as the method to estimate quality of intermediate models (see the Validation type… hint).
Training features hints are used to tune the training process:
Validation type… — specifies the method to estimate quality of intermediate models which SmartSelection creates during training:
- Auto (default): automatically selects one of the following methods, basing on data properties and other settings. Prefers validation on a test sample when it is available, otherwise can automatically split the sample into the train and test subsets, or use internal validation as the least preferred method. This is the default SmartSelection behavior, which is also used if you do not add the Validation type… hint. Note that in case when SmartSelection automatically switches to internal validation, the model will contain internal validation statistics even if you add the Do not store internal validation data hint (this hint removes only the cross-validation data samples).
- Internal validation: uses cross-validation. With this setting, you can also use the Advanced option… hint to specify the number of data subsets and training sessions in cross-validation.
- Test set: validates models on the test sample data. Test data is required for this method.
- Split training sample to train/test subsets: automatically splits the sample into two subsets, one of which is used to train models, and the other to validate them. You can change size of the training subset using the Training subset ratio slider. This method is similar to using the command to create the train and test samples (see Split Data), and then performing validation on the test set.
Randomized training — enable randomization in certain internal training algorithms. Randomized training can produce models that are slightly different.
Fixed random seed… — use a fixed seed in those training algorithms that support randomized training. This hint makes the behavior of randomized algorithms fully deterministic (controlled by the seed value).
Training time limit… — sets an estimated time that the model builder is allowed to spend in training the model. Note that if you add this hint, it may be generally detrimental to model quality.
Try output transformations — enables training versions of the model with log transformation applied to the training sample output data. While searching for the optimal training settings, pSeven Desktop will try models with and without the data transformation. Log transformation might improve model accuracy in cases where the training output values are exponentially distributed.
By default this hint applies to all outputs, which may noticeably increase the training time. To enable it only for some outputs, use the output transformation settings in the Data settings pane. The behavior of those settings in SmartSelection mode is the following:
none
: prohibit transformation for the corresponding output. In all trained model versions, data transformation does not apply to that output.lnp1
: require transformation for the corresponding output. In all trained model versions, log transformation is applied to that output.- unselected (empty default), or
auto
: try log transformation for the corresponding output. In different model versions, different transformation settings are applied to that output.
You can also use the Advanced options… hint to specify some training options manually. Selecting this hint brings up the Advanced options dialog with option settings.

Available options are:
- MaxParallel: sets the maximum number of parallel threads to use for training. The value can be any positive integer. Note that it is not recommended to set it higher than the number of physical CPU cores, otherwise you can experience a performance degradation.
- EnabledTechniques: specifies the approximation techniques that you allow SmartSelection to use.
- SubmodelTraining: selects whether to train submodels in parallel or sequentially.
Enables the non-deterministic fast parallel training in GBRT, if set to
FastParallel
. - IVSubsetCount: specifies the number of cross-validation subsets. See GTApprox/IVSubsetCount for details.
- IVSubsetSize: specifies the size of a cross-validation data subset. Provides a convenient way to set up leave-n-out cross-validation (sets n). See GTApprox/IVSubsetSize for details.
- IVTrainingCount: limits the number of training sessions in cross-validation. See GTApprox/IVTrainingCount for details.
- InputDomainType: specifies the input domain for the model
(sets the GTApprox/InputDomainType option):
Unbound
(default) is an unlimited domain.Manual
is a box-bound domain specified manually using the sample_lower_bounds and sample_upper_bounds ports (see the example in Training Data).Box
is a box-bound domain, which is an intersection of the training sample’s bounding box and the bounds specified manually (if any).Auto
limits the domain to the intersection of the box-bound domain (also with respect to the bounds specified manually) and the region bound by an ellipsoid that envelops the sample.
You can select more than one hint. If you add hints that are in conflict, the block highlights them in red and prohibits applying such configuration.

In this example, the Advanced options… hint selects two specific training techniques (GBRT and RSM) that do not support exact fit — the requirement added by the Exact fit hint.
A detailed description of SmartSelection features is available in the Smart Training section of the GTApprox guide.
Manual Training¶
To configure the block for manual training, switch the mode selector to Manual and set option values as required. See section Options for the complete options reference.
A guide to manual configuration is available in the Manual Training section of the GTApprox guide.
Training Data¶
ApproxBuilder trains an approximation model using the data received to the x_sample (variables) and f_sample (responses) input ports. See the Sample-Based Approximation tutorial, sections Loading the Sample and Preparing the Sample for details on reading the sample data from a file and splitting the sample to be further sent to the ApproxBuilder block.
You can also use the following input ports to provide details on model variables and responses:
- sample_names: adds names to model inputs and outputs.
- sample_descriptions: adds descriptions.
- sample_quantities: specifies physical quantities of variables and responses.
- sample_units: specifies measurement units.
- sample_lower_bounds and sample_upper_bounds: specify lower and upper bounds for model inputs and outputs. Input bounds can be used to limit the model’s input domain — see GTApprox/InputDomainType (this option is also available in the SmartSelection training mode, see the Advanced option… hint). Output bounds add thresholds for model outputs: if the model calculates some output value that is out of bounds, it is automatically replaced with the closest threshold value — so the model guarantees that the output value is always within bounds.
- x_categorical, f_categorical: specify the indices of categorical (discrete, qualitative) variables and outputs in the training sample. These are parameters that are not measured on a continuous scale but instead describe attributes or characteristics — such as material type, operational mode, system state. Since ApproxBuilder requires numerical training data, categorical inputs and outputs have to be assigned numerical labels in the training sample. Their indices are required, otherwise they would be treated as continuous variables.
Values sent to these ports, except for x_categorical and f_categorical, must be vectors of length equal to the total number of columns in the matrices received to the x_sample and f_sample ports. The order of vector elements follows the order of columns in training samples, with variables coming first. All names in the vector received to sample_names must be unique, and certain characters are prohibited in names (see the details below).
The values sent to the x_categorical and f_categorical ports must be vectors listing zero-based indices of the categorical variables and outputs, respectively. The first variable in the sample is at index 0, the second one is at index 1, and so on.
For example, suppose you are training a model of a beam under load, which has 3 inputs (beam cross-section area, length, and applied load) and 1 output (bending stress). You can describe this model’s inputs and outputs in the following way:
- sample_names:
("S", "L", "F", "B")
— note that all names here must be unique. - sample_descriptions:
("Beam cross-section area.", "Beam length.", "The load applied to the beam.", "Bending stress.")
. - sample_quantities:
("area", "length", "mass", "stress")
. - sample_units:
("sq.m", "m", "kg", "MPa")
.
Negative input values have no physical meaning in this model,
so you add a lower bound of 0.0
for each.
Bending stress also cannot be negative, so you specify 0.0
as the lower bound
of the model’s output too:
- sample_lower_bounds:
(0.0, 0.0, 0.0, 0.0)
.
Upper bounds may be unused in this example, or you can explicitly set them to Infinity
(for lower bounds, -Infinity
also specifies that there is no bound):
- sample_upper_bounds:
(Infinity, Infinity, Infinity, Infinity)
.
The details on inputs and outputs are saved to the trained model and are shown when you view the model in Analyze or load it into the Approximation model block. They are also kept when you export the model code, for example, as a Functional Mock-up Unit for Co-Simulation.
Names in the vector received to the sample_names port must satisfy the following rules:
- Name must not be empty.
- All names must be unique. The same name for an input and an output is also prohibited.
- The only whitespace character allowed in names is the ASCII space,
so
\t
,\n
,\r
, and various Unicode whitespace characters are prohibited. - Name cannot contain leading or trailing spaces, and cannot contain two or more consecutive spaces.
- Name cannot contain leading or trailing dots, and cannot contain two or more consecutive dots, since dots in pSeven Desktop are used as name separators.
- Parts of the name separated by dots must not begin or end with a space,
so the name cannot contain
'. '
or' .'
. - Name cannot contain control characters and Unicode separators.
- Name cannot contain characters from this set:
:"/\|?*
.
Additional training data and settings can be sent to the following optional input ports:
- output_noise_variance — noise variance for training outputs. See Data with Errorbars.
- weights — point weights (relative importance measure) in the training sample. See Sample Weighting.
- output_transformation — the type of transformation to apply to the training sample outputs before training the model. See GTApprox/OutputTransformation.
If you use the SmartSelection mode, you can optionally provide a test sample to the x_test and f_test inputs. This sample is then used to calculate quality metrics when SmartSelection estimates model quality. The test sample has the same structure as the training sample.
Initial Model¶
If you are using the MoA, GBRT, or TBL technique in the manual mode, ApproxBuilder can train a model incrementally. You can select an initial model in block’s configuration, and the block uses input data to improve that model. Update using the GBRT or TBL technique is possible only if the initial model was trained with the GBRT (TBL, respectively) technique. Update using the MoA technique works with any initial model.
Note
The SmartSelection training mode does not support initial models. If you specify an initial model in the SmartSelection mode, training can start, but the initial model is ignored. pSeven Desktop shows a warning in the Issues pane in this case.
The block can accept the initial model as a file on disk. It can also accept the path to the model file to the initial_model_path port, or receive the file to the initial_model port instead of loading it from disk.
Note that the number of inputs and outputs in the initial model should conform to the number of components in the variable and response parts of the training sample. For example, if the initial model has 3 inputs and 1 output, the x_sample input should receive a matrix with 3 columns, and f_sample — a matrix with 1 column.
Saving the Model¶
You can save the approximation model to disk in the binary GTApprox model format
specific to pSeven Desktop (.gtapprox
). The steps are as follows:
- Open the ApproxBuilder block configuration.
- Click
in the Output model pane to bring up the Configure file dialog.

- In the File origin pane select the Project origin.
- In the File path field, input the name for the model,
for example
./model.gtapprox
. - Leave other settings default and click to close the dialog.
As a result, the saved model (model.gtapprox
) is found
in your project directory after the workflow finishes.
Note that a path which begins with ./
is a relative path where the dot .
represents the project directory.
Model Export¶
To set up model export, use the following ports:
- export_format — specifies the model export format.
- export_path — sets the path for saving the exported model file on disk.
- export_model — outputs a temporary file with the exported model.
The export_model port is intended to send the model to another block in the workflow — for example, some block that evaluates the model. You can reconfigure it to output a file that is saved to disk, but it means that the block will always export to the same file, overwriting it if you export multiple models.
A more flexible way to specify the export file location is using the export_path port.
The path is a StringScalar value that you can either assign to this port
or send it from another block.
To assign the value, you can click in the value field
to bring up the Select file dialog where you select file location.
In this case, pSeven Desktop generates a correct path automatically.
If you set the path manually or send it from another block in the workflow,
note the following:
- It is recommended to use the forward slash
/
as the path separator both in Windows and Linux versions of pSeven Desktop. - The path can be absolute or relative.
Relative paths are interpreted as relative to the project directory.
You can add a leading dot
.
to explicitly note that the path is relative, for example./MyModels/SomeExportedModel
(the dot represents the project directory). - You can omit file extension: the block automatically adds a correct extension for the selected format.
- If the specified file already exists, it will be replaced.
- If you specify a path leading to a directory that does not exist, this directory will be automatically created by the block when it runs.
Available export formats are:
Executable: command-line executable for the platform, on which pSeven Desktop currently runs (
.exe
for Windows and.bin
for Linux). Note that it is not possible to export an executable file for another platform — for example, you cannot export a Windows executable under Linux.Excel document with a linked DLL: an Excel document with macros (
.xlsm
), which evaluates the model stored into a complementary DLL. In addition to the Excel document and two model DLLs (for the 32-bit and 64-bit Excel editions), this format also provides a file containing the code of a VBA wrapper (.bas
) for the model DLL, and C source (.c
) of the DLL. Export to this format is supported only in the Windows version of pSeven Desktop.For convenience, the DLL names are based on the name of the Excel document. However, DLL names (hence, the Excel document name) are also used in the VBA macros code. Due to this, the document name must contain only characters, which can be represented in the system locale’s encoding (see Language for non-Unicode programs in Windows’ language settings). For compatibility across different local versions of Windows, it is recommended to use English characters only.
FMU for Co-Simulation 1.0: an FMI 1.0 model (Functional Mock-up Unit,
.fmu
) in the Co-Simulation format, with source and binary.FMU for Model Exchange 1.0: an FMI 1.0 model (Functional Mock-up Unit,
.fmu
) in the Model Exchange format, with source and binary.FMU for Co-Simulation and Model Exchange 2.0: an FMI 2.0 model (Functional Mock-up Unit,
.fmu
) in the combined Co-Simulation and Model Exchange format, with source and binary.FMU for Co-Simulation 1.0 (source only): an FMI 1.0 model in the Co-Simulation format, with source only.
FMU for Model Exchange 1.0 (source only): an FMI 1.0 model in the Model Exchange format, with source only.
FMU for Co-Simulation and Model Exchange 2.0 (source only): an FMI 2.0 model (Functional Mock-up Unit,
.fmu
) in the combined Co-Simulation and Model Exchange format, with source only.C# source (experimental): source code (
.cs
) to compile the model with C# compiler.C# library (experimental): a compiled .NET DLL (
.dll
). Note that using this export format requires a C# compiler installed (pSeven Desktop does not include a C# compiler).- In Windows:
requires .NET Framework or another package which provides the C# compiler (csc.exe).
pSeven Desktop finds the compiler automatically; if there are several versions installed, the latest is used.
If you want to select a specific version, you can set the
CSHARP_COMPILER_ROOT
environment variable. Its value should be the full path to the directory which contains csc.exe. - In Linux:
requires the dotnet command line tool which is a part of
.NET Core SDK.
The following environment variables are also required:
CSHARP_COMPILER_ROOT
must contain the path to the compiler executable (csc.dll
), andCSHARP_LIBRARIES_ROOT
must contain the full path to the directory where theSystem.dll
andSystem.Private.CoreLib.dll
libraries are located. Finally, the dotnet executable should be added toPATH
.
- In Windows:
requires .NET Framework or another package which provides the C# compiler (csc.exe).
pSeven Desktop finds the compiler automatically; if there are several versions installed, the latest is used.
If you want to select a specific version, you can set the
C source for standalone program: C source code with the
main()
function, which you can compile to a complete command-line program.C header for library: the header for a model compiled to a shared library (DLL or
.so
).C source for library: C header and model implementation, which you can compile to a shared library (DLL or
.so
).C source for MEX: source code for a MATLAB MEX file.
Octave script: model code compatible with MATLAB.
Note
If you select the Excel export format,
the export_model port outputs a ZIP archive (.zip
)
containing all exported files
(Excel document, DLL files, VBA wrapper code and model C source).
Note
pSeven Desktop approximation models has no time dependencies — model outputs depend on inputs only. When exporting a model as an FMU for Model Exchange (FMI 1.0) or an FMU for Co-Simulation and Model Exchange (FMI 2.0), pSeven Desktop adds a dummy local variable (time variable) to the model to follow the FMI standard. The exported model does not use that variable to evaluate outputs. For the same reason, any exported FMU for Co-Simulation (FMI 1.0) or FMU for Co-Simulation and Model Exchange (FMI 2.0) does not contain any numerical solver code: once the model receives a new input, its outputs change instantly, there is no state transition. Time derivatives are always 0 for exported pSeven Desktop models.
Note
An FMI model in any of the supported formats can be exported as an FMU with a binary, or as a source-only FMU.
A binary FMU is ready to use, but it is platform-dependent. For example, if you are running pSeven Desktop under Linux, you will not be able to export a FMU with a Windows binary. However, a binary FMU also contains source code, so you can recompile it for any platform later after export.
A source-only FMU does not contain any binaries, so you will always have to compile it in order to obtain a working FMU.
Model Report¶
To set up model report, use the following ports:
- report — outputs the HTML report containing model information and validation data.
- report_path — sets the path for saving the report file on disk.
The block can output model information and validation data as an HTML report to the report port. Note that monitoring for this port is disabled by default. You can view this report in Analyze or export it as a file to disk.
To view the report in Analyze:
- Create a new report and open the Report database pane.
- Drag the record with the report from the Project database pane to the Data series pane in the report database. This will create a string data series containing the HTML code.
- Select this new data series and click the
button on the report toolbar.
To export the report as a file to disk, send the export file path
to the report_path port. The path can be relative to the
project directory — for example, ./results/My report.html
saves the file to the results
subdirectory in the project.
An absolute path can be used to save the report outside the project.
In either case, if the specified file already exists, it will be
overwritten.
Options¶
- General options
- Error handling behavior — the action to perform if the block encounters an error.
- Basic options
- Common options
- GTApprox/Accelerator — five-position switch to control the trade-off between speed and accuracy (updated in 6.23).
- GTApprox/AccuracyEvaluation — require accuracy evaluation.
- GTApprox/ExactFitRequired — require the model to fit sample data exactly (updated in 6.15).
- GTApprox/InternalValidation — enable or disable internal validation.
- GTApprox/LinearityRequired — require the model to be linear.
- GTApprox/LogLevel — minimum log level.
- Common options
- Advanced options
- Common options
- GTApprox/DependentOutputs — specify the type of dependency between output components (added in 6.3, updated in 6.15).
- GTApprox/Deterministic — controls the behavior of randomized initialization algorithms in certain techniques (added in 5.0).
- GTApprox/Heteroscedastic — treat input sample as a sample containing heteroscedastic noise (added in 1.9.0).
- GTApprox/InputDomainType — specifies the input domain for the model (added in 6.16, updated in 6.17).
- GTApprox/InputNanMode — specifies how to handle non-numeric values in the input part of the training sample (added in 6.8, updated in 6.19).
- GTApprox/InputsTolerance — specifies tolerance up to which each input variable will be rounded (added in 6.3).
- GTApprox/MaxAxisRotations — use rotation transformations in the input space to iteratively improve model quality (added in 6.12).
- GTApprox/MaxExpectedMemory — maximum expected amount of memory allowed for model training (added in 6.4).
- GTApprox/MaxParallel — maximum number of parallel threads (added in 5.0 Release Candidate 1, updated in 6.17).
- GTApprox/OutputNanMode — specifies how to handle non-numeric values in the output part of the training sample (added in 6.8).
- GTApprox/OutputTransformation — prior to training, apply transformation to the training sample output data (added in 6.13 Service Pack 1, updated in 6.48).
- GTApprox/PartialDependentOutputs/RRMSThreshold — if training a model with linear dependency between outputs, specifies the RRMS error threshold for the internal model of that dependency (added in 6.29).
- GTApprox/Seed — fixed seed used in the deterministic training mode (added in 5.0).
- GTApprox/StoreTrainingSample — save a copy of training data with the model (added in 6.6).
- GTApprox/SubmodelTraining — select whether to train submodels in parallel or sequentially (added in 6.14, updated in 2024.02).
- GTApprox/Technique — specify the approximation algorithm to use (added in 1.9.2, updated in 6.8).
- GTApprox/TrainingAccuracySubsetSize — limit the number of points selected from the training set to calculate model accuracy (added in 1.9.0).
- Gradient Boosted Regression Trees (GBRT)
- GTApprox/GBRTColsampleRatio — column subsample ratio (added in 5.1).
- GTApprox/GBRTMaxDepth — maximum regression tree depth (added in 5.1).
- GTApprox/GBRTMinChildWeight — minimum total weight of points in a regression tree leaf (added in 5.1).
- GTApprox/GBRTMinLossReduction — minimum significant reduction of loss function (added in 5.1).
- GTApprox/GBRTNumberOfTrees — the number of regression trees to include in a model (added in 5.1).
- GTApprox/GBRTShrinkage — shrinkage step, or learning rate (added in 5.1).
- GTApprox/GBRTSubsampleRatio — row subsample ratio (added in 5.1).
- Gaussian Processes (GP)
- GTApprox/GPInteractionCardinality — allowed orders of additive covariance function (added in 1.10.3).
- GTApprox/GPLearningMode — give priority to either model accuracy or robustness (added in 1.9.6, updated in 6.17).
- GTApprox/GPMeanValue — the mean of the model output mean values.
- GTApprox/GPPower (updated in 6.47) — the value of p in the p-norm used to measure the distance between input vectors.
- GTApprox/GPTrendType — select trend type (added in 3.2).
- GTApprox/GPType — select the covariance function (kernel) type (updated in 6.17).
- High Dimensional Approximation (HDA)
- GTApprox/HDAFDGauss — include Gaussian functions into functional dictionary used in construction of approximations (updated in 6.14).
- GTApprox/HDAFDLinear — include linear functions into functional dictionary used in construction of approximations (updated in 6.14).
- GTApprox/HDAFDSigmoid — include sigmoid functions into functional dictionary used in construction of approximations (updated in 6.14).
- GTApprox/HDAHessianReduction — maximum proportion of data used in evaluating the Hessian (added in 1.6.1).
- GTApprox/HDAMultiMax — maximum number of basic approximators constructed during one approximation phase (updated in 6.14).
- GTApprox/HDAMultiMin — minimum number of basic approximators constructed during one approximation phase (updated in 6.14).
- GTApprox/HDAPhaseCount — maximum number of approximation phases (updated in 6.14).
- GTApprox/HDAPMax — maximum allowed approximator complexity (updated in 6.14).
- GTApprox/HDAPMin — minimum allowed approximator complexity (updated in 6.14).
- Internal Validation (IV)
- GTApprox/IVDeterministic — controls the behavior of the pseudorandom algorithm selecting data subsets in cross validation (added in 5.0).
- GTApprox/IVSavePredictions — save model values calculated during internal validation (added in 2.0 Release Candidate 2).
- GTApprox/IVSeed — fixed seed used in the deterministic cross validation mode (added in 5.0).
- GTApprox/IVSubsetCount — the number of subsets into which the training sample is divided for cross validation (updated in 6.19).
- GTApprox/IVSubsetSize — the size of a training sample subset used as test data in a cross validation session (added in 6.19).
- GTApprox/IVTrainingCount — an upper limit for the number of training sessions in cross validation (updated in 6.19).
- Mixture of Approximators (MoA)
- GTApprox/MoACovarianceType — type of covariance matrix to use in Gaussian Mixture Model (added in 1.10.0, updated in 6.11).
- GTApprox/MoANumberOfClusters — the number of clusters (added in 1.10.0).
- GTApprox/MoAPointsAssignment — select the technique for assigning points to clusters (added in 1.10.0).
- GTApprox/MoAPointsAssignmentConfidence — confidence for points assignment technique based on Mahalanobis distance (added in 1.10.0).
- GTApprox/MoATechnique — approximation technique for local models (added in 1.10.0, updated in 6.3).
- GTApprox/MoATypeOfWeights — type of weights to use when creating the final model (added in 1.10.0).
- GTApprox/MoAWeightsConfidence — the value to control smoothness of weights based on sigmoid function and Mahalanobis distance (added in 1.10.0).
- Response Surface Model (RSM)
- GTApprox/RSMElasticNet/L1_ratio — specifies the ratio between L1 and L2 regularization in the ElasticNet type regularization (added in 6.1).
- GTApprox/RSMFeatureSelection — specifies the regularization and term selection procedures (added in 6.1, updated in 6.17).
- GTApprox/RSMMapping — selects the type of sample data pre-processing.
- GTApprox/RSMStepwiseFit/inmodel — selects the starting model for stepwise-fit regression (updated in 6.14).
- GTApprox/RSMStepwiseFit/penter — specifies p-value of inclusion for stepwise-fit regression.
- GTApprox/RSMStepwiseFit/premove — specifies p-value of exclusion for stepwise-fit regression.
- GTApprox/RSMType — specifies the type of response surface model (updated in 6.17).
- Sparse Gaussian Processes (SGP)
- GTApprox/SGPNumberOfBasePoints — the number of base points used to approximate the full covariance matrix of the points from the training sample (updated in 6.14).
- Splines with Tension (SPLT)
- GTApprox/SPLTContinuity — required approximation smoothness (updated in 6.17).
- Tensor Approximation (TA)
- GTApprox/EnableTensorFeature — enable automatic selection of TA and iTA techniques (added in 1.9.2).
- GTApprox/TALinearBSPLExtrapolation — use linear extrapolation for BSPL factors (added in 1.9.4).
- GTApprox/TALinearBSPLExtrapolationRange — set linear BSPL extrapolation range (added in 1.9.4).
- GTApprox/TAModelReductionRatio — sets the ratio of model complexity reduction (added in 6.2).
- GTApprox/TensorFactors — describes tensor factors to use in the Tensor Approximation technique.
- Common options
-
Error handling behavior
The action to perform if the block encounters an error.
Value: stop workflow
,output defaults and signal
,output signal only
Default: stop workflow
(hard stop the block)When set to
stop workflow
, a block error causes its hard stop, which leads to the workflow shutdown (see Block Stop for details). In this case, the block does not output any data.If set to
output defaults and signal
, the block suppresses the error and does not interrupt the workflow. In this case, the done port outputsFalse
(the failure signal), and other output ports issue values assigned to them in the block settings.The
output signal only
behavior means to output only theFalse
value to done; nothing is output to other ports.
-
GTApprox/Accelerator
Five-position switch to control the trade-off between training speed and model quality.
Value: integer in range from 1 (prefer quality, lower speed) to 5 (prefer speed, lower quality) Default: 1 (prefer quality) Changed in version 5.1: GTApprox/Accelerator affects the GTApprox/GBRTMaxDepth and GTApprox/GBRTNumberOfTrees options in manual training.
Changed in version 6.23: GTApprox/Accelerator affects RSM parameters estimation in SmartSelection.
This option changes several internal parameters of approximation techniques, which allow the trade-off between training speed and model quality. When you use SmartSelection or set up manual training with the GP, GBRT, or HDA technique, GTApprox/Accelerator also implicitly changes values of some dependent options. You can override these changes by manually setting dependent options: if you set both GTApprox/Accelerator and some dependent option, GTApprox will use your value of this dependent option, not the value automatically set by GTApprox/Accelerator.
In SmartSelection mode, setting GTApprox/Accelerator to 4 disables the stepwise regression and ElasticNet algorithms when tuning parameters of the RSM technique; setting it to 5 additionally disables the multiple ridge algorithm in RSM to speed up model training (see section Parameters Estimation in Response Surface Model for details). You can override these changes by setting the GTApprox/RSMFeatureSelection option manually.
Dependent GP option in manual training is GTApprox/GPLearningMode: it is set to
"Accurate"
if GTApprox/Accelerator is 1 or 2.Dependent GBRT options in manual training are GTApprox/GBRTMaxDepth and GTApprox/GBRTNumberOfTrees. GTApprox/Accelerator sets them as follows:
GTApprox/Accelerator 1 2 3 4 5 GTApprox/GBRTMaxDepth 10 10 10 6 6 GTApprox/GBRTNumberOfTrees 500 400 300 200 100 HDA technique settings affected by GTApprox/Accelerator in manual training depend on input sample size. There are two cases:
- Common sample size (the sample contains less than 10 000 points).
- Big sample size (the sample contains 10 000 points or more).
In the case of a commonly sized sample, dependent options are GTApprox/HDAFDGauss, GTApprox/HDAMultiMax, GTApprox/HDAMultiMin and GTApprox/HDAPhaseCount. GTApprox/Accelerator sets them as follows:
GTApprox/Accelerator 1 2 3 4 5 GTApprox/HDAFDGauss 1 1 0 0 0 GTApprox/HDAMultiMax 10 6 4 4 2 GTApprox/HDAMultiMin 5 4 2 2 1 GTApprox/HDAPhaseCount 10 7 5 1 1 In the case of a big sized sample, dependent options are GTApprox/HDAFDGauss, GTApprox/HDAHessianReduction, GTApprox/HDAMultiMax, GTApprox/HDAMultiMin, GTApprox/HDAPhaseCount, GTApprox/HDAPMax, and GTApprox/HDAPMin. GTApprox/Accelerator sets them as follows:
GTApprox/Accelerator 1 2 3 4 5 GTApprox/HDAFDGauss 0 0 0 0 0 GTApprox/HDAHessianReduction 0.3 0.3 0 0 0 GTApprox/HDAMultiMax 3 2 2 2 1 GTApprox/HDAMultiMin 1 1 1 1 1 GTApprox/HDAPhaseCount 5 5 3 1 1 GTApprox/HDAPMax 150 150 150 150 150 GTApprox/HDAPMin 150 150 150 150 150
-
GTApprox/AccuracyEvaluation
Require accuracy evaluation.
Value: True
orFalse
Default: False
If this option is
True
, then, in addition to the approximation, constructed model will contain a function providing an estimate of the approximation error as a function on the design space.See Evaluation of accuracy in given point chapter for details.
-
GTApprox/DependentOutputs
Specify the type of dependency between output components.
Value: Boolean, PartialLinear
, orAuto
Default: Auto
New in version 6.3.
Changed in version 6.15: added the linear dependencies mode (
PartialLinear
).Selects which approximation mode to use when training a model with multidimensional output (see section Output Dependency Modes for details).
True
: treat different components of the output as possibly dependent, do not use componentwise approximation.False
: assume that output components are independent and use componentwise approximation.PartialLinear
: before training, search for linear dependencies between outputs in the training data. If such dependencies are found, train a model that keeps the dependencies. In this case, the submodels of independent outputs are trained in the componentwise mode.Auto
(default): use componentwise approximation.
Note
The TBL technique ignores this option since it is a simple table function.
-
GTApprox/Deterministic
Controls the behavior of randomized initialization algorithms used in certain techniques.
Value: disable
,block run
, orworkflow run
Default: workflow run
New in version 5.0.
Several model training techniques in GTApprox feature randomized initialization of their internal parameters. These techniques include:
- GBRT, which can select random subsamples of the full training set when creating regression trees (see section Stochastic Boosting).
- HDA and HDAGP, which use randomized initialization of approximator parameters.
- MoA, if the approximation technique for its local models is set to HDA, HDAGP or SGP using GTApprox/MoATechnique, or the same selection is done automatically.
- SGP, which uses randomized selection of base points when approximating the full covariance matrix of the points from the training sample (Nystrom method).
- TA, if for some of its factors the HDA technique is specified manually or is selected automatically (see GTApprox/TensorFactors).
The determinacy of randomized techniques can be controlled in the following way:
- If GTApprox/Deterministic is set to
workflow run
, a fixed seed is always used by the block in all randomized algorithms. The seed is set by GTApprox/Seed. This makes the block and workflow behavior fully reproducible — for example, two models trained in this mode with the same data, same GTApprox/Seed and other settings will be exactly the same, even if they are trained in different workflow runs. - If GTApprox/Deterministic is set to
block run
, a new seed is selected every time the workflow starts, and the selected seed is used by the block during that workflow run only. For example, two models trained with the same data and settings will be the same if they are trained in the same workflow run, but may be different if they are trained in different workflow runs. - If GTApprox/Deterministic is set to
disable
, a new seed is selected every time the block starts, even in the same workflow run.
In case of randomized techniques, repeated non-deterministic training runs may be used to try obtaining a more accurate approximation, because results will be slightly different. On the contrary, deterministic techniques always produce exactly the same model given the same training data and settings, and are not affected by GTApprox/Deterministic and GTApprox/Seed. Deterministic techniques include:
-
GTApprox/EnableTensorFeature
Enable automatic selection of the TA and iTA techniques.
Value: True
orFalse
Default: True
New in version 1.9.2: allows the automatic selection of the iTA technique. Previously affected only the TA technique selection.
If
True
, makes TA and iTA techniques available for auto selection. IfFalse
, neither TA nor iTA will ever be selected automatically based on decision tree. Has no effect if any approximation technique is selected manually using the GTApprox/Technique option.Note
This option does not enable the automatic selection of the TGP technique.
-
GTApprox/ExactFitRequired
Require the model to fit sample data exactly.
Value: True
orFalse
Default: False
If this option is
True
, the model fits the points of the training sample exactly — that is, model responses in the points that were included in the training sample are equal to the response values in the training sample.If GTApprox/ExactFitRequired is
False
then no fitting condition is imposed, and the approximation can be either fitting or non-fitting depending on the training data. Typical example: if GTApprox finds that the sample is noisy, it does not create an exact-fitting model to avoid overtraining.Note that the exact fit mode is not supported by some approximation techniques. In particular, it is incompatible with the robust version of GP-based techniques (see GTApprox/GPLearningMode). For details on other techniques, see their descriptions in the Techniques section.
Changed in version 4.2: added the exact fit mode support to the TA technique (see TA).
Changed in version 6.15: the HDAGP technique, which does not support the exact fit mode, now raises an error if GTApprox/ExactFitRequired is on. Previously HDAGP silently ignored this option.
Changed in version 6.15: it is no longer possible to train a model with GTApprox/ExactFitRequired on and GTApprox/GPLearningMode set to
"Robust"
. Now this combination is explicitly prohibited and raises an error.For more information on the effects of this option, see section Exact Fit.
-
GTApprox/GBRTColsampleRatio
Column subsample ratio.
Works only for GBRT technique.
Value: floating point number in range (0,1] Default: 1.0 New in version 5.1.
The GBRT technique uses random subsamples of the full training set when training weak estimators (regression trees). GTApprox/GBRTColsampleRatio specifies the fraction of columns (input features) to be included in a subsample: for example, setting it to 0.5 will randomly select half of the input features to form a subsample.
For more details, see section Stochastic Boosting.
-
GTApprox/GBRTMaxDepth
Maximum regression tree depth.
Works only for GBRT technique.
Value: non-negative integer Default: 0 New in version 5.1.
Sets the maximum depth allowed for each regression tree (GBRT weak estimator). Greater depth results in a more complex final model.
Default (0) means that the tree depth will be set by GTApprox/Accelerator as follows:
GTApprox/Accelerator 1 2 3 4 5 GTApprox/GBRTMaxDepth 10 10 10 6 6 For example, if both options are default (GTApprox/GBRTMaxDepth is 0 and GTApprox/Accelerator is 1), actual depth setting is 10.
For more details, see section Model Complexity.
-
GTApprox/GBRTMinChildWeight
Minimum total weight of points in a regression tree leaf.
Works only for GBRT technique.
Value: non-negative floating point number Default: 1 New in version 5.1.
The GBRT technique stops growing a branch of a regression tree if the total weight of points assigned to a leaf becomes less than GTApprox/GBRTMinChildWeight. If the sample is not weighted, this is the same as limiting the number of points in a leaf. Zero minimum weight means that no such limit is imposed.
For more details, see section Leaf Weighting.
-
GTApprox/GBRTMinLossReduction
Minimum significant reduction of loss function.
Works only for GBRT technique.
Value: non-negative floating point number Default: 0 New in version 5.1.
The GBRT technique stops growing a branch of a regression tree if the reduction of loss function (model’s mean square error over the training set) becomes less than GTApprox/GBRTMinLossReduction.
For more details, see section Model Complexity.
-
GTApprox/GBRTNumberOfTrees
The number of regression trees in the model.
Works only for GBRT technique.
Value: non-negative integer Default: 0 New in version 5.1.
Sets the number of weak estimators (regression trees) in a GBRT model, the same as the number of gradient boosting stages. Greater number results in a more complex final model.
Changed in version 5.2: 0 is allowed and means auto setting.
Default (0) means that the number of trees will be set by GTApprox/Accelerator as follows:
GTApprox/Accelerator 1 2 3 4 5 GTApprox/GBRTNumberOfTrees 500 400 300 200 100 For example, if both options are default (GTApprox/GBRTNumberOfTrees is 0 and GTApprox/Accelerator is 1), the actual number of trees is 500.
For more details, see section Model Complexity.
Note that in incremental training the default (0) number of trees is not affected by GTApprox/Accelerator but depends on the number of trees in the initial model and training sample sizes — see Incremental Training for details.
-
GTApprox/GBRTShrinkage
Shrinkage step, or learning rate
Works only for GBRT technique.
Value: floating point number in range (0,1] Default: 0.3 New in version 5.1.
GBRT scales each weak estimator by a factor of GTApprox/GBRTShrinkage, resulting in a kind of regularization with smaller step values.
For more details, see section Shrinkage.
-
GTApprox/GBRTSubsampleRatio
Row subsample ratio
Works only for GBRT technique.
Value: floating point number in range (0,1] Default: 1.0 New in version 5.1.
The GBRT technique uses random subsamples of the full training set when training weak estimators (regression trees). GTApprox/GBRTSubsampleRatio specifies the fraction of rows (sample points) to be included in a subsample: for example, setting it to 0.5 will randomly select half of the points to form a subsample.
For more details, see section Stochastic Boosting.
-
GTApprox/GPInteractionCardinality
Allowed orders of additive covariance function.
Works for GP, SGP and HDAGP techniques.
Value: IntVector containing unique unsigned integers in range [1,dim(X)] each Default: empty vector (equivalent to (1, n)
, n=dim(X))New in version 1.10.3.
This option takes effect only when using the additive covariance function (GTApprox/GPType is set to
Additive
), otherwise it is ignored. In particular, the TGP technique always ignores this option since its covariance function is alwaysWlp
.The additive covariance function is a sum of products of one-dimensional covariance functions, where each additive component (a summand) depends on a subset of initial input variables. GTApprox/GPInteractionCardinality defines the degree of interaction between input variables by specifying allowed subset sizes, which are in fact the allowed values of covariance function order.
All vector values should be unique, and neither of them can be greater than the number of input components, excluding constant inputs (the effective dimension of the input part of the training sample).
Consider an n-dimensional X sample with m variable and n−m constant components (sample matrix columns). Valid GTApprox/GPInteractionCardinality settings then would be:
(1, n)
: simplified syntax, implicitly converts to(1, m)
.(1, 2, ... m-1, m, m+1, ... k)
, where m<k≤n: treated as a consecutive list of interactions up to cardinality k, implicitly converts to(1, 2, ... m-1, m)
. Note that in this case all values from 1 to m have to be included in the vector, otherwise it is considered invalid.(i1, i2, ... ik)
, where ij≤m: valid list of interaction cardinalities, no conversion needed.
-
GTApprox/GPLearningMode
Give priority to either model accuracy or robustness.
Value: Accurate
,Robust
, orAuto
Default: Auto
New in version 1.9.6.
Changed in version 6.15: added the
Auto
value which is now default (wasAccurate
).Changed in version 6.17: the
Auto
behavior now depends on GTApprox/Accelerator.This option affects the Gaussian processes-based techniques: GP, TGP, and TA with GP factors. These techniques can use different versions of the training algorithm. The accurate version aims to minimize model errors, but is prone to unwanted effects related to overtraining, which decrease model quality. The robust version prevents overtraining at the cost of a possible decrease in model accuracy; this version is also incompatible with the exact fit mode (see GTApprox/ExactFitRequired).
Using the robust version is recommended. The
Auto
setting defaults to the robust version and selects the accurate version only when:- GTApprox/ExactFitRequired is enabled, or
- GTApprox/Accelerator is
1
or2
.
-
GTApprox/GPMeanValue
Specifies the mean of model output mean values.
Works for GP , SGP, HDAGP and TGP techniques.
Value: RealVector Default: empty vector (automatic estimate) Model output mean values are essential for constructing GP approximation. These values may be defined by user or estimated using the given sample (the bigger and more representative is the sample, the better is the estimate of model output mean values). Model output mean values misspecification leads to decrease in approximation accuracy: the larger the error in output mean values, the worse is the final approximation model. If left default (empty vector), model output mean values are estimated using the given sample.
Option value is a RealVector, which should either be empty or contain a number of elements equal to the output dataset dimension.
-
GTApprox/GPPower
For the Gaussian processes based techniques (GP, HDAGP, SGP, TGP), sets the value of p in the p-norm used to measure the distance between input vectors.
Value: floating point number in range [1,2], or Auto
Default: Auto
Changed in version 6.47: added the
Auto
value, which is now default; does not change the default behavior from the previous versions.The main component of the Gaussian processes based regression is the covariance function measuring the similarity between two input points. That function factors in the p-norm of the difference in coordinates of the input points in the considered point pair, where p is the GTApprox/GPPower value. For example:
- p=1 corresponds to the Laplacian covariance function, better suited for modeling of non-smooth functions.
- p=2 corresponds to the usual Gaussian covariance function, better suited for modeling of smooth functions.
The
Auto
(default) value is intended to explicitly “unlock” this option for automatic tuning in SmartSelection training. In the manual training configuration mode,Auto
defaults to using the Gaussian covariance function with p=2.For the GP, SGP, and HDAGP techniques, this option takes effect only if GTApprox/GPType is
Wlp
orAdditive
. The TGP technique is always affected by GTApprox/GPPower, since it always uses the common covariance function (denotedWlp
) and disregards the GTApprox/GPType setting.
-
GTApprox/GPTrendType
Specifies the trend type.
Works for GP, SGP, HDAGP and TGP techniques.
Value: None
,Linear
,Quadratic
, orAuto
Default: Auto
New in version 3.2.
This option makes it possible to account for specific (linear or quadratic) behavior of the modeled dependency by selecting which type of trend to use.
None
— no trend.Linear
— linear trend.Quadratic
— polynomial trend with constant, linear and pure quadratic terms (no interaction terms, no feature selection).Auto
— automatic selection, defaults to no trend.
-
GTApprox/GPType
Select the kernel function type for the Gaussian processes-based techniques (GP, SGP, and HDAGP, excluding TGP).
Value: Additive
,Mahalanobis
,Wlp
,Periodic
, orAuto
Default: Auto
Changed in version 1.10.3: added the additive kernel function.
Changed in version 6.16: added the periodic kernel function.
Changed in version 6.17: added the
Auto
setting which is now default.Selects the kernel function used in Gaussian processes. Available kernels:
Additive
: summarized coordinate-wise products of 1-dimensional Gaussian covariance functions. With this setting, GTApprox/GPInteractionCardinality may be used to set the degree of interaction between input variables.Mahalanobis
: squared exponential covariance function with Mahalanobis distance.Wlp
: common exponential Gaussian covariance function with weighted Lp distance.Periodic
: periodic covariance function. Using this kernel potentially allows you to create an approximation model with periodic extrapolation."Auto"
: primarily intended for compatibility with SmartSelection, where it explicitly “unlocks” the option for smart tuning. When set manually,Auto
defaults toWlp
.
If set to
Additive
when the input part of the training sample is 1-dimensional (that is, there is only 1 variable), then the additive covariance function is implicitly replaced with the common covariance function (denotedWlp
), and the GTApprox/GPInteractionCardinality option value is ignored.Note
The TGP technique ignores this option and always uses the common covariance function (denoted
Wlp
).
-
GTApprox/Heteroscedastic
Treat input sample as a sample containing heteroscedastic noise.
Value: True
,False
, orAuto
Default: Auto
New in version 1.9.0.
If this option is
True
, the block assumes that heteroscedastic noise variance is present in the input sample. Default value (Auto
) currently means that option isFalse
.This option has certain limitations:
- It is valid for GP
and HDAGP
techniques only. For other techniques the value is ignored (treated as always
False
). - Heteroscedasticity is incompatible with covariance functions other than
Wlp
: if GTApprox/Heteroscedastic isTrue
and GTApprox/GPType is notWlp
, the block raises an error. - If noise variance is given, the GTApprox/Heteroscedastic option is ignored and non-variational GP (or HDAGP) technique is used.
See section Heteroscedastic data for more details.
- It is valid for GP
and HDAGP
techniques only. For other techniques the value is ignored (treated as always
-
GTApprox/HDAFDGauss
Include Gaussian functions into functional dictionary used in construction of approximations.
Works for HDA and HDAGP techniques.
Value: No
,Ordinary
orAuto
Default: Auto
Changed in version 6.14: added the
"Auto"
value which is now default (was"Ordinary"
).In order to construct an approximation, the linear expansion in functions from special functional dictionary is used. This option controls whether Gaussian functions should be included into functional dictionary used in construction of approximation.
In general, using Gaussian functions as building blocks for construction of approximation can lead to significant increase in accuracy, especially in the case when the approximable function is bell-shaped. However it may also significantly increase training time.
-
GTApprox/HDAFDLinear
Include linear functions into functional dictionary used in construction of approximations.
Works for HDA and HDAGP techniques.
Value: No
,Ordinary
orAuto
Default: Auto
Changed in version 6.14: added the
Auto
value which is now default (wasOrdinary
).In order to construct an approximation, the linear expansion in functions from special functional dictionary is used. This option controls whether linear functions should be included into functional dictionary used in construction of approximation or not.
In general, using linear functions as building blocks for construction of approximation can lead to increase in accuracy, especially in the case when the approximable function has significant linear component. However, it may also increase training time.
-
GTApprox/HDAFDSigmoid
Include sigmoid functions into functional dictionary used in construction of approximations.
Works for HDA and HDAGP techniques.
Value: No
,Ordinary
orAuto
Default: Auto
Changed in version 6.14: added the
Auto
value which is now default (wasOrdinary
).In order to construct an approximation, the linear expansion in functions from special functional dictionary is used. This option controls whether sigmoid-like functions should be included into functional dictionary used in construction of approximation or not.
In general, using sigmoid-like functions as building blocks for construction of approximation can lead to increase in accuracy, especially in the case when the approximable function has square-like or discontinuity regions. However, it may also lead to significant increase in training time.
-
GTApprox/HDAHessianReduction
Maximum proportion of data used in evaluating Hessian matrix.
Works for HDA and HDAGP techniques.
Value: floating point number in range [0,1] Default: 0.0 New in version 1.6.1.
This option shrinks maximum amount of data points for Hessian estimation (used in high-precision algorithm). If the value is 0, the whole set of points is used in Hessian estimation, otherwise, if the value is in range (0;1], only a part (smaller than HDAHessianReduction of the whole set) is used. Reduction is used only in case of samples bigger than 1250 input points (if number of points is smaller than 1250, this option is ignored and Hessian is estimated by the whole train sample).
Note
In some cases, the high-precision algorithm can be disabled automatically, regardless of the GTApprox/HDAHessianReduction value. This happens if:
- (dim(X)+1)⋅p≥4000, where dim(X) is the dimension of the input vector X and p is a total number of basis functions, or
- dim(X)≥25, where dim(X) is the dimension of the input vector X, or
- there are no sufficient computational resources to use the high precision algorithm.
-
GTApprox/HDAMultiMax
Maximum number of basic approximators constructed during one approximation phase.
Works for HDA and HDAGP techniques.
Value: integer in range [GTApprox/HDAMultiMin,1000] or 0 Default: 0 (auto selection) Changed in version 6.14: added 0 as a valid value for automatic selection, which is now default (was 10).
This option specifies the maximum number of basic approximators constructed during one approximation phase. Option value is a positive integer that must be greater than or equal to the value of GTApprox/HDAMultiMin option. This option sets upper limit to the number of basic approximators, but does not require this limit to be reached (approximation algorithm stops constructing basic approximators as soon as construction of subsequent basic approximator does not increase accuracy). In general, the bigger the value of GTApprox/HDAMultiMax is, the more accurate is the constructed approximator. However, increasing the value may lead to significant training time increase and/or overtraining in some cases.
-
GTApprox/HDAMultiMin
Minimum number of basic approximators constructed during one approximation phase.
Works for HDA and HDAGP techniques.
Value: integer in range [1, GTApprox/HDAMultiMax] or 0 Default: 0 (auto selection) Changed in version 6.14: added 0 as a valid value for automatic selection, which is now default (was 5).
This option specifies the minimum number of basic approximators constructed during one approximation phase. Option value is a positive integer that must be less than or equal to the value of GTApprox/HDAMultiMax option. In general, the bigger the value of GTApprox/HDAMultiMin is, the more accurate is the constructed approximator. However, increasing the value may lead to significant training time increase and/or overtraining in some cases.
-
GTApprox/HDAPhaseCount
Maximum number of approximation phases.
Works for HDA and HDAGP techniques.
Value: integer in range [1,50] or 0 Default: 0 (auto) Changed in version 6.14: added 0 as a valid value for automatic selection, which is now default (was 10).
This option specifies the maximum possible number of approximation phases. It sets upper limit to that number only, and does not require the limit to be reached (approximation algorithm stops performing new phases as soon as the subsequent approximation phase does not increase accuracy). In general, the more approximation phases, the more accurate approximator is built. However, increasing maximum number of approximation phases may lead to significant training time increase and/or overtraining in some cases.
-
GTApprox/HDAPMax
Maximum allowed approximator complexity.
Works for HDA and HDAGP techniques.
Value: integer in range [GTApprox/HDAPMin,5000] or 0 Default: 0 (auto) Changed in version 6.14: added 0 as a valid value for automatic selection, which is now default (was 150).
This option specifies the maximum allowed complexity of the approximator. Its value must be greater than or equal to the value of the GTApprox/HDAPMin option. The approximation algorithm selects the approximator with optimal complexity pOpt from the range [GTApprox/HDAPMin, GTApprox/HDAPMax]. Optimality here means that, depending on the complexity of approximable function behavior and the size of the available training sample, constructed approximator with complexity pOpt fits this function in the best possible way compared to other approximators with complexity in range [GTApprox/HDAPMin, GTApprox/HDAPMax]. Thus the GTApprox/HDAPMax value should be big enough in order to select the approximator with complexity being the most appropriate for the considered problem. Note, however, that increasing the GTApprox/HDAPMax value may lead to significant increase in training time and/or overtraining in some cases.
-
GTApprox/HDAPMin
Minimum allowed approximator complexity.
Works for HDA and HDAGP techniques.
Value: integer in range [0, GTApprox/HDAPMax] Default: 0 (auto) Changed in version 6.14: 0 is now a special value which enables automatic selection.
This option specifies the minimum allowed complexity of the approximator. Its value must be less than or equal to the value of the GTApprox/HDAPMax option. The approximation algorithm selects the approximator with optimal complexity pOpt from the range [GTApprox/HDAPMin, GTApprox/HDAPMax]. Optimality here means that, depending on the complexity of approximable function behavior and the size of the available training sample, constructed approximator with complexity pOpt fits this function in the best possible way compared to other approximators with complexity in range [GTApprox/HDAPMin, GTApprox/HDAPMax]. Thus the GTApprox/HDAPMin value should not be too big in order to select the approximator with complexity being the most appropriate for the considered problem. Note that increasing the GTApprox/HDAPMin value may lead to significant increase in training time and/or overtraining in some cases.
GTApprox/InputDomainType
Specifies the input domain for the model.
Value: Unbound
,Manual
,Box
, orAuto
Default: Unbound
New in version 6.16.
Changed in version 6.17: added the
Manual
setting.By default, a GTApprox model has an unlimited input domain — that is, model functions are defined everywhere, and model outputs are always numeric values. Such model fits the training sample data but tends to linear extrapolation outside the input space region covered by the training sample.
This option limits the input domain by adding input constraints to the model. The model then returns NaN outputs when inputs do not satisfy the constraints (the input point is outside the domain).
The input domain type can be:
Unbound
(default) — unlimited input domain, the same as in all pSeven Desktop versions prior to 6.16.Manual
— a box-bound domain specified manually using the sample_lower_bounds and sample_upper_bounds ports (see the example in Training Data).Box
— a box-bound domain that is an intersection of:
- the training sample’s bounding box, determined automatically by the block, and
- the bounds specified by the sample_lower_bounds and sample_upper_bounds ports.
Auto
— a domain that is an intersection of:
- the training sample’s bounding box, determined automatically by the block,
- the region bound by an ellipsoid that envelops the training sample, and
- the bounds specified by the sample_lower_bounds and sample_upper_bounds ports.
If you use a limited input domain, the
Auto
type is recommended because it “cuts empty corners” from the sample’s bounding box, so the input constraints represent the training data better.Note
Models trained with the Piecewise Linear Approximation (PLA) technique have a limited input domain by default. These default constraints are more strict than the sample’s bounding box and the enveloping ellipsoid, so GTApprox/InputDomainType has no effect for PLA.
Note
Another option that adds input constraints to the model is GTApprox/OutputNanMode (when set to
predict
).
-
GTApprox/InputNanMode
Specifies how to handle non-numeric values in the input part of the training sample.
Value: raise
,ignore
Default: raise
New in version 6.8.
Changed in version 6.19: for the GBRT technique only,
ignore
means to accept points where some (but not all) inputs are NaN, and these points are actually used in training.With the exception of the GBRT technique, GTApprox cannot obtain any information from non-numeric (NaN or infinity) values of variables. This option controls its behavior when encountering such values:
- Default (
raise
) raises an error. - For the GBRT technique,
ignore
excludes data points containing infinity input values from the sample, and excludes points where all inputs are NaN. Points where only some inputs are NaN are kept and actually used in training. - For all other techniques,
ignore
excludes all points with non-numeric values before training.
- Default (
-
GTApprox/InputsTolerance
Specifies up to which tolerance each input variable would be rounded.
Value: RealVector of length dim(X) Default: empty vector New in version 6.3.
If default, option does nothing. Otherwise each input variable in training sample is rounded up to specified tolerance. Note that this may lead to merge of some points.
See section Sample Cleanup for details.
-
GTApprox/InternalValidation
Enable or disable internal validation.
Value: True
orFalse
Default: False
If this option is
True
then, in addition to the approximation, the constructed model contains a table of cross validation errors of different types, which may serve as a measure of accuracy of approximation.See Model Validation chapter for details.
-
GTApprox/IVDeterministic
Controls the behavior of the pseudorandom algorithm selecting data subsets in cross validation.
Works only if GTApprox/InternalValidation is
True
.Value: True
orFalse
Default: True
New in version 5.0.
Cross validation involves partitioning the training sample into a number of subsets (defined by GTApprox/IVSubsetCount) and randomized combination of these subsets for each training (validation) session. Since the algorithm that combines subsets is pseudorandom, its behavior can be controlled in the following way:
- If GTApprox/IVDeterministic is
True
(deterministic cross validation mode, default), a fixed seed is used in the combination algorithm. The seed is set by GTApprox/IVSeed. This makes cross-validation reproducible — a different combination is selected for each session, but if you repeat a cross validation run, for each session it will select the same combination as the first run. - Alternatively, if GTApprox/IVDeterministic is
False
(non-deterministic cross validation mode), a new seed is generated internally for every run, so cross validation results may slightly differ. In this case, GTApprox/IVSeed is ignored. The generated seed that was actually used in cross validation can be found in model info, so results can still be reproduced exactly by switching to the deterministic mode and setting GTApprox/IVSeed to this value.
Final model is never affected by GTApprox/IVDeterministic because it is always trained using the full sample.
- If GTApprox/IVDeterministic is
-
GTApprox/IVSavePredictions
Save model values calculated during internal validation.
Works only if GTApprox/InternalValidation is
True
.Value: True
,False
, orAuto
Default: Auto
New in version 2.0 Release Candidate 2.
If this option is
True
, internal validation information, in addition to error values, also contains raw validation data: model values calculated during internal validation, as well as validation inputs and outputs.
-
GTApprox/IVSeed
Fixed seed used in the deterministic cross validation mode.
Works only if GTApprox/InternalValidation is
True
.Value: positive integer Default: 15313 New in version 5.0.
Fixed seed for the pseudorandom algorithm that selects the combination of data subsets for each cross validation session. GTApprox/IVSeed has an effect only if GTApprox/IVDeterministic is on — see its description for more details.
-
GTApprox/IVSubsetCount
The number of cross validation subsets.
Value: 0 (auto) or an integer in range [2,|S|], where |S| is the training sample size Default: 0 (auto) Changed in version 6.19: GTApprox/IVSubsetCount is no longer required to be less than GTApprox/IVTrainingCount, since the latter now sets an upper limit for the number of cross validation sessions instead of the exact number of sessions.
The number of subsets into which the training sample is divided for cross validation. The subsets are of approximately equal size.
GTApprox/IVSubsetCount cannot be set together with GTApprox/IVSubsetSize. Default (0) means that the number of subsets is determined by the sample size and GTApprox/IVSubsetSize. If both options are default, the number and size of subsets are selected automatically based on the sample size.
-
GTApprox/IVSubsetSize
The size of a cross validation subset.
Value: 0 (auto) or an integer in range [1,23|S|], where |S| is the training sample size Default: 0 (auto) New in version 6.19.
The size of a sample subset used as test data in a cross validation session. This option may be more convenient than GTApprox/IVSubsetCount when the training sample size is not known or is a parameter. In such cases, GTApprox can automatically determine the required number of subsets, given their size. If the sample cannot be evenly divided into subsets of the given size, then sizes of some subsets are adjusted to fit. The maximum valid option value is 23 of the sample size, however in this case the actual subset size is adjusted to 12 of the sample size.
Practically this option configures leave-n-out cross validation, where n is the option value. Since the number of of subsets — hence the number of cross validation sessions — can get too high for small n, it is recommended to limit the number of sessions with GTApprox/IVTrainingCount. Otherwise model training may take much time, because each session trains a dedicated internal validation model.
GTApprox/IVSubsetSize cannot be set together with GTApprox/IVSubsetCount. Default (0) means that the subset size is determined by the sample size and GTApprox/IVSubsetCount. If both options are default, the number and size of subsets are selected automatically based on the sample size.
-
GTApprox/IVTrainingCount
The maximum allowed number of training sessions in cross validation.
Value: positive integer or 0 (auto) Default: 0 (auto) Changed in version 6.19: now sets an upper limit instead of the exact number of sessions, and is no longer required to be less than GTApprox/IVSubsetCount.
Each GTApprox cross validation session includes the following steps:
- Select one of the cross validation subsets to be the test data.
- Prepare the complement of the selected subset, which is the training sample excluding the test data.
- Train an internal validation model, using this complement as the training sample — so the test data is excluded from training.
- Calculate error metrics for the validation model, using the previously selected test data subset.
Internal validation repeats such sessions with different test subsets, until the number of sessions reaches GTApprox/IVTrainingCount, or there are no more subsets to test (each subset may be tested only once).
The number and sizes of cross validation subsets are determined by GTApprox/IVSubsetCount and GTApprox/IVSubsetSize, and are selected by GTApprox if both these options are default. If GTApprox/IVTrainingCount is also default, GTApprox sets an appropriate limit for the number of sessions, based on the training sample size.
-
GTApprox/LinearityRequired
Require the model to be linear.
Value: True
orFalse
Default: False
If this option is
True
, then the approximation is constructed as a linear function that fits the training data optimally. If option isFalse
, then no condition related to linearity is imposed on the approximation: it can be either linear or non-linear depending on which one fits training data best.Note
The TGP technique does not support linear models: if GTApprox/Technique is
TGP
, GTApprox/LinearityRequired should beFalse
.
-
GTApprox/LogLevel
Set minimum log level.
Value: Debug
,Info
,Warn
,Error
,Fatal
Default: Info
If this option is set, only messages with log level greater than or equal to the threshold are dumped into log.
-
GTApprox/MaxExpectedMemory
Maximum expected amount of memory (in GB) allowed for model training.
Value: positive integer or 0 (no limit) Default: 0 (no limit) New in version 6.4.
This option currently works for the GBRT technique only.
GTApprox/MaxExpectedMemory is intended to avoid the case when a long training process fails due to memory overflow, spending much time and giving no results. If GTApprox/MaxExpectedMemory is not default, GTApprox tries to estimate the expected memory usage at each stage of the training algorithm, and if the estimate exceeds the option value, the training is suspended: the process stops and returns a “partially trained” model that then can be trained incrementally (see Incremental Training).
With GTApprox/MaxExpectedMemory set it is also possible that the training sample is so big that it never can be processed with the allowed amount of memory; in this case, the training does not start.
If GTApprox/MaxExpectedMemory is default (0, no limit) or training technique is not GBRT, then GTApprox does not try to prevent memory overflow.
-
GTApprox/MaxAxisRotations
Use rotation transformations in the input space to iteratively improve model quality.
Value: positive integer (number of rotations) or -1 (auto) Default: 0 (no rotations) New in version 6.12.
This option enables a special training mode that can improve quality of models trained using Gaussian processes-based techniques (HDA, GP, HDAGP, and SGP) in some cases where the training sample is non-uniformly distributed. After training an initial model, it evaluates model gradients in the training points and uses the principal component analysis algorithm to create a model input projection matrix. Then it applies the input transformation and trains a new model that improves the initial one. The process is repeated until an internal quality criterion is satisfied or the maximum number of iterations is reached. The final model is a weighted combination of all models trained in the process.
Option values are:
0
(default): iterative training is disabled.-1
(auto): selects the number of iterations automatically with respect to the approximation technique, training dataset size, GTApprox/Accelerator value and the GTApprox/ExactFitRequired setting.- Any other value sets the maximum allowed number of iterations explicitly. The process may finish before this maximum is reached, if the internal quality criterion is satisfied.
This option works with the HDA, GP, HDAGP, and SGP techniques only. Note that enabling it can significantly increase training time, since a new model is trained internally on each iteration.
-
GTApprox/MaxParallel
Sets the maximum number of parallel threads to use when training a model.
Value: integer in range [1,512], or 0 (auto) Default: 0 (auto) New in version 5.0 Release Candidate 1.
ApproxBuilder can use parallel calculations to speed up model training. This option sets the maximum number of threads the block is allowed to create.
Changed in version 6.0: auto (0) sets the number of threads to 1 for small training samples.
Changed in version 6.12: auto (0) tries to detect hyper-threading CPUs in order to use only physical cores.
Changed in version 6.15: added the upper limit for the option value, previously was any positive integer.
Changed in version 6.17: changed the upper limit to 512 (was 100000).
Default (auto) behavior depends on the value of the
OMP_NUM_THREADS
environment variable.If
OMP_NUM_THREADS
is set to a valid value, this value is the maximum number of threads by default. Note thatOMP_NUM_THREADS
must be set before you launch pSeven Desktop.If
OMP_NUM_THREADS
is unset, set to 0 or an invalid value, the default maximum number of threads is equal to the number of cores detected by pSeven Desktop. However, there are two exceptions:- Parallelization becomes inefficient in the case of a small training sample. For small training samples, only 1 thread is used by default.
- On hyper-threading CPUs using all logical cores has been found to negatively affect the training performance. If a hyper-threading CPU is detected, the default maximum number of threads is set to half the number of cores (to use only physical cores).
The behavior described above is only for the default (0) option value. If you set this option to a non-default value, it will be the maximum number of threads, regardless of the sample size and your CPU.
-
GTApprox/MoACovarianceType
Type of covariance matrix to use when creating the Gaussian Mixture Model for the Mixture of Approximators technique.
Value: Full
,Tied
,Diag
,Spherical
,BIC
, orAuto
.Default: Auto
New in version 1.10.0.
Full
— all covariance matrices are positive semidefinite and symmetric.Tied
— all covariance matrices are positive semidefinite, symmetric, and equal.Diag
— all covariance matrices are diagonal.Spherical
— diagonal matrix with equal elements on its diagonal.BIC
— the type of covariance matrix and effective number of clusters are selected according to Bayesian Information Criterion.Auto
— optimal covariance type for each possible number of clusters is chosen according to the clustering quality.
Changed in version 6.11: added the
Auto
value which is now default (previously default wasBIC
).This option allows the user to control accuracy and training time of the MoA technique. For example, if it is known that design space consists of regions of regularity having similar structure it may be reasonable to use
Tied
matrix for Gaussian Mixture Models.Full
has the slowest training time andDiag
andSpherical
have the fastest training time. InBIC
mode Gaussian Mixture Models are constructed for all types of covariance matrices and numbers of clusters, the best one in sense of Bayesian Information Criterion (BIC) is chosen.In
Auto
mode optimal covariance types are selected for each possible number of clusters according to the clustering quality based on the cluster’s tightness and separation assessment.
-
GTApprox/MoANumberOfClusters
Sets the number of design space clusters.
Works only for the Mixture of Approximators technique.
Value: IntVector Default: empty vector (auto) New in version 1.10.0.
New in version 1.11.0: empty vector is also a valid value that selects the number of clusters automatically.
If set, the effective number of clusters is selected from the vector values according to Bayesian Information Criterion (BIC). To set a fixed number of clusters, you may specify a vector containing a single positive integer.
Default (empty vector) selects the number of clusters automatically, based on the training sample size and input dimension.
-
GTApprox/MoAPointsAssignment
Select the technique for assigning points to clusters.
Works only for the Mixture of Approximators technique, see Design Space Decomposition.
Value: Probability
orMahalanobis
.Default: Probability
New in version 1.10.0.
Probability
corresponds to points assignment based on posterior probability.Mahalanobis
corresponds to points assignment based on Mahalanobis distance.
For the Mahalanobis distance based technique, the confidence value α may be changed using the GTApprox/MoAPointsAssignmentConfidence option.
-
GTApprox/MoAPointsAssignmentConfidence
This option sets confidence value for points assignment technique based on Mahalanobis distance.
Works only for the Mixture of Approximators technique, see Design Space Decomposition.
Value: floating point number in range (0,1). Default: 0.97 New in version 1.10.0.
This option allows you to control the size of clusters. The greater this value is the greater will be the cluster size.
-
GTApprox/MoATechnique
This option specifies approximation technique for local models created by the Mixture of Approximators (MoA) technique.
Value: SPLT
,HDA
,GP
,HDAGP
,SGP
,TA
,iTA
,TGP
,RSM
GBRT
,PLA
, orAuto
.Default: Auto
New in version 1.10.0.
Changed in version 5.1: added GBRT to the list of available techniques.
Changed in version 6.3: added PLA to the list of available techniques.
The option allows you to control local approximation technique. It sets the same technique for all local models.
Note that MoA performs sample clustering and the resulting subsamples may lose certain properties of the input sample. For example, it is possible that the input sample has tensor structure, but the subsamples do not, so the TA and iTA techniques become inapplicable to local models.
-
GTApprox/MoATypeOfWeights
This option sets the type of weighting used for “gluing” local approximations to obtain the final model.
Works only for Mixture of Approximators, see Calculating Model Output.
Value: Probability
orSigmoid
.Default: Probability
New in version 1.10.0.
Probability
corresponds to weights based on posterior probability.Sigmoid
corresponds to weights based on sigmoid function.
Sigmoid weighting can be fine-tuned with GTApprox/MoAWeightsConfidence.
-
GTApprox/MoAWeightsConfidence
This option sets confidence for sigmoid based weights.
Works only for Mixture of Approximators, see Calculating Model Output.
Value: floating point number in range (0,1); must be greater than GTApprox/MoAPointsAssignmentConfidence Default: 0.99 New in version 1.10.0.
This options controls smoothness of weights. The greater this value is the smoother will be weights providing more smooth approximation.
-
GTApprox/OutputNanMode
Specifies how to handle non-numeric values in the output part of the training sample.
Value: raise
,ignore
, orpredict
Default: raise
New in version 6.8.
By convention, NaN output values signify undefined function behavior. This option controls whether the model should try to predict undefined behavior. If set to
predict
, NaN values in training sample outputs are accepted, and the model will return NaN values in regions that are close to those points, for which the training sample contained NaN output values. Default (raise
) means that NaN output values are not accepted and the block raises an error and cancels training if they are found;ignore
means that such points are excluded from the sample, and training continues.Note
When set to
predict
, this option adds specific input constraints to the model. These constraints are combined with the constraints added by GTApprox/InputDomainType.
-
GTApprox/OutputTransformation
Apply transformation to the training sample output data prior to training the model.
Value: a StringScalar or StringVector specifying the transformation Default: empty string (no transformations) New in version 6.13 Service Pack 1.
Changed in version 6.47: by default, ApproxBuilder runs statistical tests and applies transformation depending on the test results (changed the default to
auto
); this change is incompatible with 6.46 and earlier versions as it changes the default behavior (wasnone
— do not apply transformation).Changed in version 6.48: reverted the incompatible change from 6.47: by default, ApproxBuilder uses the output transformation settings stored in the initial model, or does not apply transformation if no initial model given; added the corresponding value — an empty string, which is now default;
none
explicitly prohibits the transformation; this behavior is compatible with 6.47 as well as with 6.46 and earlier.The block can apply log transformation to the output data from the training sample prior to training the model. This is intended to improve accuracy of models trained with data samples where values of some outputs are exponentially distributed. For such outputs, log transformation can reduce the distribution skew, resulting in a more accurate approximation. The model is then trained on the transformed data and automatically applies the reverse transformation when evaluated.
The transformation that the block can use for a given output depends on whether you have set the output thresholds for that output:
- If the output thresholds are not set, uses log transformation of the form y∗=sgn(y)⋅ln(|y|+1), where sgn is the sign function.
- If only one of the thresholds is set (lower ymin or upper ymax), uses log transformation of the form y∗=ln(max or y^* = \ln(\max({y_{max}-y}, \epsilon)), respectively.
- If both thresholds are set, uses the logit transformation: y^* = \ln \frac{\max({y - y_{min}},\epsilon)}{\max({y_{max}-y},\epsilon)}.
Available settings are (note that all strings in settings are case-sensitive):
auto
— allow automatically applying transformation to an output.- If you do not use an initial model in training, the block runs a statistical test for the given output prior to training the model. If the test shows that the distribution of output values is statistically similar to an exponential distribution, the block applies transformation to the tested output. In that case, the applied transformation depends on the output threshold settings as described above.
- If you use an initial model,
adjusts to the settings previously used when training that model:
- If the initial model has been trained with transformation
explicitly required (
lnp1
) or prohibited (none
) for the given output, the block keeps and uses that setting. - Otherwise, runs the statistical test mentioned above to determine whether transformation should be applied to the given output. Similarly, the applied transformation (if any) depends on the output threshold settings.
- If the initial model has been trained with transformation
explicitly required (
lnp1
— require applying transformation to an output. The applied transformation depends on the output threshold settings as described above. This setting persists if you later update the model (use it as an initial one in another training session), unless you override the initiallnp1
with anone
setting.none
— prohibit applying transformation to an output. This setting persists if you later update the model (use it as an initial one in another training session), unless you override the initialnone
with alnp1
setting.- An empty string (default) — do not change the transformation settings. If you do not use an initial model, this setting does not apply the transformation, but does not persist later (when updating the model), as any other setting overrides it. If you use an initial model, the block keeps and uses the setting found in the initial model.
If option value is a StringScalar, the same setting is applied to all outputs. For example, setting this option to
auto
means that the block will analyze all sample outputs and automatically apply the log transformation to those outputs that are statistically similar to an exponential distribution.To apply different settings per output, use the StringVector form of this option. The length of this vector must be equal to the number of columns in the matrix received to the f_sample port (the output part of the training sample). Each element of the vector specifies the type of transformation for the respective output column. For example, if there are 3 training outputs, then the
("auto", "lnp1", "none")
setting means that the block will analyze the first output in order to decide whether to apply the transformation to it, log transformation is persistently applied to the second output, and for the third output the transformation is persistently prohibited.Note
The GBRT and HDAGP techniques do not support changing the output transformation settings when updating the model: if you train a model with the GBRT or HDAGP technique and use an initial model, the new model must be trained with the same output transformation settings as the initial one. In such cases, since you cannot change any of the output transformation settings, it is recommended to keep the GTApprox/OutputTransformation option default (empty) when updating your model.
-
GTApprox/PartialDependentOutputs/RRMSThreshold
Specifies the RRMS error threshold for the internal model of linear dependency between outputs.
Value: floating point number in range [10^{-15}, 1] Default: 10^{-5} New in version 6.29.
When search for linear dependencies between outputs in the training data is enabled (GTApprox/DependentOutputs is set to
PartialLinear
), GTApprox attempts to fit output data to a linear model (see Output Dependency Modes). This option sets the maximum allowed error of that model; if GTApprox cannot reach the error threshold, it is assumed that there is no linear dependency between outputs.
-
GTApprox/RSMElasticNet/L1_ratio
Specifies ratio between L1 and L2 regularization.
Works only for Response Surface Model technique.
Value: RealVector containing values in range [0, 1]. Default: empty vector New in version 6.1.
Each vector element sets the trade-off between L1 and L2 regularization: 1 means L1 regularization only, while 0 means L2 regularization only. The best value among given is chosen via cross-validation procedure. If none is given (default) RSM with pure L1 regularization is constructed.
-
GTApprox/RSMFeatureSelection
Specifies the regularization and term selection procedures.
Value: LS
,RidgeLS
,MultipleRidgeLS
,ElasticNet
,StepwiseFit
, orAuto
Default: Auto
Changed in version 6.17: added the
Auto
setting which is now default.Sets the technique to use for regularization and term selection:
LS
— ordinary least squares (no regularization, no term selection).RidgeLS
— least squares with Tikhonov regularization (no term selection).MultipleRidgeLS
— multiple ridge regression that also filters non-important terms.ElasticNet
— linear combination of L1 and L2 regularizations.StepwiseFit
— ordinary least squares regression with stepwise inclusion and exclusion for term selection.Auto
— primarily intended for compatibility with SmartSelection, where it explicitly “unlocks” the option for smart tuning. When set manually,Auto
defaults toRidgeLS
.
-
GTApprox/RSMMapping
Selects the type of sample data pre-processing in RSM.
Value: None
,MapStd
orMapMinMax
Default: MapStd
In RSM, specific pre-processing is usually applied to the sample data to improve accuracy. You can use this option to select the pre-processing method or disable it.
"MapStd"
(default) — standardizes each variable (independently) by converting its sample values to the z-score, estimated as z = (x - \bar x) / S where \bar x is the sample mean and S is the sample standard deviation for this variable."MapMinMax"
— maps the values of each variable to the [-1, 1] range, independently."None"
— disables pre-processing.
This option applies to the RSM technique only.
-
GTApprox/RSMStepwiseFit/inmodel
Selects the starting model for stepwise-fit regression.
Works only for Response Surface Model technique, see GTApprox/RSMFeatureSelection.
Value: IncludeAll
,ExcludeAll
,Auto
Default: Auto
Changed in version 6.14: added the
Auto
setting which is now default.This option specifies the terms initially included in the model when stepwise-fit regression is used (GTApprox/RSMFeatureSelection is set to
StepwiseFit
).IncludeAll
starts with a full model (all terms included).ExcludeAll
assumes none of the terms are included at the starting step.Auto
selects the type of the initial model automatically according to the number of terms. If the number of terms is low enough, regression starts with a full model (similar toIncludeAll
); otherwise, if the number of terms is high, no terms are included in the initial model (similar toExcludeAll
).
Depending on the terms included in the initial model and the order in which terms are moved in and out, the method may build different models from the same set of potential terms.
-
GTApprox/RSMStepwiseFit/penter
Specifies p-value of inclusion for stepwise-fit regression.
Works only for Response Surface Model technique.
Value: floating point number in range (0, GTApprox/RSMStepwiseFit/premove] Default: 0.05 Option value is the maximum p-value of F-test for a term to be added into the model. Generally, the higher the value, the more terms are included into the final model.
-
GTApprox/RSMStepwiseFit/premove
Specifies p-value of exclusion for stepwise-fit regression.
Works only for Response Surface Model technique.
Value: floating point number in range [GTApprox/RSMStepwiseFit/penter, 1) Default: 0.10 Option value is the minimum p-value of F-test for a term to be removed from the model. Generally, the higher the value, the more terms are included into the final model.
-
GTApprox/RSMType
Specifies the type of a response surface model.
Value: Linear
,Interaction
,Quadratic
,PureQuadratic
, orAuto
Default: Auto
(auto selection)Changed in version 6.8: default is
Linear
(wasPureQuadratic
).Changed in version 6.17: added the
Auto
setting which is now default.This option restricts the type of terms that may be included into the regression model.
Linear
— only constant and linear terms may be included.Interaction
— constant, linear, and interaction terms may be included.Quadratic
— constant, linear, interaction, and quadratic terms may be included.PureQuadratic
— only constant, linear, and quadratic terms may be included (interaction terms are excluded).Auto
is primarily intended for compatibility with SmartSelection, where it explicitly “unlocks” the option for smart tuning. When set manually,Auto
defaults toLinear
.
-
GTApprox/Seed
Fixed seed used in the deterministic training mode.
Value: positive integer Default: 15313 New in version 5.0.
In the deterministic training mode, GTApprox/Seed sets the seed for randomized initialization algorithms in certain techniques. See GTApprox/Deterministic for more details.
-
GTApprox/SGPNumberOfBasePoints
The number of base points used to approximate the full covariance matrix of the points from the training sample.
Works only for Sparse Gaussian Process technique.
Value: integer in range [1, 4000] Default: 1000 Changed in version 6.14: upper limit is now 4000 (was 2^{31}-2).
Base points (subset of regressors) are selected randomly among points from the training sample and used for the reduced rank approximation of the full covariance matrix of the points from the training sample. Reduced rank approximation is done using Nystrom method for selected subset of regressors. Note that if the value of this option is greater than the dataset size, then GP technique is used instead of SGP.
-
GTApprox/SPLTContinuity
Smoothness requirement for SPLT approximation.
Value: C1
,C2
, orAuto
Default: Auto
Changed in version 6.17: added the
Auto
setting which is now default.Sets the smoothness requirement for the SPLT technique (see 1D Splines with tension).
C1
requires continuous first derivative.C2
requires continuous second derivative.Auto
is primarily intended for compatibility with SmartSelection, where it explicitly “unlocks” the option for smart tuning. When set manually,Auto
defaults toC2
.
This option is ignored by all techniques other than SPLT.
-
GTApprox/StoreTrainingSample
Save a copy of training data with the model.
Value: True
,False
, orAuto
Default: Auto
New in version 6.6.
If
True
, the trained model will store a copy of the training sample. IfFalse
, the training sample data will not be available from the model. TheAuto
setting currently defaults toFalse
.Note that in case of GBRT incremental training (see Incremental Training) setting GTApprox/StoreTrainingSample saves only the last (most recent) training sample on each training iteration.
-
GTApprox/SubmodelTraining
Select whether to train submodels in parallel or sequentially.
Value: Sequential
,Parallel
,FastParallel
, orAuto
Default: Auto
New in version 6.14.
Changed in version 2024.02: added the
FastParallel
mode, supported by the GBRT technique only.This option can be used to force or disable parallel training of submodels.
Sequential
: different submodels are never trained simultaneously. Parallel threads will be used only if the selected approximation technique supports parallelization internally (on the algorithm level).Parallel
: parallel threads will be used to train multiple submodels simultaneously. If some submodel is trained by a technique that supports parallelization internally, it can use several threads if available.FastParallel
(for GBRT only): a faster parallel training implementation, non-deterministic. The general behavior is the same asParallel
. Might improve performance when training large GBRT models with a high number of outputs. This mode does not guarantee reproducibility: two GBRT models trained in this mode — with the same data, settings, and under the same conditions — may nevertheless be different even if GTApprox/Deterministic is on.Auto
(default): selects eitherSequential
orParallel
, depending on the other approximation settings and the training sample properties. Never selectsFastParallel
, thus disabling it by default in SmartSelection (use the Advanced option… hint to enable).
See Submodels and Parallel Training for details.
-
GTApprox/TALinearBSPLExtrapolation
Use linear extrapolation for BSPL factors.
Works for Tensor Products of Approximations and Incomplete Tensor Products of Approximations techniques.
Value: True
,False
, orAuto
Default: Auto
New in version 1.9.4.
This option allows you to switch extrapolation type for BSPL factors to linear. By default, BSPL factors extrapolate to constant. If GTApprox/TALinearBSPLExtrapolation is
True
, extrapolation will be linear in the range specified by the GTApprox/TALinearBSPLExtrapolationRange option, and fall back to constant outside this range.True
: use linear extrapolation in the range specified by GTApprox/TALinearBSPLExtrapolationRange.False
: do not use linear extrapolation (always use constant extrapolation).Auto
: defaults toFalse
.
This option affects only the Tensor Products of Approximations (including Incomplete Tensor Products of Approximations) models that contain BSPL factors. It does not affect non-BSPL factors at all, and if a Tensor Products of Approximations model is built using only non-BSPL factors, this option is ignored.
-
GTApprox/TALinearBSPLExtrapolationRange
Sets linear BSPL extrapolation range.
Works for Tensor Products of Approximations and Incomplete Tensor Products of Approximations techniques.
Value: floating point number greater than 0 Default: 1.0 New in version 1.9.4.
Sets the range in which the BSPL factors extrapolation will be linear (see GTApprox/TALinearBSPLExtrapolation) relatively to the variable range of this factor in the training sample. This setting “expands” the sample range: let x_{min} and x_{max} be the minimum and maximum value of a variable found in the sample (BSPL factors are always 1-dimensional), then the extrapolation range is (x_{max} - x_{min}) \cdot (1 + 2r), where r is the GTApprox/TALinearBSPLExtrapolationRange option value (the range is expanded by (x_{max} - x_{min}) \cdot r on each bound).
This option affects only the Tensor Products of Approximations (including Incomplete Tensor Products of Approximations) models that contain BSPL factors, and only if GTApprox/TALinearBSPLExtrapolation is set to
True
. It does not affect non-BSPL factors at all, and if a Tensor Approximation model is built using only non-BSPL factors, this option is ignored.
-
GTApprox/TAModelReductionRatio
Sets the ratio of model complexity reduction.
Value: floating point number in [1, \infty), or 0 (auto) Default: 0 (auto) New in version 6.2.
Sets the complexity (the number of basis functions) for TA and iTA models as the ratio to the default complexity (see Model Complexity Reduction). For example,
2
sets the number of basis functions 2 times less than default. The reduction affects only BSPL factors; all other factors ignore this option.This option slightly increases model size but reduces memory consumption in model evaluation and the size of model exported to C or Octave. Model accuracy decreases in most cases.
Note that there is a lower limit for model complexity, so the actual reduction ratio may be less than the GTApprox/TAModelReductionRatio value you set.
The exact fit requirement may be impossible to satisfy, if GTApprox/TAModelReductionRatio has any meaningful non-default value (greater than 1). Generally, this option is not compatible with exact fit.
-
GTApprox/Technique
Specify the approximation algorithm to use.
Value: RSM
,SPLT
,HDA
,GP
,SGP
,HDAGP
,TA
,iTA
,TGP
,MoA
,GBRT
,PLA
,TBL
orAuto
Default: Auto
New in version 1.9.2: added the incomplete Tensor Approximation technique.
New in version 1.10.0: added the Mixture of Approximators technique.
New in version 3.0 Release Candidate 1: added the Tensor Gaussian Processes technique.
New in version 5.1: added the Gradient Boosted Regression Trees technique.
New in version 6.3: added the Piecewise Linear Approximation technique.
New in version 6.8: added the Table Function technique.
Changed in version 6.8: removed the deprecated Linear Regression (LR) technique. This technique is no longer supported; instead, use RSM with GTApprox/RSMType set to
Linear
.This option allows you to explicitly specify an algorithm to be used in approximation. Its default value is
Auto
, meaning that the tool will automatically determine and use the best algorithm (except TGP and GBRT which are never selected automatically, and TA and iTA which are by default excluded from automatic selection — see GTApprox/EnableTensorFeature). Manual settings are:RSM
— Response Surface ModelSPLT
— 1D Splines with tensionHDA
— High Dimensional ApproximationGP
— Gaussian ProcessesSGP
— Sparse Gaussian ProcessHDAGP
— High Dimensional Approximation combined with Gaussian ProcessesTA
— Tensor Products of ApproximationsiTA
— Incomplete Tensor Products of Approximations (added in 1.9.2)TGP
— Tensored Gaussian Processes (added in 3.0 Release Candidate 1)MoA
— Mixture of Approximators (added in 1.10.0)GBRT
— Gradient Boosted Regression Trees (added in 5.1)PLA
— Piecewise Linear Approximation (added in 6.3)TBL
— Table Function (added in 6.8).
Sample size requirements taking effect when the approximation technique is selected manually are described in section Sample Size Requirements.
Note
Smart training of GBRT technique can be time consuming even in case of small training samples. Details on smart training can be found in section Smart Training.
-
GTApprox/TensorFactors
Describes tensor factors to use in the Tensor Approximation technique.
Value: list of factor definitions Default: empty list (automatic factorization) This option allows you to specify your own factorization of the input when the TA technique is used. It can also be used with TGP, but does not allow you to change factor techniques in this case, except specifying discrete variables. iTA and other techniques ignore this option completely.
Note
The incomplete tensor approximation (iTA) technique ignores factorization specified by GTApprox/TensorFactors because it always uses 1-dimensional BSPL factors. The tensor Gaussian processes (TGP) technique applies factorization, but in this case the option value cannot include technique labels (see below). The only valid label for TGP is
DV
.Option value is a list of user-defined tensor factors, each factor being a subset of input dataset components selected by user. A factor is defined by a list of component indices and optionally includes a label, specifying the approximation technique to use, as the last element of the list. Indices are zero-based, lists are comma-separated and enclosed in square brackets.
For example,
[[0, 2], [1, "BSPL"]]
specifies factorization of a 3-dimensional input dataset into two factors. The first factor includes the first and third components, and the approximation technique for this factor will be selected automatically (no technique specified by user). The second factor includes the second component, and splines (BSPL
label) will be used in the approximation of this factor.Technique label must be the last element of the list defining a factor. Valid labels are:
Auto
- automatic selection (same as no label).BSPL
- use 1-dimensional cubic smoothing splines.GP
- use Gaussian processes.SGP
- use Sparse Gaussian Process (added in 6.2).HDA
- use high dimensional approximation.LR
- linear approximation (linear regression).LR0
- constant approximation (zero order linear regression).DV
- discrete variable. The only valid label for the tensor Gaussian processes (TGP) technique.
Note
The splines technique (
BSPL
) is available only for 1-dimensional factors.Note
For factors using sparse Gaussian processes (
SGP
) the number of base points is specified by GTApprox/SGPNumberOfBasePoints. Note that this number is the same for all SGP factors. If a factor’s cardinality is less than the number of base points then a warning is generated and the Gaussian processes (GP
) technique is used for this factor instead.Factorization has to be full (has to include all components). If there is a component not included in any of the factors, it leads to an exception.
-
GTApprox/TrainingAccuracySubsetSize
Limits the number of points selected from the training set to calculate model accuracy on the training set.
Value: integer in range [1, 2^{32}-1], or 0 (no limit). Default: 100 000 New in version 1.9.0.
After a model has been built by GTApprox, it is evaluated on the input values from the training set to test model accuracy (calculate model errors, or the deviation of model output values from the original output values). The result is an integral characteristic named “Training Set Accuracy”, which is found in model info. For very large samples this test is time consuming and may significantly increase the build time. If the number of points in the training set exceeds the GTApprox/TrainingAccuracySubsetSize option value, some of the points will be dropped to make the test take less time, and training set accuracy statistic will be based only on the model errors calculated using the limited points subset (the size of which is equal to the GTApprox/TrainingAccuracySubsetSize option value). The number of points actually used in the test will also be found in model info.
If the sample size is less than GTApprox/TrainingAccuracySubsetSize value, this option in fact has no effect. In this case the number of points used in model accuracy test is equal to the number of points used to build the model (which may still be different from the number of points in the training set — for example, if the training set contains duplicate values).
When this option does take an effect, it always produces a warning to the model build log stating that only a limited subset of points selected from the training set will be used to calculate model accuracy.
To cancel the limit, set this option to 0. With this setting, the model will always be evaluated on the same set of points that was used to build the model.