January 16, 2020

# Data Fusion vs Mixture of Approximators to Handle Training Datasets of Low and High Fidelity

## Introduction

**Data Fusion (DF)** approximation technique is based on the idea of merging several datasets from data sources of different fidelity to obtain the most accurate model with a limited number of high-fidelity points (which usually are more expensive to generate). Data of different accuracy can result from calculations and experiments, or from computational models based on methods of different accuracy, or can even be obtained on fine mesh grids in FEA simulations.

pSeven has a dedicated tool to train such models called GTDF (see pSeven documentation). In this tech tip, we will cover some aspects of **DF** methodology:

**A**. Starting from pSeven 6.16 release, native **DF** models (.gtdf) are fully supported in pSeven modeling toolkit. Along with standard approximation, they can be imported to a report, studied in Model Validator and Model Explorer, and exported to the neutral format.

**B**. Recent updates in **Mixture of Approximators (MoA)** technique allow using it as another tool to merge data of various fidelity.

This tech tip demonstrates both approaches (**DF** and **MoA**) to using data of various accuracy for model training.

## Problem Statement

For demonstration purposes, we will consider the Borehole function^{1} as a data source in this tip. The 8-dimensional borehole function models water flow through a borehole that is drilled from the ground surface through the two aquifers. The water flow rate is described by the borehole and the aquifer’s properties. The water flow rate through the borehole Q is computed using the following analytical expression:

\(Q=f(x)=\frac{2πT_u (H_u-H_l )}{ln(r/r_w )\left[1+\frac{2LT_u}{ln(r/r_w ) r_w^2 K_w }+\frac{T_u}{T_l}\right]}\) ,

where \(x=\{r_w,r,T_u,H_u,T_l,H_l,L,K_w \}\) is the vector of input variables defined below (Fig. 1).

*Fig. 1. Illustration for the water flow through the borehole, adapted from Harper and Gupta (1983) ^{2}*

The input variables and their usual input ranges are:

- radius of borehole (m) - rw ∈ [0.05, 0.15],
- radius of influence (m) - r ∈ [100, 50000],
- transmissivity of upper aquifer (m
^{2}/yr) - Tu ∈ [63070, 115600], - potentiometric head of upper aquifer (m) - Hu ∈ [990, 1110],
- transmissivity of lower aquifer (m
^{2}/yr) - Tl ∈ [63.1, 116], - potentiometric head of lower aquifer (m) - Hl ∈ [700, 820],
- length of borehole (m) - L ∈ [1120, 1680],
- hydraulic conductivity of borehole (m/yr) - Kw ∈ [9855, 12045].

The response of this test model is the water flow rate, m^{3}/yr.

Our sample contains 20 high-fidelity points generated using the original Borehole function and 160 low-fidelity points generated using a Borehole function modified for this purpose:

\(Q^*=f^* (x)=\frac{5T_u (H_u-H_l)}{ln(r/r_w )\left[1.5+\frac{2LT_u}{ln(r/r_w) r_w^2 K_w} +\frac{T_u}{T_l}\right]}\) .

In addition, 1000 high-fidelity points were generated as a test sample for validation. All datasets were obtained by Latin hypercube sampling. We will consider the following steps:

- Train a model using samples containing high- and low-fidelity data with the help of
**Data Fusion**toolkit (DF model) and export it to native .gtdf format. - Import this DF model to Analyze along with high and low fidelity data and a test sample.
- Train low and high fidelity models (LF model and HF model on corresponding datasets) with the standard approximation techniques.
- Use
**MoA**technique: apply high fidelity data to improve the LF model. - Compare the models accuracy: LF, HF, DF and MoA using the test sample.

## Preparing the models

Two projects in the pSeven Examples Collection (examples 2.11 and 2.12) demonstrate how to use **Data Fusion** capabilities and train an approximation model combining two training samples of different fidelity. In our test case, we built a Data fusion model and export it to external file in native p7 format. In order to explore the DF model, we can import it to Analyze report by clicking Import from file… in Model tab or the same button in menu tab on Analyze panel (Fig. 2).

*Fig. 2. Import tool in Analyze mode *

Low- and high-fidelity datasets are imported to the same report. Two models are trained (see this tutorial video for model training basics) using each sample separately: LF and HF models (Fig. 3).

*Fig. 3.List of approximation models *

The DF model can be exported in any format as usual.

## Comparison

Let’s compare the models using quantile and scatter plots on the test sample (Fig. 4).

*Fig. 4. Comparison of models accuracy*

It is clear that the DF model is more accurate even than high-fidelity (HF) model. This can be explained by the small size of the training sample for HF model (20 points). Quantile plot for the low fidelity model demonstrates high error value for a large fraction of points, that is why model predictions do not locate on diagonal on the scatter plot.

As a next step, we use the **MoA** technique to update the low fidelity model using a high fidelity dataset. **MoA** allows improving model accuracy by reducing model error on the new training sample. In this case, the MoA updates the initial model so that it fits the new data. You can learn more about the **MoA** technique in this tech tip.

For this, we select LF model, select the high fidelity sample and click **Update…** button. Update option allows applying the **MoA** technique as another layer on top of initial model. Model update setup dialog is very similar to Model builder, except that the selected initial model is used as a starting approximation (Fig. 5).

*Fig. 5. Improvement of LF model with high fidelity data using MoA technique*

After the training is complete, we compare all models: DF, LF, HF and MoA. Quantile and scatter plots constructed on test sample as well as predictions errors are shown in Fig. 6.

*Fig. 6. Comparison of Models accuracy*

The following conclusion can be made out of the figure above: **MoA** technique can successfully operate with different fidelity data improving the initial model, however, generally, it cannot fully replace the **Data Fusion** tool without losing the model accuracy.

## Conclusions

This tech tip demonstrates some aspects of **Data Fusion (DF)** tool and **Mixture of Approximators (MoA)** technique capabilities to handle training datasets of low and high fidelity. Combining low-fidelity and high-fidelity samples in **Data Fusion** tool allows training the most accurate approximation model. **MoA** technique improves the initial low fidelity model using high fidelity sample, and can be used to merge data of various fidelity. However, it can be considered as a simplified method to merge the datasets and in general case it cannot replace **Data Fusion**. Thus, **Data Fusion** technique is more suitable to train approximation models combining two training samples of different fidelity.

## References

- uqworld.org
- W. V. Harper and S. K. Gupta, “Sensitivity/Uncertainty Analysis of a Borehole Scenario Comparing Latin Hypercube Sampling and Deterministic Sensitivity Approaches”, Office of Nuclear Waste Isolation, Battelle Memorial Institute, Columbus, Ohio, BMI/ONWI-516, 1983. URL