December 5, 2016
Surrogate modeling to predict thermodynamic parameters
Accurate calculation of the solvation parameters of organic molecules is a long-standing challenge in computational chemistry and is important in many aspects of research in the pharmaceutical and agrochemical industries. For example, many of the pharmacokinetic properties of potential drug molecules are defined by their solvation and acid-base behavior, which can be estimated from their hydration free energies.
Existing computational methods for prediction of molecular solvation parameters can be roughly classified into two general classes: 'bottom up' and 'top down' methods. The first category of methods uses a molecular-scale physical-chemical model to describe the process of molecular solvation at some level of approximation. Methods in the second category are based on statistical analysis of quantitative structure–property relationships (QSPR) and make no a priory assumptions about physical-chemical phenomena behind the established. Both strategies have their own advantages and disadvantages.
Molecular modeling methods offer useful insights into the mechanisms of molecular solvation and (in average) are more accurate than QSPR methods. However, at the same time they are much more computationally expensive and due to the large costs associated with modeling of large complex molecular systems have a limited number of applications in a large-scale computational screening of molecular databases. From another side, QSPR methods offer computationally inexpensive ways to predict molecular solvation parameters.
Using of modern methods of statistical analysis one could technically apply them for establishing complex non-linear relationships in large complex molecular (bio) systems. However, there is no proper physical-chemical solvation model behind these methods. Therefore, quite often it is difficult to interpret results obtained by these methods. Also, QSPR-derived models are often sensitive to the composition of the training and test sets. Quite often parameters of a QSPR model that describes well properties of molecules from one chemical class are not transferrable to another chemical class.
The main purposes of this study were to generalize previously proposed approach, based on linear regression, using smart pSeven Core surrogate modeling techniques, develop its statistical analysis part and to expand its area of applications. To achieve these goals we performed a thorough investigation of the performance of different methods of statistical analysis on the quality of predictions.
We focus on the prediction of two important thermodynamic parameters – hydration free energy and logarithm of the octanol-water partition coefficient (for normal pH). These parameters are of fundamental interest in several areas of solution chemistry, pharmacology and environmental sciences.
Using pSeven Core we obtain a surrogate model that provides better prediction quality than common approaches. Model preserves its predictive power for a wide range of molecules. Time to obtain one accurate prediction of the thermodynamic parameter was reduced from one month (time to carry out an experiment) to several hours.