[insert Table 4 here]

Step 3: Model identification

The third step is to identify the models for the average sensorial rating on perfume smell (\(q_{s}\)) and the four target properties. The models are elaborated below.

ANN-based surrogate model for sensorial rating

A surrogate model is developed for predicting \(q_{s}\). Perfume sensorial data are generated by matching the general consumers’ preferences reflected in various perfume review websites. Here, the data is used to represent consumers’ satisfaction. A total of 761 data samples are uploaded in https://github.com/zx2012flying/Perfume-Case-Study. These data samples only involve the 48 ingredient candidates in Table 4. For each data sample, the input data includes the selected ingredients and their volume fractions. The output data is the overall sensorial rating. For consistency, the ratings are scaled to [0, 100] with 100 denoting the best smell. The minimum and maximum ratings for these samples are 50.2 and 89.7, respectively. Based on these data, several surrogate models such as linear regression, artificial neuron network (ANN), and support vector regression are built using the Surrogate Modeling Toolbox, Pyrenn, and Scikit-learn packages in Python 3.6. The hyperparameters are tuned manually and the model accuracy is evaluated through 10-fold cross validation. A three-layer ANN model (i.e., one input layer, one hidden layer, and one output layer) was found to offer the highest accuracy. Figure S1 shows the schematic structure of the ANN model. The tansig and purelin functions are applied in the hidden and output layer, respectively. The number of neurons in the hidden layer is tuned to be 8. Figure 4 presents the histogram of the absolute errors between the true values and predicted values (\(q_{s}^{\text{true}}-q_{s}^{\text{pre}}\)). 90% of the deviations are less than 10. The mean average error (MAE) and mean average percentage error (MAPE) are equal to 4.8 and 6.9%, respectively. This ANN model provides an accurate prediction of \(q_{s}\), which is explicitly expressed as
\(q_{s}=\sum_{l=1}^{8}{wo_{l}\bullet\ \ f_{h}(ah_{l})}+bo\) (21)
\(f_{h}\left(ah_{l}\right)=1-\frac{2}{1+e^{2\times ah_{l}}},\ \ \ l=1,\ldots,8\)(22)
\(ah_{l}=\sum_{i=1}^{48}{\text{wh}_{l,\ i}\bullet V}_{i}+bh_{l},\ \ \ \ l=1,\ldots,8\)(23)
where \(wo_{l}\) and bo are the weights and bias in the output layer, respectively. \(f_{h}\) is the tansig function in the hidden layer. \(ah_{l}\) is the intermediate variable in the hidden layer. \(\text{wh}_{l,\ i}\) and \(bh_{l}\) are the weights and biases in the hidden layer, respectively. These model parameters are provided in the Github platform mentioned above.