Recommendation system
Once the model predicts that a specimen indicates low yields due to nutrient imbalance, what should we do? By computing the imbalanced index, we spotted the closest balanced point. Moving from an imbalanced point to this closest balanced point (or the mean vector of the k nearest balanced points) is a translation operation described by a vector of deltas, one on each isometric log-ratio coordinate. In the compositional data jargon, such vector is known as a perturbation vector. The isometric log-ratio perturbation vector can be backtransformed to the concentration scale. The perturbation operator is applied to compositions as follows:
\(\left[ y_1, y_2, ..., y_n \right] = \left[ x_1, x_2, ..., x_n \right] \bigoplus \left[ p_1, p_2, ..., p_n \right] = \mathcal{C} \left( \left[ x_1 \times p_1, x_2 \times p_2, ..., x_n \times p_n \right] \right)\)
\(\)where, with \(n\) components, the composition \(x\) is perturbed by the composition \(p\) returning the translated composition \(y\).
A convenient way to compute the perturbation vector is to compute the difference between the coordinates of the reference point and the coordinates of the imbalanced one, then back-transform the balances to compositions according to the bifurcating tree. The inverse of isometric log-ratios could be computed in a spreadsheet, but I suggest the use of coding with your favorite language (most scientists use either R or Python - both the compositions
R package and the scikit-bio
Python package have an inverse isometric log-ratio function).
For example, if the observed ionome is \(\left[ 0.04, 0.02, 0.015, 0.005, 0.008, 0.912 \right]\) and I identify the target ionome as \(\left[ 0.05, 0.02, 0.015, 0.01, 0.008, 0.897 \right]\), the first and fourth components are increased at the expense of the last one. The perturbation from the observation to the target is \(\left[ 0.1728, 0.1382, 0.1382, 0.2765, 0.1382, 0.1360 \right]\). This perturbation vector contains all the information needed to move from the imbalance specimen to the target, but is difficult to interpret.
To overcome this difficulty, we proposed in \cite{Parent2013} to interpret the differences directly in a balance dendrogram. Still, a high degree of abstraction is needed to visualize a composition in a multidimensional mobile of parts in equilibrium. Some authors, like \cite{de2018}, proposed to diagnose ionomes using the centered log-ratio form. A centered log-ratio is a log ratio between a part and the geometric mean of the whole composition: interpreting a centered log-ratio is itself a cognitive challenge. We could imagine several ways to represent a perturbation vector in more insightful manners. The best option I have found to date is to ratio the actual components with the targeted ones.
To recall the example above, ratioing the target to the observation makes thing much clearer, i.e. \(\left[0.8,1,1,0.5,1,1.02\right]\): the first component of then imbalanced specimen is in shortage by one fifth, the fourth component is half as it should be, all at the expense of the last one. Bingo.
Case study with blueberry leaf data
While this paper addresses the whys and how-tos of ionomic diagnosis in agriculture, I will keep the methodology short. The Québec blueberry (Vaccinium angustifolium) data set comprises concentrations of nitrogen, phosphorus, potassium, calcium, magnesium, boron, copper, manganese, zinc, iron, molybdenum and aluminium in the diagnostic leaves, related to yield.
In this paper, I will use the R statistical language
\cite{nokey_31eda} to perform the necessary tasks. The
GitLab repository offered as supplementary material provides all the necessary R codes.
I preprocessed the composition by computing the filling value by difference between the total sum and the sum of the quantified components, then computed the isometric log-ratios. I used a k-nearest neighbors algorithm based on Euclidean distances: the balance scheme does not affect the results.
After randomly splitting data between a training (70%) and a testing set (30%) and optimizing the KNN model using the custom tuning algorithms of the caret
package \cite{Kuhn_2020}, I obtained a model accuracy (proportion of correctly classified samples) of 77% on the testing set.
I created a fictive but plausible imbalanced specimen (table \ref{192579}): its predicted yield category was a low-yielder (< 5000 kg/ga) with a probability of 73% (proportion of high yielders among the k nearest points). I decided to search for the 10 closest observations in the training set, from which I extracted the median: this strategy will prevent reaching a point too close to the edge of the hyper volume (which I use to name the hyperd-blob) of high yielders. This target, found at an Aitchison distance (imbalance index, a value that should be contextualize for a given model) of 0.58, is also shown in table \ref{192579} with the associated observation / target ratio.