Recommendation system

Once we know that a sample is misbalanced, what should we do? By computing the misbalanced index, we identified the closest balanced point. Going from a misbalanced point to this closest balanced point (or the mean vector of the k nearest balanced points) is a translation operation, applied by computing a vector of deltas, one on each ilr-coordinate. In the compositional data jargon, such vector known as a perturbation vector, and can be backtransformed to the concentration scale. The perturbation operator is applied as follows:
\(\left[ y_1, y_2, ..., y_n \right] = \left[ x_1, x_2, ..., x_n \right] \bigoplus \left[ p_1, p_2, ..., p_n \right] = \mathcal{C} \left( \left[ x_1 \times p_1, x_2 \times p_2, ..., x_n \times p_n \right] \right)\)
\(\)A convenient way to compute the perturbation vector is to compute the difference between the coordinates of the reference point minus the coordinates of the imbalanced observation, then inverse the balances to compositions according to the bifurcating tree. The inverse of ilrs could be computed in a spreadsheet, but I suggest the use of coding with your favorite language (most scientists use either R or Python - both the compositions R package and the scikit-bio Python package have an inverse ilr function).
For example, if the observed ionome is \(\left[ 0.04, 0.02, 0.015, 0.005, 0.008, 0.912 \right]\) and I identify the target ionome as \(\left[ 0.05, 0.02, 0.015, 0.01, 0.008, 0.897 \right]\), the first and fourth components are increased at the expense of the last one. The perturbation from the observation to the target is \(\left[ 0.1728, 0.1382, 0.1382, 0.2765, 0.1382, 0.1360 \right]\), which is difficult to interpret.
To overcome this difficulty, we proposed in \cite{Parent2013} to interpret the differences in a balance dendrogram. Still, a high degree of abstraction was needed to visualize a composition through a multidimensional mobile of parts in equilibrium. Some authors, like \cite{de2018}, have proposed to render recommendations in the centered log-ratio (clr) form. A clr is a log ratio between a part and the geometric mean of the whole composition: interpreting a clr is itself a cognitive challenge. We could imagine several ways to represent a perturbation vector in more insightful manners. The best option I have found to date is to ratio the actual components with the targeted ones.
To recall the example above, ratioing the target to the observation makes thing much clearer, i.e. \(\left[0.8,1,1,0.5,1,1.02\right]\): the first component is in shortage by one fifth,  the fourth component is half as it should be, all at the expense of the last one. Bingo.

Case study with blueberry leaf data

Because the purpose of this paper is to show the whys and how-tos of ionomic diagnosis in agriculture, I will keep the methodology short. The blueberry (Vaccinium angustifolium) data set consist in concentrations of nitrogen, phosphorous, potassium, calcium, magnesium, boron, copper, manganese, zinc, iron, molybdenum and alluminium in leaves, related to yield.
Coding makes computations quicker and easier to reproduce compared to workflows including spreadsheets. In this paper, I will use the R statistical language to perform the necessary tasks. The GitLab repository associated to this paper provides all the necessary R codes.
I preprocessed the composition by computing the filling value by difference between the total sum and the sum of the measured components, then computed the isometric log-ratios. Since I used a k-nearest neighbors algorithm based on euclidean distance, the balance schema should not affect the results.
After randomly splitting data between a training (70%) and a testing set (30%) and optimizing the KNN model, I obtained an model accuracy (proportion of correctly classified samples) of 77% on the testing set.
I created a fictive but plausible misbalanced observation (table \ref{192579}):  its predicted yield category was a low-yielder (< 5000 kg/ga) with a probability of 73%. I decided to search for the 10 closest observations (in the training set), from which I extracted the median: this strategy will prevent reaching a point too close to the edge of the hyper volume (which I use to name the hyperd-blob) of high yielders. This target, found at an Aitchison distance (misbalance index, a value that should be contextualize for a given model) of 0.58, is also shown in table \ref{192579} with the associated observation / target ratio.