Table 1: The sensitivity and specificity of the various supervised
models evaluated using target variable/s of (a) one-dimensional
(T1-relaxation), (b) one-dimensional
(T2-relaxation), (c) two-dimensional
(T1-relaxation, T2-relaxation), and (d)
three-dimensional (T1-relaxation,
T2-relaxation, A-ratio) from leave-one-out training
method. The synonyms used were:
T1-relaxation/T2-relaxation (A-ratio),
area under the curve (AUC), classification accuracy (CA), F1 score –
the balance between precision and recall, Precision – how many selected
items were relevant, Recall - how many relevant items are selected. The
training method using cross validation of k=5 was also evaluated for
comparison (Supp. Fig. 4).
Results
Each edible oils (i.e., peanut, olive, sunflower, corn) were assigned to
its´ respective label (A, B, C, D) following the blinded NMR
measurements. As depicted in the one-dimensional map, each of oil
contents has a specific T1 relaxation and
T2 relaxation characteristic reading (Figs. 2a-b). The
means for T1 relaxation time were (191.3, 199.3, 228.4,
247.8) ms and means for T2 relaxation time were (127.9,
136.8, 162, 163) ms for (A, B, C, D), respectively.
The spread of the readings were, however, substantially large making
objects (A and B) and objects (C and D) inseparable in the
T1 relaxation dimension (P >0.05)
(Fig. 2a). Further in the T2 relaxation dimension, the
objects (C and D) were also inseparable (Fig. 2b). The undesirable
spread causes (similarly to spectral) cluster overlapping and hence
making classification difficult (if not impossible). One straightforward
solution is to increase the SNR (e.g., increasing the number scans)
or/and increase the number of samplings, which unfortunately, came at
the expenses of acquisition time. In addition, the relaxation time of
liquid sample is inherently long. On the other hand, using the
Clustering NMR method (as proposed in this work), one can leverages on
the combined characteristic of (T1, T2)
relaxation times of the oil contents. It forms (visibly) unique and
specific cluster based on the oil contents (´molecular fingerprint´) in
(pseudo) two-dimensional map (Fig. 2c). With the minor exception of corn
oil (which partially overlapped with sunflower oils), which could be due
to possible adulteration or factory processes. Upon further
investigation, we found that this artifact can be removed with higher
SNR.
Interestingly, unsupervised techniques based clustering analysis (e.g.,
hierarchical clustering (HC), tree-based classification, and k-means)
can be performed in conveniently using (open-source code) user friendly
third party software (e.g., R , or Orange 3.1.2). A front-end
statistical programming language allows the clustering analysis (once
compiled), can be executed in the next occasion. The HC analysis
successfully separated the (peanut and olive) cluster from the
(sunflower and corn) cluster, and subsequently split between themselves
(Fig. 3). The HC was constructed based on Euclidean distance (between
T1 relaxation and T2 relaxation) and
its´ quantitative linkages (e.g., inter/intra cluster similarity) were
shown in a heat map. The HC methods also confirmed the oil variants (A,
A´, B, B´, C, C´, C´´, D) based on its´ respective manufacturer.
Similarly, the Chemometric approach[31] based on
fat compositions (Supp. Fig. 2) and tree-based classification technique
based on the T1-relaxation cutoff and
T2-relaxation cutoff criterion (Supp. Fig. 3) appear to
be in good agreement (qualitatively) with the HC classification using
Euclidian distance of T1 relaxation and
T2 relaxation obtained with NMR experimentally. It is
worth noting, however, that the figures (i.e., fat compositions) given
by the manufacturers are for references (and not for scientific)
purposes. The clustering analysis models despite using various
differential clustering criterions (e.g., Euclidean distance, fat
compositions, relaxation cutoff) were in agreement with our observation
(Clustering NMR, Fig. 2c). This demonstrated the robustness of
Clustering NMR method, which can be validated using unsupervised
techniques.
In order to evaluate the classification accuracy on the quantitative
basis, various supervised learning models (i.e., kNN, random forest,
neural network, naïve Bayes, and logistic regression) were used to
train, validate and predict the datasets. The Area Under Curve (AUC) as
evaluated with Receiver Operating Characteristic (ROC) were on average
(0.820, 0.876, 0.915, 0.933) with (one-dimensional
(T1-relaxation), one-dimensional
(T2-relaxation), two-dimensional
(T1-relaxation, T2-relaxation), and
three-dimensional (T1-relaxation,
T2-relaxation, A-ratio)), respectively, using the
leave-one-out training method (Fig. 4). A-ratio is the ratio between
T1-relaxation and T2 -relaxation.
Similar conclusions were observed using cross validation method (e.g.,
k=5) (details in Supp. Table 1). This confirmed that the sensitivity and
specificity of the proposed Clustering NMR method has substantially
improved at the higher order of (pseudo)-dimensionality (e.g., 2D or
multidimensional) over low dimensionality (e.g., n=1). With the (minor)
exception of logistic regression, all the supervised models performed
reasonably well (AUC>0.80) (Table 1). Furthermore, all the
machine learning tasks run simultaneously and computational time taken
were typically in less than 1 minute (in this work).
Discussion
The proposed Clustering NMR method works on the rational that
accumulative characteristic of each dimensionality would forms a
specific and unique signature (´molecular fingerprint´). This is the
same concept which borrowed from the data
mining[32]. Fortunately, the characteristic of
(T1, T2) relaxation times in the
relaxometry is rather specific and prominent, and as the results
suggested, an optimal n=2 to 3 of dimensionality are essential to attain
a high AUC (Fig. 4)[33]. With the recent advances
in machine learning, however, its´ becoming computationally cheaper
(e.g., shorter analysis time) to calculate a big dataset. The
computational time reported in this analysis (less than one minute) much
shorter than a conventional two- or multidimensional NMR
(>hours), without resorting to the use of Ultrafast NMR.
Two- or multidimensional relaxometry experiments (e.g.,
T1-T2 correlation spectroscopy),
however, may provides much more information (e.g., cross peaks) but are
far more time consuming than that of Clustering NMR method. One way to
speed up acquisition time is to employ the use of gradient fields (e.g.,
Ultrafast NMR[30], continuous spatial
encoding[34]) which require modification to the
radio-frequency probe. Machine learning in the form of dimension
reductionist (e.g., principal component analysis (PCA), partial least
squares (PLS)) have also been used to reduce the dimensionality in
multidimensional spectroscopy (e.g., NMR metabolomics[19,35,36]). A recent deep learning assistive NMR
spectroscopy[18], which signals reconstructing
were demonstrated. We summarized and compared Clustering NMR method with
the state-of-the-art methodologies in a SWOT-like analysis (Table 2).
In conclusion, this proposed methodology, termed as Clustering NMR is
extremely powerful for rapid and accurate classification of objects
using the low-field NMR. This methodology is highly distruptive to the
low-field NMR applications, in particularly, the recent reported
NMR-based PoCT medical diagnostic. These include the immuno-magnetic
labelled detection (e.g., tumour cells[14,20],
tuberculosis[37] and magneto-DNA detection of
bacteria[38]) and the label-free detection of
various pathological states (e.g., blood
oxygenation[15]/oxidation
level[10] and malaria
screening[21,22,39]). Interestingly, with the
recent advances on machine learning technique, it has become remarkably
efficient that a large data run in almost in ´real-time mode´, which
open-up opportunity to combine real-time NMR (or MRI) with machine
learning simultaneously.
(1675 words)
Table 2: State-of-the-art (with/without) machine learning assistive NMR
works in comparison to the current work (Clustering NMR).