Statistical considerations, targets and background variability for exposure assessment. To establish the baseline variability of the DNA adductome for a target tissue/species (and, possibly, a population), replicate samples of single specimens for each assessment site are needed. Based on our experience with amphipods, about 15-20 individuals per site are sufficient when only females in the reproductive stage are considered \cite{Gorokhova2020}. However, species with a larger or lower inter-individual variability in unimpacted sites may require a larger sample size.
Before evaluating data in relation to the contaminants and other health parameters, some QC checks are conducted, using, e.g., Principal Component Analysis (PCA) as a checkpoint to screen for outlier data points and obtain a global perspective of the data \cite{Anwardeen2023}. All single adducts are evaluated for variability and those showing less than 1 % variation across the samples are omitted from the further analyses. Also, adducts that are present in less than 5 % of the samples are omitted.
The data are evaluated using two-tier diagnostics with standard multivariate approaches for omics data \cite{Brereton_2021}.
- First, the primary adductome data output is used to (1) define the variability of the data originating from unimpacted sites and/or healthy specimens (i.e., the target space in the multivariate ordination; Figure 4); (2) assess the predicted class membership or score to evaluate if the test sample aligns with the background samples; and (3) estimate the proportion of the test samples that do not align with the background samples. For that, a Partial Least Squares-Discriminant Analysis (PLS-DA), a supervised method that combines aspects of PCA and discriminant analysis, and, in particular, its upgraded version called OPLS-DA (orthogonal PLS-DA) is recommended (Figure 4B). For the environmental status assessment, we suggest applying the following principle: if more than 50% of the test samples from a site/area are classified as not belonging to the reference group, the site is considered as deviating significantly from the unimpacted state in terms of the DNA adduct composition and relative abundance.
- Second, the test samples that do not align with the reference adductome for the species/population in question are subjected to the analysis of individual variability of the influential adducts; the latter are identified by the PLS-DA model and ROC (Receiver Operating Curve) analysis (Figure 5). Once the PLS/OPLS-DA model is built, the VIP (variable influence of projection) measure can be obtained for the adducts based on their association with the identified predictive components. Each of these adducts may permit the identification of exposures to certain hazardous chemicals in the environment with a unique diagnostic value. For each individual adduct, the same principle as for many other biomarkers \cite{Hylland2017} can be used by defining the background assessment criteria (BAC) as the 90th percentile of the relative abundance of this adduct in the areas regarded as less polluted reference areas. Bootstrapping (100 000 runs) can be used to derive mean, median and 90th percentile values. The significant deviations of the influential adducts from the corresponding BAC values should be reported to facilitate the interpretation of the overall DNA adductome response.