2 Materials and methods
2.1 SARS-CoV-2 variants
A total of 14,472,180 SARS-CoV-2 RNA sequences were downloaded from the
GISAID database [18] on January 7th 2023 and
annotated with Pangolin [19]
(program’s version 4.2, Pangolin’s data version 1.17). Among 28,420
variants associated with the City of Berlin (Federal Republic of
Germany), 3,755 ones satisfying the following conditions were selected:
viral RNA does not contain unrecognized nucleotides; the difference
between viral RNA length and the “Wuhan” reference variant (GISAID:
EPI ISL 402125) is less than 5%. All selected RNA sequences were
collected between the 14th January 2021 and
12th December 2022.
All sequences were divided into two groups, “Ancestral variants” and
“Omicrons”, according to their GISAID descriptions. “Ancestral
variants” included variants of “Alpha”, “Beta”, “Delta”, and
others (Table 1 of Supplementary materials) and collected before the
30th January 2022, while all “Omicrons” species were
sequenced after 1st December 2021 (Figure 1).
2.2 Highly expressed miRNAs
miRNA-seq read count tables were downloaded from the GDC portal at
https:
//portal.gdc.cancer.gov/projects/
for colon (COAD) and lung (LUAD) tissues. Following the corresponding
clinical description, the dataset samples were referred to as “Normal”
(“Solid Tissue Normal”) and “Tumor” (“Primary Tumor”) groups. Only
“Normal” samples were used to assess miRNA expression in the analysis:
8 entries in COAD dataset and 46 in the LUAD project.
Highly expressed miRNAs were selected independently for each tissue by
the following procedure: read counts associated with a particular miRNA
by the samples were summed up; then, miRNAs were sorted by the total
number of reads; finally, miRNAs accounting for 95% of all reads were
selected; As a final expression of the miRNAs, corresponding CPM
expressions were averaged by the samples.
2.3 Binding model
To model the binding interaction between miRNA and viral mRNAs, the
regions of viral RNA that were a reverse complement to the seed region
of miRNA
(nucleotides from 2 to 7 on the 5’-end of mature miRNA) were mapped to
SARSCoV-2 genomes. Following the common classification [20], such
interactions are called “6mer”; they lead to translational
inactivation of target mRNAs [21]. In addition to the reverse
complementarity of the seed region, miRNA target prediction tools may
rely on additional features. For example, TargetScan [22] utilizes
context++ score based on site type, 30-supplementary
pairing, local AU content and others, MirTarget [23] considers GC
content of target site, and UTR length, while RNA22 [24] predicts
target sites by placing “Target Islands” within 3’-UTRs, 5’-UTRs or
CDS.
Commonly, the overlap of the targets by different software is weak at
best [25,
26]. While RNA22 predicts more individual miRNA-mRNA interactions than
TargetScan, it misses the majority of classical seed region binding
sites in 3’-UTRs. Moreover, the prediction results heavily depend on the
tool-specific internal thresholds, with sensitivity and specificity
expected to vary depending on particular thresholds. To overcome these
limitations, a bottom-up approach based on the mapping of SARS-CoV-2
regions with reverse complement “6mers” matching the seed regions
within tissue-specific and relevant host miRNAs was utilized.
All SARS-CoV-2 RNA sequences were aligned to the reference “Wuhan”
variant using MAFFT tool [27]. After the alignment, we transformed
viral RNAs coordinates on the “Wuhan” variant and annotated
protein-coding regions using
“Wuhan” annotation
(https://www.ncbi.nlm.nih.gov/nuccore/1798174254).
2.4 Statistical analysis
In all comparisons, the significances of the findings were evaluated in
the nonparametric Mann–Whitney U tests [28]. In addition, the
standard Student’s t-test procedure [29] was used to test the
hypothesis that the Spearman correlation [30] was zero.
The Monte-Carlo sampling procedure was used to show that the decrease in
the occurrence of miRNA binding motifs in the “Omicron” sequence group
was specific to the highly expressed miRNAs. More precisely, the
expression profile of the highly expressed miRNAs was fixed and the
miRNA sequences were randomly replaced with other miRNAs expressed in a
tissue. After that, the weighted number of binding motifs in
“Omicrons” and “Ancestral” groups were calculated: the number of
binding motifs within a particular miRNA was multiplied by the
proportion of its read counts, summed up and normalized by the total
lengths of the compared viral genomes. Then, for “Ancestral variants”
and “Omicrons” groups, the computed numbers were compared using the
MannWhitney U test. The described procedure was repeated 1,000 times.
Thus, the distribution of statistics characterizing the change in the
number of binding motifs between “Ancestral variants” and “Omicrons”
was obtained. Finally, using this distribution, the probability that the
decrease in the number of binding motifs for highly expressed miRNAs was
greater than for randomly selected ones was calculated.
2.5 Other technical details
The statistical analysis (hypothesis testing, correlation computation)
was performed with SciPy library [31]. All vector computations were
implemented using NumPy library [32] and parallelized using Parallel
tool [33]. Miscellaneous computations with tables were produced with
Pandas [34]. Finally, figures were made using Matplotlib [35]
and Seaborn [36] libraries. The source code of the analysis can be
found at
https://github.com/zhiyanov/covid-miRNA-evolution.