2.6 Similarity in gene expression among samples
To assess the variation and direction of variation among samples based on their gene expression, we calculated the correlation of gene expression levels among samples and the Euclidean distances among samples in DESeq 2 (version 1.22.2; Love et al., 2014) following the program directions. These measures are especially useful to assess the similarity of biological replicates (e.g., samples belonging to the same group) (Koch et al. 2018) and therefore to detect anomalies among the samples. The sample correlation matrix was calculated by performing the Pearson correlation of the normalized matrix after the variance stabilizing transformation (vst ) was performed on the most variable 2000 genes based on the HTSeq data produced. vst allows taking into account the sample variability of low counts. Sample Pearson correlation is calculated in pairwise comparison among samples and ranges from -1 to 1, where a value of 0 indicates no correlation (gene expression is completely dissimilar between the two samples), while values of -1 and 1 indicate that the samples have identical expression level (-1 corresponding to negative correlation). The Euclidean distance between samples was calculated by this equation: dist = sqrt(1- cor2) , where cor stands for the correlation coefficient of 2 samples. The smaller the distance, the higher the correlation among samples is. These distances were then used to build the heatmaps of sample distance of each normalized matrix, which allows the data to be shrunken towards the genes’ average expression across all samples. Gene heatmaps are instead based on vst transformation to normalize the raw count. After this, the mean expression in each sample is then normalized to 0. Finally, differences in gene expression among the studied groups (see below) were visualized by a PCA plot using the gene count matrix after applying the variance stabilizing transformation (vst ) to normalize the raw counts. PCA plots are useful to assess the effect of covariates and batch effects (non-biological variation due to experimental artifacts (Reese et al. 2013).