Genetic landscape of SARS-CoV-2

Xueming Zheng; Wen Zhang

doi:10.22541/au.160455603.30764763/v1

loading page

Genetic landscape of SARS-CoV-2

Xueming Zheng,
Wen Zhang

Abstract

The novel coronavirus named SARS-CoV-2 caused human epidemic all over the world at breathtaking speed. It is of great concern for the research community to understand the evolutionary origin and molecular characteristics of this virus. With more and more isolates are sequenced, it is possible to estimate the genomic variation and evolution of SARS-CoV-2. In this study, 17,229 complete genomes of SARS-CoV-2 were analyzed to characterize the genomic diversity. Using Doc2vec algorithm, we got the the genome embeddings of SARS-CoV-2 isolates as well as its related virus species. The results showed that the distance estimated from genome embedding is different from sequence alignment. Additionally, a frequently happened mutations (C to T/U) in -25 upstream of the ORF1ab start codon were identified. On protein level, it seemed that the mutations appeared with unequal distribution among the proteins. ORF1ab, S, ORF3a, ORF8 and N proteins were easier to tolerate mutations while the other proteins showed high conservation among the isolates.