Abstract
The novel coronavirus named SARS-CoV-2 caused human epidemic all over
the world at breathtaking speed. It is of great concern for the research
community to understand the evolutionary origin and molecular
characteristics of this virus. With more and more isolates are
sequenced, it is possible to estimate the genomic variation and
evolution of SARS-CoV-2. In this study, 17,229 complete genomes of
SARS-CoV-2 were analyzed to characterize the genomic diversity. Using
Doc2vec algorithm, we got the the genome embeddings of SARS-CoV-2
isolates as well as its related virus species. The results showed that
the distance estimated from genome embedding is different from sequence
alignment. Additionally, a frequently happened mutations (C to T/U) in
-25 upstream of the ORF1ab start codon were identified. On protein
level, it seemed that the mutations appeared with unequal distribution
among the proteins. ORF1ab, S, ORF3a, ORF8 and N proteins were easier to
tolerate mutations while the other proteins showed high conservation
among the isolates.