Published on in Vol 2, No 1 (2021): Jan-Dec

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/25995, first published .
Isolating SARS-CoV-2 Strains From Countries in the Same Meridian: Genome Evolutionary Analysis

Isolating SARS-CoV-2 Strains From Countries in the Same Meridian: Genome Evolutionary Analysis

Isolating SARS-CoV-2 Strains From Countries in the Same Meridian: Genome Evolutionary Analysis

Original Paper

1Systemomics Center, College of Pharmacy, Genomics Research Center, State-Province Key Laboratories of Biomedicine-Pharmaceutics of China, Harbin Medical University, Harbin, China

2HMU-UCCSM Centre for Infection and Genomics, Harbin Medical University, Harbin, China

3Somov Institute of Epidemiology and Microbiology, Vladivostok, Russian Federation

4Department of Microbiology, Immunology and Infectious Diseases, University of Calgary, Calgary, AB, Canada

Corresponding Author:

Emilio Mastriani, PhD

HMU-UCCSM Centre for Infection and Genomics

Harbin Medical University

No 157, Baojian Road

Harbin, 150081

China

Phone: 86 13664502721 ext 64502721

Email: emiliomastriani@icloud.com


Background: COVID-19, caused by the novel SARS-CoV-2, is considered the most threatening respiratory infection in the world, with over 40 million people infected and over 0.934 million related deaths reported worldwide. It is speculated that epidemiological and clinical features of COVID-19 may differ across countries or continents. Genomic comparison of 48,635 SARS-CoV-2 genomes has shown that the average number of mutations per sample was 7.23, and most SARS-CoV-2 strains belong to one of 3 clades characterized by geographic and genomic specificity: Europe, Asia, and North America.

Objective: The aim of this study was to compare the genomes of SARS-CoV-2 strains isolated from Italy, Sweden, and Congo, that is, 3 different countries in the same meridian (longitude) but with different climate conditions, and from Brazil (as an outgroup country), to analyze similarities or differences in patterns of possible evolutionary pressure signatures in their genomes.

Methods: We obtained data from the Global Initiative on Sharing All Influenza Data repository by sampling all genomes available on that date. Using HyPhy, we achieved the recombination analysis by genetic algorithm recombination detection method, trimming, removal of the stop codons, and phylogenetic tree and mixed effects model of evolution analyses. We also performed secondary structure prediction analysis for both sequences (mutated and wild-type) and “disorder” and “transmembrane” analyses of the protein. We analyzed both protein structures with an ab initio approach to predict their ontologies and 3D structures.

Results: Evolutionary analysis revealed that codon 9628 is under episodic selective pressure for all SARS-CoV-2 strains isolated from the 4 countries, suggesting it is a key site for virus evolution. Codon 9628 encodes the P0DTD3 (Y14_SARS2) uncharacterized protein 14. Further investigation showed that the codon mutation was responsible for helical modification in the secondary structure. The codon was positioned in the more ordered region of the gene (41-59) and near to the area acting as the transmembrane (54-67), suggesting its involvement in the attachment phase of the virus. The predicted protein structures of both wild-type and mutated P0DTD3 confirmed the importance of the codon to define the protein structure. Moreover, ontological analysis of the protein emphasized that the mutation enhances the binding probability.

Conclusions: Our results suggest that RNA secondary structure may be affected and, consequently, the protein product changes T (threonine) to G (glycine) in position 50 of the protein. This position is located close to the predicted transmembrane region. Mutation analysis revealed that the change from G (glycine) to D (aspartic acid) may confer a new function to the protein—binding activity, which in turn may be responsible for attaching the virus to human eukaryotic cells. These findings can help design in vitro experiments and possibly facilitate a vaccine design and successful antiviral strategies.

JMIR Bioinformatics Biotechnol 2021;2(1):e25995

doi:10.2196/25995

Keywords



The ongoing COVID-19 pandemic caused by the novel SARS-CoV-2 is the most threatening respiratory infection worldwide and has affected almost every country in the world. As of December 30, 2020, over 81 million people were infected with COVID-19, and more than 1.7 million deaths were reported. Many health institutions are attempting to produce effective vaccines against this virus infection, and several are now in the final stages of development before their application to human populations [1,2].

The SARS-CoV-2 genome shares approximately 82% sequence identity with SARS-CoV and MERS-CoV (Middle East respiratory syndrome coronavirus) and more than 90% sequence identity for essential enzymes and structural proteins. This high level of sequence identity suggests a common pathogenesis mechanism and, thus, therapeutic targeting. SARS-CoV-2 contains 4 structural proteins, including spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins [3]. The structure and the genome of SARS-CoV-2 are being extensively studied, but the results seem to be controversial. For example, a recent study found that the 2 integral membrane proteins (ie, envelope and membrane proteins) tend to evolve slowly by accumulating nucleotide mutations on their corresponding genes, but genes encoding nucleocapsid, viral replicase and spike proteins, which are regarded as important targets for the development of vaccines and antiviral drugs, tend to evolve faster [4]. However, other studies have shown that potential drug targets of SARS-CoV-2 are highly conserved [3].

The genome of SARS-CoV-2 is comprised of a single-stranded positive-sense RNA. A newly sequenced genome of SARS-CoV-2 was submitted to the NCBI genome database (NC_045512.2). The genetic makeup of SARS-CoV-2 is composed of 13-15 (including 12 functional) open reading frames (ORFs) containing ~30,000 nucleotides. The genome contains 38% of GC content and 11 protein-coding genes, together expressing 12 proteins [3].

The genomic characterization of 95 SARS-CoV-2 genomes revealed the 2 most common mutations that might affect the severity and spread of SARS-CoV-2 [5]. Another study highlighted the crucial genomic features that are unique to SARS-CoV-2 and 2 other deadly coronaviruses, SARS-CoV and MERS-CoV. These unique features correlate with the high fatality rate due to infection with these coronaviruses as well as their ability to switch hosts from animals to humans [6]. As a result, it can be speculated that the epidemiological and clinical features of these viruses may differ across countries or continents.

Genomic comparison of 48,635 SARS-CoV-2 genomes has shown that the average number of mutations per sample was 7.23, and most SARS-CoV-2 strains belong to one of the following 3 clades characterized by geographic and genomic specificity: clade G (Europe), clade L (Asia), and G-derived clade (North America) [7]. These results suggest custom-designed antiviral strategies based on the molecular specificities of SARS-CoV-2 in patients from different geographies [7]. Previous studies have also differentiated the 3 variants according to the geographic location (East Asia, Europe, and America) [8]. A more recent genome-wide analysis revealed that the frequency of amino acid mutations was higher in the genome sequences of SARS-CoV-2 strains from Europe (43.07%), followed by strains from Asia (38.09%) and North America (29.64%). However, case fatality rates remained higher in the European temperate countries, such as Italy, Spain, Netherlands, France, England, and Belgium [9].

The aim of this study was to compare the set of SARS-CoV-2 genomes of viral strains isolated from representative countries in the same meridian (longitude), namely, Italy, Sweden, and Congo, which have different climate conditions, to reveal similarities or differences in the patterns of possible evolutionary pressure signatures in their genomes.


Sequence Data

We obtained data from the Global Initiative on Sharing All Influenza Data (GISaid) repository and sampled all genomes available therein to that date (May 5, 2020), including the files congo-gisaid_hcov-19_2020_05_05_09.fasta with 75 entries, italy-gisaid_hcov-19_2020_05_05_10.fasta with 69 entries, sweden-gisaid_hcov-19_2020_05_05_10.fasta with 104 entries, and also the outgroup file brazil_gisaid_hcov-19_2020_05_15_04.fasta with 92 entries. The reference genome with accession number NC_045512.2 was downloaded from the GenBank repository.

Evolution Model Analysis

We used the SARS-CoV-2 Wuhan-Hu-1 genome (RefSeq Acc. No. NC_045512.2) as the reference sequence and the VIRULIGN version 1.0.1 application [10] to perform multiple sequence alignment, with AliView version 1.26 application for visualizing the results of the analyses [11]. HyPhy 2.5.8 (MP) was used to perform recombination analysis by the genetic algorithm recombination detection method and conduct trimming, stop codon removal, and phylogenetic tree and mixed effects model of evolution (MEME) analyses [12]. The MEME web site was used to read JSON output files and generate MEME images and tables.

RNA Secondary Structure Prediction

We used the RNA_fold web server (part of the Vienna RNA Websuite) to predict secondary structures of both the wild-type and mutated sequences [13], and the Forna package [14] to build the graph diagrams.

Protein Analysis

Protein disorder analysis was conducted using MFDp2 [15], NetSurfP-2.0 [16], and SPOT-Disorder2 [17] applications. Transmembrane analysis of the protein was calculated using the TMHMM server v.2.0, MemBrain webserver [18], ProtScale [19], and TMpred [20] (scores normalized for comparison) on the Expasy website [21].

3D Protein Structure Prediction and Ontologies

Both protein structures were determined with an ab initio approach by using the Robetta webserver [22], whereas DeeProtein capsule from OCEAN CODE [23] was used to predict ontologies of the predicted proteins. 3D images of protein structures and their ontologies were released using PyMOL 2.4.0 [24].


Codon 9628 Evolved Under Episodic Positive Selection

Mixed evolutionary analysis based on the MEME algorithm was conducted on the SARS-CoV-2 data from Italy, Sweden, and Congo (countries from the same geographic meridian) and Brazil (included as an outgroup). The investigation revealed codon 9628 was under episodic positive selective pressure across the countries, as depicted in Table 1.

Table 1. Mixed effects model of evolution (MEME_ analysis results showing data obtained from the evolutionary analysis of SARS-CoV-2 from Brazil, Congo, Italy, and Sweden. The top 3 sites for every country are shown, sorted by P value.
Country (ID)/SitePartitionαβpβ+p+LRTP valueBranches under selectionTotal branch lengthMEME LogLFixed effects likelihood LogL
Brazil (BR)

9628a1000.9610,0000.0416.37<.00120.65-27.28-20.62

99281000.8210,0000.1811.12<.00142.71-31.03-28.53

811000.041032.180.966.95.0151.49-40.77-40.77
Congo (CG)

9628a1000.9710,0000.0310.89<.00110.25-18.18-13.54

28841000.451273.450.553.51.0850.60-42.49-42.37

65411000.9710,0000.032.73.1210.27-12.94-11.92
Italy (IT)

151000.9610,0000.0410.21<.00110.73-15.90-12.57

9628a1000.971,00000.0311.24<.00110.45-17.66-12.95

41000.8910,0000.117.25.0101.83-13.11-10.43
Sweden (SE)

9628a1000.969613.520.0416.03<.00120.51-27.43-21.10

44091000.974356.700.037.68.0110.16-15.63-12.33

47321000.9510,0000.053.85.0720.74-19.66-18.78

aIndicates site 9628.

In this context, we use the term “site” as a synonym of codon, respecting the HyPhy terminology. The asymptotic P value was <.001 for episodic diversification at site 9628. Figure 1 shows the distribution of the P value across the sites for all 4 countries.

A deep check of the multiple alignment data of the 4 countries revealed that the episodic positive selective pressure on site 9628 is a consistent mutation of the codon GGG to ACG, as shown in Figure 2.

Figure 1. Mixed effects model of evolution site plot. Distribution of the P value over the sites in Brazil, Congo, Italy, and Sweden. The purple circle indicates site 9628 that was found to be under episodic selective pressure.
View this figure
Figure 2. Part of the multiple sequence alignment from the Italian data showing the site 9628 under episodic selective pressure. The nucleotides mute from GGG to ACG.
View this figure

RNA Secondary Structure Prediction Changes

The prediction of secondary structure before and after mutation shows important differences, as shown by the mutation from GGG to ACG (Figure 3). The comparison between the 2 predicted secondary structures highlighted structural modifications at the top-right ring of the RNA conformation, as depicted in Figure 4, suggesting the GGG to ACG mutation was responsible for a significant modification of the RNA secondary structure.

Figure 3. Nucleotide mutation over aligned sequences, illustrating the sequence considered to predict secondary structures in both mutated and wild-type proteins. Site position is indicated in blue, from the start codon (9578) to the open reading frame (9632).
View this figure
Figure 4. Secondary structure prediction. The 2 RNA diagrams exhibit structural modifications affected by the GGG to ACG mutation.
View this figure

Protein Analysis

The analysis of the protein conducted for finding its disordered region turned out the positions from 41 to 59 to be more stable with the glycine (G) placed at the 50th position. We obtained results by using 3 different software tools and considering the average value for the probability of disorder, as shown in Figure 5 and reported in Table 2. Further analysis to locate the transmembrane region in the protein revealed that locations 54-67 were associated with this function. The analysis, conducted by using 4 distinct web applications and by evaluating the resultant average values, places the glycine (G) as near the transmembrane region to suppose its involvement. Table 3 reports the data showing the probabilities of each amino acid acting as the transmembrane. The transmembrane topology of the sequence (Figure 6) highlights the amino acid G at location 50 in the middle of the transmembrane region, and the distribution of the probabilities (Figure 7) corroborates this hypothesis.

Figure 5. Disorder region analysis. The region 41-59 was found to have the lowest probability to be disordered. The orange lines delimit this region, and the blue dotted line outlines the position of G on the different curves.
View this figure
Table 2. Protein disorder analysis results showing the probability of disorder for each position of the protein. The probabilities have been calculated using MFDp2, Netsurf, and SPOTD software.
PositionAmino acid sequenceDisorder probability values


MFDp2NetsurfP2SPOTDAverage valuea
1M0.1320.6278231140.56070.440174371
2L0.1340.3479783830.53580.339259461
3Q0.1350.2707064750.49450.300068825





39T0.030.0108429440.19360.078147648
40V0.0290.0076606640.1890.075220221
41Q0.0270.0044789070.1720.067826302
42E0.0250.003409310.18480.07106977
43I0.0250.0038877620.19680.075229254
44Q0.0240.0039978370.19270.073565946
45L0.0230.003615180.21290.079838393
46Q0.0230.0045515740.21230.079950525
47A0.0230.0049395250.20110.076346508
48A0.0220.0057523070.21330.080350769
49V0.0220.0028261490.25240.092408716
50bG0.0220.0058280880.20130.076376029
51E0.0220.0010461030.240.087682034
52L0.0230.0009224680.26940.097774156
53L0.0230.0012632750.25880.094354425
54L0.0230.0011874410.25390.092695814
55L0.0230.0006504760.24830.090650159
56E0.0230.0006154340.23280.085471811
57W0.0230.0010805710.23020.08476019
58L0.0230.0009415730.21540.079780524
59A0.0230.0015730790.2080.07752436
60M0.0240.0009976980.28530.103432566
61A0.0240.002277830.30260.109625943
62V0.0250.0033627860.35030.126220929

aAverage values of the disorder probability for each position.

bAmino acid G placed at position 50, inside the stable region.

Table 3. Transmembrane prediction results obtained using TMHMM, MemBrainTHM, ProtScale, and TMpred applications. Results from ProtScale and TMpred have been normalized for comparison with other probabilities.
PositionAmino acid sequenceTMHMM probabilityMemBrain THM propensityProtScale normalized scoreTMpred normalized scoreTransmembrane probability, average valuea
1M00.000191N/Ab0.6614257640.220538921
2L00.002851N/Ab0.6614257640.221425588
3Q00.046538N/Ab0.6614257640.235987921






49V0.25940.9879140.6460.6033589420.624168236
50cG0.277190.9879140.6460.6298016790.63522642
51E0.280830.9917020.7360.6605324280.667266107
52L0.327350.9938570.670.5942469180.646363479
53L0.566510.9938570.6370.7784527430.743954936
54L0.639370.9945220.6320.733607290.749874822
55L0.640320.9904590.6590.8188315170.777152629
56E0.640520.960270.7260.8356262280.790604057
57W0.648260.9468190.7010.8225835270.779665632
58L0.64930.9474240.7060.8951223870.799461597
59A0.649280.9474240.6830.9056637480.796341937
60M0.649270.9707350.6830.9472931930.812574548
61A0.649240.9707350.7730.9555118810.83712172
62V0.649030.9375070.83110.85438425
63M0.648930.8925060.8310.9608718960.833326974
64L0.64820.8464030.840.9428265140.819357379
65L0.647580.7817330.8470.9240664640.800094866
66L0.635570.6703870.8560.6614257640.705845691
67L0.618350.5393530.8510.6614257640.667532191
68C0.54280.4556150.8190.6614257640.619710191
69C0.510090.4303850.7280.6614257640.582475191
70C0.447020.380525N/Ab0.6614257640.496323588

aAverage values of the probability for each position.

bThe window size used for the profile computation is 9, so the score is not applicable for positions 1-4 and 70-73.

cAmino acid G placed at position 50, inside the stable region.

Figure 6. Topology diagram using the MemBrain v3. The illustration depicts the transmembrane topology of the sequence and highlights that the amino acid at position 50 (G) is positioned into the middle of the transmembrane region. Red: transmembrane helix (TMH); blue: loop.
View this figure
Figure 7. Transmembrane prediction. The region 54-67 was found to be the region with the highest probability to code for the transmembrane, and the G amino acid is near enough to suppose its involvement. The orange lines delimit this region, and the blue dotted line outlines the position of G on the different curves.
View this figure

3D Protein Analysis

To characterize the deduced protein P0DTD3.1, we predicted the 3D structures for both the wild-type and mutated protein sequences using an ab initio approach. According to the preliminary clue from the secondary structure prediction, the mutated protein presents a slightly different structure when the amino acid residue changed from G to T. Figures 8 and 9 illustrate both the predicted models showing that the mutation would affect the tertiary structure of the protein. The comparison of residues 45-55 between MUT31136 and MOD30336 showed that this portion of the protein with the mutation stretches out with repercussions to the preceding helix. This result suggests that the mutation of the single amino acid from G to T, with consecutive stretching cycles on the 3D structure of the protein, tends to make the protein assume new functions.

Figure 8. Prediction of the 3D structure for the mutated protein of SARS-CoV-2. The model MUT31136 represents the predicted 3D model of the protein subject to mutation. (A) Amino acid sequence colored by the spectrum range, with the mutated amino acid indicated in black color at position 50 (T). (B) The protein has been oriented to facilitate the comparison and residue 50 is represented with red dots. (C) Details of the residues 45-55 and their rotation (D) around the Y-axis and (E) around the X-axis with a step of 90˚.
View this figure
Figure 9. Prediction of the 3D structure of the unchanged protein. The model MOD30506 represents the predicted 3D model of the wild-type protein. (A) Amino acid sequence colored by the spectrum range, with the investigated amino acid indicated in black color at position 50 (G). (B) The protein has been oriented to facilitate the comparison and the residue 50 is represented with the red dots. (C) Details of the residues 45-55 and their rotation (D) around the Y-axis and (E) around the X-axis with a step of 90˚.
View this figure

Prediction of Protein-Related Ontologies

The analysis of protein ontologies indicates different functions between the wild-type and mutated proteins, owing to their changed structures. As shown in Table 4, the wild-type variant of the protein is linked with a high probability (.978≤P≤1) to both catalytic and transferase activities. The mutated variant of the protein presents a remarkable change in its functionality trend: even if usually the scores below 0.5 are interpreted as negative predictions, in an evolutionary context, the decrease in probability of the transferase activity (from 0.98 to 0.375) to favor the binding function (from 0.004 to 0.132) is not regarded as negligible. The contextual inversion of tendency of transferase to binding activity suggests that the episodic evolutionary mutation aims to improve the binding ability of the protein.

Table 4. Classification report showing the predicted functions of both (mutated and wild-type) protein sequences and related scores. Only positive scores are reported.
Gene ontology terms and functionScore


Wild-type protein sequenceMutated protein sequence
GO:0003674Molecular function11
GO:0003824Catalytic function10.998
GO:0016740aTransferase activity0.9780.375
GO:0016829Lyase activity0.017b
GO:0022891Transmembrane0.07b
GO:0005488aBinding activity0.0040.132
GO:0022892Transmembrane transport activity0.0010.001

aOntological functions subjected to inverted tendency.

bUnpredicted function.


Principal Findings

SAR-CoV-2, the virus known to cause the COVID-19 pandemic, has many peculiar characteristics, such as rapidly accumulating mutations, compared to other coronaviruses [25]. Specifically, the prevalence of single nucleotide transitions as the major mutational type of SAR-CoV-2 across the world has been shown previously [7]. In this study, we conducted evolutionary analyses on the mutations to determine whether SARS-CoV-2 genomes from different countries in the same meridian might have specific variation patterns. We found that codon 9628 was under episodic selective pressure for all 4 countries in the same meridian. This would affect RNA secondary structure and, consequently, the protein product, with T (threonine) changing to G (glycine) in protein position 50. This position is located close to the predicted transmembrane region. Mutation analysis revealed that a change from G (glycine) to D (aspartic acid) may confer a new function to the protein, that is, binding activity, which in turn may be responsible for attaching the virus to human eukaryotic cells. These bioinformatics findings may help in better designing in vitro (wet lab) and in vivo (animal model) experiments to determine protein variants associated with the virulence of the virus. Therefore, these findings may eventually facilitate vaccine design and successful antiviral strategies. For example, the results of this study suggest the need for site-directed mutagenesis and animal experiments to validate the anticipated effects.

Mercatelli and Georgi [7] demonstrated that clade G, prevalent in Europe, carries a D614G mutation in the spike protein, which is responsible for the initial interaction of the virus with the host human cell. Other studies have also shown different mutation locations among strains isolated from different continents. Mutations at positions 2891, 3036, 14408, 23403, and 28881 are predominantly observed in European strains, whereas those located at positions 17746, 17857, and 18060 are exclusively present in North American strains of SARS-CoV-2 [26]. Their findings suggest that the virus is evolving and that European, North American, and Asian strains of the virus might coexist, with each characterized by different mutation patterns.

Furthermore, a comparison of viral genomes of SARS-CoV-2 strains from 13 countries identified differences in the protein-coding sequences. For example, an Indian strain showed a mutation in the spike glycoprotein at R408I and in the replicase polyprotein at I671T, P2144S, and A2798V, whereas the spike protein of Spain and South Korean strains carried an F797C and a S221W mutation, respectively [27]. Moreover, recently conducted integrative analyses of SARS-CoV-2 genomes of strains from different geographical locations reveal unique features that are potentially consequential to host-virus interaction and pathogenesis [28]. However, the most recent study of genomic diversity and hotspot mutations in 30,983 SARS-CoV-2 genomes indicates that unlike the influenza virus or HIV, SARS-CoV-2 has a low mutation rate, which makes the development of an effective global vaccine very likely [29]. The study determined several hotspot mutations across the whole SARS-CoV-2 genome. In all, 14 nonsynonymous hotspot mutations (whose prevalence of mutations is >10%) have been identified at different locations along the viral genome: 8 in ORF1ab polyprotein (in nsp2, nsp3, transmembrane domain, RdRp, helicase, exonuclease, and endoribonuclease), 3 in nucleocapsid protein, and 1 in each of the 3 proteins spike, ORF3a, and ORF8. Moreover, 36 nonsynonymous mutations were identified in the receptor-binding domain of the spike protein with a low prevalence (<1%) across all genomes [29].

Conclusions

All these findings highlight the importance of studying the relationship of geographical locations of SARS-CoV-2 isolates and mutations in their genomes, because the relationship can also be confirmed by phylogenetic tree analyses for elucidation of lineages and clusters based on the geographic locations. In conclusion, this genome evolutionary analysis revealed that codon 9628 is under episodic selective pressure for SARS-CoV-2 strains isolated from all 4 countries (Italy, Sweden, Congo, and Brazil) of the same geographical meridian.

Acknowledgments

This work was supported by grants of Natural National Science Foundation of China (NSFC81671980, 81871623, 82020108022, Shu-Lin Liu). The funding bodies played no roles in the design of the study; collection, analysis, or interpretation of data; or in writing the manuscript.

Conflicts of Interest

None declared.

  1. Bar-Zeev N, Inglesby T. COVID-19 vaccines: early success and remaining challenges. The Lancet 2020 Sep;396(10255):868-869. [CrossRef]
  2. Logunov DY, Dolzhikova IV, Zubkova OV, Tukhvatulin AI, Shcheblyakov DV, Dzharullaeva AS, et al. Safety and immunogenicity of an rAd26 and rAd5 vector-based heterologous prime-boost COVID-19 vaccine in two formulations: two open, non-randomised phase 1/2 studies from Russia. The Lancet 2020 Sep;396(10255):887-897. [CrossRef]
  3. Naqvi AAT, Fatima K, Mohammad T, Fatima U, Singh IK, Singh A, et al. Insights into SARS-CoV-2 genome, structure, evolution, pathogenesis and therapies: Structural genomics approach. Biochim Biophys Acta Mol Basis Dis 2020 Oct 01;1866(10):165878 [FREE Full text] [CrossRef] [Medline]
  4. Dilucca M, Forcelloni S, Georgakilas AG, Giansanti A, Pavlopoulou A. Codon usage and phenotypic divergences of SARS-CoV-2 genes. Viruses 2020 Apr 30;12(5) [FREE Full text] [CrossRef] [Medline]
  5. Khailany RA, Safdar M, Ozaslan M. Genomic characterization of a novel SARS-CoV-2. Gene Rep 2020 Jun;19:100682 [FREE Full text] [CrossRef] [Medline]
  6. Gussow AB, Auslander N, Faure G, Wolf YI, Zhang F, Koonin EV. Genomic determinants of pathogenicity in SARS-CoV-2 and other human coronaviruses. Proc Natl Acad Sci U S A 2020 Jun 30;117(26):15193-15199 [FREE Full text] [CrossRef] [Medline]
  7. Mercatelli D, Giorgi FM. Geographic and genomic distribution of SARS-CoV-2 mutations. Front Microbiol 2020;11:1800 [FREE Full text] [CrossRef] [Medline]
  8. Forster P, Forster L, Renfrew C, Forster M. Phylogenetic network analysis of SARS-CoV-2 genomes. Proc Natl Acad Sci U S A 2020 Apr 28;117(17):9241-9243 [FREE Full text] [CrossRef] [Medline]
  9. Islam MR, Hoque MN, Rahman MS, Alam ASMRU, Akther M, Puspo JA, et al. Genome-wide analysis of SARS-CoV-2 virus strains circulating worldwide implicates heterogeneity. Sci Rep 2020 Aug 19;10(1):14004 [FREE Full text] [CrossRef] [Medline]
  10. Libin PJK, Deforche K, Abecasis AB, Theys K. VIRULIGN: fast codon-correct alignment and annotation of viral genomes. Bioinformatics 2019 May 15;35(10):1763-1765 [FREE Full text] [CrossRef] [Medline]
  11. Larsson A. AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics 2014 Nov 15;30(22):3276-3278 [FREE Full text] [CrossRef] [Medline]
  12. Murrell B, Wertheim JO, Moola S, Weighill T, Scheffler K, Kosakovsky Pond SL. Detecting individual sites subject to episodic diversifying selection. PLoS Genet 2012;8(7):e1002764 [FREE Full text] [CrossRef] [Medline]
  13. Lorenz R, Bernhart SH, Höner Zu Siederdissen C, Tafer H, Flamm C, Stadler PF, et al. ViennaRNA Package 2.0. Algorithms Mol Biol 2011 Nov 24;6:26 [FREE Full text] [CrossRef] [Medline]
  14. Kerpedjiev P, Hammer S, Hofacker IL. Forna (force-directed RNA): Simple and effective online RNA secondary structure diagrams. Bioinformatics 2015 Oct 15;31(20):3377-3379 [FREE Full text] [CrossRef] [Medline]
  15. Mizianty MJ, Uversky V, Kurgan L. Prediction of intrinsic disorder in proteins using MFDp2. Methods Mol Biol 2014;1137:147-162. [CrossRef] [Medline]
  16. Klausen MS, Jespersen MC, Nielsen H, Jensen KK, Jurtz VI, Sønderby CK, et al. NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins 2019 Jun 09;87(6):520-527. [CrossRef] [Medline]
  17. Hanson J, Paliwal KK, Litfin T, Zhou Y. SPOT-Disorder2: improved protein intrinsic disorder prediction by ensembled deep learning. Genomics Proteomics Bioinformatics 2019 Dec;17(6):645-656 [FREE Full text] [CrossRef] [Medline]
  18. Yin X, Yang J, Xiao F, Yang Y, Shen H. MemBrain: an easy-to-use online webserver for transmembrane protein structure prediction. Nanomicro Lett 2018;10(1):2 [FREE Full text] [CrossRef] [Medline]
  19. Wilkins MR, Gasteiger E, Bairoch A, Sanchez JC, Williams KL, Appel RD, et al. Protein identification and analysis tools in the ExPASy server. In: Link AJ, editor. 2-D Proteome Analysis Protocols. Methods in Molecular Biology vol. 112. Totowa, NJ: Humana Press; 1999:531-552.
  20. Hofmann K, Stoffel W. TMbase-a database of membrane spanning proteins segments. Biol. Chem. Hoppe-Seyler, 374. Biol. Chem. Hoppe-Seyler 1993;374:166 [FREE Full text]
  21. Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD, Bairoch A. ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res 2003 Jul 01;31(13):3784-3788 [FREE Full text] [CrossRef] [Medline]
  22. Kim DE, Chivian D, Baker D. Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res 2004 Jul 01;32(Web Server issue):W526-W531 [FREE Full text] [CrossRef] [Medline]
  23. Upmeier zu Belzen J, Bürgel T, Holderbach S, Bubeck F, Adam L, Gandor C, et al. Leveraging implicit knowledge in neural networks for functional dissection and engineering of proteins. Nat Mach Intell 2019 May 13;1(5):225-235. [CrossRef]
  24. Rigsby RE, Parker AB. Using the PyMOL application to reinforce visual understanding of protein structure. Biochem Mol Biol Educ 2016 Sep 10;44(5):433-437 [FREE Full text] [CrossRef] [Medline]
  25. Zhao Z, Li H, Wu X, Zhong Y, Zhang K, Zhang YP, et al. Moderate mutation rate in the SARS coronavirus genome and its implications. BMC Evol Biol 2004 Jun 28;4:21 [FREE Full text] [CrossRef] [Medline]
  26. Pachetti M, Marini B, Benedetti F, Giudici F, Mauro E, Storici P, et al. Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant. J Transl Med 2020 Apr 22;18(1):179 [FREE Full text] [CrossRef] [Medline]
  27. Khan MI, Khan ZA, Baig MH, Ahmad I, Farouk A, Song YG, et al. Comparative genome analysis of novel coronavirus (SARS-CoV-2) from different geographical locations and the effect of mutations on major target proteins: An in silico insight. PLoS One 2020;15(9):e0238344 [FREE Full text] [CrossRef] [Medline]
  28. Sardar R, Satish D, Birla S, Gupta D. Integrative analyses of SARS-CoV-2 genomes from different geographical locations reveal unique features potentially consequential to host-virus interaction, pathogenesis and clues for novel therapies. Heliyon 2020 Sep;6(9):e04658 [FREE Full text] [CrossRef] [Medline]
  29. Alouane T, Laamarti M, Essabbar A, Hakmi M, Bouricha EM, Chemao-Elfihri MW, et al. Genomic diversity and hotspot mutations in 30,983 SARS-CoV-2 genomes: moving toward a universal vaccine for the. Pathogens 2020 Oct 10;9(10) [FREE Full text] [CrossRef] [Medline]


GISaid: Global Initiative on Sharing All Influenza Data
MEME: mixed effects model of evolution
MERS-CoV: Middle East respiratory syndrome coronavirus
ORF: open reading frames


Edited by G Eysenbach; submitted 23.11.20; peer-reviewed by F Pappalardo, S Motta; comments to author 14.12.20; revised version received 30.12.20; accepted 13.01.21; published 22.01.21

Copyright

©Emilio Mastriani, Alexey V Rakov, Shu-Lin Liu. Originally published in JMIR Research Protocols (http://www.researchprotocols.org), 22.01.2021.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Research Protocols, is properly cited. The complete bibliographic information, a link to the original publication on http://bioinform.jmir.org, as well as this copyright and license information must be included.