Some common deleterious mutations are shared in SARS-CoV-2 genomes from deceased COVID-19 patients across continents

Demographic summary of the retrieved genomes of the SARS-CoV-2

To investigate the spectrum of nucleotide (NT) and amino acid (AA) mutations and their effects  in different variants of the SARS-CoV-2, sequenced from COVID-19 deceased patients, we retrieved 243,270 whole genome sequence (WGS) with high read coverage (> 29,000 bp) from the global initiative on sharing all influenza data (GISAID) up to February 2023. After a thorough filtering of these genomes, 5724 complete genomes belonged to COVID-19 deceased patients from different demographics were selected for further analysis. These WGS data (n = 5724) comprised SARS-CoV-2 genome sequences from 123 countries and five continents (e.g., Asia, Africa, Europe, North America and South America) of the globe. The geographical distribution of the SARS-CoV-2 WGS from deceased COVID-19 people reveals that 33.4% of  the genomes were sequenced from North America followed by 28.8% from Europe, 19.9% from South America, 17.7% from Africa, and 0.7% from Asia (Fig. 1A).

Figure 1
\"figure

Retrieved SARS-COV-2 whole genome sequences (WGS) obtained from deceased COVID=19 patients worldwide from the global initiative on sharing all influenza database (GISAID). (A) SARS-CoV-2 genomes submitted to the GISAID from five continents of the globe between January 2020 and February 2023. (B) Distribution of sequences throughout several lineages during the time span. (C) SARS-CoV-2 genomes sequenced by different countries from deceased patients. (D) Gender-wise (male and female) and (E) age-wise distribution of the selected sequences. (F) Distribution of different clades in five continents. (G) Cumulative number of strains retrieved during the time frame from five continents.

We found B.1.1.529 (Omicron) as the most prevalent (13.90%) among all the lineages detected whereas B.1.617.2 (Delta) and B.1.1.7 (Alpha) were also significantly prevalent; comprising 10.80% and 9.80% of the sequences, respectively (Fig. 1B). However, B.1.1.28 (Brazilian variant), P.1 (Gamma), and B.1.1.519 (Mexican variant) were also observed to be predominant in 7.10, 5.80 and 5.00% of the SARS-CoV-2 genomes, respectively (Fig. 1B). By comparing these data according to different countries of origin, we found that Brazil contributed the highest (16.5%) amount of SARS-CoV-2 WGS data sequenced from deceased COVID-19 patients, followed by Mexico (14.5%), the USA (14.4%), Bulgaria (13.1%), and India (5.1%). The lowest amount of WGS (0.5%) was sequenced from deceased COVID-19 patients in Panama (Fig. 1C). According to the metadata, 61.4% of deceased COVID-19 patients were male while 38.6% were female (Fig. 1D) with an average age of approximately 70 years (Fig. 1E), which was not explicitly stated in the GISAID. Further analysis demonstrated that male patients with an average age of 55 years had higher COVID-19 risk than female patients.

The “G” clade of the SARS-CoV-2 was found to be predominated in the deceased COVID-19 patients of the Asian, African and North American regions while most of the death cases in Europe were registered with “GRY” clade. In contrast, most death cases were registered for “GR” clade in the South American continent (Fig. 1F). By looking at the cumulative death cases registered throughout study period (from January 2020 to February 2023), we found that most of the WGS data (approximately 1000 sequences) from deceased COVID-19 patients were sequenced from the North American residents. Though all the countries had submitted their WGS irrespective of regional barrier, African regions are found as the lowest possible data generating zone from deceased COVID-19 patients (less than 200 sequences) (Fig. 1G). Relevant demographic and medical data are described in Data S1.

Phylogenetic diversity of the SARS-CoV-2 genomes of the deceased COVID-19 patients

To determine phylogenetic characteristics of the SARS-CoV-2 genomes of the deceased COVID-19 patients, we built a maximum likelihood (ML) tree based on aligned full length sequences using Nextclade Web 2.14.1 web-based tool (https://clades.nextstrain.org/) (Fig. 2, Data S2). The WGS data assembled from deceased COVID-19 patients around the world showed the formation of 21 Nextstrain clades, including four VOCs: 20I (alpha, V1), 20H (Beta, V2), 20J (Gamma, V3), and 21A, 21I, and 21J (Delta V4). In addition, the study genomes belonged to the VOIs, such as 21C (Epsilon), 21G (Lambda), and 21H (Mu), other assigned and unassigned clades including 21B (Kappa), 21F (Lota), 20E (EU1), 19A, 19B, 20A, 20B, 20C, 20D, and 20G (Fig. 2A). The viral clade distribution of the study genomes represented some of the SARS-CoV-2 genetic clades that were circulating worldwide during January 2020 to February 2023. We further explored the diversity of SARS-CoV-2 genomes by comparing the distance matrix of the SARS-CoV-2 strains of the deceased patients to the Wuhan-Hu-1/NC 045512 reference strain. Nextstrain classification revealed that clade 20I (Alpha, V1) was prevalent in 20% of the study genomes, whereas clade 20H (Beta, V2) and 20J (Gamma, V3) were found to be prevalent in 2.42% and 6.33% genomes, respectively (Fig. 2B). Moreover, clade 21A, 21I, and 21J of the Delta variant (B.1.617.2) accounted for 1.48% of the study genomes while clade 21F of the Lota variant was represented by 1.45% genomes. However, only four SARS-CoV-2 genomes in our investigation belonged to the Kappa variant (B.1.617.1), which was prevalent in India and possessed three significant alterations at the sites of L452R, E484Q, and P681R. In addition, 25, 3, and 16 sequences were belonged to Epsilon (21C), Lambda (21G), and Mu (21H), respectively, which are labeled as “VOI.” The most prevalent clades in our analysis were 20B (30%) followed by 20A (22.0%). In this analysis, the most mutational frequency was observed in the spike protein region followed by ORF1a fragment. Whereas the least changes were observed in ORF1b region. The highest peak was observed in the N portion (Fig. 2B).

Figure 2
\"figure

Phylogenetic analyses of the 5,724 SARS-CoV-2 genomes sequenced from the COVID-19 deceased patients worldwide. (A) A detailed phylogenetic tree presenting all the significant clades associated with deceased COVID-19 patients. (B) Value of the entropy change (distribution of mutational frequency overall the SARS-CoV-2 genome) throughout the SARS-CoV-2 genome based on mutation count for each position. The maximum-likelihood tree was generated using Nextclade Web 2.14.1 web-based tool (https://clades.nextstrain.org/results; accessed on October 10, 2023).

Frequency and distribution of nucleotide mutations in SARS-CoV-2 genomes of the deceased COVID-19 patients

To determine the frequency and distribution of nucleotide (NT) mutations, we further analyzed 5,724 SARS-CoV-2 genomes from deceased COVID-19 patients of diverse demographics. We detected an average of 12.9 NT mutations per genome. The overall NT mutation frequencies in the SARS-CoV-2 genomes sequenced from deceased COVID-19 patients are shown in Fig. 3. Our comprehensive mutational analysis identified 35,799 NT mutations across the entire dataset of 5724 SARS-CoV-2 genomes, of which 11,402 (highest) NT mutations were solely found in the S gene. Conversely, the E gene possessed the lowest number (n = 94) of NT mutations (Fig. 3A). In addition, the number of NT mutations in ORF1a, N, ORF1b, ORF8, ORF3a, ORF9b, ORF7a, M, ORF6, ORF7b, and E segments were 7964, 6008, 5064, 2431, 1735, 410, 325, 177, 138, 75 and 70, respectively (Fig. 3A). The highest number of NT mutations were identified in D614G positions, followed by N501Y, P681H, T716I, and A570D in the spike (S) protein (Fig. 3B). Of the identified NT mutations, the four most frequent NT mutations such as T1001I, A170BD, I2230T, and T265I were identified in the N gene. Herein this study, we spotted three most NT mutations (e.g., P314L, E1264D, and P218L) in the ORF1b gene. In addition, E, M, ORF3a, ORF6a, ORF7b, ORF8 and ORF9b genes possessed the highest number of NT mutations at P71L, I82T, Q57H, I33T, T4OI, Y73C and Q77E positions, respectively. Other genes with significant NT mutation frequencies at specific sites included P71L (E-gene), T4OI (0RF7b), I33T (ORF6a), I82T (M), T40I (ORF7b), Q77E (ORF9b), Q57H (ORF3a) and Y73C (ORF8) (Fig. 3B).

Figure 3
\"figure

The frequency of nucleotide (NT) mutations found throughout the SARS-CoV-2 genomes of the deceased COVID-19 patients. (A) The number of conversions respective to specific genes or segments of the SARS-CoV-2 genome. (B) The maximum frequency of NT mutations in particular region of  the SARS-CoV-2 genome. In both cases, specific gene regions were colored with frequency.

Another notable finding of this study is the prediction of the NT alterations in the SARS-CoV-2 genomes of various nations.  We compared the NT mutational spectra in top 12 countries (Table 1), where the highest number of COVID-19 associated deaths were reported. The frequency of the NT mutations at D614G position in the S gene was prominent in 12 nations with the most significant incidence of COVID-19 deaths (Table 1). Similarly, the maximum number of NT mutations were identified in T1001I, G204R, E1264D, P314L, E92K, Q57H, Q77E, S5L, L29F, S41P, and T40I positions of SARS-CoV-2 genomes. The SARS-CoV-2 genomes sequenced from deceased COVID-19 patients of the USA showed maximum NT mutations at I2230T position in ORF1a, D3L in N, P218L in ORF1b, Y73C in ORF8, Q57H in ORF3a, P10S in ORF9b, A43S in ORF7a, T175M in M, E13D in ORF6, T40I in ORF7b, and P71L in E genes. In contrast, genomes sequenced from India showed the highest NT mutation frequency at D614G in S gene followed by T1001I in ORF1a, G204R in N gene, K1383R in ORF1b, R52I in ORF8, P42L in ORF3a, Q77E in ORF9b, N38T in ORF7a, L87F in M gene, I14T in ORF6, S5L in ORF7b and T9I in E gene (Table 1). These findings imply that while some SARS-CoV-2 NT mutations were responsible for its evolution, a few may benefit viral adaptation in a specific demographic distribution. Variations in NT mutation patterns in SARS-CoV-2 genomes may be attributable to population age distribution, gender, host immunity, and socioeconomic level.

Table 1 The nucleotide (nNT) mutations with the highest frequency predicted at various loci of SARS-CoV-2 genomes extracted from deceased COVID-19 patients of different countries.

Point-specific amino-acid mutations in SARS-CoV-2 genomes of the deceased COVID-19 patients

To identify deleterious mutations in the SARS-CoV-2 genomes, we analyzed point-specific amino acid (AA) mutations in the genomes of this virus obtained from deceased COVID-19 patients using SIFT, PolyPhen-2, SNAP2, PROVEAN, PredictSNP, and MAPP web-based tools. Deleterious mutations were critically analyzed and cross-checked using these tools. A threshold value of − 2.5 was determined to ensure highly balanced accuracy in defining the deleterious mutation. Therefore, mutations having a value smaller than − 2.5 were identified as deleterious33. Among the AA mutations identified, the number of deleterious and non-deleterious mutations were 951 and 3199, respectively. The highest number of deleterious AA mutations were found in the ORF1b (n = 338) followed by ORF1a (n = 236), ORF3a (n = 122). Besides, 49, 45, 42, 40 and 30 deleterious AA mutations were predicted in the N, ORF8, ORF7a, S and ORF9b segments, respectively. In this study, the open reading frames (ORF) of the SARS-CoV-2 genome possessed a higher percentage of deleterious AA mutations than other segments. As for example, the ORF3a, ORF6, ORF7a, ORF8 and ORF9b harbored > 50.0% deleterious AA mutations. However, rest of the segments of the SARS-CoV-2 genomes fewer mutations (< 30) (Table S1). The overall AA mutations detected in the spike protein of the SARS-CoV-2 genomes of the  deceased patients are shown in Fig. 4A. In this study, the highest frequency of AA mutations (31.85%) was recorded in the S gene, which is responsible for viral pathogenicity. The S gene of the study genomes underwent AA mutations at 32 sites (Fig. 4A). Fourteen of these AA mutation sites such as V3L, L5Y, L10S, S13L, T19R, P26L, D401, S60A, P82AT, V1201Y204R, S2051, L2231, Y2651 were predicted in NTD (N-terminal domain) fragment, while eight of them (e.g., Q314, G339D, S371F, S373P, F377Y, D405N, K417N, L452R, T478K, and E484Q) were found in the RBD region. The remaining eight sites such as A570D, D614G, P681R, N764K, D796Y, N856K, R1000L, and E1188L were positioned in diverse areas of the S protein. The fusion peptide area was well conserved because no AA mutation hotspots were discovered (Fig. 4A).

Figure 4
\"figure

Genomic deletion analysis in SARS-CoV-2 whole genome sequences of the deceased COVID-19 patients. (A) Mapping of amino acid (AA) mutations in the spike (S) glycoprotein of SARS-CoV-2 genome. (B) The AA mutations in the subdomains S1 and S2 (SD1, SD2), N-terminal domain (NTD), and receptor binding domain (RBD) are highlighted.

Except for the significant AA mutation changes in the spike protein, there were notable changes in the mutational spectra of other proteins as well. In comparison to the different AA mutational spectra, a huge number of repeats were observed in the ORF1b (T67I, 2571 times) and N (G204R, 1484; R203K, 1559 times) segments. There were more seven AA mutations found to be occurred in > 500 sequences such as I2230T (543 times), A1708D (547 times) and T1001I (551 times) in ORF1a, L83F (631 times) in ORF3a, F3L (530 times), D34G (532 times) and F120L (532 times) in ORF8 fragment of the SARS-CoV-2 genome (Fig. 4B).

With the march of time, more and more deleterious AA mutations are being detected among the genome sequences of SARSCoV-2 especially those sequenced from the deceased COVID-19 patients. The top frequent AA mutations of different proteins occurring in different continents have been listed to better understand the scenario of SARS-CoV-2 mutational tendency depending on the regional factor (Table 2). The AA mutations occurring in more than two continents are highlighted to focus on them. Interestingly, D614G mutation in spike protein and S26L mutation in OF3a protein were found to occur in the SARS-CoV-2 genomes in all continents. Another noteworthy AA mutation, A1918V, occurred in both Asia and North American regions whereas a slightly different mutation, A1818L found in the African region. However, the ORF8 and ORF9b fragments showed no similarity of mutational alignment across the regional barrier (Table 2).

Table 2 Most frequent amino acid (AA) mutations predicted at various loci of SARS-CoV-2 genome obtained from deceased COVID-19  patients of the five continents.

Effects of mutations on protein functions

Finally, we considered the deleterious signature mutations to evaluate the changes in proteins as biological functions using PROVEAN, PolyPhen-2, and Predict SNP tools (Fig. S1). We found the highest PROVEAN score of − 13.22 in case of W45S and W45R deleterious mutations, and a minimum − 12.278 for the W45L mutation of the ORF8 gene (Fig. 5A). Interestingly, these three detrimental mutations were identified in the same ORF8 region of SARS-CoV-2 genomes using other tools. Using these three tools, G18V, W45S, I33T, P30L, and Q418H were identified as the frequent mutations which are responsible for defining each clade as they all are deleterious and unstable. Using Predict SNP, we simultaneously predicted the highest number of detrimental mutations (n = 1875) at Q57H in the ORF3a gene (Fig. 5B). Through the PolyPhen-2, we detected the highest number of deleterious mutations at D160Y (n = 1559) in the M gene and G204R (n = 1448) and D3L (n = 540) in the N gene. All these deleterious mutations had a PolyPhen-2 score of 1, whereas the sensitivity and specificity were 0 and 1, respectively. These findings indicate that differences in mutations in distinct regions will likely impact protein function. Top mutations against the Predict SNP score are visualized in Fig. 5C. The mutations occurred in the ORF8 segment such as W45L andW45S scored the most negative values according to Predict SNP prediction model where both scored less than -12. No other mutation of this segment or other proteins had scored such negative scores throughout the mutational spectra (Fig. 5C).

Figure 5
\"figure

Scores of different mutations throughout the SARS-CoV-2 genomes sequenced from deceased COVID=19 patients. (A) Top hundred mutations predicted by PROVEAN tool. (B) Total frequency of the top mutations predicted by Predict SNP tool. (C) Prediction of deleterious mutations by Predict SNP.

Leave a Comment