A Bioinformatics Tool for Predicting Future COVID-19 Waves Based on a Retrospective Analysis of the Second Wave in India: Model Development Study

Background Since the start of the COVID-19 pandemic, health policymakers globally have been attempting to predict an impending wave of COVID-19. India experienced a devastating second wave of COVID-19 in the late first week of May 2021. We retrospectively analyzed the viral genomic sequences and epidemiological data reflecting the emergence and spread of the second wave of COVID-19 in India to construct a prediction model. Objective We aimed to develop a bioinformatics tool that can predict an impending COVID-19 wave. Methods We analyzed the time series distribution of genomic sequence data for SARS-CoV-2 and correlated it with epidemiological data for new cases and deaths for the corresponding period of the second wave. In addition, we analyzed the phylodynamics of circulating SARS-CoV-2 variants in the Indian population during the study period. Results Our prediction analysis showed that the first signs of the arrival of the second wave could be seen by the end of January 2021, about 2 months before its peak in May 2021. By the end of March 2021, it was distinct. B.1.617 lineage variants powered the wave, most notably B.1.617.2 (Delta variant). Conclusions Based on the observations of this study, we propose that genomic surveillance of SARS-CoV-2 variants, complemented with epidemiological data, can be a promising tool to predict impending COVID-19 waves.


Introduction
The year 2019 had a SARS-CoV-2-driven wave of COVID worldwide that soon turned into a pandemic, and to date, this disease has killed about 65 million people [1].Since the pandemic's start, much policy talk has been about whether an impending COVID wave can be predicted [2].Unfortunately, successful prediction of COVID waves has not yet been achieved.A prediction tool that can inform about an upcoming COVID wave well before time and reasonably accurately could minimize the enormous loss of life and other collateral damages.
Multiple waves at a global scale driven by SARS-CoV-2 variants, primarily Alpha, Delta [3], and, most recently, Omicron [4], have followed since the first wave.The successive SARS-CoV-2 variants showed increased transmissibility and virulence compared with the wild-type strain [3]; however, the latest Omicron variant has shown higher transmissibility and immune escape but lesser lethality compared with the Delta variant [4].The Delta variant-driven wave was characterized by high speed of rising cases, increased oxygen demand, vaccine breakthrough [5], a highly increased proportion of severe cases, and high mortality [6].
More comprehensive coverage of COVID vaccines in the global population is helping to create an immunity barrier against the rise of a new wave.However, an increase in the immune escape potential of emerging variants causes a grave concern for vaccine breakthroughs and reinfections [3,4,7].With the waning of immunity derived from vaccines and previous infections [8], the risk of the emergence of a more lethal variant capable of creating a global wave remains high and therefore demands continued surveillance [9].
The Delta variant-driven wave showed a rapid peak and fall to the baseline, making it ideal for prediction studies.The Delta strain was first reported from India [10].Of note, India witnessed a devastating second COVID wave that began toward the end of February 2021 [11].The unexpected arrival of the second COVID wave, accompanied by an exponential increase in infections, brought the country's epidemic response system and health infrastructure to a standstill [11], and resulted in massive suffering and loss of life [12].
Unfortunately, none of them could accurately anticipate a COVID-19 wave.The ability to predict an established wave from epidemiological data alone seems severely limited [12,29].
The analysis of SARS-CoV-2 genomic sequences has emerged as an efficient surveillance tool for understanding the emergence of new variants and their spread.Fortunately, millions of SARS-CoV-2 genomic sequences from regions worldwide are being made publicly available as a collaborative effort to contain the pandemic [30].The easy availability of high-quality viral sequences with patient metadata has opened a new avenue for potential predictions of the COVID-19 pandemic [31].However, viral genomic sequences alone may not be sufficient for efficient predictions, and their current uses for this purpose are constrained.
In this study, we propose an integrated approach using viral genome surveillance and epidemiological data for the prediction of an impending COVID-19 wave.We retrospectively analyzed viral genomic sequences and epidemiological data reflecting the emergence and spread of the second wave of COVID-19 in India to construct such a model.

Study Design, Participants, and Data Sources
We analyzed the time series (weekly and monthly) distributions of SARS-CoV-2 variants coupled with epidemiological data from December 1, 2020, to July 26, 2021 (34 weeks) for new cases and deaths from COVID-19 in India.Further, a phylodynamic analysis for individual variants was performed.
We downloaded SARS-CoV-2 genomic sequence data and epidemiological data from the EpiCoV database of the Global Initiative on Sharing All Influenza Data (GISAID) [32] and the Worldometer database [33], respectively.A total of 40,359 genomic sequences of SARS-CoV-2 were analyzed.The sequence for each SARS-CoV-2 variant was retrieved using an automated search function that entered lineage and sublineage information into the EpiCoV database.The total numbers of sequences per week and month for the variants and their relative proportions were calculated (in percentage).The data were tabulated, and each variant's weekly and monthly distributions were compared to COVID-19 epidemiological data (new cases and deaths) and statistically analyzed.The genomic sequences of SARS-CoV-2 variants in each state and union territory were also examined to check deviations from overall patterns in data.

Phylodynamics of SARS-CoV-2 Variants
A phylodynamic analysis of the variants circulating in the Indian population during the study period was performed on GISAID sequences using the bioinformatics tool available at EpiCoV.

Statistical Analysis
XLSTAT (Addinsoft) was used to perform all statistical analyses.Descriptive statistics were calculated for each variable.Levene and Anderson tests were used to determine the homogeneity or normality of the data.In addition, a correlation matrix was constructed, and a linear regression analysis was performed between contrasting variables (R values = −1 to +1).

RenderX
Finally, the statistical significance level for each comparison was set at P<.05.

Ethical Considerations
Approval from the institutional ethics committee was not required as the data used in this study were retrieved from publicly available databases.

Results
Our retrospective analysis of the epidemiological data reflected that the second COVID-19 wave started rising by the end of February 2021 and peaked by the end of the first week of May 2021.Based on the distinct epidemiological trends observed (Multimedia Appendix 1), we divided the study period (December 1, 2020, to July 26, 2021; 34 weeks) into prepeak (weeks 1-23) and postpeak (weeks 24-34) periods.The weekly average of new cases and deaths showed a strong correlation in the study period (R=0.98,P<.001), signifying the high statistical validity of the data for further comparisons.Further, we analyzed the distribution of SARS-CoV-2 variants circulating in the Indian population in correlation with new cases and deaths before and after the peak.For description, based on epidemiological trends, the prepeak period was further divided into the following 3 time series intervals: "very early" (weeks 1-8), "early" (weeks 9-16), and "near peak" (weeks 17-23).New cases and deaths showed a downward trend in the "very early" period and maintained a plateau in the "early" period (except toward the end when cases and deaths started increasing, indicating the start of the second wave).In the "near peak" period, a steep rise in new cases and deaths was observed (Figure 1).
The rise and fall of circulating SARS-CoV-2 variants were studied against the observed epidemiological data trends in the respective time series intervals.Observing the composite data trends of epidemiological and SARS-CoV-2 genomic data provides a glimpse of the formation of the second COVID-19 wave, with clear indications of which SARS-CoV-2 strains may have driven it (Figures 1 and 2 The phylodynamic analysis of the circulating variants in the study period strongly corroborated the trends present in the graph data, showing an exclusive increase in the cluster density of B.1.617.2 compared with other variants in the "near peak" period (Figure 3).
To know whether the rise in the B.1.617.2 variant was localized to specific geographical regions, which may have influenced the collective data trends, we compared the monthly distribution of genomic sequences of SARS-CoV-2 variants for the states and union territories of India individually.A similar increase in the detection of the B.1.617.2 variant was observed in most states and union territories (Multimedia Appendix 2), except Kerala, where different patterns were visible (Figure S15 in Multimedia Appendix 2).In Kerala, the rise of the B.1.617.2 variant was slower in comparison with the rest of the country (55.5% vs 72% of total cases by the end of April 2021), which was further confirmed in the state-wise serosurvey data from the period of the second wave (44.4% vs 67.7% of the national average) [34].Notably, a sharp rise in B.1.617.2 cases was observed in Kerala in a later period.

Comparison With Prior Work
Current prediction models in the COVID-19 pandemic are dominated by purely epidemiological analyses, from which hardly anyone could accurately predict an impending COVID-19 wave [23][24][25][26][27].The importance of studying viral genomic sequences for the epidemiological surveillance of new SARS-CoV-2 variants is well recognized [31,[35][36][37][38][39][40].However, its application in developing a predictive model to forecast upcoming virus waves has received little appreciation in the existing literature [41].Interestingly, strong conceptual validation for the applicability of an integrated approach to predict an impending COVID-19 wave using viral genomic surveillance and epidemiological data came from a recent study by de Hoffer et al [42].These authors studied the temporal dynamics of emerging SARS-CoV-2 variants using a machine learning algorithm-based analysis of the spike protein sequences of viral samples from England, Scotland, and Wales reported in the GISAID database.Further, they correlated the relative percentage of each variant with the weekly and monthly epidemiological data of active cases from the studied geographical regions.They showed a strong relationship between the genesis of a new emerging variant and the onset of a new wave, with an exponential increase in the number of infections [42].
Moreover, our findings regarding the second wave of COVID-19 in India are corroborated by a previous study by Dhar et al [10].The authors analyzed viral genomic sequences retrospectively and observed a similar pattern in the rise of the B.1.617lineage, mainly the B.1.617.2 variant, in Delhi before the second wave [10].A B.1.617.2-drivensecond wave was also reflected in the analysis of viral genomic sequences performed by Adiga and Nayak in 2021 [43].We recently used our prediction model prospectively during the initial rise of cases caused by the Omicron strain in South Africa, which indicated an upcoming wave with very high transmissibility but limited lethality [4].These predictions were later accurately reflected in the studies reporting the Omicron-mediated fourth wave of COVID-19 in South Africa [44,45].
The potential predictability of the second wave of COVID-19 in India in the retrospective data analysis suggests that genomic surveillance of SARS-CoV-2 variants, enriched with epidemiological data, could be a potential tool to predict upcoming COVID-19 waves.Still, the prediction accuracy is largely dependent on population-based viral genomic sequencing and consistency in data upload from all geographic regions, as well as accurate reporting of epidemiological data.The sole increase in the proportion of an emerging SARS-CoV-2 variant, coupled with an associated rise in new cases, might inform the arrival of a new wave of COVID-19.However, consideration of other epidemiological factors, such as previous exposure to related virus strains and the immunization status of the population, will be necessary to determine the magnitude of an impending wave [46].Notably, the first wave of COVID-19 in India was limited in scope, as evidenced by the serosurvey data [47,48], and only a small part of the population was vaccinated as of early 2021 [49].With the emergence of a new variant, both these factors may have created an ideal environment for a XSL • FO RenderX massive second wave to emerge.In addition, preventive measures, such as blocking or limiting gatherings and using face masks, can also influence the prospects and magnitude of a new wave [29].

Limitations
There were some limitations in our study that may have influenced the interpretation of the results.First, the samples used in our analyses might not be representative of the population.In many geographical regions, the sample size was grossly disproportionate.Therefore, the genomic sequence data presented in this study might not reflect the exact epidemiological extent of the distribution of the variants in the reported geographical regions but only show their relative proportions in the samples for which genomic sequences were uploaded to the GISAID database.We have assumed that similar proportions exist between variants in the actual population.Second, inconsistent reports and uploads of genomic sequences made it challenging to study a daily trend in the spread of variants.Finally, the scarcity of genomic sequences and inconsistency in uploading to the databases used for some states/union territories made determining variant dominance difficult.

Conclusions
Based on the observations of this study, we propose that genomic surveillance of SARS-CoV-2 variants, complemented with epidemiological data, can be a promising tool to predict upcoming COVID-19 waves.

Figure 1 .
Figure 1.Weekly distribution of SARS-CoV-2 variants in genomic sequence data from India and the correlation with daily new COVID-19 cases and deaths from December 1, 2020, to July 26, 2021.The data were analyzed for the period before the peak of the second wave (23rd week) and after that.SARS-CoV-2 genomic sequence data were obtained from the EpiCoV database of the Global Initiative on Sharing All Influenza Data, and epidemiological data were obtained from the Worldometer database.

Figure 2 .
Figure 2. Origin and spread of B.1.617lineage SARS-CoV-2 variants in the Indian population.Data were analyzed from December 1, 2020, to July 26, 2021.SARS-CoV-2 genomic sequence data were obtained from the EpiCoV database of the Global Initiative on Sharing All Influenza Data, and epidemiological data were obtained from the Worldometer database.

Figure 3 .
Figure 3. Phylodynamics of SARS-CoV-2 variants in the Indian population from December 1, 2020, to July 26, 2021.SARS-CoV-2 genomic sequence data were obtained from the EpiCoV database of the Global Initiative on Sharing All Influenza Data, and epidemiological data were obtained from the Worldometer database.VOC: variant of concern.