JMIR Bioinformatics and Biotechnology

https://bioinform.jmir.org/issue/feed JMIR Bioinformatics and Biotechnology 2023-01-10T09:30:04-05:00 JMIR Publications editor@jmir.org Open Journal Systems Unless stated otherwise, all articles are open-access distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work ("first published in the Journal of Medical Internet Research...") is properly cited with original URL and bibliographic citation information. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included. Methods, web-based platforms, open data and open software tools for big data analytics, machine learning-based predictive models using genomic and imaging data, and information retrieval in biology and medicine. JMIR Bioinformatics and Biotechnology is the official journal of the MidSouth Computational Biology and Bioinformatics Society https://bioinform.jmir.org/2026/1/e90572 Readability of AI-Generated Patient Information on Glucagon-Like Peptide-1 Receptor Agonists 2026-05-05T16:00:09-04:00 Tyler Williams Ines Bilic-Curcic Jonathan Hurley Harisankeerth Mummareddy Maja Cigrovski Berkovic Silvija Canecki Varzic Marina Gradiser

Artificial intelligence (AI)–generated content on glucagon-like peptide-1 receptor agonists (GLP-1RAs) gave informationally detailed responses, but its readability remains suboptimal for many patients. Incorporating literacy-sensitive design principles into AI health communication is essential to ensure equitable access to digital medical information.

2026-05-05T16:00:09-04:00 https://bioinform.jmir.org/2026/1/e75678 Random Survival Forest Versus Elastic-Net Regularized Cox Regression for Survival Prediction in Acute Myeloid Leukemia at Distinct Treatment Time Points: Model Performance Comparison Study 2026-04-29T16:00:22-04:00 Oisín Brady Sean Johnson Peter Giles Caroline Alvares Joanna Zabkiewicz Carolina Fuentes

Background: Risk group stratification based on the prediction of survival of patients with acute myeloid leukemia (AML) is complex. Despite common risk group categorization guidelines, the overall prognosis remains poor. Machine learning techniques have been shown to provide more accurate risk group stratification than conventional approaches using trial data. However, many time-to-event (TTE) models do not use training sets constrained to specific time windows, instead using aggregations of trial data. Objective: This study aimed to evaluate the performance of (1) random survival forest (RSF) and (2) Cox proportional hazard regression with elastic net regularization (CoxNet) for survival prediction of patients with AML within a censoring window trained with available data recorded at discrete time points during the United Kingdom National Cancer Research Institute Acute Myeloid Leukaemia 17 randomized controlled trial (AML17). Methods: For each stage in the AML17 trial, separate models were trained for each exhaustive k-choice combination of available AML17 data subsets. Data combinations for each model were further constrained according to the respective trial stage to avoid data leakage. Preliminary Pearson correlation methods were used to remove directly correlating features with the TTE prediction (time-to-death/5-y censoring point). Repeated k-fold stratified cross-validation was used on each dataset ablation to find candidate models. Permutation importance and elastic net regularization were used to monitor stability across validation folds and reduce the feature set of the highest performing stage RSF and Cox proportional hazard regression models, respectively. Finally, selected ablated models were re-evaluated using the nested, k-fold, stratified sampling cross-validation method with bootstrapping. Results: Concordance index ranked the best models for data constricted up to the end of induction (RSF=0.68, CoxNet=0.67), stages 1 (RSF=0.69, CoxNet=0.68), 2 (RSF=0.68, CoxNet=0.66), and 3 (RSF=0.69, CoxNet=0.63) of the trial. Conclusions: This study details the high prediction accuracy for time-to-survival-event predictions when training sets of CoxNet and RSF models, which are sequentially constricted to data measured up to the end of respective AML17 trial stages. The performance of these sequential TTE models is intended to justify their use as part of a wider digital twin system simulating multiple TTE outcomes for patients with AML.

2026-04-29T16:00:22-04:00 https://bioinform.jmir.org/2026/1/e85659 Temporal Reproducibility of a Genetic Algorithm–Derived Health Risk Score: Standardized Out-of-Fold Validation Framework (2021-2023) 2026-04-21T16:15:11-04:00 Yoichiro Aoki Hiroki Takeda Kinichi Yokota Ryoko Yoshida

Background: Genetic algorithm (GA)–based scoring has been proposed as a data-driven approach for health risk stratification . However, performance estimates may be inflated when preprocessing, optimization, and evaluation are not strictly separated within a prespecified validation framework. Demonstrating temporal reproducibility under a standardized out-of-fold (OOF) evaluation framework with transparent uncertainty quantification is therefore essential for ensuring translational reliability in preventive health screening. Objective: This study aimed to evaluate the temporal reproducibility of a GA-derived composite health risk score across three consecutive annual cohorts (2021‐2023) under a standardized OOF validation pipeline and to assess robustness to policy-driven structural HbA missingness through a prespecified ON/OFF sensitivity analysis. Methods: Annual health examination datasets from 2021 (n=3744), 2022 (n=5153), and 2023 (n=5352) were analyzed using an identical preprocessing and modeling pipeline. Thirteen clinical indicators and eight lifestyle questionnaire variables were included as predictors. The outcome was based on an A–D grading framework and binarized using an OR rule across domains (grade ≥B in any domain). Continuous variables were median-imputed and standardized within each training fold to prevent information leakage. GA optimization was performed using fixed random seeds, and fitness estimation employed stratified K-fold cross-validation. Predicted probabilities were obtained by fitting logistic regression models to GA-derived composite scores within the OOF framework. Discrimination and overall predictive performance were quantified using the area under the receiver operating characteristic curve (AUC) and the Brier score calculated from OOF predicted probabilities. Uncertainty was estimated using 2,000-replicate percentile bootstrap resampling. A prespecified sensitivity analysis excluded HbA while maintaining an identical evaluation framework. Results: OOF AUC values were stable across cohorts (2021: 0.810; 2022: 0.814; 2023: 0.812), with overlapping 95% percentile bootstrap confidence intervals. Brier scores ranged from 0.172 to 0.176. Exclusion of HbA resulted in small changes in discrimination (median ΔAUC was ≤0.007), consistent with the prespecified ON/OFF sensitivity analysis. Conclusions: Under a harmonized OOF validation framework, the GA-derived composite risk score showed stable temporal discrimination and consistent overall predictive performance across three consecutive annual cohorts. These findings underscore the methodological importance of prespecified, standardized evaluation procedures and transparent uncertainty quantification when assessing reproducibility of risk stratification models in routine health screening data.

2026-04-21T16:15:11-04:00 https://bioinform.jmir.org/2026/1/e85212 The AudioGene Translational Dashboard for Diagnosing Autosomal Dominant Nonsyndromic Hearing Loss: Phenotypic Data Visualization and Analysis Study 2026-04-14T12:30:10-04:00 Benjamin DeSollar Nathan Schaefer Daniel Walls Amanda M Odell Kevin T A Booth Hela Azaiez Michael Schnieders Richard J H Smith Terry Braun Thomas Casavant

Background: Autosomal dominant nonsyndromic hearing loss (ADNSHL) is highly heterogeneous, with more than 64 genes implicated in its etiology. This complexity limits the diagnostic power of clinical examinations and audiometry alone, while existing computational approaches have achieved only moderate accuracy and often lack interpretability. As precision medicine increasingly emphasizes genotype-phenotype correlations, there is a recognized need for diagnostic tools that provide clinicians with transparent, interpretable outputs. Objective: This study aimed to develop and evaluate the AudioGene Translational Dashboard, an interpretable clinical informatics tool that integrates machine learning models and interactive visualizations to enhance genotype-phenotype correlations and support diagnostic decision-making in ADNSHL. Methods: We developed the AudioGene Translational Dashboard, integrating 2 machine learning models (AudioGene version 4 and AudioGene version 9.1) with 6 interactive visualization tools. AudioGene version 4 uses a multi-instance support vector machine classifier for patients with multiple audiograms, while AudioGene version 9.1 combines adaptive boosting, k-nearest neighbors, random forest models, and logistic regression for patients with a single audiogram. Visualizations include audiometric profile plots, audioprofile surfaces, clustering analyses, and data distribution charts designed to facilitate clinical interpretation. Results: The AudioGene Translational Dashboard was developed to address the “70/30” phenomenon, indicating a 74% likelihood that the causative gene is among the top 3 predicted genes, thereby providing clinicians with a clear confidence indicator (“green flag”) or a caution alert (“red flag”) during diagnosis. While this level of performance is well suited for hypothesis generation, the remaining uncertainty underscores the need for interpretive context in clinical decision-making. Visualization tools enhanced clinicians’ ability to interpret and correlate phenotypic data with predicted genetic outcomes, improving diagnostic confidence and interpretability. Conclusions: The AudioGene Translational Dashboard advances clinical informatics in genetic diagnosis of ADNSHL by integrating explainable artificial intelligence with interactive visualizations, enhancing clinical interpretability and diagnostic accuracy. This approach facilitates informed clinical decision-making, highlights the translational potential of genotype-phenotype computational models, and supports precision medicine in hearing loss diagnostics. Future enhancements will target improving class balance and incorporating additional user-customizable features to further optimize clinical applicability.

2026-04-14T12:30:10-04:00 https://bioinform.jmir.org/2026/1/e93272 A Strategic Partnership to Advance AI Applications in Genomics and Bioinformatics for Health Innovation 2026-03-27T18:15:10-04:00 Aik Choon Tan Ece Dilber Gamsiz Uzun

Background: Late 2025, JMIR Bioinformatics and Biotechnology (JBB) is pleased to announce a new strategic partnership with the MidSouth Computational Biology and Bioinformatics Society (MCBIOS), under which JBB will serve as the official journal of MCBIOS. This collaboration represents a shared commitment to advancing computational biology, bioinformatics, and biotechnology through open science, interdisciplinary collaboration, and real-world impact. At a time when biological discovery and health innovation are increasingly driven by large-scale data, artificial intelligence (AI), and computational methodologies, this partnership reflects a shared vision: to bridge bioinformatics, data science, AI, and innovation to transform health. Together, JBB and MCBIOS aim to provide a scholarly home for research that not only advances methods and technologies, but also demonstrates meaningful applications across biomedical, clinical, and population health contexts. Objective: N/A Methods: N/A Results: N/A Conclusions: N/A

2026-03-27T18:15:10-04:00 https://bioinform.jmir.org/2026/1/e81219 Prevalence and Associated Risk Factors of Bovine Fasciolosis in Bahir Dar, Ethiopia: Cross-Sectional Study 2026-03-17T16:15:09-04:00 Tesfaye Mesfin Theobesta Solomon Abraham Belete Temesgen

Background: Cattle are among the most important livestock resources in Ethiopia, contributing significantly to the agricultural economy and rural livelihoods. They provide meat, milk, hides, draft power for crop production, and serve as a major source of income for farmers. Despite their vital role, cattle productivity is often constrained by various diseases, particularly parasitic diseases. One of the most significant of these is bovine fasciolosis, a condition caused by ingestion of metacercariae of liver flukes belonging to the genus . Objective: This study aimed to assess the prevalence and associated risk factors of bovine fasciolosis in Bahir Dar, Ethiopia. Methods: A cross-sectional study was conducted from November 2021 to April 2022. A total of 384 cattle were randomly selected from different locations within the study area. Animals of all age groups and both sexes were included. Fecal samples were collected directly from the rectum of each animal using clean, labeled containers. The samples were examined using standard coprological techniques, specifically the sedimentation method, to detect liver fluke eggs. All findings were recorded, and the data were analyzed using descriptive statistical methods. Results: The overall prevalence of fasciolosis was 49.21% (n=189). Based on origin, Sebatamit had the most incidence at 61.84% (n=47), followed by Kebele 11 at 59.37% (n=57), Tikurit at 50% (n=59), and Latammba at 27.65% (n=26). Statistical analysis revealed significant disparities in occurrence among areas. Cattle in poor condition had the largest prevalence (n=80, 64%), followed by medium condition (n=85, 50%) and fat cattle (n=24, 26.96%). This variation was statistically significant. Age-group analysis revealed comparable prevalence rates, with young cattle at 50.38% (n=65), adults at 47.33% (n=71), and elderly cattle at 50.47% (n=53), with no significant differences found. There were no significant sex-related variations in prevalence, with males exhibiting a prevalence of 49.73% (n=93) and females 48.73% (n=96). Local cattle had a slightly higher prevalence (n=111, 51.62%) than crossbreeds (n=78, 46.15%), although the difference was not statistically significant (=.29). Conclusions: These findings underscore the need for targeted, location-specific control strategies and highlight the importance of improved nutritional and health management practices to reduce the burden of fasciolosis in cattle populations.

2026-03-17T16:15:09-04:00 https://bioinform.jmir.org/2026/1/e70553 Unpacking Genomic Biomarkers for Programmed Cell Death Receptor-1 Immunotherapy Success in Non–Small Cell Lung Cancer Using Deep Neural Networks: Quantitative Study 2026-01-13T16:00:11-05:00 Rayan Mubarak Fahim Islam Anik Jean T Rodriguez Nazmus Sakib Mohammad A Rahman

Background: Non-small cell lung cancer (NSCLC) is one of the leading causes of cancer-related mortality worldwide. PD-1 immunotherapy has shown promising results in the treatment of NSCLC; however, not all patients respond effectively to this treatment. Identifying predictive biomarkers for PD-1 therapy response is critical to improving patient outcomes and optimizing treatment strategies. Traditional methods of biomarker discovery often fall short in terms of accuracy and comprehensiveness. Recent advancements in deep learning provide a powerful approach to analyze complex genomic data and identify novel biomarkers that may predict therapeutic responses. Objective: This study aims to leverage machine learning techniques, particularly deep neural networks (DNN), to identify genomic biomarkers for predicting responses to PD-1 immunotherapy in NSCLC patients. By applying the DeepImmunoGene model to RNA-seq data, the study compares the performance of DNN, SVM, and XGBoost in predicting patient responses. It focuses on identifying key biomarkers through feature selection and deep learning that can enhance patient stratification and improve the accuracy of PD-1 immunotherapy predictions, contributing to more personalized treatment strategies. Methods: Differentially expressed genes (DEGs) were identified in RNA-seq data from 355 NSCLC patients using the LIMMA package in R, followed by preprocessing with log2 transformation. Machine learning models, including Support Vector Machines (SVM), XGBoost, and Deep Neural Networks (DNN), were employed to analyze gene expression data, with hyperparameters optimized using GridSearchCV. The DNN model's predictive performance was evaluated with permutation importance to identify genes critical for immunotherapy response. The models were trained on 284 patients, with 71 used for testing. Evaluation metrics like accuracy, AUC, precision, recall, specificity, and F1 score were used to assess performance. Statistical significance was tested using the Kruskal-Wallis test. Results: Initially, we identified 1,093 differentially expressed genes from RNA-seq data of 355 patients. We then trained models using SVM, XGBoost, and DNN to predict immunotherapy response. The DNN model outperformed both SVM and XGBoost with an accuracy of 82%, AUC of 90%, and recall of 0.85, significantly improving predictive performance by capturing non-linear relationships in gene expression data. To identify key biomarkers, we performed a permutation importance analysis, narrowing down the gene set to 98 genes. DeepImmunoGene, trained on these 98 genes, showed superior results, with an accuracy of 85% and an AUC of 90%. The top 36 upregulated genes in responders and 62 upregulated genes in non-responders were identified, which could serve as potential biomarkers for predicting response to PD-1 inhibitors. These findings suggest that the DeepImmunoGene model, with its ability to capture complex gene interactions, can reliably predict immunotherapy outcomes and provide insights into the molecular mechanisms of response, paving the way for more personalized treatment strategies. Conclusions: The DeepImmunoGene predictive model has successfully identified 36 upregulated genes that may serve as potential genomic biomarkers for predicting NSCLC patient responses to PD-1 immunotherapy. Notably, the ten most significant genes—GSTT2B, HMGA2, AC135050.2, ANKRD33B, MMP13, PLA2G2D, RASGEF1A, BIRC7, DCAF4L2, and CHMP7—offer valuable insights into the underlying mechanisms of treatment responses. These biomarkers not only help predict which patients are most likely to respond to PD-1 immunotherapy but also shed light on the molecular factors that explain non-response.

2026-01-13T16:00:11-05:00 https://bioinform.jmir.org/2026/1/e80539 Systematic Mining of Bioactive Compounds for Wound Healing From Cayratia Japonica Exosome-Like Nanovesicles: A Workflow Combining LC-MS and DeepSeek Models 2026-01-08T16:00:12-05:00 Qiang Fu Wei Ji Yu-Ping Fan Jian Yao Ming-Xia Song Qiao-Jing Yan

Background: Plant-derived exosome-like nanovesicles (P-ELNs) effectively deliver bioactive compounds due to their high biocompatibility and low immunogenicity. While LC-MS profiles compounds in complex samples, its analysis of large datasets remains limited by traditional methods. Recent advances in large language models (LLMs) and domain-specific systems now enhance Chinese biomedical data processing and cross-modal pharmaceutical research. Objective: To create a multimodal framework of liquid chromatography-mass spectrometry (LC-MS) combined with DeepSeek models for data mining of compounds with wound-healing properties from exosome-like nanovesicles derived from Cayratia japonica (CJ-ELNs). Methods: LC-MS identified compounds enriched in CJ (N=3) and CJ-ELNs (N=3), then compounds specifically enriched in CJ-ELNs were filtered via a four-step filtering workflow. The CJ-ELNs-specific compounds were processed by DeepSeek models for screening naturally active compounds with targeted functions of antioxidation, anti-inflammation, anti-cellular damage, anti-apoptosis, wound healing and tissue regeneration, and cell proliferation. Results: A multimodal framework of LC-MS combined with the DeepSeek-DF model was created. With the assistance of artificial intelligence (AI), a total of 46 naturally active compounds derived from CJ-ELNs with targeted functions were identified. Conclusions: A self-designed multimodal framework of LC-MS combined with DeepSeek models rapidly and accurately identifies naturally active compounds from CJ-ELNs. This AI-powered system innovatively integrates the traditional analytical technique with modern large language models, thus greatly favoring data mining of active ingredients in traditional Chinese medicine (TCM) herbs.

2026-01-08T16:00:12-05:00 https://bioinform.jmir.org/2026/1/e70708 Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction From Electronic Health Records: Technical Implementation Study 2026-01-06T16:30:03-05:00 Marvin N Carlisle William A Pace Andrew W Liu Robert Krumm Janet E Cowan Peter R Carroll Matthew R Cooperberg Anobel Y Odisho

Background: Manual abstraction of unstructured clinical data is often necessary for granular clinical outcomes research but is time consuming and can be of variable quality. Large language models (LLMs) show promise in medical data extraction yet integrating them into research workflows remains challenging and poorly described. Objective: To develop and integrate an LLM-based system for automated data extraction from unstructured electronic health record (EHR) text reports within an established clinical outcomes database. Methods: We implemented a generative artificial intelligence (genAI) pipeline (UODBLLM) utilizing a flexible language model interface that supports various LLM implementations, including Health Insurance Portability and Accountability Act (HIPAA)-compliant cloud services and local open-source models. We used Extensible Markup Language (XML)-structured prompts and integrated using an open database connectivity interface to generate structured data from clinical documentation in the EHR. We evaluated UODBLLM's performance on completion rate, processing time, and extraction capabilities across multiple clinical data elements, including quantitative measurements, categorical assessments, and anatomical descriptions, using sample MRI reports as test cases. System reliability was tested across multiple batches to assess scalability and consistency. Results: Piloted against MRI reports, UODBLLM processed 1,800 clinical documents with a 100% completion rate and an average processing time of 8.90 seconds per report. Token utilization averaged 2,692 tokens per report, with an input-to-output ratio of approximately 13:2, resulting in a processing cost of $0.009 per report. UODBLLM had consistent performance across 18 batches of 100 reports each and completed all processing in 4.45 hours. From each report, UODBLLM extracted 16 structured clinical elements, including prostate volume, PSA values, PI-RADS scores, clinical staging, and anatomical assessments. All extracted data was automatically validated against predefined schemas and stored in standardized JSON format. Conclusions: We demonstrated successful integration of an LLM-based extraction system within an existing clinical outcomes database, achieving rapid, comprehensive data extraction at minimal cost. UODBLLM provides a scalable, efficient solution for automating clinical data extraction while maintaining protected health information security. This approach could significantly accelerate research timelines and expand feasible clinical studies, particularly for large-scale database projects.

2026-01-06T16:30:03-05:00 https://bioinform.jmir.org/2025/1/e89673 Correction: Structural and Functional Impacts of SARS-CoV-2 Spike Protein Mutations: Insights From Predictive Modeling and Analytics 2025-12-29T17:00:11-05:00 Edem K Netsey Samuel M Naandam Joseph Asante Jnr Kuukua E Abraham Aayire C Yadem Gabriel Owusu Jeffrey G Shaffer Sudesh K Srivastav Seydou Doumbia Ellis Owusu-Dabo Chris E Morkle Desmond Yemeh Stephen Manortey Ernest Yankson Mamadou Sangare Samuel Kakraba

2025-12-29T17:00:11-05:00