This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Bioinformatics and Biotechnology, is properly cited. The complete bibliographic information, a link to the original publication on https://bioinform.jmir.org/, as well as this copyright and license information must be included.

Current postpartum hemorrhage (PPH) risk stratification is based on traditional statistical models or expert opinion. Machine learning could optimize PPH prediction by allowing for more complex modeling.

We sought to improve PPH prediction and compare machine learning and traditional statistical methods.

We developed models using the Consortium for Safe Labor data set (2002-2008) from 12 US hospitals. The primary outcome was a transfusion of blood products or PPH (estimated blood loss of ≥1000 mL). The secondary outcome was a transfusion of any blood product. Fifty antepartum and intrapartum characteristics and hospital characteristics were included. Logistic regression, support vector machines, multilayer perceptron, random forest, and gradient boosting (GB) were used to generate prediction models. The area under the receiver operating characteristic curve (ROC-AUC) and area under the precision/recall curve (PR-AUC) were used to compare performance.

Among 228,438 births, 5760 (3.1%) women had a postpartum hemorrhage, 5170 (2.8%) had a transfusion, and 10,344 (5.6%) met the criteria for the transfusion-PPH composite. Models predicting the transfusion-PPH composite using antepartum and intrapartum features had the best positive predictive values, with the GB machine learning model performing best overall (ROC-AUC=0.833, 95% CI 0.828-0.838; PR-AUC=0.210, 95% CI 0.201-0.220). The most predictive features in the GB model predicting the transfusion-PPH composite were the mode of delivery, oxytocin incremental dose for labor (mU/minute), intrapartum tocolytic use, presence of anesthesia nurse, and hospital type.

Machine learning offers higher discriminability than logistic regression in predicting PPH. The Consortium for Safe Labor data set may not be optimal for analyzing risk due to strong subgroup effects, which decreases accuracy and limits generalizability.

Maternal morbidity and mortality have been regarded as a reflection of health care quality nationwide. Among lower-income countries, postpartum hemorrhage (PPH) is typically the most common cause of maternal mortality and remains among the top causes in higher-income countries. In the United States, hemorrhage accounted for 11.0% of deaths between 2011 and 2016 [

Machine learning offers an advantage to current risk assessment methods through its ability to create a robust model based on larger numbers of predictors, with nonlinear relationships and interactions between variables included in analyses [

Data for this analysis were extracted from the Consortium for Safe Labor (CSL) data set created by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD). It includes antepartum, intrapartum, and postpartum medical histories of 224,438 women from 12 hospitals in the United States (

Flowchart of inclusion of women with transfusion or postpartum hemorrhage (or both).

Machine learning methods are known to generate errors in the presence of missing values [^{2}). Imputing estimated blood loss (EBL) as the median value (350 mL) meant that missing values were assumed to be <1000 mL.

We used the Cramér V index of nominal association for variable selection [

Separate models were constructed to predict 2 target outcomes. The primary outcome was a composite including all patients who received a transfusion of any blood product or had a PPH defined by documented blood loss of ≥1000 mL during or after delivery. Our secondary outcome was all patients who received transfusion of any blood product. Both blood loss of ≥1000 mL and blood transfusion are clinically significant metrics in obstetric care. Transfusion alone represents patients who are at risk for high maternal morbidity and mortality and is a clinically important metric to evaluate in isolation; hence, it was evaluated independently in a model as a secondary outcome.

For each of the 4 combinations of predictors and outcomes (for predictors, antepartum vs antepartum and intrapartum; for outcomes, transfusion and blood loss greater than a liter versus transfusion alone), the data were split so that 70% of the observations were used for training and 30% were used for testing, with both sets having the same outcome rate. We applied a number of methods, including LR, support vector machines (SVMs), multilayer perceptron (MLP), random forest (RF), and gradient boosting (GB), as well as deep learning algorithms including TensorFlow imbalanced (TFIM) and learned embedding (Emb). Hyperparameters were tuned for each algorithm using a customized grid search technique. The model performance for each combination of outcome and algorithm was measured using the Matthews correlation coefficient (MCC), area under the receiver operating characteristic curve (ROC-AUC), area under the precision/recall curve (PR-AUC), and modified F-score skewed toward recall (F2). A modified F2 score was chosen to minimize false negatives and thus maximize the identification of patients at high risk for bleeding and transfusion. Existing LR models and risk classification schemes perform poorly, and the majority of patients with hemorrhage or transfusion are misclassified as low risk. Misclassification of a “high risk” patient as “low risk” may have important clinical implications. Additionally, interventions can be implemented to minimize risk and enhance patient safety (eg, type and cross, multiple intravenous access sites, provider awareness, medications, etc). Models will then be evaluated for those with the highest positive predictive value (PPV) given these parameters. A model with the highest PPV will be clinically useful to identify a high-risk patient population without increasing the clinical burden on the hospital system or patient with the abovementioned interventions. Algorithms were processed and results were analyzed using Python (version 3.6; Python Software Foundation), Pandas (version 1.2; The Pandas Development Team), scikit-learn (version 0.24; scikit-learn Developers), and TensorFlow (version 2.2; Python Software Foundation).

The primary study objective was to identify the strongest set of pre- and intraoperative predictors of hemorrhage or transfusion and the strongest modeling technique. Secondary objectives included determining the level of agreement between metrics for model evaluation and the extent to which any technique produced results that are clinically useful. Given the heterogeneity of this data set derived from multiple institutions, a site-specific sensitivity analysis was performed.

This analysis was exempt from review by the George Washington University’s institutional review board (NCR202746).

Of 228,438 births included in the CSL cohort, we included 185,413 patients (

After building the models in an iterative process, their performance in predicting both the primary and secondary outcomes was compared using a variety of metrics. The metrics ROC-AUC, PR-AUC, MCC, and F2, as well as sensitivity and specificity at a probability cut point of 50% are shown in

Performance of machine learning and statistical models based on antepartum and intrapartum maternal variables at predicting transfusion or postpartum hemorrhage (or both). Primary outcome: blood transfusion or blood loss of ≥1 L.

Algorithm | True positives^{a}, n |
True negatives^{a}, n |
False positives^{a}, n |
False negatives^{a}, n |
Positive predictive value | Sensitivity | Specificity | ROC-AUC^{b} |
PR-AUC^{c} |
MCC^{d} |
F2^{e} |

GB^{f} |
50 | 6 | 318 | 626 | 0.135 | 0.889 | 0.663 | 0.833 | 0.210 | 0.260 | 0.419 |

RF^{g} |
50 | 6 | 339 | 605 | 0.138 | 0.857 | 0.641 | 0.830 | 0.204 | 0.261 | 0.409 |

Emb^{h} |
46 | 10 | 296 | 649 | 0.134 | 0.821 | 0.687 | 0.813 | 0.181 | 0.246 | 0.406 |

MLP^{i} |
49 | 7 | 335 | 609 | 0.127 | 0.875 | 0.645 | 0.808 | 0.149 | 0.245 | 0.402 |

TFIM^{j} |
48 | 8 | 323 | 619 | 0.129 | 0.861 | 0.655 | 0.822 | 0.194 | 0.245 | 0.403 |

SVM^{k} |
49 | 6 | 349 | 595 | 0.124 | 0.886 | 0.630 | 0.804 | 0.159 | 0.242 | 0.397 |

LR^{l} |
46 | 10 | 314 | 631 | 0.129 | 0.830 | 0.668 | 0.813 | 0.177 | 0.238 | 0.393 |

^{a}Values are normalized per 1000, so they are easier to compare across different models; the actual N value is 55,624.

^{b}ROC-AUC: area under the receiver operating characteristic curve.

^{c}PR-AUC: area under the precision-recall curve.

^{d}MCC: Matthews correlation coefficient.

^{e}F2: modified F-score skewed toward recall.

^{f}GB: gradient boosting.

^{g}RF: random forest.

^{h}Emb: learned embedding.

^{i}MLP: multilayer perceptron.

^{j}TFIM: TensorFlow imbalanced.

^{k}SVM: support vector machine.

^{l}LR: logistic regression.

Performance of machine learning and statistical models based on antepartum and intrapartum maternal variables in predicting transfusion or postpartum hemorrhage (or both). Secondary outcome: blood transfusion.

Algorithm | True positives^{a}, n |
True negatives^{a}, n |
False positives^{a}, n |
False negatives^{a}, n |
Positive predictive value | Sensitivity | Specificity | ROC-AUC^{b} |
PR-AUC^{c} |
MCC^{d} |
F2^{e} |

GB^{f} |
24 | 4 | 235 | 737 | 0.093 | 0.866 | 0.758 | 0.860 | 0.111 | 0.234 | 0.325 |

RF^{g} |
25 | 3 | 251 | 721 | 0.090 | 0.887 | 0.742 | 0.862 | 0.107 | 0.232 | 0.319 |

Emb^{h} |
22 | 6 | 223 | 750 | 0.090 | 0.789 | 0.771 | 0.837 | 0.096 | 0.215 | 0.309 |

MLP^{i} |
24 | 4 | 237 | 735 | 0.091 | 0.849 | 0.756 | 0.845 | 0.095 | 0.227 | 0.318 |

TFIM^{j} |
24 | 4 | 240 | 732 | 0.091 | 0.859 | 0.753 | 0.855 | 0.111 | 0.229 | 0.319 |

SVM^{k} |
24 | 4 | 244 | 728 | 0.091 | 0.871 | 0.749 | 0.852 | 0.116 | 0.230 | 0.320 |

LR^{l} |
24 | 3 | 250 | 722 | 0.089 | 0.876 | 0.743 | 0.853 | 0.111 | 0.228 | 0.317 |

^{a}Values are normalized per 1000, so they are easier to compare across different models; the actual N value is 55,624.

^{b}ROC-AUC: area under the receiver operating characteristic curve.

^{c}PR-AUC: area under the precision-recall curve.

^{d}MCC: Matthews correlation coefficient.

^{e}F2: modified F-score skewed toward recall.

^{f}GB: gradient boosting.

^{g}RF: random forest.

^{h}Emb: learned embedding.

^{i}MLP: multilayer perceptron.

^{j}TFIM: TensorFlow imbalanced.

^{k}SVM: support vector machine.

^{l}LR: logistic regression.

For both the primary and secondary outcomes, models developed using antepartum and intrapartum maternal variables (see

Receiver operating characteristic and precision/recall curves for different models using intrapartum maternal variables predicting transfusion or postpartum hemorrhage.

The remainder of our results focus on the model with the highest PPV: the intrapartum model (containing both antepartum and intrapartum variables) evaluating our primary outcome of a composite of blood loss of more than 1000 mL or transfusion. Both RF and GB had significantly higher PPVs for predicting the composite transfusion or PPH when compared with LR (PR-AUC=0.18, 95% CI 0.17-0.19; ROC-AUC=0.81, 95% CI 0.808-0.818).

Calibration curves for models using intrapartum maternal variables to predict transfusion or postpartum hemorrhage (or both). Emb: learned embedding; GB: gradient boosting; LR: logistic regression; MLP: multilayer perceptron; RF: random forest; SVC: support vector machine; TFIM: TensorFlow imbalanced.

Top 25 predictors based on each model using intrapartum maternal factors predicting transfusion or postpartum hemorrhage (or both). GB: gradient boosting; LR: logistic regression; MLP: multilayer perceptron; RF: random forest; SVC: support vector machine.

In this study, LR and machine learning techniques were analyzed and compared to develop prediction models for PPH and transfusions. We found that the machine learning techniques, particularly GB, performed best to predict PPH when PPH was defined as blood transfusion or blood loss of greater than 1 L. However, all prediction models had difficulties with calibration when predicting the rare outcome of transfusion alone.

Risk assessment for PPH has been shown in a pre-post study to reduce rates of blood transfusion and PPH [

A previously published risk assessment for PPH using the CSL data set demonstrated exceptional model performance, but model performance was drastically lower in an external validation cohort [

For all the intrapartum methods that we tested for predicting transfusion or hemorrhage, the ROC-AUC values were greater than 0.80, which is often cited as a threshold indicating adequate discrimination. However, this conclusion is misleading because in a situation where incidence of the outcome is low (here, it was ~3% for transfusion or hemorrhage alone), the PPV, also known as “precision,” is likely to be quite low. Our precision for the best-performing model was ~13%, meaning that of those predicted to be positive for the outcome, 13% were positive and 87% were negative. This may be satisfactory for clinical uses where preventive interventions have very low cost (in terms of both financial cost and added risk to the patient) but would not be acceptable when the intervention is of higher risk or is more expensive. In this situation, the PR-AUC provided a more realistic measure of model quality. Precision/recall plots show PPV (aka precision) as a function of sensitivity (aka recall); thus, they account for true positives in positive predictions. In contrast, the ROC-AUC emphasizes specificity, which is likely to be very high when true positives are rare [

The strengths of this study include the use of a large, national multicenter data set to develop a data-driven model that can predict PPH using antepartum and intrapartum factors using cutting-edge machine learning techniques. Furthermore, we considered both commonly used end points such as estimated blood loss greater than 1 L and clinically relevant end points such as transfusion; this led us to conclude that due to a less frequent occurrence and transfusion practice, variation made it more challenging to develop a reliable model for transfusion only.

Limitations of the study include the low reported precision of algorithms. Sensitivity is prioritized for prediction, as clinically missing PPH has more consequences than a false positive. Therefore, the algorithms are trained to be biased toward predicting positives resulting in lower false negative rates at the risk of higher false positive rates and decreased precision. As a result, as shown in the calibration plots, the models systematically overstate hemorrhage risk. In this study, the outcomes of interest were either a composite of transfusion or blood loss of ≥1 L or transfusion only. Our PPH definition was based on the American College of Obstetricians and Gynecologists’ reVITALize program’s definition of PPH as blood loss of ≥1 L or loss of blood with clinical signs of hypovolemia within 24 hours of delivery. This definition deviates from older traditional definitions that defined PPH as ≥500 mL for vaginal delivery and 1000 mL for cesarean delivery [

In conclusion, machine learning and data-driven statistical modeling may offer more objective and discriminative prediction of PPH based on individual antepartum and intrapartum patient features, compared to expert opinion, and may improve upon traditional regression models. This can increase the opportunity for precision medicine and improved clinical care to reduce the burden of PPH as a leading cause of maternal morbidity and mortality.

All antepartum and intrapartum variables were included for analysis for feature selection.

Overall Patient Characteristics.

Performance of machine learning and statistical models. The model included antepartum maternal features predicting transfusion and/or postpartum hemorrhage. Pre/Trans Loss. Footnote for table: aAlg=algorithm, bNTP=normalized true positive, cNFN=normalized false negative, dNFP=normalized false positive, eNTN=normalized true negative, fROC_AUC (receiver operator curve_area under the curve; 0.5 was considered no better than chance, greater than 0.5 to less than 0.7 poor, 0.7 to less than 0.8 acceptable, 0.8 to less than 0.9 excellent, 0.9 or greater outstanding), gPR_AUC (precision recall_area under the curve), hMCC=Matthews correlation coefficient, iF2= modified F-score skewed towards recall), jGradient boosting, kRandom forests, llearned embedding, mMulti-layer percepton, nTensorflow imbalanced, oSupport vector machines, plogistic regression.

Performance of machine learning and statistical models. The model included antepartum maternal features predicting transfusion of any blood products only. Pre/ Trans_yes Footnote: aAlg=algorithm, bNTP=normalized true positive, cNFN=normalized false negative, dNFP=normalized false positive, eNTN=normalized true negative, fROC_AUC (receiver operator curve_area under the curve; 0.5 was considered no better than chance, greater than 0.5 to less than 0.7 poor, 0.7 to less than 0.8 acceptable, 0.8 to less than 0.9 excellent, 0.9 or greater outstanding), gPR_AUC (precision recall_area under the curve), hMCC=Matthews correlation coefficient, iF2= modified F-score skewed towards recall), jGradient boosting, kRandom forests, llearned embedding, mMulti-layer percepton, nTensorflow imbalanced, oSupport vector machines, plogistic regression.

Consortium for Safe Labor

estimated blood loss

learned embedding

gradient boosting

logistic regression

Matthews correlation coefficient

multilayer perceptron

Eunice Kennedy Shriver National Institute of Child Health and Human Development

postpartum hemorrhage

positive predictive value

precision/recall area under the curve

random forest

receiver operating characteristic area under the curve

support vector machine

TensorFlow imbalanced

HKA’s effort was supported by the National Heart Lung and Blood Institute of the National Institutes of Health (award K23HL141640) and JJF’s effort was supported by the National Center for Advancing Translation Sciences (award TL1TR002555). LR’s and MB’s effort was supported by the Intramural Research Program at the National Institutes of Health, National Library of Medicine.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors would like to acknowledge Dr Christian Macedonia and Dr Chad Grotegut for their insights on the initial model design and Dr Mina Felfeli for helping to submit the manuscript. HKA and JJF were supported by grants (K23HL141640 and TL1TR002555, respectively).

RA has stock ownership in Abbvie, Bristol Myers Squibb, and Pfizer. This is not related to this study.