The Silent Emergency - Predicting Preterm Birth
Our project harnesses machine learning techniques to predict preterm birth using electronic health records. This data intersects with social determinants of health, reflecting some of the interactions contributing to preterm birth. Recognizing that under-representation in healthcare research perpetuates racial and ethnic health disparities, we take care to use diverse data to ensure equitable model performance across underrepresented populations.
Erdős Institute Data Science Boot Camp Fall 2023
- View our 5-minute recorded presentation
- Download our Executive Summary
- Visit our GitHub repo
Team Members:
Project Description
“The world is facing a silent emergency. . . of preterm births.” - UNICEF1
Preterm birth is a primary cause of infant mortality and morbidity in the United States, affecting approximately 1 in 10 births.2 This rate is notably higher among Black women (14.6%), compared to White (9.4%) and Hispanic women (10.1%).3 These rates have recently declined for White women but remained unchanged for other groups.4 Despite its prevalence, predicting preterm birth remains challenging due to its multifaceted etiology rooted in environmental, biological, genetic, and behavioral interactions.5 Our project harnesses machine learning techniques to predict preterm birth using electronic health records. This data intersects with social determinants of health, reflecting some of the interactions contributing to preterm birth.6 Recognizing that under-representation in healthcare research perpetuates racial and ethnic health disparities, we take care to use diverse data to ensure equitable model performance across underrepresented populations.7
Project Stakeholders
Pregnant individuals, prospective parents, medical professionals involved with maternal care and births, hospital systems, insurance companies
Approach
We constructed two models to predict preterm birth, one with demographic features and one with health and lifestyle features. Our data source was the National Institute of Health’s All of Us Research Program controlled tier of de-identified medical data. The Demographics model was trained on a dataset of 13690 births between the years 2009 and 2022. Most demographic information for each individual was available only as summary statistics based on their zip code. Race and ethnicity were available on the individual level. The Lifestyle model was trained on a dataset of 8771 births between the years 2011 and 2022. Features included drinking, smoking, drug use, body mass index, diabetes, and mental health.
Model Details
Baseline model: Our baseline model was a weighted coin flip that reflected the ratio of preterm births in our data.
Demographics model: The Demographics model was trained on a dataset of 13690 births between the years 2009 and 2022. Most demographic information for each individual was available only as summary statistics based on their zip code. Race and ethnicity were available on the individual level. We used a Support Vector Classifier with class weights to prioritize prediction of preterm birth. GridSearch CV was used to tune hyperparameters.
Lifestyle model: The Lifestyle model was trained on a dataset of 8771 births between the years 2011 and 2022. Features explored included drinking, smoking, drug use, body mass index, diabetes, and mental health. We used a logistic regression with class weights to prioritize prediction of preterm birth.
For both models, we used the package AI Fairness 360 to ensure that our model predictions performed equally well across protected classes (race and ethnicity).
Data Access
The data used to train these models was accessed from the All of Us Research Hub. You must register for the Researcher Workbench to access the data. We specifically used individual-level data from the Controlled Access Tier, which requires the completion of additional training. To facilitate testing without access to this protected health data, we have also provided a synthetic data frame.