Machine learning to predict COVID-19 outcomes to facilitate
decision making.
Sonu Subudhi, M.B.,B.S, PhD 1, Ashish Verma M.B.,B.S2, Ankit B.Patel, MD,PhD2
1Gastroenterology Unit, Department of Medicine,
Massachusetts General Hospital, Harvard Medical School, Boston,
Massachusetts
2 Renal Division, Department of Medicine, Brigham and
Women’s Hospital, Harvard Medical School, Boston, Massachusetts
*Corresponding author:
Ashish Verma MBBS
Renal Medicine,
Department of Medicine,
75 Francis Street
Boston, MA, 02115
Email:
averma8@bwh.harvard.edu
Conflict of Interest: None
Financial disclosure: None
Keywords: COVID-19, SARS-CoV-2, Machine learning, Artificial
Intelligence
An increasing number of COVID-19 cases worldwide has overwhelmed the
healthcare system. Physicians are struggling to allocate resources and
to focus their attention on high-risk patients, partly because early
identification of high-risk individuals is difficult. This can be
attributed to the fact that COVID-19 is a novel disease and its
pathogenesis is still partially understood. However, machine learning
algorithms have the capability to correlate a large number of parameters
within a short period of time to identify the predictors of disease
outcome. Implementing such an algorithm to predict high-risk individuals
during the early stages of infection, would be helpful in decision
making for clinicians. Here, we propose recommendations to integrate
machine learning model with electronic health records so that a
real-time risk score can be developed for COVID-19.
The current surge in COVID-19 patients has created an unprecedented
stress on health care infrastructure. Early identification of high-risk
patients can allow healthcare workers to allocate their efforts and
resources during early clinical course to maximize their impact on
patient health. Early critical care management in certain clinical
settings has demonstrated improvement in mortality1.
However, identification of patient’s at high risk of progressive and
severe disease remains a challenge. Previous methods, such as scoring
systems based on clinical signs, perform poorly when novel diseases
emerge. Clinical characteristics such as Chest CT findings and
lymphopenia are helpful for diagnosis but these predictors fail to show
up at early stages of COVID-19. Other characterisitics such as age,
gender, and viral load have been associated with COVID-19 severity but
have not yet proven to predict disease severity with
accuracy2. Here we lay out recommendations to
implement a machine learning algorithm which would facilitate clinical
decision making during outbreaks like COVID-19.
Rationale for machine learning: In the case of the COVID-19
outbreak, there have been more than 800,000 cases in the United States
and more than 2.6 million cases worldwide as of April 22, 2020. Given
the number of cases, an analog approach to reviewing cases to identify
patterns that indicated poor prognosis is not feasible. A large number
of cases has particularly stressed the intensive care unit (ICU)
settings with increasing needs for ICU beds. With this increase in ICU
beds, there is an immense need for ventilators and continuous renal
replacement machines given high rates of pulmonary and renal failure. A
prediction model, which can identify patients more likely to deteriorate
and require ICU care will allow physicians to allocate manpower and
resources in an expeditious and informed manner. Prediction models can
also hone in on specific disease and identify the subset of patients
that will develop respiratory failure and require ventilators from
patients that will develop renal failure and require renal replacement
therapy as well as patients that are at risk of requiring both
life-supporting treatments. The integration of prediction model with the
electronic health record can give physicians immediate information about
the expected patient course and predicted response to treatments.
Outcome of interest and applicability: Machine learning models
could be trained to learn and detect patterns in a large number of
records in a fraction of time. Supervised machine learning is type of
machine learning where the model trains itself using patient traits as
input and disease outcome as output. Early clinical, radiological, and
laboratory data could be considered as input, while disease severity by
a variety of metrics could be the output to train a predictive model for
COVID-19. By providing input data from the electronic health records,
certain characteristics or lab values that have yet to be associated
with disease severity could be found to be strong predictors in specific
situations giving clinicians information they had otherwise not had time
to investigate in a novel disease such as COVID-19.
Lessons from the past: Multiple examples of machine learning in
predicting clinical outcomes currently exist. Using a longitudinal
dataset of electronic health records (EHR) from more than 700,000
patients, a machine learning model was able to predict future acute
kidney injury3. Another similar machine learning model
based on hospital data from a Portuguese and American hospital was able
to predict the risk of ICU admission4. A study from
Denmark using the machine learning model was able to predict 90-day
mortality for intensive care unit patients5. One key
finding of this model was that patient features can interact and
compensate for one another and could pull the patient towards survival
at one timepoint and towards mortality at another. Static prognostic
scoring systems usually fail to adapt to such patient dynamics. These
examples underscore the capability of machine learning.
Clinical implementation: Building a machine learning model for
COVID-19 would require early-stage clinical, radiological, and
laboratory data from a large cohort (Figure 1). The training dataset
must also include information about the patient outcomes you are looking
to predict, which forms the primary basis of machine learning training.
While developing the model, a fraction of patients should be kept out
during the training process, to serve as a testing cohort and help
validate the accuracy of the model. Once the accuracy of the model
significantly improves as compared to no-information-rate, the model
could be deployed to new patients. This model would primarily be able to
predict a risk score based on new input data from a patient, which would
then help clinicians guide treatment based on risk of particular
outcomes and plan for future treatment needs.
Machine learning approach was implemented on COVID-19 patient data in
China6,7. The aim was to predict the severity of
disease based on initial presentation data. One of the model was
accurately able to predict disease outcome in 90% of the
cases7. In this model, the most important features
used for prediction were lactic dehydrogenase (LDH), lymphocyte and
high-sensitivity C-reactive protein (hs-CRP). Similar implementations of
the machine learning approach in larger cohorts from other countries,
can provide more specific models to understand local factors as
predictors of disease.
Advantages and challenges: Machine learning models benefit out
from larger sample sizes, which in most cases, improve the accuracy of
models8. For evolving outbreaks, such as COVID-19, as
the number of patients increases and more data becomes available for
training, the model would likely evolve and become more accurate in
predicting disease severity from initial presentation data. Current
scoring systems lack this sort of evolving scoring criteria which make
them less accurate particularly in novel disease entities were limited
data exists.
An added advantage of deploying such a model would be improving patient
care by aiding clinicians obtaining data that is most relevant for
understanding risk of disease progression. This process can be performed
in real-time when an integrated electronic health record can alert the
clinican about key data in regards to demographics, clinical
characteristics, or laboratory data that would be helpful in predicting
patient outcomes.
A key challenge is providing high-quality data for training the
predictive model. Variable data or noise could hamper the performance of
such a model. It is important to be cautious of the model overfitting
the data which can be compensated by increase the number of patients
used in training the model. As the COVID-19 outbreak expands, the
accuracy of the model should improve. It is important for clinician to
remember that machine learning provides you with a prediction. Blind
reliance on predictive models leads to automation bias and should be
monitored for with implementation of a predictive models.
Conclusion: The overall goal of this approach would be to provide
an early clue to future predictions concerning COVID-19. However, such a
system in place could act as a model for such future outbreaks as well.