LIMITATIONS
This is a single center, retrospective study. Although our approach is
especially effective when the size of the dataset and the number of
examples in the under-represented class (AF in our case) are limited,
our approach has limitations. When working with a larger imbalanced
dataset, the under-sampling step involved in creating a balanced
training set eliminates a sizeable portion of the over-represented
class, while the over-sampling step applied via the SMOTE process to the
under-represented class generates a large number of synthesized samples
that were not in the original dataset. Both of these lead to a potential
loss of useful information and alter the distribution of characteristic
feature values across both the minority and the majority classes. In the
current study, this issue was mitigated by thorough experimentation to
determine the effective rates of under- and over-sampling.
We grouped all AF cases (paroxysmal, persistent, permanent), as well as
prevalent and incident AF into one set (because of low event number) and
only included ECHO/CMR imaging features obtained at the patients’ first
clinic visit in the model. Left atrial volume,[46] LA
strain,[58] LA fibrosis,[56] EKG/blood biomarkers,[66]
genotype,[67-69] or sleep apnea were not included in our model,
because this data is not available for a large proportion of our cohort.
Lastly, we were unable to assess the generalizability of our approach by
applying our developed model to additional HCM patients – beyond the
cross-validation study, due to the unavailability of data from other HCM
cohorts reported in other studies. We expect to address the latter issue
in a future prospective study.