LIMITATIONS
This is a single center, retrospective study. Although our approach is especially effective when the size of the dataset and the number of examples in the under-represented class (AF in our case) are limited, our approach has limitations. When working with a larger imbalanced dataset, the under-sampling step involved in creating a balanced training set eliminates a sizeable portion of the over-represented class, while the over-sampling step applied via the SMOTE process to the under-represented class generates a large number of synthesized samples that were not in the original dataset. Both of these lead to a potential loss of useful information and alter the distribution of characteristic feature values across both the minority and the majority classes. In the current study, this issue was mitigated by thorough experimentation to determine the effective rates of under- and over-sampling.
We grouped all AF cases (paroxysmal, persistent, permanent), as well as prevalent and incident AF into one set (because of low event number) and only included ECHO/CMR imaging features obtained at the patients’ first clinic visit in the model. Left atrial volume,[46] LA strain,[58] LA fibrosis,[56] EKG/blood biomarkers,[66] genotype,[67-69] or sleep apnea were not included in our model, because this data is not available for a large proportion of our cohort. Lastly, we were unable to assess the generalizability of our approach by applying our developed model to additional HCM patients – beyond the cross-validation study, due to the unavailability of data from other HCM cohorts reported in other studies. We expect to address the latter issue in a future prospective study.