. The Epic EMR system accompanies laboratory values with flags indicating results above the reference range or below. This is how Chi2notype is able to tally up the fraction of patients with metastasis who have each type of abnormal lab value and compare them to the corresponding fractions of kidney cancer patients as a whole. However, Chi2notype cannot detect more subtle numerical trends that might be predictive of metastasis. There are certain laboratory tests that are commonly administered to all patients, and thus are sufficiently represented in the data to possibly permit detection of quantitative differences within the normal range of values. These tests are: . In addition, we included the following vital signs:Body Mass Index, Pulse, Respiration Rate, Temperature (F), and Systolic and Diastolic Blood Pressure. Finally, the following demographic variables were included: race, sex, and Hispanic ethnicity.
Our outcome variable was any diagnosis of secondary tumors with ICD10 codes: "C77 Secondary and unspecified malignant neoplasm of lymph nodes",C78 Secondary malignant neoplasm of respiratory and digestive organs," and "C79 Secondary malignant neoplasm of other and unspecified sites."
Data Extraction
[The data was pulled as a CSV file from i2b2 using DataFinisher \citep{Bokov_2016}-- in the output data, each of the XX variables got a separate column, with some of them having additional columns to store metadata. For each patient, each visit date we represented by one row. Visits from the same patients are adjacent to each other and sorted in chronological order. A non-identifying study-assigned-ID called PATIENT_NUM is available for grouping data by patient. The time variable is represented as age at visit in days, which is a very precise and convenient quantity to work with that can, if necessary, easily be rounded to cruder intervals like months or years. We used two new features of DataFinisher that have been developed since the original 2016 publication-- random sub-sampling and data dictionary export. The random sub-sampling feature made it easy to select a random set of 500 patients from the original XXX so that we could carry out the model development work presented here without exposing the rest of the dataset to the biasing effect of we within-sample hypothesis within-sample inference \citep{Berk_2009}. The data dictionary feature permitted automatic renaming of variables into a human-readable form.]
The R statistical language \cite{team2017rb} and the dplyr package for R \citep{wickham2015dplyr} were used to group the data by PATIENT_NUM and create temporal variables for each patient. The row having first kidney cancer diagnosis (ICD10 code C64) was set as the start of the follow up period and was subtracted from all the other patient ages to create the start time variable for each follow up interval. The start time variable was then lagged by one row in order to create the end time variable for each interval. A censoring indicator was created with a default value of 0 (meaning that no event of interest was observed). For the row preceding the first row when any ICD10 code for secondary tumors (C77-C79) was entered the censoring variable was set to 1, meaning first diagnosis of metastasis. The preceding visit was used to prevent the predictive model from being dominated by EMR data that gets recorded after the diagnosis is already known to the physician. Patients who had a metastasis diagnosis recorded prior to their first primary kidney cancer diagnosis or white first kidney cancer diagnosis was accompanied by metadata indicating that they had a previous history of it were eliminated from the dataset because their first diagnoses occurred outside UT Medicine and would distort a model whose purpose is to predict risk of progression to metastasis from the time of initial diagnosis. For the remaining patients the rows from initial kidney cancer diagnosis to initial metastasis diagnosis (if any) were used in the analysis. For kidney cancer patients who never progressed to metastasis all visits subsequent to initial diagnosis were used in the analysis.
Missing Data
A major challenge for the secondary use of EMR data in research is the missing observations problem \citep{Greenland_1995,19946393}. Patient visits happen at irregular intervals and laboratory tests results can be reported on different days than clinic visits and even vital signs are not necessarily read on each visit, and the data is not missing at random. If we were to analyse only the complete cases no only would our sample size shrink drastically, the visits that remained (if any) would be highly unusual for being accompanied by every laboratory test and vital sign. We considered two approaches to imputing missing laboratory values. The first, to which we will refer here as discretization, takes advantage of the fact that in the Epic EMR laboratory results that are below or above their references ranges are accompanied by value-flags. We turned every laboratory result into an ordinal variable with a baseline value of 'None', which includes days with no laboratory results as well as ones on which the results were in the normal range. The deviations from this baseline were coded as 'High' and 'Low'. The second approach was last observation carried forward (LOCF). We tried both approaches for laboratory values, but only LOCF with vital signs since value flags were not available for the latter. More sophisticated methods for interpolating time data exist of course, but most of themsuffer fromlook-ahindsight bias-- i.e.they use information both before and after the missing data, which is not valid for time-to-even models \citep{therneau2017using}.
For diagnoses, there is no distinction between missing and negative data-- in an EMR system, a diagnosis is actually an indicator of clinical activity related to a disease rather than the disease itself. For this reason, even though all the patients had kidney cancer, the ICD10 code C64 was not necessarily present for all visits during which they had it-- some visits could have been for reasons unrelated to their cancer. Likewise for other diagnosis codes. Medications and diagnoses were coded as 'Yes' if they were present during a visit, and 'No' otherwise, and no missing data interpolation was needed.
Post-Processing
In addition to dplyr, the following R packages were used by our analysis script to organize the data and prepare it for analysis: zoo, readr, stringr, and magrittr. Along with the built-in functionality of RStudio the following R packages were used to produce plots or tables of our results: ggplot2, ggfortify, grid, and stargazer. Analysis was done primarily using the built-in survival package with some added functionality from the MASS and Hmisc packages.
Model Specification
Survival analysis is more efficient to use with this data than logistic regression because the latter throws away information by ignoring the time elapsed until the event of interest (metastasis) is observed. The Cox proportional hazard model \citep{cox_regression_1972} is widely used for survival data and can support predictor variables that are discrete, continuous, or both. The Cox model can be extended to permit multiple follow-ups per subject and thus, time-varying predictors \citep{survival-book,therneau2017using}. One approach is treating the data as right-censored and using robust methods to adjust the variance estimates on linear predictors. Another is treating the data as interval-censored without any further adjustment. We tried and compared both approaches on the laboratory values, diagnoses, and medications.
For the model specifications and data imputation approaches described above, a separate univariate Cox proportional hazard model was fit for each candidate predictor.
Results
We used the Wilcoxon signed-rank test \cite{bauer1972constructing} to compare the concordances and goodness-of-fit (likelihood ratio) statistics for the same models using right censoring and clustering on PATIENT_NUM versus interval censoring and no clustering. There were no significant differences for the concordances (V statistic = 452, p > 0.5) nor the goodness-of-fit (V statistic = 490, p > 0.2) between the two approaches. Because the interval-censored approach without clustering is computationally faster, we will use that one going forward.
For the laboratory values only, we used the above approach to compare models using discretized variables versus the LOCF approach for addressing missing values. The goodness-of-fit was not significantly different (V statistic = 91, p > 0.2) , but the concordances were significantly improved by the LOCF approach (V statistic = 40, p < 0.004).
The concordances were not especially high, though serum creatinine (LOINC:2160-0, Creat SerPl-mCnc), erythrocyte distribution width (LOINC:788-0, RDW RBC Auto-Rto), serum calcium (LOINC:17861-6, Calcium SerPl-mCnc), and systolic blood pressure came close to 0.6. These could be usable in a predictive model with a larger sample size. The predictive accuracies may also be improved by including several of these variables in a single multivariate model that has interaction terms and includes demographic covariates (age at diagnosis, race, hispanic status, and ethnicity). This is the next step planned for this ongoing project.
Figure 1. Screenshots of i2b2
Table 1. Patient Population
NAMEAll Patients% of All Patients # of Kidney Cancer Patients% of Kidney Cancer PatientsOdds Ratio Hispanic or Latino11002828.7%66646.1%2.13 Spanish313408.2%17111.8%1.51 Self-pay [3,769,507 facts; 281,439 patients]29465676.8%138295.6%6.63 Medicare [1,198,657 facts; 77,689 patients]8122521.2%67947.0%3.30 Carelink [518,676 facts; 54,031 patients]5509414.4%41128.4%2.37 Deceased 59051.5%1329.1%6.43 Living37232097.0%131090.7%0.30 Male17441445.4%87160.3%1.82 Female20913454.5%57439.7%0.55 White or Caucasian24707564.4%112277.6%1.92 Unknown/Other9115023.8%22415.5%0.59 Not Recorded235096.1%281.9%0.30 Black or African-American163244.3%614.2%0.99 Asian50161.3%120.8%0.63 I choose not to provide this information16580.4%60.4%0.96 Unknown5350.1%40.3%1.99 American Indian or Alaska Native5280.1%30.2%1.51 More Than One Race4620.1%20.1%1.15 Native Hawaiian and Other Pacific Islander2600.1%00.0%0.00 Other1490.0%30.2%5.36 383752 1445 |
A demographic summary of the patient-set we analyzed.
Table 2. Chi2notype: Kidney Cancer vs Kidney Cancer With Metastasis
The top over-represented data elements in patients with metastatic kidney cancer relative to kidney cancer patients overall. [JUSTIN, I DO NOT HAVE THE FINAL VERSION OF THE TABLES, BUT PLEASE PUT INTO THE POSTER AT LEAST THE MEDICATIONS AND THE (HIGH/LOW) LABS FROM THE METASTASIS VS ALL KIDNEY CANCER CHINOTYPE OUTPUT]
Table 3. Comparison of Right Censored and Interval Censored Models for the Various Predictors
The labs, diagnoses, and medications are in separate sections of this table. For all the labs shown here the discretized approach (Low/Neither/High) was used. The Concordance column indicates how well the predictions of the respective models matched the actual data. The Standard Error column is the standard error for the concordance. The LRT column is the likelihood ratio omnibus test for the goodness-of-fit for each model.
Table 4. Side-by-Side Comparison of Discretized Lab Results vs. Last Observation Carry Forward
The labs and vitals are in separate sections of this table. For all the models shown here the interval-censored approach was used. The Concordance column indicates how well the predictions of the respective models matched the actual data. The Standard Error column is the standard error for the concordance. The LRT column is the likelihood ratio omnibus test for the goodness-of-fit for each model. For the vital signs only the LOCF approach to addressing missing values was used because there were no value-flags available for high or low measurements.
Figure 2. Survival Curves for All Labs (Last Observation Carry Forward)
For all plots the x-axes represent the time in days from initial kidney cancer diagnosis (ICD10: C64), the y-axes represent the fraction of the original population remaining metastasis-free, and the tick-marks represent censoring times-- i.e. the last recorded visits of patients who were never diagnosed with metastatic tumors. The shaded regions are 95% confidence intervals. All survival curves are on the same set of patients and follow-up times, but differ from each other by which predictor variable was used to group the patients. A perfect predictor would have separated all the tick-marks into one curve which would have looked like a horizontal line, and the other line would have looked like a step-function with no tick-marks on it.
Conclusion
Here we present the results of a formal data characterization and model selection effort on a developmental dataset prior to external validation and hypothesis testing on a larger hold-out dataset. We have found that using an interval-censored representation of the data and dispensing with mixed-effects (i.e. the clustering term) does not significantly diminish concordance nor goodness-of-fit. Therefore, going forward we will use interval-censoring because it is faster to compute. For laboratory values, we could choose to either discretize them such that the effect estimates in the Cox model reflect positive and negative deviations from the baseline state (that of having either normal results or none). Or, we could choose to replace missing values with the most recent non-missing value for that patient (LOCF). Based on the comparison of 22 univariate Cox models, each using a different lab as its predictor, we found a significantly improved concordance for the LOCF models. For vital signs there was no such choice, since value-flags for abnormal results are not available and we had to use the LOCF approach on them. Diagnoses and medications can be treated as free of missing values as long as we interpret them to mean clinical activity rather than a direct indication of the underlying health condition.
LOCF is the simplest way to fill in missing values relying only in information collected prior to each missing variable. We do not expect mulltiple imputation to work because of how rare it is for some labs to ever co-occur. However, there may be moving average approaches that are superior, and a future direction will be to compare those to LOCF. Another future direction will be to fit a multivariable model including demographic variables and all the variables fit in the univariate models reported here, with the final choice of predictors and interactions made using bidirectional stepwise regression. The candidate variables will be prioritized based on the concordances we observed here, which will improve the speed of computation.
[ADD WHATEVER OTHER CONCLUSIONS FROM THE STUDY SO FAR THAT YOU FEEL ARE RELEVANT]
Acknowledgments
Done, already in poster