Processing of feature data
We used the following data tables for feature extraction before the
cutoff date: ECOG, enhanced biomarkers, demographics, diagnosis code,
visit code, telemedicine code, medication administration code,
insurance, lab results, medication order, vitals, and practice.
Feature data can be largely separated into two categories. One set is
static data, which does not change over the observation time course,
including Age, Gender, Race, etc. The other set is dynamic data,
including lab, medication, visit, vitals, diagnosis, etc , which
are collected before the cutoff date. For this set of data, we extracted
diverse meta-features. We first selected the most frequent 100 concept
IDs in each of the above Flatiron data tables, and the last eight points
of records are binarized (if not originally a continuous value) to
generate 800 features, with 1 representing the appearance of the concept
ID at that data point, and 0 otherwise. Additionally, if the concept ID
represents a real-valued feature, the mean value and the standard
deviation of each selected concept ID before the cutoff time are
included. Using these mean and the standard deviation, we generate
normalized values for the initial 800 features for each table, and we
record the time difference between each record and the previous one.
Lastly, we include a binary indicator for each original feature whether
it comes from a missing record (8 values for each Flatiron data table)
or an existing record. This matrix will be flattened into a single
feature vector, concatenated with the static features and input into
lightGBM.