Asthma AI modeling and feature importance analysis
Microbiome next generation sequencing (NGS) data were converted to
relative composition at the genus level to standardize the input data. A
total of 2,035 features were detected from all samples at the genus
level. Two general linearized models (GLM) were developed using the t-
test and linear discriminant analysis effect size (LEfSe) methods for
feature selection. Two algorithms were incorporated for AI modeling,
including a gradients-boosting machine (GBM). To apply the GBM algorithm
to the tabulated microbiome composition data, machine learning modules
in the Python scikit-learn package were applied. The second algorithm
was an artificial neural network (ANN), which utilizes regularized
5-layered neural networks and was conducted with the Tensorflow
ecosystem using Python15. The ensemble algorithm was
based on the average values obtained from each prediction model. The
data were split into testing and training sets 10 times before training
to validate the performance of the models. After splitting the samples,
the models were trained based on the training set, followed by
validation in the test set. Feature importance was assessed using the
testing model to determine the relatively important taxa. Permutation
importance was applied using the Scikit-learn package to trace the
feature importance in the AI model 16.