Asthma AI modeling and feature importance analysis
Microbiome next generation sequencing (NGS) data were converted to relative composition at the genus level to standardize the input data. A total of 2,035 features were detected from all samples at the genus level. Two general linearized models (GLM) were developed using the t- test and linear discriminant analysis effect size (LEfSe) methods for feature selection. Two algorithms were incorporated for AI modeling, including a gradients-boosting machine (GBM). To apply the GBM algorithm to the tabulated microbiome composition data, machine learning modules in the Python scikit-learn package were applied. The second algorithm was an artificial neural network (ANN), which utilizes regularized 5-layered neural networks and was conducted with the Tensorflow ecosystem using Python15. The ensemble algorithm was based on the average values obtained from each prediction model. The data were split into testing and training sets 10 times before training to validate the performance of the models. After splitting the samples, the models were trained based on the training set, followed by validation in the test set. Feature importance was assessed using the testing model to determine the relatively important taxa. Permutation importance was applied using the Scikit-learn package to trace the feature importance in the AI model 16.