The human gut microbiome has increasingly been used as a non-invasive biomarker for disease screening and diagnosis, as well as targets for disease treatment and intervention. This article interprets three recently published papers in Gut Microbes, exploring the application of machine learning in gut microbiome research. The first paper[1] demonstrated the potential application of gut microbiome in diagnosing multiple diseases through cross-validation of machine learning classifiers across multiple cohorts. The second paper[2] found that for diagnosing Crohn’s disease, microbial genes outperform species and single nucleotide variants (SNVs), and combining artificial intelligence technology can achieve efficient diagnosis across multiple cohorts’ data.The third study[3] identified distinct microbial composition characteristics prior to clinical diagnosis of liver cancer, which can be used for early diagnosis. These studies indicate that machine learning has broad application prospects in researching the relationship between microorganisms and diseases.
The Gut Microbiome as an Independent Diagnostic Tool for 20 Diseases: Machine Learning Classifier and Cross-Cohort Validation[1]
For the purpose of disease diagnosis, machine learning (ML) classifiers are often trained alone or in combination with clinically relevant features to distinguish patients from control groups. These ML models are typically validated on retained samples from the same cohort to assess predictive performance, such as the area under the receiver operating characteristic curve (AUCs) (i.e., intra-cohort validation). However, a systematic review on the cross-cohort reproducibility of gut microbiome variations across all available datasets has not yet been conducted; additionally, the factors affecting reproducibility (i.e., determinants) remain to be explored. Therefore, there is an urgent need to test and validate the cross-cohort reproducibility of the gut microbiome as a diagnostic tool, as well as the cross-cohort consistency of disease biomarkers. Here, we systematically assessed the cross-cohort performance of machine learning classifiers based on the gut microbiome in distinguishing 20 diseases.
1. Analysis Workflow Overview
As shown in Figure 1a, 134 diseases and 361 human gut microbiome samples (the control group only included healthy phenotypes) were collected from public databases. A total of 69 eligible projects were finally screened, including 20 diseases. Then, projects with the same disease and data type were modeled and cross-validated (externally) using different methods, which were influenced by the cohort size (n) of the disease. First, for all diseases with n≥2, analysis was conducted through intra-cohort modeling (i.e., constructing a single cohort classifier). Second, only diseases with n≥3 underwent leave-one-out cross-validation (LODO) analysis (one of the combined cohort modeling methods). Third, only diseases with n≥5 were included for analysis through cohort-cumulative models (CCM) and sample-cumulative models (SCM) (combined cohort modeling methods). As shown in Figure 1b, these diseases span five major disease categories, with each disease having two or more cohorts. Figure 1c presents the density plot of the number of samples in each cohort. The median sample sizes for cases and controls were 48 and 47, respectively.
(1) Using a single cohort classifier, a high predictive accuracy (~0.77 AUC) was achieved in intra-cohort validation, but in cross-cohort validation, the predictive accuracy was low for all except gut diseases (~0.73 AUC).
(2) To improve the validation performance for non-gut diseases, a combined cohort classifier based on multiple cohorts was constructed, and the required sample size was estimated to achieve >0.7 validation accuracy.
(3) Additionally, it was observed that the model performance using metagenomic data for gut diseases was higher than that using 16S amplicon data.
(4) Further, a similarity index was used to quantify cross-cohort biomarker consistency, and similar trends were observed.
In summary, the study results indicate that the gut microbiome can serve as an independent diagnostic tool for gut diseases and reveal strategies to improve cross-cohort performance. This study conducted a comprehensive meta-analysis of 20 diseases using 83 case-control cohorts, totaling 9708 samples; these diseases span five major disease categories, with each disease having two or more cohorts. The study employed state-of-the-art tools for intra-cohort and combined cohort modeling and predictive validation, assessed factors affecting predictive accuracy, and recommended strategies to enhance cross-cohort validation performance to support classifiers derived from gut microbiota as disease prediction tools.
Figure 1 Experimental design, dataset information, and intra-cohort validation results of Paper 1
2. Microbial Genes Outperform Species and SNVs as Diagnostic Markers for Crohn’s Disease on AI-Powered Multi-Cohort Fecal Metagenomes[2]
Crohn’s disease (CD) is one of the two major forms of inflammatory bowel disease (IBD). Currently, the diagnosis of CD mainly relies on a combined assessment of endoscopic, imaging, and pathological findings. However, the diagnostic capacity of endoscopy is often limited by patient compliance, bowel preparation quality, and other uncontrollable factors. Serum and fecal biomarkers, such as C-reactive protein and fecal calprotectin, have been used as indicators for assessing IBD inflammatory activity. However, the accuracy and specificity of these biomarkers are not satisfactory. Recently, microbial markers have emerged as potential diagnostic markers for IBD. Notably, species abundance may not accurately represent microbial function, as the naming of many gut microbial species is currently undergoing continuous adjustment. In this regard, the diagnostic value of microbial genes and their polymorphisms has become a research hotspot (Figure 2b). Currently, there is a lack of comprehensive studies on the multi-dimensional characteristics of CD at the species, gene, and SNV levels, which seems necessary in clinical practice.
In this study, 1418 whole metagenome sequencing (WMS) samples from 8 cohorts were collected (Figure 2a), including 870 CD patients and 548 healthy controls. Three levels of analysis were performed: species, gene, and microbial SNV levels (Figure 2b), constructing diagnostic models for CD and conducting systematic evaluations. The overall workflow of the study: first, identify changes in microbes, and characterize the differential multi-dimensional features of the gut microbiome. Subsequently, construct diagnostic models and select the best model based on performance in internal and external validation. Finally, assess the disease specificity of the models and interpret the models, ultimately determining gut microbial markers, which were then validated through qRT-PCR analysis. A total of 227 species, 1047 microbial genes, and 21877 microbial SNVs were identified to have significant differences between CD and controls. The average AUCs for species, gene, and SNV models were 0.97, 0.95, and 0.77, respectively (Figure 3). Notably, the gene model demonstrated superior diagnostic capability, with average AUCs of 0.89 and 0.91 in internal and external validations, respectively. In summary, these results reveal the multi-dimensional changes in microbial communities within CD patients and provide universal and reliable biomarkers for CD diagnosis.
Figure 2 Experimental design of Paper 2
Figure 3 Performance of diagnostic models built with multi-dimensional features
3. Distinct Microbial Composition Characteristics Prior to the Clinical Diagnosis of Hepatocellular Carcinoma Can Be Used for Early Diagnosis[3]
Hepatocellular carcinoma (HCC) is one of the deadliest malignant tumors, with increasing incidence and mortality rates globally, leading to over 700,000 deaths annually. Due to the lack of accurately predictive biomarkers, a significant proportion of patients are found to be in advanced stages at diagnosis, missing the optimal time for surgical resection. Currently, HCC is diagnosed through serum AFP levels and imaging examinations. However, as the only standard predictive factor for HCC, AFP levels can also be elevated in other conditions, such as active hepatitis, germ cell tumors, secondary liver cancer, pregnancy, etc., which limits its specificity to some extent. Therefore, there is an urgent need to develop a brand new biomarker with high accuracy to improve life expectancy. In the case of HCC, it has a complex etiology, often accompanied by significant changes in the microbiome. The oral, gut, and tumor microbiomes are considered important regulatory factors in the occurrence and development of gastrointestinal malignancies. However, few studies have focused on the presence and associations of resident microbes in different body regions. Here, we aim to reveal the characteristics of the “oral-gut-tumor microbiome” and its diagnostic performance in hepatocellular carcinoma (HCC).
1. The study includes two cohorts, as shown in Figure 4:
(1) The retrospective discovery cohort includes 364 HBV-HCC patients and 160 controls: oral and fecal samples;
(2) The prospective validation cohort includes 91 HCC and 124 controls; as well as 48 HBV and 39 hepatitis B virus cirrhosis patients.
2. Machine Learning Results, as shown in Figure 5:
(1) Random forest analysis validated the findings from the retrospective cohort in the prospective cohort, identifying 10 oral genera and 9 gut genera that can distinguish HCC from controls, with AUC values of 0.7971 and 0.8084, respectively.
(2) When combining influential classification units, the AUC of the classifier increased to 0.9405.
(3) When combined with serum alpha-fetoprotein (AFP) levels, the model performance AUC continued to improve to 0.9811.
(1) Specifically, microbial biomarkers represented by streptococci exhibit a continuously increasing trend during the disease transition.
(2) Furthermore, several dominant microbial communities were confirmed in liver tumor and non-tumor tissues through fluorescence in situ hybridization (FISH) and 5R 16S rRNA gene sequencing.
Overall, the study based on oral-gut-tumor microbiome research provides a reliable method for early detection of HCC. Here, the study prospectively evaluated 91 newly diagnosed HCC patients and 124 controls. Oral and fecal samples, as well as appropriate liver tumor and normal tissues, were collected for microbial composition 16S rRNA gene amplicon sequencing, IHC, and FISH detection. Based on a collection of 19 gut genera, a distinct microbial feature was identified with high predictive accuracy. More importantly, the species enriched in tumor tissues were studied in relation to the oral and gut microbiomes, revealing their potential role in the pathogenesis of HCC.
Figure 4 Experimental design of Paper 3
Figure 5 Classification model established from retrospective data, validated in the prospective cohort to distinguish HCC from controls
In recent years, the application of gut microbiota in disease prediction has gradually gained attention and has broad application prospects in early diagnosis and treatment of diseases. In response to the latest research in this field, the editor has compiled relevant literature recently published, introducing the latest applications of gut microbiota as potential biomarkers for diseases from multiple perspectives such as breadth and depth, hoping to assist everyone~
[1] Li M, Liu J, Zhu J, Wang H, Sun C, Gao NL, Zhao XM, Chen WH. Performance of Gut Microbiome as an Independent Diagnostic Tool for 20 Diseases: Cross-Cohort Validation of Machine-Learning Classifiers. Gut Microbes. 2023 Jan-Dec;15(1):2205386. doi: 10.1080/19490976.2023.2205386. PMID: 37140125; PMCID: PMC10161951.
[2] Gao S, Gao X, Zhu R, Wu D, Feng Z, Jiao N, Sun R, Gao W, He Q, Liu Z, Zhu L. Microbial genes outperform species and SNVs as diagnostic markers for Crohn’s disease on multicohort fecal metagenomes empowered by artificial intelligence. Gut Microbes. 2023 Jan-Dec;15(1):2221428. doi: 10.1080/19490976.2023.2221428. PMID: 37278203; PMCID: PMC10246480.
[3] Yang J, He Q, Lu F, Chen K, Ni Z, Wang H, Zhou C, Zhang Y, Chen B, Bo Z, Li J, Yu H, Wang Y, Chen G. A distinct microbiota signature precedes the clinical diagnosis of hepatocellular carcinoma. Gut Microbes. 2023 Jan-Dec;15(1):2201159. doi: 10.1080/19490976.2023.2201159. PMID: 37089022; PMCID: PMC10128432.
Previous Hot Topics (Click title to jump)
Mitochondrial Quality Control
Single-Cell Spatial Transcriptomics
Tumor-Associated Macrophages (TAMs)
More exciting bioinformatics knowledge and technology sharing
Welcome to add WeChat
Text Layout|CY
For reprinting, please contact | 15510012760 (WeChat)
Advertising and Business Cooperation | 18501253903 (WeChat)