Treffer: Explainable machine learning for early diagnosis of esophageal cancer: A feature-enriched Light Gradient Boosting Machine framework with Shapley Additive Explanations and Local Interpretable Model-Agnostic Explanations interpretations.

Title:
Explainable machine learning for early diagnosis of esophageal cancer: A feature-enriched Light Gradient Boosting Machine framework with Shapley Additive Explanations and Local Interpretable Model-Agnostic Explanations interpretations.
Authors:
Ridwan AM; Department of Computer Science and Engineering, Southeast University, Bangladesh., Mohi Uddin KM; Department of Computer Science and Engineering, Southeast University, Bangladesh.
Source:
The Journal of international medical research [J Int Med Res] 2026 Jan; Vol. 54 (1), pp. 3000605251411752. Date of Electronic Publication: 2026 Jan 23.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Sage Publications Country of Publication: England NLM ID: 0346411 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1473-2300 (Electronic) Linking ISSN: 03000605 NLM ISO Abbreviation: J Int Med Res Subsets: MEDLINE
Imprint Name(s):
Publication: Nov. 2012- : London : Sage Publications
Original Publication: Northampton, Eng., Cambridge Medical Publications ltd.
Contributed Indexing:
Keywords: Esophageal cancer; LightGBM; Local Interpretable Model-Agnostic Explanations; Shapley Additive Explanations; feature selection; machine learning
Entry Date(s):
Date Created: 20260123 Date Completed: 20260123 Latest Revision: 20260123
Update Code:
20260123
DOI:
10.1177/03000605251411752
PMID:
41575322
Database:
MEDLINE

Weitere Informationen

ObjectiveEsophageal cancer is among the most rapidly spreading malignancies worldwide. Early detection of esophageal cancer is critical for disease prevention and for improving overall population health. Most studies have used statistical methodologies to assess the esophageal cancer risk, and only a few studies have used prediction models.MethodsThe esophageal cancer dataset, comprising 3985 patient records with 85 demographic, pathological, and follow-up features, was obtained from Kaggle. A comprehensive data-engineering pipeline was implemented, including the removal of null and low-variance features, elimination of identifier variables to prevent data leakage, mode-based imputation, label encoding, and data standardization. Feature relevance was assessed using Mutual Information, and the top 31 clinically meaningful features were retained for model development. Six machine learning classifiers-Support Vector Machine, Gaussian Naïve Bayes, k-nearest neighbors, AdaBoost, Multilayer Perceptron, and LightGBM (Gradient Boosting Machine)-were trained and evaluated. A stratified 10-fold cross-validation was applied to maintain class balance, and GridSearchCV was used for hyperparameter optimization. Model interpretability was assessed using Shapley Additive Explanations (SHAP) for global and local feature attribution and Local Interpretable Model-Agnostic Explanations (LIME) for instance-level explanations. Furthermore, the top features identified by SHAP and LIME were used to retrain the LightGBM model to evaluate performance under reduced dimensionality.ResultsAmong all evaluated classifiers, LightGBM exhibited the highest and most stable performance, achieving an accuracy of 99.87% prior to hyperparameter tuning and 99.74% following stratified cross-validated tuning, with near-perfect precision, recall, F1-score, and area under the curve values. Explainability analyses indicated that clinically relevant variables, including tumor staging, smoking-related factors, and follow-up indicators, played a significant role in model predictions. The SHAP-selected top-20 feature model maintained high predictive performance (99.76%), demonstrating that the classifier remained robust despite dimensionality reduction.ConclusionsThe proposed LightGBM-based model demonstrates exceptional predictive accuracy and strong interpretability, suggesting its potential utility for the early detection of esophageal cancer using machine learning approaches.