基于机器学习算法的菌阴肺结核与细菌性肺炎鉴别诊断模型构建和验证

Development and validation of a machine learning-based model for discriminating between culture-negative pulmonary tuberculosis and bacterial pneumonia

  • 摘要:
    目的 通过整合多维度临床数据,构建基于机器学习算法的菌阴肺结核与细菌性肺炎鉴别诊断模型,为临床提供客观、可靠的辅助诊断工具。
    方法 选择2023年12月-2024年12月于天津市海河医院就诊的400例患者(菌阴肺结核280例,细菌性肺炎120例)。系统收集人口学特征、临床症状、影像学表现及实验室指标。通过单因素分析筛选潜在预测变量,采用多因素logistic回归确定独立预测因子。基于筛选结果构建四种机器学习模型(logistic回归、随机森林、XGBoost和支持向量机),采用5折交叉验证优化超参数,按7∶3比例划分训练集和验证集。通过准确率、灵敏度、特异度和曲线下面积(AUC)评估判别性能,并采用校准曲线和预测概率分布分析验证模型的校准度和稳定性。
    结果 单因素分析和多因素回归确定了六个独立预测因子。其中,结核感染T细胞斑点试验(T-SPOT.TB)阳性(OR=86.974)、上叶受累(OR=48.462)、空洞形成(OR=7.271)、体质量减轻(OR=7.389)是菌阴肺结核的危险因素;而PCT水平升高(OR=0.007)及咳脓痰(OR=0.056)则是细菌性肺炎的预测因素。基于这些因子构建的XGBoost模型在验证集上表现最优,AUC达到1.000,准确率为99.22%,敏感度为99.01%,特异度为99.32%,并且显示出良好的校准度和泛化能力。
    结论 本研究构建的基于T-SPOT.TB、上叶受累、PCT水平升高、空洞形成、体质量减轻及咳脓痰的机器学习模型(尤其是XGBoost模型),在菌阴肺结核与细菌性肺炎的鉴别诊断中展现了极强的判别性能和稳健的泛化能力,可为临床提供可靠的辅助诊断工具。

     

    Abstract:
    OBJECTIVE  To develop a machine learning-based model that integrates multi-dimensional clinical data to discriminate between culture-negative pulmonary tuberculosis and bacterial pneumonia, thereby providing an objective and reliable auxiliary diagnostic tool for clinical practice.
    METHODS  A total of 400 patients (280 with culture-negative pulmonary tuberculosis and 120 with bacterial pneumonia) treated at Tianjin Haihe Hospital from Dec. 2023 to Dec. 2024 were enrolled. Demographic characteristics, clinical symptoms, imaging findings and laboratory indicators were systematically collected. Potential predictive variables were screened through univariate analysis, and independent predictors were identified by multivariate logistic regression. Based on the screening results, four machine learning models—logistic regression, random forest, XGBoost and support vector machine—were developed. Hyperparameters were optimized via 5-fold cross-validation, and the dataset was split into a training set (70%) and a validation set (30%). The discriminative performance of the models was evaluated by accuracy, sensitivity, specificity and area under the curve (AUC). Model calibration and stability were assessed through calibration curves and the distribution of predicted probabilities.
    RESULTS  Univariate analysis and multivariate logistic regression analysis identified six independent predictors. Among them, positivity of the T-SPOT test for tuberculosis infection (T-SPOT.TB) (OR=86.974), upper lobe involvement (OR=48.462), cavity formation (OR=7.271) and weight loss (OR=7.389) were risk factors for culture-negative pulmonary tuberculosis, while elevated procalcitonin (PCT) levels (OR=0.007) and purulent sputum production (OR=0.056) were predictive factors for bacterial pneumonia. The XGBoost model, developed based on these factors, achieved the best performance on the validation set, with an AUC of 1.000, an accuracy of 99.22%, a sensitivity of 99.01% and a specificity of 99.32%. It also demonstrated excellent calibration and generalization ability.
    CONCLUSIONS  In this study, a machine learning model—particularly the XGBoost model—was developed based on T-SPOT.TB, upper lobe involvement, elevated PCT levels, cavity formation, weight loss and purpurulent sputum. The model demonstrated excellent discriminative performance and robust generalization ability in differentiating between culture-negative pulmonary tuberculosis and bacterial pneumonia, thereby providing a reliable auxiliary diagnostic tool for clinicians.

     

/

返回文章
返回