Title: Comparing traditional machine learning algorithms with a transformer-based model (TabPFN) for the prediction of health outcomes
This project compares TabPFN, a pre-trained transformer for tabular data, to traditional machine learning models (LightGBM, glmnet) in predicting tuberculosis (TB) status from host analytes.
We evaluate performance using:
- ROC curves and AUC
- Accuracy, Sensitivity, Specificity, Balanced Accuracy
The comparison is performed for:
- All 22 analytes
- Top 3 analytes selected using information gain
Dataset details:
- Concentrations of TB biomarkers measured using the Luminex assay
- Clinical dataset (patient-level data)
- Binary classification: patient is TB positive or TB negative
- ML algorithms: Elastic Net Logistic Regression (glmnet), LightGBM, TabPFN (transformer-based)
- Purpose: Compare performance of traditional ML vs transformer-based models
- Notes: Clinical dataset not shared publicly; code can run on synthetic or similar datasets
Compare TabPFN to traditional ML models in predicting TB status from host analytes, using ROC, AUC, and balanced accuracy as performance metrics.
- Load TB dataset and analyte information
- Split data into training and test sets
- Preprocess data:
- Scale features
- Apply SMOTE to balance classes
- Select features for analysis:
- All 22 analytes
- Top 3 analytes (based on information gain)
- Train traditional ML models on training data:
- LightGBM
- glmnet
- Rpart (optional)
- Tune hyperparameters using nested cross-validation
- Prepare TabPFN inputs using training/test sets
- Train TabPFN classifier on same training data
- Generate predictions and probabilities for all models
- Evaluate performance:
- ROC curves and AUC
- Accuracy, Sensitivity, Specificity, Balanced Accuracy
- Compare TabPFN to traditional ML models:
- Plot ROC curves together
- Summarize performance metrics
- Save processed datasets, model objects, and prediction outputs