A research project exploring automated generation of chest X-ray diagnostic reports using various Vision-Language Models (VLMs) and CNN backbones.
.
├── notebooks/ # Jupyter notebooks
│ ├── finetuning/ # Fine-tuning experiments
│ ├── inference/ # Inference scripts
│ └── data_generation/ # Dataset generation
├── models/ # Trained CNN models
│ ├── ResNet50 - Chest XRay_
│ ├── EfficientNet - Chest XRay_
│ └── ... (other CNN models)
├── data/ # Datasets
│ ├── raw/ # Raw data files
│ └── processed/ # Processed datasets
├── src/ # Source code
│ └── augmentation.py # Data augmentation utilities
├── docs/ # Documentation
│ └── paper_draft/ # Research paper
└── README.md
- ResNet50 - 50-layer residual network
- EfficientNet - Efficient architecture
- VGG16 - Classic 16-layer network
- InceptionV3 - Inception module architecture
- MobileNet - Lightweight mobile-optimized network
- LLaMa 3.2 (11B) - Meta's large language model
- Qwen3-VL-8B-Instruct - Alibaba's vision-language model
- Ministral-3-3B-Instruct - Mistral AI's compact VLM
- Python 3.8+
- CUDA-capable GPU (recommended)
- Google Colab (for notebook execution)
-
Fine-tuning a model: Open notebooks in
notebooks/finetuning/and run cells sequentially -
Running inference: Use notebooks in
notebooks/inference/with your trained models
- Fine-tuning dataset: 1300+ chest X-ray images with corresponding reports
- Augmented dataset for improved generalization
This project compares different backbone-VLM combinations to determine optimal architectures for medical report generation. Detailed results are available in the docs folder.
For academic/research purposes.