Codebase from which the results from this submission were obtained.
The code base follows this directory structure:
├── liposarcome
│ ├── dl <- Deep learning python classes and scripts
│ ├── configs <- Hydra configuration files for the deep learning training
│ ├── datasets <- Classes for datasets
│ ├── models <- Classes for models
│ ├── preprocessors <- Pipeline for preprocessing features
│ ├── trainers <- Implement fit and test
│ ├── utils <- Misc utils (data splits, normalizations, etc.)
│ ├── train.py <- Script executed for a ML training. Implement the CV
│ ├── cli <- CLI python scripts
│ ├── load_data <- Read the raw data and send parsed data
│ │ ├── bergo_parsing <- Formating and checks for data from IB
│ │ ├── clb_parsing <- Formating and checks for data from IB
│ │ ├── cohort_dataframe_generator.py <- Functions issueing a df per center
│ │ └── debug_parsing <- Functions to generate random clinical data (debug)
│ └── paths
│ └── paths.py <- File to insert paths to use real data
├── .gitignore <- List of files/folders ignored by git
├── setup.py <- Package installation and dependencies
└── README.md
Install the dependencies
cd liposarcoma
# Create your conda environment
conda create -n liposarcome python=3.8
conda activate liposarcome
# install the repo
uv pip install -e . -i https://pypi.org/simpleTrain model with a chosen experiment configuration from
liposarcome/dl/configs/experiment/
where experiment_name is one of the files in that folder (without .yaml extension)
liposarcome-dl experiment=debug_l_vs_s_clinical_sklearnOne can override any parameter from command line as follows
liposarcome-dl experiment=experiment_name trainer.max_epochs=20 dataset.batch_size=64This project employs four feature extractors. To train such extractors, please see the following self-supervised learning methods:
-
iBOT ViT Pancan: A Vision Transformer (ViT)-based feature extractor leveraging the iBOT framework (Zhou et al., 2021), trained on all available H&E-stained datasets from TCGA (Pan-cancer). Designed to capture broad, cross-cancer tissue representations.
-
iBOT ViT COAD: A domain-specific variant of the iBOT ViT model, trained exclusively on the TCGA COAD (colon adenocarcinoma) dataset for enhanced representation of colon tissue.
-
MoCo COAD: A feature extractor based on the Momentum Contrast (MoCo) framework (He et al., 2020), trained on TCGA COAD to learn colon-specific features via contrastive learning.
-
MoCo Cond: A condition-aware MoCo COAD model Zhou et al., 2022 allowing condition-sensitive tissue embeddings. In this approach, the feature extractor is trained based on Momentum Contrast (MoCo) framework with a constraint applied during the conditional sampling. This training aims at reducing batch effect issues by enforcing the extractor to discriminate between different images of the same slide (e.g. two images of the same specimen with different stainings).
The code base contains 4 unit tests that can be run by a call to pytest in the command line.
Of note, the test called test_deepmil.py contains a very basic instantiation of the
multimodal DeepMIL network which is the main model investigated in this work. This test
serves as an example for someone willing to re-use only the neural network.
To use this code base with your own data, the data files need to be structured in a tree
identical to what was done for the mock data in tests/load_data/cohort_dataframe_generator/assets.
Based on this structure, paths need to be inserted in the corresponding constants in liposarcome/paths/paths.py.
Additionnally, the CohortDataFrameGenerator class defined in liposarcome/load_data/cohort_dataframe_generator.py
will need to be updated by:
- removing the
NotImplementedErrorraised for a cohort name different from"debug"(see this line and this other line), - defining parsing functions for the clinical data to deliver a correctly formatted dataframe. Those functions are stored in
liposarcome/load_datawhere formatting examples for the IB and CLB centers are provided.
This code does not include federated learning scripts. For a federated deployment of the proposed deep learning pipeline, this work relied on the open source Substra framework. We used FedAvg as federated strategy.