This repo contains code for training NER models on a variety of well-known biomedical named entity recognition datasets. The models are fine-tuned token classification models that are available through Hugging Face Hub.
The code below will load up the model and apply it to the provided text. It uses an aggregation strategy to post-process the inside-outside-beginning tagging format.
from transformers import pipeline
# Load the model as part of an NER pipeline
ner_pipeline = pipeline("token-classification",
model="Glasgow-AI4BioMed/bioner_medmentions_st21pv",
aggregation_strategy="simple")
# Apply it to some text
ner_pipeline("EGFR T790M mutations have been known to affect treatment outcomes for NSCLC patients receiving erlotinib.")| Model | Entity Types | Entity Count |
|---|---|---|
| medmentions_st21pv | A variety of broad biomedical concept categories | 14 |
| medmentions_st21pv_finegrain | A large number of specific biomedical categories | 91 |
| ncbi_disease | Diseases | 2 |
| nlmchem | Chemicals | 1 |
| bc5cdr | Chemicals and diseases | 2 |
| tmvar3 | Mutations (plus genes, species, etc) | 10 |
| gnormplus | Genes and gene families | 3 |
The models can be built with a moderate GPU. The commands below outline what's needed to get the datasets, preprocess them and fine-tune the models.
Building the models requires several libraries including transformers and are listed in the requirements.txt file. These can be installed through pip with:
pip install -r requirements.txtThe various datasets/corpora used to train the models can be downloaded used the fetch_corpora.sh script:
bash fetch_corpora.shThe sections below provide the commands to preprocess and tune the model. More details are available for each model on their model page including model performance and selected hyperparameters.
We use the 2017AA full release of UMLS to map entity concept identifiers to semantic types, and do not use the semantic types in MedMentions directly. You would need to change the path in the command below to point towards your local copy of the MRSTY.RRF file.
# Preprocess the data
python prepare_medmentions.py --umls_mrsty ~/umls/2017AA-full/META/MRSTY.RRF --medmentions_dir corpora_sources/medmentions/st21pv --semantic_groups corpora_sources/medmentions/SemGroups.txt --out_train datasets/medmentions_st21pv_train.bioc.xml.gz --out_val datasets/medmentions_st21pv_val.bioc.xml.gz --out_test datasets/medmentions_st21pv_test.bioc.xml.gz
# Tune the model and save it
python tune_ner.py --train_corpus datasets/medmentions_st21pv_train.bioc.xml.gz --val_corpus datasets/medmentions_st21pv_val.bioc.xml.gz --test_corpus datasets/medmentions_st21pv_test.bioc.xml.gz --n_trials 100 --model_name bioner_medmentions_st21pv --model_card_template model_card_template.md --dataset_info dataset_info/medmentions_st21pv.mdThis version uses the fine-grained types (instead of the semantic groups). And as above, you would need to change the path in the command below to point towards your local copy of the MRSTY.RRF file.
# Preprocess the data
python prepare_medmentions.py --finegrain --umls_mrsty ~/umls/2017AA-full/META/MRSTY.RRF --medmentions_dir corpora_sources/medmentions/st21pv --semantic_groups corpora_sources/medmentions/SemGroups.txt --out_train datasets/medmentions_st21pv_finegrain_train.bioc.xml.gz --out_val datasets/medmentions_st21pv_finegrain_val.bioc.xml.gz --out_test datasets/medmentions_st21pv_finegrain_test.bioc.xml.gz
# Tune the model and save it
python tune_ner.py --train_corpus datasets/medmentions_st21pv_finegrain_train.bioc.xml.gz --val_corpus datasets/medmentions_st21pv_finegrain_val.bioc.xml.gz --test_corpus datasets/medmentions_st21pv_finegrain_test.bioc.xml.gz --n_trials 100 --model_name bioner_medmentions_st21pv_finegrain --model_card_template model_card_template.md --dataset_info dataset_info/medmentions_st21pv_finegrain.md# Preprocess the data
python prepare_ncbi_disease.py --ncbidisease_dir corpora_sources/NCBI-disease --out_train datasets/ncbi_disease_train.bioc.xml.gz --out_val datasets/ncbi_disease_val.bioc.xml.gz --out_test datasets/ncbi_disease_test.bioc.xml.gz
# Tune the model and save it
python tune_ner.py --train_corpus datasets/ncbi_disease_train.bioc.xml.gz --val_corpus datasets/ncbi_disease_val.bioc.xml.gz --test_corpus datasets/ncbi_disease_test.bioc.xml.gz --n_trials 100 --model_name bioner_ncbi_disease --model_card_template model_card_template.md --dataset_info dataset_info/ncbi_disease.md# Preprocess the data
python prepare_nlmchem.py --nlmchem_dir corpora_sources/NLM-Chem --out_train datasets/nlmchem_train.bioc.xml.gz --out_val datasets/nlmchem_val.bioc.xml.gz --out_test datasets/nlmchem_test.bioc.xml.gz
# Tune the model and save it
python tune_ner.py --train_corpus datasets/nlmchem_train.bioc.xml.gz --val_corpus datasets/nlmchem_val.bioc.xml.gz --test_corpus datasets/nlmchem_test.bioc.xml.gz --n_trials 100 --model_name bioner_nlmchem --model_card_template model_card_template.md --dataset_info dataset_info/nlmchem.md# Preprocess the data
python prepare_bc5cdr.py --bc5cdr_dir corpora_sources/CDR_Data/CDR.Corpus.v010516 --out_train datasets/bc5cdr_train.bioc.xml.gz --out_val datasets/bc5cdr_val.bioc.xml.gz --out_test datasets/bc5cdr_test.bioc.xml.gz
# Tune the model and save it
python tune_ner.py --train_corpus datasets/bc5cdr_train.bioc.xml.gz --val_corpus datasets/bc5cdr_val.bioc.xml.gz --test_corpus datasets/bc5cdr_test.bioc.xml.gz --n_trials 100 --model_name bioner_bc5cdr --model_card_template model_card_template.md --dataset_info dataset_info/bc5cdr.md# Preprocess the data
python prepare_tmvar.py --tmvar_corpus corpora_sources/tmVar3Corpus.txt --out_train datasets/tmvar3_train.bioc.xml.gz --out_val datasets/tmvar3_val.bioc.xml.gz --out_test datasets/tmvar3_test.bioc.xml.gz
# Tune the model and save it
python tune_ner.py --train_corpus datasets/tmvar3_train.bioc.xml.gz --val_corpus datasets/tmvar3_val.bioc.xml.gz --test_corpus datasets/tmvar3_test.bioc.xml.gz --n_trials 100 --model_name bioner_tmvar3 --model_card_template model_card_template.md --dataset_info dataset_info/tmvar3.md
# Preprocess the data
python prepare_gnormplus.py --gnormplus_dir corpora_sources/GNormPlusCorpus --out_train datasets/gnormplus_train.bioc.xml.gz --out_val datasets/gnormplus_val.bioc.xml.gz --out_test datasets/gnormplus_test.bioc.xml.gz
# Tune the model and save it
python tune_ner.py --train_corpus datasets/gnormplus_train.bioc.xml.gz --val_corpus datasets/gnormplus_val.bioc.xml.gz --test_corpus datasets/gnormplus_test.bioc.xml.gz --n_trials 100 --model_name bioner_gnormplus --model_card_template model_card_template.md --dataset_info dataset_info/gnormplus.md