Skip to content

Glasgow-AI4BioMed/bioner

Repository files navigation

A selection of biomedical NER models

This repo contains code for training NER models on a variety of well-known biomedical named entity recognition datasets. The models are fine-tuned token classification models that are available through Hugging Face Hub.

🚀 Example Usage

The code below will load up the model and apply it to the provided text. It uses an aggregation strategy to post-process the inside-outside-beginning tagging format.

from transformers import pipeline

# Load the model as part of an NER pipeline
ner_pipeline = pipeline("token-classification", 
                        model="Glasgow-AI4BioMed/bioner_medmentions_st21pv",
                        aggregation_strategy="simple")

# Apply it to some text
ner_pipeline("EGFR T790M mutations have been known to affect treatment outcomes for NSCLC patients receiving erlotinib.")

⭐ Available Models

Model Entity Types Entity Count
medmentions_st21pv A variety of broad biomedical concept categories 14
medmentions_st21pv_finegrain A large number of specific biomedical categories 91
ncbi_disease Diseases 2
nlmchem Chemicals 1
bc5cdr Chemicals and diseases 2
tmvar3 Mutations (plus genes, species, etc) 10
gnormplus Genes and gene families 3

🛠️ Building the Models

The models can be built with a moderate GPU. The commands below outline what's needed to get the datasets, preprocess them and fine-tune the models.

Prerequisites

Building the models requires several libraries including transformers and are listed in the requirements.txt file. These can be installed through pip with:

pip install -r requirements.txt

Getting the data

The various datasets/corpora used to train the models can be downloaded used the fetch_corpora.sh script:

bash fetch_corpora.sh

Preprocessing and training

The sections below provide the commands to preprocess and tune the model. More details are available for each model on their model page including model performance and selected hyperparameters.

MedMentions ST21pv

We use the 2017AA full release of UMLS to map entity concept identifiers to semantic types, and do not use the semantic types in MedMentions directly. You would need to change the path in the command below to point towards your local copy of the MRSTY.RRF file.

# Preprocess the data
python prepare_medmentions.py --umls_mrsty ~/umls/2017AA-full/META/MRSTY.RRF --medmentions_dir corpora_sources/medmentions/st21pv --semantic_groups corpora_sources/medmentions/SemGroups.txt --out_train datasets/medmentions_st21pv_train.bioc.xml.gz --out_val datasets/medmentions_st21pv_val.bioc.xml.gz --out_test datasets/medmentions_st21pv_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/medmentions_st21pv_train.bioc.xml.gz --val_corpus datasets/medmentions_st21pv_val.bioc.xml.gz --test_corpus datasets/medmentions_st21pv_test.bioc.xml.gz --n_trials 100 --model_name bioner_medmentions_st21pv --model_card_template model_card_template.md --dataset_info dataset_info/medmentions_st21pv.md

MedMentions ST21pv (finegrain)

This version uses the fine-grained types (instead of the semantic groups). And as above, you would need to change the path in the command below to point towards your local copy of the MRSTY.RRF file.

# Preprocess the data
python prepare_medmentions.py --finegrain --umls_mrsty ~/umls/2017AA-full/META/MRSTY.RRF --medmentions_dir corpora_sources/medmentions/st21pv --semantic_groups corpora_sources/medmentions/SemGroups.txt --out_train datasets/medmentions_st21pv_finegrain_train.bioc.xml.gz --out_val datasets/medmentions_st21pv_finegrain_val.bioc.xml.gz --out_test datasets/medmentions_st21pv_finegrain_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/medmentions_st21pv_finegrain_train.bioc.xml.gz --val_corpus datasets/medmentions_st21pv_finegrain_val.bioc.xml.gz --test_corpus datasets/medmentions_st21pv_finegrain_test.bioc.xml.gz --n_trials 100 --model_name bioner_medmentions_st21pv_finegrain --model_card_template model_card_template.md --dataset_info dataset_info/medmentions_st21pv_finegrain.md

NCBI Disease

# Preprocess the data
python prepare_ncbi_disease.py --ncbidisease_dir corpora_sources/NCBI-disease --out_train datasets/ncbi_disease_train.bioc.xml.gz --out_val datasets/ncbi_disease_val.bioc.xml.gz --out_test datasets/ncbi_disease_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/ncbi_disease_train.bioc.xml.gz --val_corpus datasets/ncbi_disease_val.bioc.xml.gz --test_corpus datasets/ncbi_disease_test.bioc.xml.gz --n_trials 100 --model_name bioner_ncbi_disease --model_card_template model_card_template.md --dataset_info dataset_info/ncbi_disease.md

NLM-Chem

# Preprocess the data
python prepare_nlmchem.py --nlmchem_dir corpora_sources/NLM-Chem --out_train datasets/nlmchem_train.bioc.xml.gz --out_val datasets/nlmchem_val.bioc.xml.gz --out_test datasets/nlmchem_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/nlmchem_train.bioc.xml.gz --val_corpus datasets/nlmchem_val.bioc.xml.gz --test_corpus datasets/nlmchem_test.bioc.xml.gz --n_trials 100 --model_name bioner_nlmchem --model_card_template model_card_template.md --dataset_info dataset_info/nlmchem.md

BC5CDR

# Preprocess the data
python prepare_bc5cdr.py --bc5cdr_dir corpora_sources/CDR_Data/CDR.Corpus.v010516 --out_train datasets/bc5cdr_train.bioc.xml.gz --out_val datasets/bc5cdr_val.bioc.xml.gz --out_test datasets/bc5cdr_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/bc5cdr_train.bioc.xml.gz --val_corpus datasets/bc5cdr_val.bioc.xml.gz --test_corpus datasets/bc5cdr_test.bioc.xml.gz --n_trials 100 --model_name bioner_bc5cdr --model_card_template model_card_template.md --dataset_info dataset_info/bc5cdr.md

tmVar3

# Preprocess the data
python prepare_tmvar.py --tmvar_corpus corpora_sources/tmVar3Corpus.txt --out_train datasets/tmvar3_train.bioc.xml.gz --out_val datasets/tmvar3_val.bioc.xml.gz --out_test datasets/tmvar3_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/tmvar3_train.bioc.xml.gz --val_corpus datasets/tmvar3_val.bioc.xml.gz --test_corpus datasets/tmvar3_test.bioc.xml.gz --n_trials 100 --model_name bioner_tmvar3 --model_card_template model_card_template.md --dataset_info dataset_info/tmvar3.md

GNormPlus

# Preprocess the data
python prepare_gnormplus.py --gnormplus_dir corpora_sources/GNormPlusCorpus --out_train datasets/gnormplus_train.bioc.xml.gz --out_val datasets/gnormplus_val.bioc.xml.gz --out_test datasets/gnormplus_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/gnormplus_train.bioc.xml.gz --val_corpus datasets/gnormplus_val.bioc.xml.gz --test_corpus datasets/gnormplus_test.bioc.xml.gz --n_trials 100 --model_name bioner_gnormplus --model_card_template model_card_template.md --dataset_info dataset_info/gnormplus.md

About

A selection of biomedical NER models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors