A selection of biomedical NER models

This repo contains code for training NER models on a variety of well-known biomedical named entity recognition datasets. The models are fine-tuned token classification models that are available through Hugging Face Hub.

🚀 Example Usage

The code below will load up the model and apply it to the provided text. It uses an aggregation strategy to post-process the inside-outside-beginning tagging format.

from transformers import pipeline

# Load the model as part of an NER pipeline
ner_pipeline = pipeline("token-classification", 
                        model="Glasgow-AI4BioMed/bioner_medmentions_st21pv",
                        aggregation_strategy="simple")

# Apply it to some text
ner_pipeline("EGFR T790M mutations have been known to affect treatment outcomes for NSCLC patients receiving erlotinib.")

⭐ Available Models

Model	Entity Types	Entity Count
medmentions_st21pv	A variety of broad biomedical concept categories	14
medmentions_st21pv_finegrain	A large number of specific biomedical categories	91
ncbi_disease	Diseases	2
nlmchem	Chemicals	1
bc5cdr	Chemicals and diseases	2
tmvar3	Mutations (plus genes, species, etc)	10
gnormplus	Genes and gene families	3

🛠️ Building the Models

The models can be built with a moderate GPU. The commands below outline what's needed to get the datasets, preprocess them and fine-tune the models.

Prerequisites

Building the models requires several libraries including transformers and are listed in the requirements.txt file. These can be installed through pip with:

pip install -r requirements.txt

Getting the data

The various datasets/corpora used to train the models can be downloaded used the fetch_corpora.sh script:

bash fetch_corpora.sh

Preprocessing and training

The sections below provide the commands to preprocess and tune the model. More details are available for each model on their model page including model performance and selected hyperparameters.

MedMentions ST21pv

We use the 2017AA full release of UMLS to map entity concept identifiers to semantic types, and do not use the semantic types in MedMentions directly. You would need to change the path in the command below to point towards your local copy of the MRSTY.RRF file.

# Preprocess the data
python prepare_medmentions.py --umls_mrsty ~/umls/2017AA-full/META/MRSTY.RRF --medmentions_dir corpora_sources/medmentions/st21pv --semantic_groups corpora_sources/medmentions/SemGroups.txt --out_train datasets/medmentions_st21pv_train.bioc.xml.gz --out_val datasets/medmentions_st21pv_val.bioc.xml.gz --out_test datasets/medmentions_st21pv_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/medmentions_st21pv_train.bioc.xml.gz --val_corpus datasets/medmentions_st21pv_val.bioc.xml.gz --test_corpus datasets/medmentions_st21pv_test.bioc.xml.gz --n_trials 100 --model_name bioner_medmentions_st21pv --model_card_template model_card_template.md --dataset_info dataset_info/medmentions_st21pv.md

MedMentions ST21pv (finegrain)

This version uses the fine-grained types (instead of the semantic groups). And as above, you would need to change the path in the command below to point towards your local copy of the MRSTY.RRF file.

# Preprocess the data
python prepare_medmentions.py --finegrain --umls_mrsty ~/umls/2017AA-full/META/MRSTY.RRF --medmentions_dir corpora_sources/medmentions/st21pv --semantic_groups corpora_sources/medmentions/SemGroups.txt --out_train datasets/medmentions_st21pv_finegrain_train.bioc.xml.gz --out_val datasets/medmentions_st21pv_finegrain_val.bioc.xml.gz --out_test datasets/medmentions_st21pv_finegrain_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/medmentions_st21pv_finegrain_train.bioc.xml.gz --val_corpus datasets/medmentions_st21pv_finegrain_val.bioc.xml.gz --test_corpus datasets/medmentions_st21pv_finegrain_test.bioc.xml.gz --n_trials 100 --model_name bioner_medmentions_st21pv_finegrain --model_card_template model_card_template.md --dataset_info dataset_info/medmentions_st21pv_finegrain.md

NCBI Disease

# Preprocess the data
python prepare_ncbi_disease.py --ncbidisease_dir corpora_sources/NCBI-disease --out_train datasets/ncbi_disease_train.bioc.xml.gz --out_val datasets/ncbi_disease_val.bioc.xml.gz --out_test datasets/ncbi_disease_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/ncbi_disease_train.bioc.xml.gz --val_corpus datasets/ncbi_disease_val.bioc.xml.gz --test_corpus datasets/ncbi_disease_test.bioc.xml.gz --n_trials 100 --model_name bioner_ncbi_disease --model_card_template model_card_template.md --dataset_info dataset_info/ncbi_disease.md

NLM-Chem

# Preprocess the data
python prepare_nlmchem.py --nlmchem_dir corpora_sources/NLM-Chem --out_train datasets/nlmchem_train.bioc.xml.gz --out_val datasets/nlmchem_val.bioc.xml.gz --out_test datasets/nlmchem_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/nlmchem_train.bioc.xml.gz --val_corpus datasets/nlmchem_val.bioc.xml.gz --test_corpus datasets/nlmchem_test.bioc.xml.gz --n_trials 100 --model_name bioner_nlmchem --model_card_template model_card_template.md --dataset_info dataset_info/nlmchem.md

BC5CDR

# Preprocess the data
python prepare_bc5cdr.py --bc5cdr_dir corpora_sources/CDR_Data/CDR.Corpus.v010516 --out_train datasets/bc5cdr_train.bioc.xml.gz --out_val datasets/bc5cdr_val.bioc.xml.gz --out_test datasets/bc5cdr_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/bc5cdr_train.bioc.xml.gz --val_corpus datasets/bc5cdr_val.bioc.xml.gz --test_corpus datasets/bc5cdr_test.bioc.xml.gz --n_trials 100 --model_name bioner_bc5cdr --model_card_template model_card_template.md --dataset_info dataset_info/bc5cdr.md

tmVar3

# Preprocess the data
python prepare_tmvar.py --tmvar_corpus corpora_sources/tmVar3Corpus.txt --out_train datasets/tmvar3_train.bioc.xml.gz --out_val datasets/tmvar3_val.bioc.xml.gz --out_test datasets/tmvar3_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/tmvar3_train.bioc.xml.gz --val_corpus datasets/tmvar3_val.bioc.xml.gz --test_corpus datasets/tmvar3_test.bioc.xml.gz --n_trials 100 --model_name bioner_tmvar3 --model_card_template model_card_template.md --dataset_info dataset_info/tmvar3.md

GNormPlus

# Preprocess the data
python prepare_gnormplus.py --gnormplus_dir corpora_sources/GNormPlusCorpus --out_train datasets/gnormplus_train.bioc.xml.gz --out_val datasets/gnormplus_val.bioc.xml.gz --out_test datasets/gnormplus_test.bioc.xml.gz

# Tune the model and save it
python tune_ner.py --train_corpus datasets/gnormplus_train.bioc.xml.gz --val_corpus datasets/gnormplus_val.bioc.xml.gz --test_corpus datasets/gnormplus_test.bioc.xml.gz --n_trials 100 --model_name bioner_gnormplus --model_card_template model_card_template.md --dataset_info dataset_info/gnormplus.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A selection of biomedical NER models

🚀 Example Usage

⭐ Available Models

🛠️ Building the Models

Prerequisites

Getting the data

Preprocessing and training

MedMentions ST21pv

MedMentions ST21pv (finegrain)

NCBI Disease

NLM-Chem

BC5CDR

tmVar3

GNormPlus

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
dataset_info		dataset_info
datasets		datasets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fetch_corpora.sh		fetch_corpora.sh
model_card_template.md		model_card_template.md
model_preparation.py		model_preparation.py
prepare_bc5cdr.py		prepare_bc5cdr.py
prepare_gnormplus.py		prepare_gnormplus.py
prepare_medmentions.py		prepare_medmentions.py
prepare_ncbi_disease.py		prepare_ncbi_disease.py
prepare_nlmchem.py		prepare_nlmchem.py
prepare_tmvar.py		prepare_tmvar.py
requirements.txt		requirements.txt
tune_ner.py		tune_ner.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

A selection of biomedical NER models

🚀 Example Usage

⭐ Available Models

🛠️ Building the Models

Prerequisites

Getting the data

Preprocessing and training

MedMentions ST21pv

MedMentions ST21pv (finegrain)

NCBI Disease

NLM-Chem

BC5CDR

tmVar3

GNormPlus

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages