Skip to content

AmirLayegh/ner-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

NER & RE Evaluation Benchmark

A comprehensive benchmarking toolkit for comparing Named Entity Recognition (NER) and Relation Extraction (RE) models on scientific text. This project evaluates both open-source and LLM-based extraction methods with standardized metrics.

🎯 Overview

This benchmark compares multiple extraction approaches:

  • GLiNER1 - Original GLiNER model
  • GLiNER2 - Fast, open-source zero-shot model
  • LLMEntityRelationExtractor - From neo4j-graphrag package (GPT-4o)
  • LangExtract - Structured LLM extraction (NER only)

Evaluated on scientific and entertainment domain datasets with comprehensive entity and relation types.

πŸ“Š Results Summary

Named Entity Recognition - Science Dataset (500+ samples)

Model Precision Recall F1 Avg Time/Sample
LangExtract 0.714 0.619 0.663 1.39s
GLiNER1 0.666 0.617 0.641 0.06s
LLMEntityRelationExtractor 0.583 0.542 0.562 2.79s
GLiNER2 0.475 0.562 0.515 0.06s

Named Entity Recognition - Comics Dataset (66 samples)

Model Precision Recall F1 Avg Time/Sample
LLMEntityRelationExtractor 0.575 0.798 0.668 2.98s
GLiNER1 0.519 0.723 0.604 0.05s
LangExtract 0.502 0.734 0.596 1.44s
GLiNER2 0.476 0.750 0.583 0.05s

Relation Extraction (RE)

Model Precision Recall F1 Avg Time/Sample
LLMEntityRelationExtractor 0.547 0.537 0.542 2.56s
GLiNER2 0.290 0.427 0.363 0.14s
GLiNER1 0.122 0.041 0.062 0.16s

Key Findings:

  • LangExtract leads NER on science (F1: 0.663), GLiNER1 close behind (F1: 0.641) at 23x speed
  • LLMEntityRelationExtractor best on comics (F1: 0.668) and relation extraction (F1: 0.542)
  • GLiNER1 vs GLiNER2: GLiNER1 significantly better for NER (+0.13 F1), worse for RE
  • Speed: GLiNER models 20-50x faster than LLM-based approaches

πŸš€ Quick Start

Prerequisites

  • Python 3.13
  • uv package manager
  • OpenAI API key (for Neo4j GraphRAG and LangExtract)

Installation

# Clone the repository
git clone AmirLayegh/ner-eval
cd ner-eval

# Install dependencies
uv sync

# Set up environment variables
cp .env.example .env
# Add your OPENAI_API_KEY to .env

Running Benchmarks

# Named Entity Recognition - Science Dataset (500+ samples)
uv run python experiments/run_ner_benchmark.py

# Named Entity Recognition - Comics Dataset (66 samples)
uv run python experiments/run_comics_ner_benchmark.py

# Relation Extraction
uv run python experiments/run_re_benchmark.py

# Visualize all results
uv run python visualize_results.py

Results saved to result/ directory with detailed metrics and timing.

Test GLiNER Interactively

uv run python test_gliner_open.py

Quick test of GLiNER extraction on sample text.

πŸ“ Project Structure

ner-eval/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ extractors/          # Model implementations
β”‚   β”‚   β”œβ”€β”€ base.py          # Abstract base classes
β”‚   β”‚   β”œβ”€β”€ gliner1.py       # GLiNER1 extractor
β”‚   β”‚   β”œβ”€β”€ gliner2.py       # GLiNER2 extractor
β”‚   β”‚   β”œβ”€β”€ neo4j_graphrag.py # Neo4j GraphRAG extractor
β”‚   β”‚   └── langextract.py   # LangExtract extractor
β”‚   β”œβ”€β”€ evaluation/          # Evaluation logic
β”‚   β”‚   └── metrics.py       # Precision, recall, F1 calculation
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   └── loader.py        # Benchmark data loading
β”‚   β”œβ”€β”€ runner.py            # Benchmark execution
β”‚   └── models.py            # Pydantic data models
β”œβ”€β”€ datasets/                # Benchmark datasets
β”‚   β”œβ”€β”€ ner/                 # NER benchmarks
β”‚   β”‚   β”œβ”€β”€ science_ner_benchmark.json (500+ samples, 17 types)
β”‚   β”‚   └── comics_ner_benchmark.json (66 samples, 10 types)
β”‚   └── re/                  # RE benchmarks
β”‚       └── scientist_re_benchmark.json
β”œβ”€β”€ experiments/             # Standalone benchmark scripts
└── result/                  # Benchmark results

πŸ”§ Supported Models

Relation Extraction (RE)

  • GLiNER1 (knowledgator/gliner-relex-large-v0.5)
  • GLiNER2 (fastino/gliner2-base-v1)
  • LLMEntityRelationExtractor from neo4j-graphrag (GPT-4o)

Named Entity Recognition (NER)

  • GLiNER1 (urchade/gliner_medium-v2.1)
  • GLiNER2 (fastino/gliner2-base-v1)
  • LLMEntityRelationExtractor from neo4j-graphrag (GPT-4o)
  • LangExtract (GPT-4o)

πŸ“ Adding Your Own Models

Extend the base classes to add new extractors:

from src.extractors.base import BaseEntityExtractor
from src.models import Entity

class MyCustomExtractor(BaseEntityExtractor):
    @property
    def model_id(self) -> str:
        return "my-custom-model"
    
    def extract(self, text: str, entity_types: list[str]) -> list[Entity]:
        # Your extraction logic here
        return entities

πŸ“Š Datasets

NER Datasets

Science NER Benchmark

  • Samples: 500+ texts, 3000+ entities
  • Entity Types: 17 types including scientist, astronomicalobject, chemicalcompound, protein, university, etc.
  • Domain: Scientific/academic text
  • Source: Scientific articles and Wikipedia

Comics NER Benchmark

  • Samples: 66 texts, 188 entities
  • Entity Types: 10 types including ComicsCharacter, Person, Organisation, Location, Film, etc.
  • Domain: Comic books and entertainment
  • Source: Comic character database

RE Dataset

Scientist RE Benchmark

  • Samples: 98 texts, 218 triples
  • Relation Types: 20 types including almaMater, award, birthPlace, knownFor, professionalField, etc.
  • Domain: Biographical scientist information
  • Source: Filtered from ont_18_scientist_train dataset

πŸ§ͺ Evaluation Metrics

All models are evaluated using:

  • Precision: Correctness of predictions
  • Recall: Coverage of ground truth
  • F1 Score: Harmonic mean of precision and recall
  • Inference Time: Average time per sample

Normalization: Case-insensitive, whitespace-stripped comparison.

πŸ› οΈ Configuration

Environment Variables

Create a .env file:

OPENAI_API_KEY=your_api_key_here

Required only for LLMEntityRelationExtractor and LangExtract models (LLM-based). GLiNER1 and GLiNER2 work without any API keys.

🀝 Contributing

Contributions welcome! Areas for improvement:

  • Additional models (spaCy, Flair, etc.)
  • More diverse datasets
  • Additional evaluation metrics (entity-level, span-level)
  • Performance optimizations

πŸ“„ License

MIT License - see LICENSE file for details.

πŸ™ Acknowledgments

About

A repo to compare different NER models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages