Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
124 commits
Select commit Hold shift + click to select a range
05b61dc
added prev agents json keys
tekrajchhetri Feb 24, 2026
70ab4d4
add dict root level keys check
tekrajchhetri Feb 25, 2026
da35117
json-repair added
tekrajchhetri Feb 25, 2026
e198e5d
trustcall added
tekrajchhetri Feb 25, 2026
3332e19
json repair tool using trustcall + json_repair library
tekrajchhetri Feb 25, 2026
0ede87b
json repair tool added as default for all agents
tekrajchhetri Feb 25, 2026
97ee75a
normalization of resources to prevent error
tekrajchhetri Feb 26, 2026
4c51e98
prevent no output case
tekrajchhetri Feb 26, 2026
139d4f3
Create puja_test_2024withouthil.json
tekrajchhetri Mar 9, 2026
9489a39
updated task detection call
tekrajchhetri Mar 16, 2026
0b4fd25
local concept mapping tool
tekrajchhetri Mar 16, 2026
0f05e44
asyncio_mode added
tekrajchhetri Mar 18, 2026
6f5aa73
updated to include local concept mapping information
tekrajchhetri Mar 18, 2026
34833fe
fix local model issue -- returning empty NER
tekrajchhetri Mar 18, 2026
ee492a3
fallback mechanism added for pdfs where grobid fails as grobid is use…
tekrajchhetri Mar 18, 2026
4e57d78
updated json structure to align with agents
tekrajchhetri Mar 18, 2026
f5594b0
local concept mapping tool information added
tekrajchhetri Mar 18, 2026
f9f5583
entities not saved issues fixed + improved concept mapping + added st…
tekrajchhetri Mar 18, 2026
1e0ba65
updated ConceptMappingInput types
tekrajchhetri Mar 18, 2026
a2e0fdd
added concept mapping local information
tekrajchhetri Mar 18, 2026
3ae3307
added link to concept mapping tool
tekrajchhetri Mar 18, 2026
c002494
Merge branch 'improvement' into fix_json_op_issue
tekrajchhetri Mar 18, 2026
23ee024
Fix StructSenseFlow.__init__ parameter name: input_source -> source
puja-trivedi Mar 18, 2026
7183fcc
Merge pull request #97 from sensein/test_fix_json_op_issue_20260312
tekrajchhetri Mar 19, 2026
b717c90
sync poetry.lock with pyproject.toml
tekrajchhetri Mar 20, 2026
189cc39
added workflow_dispatch:
tekrajchhetri Mar 20, 2026
6724a0c
added "poetry lock --no-update"
tekrajchhetri Mar 20, 2026
a9772e2
silent drop fixed due to token compression for large results.
tekrajchhetri Mar 21, 2026
59f5c22
preloading added to handle failure
tekrajchhetri Mar 21, 2026
f03647f
downstream parallelization + preload
tekrajchhetri Mar 21, 2026
42c31dc
dangling code error fixed
tekrajchhetri Mar 21, 2026
0788215
unused params removed
tekrajchhetri Mar 21, 2026
ea12d81
speed up alignment + cost reduction
tekrajchhetri Mar 21, 2026
6d99939
performance + cost optimization using model token limit + concept ali…
tekrajchhetri Mar 21, 2026
a8b1e72
Downstream chunking is AUTOMATIC and independent of --enable_chunking
tekrajchhetri Mar 21, 2026
3de9a65
resource extraction -alignment optimized.
tekrajchhetri Mar 21, 2026
589387d
skip stage features added
tekrajchhetri Mar 21, 2026
41787b8
updated readme to include latest updates
tekrajchhetri Mar 21, 2026
cfb1c7f
Added max_itr constraint
tekrajchhetri Mar 21, 2026
eef5e6e
Max Iterations info added
tekrajchhetri Mar 21, 2026
0b54397
api call logger added
tekrajchhetri Mar 23, 2026
42ad941
added tracking + more options like max_itr to control agent execution
tekrajchhetri Mar 23, 2026
373d81e
removed --no-update
tekrajchhetri Mar 23, 2026
20abeab
added more options to control agent
tekrajchhetri Mar 23, 2026
b1b4f88
added async_execution parameters
tekrajchhetri Mar 23, 2026
1f44496
llm-as-judge-api than crew
tekrajchhetri Mar 23, 2026
94bb60f
changed default value some reset to crew ai default ones
tekrajchhetri Mar 23, 2026
72ccd15
human feedback agent -- direct api call than crew internal API call
tekrajchhetri Mar 23, 2026
afe1747
retry mechanism added.
tekrajchhetri Mar 23, 2026
ce7b77b
chunks increased
tekrajchhetri Mar 23, 2026
d415106
source text, context added for revision
tekrajchhetri Mar 23, 2026
26660ad
postprocessing conceptmap checking run batch for local
tekrajchhetri Mar 23, 2026
66c6124
fixed LLM appended trailing content after valid JSON
tekrajchhetri Mar 23, 2026
827b199
logging info added
tekrajchhetri Mar 23, 2026
33abb13
robust postprocessing for generic + resources tasks
tekrajchhetri Mar 23, 2026
b76352a
provenance added
tekrajchhetri Mar 24, 2026
e600394
added placeholder for easy human feedback input
tekrajchhetri Mar 24, 2026
7345f23
updated readme to include all details of the changes made.
tekrajchhetri Mar 24, 2026
d53168a
readme for evaluation ner directory
tekrajchhetri Mar 24, 2026
7576b5e
Command added
tekrajchhetri Mar 24, 2026
1905a7f
config file for gemini and gpt
tekrajchhetri Mar 24, 2026
1c42ccf
publication
tekrajchhetri Mar 24, 2026
9bb6d73
results from gemini config file for both with and without hil
tekrajchhetri Mar 24, 2026
cf8299c
removed old evaluations as they're no longer relevant
tekrajchhetri Mar 24, 2026
d9940f7
Delete evaluation/ner/old/evaluation/Latent-circuit/results directory
tekrajchhetri Mar 24, 2026
67e5e47
Delete evaluation/ner/puja_test_2024withouthil.json
tekrajchhetri Mar 24, 2026
983ffff
Delete evaluation/combined_all_token_cost_data/old directory
tekrajchhetri Mar 24, 2026
1851b68
Delete evaluation/pdf2reproschema/old directory
tekrajchhetri Mar 24, 2026
c351d4a
config file for pdf2reproschema task gemini
tekrajchhetri Mar 24, 2026
31f8eb6
evaluation results (hil+nhil) using gemini for pdf2reproschema task
tekrajchhetri Mar 24, 2026
eeccbb2
directory structure added
tekrajchhetri Mar 24, 2026
2208b5d
removed old evaluation, no longer relevant
tekrajchhetri Mar 24, 2026
73758b3
resource extraction results for vitpose paper - gemini hil + without hil
tekrajchhetri Mar 24, 2026
73359f6
Create ner-config-gemini.yaml
tekrajchhetri Mar 24, 2026
3f51dae
results for ncbi test data ner task
tekrajchhetri Mar 24, 2026
8ec9528
ncbi hil result gemini
tekrajchhetri Mar 24, 2026
74d246b
ner config for gpt model
tekrajchhetri Mar 24, 2026
796a790
gpt nhil result
tekrajchhetri Mar 24, 2026
a955188
gpt hil result
tekrajchhetri Mar 24, 2026
6cdb918
qwen nhil ncbi result
tekrajchhetri Mar 25, 2026
daf130b
deepseek result
tekrajchhetri Mar 25, 2026
5c115f6
s800 benchmark data + gemini nhil result
tekrajchhetri Mar 25, 2026
e20b110
s800 gpt-mini result nhil
tekrajchhetri Mar 25, 2026
2f09d5d
ncbi qwen hil
tekrajchhetri Mar 26, 2026
5e6136c
qwen config file
tekrajchhetri Mar 26, 2026
5b54f0f
jnlpba dataset for testing
tekrajchhetri Mar 26, 2026
8caf3fd
re-organized
tekrajchhetri Mar 26, 2026
14a0b77
organized ncbi dataset into folder
tekrajchhetri Mar 26, 2026
b0d573f
bc5cdr dataset
tekrajchhetri Mar 26, 2026
44b70b0
Discovery of optimal cell type classification paper result gemini
tekrajchhetri Mar 26, 2026
722e103
result gpt hil + config files
tekrajchhetri Mar 26, 2026
35d96d2
Evaluation directory info added
tekrajchhetri Mar 26, 2026
53a520e
qwen s800
tekrajchhetri Mar 26, 2026
dc33c92
only including s800 + ncbi info
tekrajchhetri Mar 26, 2026
6a25a81
results nhil qwen + config
tekrajchhetri Mar 26, 2026
ad01b8a
qwen results
tekrajchhetri Mar 26, 2026
350c6ba
gpt results + config file (qwen+gpt)
tekrajchhetri Mar 26, 2026
d0a7e30
latent circuit paper gpt + gemini results + config file
tekrajchhetri Mar 26, 2026
cbc6251
ner result qwen nhil
tekrajchhetri Mar 26, 2026
2019b90
gpt+qwen results nhil
tekrajchhetri Mar 26, 2026
6bc9b68
scan list of resources
tekrajchhetri Mar 26, 2026
7f6278a
scan list of resources
tekrajchhetri Mar 26, 2026
010189c
Delete evaluation/benchmark/bc5cdr directory
tekrajchhetri Mar 26, 2026
6f51658
Delete evaluation/benchmark/jnlpba directory
tekrajchhetri Mar 26, 2026
4e1d0d7
deeplabcut paper resource extraction result
tekrajchhetri Mar 26, 2026
19d9767
Merge branch 'evaluation_result_paper' of github.com:sensein/structse…
tekrajchhetri Mar 26, 2026
2c34a3c
merge updates from cost optimization
tekrajchhetri Mar 26, 2026
a830289
pdf2reproschema results
tekrajchhetri Mar 26, 2026
e2fcc54
add reproducible evaluation results
yibeichan Mar 28, 2026
0b1bb04
Merge pull request #104 from sensein/evaluation_result_paper
djarecka Apr 6, 2026
c31c07a
Merge pull request #98 from sensein/fix_poetry_issue_githubworkflow
djarecka Apr 6, 2026
2803da3
Update test parameterization for task_type
tekrajchhetri Apr 9, 2026
6d8f8f6
Merge pull request #110 from sensein/tekrajchhetri-fix-test
tekrajchhetri Apr 9, 2026
cda0349
Update codespell skip list in pre-commit config
tekrajchhetri Apr 9, 2026
e389477
Refactor test for task type inference with descriptions
tekrajchhetri Apr 9, 2026
bd8f371
Merge pull request #100 from sensein/performance_cost_optimization
tekrajchhetri Apr 9, 2026
78f6df3
Merge pull request #111 from sensein/tekrajchhetri-fix-test
tekrajchhetri Apr 9, 2026
3e4ef2b
Fix comments and expected output formatting in config.yaml
tekrajchhetri Apr 9, 2026
b1b1d23
creating conftest.py for shared functions; updating ner_test to use p…
djarecka Apr 9, 2026
24074e5
adding init file
djarecka Apr 9, 2026
6e1a43a
Update src/tests/conftest.py
djarecka Apr 9, 2026
fe413f1
add load_env to the test
djarecka Apr 9, 2026
d9aee7b
import logging
djarecka Apr 9, 2026
3eba584
Merge pull request #112 from djarecka/tests_updates
djarecka Apr 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 4 additions & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ on:
branches:
- main
pull_request:
workflow_dispatch:

jobs:
unit:
Expand All @@ -27,6 +28,7 @@ jobs:
python -m pip install poetry==2.3.2
- name: Install dependencies with Poetry
run: |
poetry lock
poetry install --with dev
shell: bash
- name: Run unit tests
Expand Down Expand Up @@ -82,6 +84,7 @@ jobs:
python -m pip install poetry==2.3.2
- name: Install dependencies with Poetry
run: |
poetry lock
poetry install --with dev
shell: bash
- name: Run OpenRouter integration tests
Expand Down Expand Up @@ -122,6 +125,7 @@ jobs:
- name: Install dependencies with Poetry
run: |
poetry env use ${{ matrix.python-version }}
poetry lock
poetry install --with dev
poetry env info
shell: bash
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ repos:
hooks:
- id: codespell
args:
- --skip=poetry.lock,docs_style/pdoc-theme/syntax-highlighting.css,*.cha,*.ipynb,example/sample_metadata.csv
- --skip=poetry.lock,docs_style/pdoc-theme/syntax-highlighting.css,*.cha,*.ipynb,example/sample_metadata.csv,evaluation/benchmark,evaluation/pdf2reproschema/old,evaluation/ner/old,evaluation/pdf2reproschema/old/evaluation
- --ignore-words-list=SIE,sie

- repo: https://github.com/hija/clean-dotenv
Expand Down
661 changes: 593 additions & 68 deletions README.md

Large diffs are not rendered by default.

13 changes: 5 additions & 8 deletions config_template/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,24 +80,21 @@ task_config:
"aligned_terms":{
#data
}
# do not chage this, remove it later
# do not change this, remove it later
agent_id: alignment_agent

judge_task:
description: >
{aligned_structured_information}


expected_output: >
output format: json
Example output:
expected_output: >
output format: json
Example output:
"judged_terms":{
#data
}

expected_output: >
output format: json Example output: add example

# do not change this, remove it later
agent_id: judge_agent

Expand Down Expand Up @@ -133,4 +130,4 @@ embedder_config:
provider: ollama
config:
api_base: http://localhost:11434
model: nomic-embed-text:v1.5
model: nomic-embed-text:v1.5
3,856 changes: 0 additions & 3,856 deletions evaluation/benchmark/JNLPBA_gene_protein_test_entities_mapping.jsonl

This file was deleted.

1 change: 0 additions & 1 deletion evaluation/benchmark/JNLPBA_gene_protein_test_text.txt

This file was deleted.

111 changes: 111 additions & 0 deletions evaluation/benchmark/analysis/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Benchmark NER Evaluation Results

## Overview

This evaluation measures StructSense's entity extraction **recall** against two standard biomedical NER benchmark datasets:

- **NCBI Disease**: 960 entity mentions (403 unique) across 940 sentences
- **S800 Species**: 767 entity mentions (370 unique) across 1630 sentences

The ground truth defines the **minimum** set of entities to extract. StructSense extracts additional entities beyond ground truth — this is a feature, not an error, and is reported as "extra entities."

## Evaluation Approach

### Entity Matching (3-tier cascade)

For each ground truth entity mention, the script attempts matching in priority order:

1. **Exact text match**: Normalize both GT and result text (lowercase, collapse spaced hyphens `" - "` to `"-"`, collapse whitespace), then exact string comparison. This handles the BIO tokenization artifact where GT has `"ataxia - telangiectasia"` and results have `"ataxia-telangiectasia"`.

2. **Span overlap**: Compare character offsets. Match if overlap >= 50% of GT span length. Catches positional matches where text differs slightly.

3. **Substring/containment**: Normalized GT text is a substring of a result entity text, or vice versa. For example, GT `"T-cell leukaemia"` matches result `"sporadic T-cell prolymphocytic leukaemia"`.

### Filtering

- Entities from `en_core_web_sm` only (SpaCy general-purpose NER) are removed before evaluation
- No junk/stopword filtering is applied (unlike paper-based NER eval) since even generic-looking entities may be valid benchmark matches

### Metrics

| Metric | Description |
|---|---|
| Recall (strict) | Fraction of GT mentions matched via exact text |
| Recall (relaxed) | Fraction of GT mentions matched via any tier |
| Extra entities | Result entities beyond ground truth (tool advantage) |

## Results

### NCBI Disease

| Model | Variant | Strict Recall | Relaxed Recall | Extra Entities |
|-------|---------|--------------|----------------|----------------|
| Gemini 3.1 Flash Lite | hil | 85.6% (822/960) | **95.7%** (919/960) | 1,046 |
| Gemini 3.1 Flash Lite | nhil | 83.1% (798/960) | **96.1%** (923/960) | 1,049 |
| GPT-4o mini | hil | 77.9% (748/960) | 93.4% (897/960) | 1,341 |
| GPT-4o mini | nhil | 80.9% (777/960) | 94.5% (907/960) | 1,192 |
| Qwen | hil | 83.5% (802/960) | **97.5%** (936/960) | 1,783 |
| Qwen | nhil | 84.1% (807/960) | **97.0%** (931/960) | 1,504 |

### S800 Species

| Model | Variant | Strict Recall | Relaxed Recall | Extra Entities |
|-------|---------|--------------|----------------|----------------|
| Gemini | nhil | **85.8%** (658/767) | **96.6%** (741/767) | 2,719 |
| GPT-4o mini | nhil | 62.5% (479/767) | 81.0% (621/767) | 3,223 |
| Qwen | nhil | 70.0% (537/767) | 90.6% (695/767) | 3,622 |

### Key Findings

1. **High recall across models**: Relaxed recall exceeds 90% for most configurations, reaching 97.5% (Qwen hil, NCBI). This means StructSense finds nearly all ground truth entities.

2. **Gemini performs best on strict matching**: Gemini achieves the highest strict recall on both datasets (85.6% NCBI, 85.8% S800), suggesting its extracted text most closely matches ground truth formatting.

3. **GPT struggles with S800 species**: GPT-4o mini drops to 62.5% strict / 81.0% relaxed on S800, likely due to abbreviated species names (e.g., `"F. graminearum"`, `"C. albicans"`) and strain identifiers (e.g., `"M2 (T)"`, `"6C (T)"`).

4. **Significant extra entities extracted**: All models extract 1,000-3,600+ entities beyond ground truth, including genes, proteins, chemicals, biological processes — demonstrating StructSense's advantage over the minimum benchmark annotation.

5. **HIL vs NHIL**: Human-in-the-loop generally provides marginal improvement on relaxed recall (e.g., Qwen: 97.5% vs 97.0%) but can slightly decrease strict recall (Gemini: 85.6% vs 83.1%), suggesting HIL may refine entity text in ways that diverge from ground truth formatting.

6. **Common missed entities**: Entities with unusual formatting are most commonly missed:
- Abbreviated forms: `"A-T"`, `"B-NHL"`, `"T-PLL"`, `"WT"`
- Compound entities: `"sporadic breast, brain, prostate and kidney cancer"`
- Strain identifiers: `"DSM 18155 (T)"`, `"M2 (T)"`
- Abbreviated species: `"F. graminearum"`, `"C. albicans"`

## Extra Entity Label Distribution (top labels across models)

The extra entities (beyond ground truth) break down into these categories, showing StructSense's broader extraction capability:

**NCBI**: GENE, PROTEIN, GENE_MUTATION, GENETIC_VARIANT, BIOLOGICAL_PROCESS, CHEMICAL, CELL_TYPE, METHOD

**S800**: GENE, BIOLOGICAL_PROCESS, PROTEIN, DISEASE, CHEMICAL, ORGANISM, ORG, METHOD

## Reproduction

```bash
# Evaluate all datasets and models
python evaluation/benchmark/analysis/benchmark_eval.py

# Evaluate single dataset
python evaluation/benchmark/analysis/benchmark_eval.py --dataset ncbi

# Explicit GT + result pair
python evaluation/benchmark/analysis/benchmark_eval.py \
--gt evaluation/benchmark/ncbi/NCBI_disease_test_entities_mapping.jsonl \
--result evaluation/benchmark/ncbi/results-qwen/NCBI_disease_test_text_qwen_nhil.json

# Save JSON report
python evaluation/benchmark/analysis/benchmark_eval.py -o report.json

# Verbose (show all missed entities)
python evaluation/benchmark/analysis/benchmark_eval.py --verbose
```

## Output Files

- `benchmark_eval.py` — Evaluation script (Python stdlib only, no external dependencies)
- `results_all.json` — Combined evaluation results for all datasets and models
- `results_ncbi.json` — NCBI dataset results
- `results_s800.json` — S800 dataset results
- `RESULTS.md` — This document
Loading
Loading