Skip to content

Commit b824a32

Browse files
committed
initial code
1 parent c6e5cd2 commit b824a32

16 files changed

Lines changed: 1279 additions & 0 deletions
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name: Python Package
2+
3+
on:
4+
push:
5+
branches: [ "main" ]
6+
pull_request:
7+
branches: [ "main" ]
8+
9+
jobs:
10+
build:
11+
12+
runs-on: ubuntu-latest
13+
strategy:
14+
fail-fast: false
15+
matrix:
16+
python-version: ["3.11"]
17+
18+
steps:
19+
- uses: actions/checkout@v3
20+
- name: Set up Python ${{ matrix.python-version }}
21+
uses: actions/setup-python@v3
22+
with:
23+
python-version: ${{ matrix.python-version }}
24+
- name: Install dependencies
25+
run: |
26+
python -m pip install --upgrade pip
27+
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
28+
pip install pytest
29+
- name: Test with pytest
30+
run: |
31+
pytest

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,4 @@
11
downloads/*
2+
debug/*
3+
*.pyc
4+
dist/*

LICENSE

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Copyright (c) 2025 Quan Wang
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4+
5+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6+
7+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# mdeval
2+
3+
A Python implementation of the NIST `md-eval.pl` script for evaluating rich transcription and speaker diarization accuracy. This tool mimics the core functionality and scoring logic of the standard Perl script used in NIST evaluations (e.g., RT-0x), focusing on Diarization Error Rate (DER).
4+
5+
## Table of Contents
6+
7+
- [Overview](#overview)
8+
- [Installation](#installation)
9+
- [Usage](#usage)
10+
- [Command Line Interface](#command-line-interface)
11+
- [Python API](#python-api)
12+
- [Input Formats](#input-formats)
13+
- [RTTM (Rich Transcription Time Marked)](#rttm-rich-transcription-time-marked)
14+
- [UEM (Un-partitioned Evaluation Map)](#uem-un-partitioned-evaluation-map)
15+
- [Core Algorithms](#core-algorithms)
16+
- [Scoring Logic](#scoring-logic)
17+
- [Optimal Speaker Mapping](#optimal-speaker-mapping)
18+
- [Collars](#collars)
19+
- [Overlap Exclusion](#overlap-exclusion)
20+
- [Testing](#testing)
21+
- [Citation](#citation)
22+
23+
## Overview
24+
25+
`mdeval` calculates the Diarization Error Rate (DER) by comparing a system hypothesis (SYS) against a ground truth reference (REF). It supports:
26+
- **Missed Speech**: Speech present in REF but not in SYS.
27+
- **False Alarm**: Speech present in SYS but not in REF.
28+
- **Speaker Error**: Speech assigned to the wrong speaker (after optimal mapping).
29+
- **Collars**: Optional no-score zones around reference segment boundaries.
30+
- **Overlap handling**: Option to exclude regions where multiple reference speakers talk simultaneously.
31+
32+
The goal is to provide a pure Python, dependency-free (or minimal dependency) alternative to the legacy Perl script for modern pipelines.
33+
34+
## Installation
35+
36+
You can install the package via pip:
37+
38+
```bash
39+
pip install mdeval
40+
```
41+
42+
## Usage
43+
44+
### Command Line Interface
45+
46+
The package provides a CLI entry point `mdeval`.
47+
48+
```bash
49+
python3 -m mdeval.cli -r <ref_rttm> -s <sys_rttm> [options]
50+
```
51+
52+
**Arguments:**
53+
54+
- `-r, --ref`: Path to the Reference RTTM file (Required).
55+
- `-s, --sys`: Path to the System/Hypothesis RTTM file (Required).
56+
- `-u, --uem`: Path to the UEM file defining evaluation regions (Optional. If omitted, the valid region is inferred from the Reference RTTM).
57+
- `-c, --collar`: Collar size in seconds (Float, default: 0.0). A "no-score" zone of +/- `collar` seconds is applied around every reference segment boundary.
58+
- `-1, --single-speaker`: Limit scoring to single-speaker regions only (ignore overlaps in REF). This is equivalent to "Overlap Exclusion".
59+
60+
**Example:**
61+
62+
```bash
63+
python3 -m mdeval.cli -r ref.rttm -s hyp.rttm -c 0.25
64+
```
65+
66+
### Python API
67+
68+
You can use the scoring logic programmatically:
69+
70+
```python
71+
from mdeval.io import load_rttm, load_uem
72+
from mdeval.scoring import score_speaker_diarization
73+
from mdeval.utils import Segment
74+
75+
# Load Data
76+
ref_data = load_rttm('ref.rttm')
77+
sys_data = load_rttm('sys.rttm')
78+
79+
# Define Evaluation Map (or infer it)
80+
# uem_eval = [Segment(0.0, 100.0)]
81+
# Or load:
82+
# uem_data = load_uem('test.uem')
83+
# uem_eval = uem_data['file1']['1']
84+
85+
# Parse specific file/channel data
86+
ref_spkrs = {} # ... extract from ref_data['file1']['1']['SPEAKER']
87+
sys_spkrs = {} # ... extract from sys_data['file1']['1']['SPEAKER']
88+
89+
# Score
90+
stats, mapping = score_speaker_diarization(
91+
'file1', '1',
92+
ref_spkrs, sys_spkrs,
93+
uem_eval,
94+
collar=0.25,
95+
ignore_overlap=False
96+
)
97+
98+
print(f"DER: {stats['MISSED_SPEAKER'] + stats['FALARM_SPEAKER'] + stats['SPEAKER_ERROR']}")
99+
```
100+
101+
## Input Formats
102+
103+
### RTTM (Rich Transcription Time Marked)
104+
105+
Format used for both Reference and System inputs.
106+
Space-delimited text file. Lines starting with `;` or `#` are ignored.
107+
108+
**Required Columns (indices 0-8):**
109+
110+
1. **TYPE**: Segment type (must be `SPEAKER` to be scored).
111+
2. **FILE**: File name / Recording ID.
112+
3. **CHNL**: Channel ID (e.g., `1`).
113+
4. **TBEG**: Start time in seconds (float).
114+
5. **TDUR**: Duration in seconds (float).
115+
6. **ORTHO**: Orthography field (ignored/placeholder, e.g., `<NA>`).
116+
7. **STYPE**: Subtype (ignored/placeholder, e.g., `<NA>`).
117+
8. **NAME**: Speaker Name/ID.
118+
9. **CONF**: Confidence score (ignored/placeholder, e.g., `<NA>`).
119+
120+
**Example:**
121+
```
122+
SPEAKER file1 1 0.00 5.00 <NA> <NA> spk1 <NA> <NA>
123+
SPEAKER file1 1 5.00 3.00 <NA> <NA> spk2 <NA> <NA>
124+
```
125+
126+
### UEM (Un-partitioned Evaluation Map)
127+
128+
Defines the time regions that should be evaluated. Regions outside the UEM are ignored.
129+
Space-delimited text file.
130+
131+
**Required Columns:**
132+
133+
1. **FILE**: File name.
134+
2. **CHNL**: Channel ID.
135+
3. **TBEG**: Start time of valid region.
136+
4. **TEND**: End time of valid region.
137+
138+
**Example:**
139+
```
140+
file1 1 0.00 100.00
141+
file1 1 120.00 300.00
142+
```
143+
144+
## Core Algorithms
145+
146+
### Scoring Logic
147+
148+
The scoring is segment-based (time-weighted).
149+
1. **Metric**: Diarization Error Rate (DER).
150+
$$ DER = \frac{\text{Missed Speaker Time} + \text{False Alarm Speaker Time} + \text{Speaker Error Time}}{\text{Total Scored Speaker Time}} $$
151+
2. **Segmentation**: The timeline is split into contiguous segments where the set of reference and system speakers remains constant.
152+
3. **Intersection**: For each segment, the number of reference speakers ($N_{ref}$) and system speakers ($N_{sys}$) is compared.
153+
154+
### Optimal Speaker Mapping
155+
156+
Since System speaker labels (e.g., "sys01") do not match Reference labels (e.g., "spk01"), a global 1-to-1 mapping is computed to minimize error.
157+
- We compute an overlap matrix between every reference speaker and every system speaker over the entire valid UEM duration.
158+
- The **Hungarian Algorithm** (implemented purely in Python, no `scipy` dependency required) is used to find the optimal assignment that maximizes total overlap time.
159+
160+
### Collars
161+
162+
When `collar > 0`, a "no-score" zone is applied.
163+
- For every segment boundary in the **Reference** RTTM, a region of $t \pm collar$ is removed from the UEM.
164+
- This accounts for human annotation uncertainty boundaries.
165+
- **Note**: The Python implementation follows the logic of `md-eval.pl`'s `add_collars_to_uem` subroutine, using a counter-based approach to subtract the union of all collar regions from the scoring UEM.
166+
167+
### Overlap Exclusion
168+
169+
If enabled (via `-1` / `--single-speaker`), regions where **two or more** Reference speakers are speaking simultaneously are removed from the UEM.
170+
- This allows evaluation of systems that only output single-speaker segments.
171+
- **Note**: Overlap exclusion is applied *before* collars in the perl script logic, but effectively they both just subtract time from the valid UEM.
172+
173+
## Testing
174+
175+
The package includes unit tests using Python's `unittest` framework.
176+
177+
Run tests via:
178+
```bash
179+
python3 -m unittest discover tests
180+
```
181+
182+
## Citation
183+
184+
We developed this package as part of the following work:
185+
186+
```
187+
@inproceedings{wang2018speaker,
188+
title={{Speaker Diarization with LSTM}},
189+
author={Wang, Quan and Downey, Carlton and Wan, Li and Mansfield, Philip Andrew and Moreno, Ignacio Lopz},
190+
booktitle={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
191+
pages={5239--5243},
192+
year={2018},
193+
organization={IEEE}
194+
}
195+
196+
@inproceedings{xia2022turn,
197+
title={{Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection}},
198+
author={Wei Xia and Han Lu and Quan Wang and Anshuman Tripathi and Yiling Huang and Ignacio Lopez Moreno and Hasim Sak},
199+
booktitle={2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
200+
pages={8077--8081},
201+
year={2022},
202+
organization={IEEE}
203+
}
204+
205+
@article{wang2022highly,
206+
title={Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering},
207+
author={Quan Wang and Yiling Huang and Han Lu and Guanlong Zhao and Ignacio Lopez Moreno},
208+
journal={arXiv:2210.13690},
209+
year={2022}
210+
}
211+
```

mdeval/__init__.py

Whitespace-only changes.

mdeval/cli.py

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
import argparse
2+
import sys
3+
import os
4+
from typing import List
5+
from .io import load_rttm, load_uem
6+
from .scoring import score_speaker_diarization
7+
from .utils import Segment
8+
9+
def main():
10+
parser = argparse.ArgumentParser(description='Python implementation of NIST md-eval.pl')
11+
parser.add_argument('-r', '--ref', required=True, help='Reference RTTM file')
12+
parser.add_argument('-s', '--sys', required=True, help='System RTTM file')
13+
parser.add_argument('-u', '--uem', help='UEM file (Evaluation Partition)')
14+
parser.add_argument('-c', '--collar', type=float, default=0.0, help='No-score collar around reference boundaries (seconds)')
15+
parser.add_argument('-1', '--single-speaker', action='store_true', dest='single_speaker', help='Limit scoring to single-speaker regions')
16+
# Add other flags as needed
17+
18+
args = parser.parse_args()
19+
20+
# Load Data
21+
ref_data = load_rttm(args.ref)
22+
sys_data = load_rttm(args.sys)
23+
24+
uem_data = None
25+
if args.uem:
26+
uem_data = load_uem(args.uem)
27+
28+
# Process each file found in REF
29+
files = sorted(ref_data.keys())
30+
31+
# Accumulate global scores
32+
total_stats = {
33+
'EVAL_TIME': 0.0,
34+
'EVAL_SPEECH': 0.0,
35+
'SCORED_TIME': 0.0,
36+
'SCORED_SPEECH': 0.0,
37+
'MISSED_SPEECH': 0.0,
38+
'FALARM_SPEECH': 0.0,
39+
'SCORED_SPEAKER': 0.0,
40+
'MISSED_SPEAKER': 0.0,
41+
'FALARM_SPEAKER': 0.0,
42+
'SPEAKER_ERROR': 0.0,
43+
'SCORED_WORDS': 0, # Placeholder
44+
'EVAL_WORDS': 0
45+
}
46+
47+
# TODO: Output header matching md-eval.pl
48+
49+
for file in files:
50+
if file not in sys_data:
51+
print(f"Warning: File {file} found in REF but not in SYS. Skipping.", file=sys.stderr)
52+
continue
53+
54+
chnls = sorted(ref_data[file].keys())
55+
for chnl in chnls:
56+
if chnl not in sys_data[file]:
57+
print(f"Warning: Channel {chnl} for file {file} found in REF but not in SYS. Skipping.", file=sys.stderr)
58+
continue
59+
60+
# Determine UEM
61+
if uem_data and file in uem_data and chnl in uem_data[file]:
62+
uem_eval = uem_data[file][chnl]
63+
else:
64+
# Infer UEM from REF RTTM (min TBEG, max TEND)
65+
min_t = 1e30
66+
max_t = 0
67+
found_seg = False
68+
for seg in ref_data[file][chnl]['SPEAKER']:
69+
min_t = min(min_t, seg['TBEG'])
70+
max_t = max(max_t, seg['TEND'])
71+
found_seg = True
72+
if not found_seg:
73+
# Try other types? For now just SPEAKER
74+
pass
75+
if max_t > min_t:
76+
uem_eval = [Segment(min_t, max_t)]
77+
else:
78+
uem_eval = []
79+
80+
# Determine REF/SYS inputs
81+
# Group by speaker
82+
# Expected format for scoring: {spkr: [{TBEG, TDUR, TEND, ...}]}
83+
curr_ref = {}
84+
for seg in ref_data[file][chnl]['SPEAKER']:
85+
s = seg['SPKR']
86+
if s not in curr_ref: curr_ref[s] = []
87+
curr_ref[s].append(seg)
88+
89+
curr_sys = {}
90+
if 'SPEAKER' in sys_data[file][chnl]:
91+
for seg in sys_data[file][chnl]['SPEAKER']:
92+
s = seg['SPKR']
93+
if s not in curr_sys: curr_sys[s] = []
94+
curr_sys[s].append(seg)
95+
96+
file_stats, _ = score_speaker_diarization(file, chnl, curr_ref, curr_sys, uem_eval, args.collar, args.single_speaker)
97+
98+
# Add to totals
99+
for k in total_stats:
100+
if k in file_stats:
101+
total_stats[k] += file_stats[k]
102+
103+
# Print simplified output
104+
print_scores("ALL", total_stats)
105+
106+
def print_scores(condition, scores):
107+
print(f"\n*** Performance analysis for Speaker Diarization for {condition} ***\n")
108+
109+
def p(val): return val
110+
111+
eval_time = scores['EVAL_TIME']
112+
eval_speech = scores['EVAL_SPEECH']
113+
scored_time = scores['SCORED_TIME']
114+
scored_speech = scores['SCORED_SPEECH']
115+
116+
print(f" EVAL TIME = {eval_time:10.2f} secs")
117+
print(f" EVAL SPEECH = {eval_speech:10.2f} secs ({100*eval_speech/eval_time if eval_time else 0:5.1f} percent of evaluated time)")
118+
print(f" SCORED TIME = {scored_time:10.2f} secs ({100*scored_time/eval_time if eval_time else 0:5.1f} percent of evaluated time)")
119+
print(f"SCORED SPEECH = {scored_speech:10.2f} secs ({100*scored_speech/scored_time if scored_time else 0:5.1f} percent of scored time)")
120+
print(f" EVAL WORDS = {scores['EVAL_WORDS']:7d} ")
121+
print(f" SCORED WORDS = {scores['SCORED_WORDS']:7d} (100.0 percent of evaluated words)")
122+
print("---------------------------------------------")
123+
print(f"MISSED SPEECH = {scores['MISSED_SPEECH']:10.2f} secs ({100*scores['MISSED_SPEECH']/scored_time if scored_time else 0:5.1f} percent of scored time)")
124+
print(f"FALARM SPEECH = {scores['FALARM_SPEECH']:10.2f} secs ({100*scores['FALARM_SPEECH']/scored_time if scored_time else 0:5.1f} percent of scored time)")
125+
print(f" MISSED WORDS = 0 (100.0 percent of scored words)")
126+
print("---------------------------------------------")
127+
print(f"SCORED SPEAKER TIME = {scores['SCORED_SPEAKER']:10.2f} secs ({100*scores['SCORED_SPEAKER']/scored_speech if scored_speech else 0:5.1f} percent of scored speech)")
128+
print(f"MISSED SPEAKER TIME = {scores['MISSED_SPEAKER']:10.2f} secs ({100*scores['MISSED_SPEAKER']/scores['SCORED_SPEAKER'] if scores['SCORED_SPEAKER'] else 0:5.1f} percent of scored speaker time)")
129+
print(f"FALARM SPEAKER TIME = {scores['FALARM_SPEAKER']:10.2f} secs ({100*scores['FALARM_SPEAKER']/scores['SCORED_SPEAKER'] if scores['SCORED_SPEAKER'] else 0:5.1f} percent of scored speaker time)")
130+
print(f" SPEAKER ERROR TIME = {scores['SPEAKER_ERROR']:10.2f} secs ({100*scores['SPEAKER_ERROR']/scores['SCORED_SPEAKER'] if scores['SCORED_SPEAKER'] else 0:5.1f} percent of scored speaker time)")
131+
print(f"SPEAKER ERROR WORDS = 0 (100.0 percent of scored speaker words)")
132+
print("---------------------------------------------")
133+
der = (scores['MISSED_SPEAKER'] + scores['FALARM_SPEAKER'] + scores['SPEAKER_ERROR']) / scores['SCORED_SPEAKER'] if scores['SCORED_SPEAKER'] else 0
134+
print(f" OVERALL SPEAKER DIARIZATION ERROR = {100*der:5.2f} percent of scored speaker time `({condition})")
135+
print("---------------------------------------------")
136+
137+
if __name__ == '__main__':
138+
main()

0 commit comments

Comments
 (0)