|
| 1 | +# mdeval |
| 2 | + |
| 3 | +A Python implementation of the NIST `md-eval.pl` script for evaluating rich transcription and speaker diarization accuracy. This tool mimics the core functionality and scoring logic of the standard Perl script used in NIST evaluations (e.g., RT-0x), focusing on Diarization Error Rate (DER). |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | + |
| 7 | +- [Overview](#overview) |
| 8 | +- [Installation](#installation) |
| 9 | +- [Usage](#usage) |
| 10 | + - [Command Line Interface](#command-line-interface) |
| 11 | + - [Python API](#python-api) |
| 12 | +- [Input Formats](#input-formats) |
| 13 | + - [RTTM (Rich Transcription Time Marked)](#rttm-rich-transcription-time-marked) |
| 14 | + - [UEM (Un-partitioned Evaluation Map)](#uem-un-partitioned-evaluation-map) |
| 15 | +- [Core Algorithms](#core-algorithms) |
| 16 | + - [Scoring Logic](#scoring-logic) |
| 17 | + - [Optimal Speaker Mapping](#optimal-speaker-mapping) |
| 18 | + - [Collars](#collars) |
| 19 | + - [Overlap Exclusion](#overlap-exclusion) |
| 20 | +- [Testing](#testing) |
| 21 | +- [Citation](#citation) |
| 22 | + |
| 23 | +## Overview |
| 24 | + |
| 25 | +`mdeval` calculates the Diarization Error Rate (DER) by comparing a system hypothesis (SYS) against a ground truth reference (REF). It supports: |
| 26 | +- **Missed Speech**: Speech present in REF but not in SYS. |
| 27 | +- **False Alarm**: Speech present in SYS but not in REF. |
| 28 | +- **Speaker Error**: Speech assigned to the wrong speaker (after optimal mapping). |
| 29 | +- **Collars**: Optional no-score zones around reference segment boundaries. |
| 30 | +- **Overlap handling**: Option to exclude regions where multiple reference speakers talk simultaneously. |
| 31 | + |
| 32 | +The goal is to provide a pure Python, dependency-free (or minimal dependency) alternative to the legacy Perl script for modern pipelines. |
| 33 | + |
| 34 | +## Installation |
| 35 | + |
| 36 | +You can install the package via pip: |
| 37 | + |
| 38 | +```bash |
| 39 | +pip install mdeval |
| 40 | +``` |
| 41 | + |
| 42 | +## Usage |
| 43 | + |
| 44 | +### Command Line Interface |
| 45 | + |
| 46 | +The package provides a CLI entry point `mdeval`. |
| 47 | + |
| 48 | +```bash |
| 49 | +python3 -m mdeval.cli -r <ref_rttm> -s <sys_rttm> [options] |
| 50 | +``` |
| 51 | + |
| 52 | +**Arguments:** |
| 53 | + |
| 54 | +- `-r, --ref`: Path to the Reference RTTM file (Required). |
| 55 | +- `-s, --sys`: Path to the System/Hypothesis RTTM file (Required). |
| 56 | +- `-u, --uem`: Path to the UEM file defining evaluation regions (Optional. If omitted, the valid region is inferred from the Reference RTTM). |
| 57 | +- `-c, --collar`: Collar size in seconds (Float, default: 0.0). A "no-score" zone of +/- `collar` seconds is applied around every reference segment boundary. |
| 58 | +- `-1, --single-speaker`: Limit scoring to single-speaker regions only (ignore overlaps in REF). This is equivalent to "Overlap Exclusion". |
| 59 | + |
| 60 | +**Example:** |
| 61 | + |
| 62 | +```bash |
| 63 | +python3 -m mdeval.cli -r ref.rttm -s hyp.rttm -c 0.25 |
| 64 | +``` |
| 65 | + |
| 66 | +### Python API |
| 67 | + |
| 68 | +You can use the scoring logic programmatically: |
| 69 | + |
| 70 | +```python |
| 71 | +from mdeval.io import load_rttm, load_uem |
| 72 | +from mdeval.scoring import score_speaker_diarization |
| 73 | +from mdeval.utils import Segment |
| 74 | + |
| 75 | +# Load Data |
| 76 | +ref_data = load_rttm('ref.rttm') |
| 77 | +sys_data = load_rttm('sys.rttm') |
| 78 | + |
| 79 | +# Define Evaluation Map (or infer it) |
| 80 | +# uem_eval = [Segment(0.0, 100.0)] |
| 81 | +# Or load: |
| 82 | +# uem_data = load_uem('test.uem') |
| 83 | +# uem_eval = uem_data['file1']['1'] |
| 84 | + |
| 85 | +# Parse specific file/channel data |
| 86 | +ref_spkrs = {} # ... extract from ref_data['file1']['1']['SPEAKER'] |
| 87 | +sys_spkrs = {} # ... extract from sys_data['file1']['1']['SPEAKER'] |
| 88 | + |
| 89 | +# Score |
| 90 | +stats, mapping = score_speaker_diarization( |
| 91 | + 'file1', '1', |
| 92 | + ref_spkrs, sys_spkrs, |
| 93 | + uem_eval, |
| 94 | + collar=0.25, |
| 95 | + ignore_overlap=False |
| 96 | +) |
| 97 | + |
| 98 | +print(f"DER: {stats['MISSED_SPEAKER'] + stats['FALARM_SPEAKER'] + stats['SPEAKER_ERROR']}") |
| 99 | +``` |
| 100 | + |
| 101 | +## Input Formats |
| 102 | + |
| 103 | +### RTTM (Rich Transcription Time Marked) |
| 104 | + |
| 105 | +Format used for both Reference and System inputs. |
| 106 | +Space-delimited text file. Lines starting with `;` or `#` are ignored. |
| 107 | + |
| 108 | +**Required Columns (indices 0-8):** |
| 109 | + |
| 110 | +1. **TYPE**: Segment type (must be `SPEAKER` to be scored). |
| 111 | +2. **FILE**: File name / Recording ID. |
| 112 | +3. **CHNL**: Channel ID (e.g., `1`). |
| 113 | +4. **TBEG**: Start time in seconds (float). |
| 114 | +5. **TDUR**: Duration in seconds (float). |
| 115 | +6. **ORTHO**: Orthography field (ignored/placeholder, e.g., `<NA>`). |
| 116 | +7. **STYPE**: Subtype (ignored/placeholder, e.g., `<NA>`). |
| 117 | +8. **NAME**: Speaker Name/ID. |
| 118 | +9. **CONF**: Confidence score (ignored/placeholder, e.g., `<NA>`). |
| 119 | + |
| 120 | +**Example:** |
| 121 | +``` |
| 122 | +SPEAKER file1 1 0.00 5.00 <NA> <NA> spk1 <NA> <NA> |
| 123 | +SPEAKER file1 1 5.00 3.00 <NA> <NA> spk2 <NA> <NA> |
| 124 | +``` |
| 125 | + |
| 126 | +### UEM (Un-partitioned Evaluation Map) |
| 127 | + |
| 128 | +Defines the time regions that should be evaluated. Regions outside the UEM are ignored. |
| 129 | +Space-delimited text file. |
| 130 | + |
| 131 | +**Required Columns:** |
| 132 | + |
| 133 | +1. **FILE**: File name. |
| 134 | +2. **CHNL**: Channel ID. |
| 135 | +3. **TBEG**: Start time of valid region. |
| 136 | +4. **TEND**: End time of valid region. |
| 137 | + |
| 138 | +**Example:** |
| 139 | +``` |
| 140 | +file1 1 0.00 100.00 |
| 141 | +file1 1 120.00 300.00 |
| 142 | +``` |
| 143 | + |
| 144 | +## Core Algorithms |
| 145 | + |
| 146 | +### Scoring Logic |
| 147 | + |
| 148 | +The scoring is segment-based (time-weighted). |
| 149 | +1. **Metric**: Diarization Error Rate (DER). |
| 150 | + $$ DER = \frac{\text{Missed Speaker Time} + \text{False Alarm Speaker Time} + \text{Speaker Error Time}}{\text{Total Scored Speaker Time}} $$ |
| 151 | +2. **Segmentation**: The timeline is split into contiguous segments where the set of reference and system speakers remains constant. |
| 152 | +3. **Intersection**: For each segment, the number of reference speakers ($N_{ref}$) and system speakers ($N_{sys}$) is compared. |
| 153 | + |
| 154 | +### Optimal Speaker Mapping |
| 155 | + |
| 156 | +Since System speaker labels (e.g., "sys01") do not match Reference labels (e.g., "spk01"), a global 1-to-1 mapping is computed to minimize error. |
| 157 | +- We compute an overlap matrix between every reference speaker and every system speaker over the entire valid UEM duration. |
| 158 | +- The **Hungarian Algorithm** (implemented purely in Python, no `scipy` dependency required) is used to find the optimal assignment that maximizes total overlap time. |
| 159 | + |
| 160 | +### Collars |
| 161 | + |
| 162 | +When `collar > 0`, a "no-score" zone is applied. |
| 163 | +- For every segment boundary in the **Reference** RTTM, a region of $t \pm collar$ is removed from the UEM. |
| 164 | +- This accounts for human annotation uncertainty boundaries. |
| 165 | +- **Note**: The Python implementation follows the logic of `md-eval.pl`'s `add_collars_to_uem` subroutine, using a counter-based approach to subtract the union of all collar regions from the scoring UEM. |
| 166 | + |
| 167 | +### Overlap Exclusion |
| 168 | + |
| 169 | +If enabled (via `-1` / `--single-speaker`), regions where **two or more** Reference speakers are speaking simultaneously are removed from the UEM. |
| 170 | +- This allows evaluation of systems that only output single-speaker segments. |
| 171 | +- **Note**: Overlap exclusion is applied *before* collars in the perl script logic, but effectively they both just subtract time from the valid UEM. |
| 172 | + |
| 173 | +## Testing |
| 174 | + |
| 175 | +The package includes unit tests using Python's `unittest` framework. |
| 176 | + |
| 177 | +Run tests via: |
| 178 | +```bash |
| 179 | +python3 -m unittest discover tests |
| 180 | +``` |
| 181 | + |
| 182 | +## Citation |
| 183 | + |
| 184 | +We developed this package as part of the following work: |
| 185 | + |
| 186 | +``` |
| 187 | +@inproceedings{wang2018speaker, |
| 188 | + title={{Speaker Diarization with LSTM}}, |
| 189 | + author={Wang, Quan and Downey, Carlton and Wan, Li and Mansfield, Philip Andrew and Moreno, Ignacio Lopz}, |
| 190 | + booktitle={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, |
| 191 | + pages={5239--5243}, |
| 192 | + year={2018}, |
| 193 | + organization={IEEE} |
| 194 | +} |
| 195 | +
|
| 196 | +@inproceedings{xia2022turn, |
| 197 | + title={{Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection}}, |
| 198 | + author={Wei Xia and Han Lu and Quan Wang and Anshuman Tripathi and Yiling Huang and Ignacio Lopez Moreno and Hasim Sak}, |
| 199 | + booktitle={2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, |
| 200 | + pages={8077--8081}, |
| 201 | + year={2022}, |
| 202 | + organization={IEEE} |
| 203 | +} |
| 204 | +
|
| 205 | +@article{wang2022highly, |
| 206 | + title={Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering}, |
| 207 | + author={Quan Wang and Yiling Huang and Han Lu and Guanlong Zhao and Ignacio Lopez Moreno}, |
| 208 | + journal={arXiv:2210.13690}, |
| 209 | + year={2022} |
| 210 | +} |
| 211 | +``` |
0 commit comments