Releases: fiberseq/FiberHMM
v2.10.5: fix silent extract segfault on unusual MM/ML layouts
Bug fix
fiberhmm-extract --all produced silently empty output on certain BAMs from pangenome-graph-surjection pipelines (e.g. vg surject onto linear reference). Root cause was a C-level segfault inside pysam.AlignedSegment.modified_bases on reads with unusual MM/ML layouts — invisible because the SIGSEGV kills the worker before Python's exception handler can run, and the main process reports it blandly as "A process in the process pool was terminated abruptly" with zero rows emitted per region.
Localized via faulthandler: segfault at _extract_m6a line 505 during read.modified_bases. pysam has had this class of bug intermittently for years; we already worked around it in v2.9.0 for the apply/call path by using our own numpy MM/ML parser (fiberhmm.core.bam_reader.parse_mm_ml_per_mod_type, validated against pysam on 500+ real reads).
Fix
Swap read.modified_bases for the safe parser in all three extract sites:
_extract_m6a_extract_m5c_extract_deampriority-1 (MM/MLudU code)
No API change; output format unchanged. Single-char mod codes work identically. The rare ChEBI 55797 numeric code is no longer auto-detected in extract (explicitly unsupported — users needing it can rely on the R/Y or MD fallback paths, or run samtools calmd + fiberhmm-daf-encode).
Side fix: worker exception handler now flushes stderr explicitly so Python-level errors surface during multiprocessing crashes instead of getting buffer-swallowed.
Verified on real surjected fiber-seq CRAMs
Full pipeline: CRAM → BAM → fiberhmm-call → ft fire → fiberhmm-extract --all -c 8 --block-scores
Both of two test samples now produce full output across all 5 tracks (footprint / msp / tf / m6a / m5c), each around 1.8M rows. One of the two samples was silently emitting zero rows on v2.10.4 with identical config; v2.10.5 fixes it.
Tests
292 pass. Three deam priority-1 tests rewritten to construct real MM/ML tag strings instead of faking modified_bases dicts.
Install
pip install --upgrade fiberhmm
v2.10.4: auto-skip --deam on fiber-seq BAMs
Bug fix
fiberhmm-extract --all (default mode) now auto-skips --deam when the input BAM looks like fiber-seq.
What was happening
On vg-surjected fiber-seq CRAMs (pangenome-graph aligned, projected onto linear reference):
--allmode ran every track including--deam— even on fiber-seq data that has no deaminations (Hia5 is a methyltransferase).--deampath-3 (MD walker) fell through on every read because paths 1 (MM/MLucode) and 2 (R/Y) had nothing to match.- Surjected BAMs frequently have MDs that disagree with CIGAR on ~97% of reads (documented upstream issue in
vg surject). - pysam's
get_aligned_pairs(with_seq=True)on bad MD raises AssertionError AND corrupts malloc state before propagating. - A later allocation in the worker segfaults. Python's
finallyblock doesn't run on segfault, so buffered m6a/m5c BED writes are lost. - Result:
--deamis empty, andm6a/m5cfiles are also empty or missing entirely.
Fix
Auto-skip --deam on BAMs where:
- MM/ML carries m6A (
A+a/T-a) or 5mC (C+m) subtypes, AND - no dU code (
+u/+55797), AND - no R/Y IUPAC in the query sequence
When triggered, prints:
Auto-skipping --deam: BAM looks like fiber-seq (MM/ML has m6A/5mC subtypes,
no dU code, no R/Y). Hia5 is a methyltransferase, not a deaminase --
no deamination calls to extract. Pass --deam explicitly to override.
Explicit --deam is still honored — the auto-skip only fires in --all / default mode.
Verified
Tested on a 5 Mb chr1 slice of a real surjected fiber-seq CRAM: 29,679 reads → 22.8M m6a + 4.3M m5c + 1.9M footprint + 1.9M msp positions, zero crashes, all bigBeds produced. Pre-fix this would have segfaulted mid-region and lost m6a/m5c writes.
Tests
294 pass. 8 new cover the fiber-seq signature detection across PacBio modern MM/ML, SAM 1.7 suffix variants (A+a., C+m?), DAF IUPAC, modkit dU, ChEBI 55797, and mixed cases.
Install
pip install --upgrade fiberhmm
v2.10.3: README promotes one-step DAF pipeline (docs-only)
Docs-only release
Three README updates so the documentation matches the v2.10.0+ reality:
-
Streaming Pipelines section — replaced the stale DAF example (
minimap2 | daf-encode | apply) with the one-pass form:minimap2 --MD -a ref.fa reads.fq | samtools view -b | \ fiberhmm-call --mode daf --enzyme dddb -i - -o - | \ ft fire - final.bam
-
fiberhmm-daf-encodeCLI reference — added an "Optional in v2.10.0+" banner. The tool still exists and still works; most users just don't need it anymore. -
Tag reference — clarified that
st:Zis written only byfiberhmm-daf-encode(the two-pass path), not byfiberhmm-call --mode daf(the one-pass path derives strand internally from MD).
No code changes. No behavior changes. 292 tests still pass.
Install
pip install --upgrade fiberhmm
v2.10.2: surface stale-MD warning (extract + call)
What's new
Both fiberhmm-extract and fiberhmm-call --mode daf now print a clear note when they detect MD tags that disagree with their CIGAR (common on consensus BAMs where MD is stale after CIGAR was recomputed). Example:
NOTE: 11/20 of the sniffed reads have MD tags that
disagree with their CIGAR (typical of consensus BAMs where
MD was inherited stale after CIGAR was recomputed).
These reads will be SKIPPED for --deam only; other tracks
extract normally. The reads themselves are not corrupt --
the CIGAR is still correct, only the MD annotation is stale.
To recover --deam calls for those reads, regenerate MD with:
samtools calmd -b aligned.bam ref.fa > fixed.bam
v2.10.1 silently skipped those reads. That left users unsure whether their data was corrupt and whether anything had been dropped. v2.10.2 makes it explicit, up-front, and actionable.
Warning only fires when --deam actually depends on MD (no R/Y in sequence, no MM/ML dU code, no --reference on the call side) AND at least one sniffed read has the mismatch. Free — reuses the v2.10.1 validator. No code path changes beyond printing the note.
Install
pip install --upgrade fiberhmm
v2.10.1: fix malloc crash on BAMs with malformed MD tags
Bug fix
fiberhmm-extract (and fiberhmm-call --mode daf on v2.10.0) crashed on BAMs where the MD tag's encoded reference length disagreed with the CIGAR's reference-consuming op total. Classic consensus-BAM bug: MD inherits stale state after a re-alignment updates CIGAR.
Symptom on v2.9.7:
malloc(): invalid size (unsorted)
Why
pysam.get_aligned_pairs(with_seq=True) raises AssertionError on MD/CIGAR mismatch — but before raising, pysam has already corrupted its internal malloc state. So v2.9.6 crashed immediately with the assertion; v2.9.7 swallowed the assertion but the heap was already poisoned, surfacing as a malloc(): invalid size on some unrelated allocation later in the worker.
Fix
Pre-validate the MD tag against the CIGAR in pure Python before asking pysam to parse it. If they disagree, skip the get_aligned_pairs(with_seq=True) call entirely. Never triggers pysam's AssertionError path, never corrupts malloc.
Patched both call sites:
fiberhmm/cli/extract_tags.py—_extract_deampath-3fiberhmm/daf/encoder.py—get_daf_positions(v2.10.0's--mode dafMD fallback)
Workaround for affected BAMs (optional)
If you want --deam calls to populate for reads with broken MD, regenerate clean MD tags:
samtools calmd -b aligned.bam ref.fa > fixed.bam
Without this, v2.10.1 silently skips the broken reads for --deam extraction (every other track still works for those reads). With v2.10.0's --mode daf, you can also pass --reference ref.fa as a fallback.
Tests
292 passing. 3 new tests pin the MD parser, the validator, and the invariant that malformed MD bypasses the pysam call completely.
Install
pip install --upgrade fiberhmm
v2.10.0: fiberhmm-call --mode daf auto-detects MD tags (no encode step)
Highlights
fiberhmm-call --mode daf now works directly on raw DAF BAMs. The two-pass fiberhmm-daf-encode | fiberhmm-call pipeline collapses into one.
What changed
Per-read auto-detection in --mode daf. Priority:
- R/Y IUPAC codes in the stored sequence (from
fiberhmm-daf-encode) — existing fast path, unchanged. - MD tag on the aligned read — parsed on the fly, skips the encode step entirely.
--reference REF.fa— new CLI flag, fallback when the BAM has neither R/Y nor MD.
First non-empty source wins per read.
Correctness
Output is byte-identical to the two-pass pipeline. Verified on a real 3,929-read consensus DAF BAM (Christy LaFlamme's dataset):
| source | same HMM call bytes as two-pass? |
|---|---|
| ns (nucleosomes) | ✅ 0 / 3,929 diffs |
| nl (nuc lengths) | ✅ 0 / 3,929 diffs |
| as (MSPs) | ✅ 0 / 3,929 diffs |
| al (MSP lengths) | ✅ 0 / 3,929 diffs |
| nq / aq (quality) | ✅ 0 / 3,929 diffs |
| MA / AQ (spec tags) | ✅ 0 / 3,929 diffs |
And zero regression on R/Y-encoded BAMs: v2.9.6 vs v2.10.0 on the same pre-encoded BAM = 0 diffs.
New CLI
fiberhmm-call --mode daf --enzyme dddb --region-parallel \
-i raw_aligned.bam -o recalled.bam # MD-tag path
fiberhmm-call --mode daf --enzyme dddb --region-parallel \
-i raw_aligned.bam -o recalled.bam --reference ref.fa # FASTA fallback
On startup, --mode daf does a fast-fail sniff: it checks the first 10 mapped reads for R/Y, MD, or --reference. If none are available, it errors out in under a second with a clear message instead of silently skipping every read.
Backward compatibility
fiberhmm-daf-encodestays available and unchanged. Still the right tool if you want R/Y IUPAC stamped into the stored sequence for R/Y-aware downstream tools.- R/Y-encoded BAMs continue through the existing fast path (zero cost for the new MD branch — it only fires when R/Y is absent).
- No new dependencies.
Install
pip install --upgrade fiberhmm
Tests
289 passing. 4 new unit tests covering the MD-fallback branch, payload plumbing, fast-fail sniff, and byte-identity gate.
v2.9.7: fix fiberhmm-extract crash on malformed MD tags
Bug fix
fiberhmm-extract now swallows pysam AssertionError when a read carries a malformed MD tag, skipping that read's deamination extraction instead of aborting the entire region worker.
Symptom (user-reported field crash):
Traceback (most recent call last):
File ".../fiberhmm/cli/extract_tags.py", line 214, in _extract_region_worker
n_features['deam'] += _extract_deam(...)
File ".../fiberhmm/cli/extract_tags.py", line 688, in _extract_deam
pairs = read.get_aligned_pairs(with_seq=True)
...
AssertionError: Invalid MD tag: MD length 37729 mismatch with CIGAR length 43799 and 9742 insertions
Root cause: the path-3 MD walker in _extract_deam (added in v2.9.6 for DAF-seq BAMs that only carry MD mismatches, no MM/ML and no R/Y) caught ValueError/AttributeError but not AssertionError, which is what pysam throws when MD/CIGAR lengths disagree internally.
Fix: broaden the exception handler to Exception. Affected reads now skip the deam track silently; all other tracks (footprint, msp, tf, m6a, m5c) still extract for those reads.
Regression test pinned: tests/test_extract_block_scores.py::test_deam_priority3_skips_read_on_pysam_assertion_error.
Install
pip install --upgrade fiberhmm
Affects only v2.9.6 (which introduced path-3). Earlier versions never touched MD tags, so they weren't exposed to this bug.
v2.9.6: tag-presence diagnostic + MD-mismatch deam fallback
Changes
Tag diagnostic in fiberhmm-extract
Before the heavy pass, fiberhmm-extract now sniffs the first 20 mapped reads and prints a summary of which tag sources were detected plus which tracks it predicts will populate:
Tag scan (first 20 mapped reads): MA/AQ=20/20, MM/ML=20/20 [A+a], R/Y-in-seq=20/20
will populate: footprint[ns/nl], msp[as/al], tf[MA/AQ tf+ annotations], m6a[MM/ML (A+a)], deam[R/Y in sequence]
will be empty: m5c
Or on a raw DAF BAM with nothing but MD:
Tag scan (first 20 mapped reads): MD-only=20/20
will populate: deam[MD-only (path-3 fallback)]
will be empty: footprint, msp, tf, m6a, m5c
--deam path-3 MD fallback (completes FiberBrowser parity)
Third priority source for deamination calls. When MM/ML-native (u / 55797) and IUPAC R/Y paths both come up empty, _extract_deam walks get_aligned_pairs(with_seq=True) and picks out deamination-direction mismatches: ref-C→read-T (flavor 1, Y/CT-dea) and ref-G→read-A (flavor 0, R/GA-dea). Matches FiberBrowser's BAM parser priority exactly.
Cross-check: on a 3,929-read raw consensus BAM (MD-only) vs the FiberHMM-processed version of the same reads (R/Y-encoded), the two extraction paths produced 99.997% agreement — 990,170 vs 990,144 deamination calls, 25 reads differing by ≤2 calls each (edge cases at indels).
Priority ordering (unchanged in flavor codes)
- MM/ML
uor ChEBI 55797 - IUPAC R/Y in the query sequence
- MD-tag ref mismatch (new)
First non-empty source wins per read. blockMod bytes: 0 = R/GA-dea, 1 = Y/CT-dea.
Tests
3 new tests cover the MD walker, MD-absent skip, and priority ordering. Full suite: 288 passing.
Install
pip install --upgrade fiberhmm
v2.9.5: --deam matches FiberBrowser's 2-path priority
Change
fiberhmm-extract --deam now reads both canonical deamination encodings, matching the priority order FiberBrowser's BAM parser uses:
- MM/ML-native — mod code
u(single-letter dU per SAM spec) or ChEBI 55797 (numeric code for dU). Filtered by--prob-thresholdsame as--m6a/--m5c. - IUPAC R/Y in the query sequence — fallback, written by
fiberhmm-daf-encode.
First non-empty source wins per read so events are never double-counted.
blockMod flavor convention (unchanged)
| code | meaning |
|---|---|
0 |
R / GA-dea (canonical base G → U) |
1 |
Y / CT-dea (canonical base C → U) |
Same semantics whether the source was MM/ML or R/Y — downstream tools (FiberBrowser, UCSC) don't need to care which path the call came from.
Why this matters
DAF BAMs can arrive encoded in any of three conventions depending on upstream tooling. FiberHMM's own pipeline (fiberhmm-daf-encode) produces path-2 (IUPAC R/Y), but foreign BAMs from modkit-style callers carry path-1 (MM/ML u). Previously --deam only handled path 2; now a modkit-processed DAF BAM extracts cleanly too.
Path 3 (ref-mismatch via MD tag) stays browser-side — FiberBrowser already walks aligned-pairs-with-seq, and the right FiberHMM answer for raw BAMs is to run fiberhmm-daf-encode first to normalize to R/Y (which is what the HMM wants anyway).
Tests
5 new tests cover priority ordering, C→1 and G→0 flavor mapping, ChEBI 55797 numeric code, --prob-threshold filter, and R/Y fallback when MM/ML lacks dU entries. Full suite passes (285).
Install
pip install --upgrade fiberhmm
v2.9.4: rename --ry to --deam + sample name in bigBed autoSQL
Changes
1. Rename --ry → --deam
Naming consistency with --m5c/--m6a. --ry was only live for a few hours in v2.9.3 so this is functionally a rename rather than a breaking change. Updated across:
- CLI flag:
fiberhmm-extract --deam - Output filename:
*_deam.bb - autoSQL table name:
fiberhmm_deam
2. Sample name embedded in bigBed autoSQL
New --sample-name flag on fiberhmm-extract (default = BAM basename stem, e.g. Dl_recalled) prepends Sample: <name>. to the autoSQL description of every output bigBed. Lets FiberBrowser / UCSC match a bigBed to its source by reading file-level metadata rather than parsing filenames.
Example — bigBedInfo -as Dl_recalled_deam.bb now shows:
table fiberhmm_deam
"Sample: Dl_recalled. FiberHMM DAF-seq deamination calls (R/Y IUPAC codes written into
the query sequence by fiberhmm-daf-encode). ..."
The Sample: <name>. prefix is a simple regex parse for any downstream tool.
Install
pip install --upgrade fiberhmm
Tests
280 tests pass (4 new for sample_name plumbing).