Skip to content

Releases: fiberseq/FiberHMM

v2.10.5: fix silent extract segfault on unusual MM/ML layouts

24 Apr 06:43

Choose a tag to compare

Bug fix

fiberhmm-extract --all produced silently empty output on certain BAMs from pangenome-graph-surjection pipelines (e.g. vg surject onto linear reference). Root cause was a C-level segfault inside pysam.AlignedSegment.modified_bases on reads with unusual MM/ML layouts — invisible because the SIGSEGV kills the worker before Python's exception handler can run, and the main process reports it blandly as "A process in the process pool was terminated abruptly" with zero rows emitted per region.

Localized via faulthandler: segfault at _extract_m6a line 505 during read.modified_bases. pysam has had this class of bug intermittently for years; we already worked around it in v2.9.0 for the apply/call path by using our own numpy MM/ML parser (fiberhmm.core.bam_reader.parse_mm_ml_per_mod_type, validated against pysam on 500+ real reads).

Fix

Swap read.modified_bases for the safe parser in all three extract sites:

  • _extract_m6a
  • _extract_m5c
  • _extract_deam priority-1 (MM/ML u dU code)

No API change; output format unchanged. Single-char mod codes work identically. The rare ChEBI 55797 numeric code is no longer auto-detected in extract (explicitly unsupported — users needing it can rely on the R/Y or MD fallback paths, or run samtools calmd + fiberhmm-daf-encode).

Side fix: worker exception handler now flushes stderr explicitly so Python-level errors surface during multiprocessing crashes instead of getting buffer-swallowed.

Verified on real surjected fiber-seq CRAMs

Full pipeline: CRAM → BAM → fiberhmm-call → ft fire → fiberhmm-extract --all -c 8 --block-scores

Both of two test samples now produce full output across all 5 tracks (footprint / msp / tf / m6a / m5c), each around 1.8M rows. One of the two samples was silently emitting zero rows on v2.10.4 with identical config; v2.10.5 fixes it.

Tests

292 pass. Three deam priority-1 tests rewritten to construct real MM/ML tag strings instead of faking modified_bases dicts.

Install

pip install --upgrade fiberhmm

v2.10.4: auto-skip --deam on fiber-seq BAMs

24 Apr 03:44

Choose a tag to compare

Bug fix

fiberhmm-extract --all (default mode) now auto-skips --deam when the input BAM looks like fiber-seq.

What was happening

On vg-surjected fiber-seq CRAMs (pangenome-graph aligned, projected onto linear reference):

  1. --all mode ran every track including --deam — even on fiber-seq data that has no deaminations (Hia5 is a methyltransferase).
  2. --deam path-3 (MD walker) fell through on every read because paths 1 (MM/ML u code) and 2 (R/Y) had nothing to match.
  3. Surjected BAMs frequently have MDs that disagree with CIGAR on ~97% of reads (documented upstream issue in vg surject).
  4. pysam's get_aligned_pairs(with_seq=True) on bad MD raises AssertionError AND corrupts malloc state before propagating.
  5. A later allocation in the worker segfaults. Python's finally block doesn't run on segfault, so buffered m6a/m5c BED writes are lost.
  6. Result: --deam is empty, and m6a/m5c files are also empty or missing entirely.

Fix

Auto-skip --deam on BAMs where:

  • MM/ML carries m6A (A+a / T-a) or 5mC (C+m) subtypes, AND
  • no dU code (+u / +55797), AND
  • no R/Y IUPAC in the query sequence

When triggered, prints:

Auto-skipping --deam: BAM looks like fiber-seq (MM/ML has m6A/5mC subtypes,
no dU code, no R/Y). Hia5 is a methyltransferase, not a deaminase --
no deamination calls to extract. Pass --deam explicitly to override.

Explicit --deam is still honored — the auto-skip only fires in --all / default mode.

Verified

Tested on a 5 Mb chr1 slice of a real surjected fiber-seq CRAM: 29,679 reads → 22.8M m6a + 4.3M m5c + 1.9M footprint + 1.9M msp positions, zero crashes, all bigBeds produced. Pre-fix this would have segfaulted mid-region and lost m6a/m5c writes.

Tests

294 pass. 8 new cover the fiber-seq signature detection across PacBio modern MM/ML, SAM 1.7 suffix variants (A+a., C+m?), DAF IUPAC, modkit dU, ChEBI 55797, and mixed cases.

Install

pip install --upgrade fiberhmm

v2.10.3: README promotes one-step DAF pipeline (docs-only)

23 Apr 17:01

Choose a tag to compare

Docs-only release

Three README updates so the documentation matches the v2.10.0+ reality:

  1. Streaming Pipelines section — replaced the stale DAF example (minimap2 | daf-encode | apply) with the one-pass form:

    minimap2 --MD -a ref.fa reads.fq | samtools view -b | \
        fiberhmm-call --mode daf --enzyme dddb -i - -o - | \
        ft fire - final.bam
  2. fiberhmm-daf-encode CLI reference — added an "Optional in v2.10.0+" banner. The tool still exists and still works; most users just don't need it anymore.

  3. Tag reference — clarified that st:Z is written only by fiberhmm-daf-encode (the two-pass path), not by fiberhmm-call --mode daf (the one-pass path derives strand internally from MD).

No code changes. No behavior changes. 292 tests still pass.

Install

pip install --upgrade fiberhmm

v2.10.2: surface stale-MD warning (extract + call)

23 Apr 15:43

Choose a tag to compare

What's new

Both fiberhmm-extract and fiberhmm-call --mode daf now print a clear note when they detect MD tags that disagree with their CIGAR (common on consensus BAMs where MD is stale after CIGAR was recomputed). Example:

NOTE: 11/20 of the sniffed reads have MD tags that
disagree with their CIGAR (typical of consensus BAMs where
MD was inherited stale after CIGAR was recomputed).
These reads will be SKIPPED for --deam only; other tracks
extract normally. The reads themselves are not corrupt --
the CIGAR is still correct, only the MD annotation is stale.
To recover --deam calls for those reads, regenerate MD with:
  samtools calmd -b aligned.bam ref.fa > fixed.bam

v2.10.1 silently skipped those reads. That left users unsure whether their data was corrupt and whether anything had been dropped. v2.10.2 makes it explicit, up-front, and actionable.

Warning only fires when --deam actually depends on MD (no R/Y in sequence, no MM/ML dU code, no --reference on the call side) AND at least one sniffed read has the mismatch. Free — reuses the v2.10.1 validator. No code path changes beyond printing the note.

Install

pip install --upgrade fiberhmm

v2.10.1: fix malloc crash on BAMs with malformed MD tags

23 Apr 15:39

Choose a tag to compare

Bug fix

fiberhmm-extract (and fiberhmm-call --mode daf on v2.10.0) crashed on BAMs where the MD tag's encoded reference length disagreed with the CIGAR's reference-consuming op total. Classic consensus-BAM bug: MD inherits stale state after a re-alignment updates CIGAR.

Symptom on v2.9.7:

malloc(): invalid size (unsorted)

Why

pysam.get_aligned_pairs(with_seq=True) raises AssertionError on MD/CIGAR mismatch — but before raising, pysam has already corrupted its internal malloc state. So v2.9.6 crashed immediately with the assertion; v2.9.7 swallowed the assertion but the heap was already poisoned, surfacing as a malloc(): invalid size on some unrelated allocation later in the worker.

Fix

Pre-validate the MD tag against the CIGAR in pure Python before asking pysam to parse it. If they disagree, skip the get_aligned_pairs(with_seq=True) call entirely. Never triggers pysam's AssertionError path, never corrupts malloc.

Patched both call sites:

  • fiberhmm/cli/extract_tags.py_extract_deam path-3
  • fiberhmm/daf/encoder.pyget_daf_positions (v2.10.0's --mode daf MD fallback)

Workaround for affected BAMs (optional)

If you want --deam calls to populate for reads with broken MD, regenerate clean MD tags:

samtools calmd -b aligned.bam ref.fa > fixed.bam

Without this, v2.10.1 silently skips the broken reads for --deam extraction (every other track still works for those reads). With v2.10.0's --mode daf, you can also pass --reference ref.fa as a fallback.

Tests

292 passing. 3 new tests pin the MD parser, the validator, and the invariant that malformed MD bypasses the pysam call completely.

Install

pip install --upgrade fiberhmm

v2.10.0: fiberhmm-call --mode daf auto-detects MD tags (no encode step)

22 Apr 18:29

Choose a tag to compare

Highlights

fiberhmm-call --mode daf now works directly on raw DAF BAMs. The two-pass fiberhmm-daf-encode | fiberhmm-call pipeline collapses into one.

What changed

Per-read auto-detection in --mode daf. Priority:

  1. R/Y IUPAC codes in the stored sequence (from fiberhmm-daf-encode) — existing fast path, unchanged.
  2. MD tag on the aligned read — parsed on the fly, skips the encode step entirely.
  3. --reference REF.fa — new CLI flag, fallback when the BAM has neither R/Y nor MD.

First non-empty source wins per read.

Correctness

Output is byte-identical to the two-pass pipeline. Verified on a real 3,929-read consensus DAF BAM (Christy LaFlamme's dataset):

source same HMM call bytes as two-pass?
ns (nucleosomes) ✅ 0 / 3,929 diffs
nl (nuc lengths) ✅ 0 / 3,929 diffs
as (MSPs) ✅ 0 / 3,929 diffs
al (MSP lengths) ✅ 0 / 3,929 diffs
nq / aq (quality) ✅ 0 / 3,929 diffs
MA / AQ (spec tags) ✅ 0 / 3,929 diffs

And zero regression on R/Y-encoded BAMs: v2.9.6 vs v2.10.0 on the same pre-encoded BAM = 0 diffs.

New CLI

fiberhmm-call --mode daf --enzyme dddb --region-parallel \
    -i raw_aligned.bam -o recalled.bam                           # MD-tag path

fiberhmm-call --mode daf --enzyme dddb --region-parallel \
    -i raw_aligned.bam -o recalled.bam --reference ref.fa        # FASTA fallback

On startup, --mode daf does a fast-fail sniff: it checks the first 10 mapped reads for R/Y, MD, or --reference. If none are available, it errors out in under a second with a clear message instead of silently skipping every read.

Backward compatibility

  • fiberhmm-daf-encode stays available and unchanged. Still the right tool if you want R/Y IUPAC stamped into the stored sequence for R/Y-aware downstream tools.
  • R/Y-encoded BAMs continue through the existing fast path (zero cost for the new MD branch — it only fires when R/Y is absent).
  • No new dependencies.

Install

pip install --upgrade fiberhmm

Tests

289 passing. 4 new unit tests covering the MD-fallback branch, payload plumbing, fast-fail sniff, and byte-identity gate.

v2.9.7: fix fiberhmm-extract crash on malformed MD tags

22 Apr 16:50

Choose a tag to compare

Bug fix

fiberhmm-extract now swallows pysam AssertionError when a read carries a malformed MD tag, skipping that read's deamination extraction instead of aborting the entire region worker.

Symptom (user-reported field crash):

Traceback (most recent call last):
  File ".../fiberhmm/cli/extract_tags.py", line 214, in _extract_region_worker
    n_features['deam'] += _extract_deam(...)
  File ".../fiberhmm/cli/extract_tags.py", line 688, in _extract_deam
    pairs = read.get_aligned_pairs(with_seq=True)
  ...
AssertionError: Invalid MD tag: MD length 37729 mismatch with CIGAR length 43799 and 9742 insertions

Root cause: the path-3 MD walker in _extract_deam (added in v2.9.6 for DAF-seq BAMs that only carry MD mismatches, no MM/ML and no R/Y) caught ValueError/AttributeError but not AssertionError, which is what pysam throws when MD/CIGAR lengths disagree internally.

Fix: broaden the exception handler to Exception. Affected reads now skip the deam track silently; all other tracks (footprint, msp, tf, m6a, m5c) still extract for those reads.

Regression test pinned: tests/test_extract_block_scores.py::test_deam_priority3_skips_read_on_pysam_assertion_error.

Install

pip install --upgrade fiberhmm

Affects only v2.9.6 (which introduced path-3). Earlier versions never touched MD tags, so they weren't exposed to this bug.

v2.9.6: tag-presence diagnostic + MD-mismatch deam fallback

21 Apr 16:44

Choose a tag to compare

Changes

Tag diagnostic in fiberhmm-extract

Before the heavy pass, fiberhmm-extract now sniffs the first 20 mapped reads and prints a summary of which tag sources were detected plus which tracks it predicts will populate:

Tag scan (first 20 mapped reads): MA/AQ=20/20, MM/ML=20/20 [A+a], R/Y-in-seq=20/20
  will populate: footprint[ns/nl], msp[as/al], tf[MA/AQ tf+ annotations], m6a[MM/ML (A+a)], deam[R/Y in sequence]
  will be empty: m5c

Or on a raw DAF BAM with nothing but MD:

Tag scan (first 20 mapped reads): MD-only=20/20
  will populate: deam[MD-only (path-3 fallback)]
  will be empty: footprint, msp, tf, m6a, m5c

--deam path-3 MD fallback (completes FiberBrowser parity)

Third priority source for deamination calls. When MM/ML-native (u / 55797) and IUPAC R/Y paths both come up empty, _extract_deam walks get_aligned_pairs(with_seq=True) and picks out deamination-direction mismatches: ref-C→read-T (flavor 1, Y/CT-dea) and ref-G→read-A (flavor 0, R/GA-dea). Matches FiberBrowser's BAM parser priority exactly.

Cross-check: on a 3,929-read raw consensus BAM (MD-only) vs the FiberHMM-processed version of the same reads (R/Y-encoded), the two extraction paths produced 99.997% agreement — 990,170 vs 990,144 deamination calls, 25 reads differing by ≤2 calls each (edge cases at indels).

Priority ordering (unchanged in flavor codes)

  1. MM/ML u or ChEBI 55797
  2. IUPAC R/Y in the query sequence
  3. MD-tag ref mismatch (new)

First non-empty source wins per read. blockMod bytes: 0 = R/GA-dea, 1 = Y/CT-dea.

Tests

3 new tests cover the MD walker, MD-absent skip, and priority ordering. Full suite: 288 passing.

Install

pip install --upgrade fiberhmm

v2.9.5: --deam matches FiberBrowser's 2-path priority

21 Apr 16:31

Choose a tag to compare

Change

fiberhmm-extract --deam now reads both canonical deamination encodings, matching the priority order FiberBrowser's BAM parser uses:

  1. MM/ML-native — mod code u (single-letter dU per SAM spec) or ChEBI 55797 (numeric code for dU). Filtered by --prob-threshold same as --m6a/--m5c.
  2. IUPAC R/Y in the query sequence — fallback, written by fiberhmm-daf-encode.

First non-empty source wins per read so events are never double-counted.

blockMod flavor convention (unchanged)

code meaning
0 R / GA-dea (canonical base G → U)
1 Y / CT-dea (canonical base C → U)

Same semantics whether the source was MM/ML or R/Y — downstream tools (FiberBrowser, UCSC) don't need to care which path the call came from.

Why this matters

DAF BAMs can arrive encoded in any of three conventions depending on upstream tooling. FiberHMM's own pipeline (fiberhmm-daf-encode) produces path-2 (IUPAC R/Y), but foreign BAMs from modkit-style callers carry path-1 (MM/ML u). Previously --deam only handled path 2; now a modkit-processed DAF BAM extracts cleanly too.

Path 3 (ref-mismatch via MD tag) stays browser-side — FiberBrowser already walks aligned-pairs-with-seq, and the right FiberHMM answer for raw BAMs is to run fiberhmm-daf-encode first to normalize to R/Y (which is what the HMM wants anyway).

Tests

5 new tests cover priority ordering, C→1 and G→0 flavor mapping, ChEBI 55797 numeric code, --prob-threshold filter, and R/Y fallback when MM/ML lacks dU entries. Full suite passes (285).

Install

pip install --upgrade fiberhmm

v2.9.4: rename --ry to --deam + sample name in bigBed autoSQL

21 Apr 15:52

Choose a tag to compare

Changes

1. Rename --ry--deam

Naming consistency with --m5c/--m6a. --ry was only live for a few hours in v2.9.3 so this is functionally a rename rather than a breaking change. Updated across:

  • CLI flag: fiberhmm-extract --deam
  • Output filename: *_deam.bb
  • autoSQL table name: fiberhmm_deam

2. Sample name embedded in bigBed autoSQL

New --sample-name flag on fiberhmm-extract (default = BAM basename stem, e.g. Dl_recalled) prepends Sample: <name>. to the autoSQL description of every output bigBed. Lets FiberBrowser / UCSC match a bigBed to its source by reading file-level metadata rather than parsing filenames.

Example — bigBedInfo -as Dl_recalled_deam.bb now shows:

table fiberhmm_deam
"Sample: Dl_recalled. FiberHMM DAF-seq deamination calls (R/Y IUPAC codes written into
the query sequence by fiberhmm-daf-encode). ..."

The Sample: <name>. prefix is a simple regex parse for any downstream tool.

Install

pip install --upgrade fiberhmm

Tests

280 tests pass (4 new for sample_name plumbing).