Skip to content

Releases: fiberseq/FiberHMM

v2.4.0

01 Apr 14:46

Choose a tag to compare

What's New

fiberhmm-daf-encode — new CLI tool

Preprocessor for plain aligned DAF-seq BAMs (e.g. from minimap2). Identifies C→T / G→A deamination mismatches via the MD tag, encodes them as IUPAC Y/R in the query sequence, and adds st:Z strand tags — making the BAM ready for fiberhmm-apply --mode daf.

Usage:

# Basic
fiberhmm-daf-encode -i aligned.bam -o encoded.bam

# Streaming pipeline
fiberhmm-daf-encode -i aligned.bam -o - | \
    fiberhmm-apply --mode daf --streaming -i - -o output/

# Full pipeline from alignment
minimap2 --MD -a ref.fa reads.fq | samtools view -b | \
    fiberhmm-daf-encode -i - -o - | \
    fiberhmm-apply --mode daf --streaming -i - -o output/

Features:

  • Automatic per-read strand detection (or force with --strand CT/GA)
  • Reference FASTA fallback when MD tags are missing (--reference)
  • Streaming stdin/stdout support for piping
  • Preserves all existing tags (MM/ML for dual-labeling, etc.)
  • Sort + index on file output

Full Changelog: v2.3.1...v2.4.0

v2.3.1

30 Mar 15:34
35010ab

Choose a tag to compare

IUPAC R/Y DAF-seq Support

DAF-seq BAMs that encode deamination events as IUPAC ambiguity codes (R/Y) in the sequence instead of MM/ML tags are now auto-detected and processed under --mode daf.

  • Y in sequence marks deaminated C (+ strand, st:Z:CT)
  • R in sequence marks deaminated G (- strand, st:Z:GA)
  • No new CLI flags needed — auto-detection handles both encodings transparently
  • Both paths produce identical encoder output (verified on 3,929 real reads)

v2.3.0

28 Mar 05:51

Choose a tag to compare

v2.3.0: Streaming Pipeline & Production Integration

New Features

  • Streaming producer-consumer pipeline — sliding-window architecture keeps multiple compute chunks in flight, enabling overlap of I/O and HMM inference. Works with unaligned/unindexed BAMs and stdin.
  • Stdout output (-o -) — pipe BAM directly to downstream tools (e.g., fiberhmm-apply -o - | ft fire)
  • pysam multithreaded I/O (--io-threads) — htslib decompression/compression threading for all processing modes
  • Auto-detection — stdin or missing BAM index automatically triggers streaming mode
  • Unaligned read processing (--process-unmapped) — process reads without alignment coordinates
  • Headerless BAM support (check_sq=False) — handle BAMs without sequence dictionaries

CLI Flags

  • --streaming — explicitly use streaming pipeline mode
  • --io-threads N — htslib decompression threads (default: 4)
  • --chunk-size N — reads per compute chunk (default: 500)
  • --process-unmapped — process unmapped reads with sequences

Bug Fixes

  • Fix region-parallel worker never writing footprint tags (indentation bug in _process_region_to_bam)
  • Skip sort/index for unaligned output

Test Suite

  • 16 streaming correctness tests (order preservation, tag validity, determinism, edge cases)
  • 4 cross-mode equivalence tests (streaming == region-parallel == legacy, r=1.0)
  • Benchmark suite: throughput, scaling, I/O vs compute, memory, pysam threads
  • Synthetic BAM generation fixtures with valid MM/ML tags

Performance

  • ~350 reads/s on 200GB unaligned BAM over NAS (network storage)
  • ~5,700 reads/s on local SSD with 4 cores
  • Memory bounded at ~30MB regardless of input size
  • Full fiberhmm | ft fire pipeline validated on 2.8M reads

v2.2.0

20 Mar 20:58

Choose a tag to compare

FiberHMM v2.2.0

Bug Fixes

  • Fix BAM tag types: as/ns/al/nl tags now correctly use B:I (unsigned 32-bit int arrays) per the Fiber-seq BAM format spec. Previously, pysam inferred B:C or B:S from value magnitudes, breaking compatibility with pyft and other fibertools ecosystem tools.
  • Remove unreliable BAM mode auto-detection: Mode is now resolved as command line > model metadata > pacbio-fiber default. The auto-detect heuristic misidentified BAMs with both 5mC and m6A tags as DAF-seq, causing confusing warnings.

Improvements

  • MSPs now match fibertools convention: Only nucleosome-sized footprints (>= 85bp by default) act as MSP boundaries. Small footprints are absorbed into surrounding MSPs, consistent with how fibertools defines MSPs.
  • Skip reason reporting: Both region-parallel and standard processing paths now track and print a summary of why reads were skipped (low MAPQ, too short, no modifications, no footprints, etc.)
  • Improved CLI help text: --min-mapq and --min-read-length descriptions now explain filtering behavior and how to override defaults.
  • Read filtering documentation: README now includes a table of all skip reasons, default thresholds, and override flags.

New Flags

  • --no-msps — Suppress as/al/aq MSP tag output. Useful for Fiber-seq workflows where MSPs are computed separately by fibertools.
  • --nuc-min-size (default: 85) — Minimum footprint size to count as nucleosome-sized for MSP boundary detection.

FiberHMM v2.1.0

06 Mar 15:46
be05da2

Choose a tag to compare

What's New

Posteriors Export (fiberhmm-posteriors)

New standalone CLI for exporting per-position HMM posterior probabilities (P(footprint) per position per read).

Two output formats:

  • Gzipped TSV — no extra dependencies, base64-encoded uint8 posteriors
  • HDF5 — streaming batched writes, requires pip install h5py

Format is auto-detected from file extension (.tsv.gz → TSV, .h5/.hdf5 → HDF5), or set explicitly with --format.

# TSV (no extra deps)
fiberhmm-posteriors -i tagged.bam -m model.json -o posteriors.tsv.gz -c 4

# HDF5 (requires h5py)
fiberhmm-posteriors -i tagged.bam -m model.json -o posteriors.h5 -c 4

Other Changes

  • New optional dependency group: pip install fiberhmm[posteriors] (installs h5py)
  • pip install fiberhmm[all] now includes h5py
  • Removed auxiliary DAF-seq preprocessing examples
  • --output-posteriors flag on fiberhmm-apply now auto-activates when posteriors package is present

Install / Upgrade

pip install --upgrade fiberhmm

FiberHMM v2.0.0

22 Feb 21:05
34a2d4b

Choose a tag to compare

Complete rewrite of FiberHMM as a proper Python package.

What's New

  • Python package: pip install fiberhmm with CLI entry points (fiberhmm-apply, fiberhmm-train, fiberhmm-probs, fiberhmm-extract, fiberhmm-utils)
  • Native HMM: No hmmlearn dependency; optional Numba JIT for ~10x speedup
  • Region-parallel processing: Scales linearly with cores (-c 8)
  • Fibertools-compatible output: Tagged BAM with ns/nl/as/al tags
  • Pre-trained models: Hia5 (PacBio + Nanopore), DddA (PacBio), DddB (Nanopore)
  • Consolidated utilities: fiberhmm-utils with convert, inspect, transfer, adjust subcommands
  • JSON model format: Portable, human-readable; legacy pickle/NPZ still supported for loading

Pre-trained Models

Model Enzyme Platform Mode
hia5_pacbio.json Hia5 (m6A) PacBio pacbio-fiber
hia5_nanopore.json Hia5 (m6A) Nanopore nanopore-fiber
ddda_pacbio.json DddA (deamination) PacBio daf
dddb_nanopore.json DddB (deamination) Nanopore daf

Quick Start

pip install fiberhmm

# Call footprints with a pre-trained model
fiberhmm-apply -i experiment.bam -m models/hia5_pacbio.json -o output/ -c 8

See the README for full documentation.

v1.4

19 Feb 15:05
67dcb70

Choose a tag to compare

Updated the main scripts to account for weird chromosome names in the genome assembly. Now, it substitutes unallowed characters for the h5 file with additional underscores, and then downstream encodes the chromosome names to match. Also, simplified the chromosome parsing to try to match existing tools (remove '>' and read until first whitespace). Began implementing option to not include specific chromosomes or only use certain ones.

Version 1.3.2

12 Nov 17:39
04815f7

Choose a tag to compare

Set parameter "starting_it" to -1 by default, as otherwise the first chunk of the bed file would be skipped.

Version 1.3.1

12 Nov 04:21
7127a6a

Choose a tag to compare

Added a parameter -d to apply_model_multiprocess.py. This script has a tendency to hang unexpectedly (especially when using many CPU cores). Using this parameter, you can specify an existing temporary directory with the footprint-bed file chunks from a previous, failed run (in your original outdir). The script will then read and skip quickly past chunks of the m6a bed already footprint called, and then resume where it had left off.

FiberHMM v1.3

11 Nov 18:39
4e53d7d

Choose a tag to compare

Added a new parameter, -e to train and apply model scripts. This allows the user to set a minimum level of methylation required for a read to be used in training or to be kept after the model application. This is helpful if there are a subset of reads with very low methylation due to experimental issues.

Adjusted apply and train model scripts to use the reference-based position of m6a instead of the read-based position. This resolves issues related to poorly aligned reads having methylations and footprints outside of the expected range.