Skip to content

fix: guard against division by zero in helper_fdr_peaks_by_fire_elements#48

Merged
mrvollger merged 1 commit intofiberseq:mainfrom
tbenavi1:fix/helper-fdr-peaks-division-by-zero
Apr 1, 2026
Merged

fix: guard against division by zero in helper_fdr_peaks_by_fire_elements#48
mrvollger merged 1 commit intofiberseq:mainfrom
tbenavi1:fix/helper-fdr-peaks-division-by-zero

Conversation

@tbenavi1
Copy link
Copy Markdown
Contributor

Problem

The helper_fdr_peaks_by_fire_elements rule in workflow/rules/fire-peaks.smk crashes with a bioawk: division by zero error when a pileup row has coverage == 0 and passes the upstream filters (is_local_max=="true", FDR<=0.05, fire_coverage>1). Because Snakemake runs in bash strict mode, this kills the entire pipeline.

The error is chromosome- and sample-dependent — it only manifests when such a row exists on a given chromosome.

Root cause

Line 201 evaluates $fire_coverage/$coverage without first checking that $coverage > 0:

| bioawk -tc hdr '(NR==1)||($fire_coverage/$coverage>={params.min_per_acc_peak})'

Fix

Add a $coverage > 0 guard immediately before the division, matching the pattern already used in wide_fire_peaks (lines 298–299):

| bioawk -tc hdr '(NR==1)||($coverage>0)' \
| bioawk -tc hdr '(NR==1)||($fire_coverage/$coverage>={params.min_per_acc_peak})' \

Rows with zero coverage cannot have accessible fibers, so filtering them is semantically correct and does not change results.

Diff

@@ -198,6 +198,7 @@ rule helper_fdr_peaks_by_fire_elements:
                 | bioawk -tc hdr '(NR==1)||($is_local_max=="true")' \
                 | csvtk filter -tT -C '$' -f "FDR<={params.max_peak_fdr}" \
                 | csvtk filter -tT -C '$' -f "fire_coverage>1" \
+                | bioawk -tc hdr '(NR==1)||($coverage>0)' \
                 | bioawk -tc hdr '(NR==1)||($fire_coverage/$coverage>={params.min_per_acc_peak})' \
                 | bedtools intersect -wa -wb -sorted -a - \
                     -b <(tabix {input.fire} {wildcards.chrom} \

@mrvollger
Copy link
Copy Markdown
Member

mrvollger commented Mar 26, 2026

Thanks for this fix! I am concerned that this isn't already covered by the fire_coverage being greater than 1 part. fire_coverage should always be less than or equal to coverage.

Do you have any small dataset I can play with to see what that is happening in ft pileup at all?

@tbenavi1
Copy link
Copy Markdown
Contributor Author

Is there an email address where I can send you some files? And also, I am not sure whether I can send you the bam files (I will double check). But if not, are there any intermediate files from the pipeline you would like? I do have the inputs that go into the failing job.

@mrvollger
Copy link
Copy Markdown
Member

Yes please!

mrvollger@genetics.utah.edu

Even one locus of the bam would be enough. If you can identify a region where fire coverage is greater than coverage and send me just those overlapping reads I should be able to figure out what's going wrong.

@tbenavi1 tbenavi1 force-pushed the fix/helper-fdr-peaks-division-by-zero branch from f439083 to fc33686 Compare March 27, 2026 17:35
@tbenavi1
Copy link
Copy Markdown
Contributor Author

Ok, I looked a little bit more into the issue. I now believe it is a bug in bioawk which in rare circumstances will write a phantom empty record due to a buffer-boundary edge case.

The helper_fdr_peaks_by_fire_elements rule in workflow/rules/fire-peaks.smk crashes with a bioawk: division by zero error on certain chromosomes and samples. In our case this occurred at input record 14032 on chr8 for sample GMHC_APD, despite all data having valid nonzero coverage values.

Like you mentioned, the upstream fire_coverage>1 filter already guarantees coverage >= 1 for all data rows. Instead, the error is caused by a buffer-boundary edge case in bioawk's -c hdr line reader.

This only manifests when the input file size modulo 16384 places the final newline exactly at a buffer boundary, explaining why the error is data- and chromosome-dependent.

We confirmed this by showing that awk 'END{print NR}' reports 14031 records while bioawk -tc hdr sees 14032, with record 14032 having NF=0.

Fix

Guard against the phantom record by checking NF>0 before evaluating any division. This is applied to both affected rules:

helper_fdr_peaks_by_fire_elements (line 201):

# Before:
| bioawk -tc hdr '(NR==1)||($fire_coverage/$coverage>={params.min_per_acc_peak})'

# After:
| bioawk -tc hdr '(NR==1)||(NF>0 && $fire_coverage/$coverage>={params.min_per_acc_peak})'

wide_fire_peaks (lines 298–299) — same vulnerability, consolidated into a single bioawk call:

# Before:
| bioawk -tc hdr 'NR==1 || $coverage>0'
| bioawk -tc hdr 'NR==1 || ($fire_coverage/$coverage>={params.min_frac_acc})'

# After:
| bioawk -tc hdr 'NR==1 || (NF>0 && $coverage>0 && $fire_coverage/$coverage>={params.min_frac_acc})'

The NF>0 check costs nothing on real data and prevents the phantom record from reaching the division. Consolidating two bioawk calls into one also eliminates a second potential source of the same bug (each bioawk in a pipe chain can independently produce a phantom record).

Upstream bioawk bug

The root cause is a bug in bioawk (https://github.com/lh3/bioawk). In addon.c line 339, bio_getrec() checks if (c >= 0) but should check if (c > 0), since ks_getuntil() can return 0 (empty string length) on a spurious EOF read when the file size aligns with the 16384-byte kseq read buffer.

Diff

@@ -198,7 +198,7 @@ rule helper_fdr_peaks_by_fire_elements:
                 | bioawk -tc hdr '(NR==1)||($is_local_max=="true")' \
                 | csvtk filter -tT -C '$' -f "FDR<={params.max_peak_fdr}" \
                 | csvtk filter -tT -C '$' -f "fire_coverage>1" \
-                | bioawk -tc hdr '(NR==1)||($fire_coverage/$coverage>={params.min_per_acc_peak})' \
+                | bioawk -tc hdr '(NR==1)||(NF>0 && $fire_coverage/$coverage>={params.min_per_acc_peak})' \
                 | bedtools intersect -wa -wb -sorted -a - \
                     -b <(tabix {input.fire} {wildcards.chrom} \
                             | cut -f 1-3 \
@@ -295,8 +295,7 @@ rule wide_fire_peaks:
         ( \
             bgzip -cd {input.bed}; \
             bioawk -tc hdr 'NR==1 || $FDR<={params.max_peak_fdr}' {input.track} \
-                | bioawk -tc hdr 'NR==1 || $coverage>0' \
-                | bioawk -tc hdr 'NR==1 || ($fire_coverage/$coverage>={params.min_frac_acc})' \
+                | bioawk -tc hdr 'NR==1 || (NF>0 && $coverage>0 && $fire_coverage/$coverage>={params.min_frac_acc})' \
         ) \
             | cut -f 1-3 \
             | bedtools sort \

@mrvollger
Copy link
Copy Markdown
Member

Thanks! And the PR looks good. I will merge this before the next patch release.

@mrvollger mrvollger merged commit 835868d into fiberseq:main Apr 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants