fix: guard against division by zero in helper_fdr_peaks_by_fire_elements#48
Conversation
|
Thanks for this fix! I am concerned that this isn't already covered by the fire_coverage being greater than 1 part. fire_coverage should always be less than or equal to coverage. Do you have any small dataset I can play with to see what that is happening in ft pileup at all? |
|
Is there an email address where I can send you some files? And also, I am not sure whether I can send you the bam files (I will double check). But if not, are there any intermediate files from the pipeline you would like? I do have the inputs that go into the failing job. |
|
Yes please! Even one locus of the bam would be enough. If you can identify a region where fire coverage is greater than coverage and send me just those overlapping reads I should be able to figure out what's going wrong. |
f439083 to
fc33686
Compare
|
Ok, I looked a little bit more into the issue. I now believe it is a bug in bioawk which in rare circumstances will write a phantom empty record due to a buffer-boundary edge case. The Like you mentioned, the upstream This only manifests when the input file size modulo 16384 places the final newline exactly at a buffer boundary, explaining why the error is data- and chromosome-dependent. We confirmed this by showing that FixGuard against the phantom record by checking
# Before:
| bioawk -tc hdr '(NR==1)||($fire_coverage/$coverage>={params.min_per_acc_peak})'
# After:
| bioawk -tc hdr '(NR==1)||(NF>0 && $fire_coverage/$coverage>={params.min_per_acc_peak})'
# Before:
| bioawk -tc hdr 'NR==1 || $coverage>0'
| bioawk -tc hdr 'NR==1 || ($fire_coverage/$coverage>={params.min_frac_acc})'
# After:
| bioawk -tc hdr 'NR==1 || (NF>0 && $coverage>0 && $fire_coverage/$coverage>={params.min_frac_acc})'The Upstream bioawk bugThe root cause is a bug in bioawk (https://github.com/lh3/bioawk). In Diff@@ -198,7 +198,7 @@ rule helper_fdr_peaks_by_fire_elements:
| bioawk -tc hdr '(NR==1)||($is_local_max=="true")' \
| csvtk filter -tT -C '$' -f "FDR<={params.max_peak_fdr}" \
| csvtk filter -tT -C '$' -f "fire_coverage>1" \
- | bioawk -tc hdr '(NR==1)||($fire_coverage/$coverage>={params.min_per_acc_peak})' \
+ | bioawk -tc hdr '(NR==1)||(NF>0 && $fire_coverage/$coverage>={params.min_per_acc_peak})' \
| bedtools intersect -wa -wb -sorted -a - \
-b <(tabix {input.fire} {wildcards.chrom} \
| cut -f 1-3 \
@@ -295,8 +295,7 @@ rule wide_fire_peaks:
( \
bgzip -cd {input.bed}; \
bioawk -tc hdr 'NR==1 || $FDR<={params.max_peak_fdr}' {input.track} \
- | bioawk -tc hdr 'NR==1 || $coverage>0' \
- | bioawk -tc hdr 'NR==1 || ($fire_coverage/$coverage>={params.min_frac_acc})' \
+ | bioawk -tc hdr 'NR==1 || (NF>0 && $coverage>0 && $fire_coverage/$coverage>={params.min_frac_acc})' \
) \
| cut -f 1-3 \
| bedtools sort \ |
|
Thanks! And the PR looks good. I will merge this before the next patch release. |
Problem
The
helper_fdr_peaks_by_fire_elementsrule inworkflow/rules/fire-peaks.smkcrashes with abioawk: division by zeroerror when a pileup row hascoverage == 0and passes the upstream filters (is_local_max=="true",FDR<=0.05,fire_coverage>1). Because Snakemake runs in bash strict mode, this kills the entire pipeline.The error is chromosome- and sample-dependent — it only manifests when such a row exists on a given chromosome.
Root cause
Line 201 evaluates
$fire_coverage/$coveragewithout first checking that$coverage > 0:Fix
Add a
$coverage > 0guard immediately before the division, matching the pattern already used inwide_fire_peaks(lines 298–299):Rows with zero coverage cannot have accessible fibers, so filtering them is semantically correct and does not change results.
Diff