feat: split output tool into two by ameynert · Pull Request #39 · fg-labs/divref-wf

ameynert · 2026-04-23T21:33:50Z

Summary by CodeRabbit

New Features
- Added new CLI commands for creating DivRef DuckDB indexes and generating FASTA files from them.
Refactor
- Restructured the pipeline to build a unified database from all contigs before generating per-chromosome FASTA files.
- Consolidated population allele frequency configuration thresholds into a single parameter.
Tests
- Added test coverage for haplotype coordinate calculations across various variant types.

coderabbitai · 2026-04-23T21:33:56Z

Warning

Rate limit exceeded

@ameynert has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 36 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 406220f4-fa8c-42e3-b2c4-fe070f2f7ca9

📥 Commits

Reviewing files that changed from the base of the PR and between 659ee6e and 8abb405.

📒 Files selected for processing (4)

divref/divref/haplotype.py
divref/divref/tools/create_duckdb_index.py
divref/tests/test_haplotype.py
workflows/config/config_schema.yml

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch am_duckdb_tool

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 3 minutes and 36 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@divref/divref/haplotype.py`:
- Around line 168-187: The current computation of start/end (using min_variant
and max_variant) produces 1-based inclusive coordinates; to make them true
0-based half-open adjust both endpoints by subtracting 1 from locus positions:
set start = min_variant.locus.position - window_size - 1 and set end =
max_variant.locus.position - 1 + hl.len(max_variant.alleles[0]) + window_size
(and update the docstring to state 0-based half-open). Locate the expressions
that build hl.struct(start=..., end=...) referencing sorted_variants,
min_variant, max_variant, window_size, and max_variant.alleles[0] and apply
these arithmetic changes so downstream code persists correct 0-based half-open
spans (or alternatively, if you prefer the 1-based convention, rename/document
the fields explicitly rather than changing math).

In `@divref/divref/tools/create_divref_fasta.py`:
- Around line 22-28: The current loop only iterates over contigs present in df
and omits creating files for configured contigs with zero rows; change the
function to accept the expected contig list (e.g., contigs or
configured_contigs) and iterate over that list instead of
sorted(df["contig"].unique().to_list()), then for each contig build df_chrom =
df.filter(df["contig"] == chrom) and always create out_path =
Path(f"{output_base}.{chrom}.fasta"); if df_chrom has rows write the sequences
as before, otherwise write an empty FASTA (or a comment/header line) so the file
exists; update logging (logger.info) to show when an empty FASTA is emitted.

In `@divref/divref/tools/create_duckdb_index.py`:
- Around line 173-176: The argmax computation is using hl.max on a scalar
(hl.max(x.AF)) which is invalid; change the mapping to produce an array of
scalar AFs and feed that to hl.argmax by replacing
hl.argmax(va.gnomad_freqs.map(lambda x: hl.max(x.AF))) with
hl.argmax(va.gnomad_freqs.map(lambda x: x.AF)), so that argmax_pop, used in
va.select for max_pop and max_empirical_AF (va.gnomad_freqs[argmax_pop].AF),
correctly indexes the population with the highest AF.

In `@workflows/config/config_schema.yml`:
- Around line 60-63: The description for the "Minimum variant allele frequency"
config entry is truncated ("Also applied to")—complete the sentence to
explicitly name the other command(s) or consumer(s) that use this value (e.g.,
append "Also applied to `divref <other-command>`") so generated docs are
unambiguous; update the description string in the config_schema.yml entry that
currently references `divref extract-gnomad-afs` and `divref compute-haplotypes`
to include the missing consumer(s) wrapped in backticks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: df69de02-d1f4-4b53-8920-a176e33606c7

📥 Commits

Reviewing files that changed from the base of the PR and between f5043ea and 659ee6e.

⛔ Files ignored due to path filters (2)

divref/uv.lock is excluded by !**/*.lock
pixi.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

divref/divref/haplotype.py
divref/divref/main.py
divref/divref/tools/create_divref_fasta.py
divref/divref/tools/create_duckdb_index.py
divref/divref/tools/create_fasta_and_index.py
divref/pyproject.toml
divref/tests/test_haplotype.py
pixi.toml
workflows/config/config_schema.yml
workflows/generate_divref.smk

💤 Files with no reviewable changes (2)

pixi.toml
divref/divref/tools/create_fasta_and_index.py

coderabbitai · 2026-04-29T15:46:25Z

+    for chrom in sorted(df["contig"].unique().to_list()):
+        logger.info("Creating FASTA for chromosome %s", chrom)
+        df_chrom = df.filter(df["contig"] == chrom)
+        out_path = Path(f"{output_base}.{chrom}.fasta")
+        with open(out_path, "w") as fasta_out:
+            for sequence_id, sequence in df_chrom.select("sequence_id", "sequence").iter_rows():
+                fasta_out.write(f">{sequence_id}\n{sequence}\n")


⚠️ Potential issue | 🟠 Major

Don’t skip configured contigs that end up with zero sequences.

This loop only writes FASTAs for contigs present in df["contig"], but workflows/generate_divref.smk Lines 271-274 declare one output per configured chromosome. If a chromosome is filtered down to zero rows, no file gets created and the workflow fails on missing outputs. Pass the expected contig list into this tool and emit empty FASTAs for absent contigs.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@divref/divref/tools/create_divref_fasta.py` around lines 22 - 28, The current loop only iterates over contigs present in df and omits creating files for configured contigs with zero rows; change the function to accept the expected contig list (e.g., contigs or configured_contigs) and iterate over that list instead of sorted(df["contig"].unique().to_list()), then for each contig build df_chrom = df.filter(df["contig"] == chrom) and always create out_path = Path(f"{output_base}.{chrom}.fasta"); if df_chrom has rows write the sequences as before, otherwise write an empty FASTA (or a comment/header line) so the file exists; update logging (logger.info) to show when an empty FASTA is emitted.

ameynert force-pushed the am_duckdb_tool branch from af9be47 to 683d68e Compare April 23, 2026 21:36

ameynert temporarily deployed to github-actions-snakemake-linting April 23, 2026 21:36 — with GitHub Actions Inactive

ameynert changed the title ~~Am duckdb tool~~ feat: split output tool into two Apr 23, 2026

Base automatically changed from am_split_tool to main April 28, 2026 21:48

ameynert added 4 commits April 28, 2026 14:51

chore: remove old create_fasta_and_index tool

f8162ba

feat: add haplotype_coordinates function

f9c9b55

feat: split create_fasta_and_index into two tools

ee67991

feat: add new tools to workflow

8199960

ameynert force-pushed the am_duckdb_tool branch from 683d68e to 8199960 Compare April 28, 2026 21:51

ameynert temporarily deployed to github-actions-snakemake-linting April 28, 2026 21:51 — with GitHub Actions Inactive

ameynert added 4 commits April 28, 2026 15:06

feat: use unify and specify null

f1c0389

chore: move pyarrow into pyproject.toml

96f658b

feat: consolidate frequency filters

9e453ec

fix: add contig to database, fix schema overrides for population AFs

659ee6e

ameynert temporarily deployed to github-actions-snakemake-linting April 28, 2026 23:15 — with GitHub Actions Inactive

ameynert force-pushed the am_duckdb_tool branch from 81823ec to 659ee6e Compare April 29, 2026 15:38

ameynert marked this pull request as ready for review April 29, 2026 15:38

ameynert temporarily deployed to github-actions-snakemake-linting April 29, 2026 15:38 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

ameynert added 3 commits April 29, 2026 09:23

chore: fix up docstring in config

1b63371

fix: argmax calculation

a1c8bda

feat: fix haplotype coordinates

8abb405

ameynert temporarily deployed to github-actions-snakemake-linting April 29, 2026 16:34 — with GitHub Actions Inactive

ameynert merged commit e25ee1d into main Apr 29, 2026
4 checks passed

ameynert deleted the am_duckdb_tool branch April 29, 2026 17:00

coderabbitai Bot mentioned this pull request Apr 29, 2026

feat: stream chunks for database input #42

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: split output tool into two#39

feat: split output tool into two#39
ameynert merged 11 commits intomainfrom
am_duckdb_tool

ameynert commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ameynert commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ameynert commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading