Skip to content

feat: split output tool into two#39

Merged
ameynert merged 11 commits intomainfrom
am_duckdb_tool
Apr 29, 2026
Merged

feat: split output tool into two#39
ameynert merged 11 commits intomainfrom
am_duckdb_tool

Conversation

@ameynert
Copy link
Copy Markdown
Collaborator

@ameynert ameynert commented Apr 23, 2026

Summary by CodeRabbit

  • New Features

    • Added new CLI commands for creating DivRef DuckDB indexes and generating FASTA files from them.
  • Refactor

    • Restructured the pipeline to build a unified database from all contigs before generating per-chromosome FASTA files.
    • Consolidated population allele frequency configuration thresholds into a single parameter.
  • Tests

    • Added test coverage for haplotype coordinate calculations across various variant types.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@ameynert has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 36 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 406220f4-fa8c-42e3-b2c4-fe070f2f7ca9

📥 Commits

Reviewing files that changed from the base of the PR and between 659ee6e and 8abb405.

📒 Files selected for processing (4)
  • divref/divref/haplotype.py
  • divref/divref/tools/create_duckdb_index.py
  • divref/tests/test_haplotype.py
  • workflows/config/config_schema.yml
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch am_duckdb_tool

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 3 minutes and 36 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 23, 2026 21:36 — with GitHub Actions Inactive
@ameynert ameynert changed the title Am duckdb tool feat: split output tool into two Apr 23, 2026
Base automatically changed from am_split_tool to main April 28, 2026 21:48
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 28, 2026 21:51 — with GitHub Actions Inactive
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 28, 2026 23:15 — with GitHub Actions Inactive
@ameynert ameynert marked this pull request as ready for review April 29, 2026 15:38
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 29, 2026 15:38 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@divref/divref/haplotype.py`:
- Around line 168-187: The current computation of start/end (using min_variant
and max_variant) produces 1-based inclusive coordinates; to make them true
0-based half-open adjust both endpoints by subtracting 1 from locus positions:
set start = min_variant.locus.position - window_size - 1 and set end =
max_variant.locus.position - 1 + hl.len(max_variant.alleles[0]) + window_size
(and update the docstring to state 0-based half-open). Locate the expressions
that build hl.struct(start=..., end=...) referencing sorted_variants,
min_variant, max_variant, window_size, and max_variant.alleles[0] and apply
these arithmetic changes so downstream code persists correct 0-based half-open
spans (or alternatively, if you prefer the 1-based convention, rename/document
the fields explicitly rather than changing math).

In `@divref/divref/tools/create_divref_fasta.py`:
- Around line 22-28: The current loop only iterates over contigs present in df
and omits creating files for configured contigs with zero rows; change the
function to accept the expected contig list (e.g., contigs or
configured_contigs) and iterate over that list instead of
sorted(df["contig"].unique().to_list()), then for each contig build df_chrom =
df.filter(df["contig"] == chrom) and always create out_path =
Path(f"{output_base}.{chrom}.fasta"); if df_chrom has rows write the sequences
as before, otherwise write an empty FASTA (or a comment/header line) so the file
exists; update logging (logger.info) to show when an empty FASTA is emitted.

In `@divref/divref/tools/create_duckdb_index.py`:
- Around line 173-176: The argmax computation is using hl.max on a scalar
(hl.max(x.AF)) which is invalid; change the mapping to produce an array of
scalar AFs and feed that to hl.argmax by replacing
hl.argmax(va.gnomad_freqs.map(lambda x: hl.max(x.AF))) with
hl.argmax(va.gnomad_freqs.map(lambda x: x.AF)), so that argmax_pop, used in
va.select for max_pop and max_empirical_AF (va.gnomad_freqs[argmax_pop].AF),
correctly indexes the population with the highest AF.

In `@workflows/config/config_schema.yml`:
- Around line 60-63: The description for the "Minimum variant allele frequency"
config entry is truncated ("Also applied to")—complete the sentence to
explicitly name the other command(s) or consumer(s) that use this value (e.g.,
append "Also applied to `divref <other-command>`") so generated docs are
unambiguous; update the description string in the config_schema.yml entry that
currently references `divref extract-gnomad-afs` and `divref compute-haplotypes`
to include the missing consumer(s) wrapped in backticks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: df69de02-d1f4-4b53-8920-a176e33606c7

📥 Commits

Reviewing files that changed from the base of the PR and between f5043ea and 659ee6e.

⛔ Files ignored due to path filters (2)
  • divref/uv.lock is excluded by !**/*.lock
  • pixi.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • divref/divref/haplotype.py
  • divref/divref/main.py
  • divref/divref/tools/create_divref_fasta.py
  • divref/divref/tools/create_duckdb_index.py
  • divref/divref/tools/create_fasta_and_index.py
  • divref/pyproject.toml
  • divref/tests/test_haplotype.py
  • pixi.toml
  • workflows/config/config_schema.yml
  • workflows/generate_divref.smk
💤 Files with no reviewable changes (2)
  • pixi.toml
  • divref/divref/tools/create_fasta_and_index.py

Comment thread divref/divref/haplotype.py Outdated
Comment on lines +22 to +28
for chrom in sorted(df["contig"].unique().to_list()):
logger.info("Creating FASTA for chromosome %s", chrom)
df_chrom = df.filter(df["contig"] == chrom)
out_path = Path(f"{output_base}.{chrom}.fasta")
with open(out_path, "w") as fasta_out:
for sequence_id, sequence in df_chrom.select("sequence_id", "sequence").iter_rows():
fasta_out.write(f">{sequence_id}\n{sequence}\n")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t skip configured contigs that end up with zero sequences.

This loop only writes FASTAs for contigs present in df["contig"], but workflows/generate_divref.smk Lines 271-274 declare one output per configured chromosome. If a chromosome is filtered down to zero rows, no file gets created and the workflow fails on missing outputs. Pass the expected contig list into this tool and emit empty FASTAs for absent contigs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@divref/divref/tools/create_divref_fasta.py` around lines 22 - 28, The current
loop only iterates over contigs present in df and omits creating files for
configured contigs with zero rows; change the function to accept the expected
contig list (e.g., contigs or configured_contigs) and iterate over that list
instead of sorted(df["contig"].unique().to_list()), then for each contig build
df_chrom = df.filter(df["contig"] == chrom) and always create out_path =
Path(f"{output_base}.{chrom}.fasta"); if df_chrom has rows write the sequences
as before, otherwise write an empty FASTA (or a comment/header line) so the file
exists; update logging (logger.info) to show when an empty FASTA is emitted.

Comment thread divref/divref/tools/create_duckdb_index.py Outdated
Comment thread workflows/config/config_schema.yml
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 29, 2026 16:34 — with GitHub Actions Inactive
@ameynert ameynert merged commit e25ee1d into main Apr 29, 2026
4 checks passed
@ameynert ameynert deleted the am_duckdb_tool branch April 29, 2026 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant