Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe changes refactor DuckDB index creation to stream contig-by-contig data instead of building a single combined table, adds configurable batch-size parameters, and threads the new Changes
Sequence Diagram(s)sequenceDiagram
participant Workflow as Snakemake<br/>Workflow
participant Hail as Hail<br/>(per-contig)
participant TSV as TSV<br/>(intermediate)
participant Polars as Polars<br/>(chunks)
participant DuckDB as DuckDB<br/>(target)
Workflow->>Hail: For each contig:<br/>build_contig_sequences_table()
Hail->>TSV: export_sequences_table_to_tsv()
TSV->>Polars: iter_dataframe_chunks(polars_chunk_size)
loop Each chunk batch
Polars->>DuckDB: CREATE TABLE AS SELECT (1st)<br/>or INSERT INTO (rest)
end
DuckDB->>DuckDB: CREATE INDEX on<br/>sequence_id
DuckDB->>DuckDB: Write metadata<br/>(window_size, pops_legend,<br/>VERSION)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 39 minutes and 55 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
divref/divref/tools/create_duckdb_index.py (1)
298-302: Use a generalized collection type forpops_legendparameters
pops_legendis consumed read-only;Sequence[str]is a better parameter contract thanlist[str].Proposed refactor
-from collections.abc import Iterator +from collections.abc import Iterator, Sequence @@ def export_sequences_table_to_tsv( ht: hl.Table, out_file: Path, - pops_legend: list[str], + pops_legend: Sequence[str], ) -> None: @@ def iter_dataframe_chunks( *, tsv: Path, - pops_legend: list[str], + pops_legend: Sequence[str], chunk_size: int, ) -> Iterator[polars.DataFrame]:As per coding guidelines, "For Python function parameters, accept the most general type practical (e.g.,
IterableoverList)".Also applies to: 338-340
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@divref/divref/tools/create_duckdb_index.py` around lines 298 - 302, Change the parameter type annotation for pops_legend from list[str] to a more general Sequence[str] (from collections.abc) in export_sequences_table_to_tsv and any other functions that accept a read-only pops_legend parameter; update the type hint imports accordingly, leave runtime behavior unchanged (do not mutate pops_legend), and adjust docstrings/signatures where the same parameter appears so the API accepts any sequence of strings rather than specifically a list.workflows/generate_divref.smk (1)
250-262: Please run Snakemake lint validation for this workflow changeRun
pixi run lint --checkto validate the Snakemake file. This linting check is required as per project guidelines for all workflow files.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@workflows/generate_divref.smk` around lines 250 - 262, Run the Snakemake linter against the modified workflow and fix any reported issues: execute "pixi run lint --check" for the workflows/generate_divref.smk change (the shell block invoking "divref create-duckdb-index" and surrounding rule) and update the rule/shell formatting to satisfy the linter (e.g., quoting, indentation, line breaks, or parameter usage) until the lint command passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@divref/divref/tools/create_duckdb_index.py`:
- Around line 128-149: The per-contig TSV files created by
export_sequences_table_to_tsv (contig_tsv from
per_contig_tsvs[table_pair.contig]) are never removed when
retain_per_contig_tsvs is False; update the processing loop in
create_duckdb_index.py (the block using iter_dataframe_chunks, contig_rows,
sequence_id_offset and created_table) to delete contig_tsv after it has been
fully consumed (ideally in a finally/cleanup step per contig) when
retain_per_contig_tsvs is False, using Path.unlink() or equivalent and
handling/ignoring file-not-found errors so temporary TSVs are cleaned up without
breaking the build.
- Around line 150-155: After populating the sequences table but before executing
conn.execute("CREATE INDEX idx_sequence_id ON sequences(sequence_id)"), add an
explicit check that the sequences table exists and contains at least one row
(e.g., run a SELECT COUNT(*) FROM sequences using the same DuckDB connection or
use the row counter variable used during the loop); if the count is zero or the
table is absent, raise/exit with a clear error message like "No sequences
produced across contigs — aborting index creation" so the downstream CREATE
INDEX call is never executed on an empty/nonexistent table.
---
Nitpick comments:
In `@divref/divref/tools/create_duckdb_index.py`:
- Around line 298-302: Change the parameter type annotation for pops_legend from
list[str] to a more general Sequence[str] (from collections.abc) in
export_sequences_table_to_tsv and any other functions that accept a read-only
pops_legend parameter; update the type hint imports accordingly, leave runtime
behavior unchanged (do not mutate pops_legend), and adjust docstrings/signatures
where the same parameter appears so the API accepts any sequence of strings
rather than specifically a list.
In `@workflows/generate_divref.smk`:
- Around line 250-262: Run the Snakemake linter against the modified workflow
and fix any reported issues: execute "pixi run lint --check" for the
workflows/generate_divref.smk change (the shell block invoking "divref
create-duckdb-index" and surrounding rule) and update the rule/shell formatting
to satisfy the linter (e.g., quoting, indentation, line breaks, or parameter
usage) until the lint command passes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 444e7f60-777e-440b-90b3-709eacae2e01
📒 Files selected for processing (4)
.gitignoredivref/divref/tools/create_duckdb_index.pyworkflows/config/config_schema.ymlworkflows/generate_divref.smk
Summary by CodeRabbit
New Features
Improvements
Chores