feat: stream chunks for database input by ameynert · Pull Request #42 · fg-labs/divref-wf

ameynert · 2026-04-29T17:01:31Z

Summary by CodeRabbit

New Features
- Added configurable batch-size parameter for DuckDB index creation (default: 100,000 rows).
- Added force flag to overwrite existing index files.
- Added option to retain intermediate data files.
Improvements
- Optimized memory usage by processing sequence data in streaming batches instead of loading entirely into memory.
Chores
- Updated git ignore rules to exclude scratch directory.

coderabbitai · 2026-04-29T17:01:44Z

Warning

Rate limit exceeded

@ameynert has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 55 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7e61d411-a7c0-4d88-96a9-b887a6f0a4e3

📥 Commits

Reviewing files that changed from the base of the PR and between 1295d1f and 5ae862c.

📒 Files selected for processing (1)

divref/divref/tools/create_duckdb_index.py

📝 Walkthrough

Walkthrough

The changes refactor DuckDB index creation to stream contig-by-contig data instead of building a single combined table, adds configurable batch-size parameters, and threads the new polars_chunk_size configuration through the workflow schema and Snakemake rules to enable tunable memory-efficient processing.

Changes

Cohort / File(s)	Summary
Git Ignore `.gitignore`	Adds `scratch/` directory to exclude from version control.
Core Refactoring `divref/divref/tools/create_duckdb_index.py`	Restructures index creation to stream per-contig data via Hail tables exported to TSVs, then read into DuckDB in Polars batches. Adds `polars_chunk_size`, `retain_per_contig_tsvs`, and `force` parameters; replaces `build_sequences_table` with `build_contig_sequences_table`, removes `export_sequences_table_to_dataframe` and `write_duckdb_index`, and adds `iter_dataframe_chunks` for batch iteration.
Configuration Support `workflows/config/config_schema.yml`, `workflows/generate_divref.smk`	Introduces `polars_chunk_size` config schema property (min 10,000, default 100,000) and threads it as a CLI flag to the `create-duckdb-index` command in the workflow rule.

Sequence Diagram(s)

sequenceDiagram
    participant Workflow as Snakemake<br/>Workflow
    participant Hail as Hail<br/>(per-contig)
    participant TSV as TSV<br/>(intermediate)
    participant Polars as Polars<br/>(chunks)
    participant DuckDB as DuckDB<br/>(target)

    Workflow->>Hail: For each contig:<br/>build_contig_sequences_table()
    Hail->>TSV: export_sequences_table_to_tsv()
    TSV->>Polars: iter_dataframe_chunks(polars_chunk_size)
    loop Each chunk batch
        Polars->>DuckDB: CREATE TABLE AS SELECT (1st)<br/>or INSERT INTO (rest)
    end
    DuckDB->>DuckDB: CREATE INDEX on<br/>sequence_id
    DuckDB->>DuckDB: Write metadata<br/>(window_size, pops_legend,<br/>VERSION)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat: split output tool into two #39: Previously split create_duckdb_index into separate tools; this PR further refactors it to use streaming per-contig TSVs with chunked Polars batch inserts.

Poem

🐰 Data streams like carrots down the burrow deep,

Per-contig chunks in DuckDB's store to keep,

No giant tables clogging memory's way,

Just batch-by-batch, efficient play! 🌾✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main architectural change: refactoring data loading to stream chunks into DuckDB instead of loading all data at once.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch am_streaming

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 39 minutes and 55 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

divref/divref/tools/create_duckdb_index.py (1)

298-302: Use a generalized collection type for pops_legend parameters

pops_legend is consumed read-only; Sequence[str] is a better parameter contract than list[str].

Proposed refactor

-from collections.abc import Iterator
+from collections.abc import Iterator, Sequence
@@
 def export_sequences_table_to_tsv(
     ht: hl.Table,
     out_file: Path,
-    pops_legend: list[str],
+    pops_legend: Sequence[str],
 ) -> None:
@@
 def iter_dataframe_chunks(
     *,
     tsv: Path,
-    pops_legend: list[str],
+    pops_legend: Sequence[str],
     chunk_size: int,
 ) -> Iterator[polars.DataFrame]:

As per coding guidelines, "For Python function parameters, accept the most general type practical (e.g., Iterable over List)".

Also applies to: 338-340

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@divref/divref/tools/create_duckdb_index.py` around lines 298 - 302, Change
the parameter type annotation for pops_legend from list[str] to a more general
Sequence[str] (from collections.abc) in export_sequences_table_to_tsv and any
other functions that accept a read-only pops_legend parameter; update the type
hint imports accordingly, leave runtime behavior unchanged (do not mutate
pops_legend), and adjust docstrings/signatures where the same parameter appears
so the API accepts any sequence of strings rather than specifically a list.

workflows/generate_divref.smk (1)

250-262: Please run Snakemake lint validation for this workflow change

Run pixi run lint --check to validate the Snakemake file. This linting check is required as per project guidelines for all workflow files.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workflows/generate_divref.smk` around lines 250 - 262, Run the Snakemake
linter against the modified workflow and fix any reported issues: execute "pixi
run lint --check" for the workflows/generate_divref.smk change (the shell block
invoking "divref create-duckdb-index" and surrounding rule) and update the
rule/shell formatting to satisfy the linter (e.g., quoting, indentation, line
breaks, or parameter usage) until the lint command passes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@divref/divref/tools/create_duckdb_index.py`:
- Around line 128-149: The per-contig TSV files created by
export_sequences_table_to_tsv (contig_tsv from
per_contig_tsvs[table_pair.contig]) are never removed when
retain_per_contig_tsvs is False; update the processing loop in
create_duckdb_index.py (the block using iter_dataframe_chunks, contig_rows,
sequence_id_offset and created_table) to delete contig_tsv after it has been
fully consumed (ideally in a finally/cleanup step per contig) when
retain_per_contig_tsvs is False, using Path.unlink() or equivalent and
handling/ignoring file-not-found errors so temporary TSVs are cleaned up without
breaking the build.
- Around line 150-155: After populating the sequences table but before executing
conn.execute("CREATE INDEX idx_sequence_id ON sequences(sequence_id)"), add an
explicit check that the sequences table exists and contains at least one row
(e.g., run a SELECT COUNT(*) FROM sequences using the same DuckDB connection or
use the row counter variable used during the loop); if the count is zero or the
table is absent, raise/exit with a clear error message like "No sequences
produced across contigs — aborting index creation" so the downstream CREATE
INDEX call is never executed on an empty/nonexistent table.

---

Nitpick comments:
In `@divref/divref/tools/create_duckdb_index.py`:
- Around line 298-302: Change the parameter type annotation for pops_legend from
list[str] to a more general Sequence[str] (from collections.abc) in
export_sequences_table_to_tsv and any other functions that accept a read-only
pops_legend parameter; update the type hint imports accordingly, leave runtime
behavior unchanged (do not mutate pops_legend), and adjust docstrings/signatures
where the same parameter appears so the API accepts any sequence of strings
rather than specifically a list.

In `@workflows/generate_divref.smk`:
- Around line 250-262: Run the Snakemake linter against the modified workflow
and fix any reported issues: execute "pixi run lint --check" for the
workflows/generate_divref.smk change (the shell block invoking "divref
create-duckdb-index" and surrounding rule) and update the rule/shell formatting
to satisfy the linter (e.g., quoting, indentation, line breaks, or parameter
usage) until the lint command passes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 444e7f60-777e-440b-90b3-709eacae2e01

📥 Commits

Reviewing files that changed from the base of the PR and between e25ee1d and 1295d1f.

📒 Files selected for processing (4)

.gitignore
divref/divref/tools/create_duckdb_index.py
workflows/config/config_schema.yml
workflows/generate_divref.smk

ameynert added 5 commits April 29, 2026 10:00

chore: update .gitignore

58af450

feat: chunk the polars read / database insert function

7c65ea9

feat: remove use of deprecated function

8ca3e33

feat: add polars chunking

9e62323

feat: set default chunk size

1295d1f

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread divref/divref/tools/create_duckdb_index.py

Comment thread divref/divref/tools/create_duckdb_index.py

ameynert added 3 commits April 29, 2026 10:16

feat: unlink temporary tsv files

565d045

feat: add guard for empty results

e7b5eba

chore: ignore complex function warning

5ae862c

ameynert temporarily deployed to github-actions-snakemake-linting April 29, 2026 17:21 — with GitHub Actions Inactive

ameynert merged commit b0076a6 into main Apr 29, 2026
4 checks passed

ameynert deleted the am_streaming branch April 29, 2026 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stream chunks for database input#42

feat: stream chunks for database input#42
ameynert merged 8 commits intomainfrom
am_streaming

ameynert commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ameynert commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ameynert commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading