Skip to content

feat: support HTTP DuckDB queries in WASM notebooks#9480

Open
peter-gy wants to merge 22 commits into
marimo-team:mainfrom
peter-gy:ptr/duckdb-wasm-patch
Open

feat: support HTTP DuckDB queries in WASM notebooks#9480
peter-gy wants to merge 22 commits into
marimo-team:mainfrom
peter-gy:ptr/duckdb-wasm-patch

Conversation

@peter-gy
Copy link
Copy Markdown
Contributor

@peter-gy peter-gy commented May 8, 2026

Motivated by marimo-team/quarto-marimo#74, marimo-team/jupyter-book-marimo#1, and #9413.

DuckDB remote file queries fail in Pyodide because DuckDB-WASM can't use httpfs. Therefore, URL-based SQL like FROM 'https://...' and read_csv/read_parquet/read_json('https://...') are unusable in WASM notebooks today.

This PR adds a DuckDB WASM fallback layer for mo.sql, SQL cells, raw duckdb.sql/query/execute/query_df, connection SQL methods, and direct duckdb.read_csv/read_parquet/read_json calls.

It translates queries such as

SELECT * FROM read_csv('https://example.com/cars.csv')
SELECT * FROM 'https://example.com/cars.csv'
-- or duckdb.read_csv('https://example.com/cars.csv') via Python API

into

SELECT * FROM __marimo_wasm_duckdb_remote_0

where __marimo_wasm_duckdb_remote_0 is bound to a fetched pandas DataFrame, which DuckDB can query through Python replacement scans.

Underneath, the fallback layer:

  • uses sqlglot to analyze SQL and extract supported static remote file references from the AST
  • fetches remote files through Python/urllib via marimo's shared WASM fetch util
  • decodes fetched DuckDB file bytes into pandas DataFrames
  • hands those DataFrames back to DuckDB under generated table names

Unsupported or dynamic cases are left to DuckDB's normal path. The patch is a no-op outside Pyodide, and in Pyodide the DuckDB SQL fallback requires sqlglot for AST analysis.

Tested with unit coverage for the rewrite/fetch/read paths and manually in the WASM playground against hosted CSV, parquet, JSON, and GeoJSON datasets.

WASM Playground Demo

Demonstrates that the Pyodide-build of marimo supports querying remote files with DuckDB across cases like:

  • CSV via mo.sql: direct URL scan
  • Parquet via duckdb.sql: read_parquet(...)
  • Direct duckdb.read_csv Python API with patched options (custom delimiter)
  • JSON / GeoJSON path
  • Connection API via duckdb.connect
wasm-demo.mp4

Copilot AI review requested due to automatic review settings May 8, 2026 13:58
@vercel
Copy link
Copy Markdown

vercel Bot commented May 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marimo-docs Ready Ready Preview, Comment May 12, 2026 3:47pm

Request Review

@peter-gy peter-gy changed the title feat: support DuckDB URL scans in WASM notebooks feat: support DuckDB queries via HTTP in WASM notebooks May 8, 2026
@peter-gy peter-gy changed the title feat: support DuckDB queries via HTTP in WASM notebooks feat: support HTTP DuckDB queries in WASM notebooks May 8, 2026
@peter-gy peter-gy added the enhancement New feature or request label May 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Pyodide/WASM-only DuckDB compatibility layer that rewrites supported remote URL scans into replacement scans backed by fetched pandas DataFrames, enabling mo.sql, SQL cells, and common DuckDB APIs to query https://... sources in WASM notebooks.

Changes:

  • Implement DuckDB WASM patching: SQL AST rewrite (sqlglot) + remote fetch + bytes→DataFrame decoding + replacement scan execution.
  • Add shared WASM URL fetch helper and integrate it into existing Polars WASM fallbacks.
  • Add extensive unit/integration coverage and update WASM workers to preload DuckDB-related deps when DuckDB is used.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/_runtime/test_patches.py Adds unit test ensuring shared WASM fetch helper forwards Request/urlopen kwargs correctly.
tests/_runtime/test_duckdb_wasm.py New test suite covering DuckDB SQL rewrite parity, direct-reader patching, and mo.sql/kernel integration in Pyodide.
marimo/_sql/utils.py Hooks wrapped_sql / execute_duckdb_sql into the WASM DuckDB SQL rewrite path so mo.sql can transparently handle remote URLs.
marimo/_runtime/_wasm/_polars.py Switches Polars fallback URL fetching to shared WASM fetch utility.
marimo/_runtime/_wasm/_patches.py Extends patch framework with replace() for wrapper-only (no “call original first”) patching.
marimo/_runtime/_wasm/_fetch.py New shared synchronous urllib-based fetch utility for Pyodide fallbacks.
marimo/_runtime/_wasm/_duckdb/init.py Core DuckDB WASM patch implementation (direct readers + SQL APIs + eval-based replacement scan execution).
marimo/_runtime/_wasm/_duckdb/sources.py sqlglot AST helpers to detect supported remote sources and extract literal args/options.
marimo/_runtime/_wasm/_duckdb/io.py URL/option validation, reader selection, fetching, and multi-file concat semantics for remote sources.
marimo/_runtime/_wasm/_duckdb/dataframe.py Bytes→DataFrame decoding via temp files + implementations for text/blob-like readers.
marimo/_output/formatters/formatters.py Registers DuckDB formatter factory so importing DuckDB triggers WASM patch installation.
marimo/_output/formatters/df_formatters.py Adds DuckDBFormatter that installs the DuckDB WASM patch on DuckDB import.
frontend/src/core/wasm/worker/worker.ts Expands WASM dependency preloading heuristic to include DuckDB usage.
frontend/src/core/wasm/worker/bootstrap.ts Expands notebook dependency preloading heuristic to include DuckDB usage.
frontend/src/core/islands/worker/worker.tsx Expands islands worker dependency preloading heuristic to include DuckDB usage.

Comment thread marimo/_runtime/_wasm/_duckdb/sources.py Outdated
Comment thread marimo/_runtime/_wasm/_duckdb/io.py
Comment thread marimo/_runtime/_wasm/_duckdb/__init__.py
@peter-gy peter-gy marked this pull request as draft May 8, 2026 14:13
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 15 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="marimo/_runtime/_wasm/_duckdb/io.py">

<violation number="1" location="marimo/_runtime/_wasm/_duckdb/io.py:198">
P2: `read_json_objects` is routed to `read_json_objects_auto`, which changes JSON-object reader semantics instead of preserving the requested function behavior.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant User as User Code
    participant moSQL as mo.sql / SQL Cell
    participant DuckDB as DuckDB Module
    participant WasmPatch as WASM DuckDB Layer
    participant SQLGlot as sqlglot Parser
    participant Fetcher as Fetch Utility
    participant DataFrame as DataFrame Builder
    participant Pandas as Pandas DF
    participant MemTable as DuckDB Temp Table

    Note over User,MemTable: WASM DuckDB Remote File Query Flow

    User->>moSQL: SQL with remote URL (e.g., read_csv('https://...'))
    moSQL->>WasmPatch: try_run_duckdb_sql_with_wasm_patch()

    alt Non-WASM environment
        WasmPatch-->>moSQL: Return None (no-op)
        moSQL->>DuckDB: Normal SQL execution
        DuckDB-->>User: Query Result
    else WASM environment (Pyodide)
        WasmPatch->>SQLGlot: patch_duckdb_query_for_wasm()
        SQLGlot->>SQLGlot: Parse SQL AST
        alt SQL has remote URL references
            SQLGlot-->>WasmPatch: Extract URLs and table functions
            WasmPatch->>Fetcher: fetch_url_bytes(url)
            Fetcher->>Fetcher: urllib.request (via pyodide_http)
            Fetcher-->>WasmPatch: Raw bytes
            WasmPatch->>DataFrame: Read bytes to DataFrame
            DataFrame->>DataFrame: Determine format (CSV/Parquet/JSON)
            DataFrame->>Pandas: Create DataFrame from bytes
            Pandas-->>DataFrame: Pandas DataFrame
            DataFrame-->>WasmPatch: DataFrame with remote data
            WasmPatch->>WasmPatch: Generate replacement table name
            Note over WasmPatch: e.g., __marimo_wasm_duckdb_remote_0
            WasmPatch-->>moSQL: WasmDuckDBQueryPatch(query, tables)
            moSQL->>DuckDB: Register temp table with DataFrame
            moSQL->>DuckDB: Execute rewritten SQL (without URLs)
        else No remote URLs
            SQLGlot-->>WasmPatch: No remote sources found
            WasmPatch-->>moSQL: Return None
            moSQL->>DuckDB: Normal SQL execution
        end
        DuckDB->>MemTable: Replacement scan on temp DataFrame
        MemTable-->>DuckDB: Query via pandas
        DuckDB-->>User: Query Result
    end

    Note over User,MemTable: Direct DuckDB Reader API (patch_duckdb_for_wasm)

    User->>DuckDB: duckdb.read_csv('https://...')
    alt WASM + patched
        DuckDB->>WasmPatch: Patched wrapper intercepts
        WasmPatch->>Fetcher: fetch_url_bytes(url)
        Fetcher-->>WasmPatch: Raw bytes
        WasmPatch->>DataFrame: Build DataFrame from bytes
        WasmPatch-->>DuckDB: Return DuckDB relation from DataFrame
        DuckDB-->>User: DataFrame/Relation
    else Not WASM or not patched
        DuckDB->>DuckDB: Normal httpfs path (fails in WASM)
    end

    Note over User,MemTable: Key Boundaries

    alt WASM fetch uses pyodide_http
        Note over Fetcher: urllib → JS fetch bridge
    end

    alt sqlglot parsing fails or dynamic expressions
        Note over SQLGlot: Return None → fallback to DuckDB native
    end

    opt Error during fetch or decode
        WasmPatch-->>moSQL: Propagate exception
        moSQL-->>User: Error with original DuckDB message
    end
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread marimo/_runtime/_wasm/_duckdb/io.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

frontend/src/core/wasm/worker/worker.ts:149

  • The dependency-loading heuristic now triggers on any occurrence of the substring "duckdb" in the notebook source. This can cause loadPackagesFromImports to pull in pandas/duckdb/sqlglot even when the user isn’t actually importing/using DuckDB (e.g., comments/strings/variable names), increasing startup time and bandwidth in WASM. Consider tightening detection (e.g., regex for ^\s*import\s+duckdb\b / ^\s*from\s+duckdb\b / \bduckdb\.) or relying on import discovery rather than raw substring matching.
    if (code.includes("mo.sql") || code.includes("duckdb")) {
      // Add pandas and duckdb to the code
      code = `import pandas\n${code}`;
      code = `import duckdb\n${code}`;
      code = `import sqlglot\n${code}`;

frontend/src/core/wasm/worker/bootstrap.ts:171

  • The code.includes("duckdb") heuristic is very broad and can cause WASM bootstrap to pre-load heavy deps (pandas/duckdb/sqlglot) on incidental mentions of “duckdb” (comments/strings), increasing load time. Consider switching to a more precise pattern (import statement detection / duckdb. usage) to avoid unnecessary package loads.
    if (code.includes("mo.sql") || code.includes("duckdb")) {
      // We need pandas and duckdb for mo.sql
      code = `import pandas\n${code}`;
      code = `import duckdb\n${code}`;
      code = `import sqlglot\n${code}`;

frontend/src/core/islands/worker/worker.tsx:93

  • Using code.includes("duckdb") to decide whether to pre-load pandas/duckdb/sqlglot is likely to over-trigger (e.g., “duckdb” in a comment/string), adding unnecessary package load time in WASM. Consider using a stricter detection strategy (import statement regex / duckdb. token) instead of a raw substring search.
    if (code.includes("mo.sql") || code.includes("duckdb")) {
      // Add pandas and duckdb to the code
      code = `import pandas\n${code}`;
      code = `import duckdb\n${code}`;
      code = `import sqlglot\n${code}`;

Comment thread marimo/_runtime/_wasm/_duckdb/__init__.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Comment thread marimo/_sql/utils.py
Comment thread marimo/_sql/utils.py
Comment thread marimo/_runtime/_wasm/_duckdb/__init__.py Outdated
Comment thread frontend/src/core/wasm/utils.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Comment thread frontend/src/core/wasm/utils.ts
Comment thread frontend/src/core/wasm/worker/bootstrap.ts
Comment thread frontend/src/core/wasm/worker/worker.ts
Comment thread frontend/src/core/islands/worker/worker.tsx
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

@peter-gy peter-gy requested review from dmadisetti and mscolnick May 12, 2026 16:00
@peter-gy peter-gy marked this pull request as ready for review May 12, 2026 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants