feat: support HTTP DuckDB queries in WASM notebooks#9480
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Adds a Pyodide/WASM-only DuckDB compatibility layer that rewrites supported remote URL scans into replacement scans backed by fetched pandas DataFrames, enabling mo.sql, SQL cells, and common DuckDB APIs to query https://... sources in WASM notebooks.
Changes:
- Implement DuckDB WASM patching: SQL AST rewrite (sqlglot) + remote fetch + bytes→DataFrame decoding + replacement scan execution.
- Add shared WASM URL fetch helper and integrate it into existing Polars WASM fallbacks.
- Add extensive unit/integration coverage and update WASM workers to preload DuckDB-related deps when DuckDB is used.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/_runtime/test_patches.py | Adds unit test ensuring shared WASM fetch helper forwards Request/urlopen kwargs correctly. |
| tests/_runtime/test_duckdb_wasm.py | New test suite covering DuckDB SQL rewrite parity, direct-reader patching, and mo.sql/kernel integration in Pyodide. |
| marimo/_sql/utils.py | Hooks wrapped_sql / execute_duckdb_sql into the WASM DuckDB SQL rewrite path so mo.sql can transparently handle remote URLs. |
| marimo/_runtime/_wasm/_polars.py | Switches Polars fallback URL fetching to shared WASM fetch utility. |
| marimo/_runtime/_wasm/_patches.py | Extends patch framework with replace() for wrapper-only (no “call original first”) patching. |
| marimo/_runtime/_wasm/_fetch.py | New shared synchronous urllib-based fetch utility for Pyodide fallbacks. |
| marimo/_runtime/_wasm/_duckdb/init.py | Core DuckDB WASM patch implementation (direct readers + SQL APIs + eval-based replacement scan execution). |
| marimo/_runtime/_wasm/_duckdb/sources.py | sqlglot AST helpers to detect supported remote sources and extract literal args/options. |
| marimo/_runtime/_wasm/_duckdb/io.py | URL/option validation, reader selection, fetching, and multi-file concat semantics for remote sources. |
| marimo/_runtime/_wasm/_duckdb/dataframe.py | Bytes→DataFrame decoding via temp files + implementations for text/blob-like readers. |
| marimo/_output/formatters/formatters.py | Registers DuckDB formatter factory so importing DuckDB triggers WASM patch installation. |
| marimo/_output/formatters/df_formatters.py | Adds DuckDBFormatter that installs the DuckDB WASM patch on DuckDB import. |
| frontend/src/core/wasm/worker/worker.ts | Expands WASM dependency preloading heuristic to include DuckDB usage. |
| frontend/src/core/wasm/worker/bootstrap.ts | Expands notebook dependency preloading heuristic to include DuckDB usage. |
| frontend/src/core/islands/worker/worker.tsx | Expands islands worker dependency preloading heuristic to include DuckDB usage. |
There was a problem hiding this comment.
1 issue found across 15 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="marimo/_runtime/_wasm/_duckdb/io.py">
<violation number="1" location="marimo/_runtime/_wasm/_duckdb/io.py:198">
P2: `read_json_objects` is routed to `read_json_objects_auto`, which changes JSON-object reader semantics instead of preserving the requested function behavior.</violation>
</file>
Architecture diagram
sequenceDiagram
participant User as User Code
participant moSQL as mo.sql / SQL Cell
participant DuckDB as DuckDB Module
participant WasmPatch as WASM DuckDB Layer
participant SQLGlot as sqlglot Parser
participant Fetcher as Fetch Utility
participant DataFrame as DataFrame Builder
participant Pandas as Pandas DF
participant MemTable as DuckDB Temp Table
Note over User,MemTable: WASM DuckDB Remote File Query Flow
User->>moSQL: SQL with remote URL (e.g., read_csv('https://...'))
moSQL->>WasmPatch: try_run_duckdb_sql_with_wasm_patch()
alt Non-WASM environment
WasmPatch-->>moSQL: Return None (no-op)
moSQL->>DuckDB: Normal SQL execution
DuckDB-->>User: Query Result
else WASM environment (Pyodide)
WasmPatch->>SQLGlot: patch_duckdb_query_for_wasm()
SQLGlot->>SQLGlot: Parse SQL AST
alt SQL has remote URL references
SQLGlot-->>WasmPatch: Extract URLs and table functions
WasmPatch->>Fetcher: fetch_url_bytes(url)
Fetcher->>Fetcher: urllib.request (via pyodide_http)
Fetcher-->>WasmPatch: Raw bytes
WasmPatch->>DataFrame: Read bytes to DataFrame
DataFrame->>DataFrame: Determine format (CSV/Parquet/JSON)
DataFrame->>Pandas: Create DataFrame from bytes
Pandas-->>DataFrame: Pandas DataFrame
DataFrame-->>WasmPatch: DataFrame with remote data
WasmPatch->>WasmPatch: Generate replacement table name
Note over WasmPatch: e.g., __marimo_wasm_duckdb_remote_0
WasmPatch-->>moSQL: WasmDuckDBQueryPatch(query, tables)
moSQL->>DuckDB: Register temp table with DataFrame
moSQL->>DuckDB: Execute rewritten SQL (without URLs)
else No remote URLs
SQLGlot-->>WasmPatch: No remote sources found
WasmPatch-->>moSQL: Return None
moSQL->>DuckDB: Normal SQL execution
end
DuckDB->>MemTable: Replacement scan on temp DataFrame
MemTable-->>DuckDB: Query via pandas
DuckDB-->>User: Query Result
end
Note over User,MemTable: Direct DuckDB Reader API (patch_duckdb_for_wasm)
User->>DuckDB: duckdb.read_csv('https://...')
alt WASM + patched
DuckDB->>WasmPatch: Patched wrapper intercepts
WasmPatch->>Fetcher: fetch_url_bytes(url)
Fetcher-->>WasmPatch: Raw bytes
WasmPatch->>DataFrame: Build DataFrame from bytes
WasmPatch-->>DuckDB: Return DuckDB relation from DataFrame
DuckDB-->>User: DataFrame/Relation
else Not WASM or not patched
DuckDB->>DuckDB: Normal httpfs path (fails in WASM)
end
Note over User,MemTable: Key Boundaries
alt WASM fetch uses pyodide_http
Note over Fetcher: urllib → JS fetch bridge
end
alt sqlglot parsing fails or dynamic expressions
Note over SQLGlot: Return None → fallback to DuckDB native
end
opt Error during fetch or decode
WasmPatch-->>moSQL: Propagate exception
moSQL-->>User: Error with original DuckDB message
end
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
af360b1 to
6acf51b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (3)
frontend/src/core/wasm/worker/worker.ts:149
- The dependency-loading heuristic now triggers on any occurrence of the substring
"duckdb"in the notebook source. This can causeloadPackagesFromImportsto pull inpandas/duckdb/sqlgloteven when the user isn’t actually importing/using DuckDB (e.g., comments/strings/variable names), increasing startup time and bandwidth in WASM. Consider tightening detection (e.g., regex for^\s*import\s+duckdb\b/^\s*from\s+duckdb\b/\bduckdb\.) or relying on import discovery rather than raw substring matching.
if (code.includes("mo.sql") || code.includes("duckdb")) {
// Add pandas and duckdb to the code
code = `import pandas\n${code}`;
code = `import duckdb\n${code}`;
code = `import sqlglot\n${code}`;
frontend/src/core/wasm/worker/bootstrap.ts:171
- The
code.includes("duckdb")heuristic is very broad and can cause WASM bootstrap to pre-load heavy deps (pandas/duckdb/sqlglot) on incidental mentions of “duckdb” (comments/strings), increasing load time. Consider switching to a more precise pattern (import statement detection /duckdb.usage) to avoid unnecessary package loads.
if (code.includes("mo.sql") || code.includes("duckdb")) {
// We need pandas and duckdb for mo.sql
code = `import pandas\n${code}`;
code = `import duckdb\n${code}`;
code = `import sqlglot\n${code}`;
frontend/src/core/islands/worker/worker.tsx:93
- Using
code.includes("duckdb")to decide whether to pre-loadpandas/duckdb/sqlglotis likely to over-trigger (e.g., “duckdb” in a comment/string), adding unnecessary package load time in WASM. Consider using a stricter detection strategy (import statement regex /duckdb.token) instead of a raw substring search.
if (code.includes("mo.sql") || code.includes("duckdb")) {
// Add pandas and duckdb to the code
code = `import pandas\n${code}`;
code = `import duckdb\n${code}`;
code = `import sqlglot\n${code}`;
3b20c2f to
6791dcf
Compare
6791dcf to
a046709
Compare
a046709 to
1665f99
Compare
Motivated by marimo-team/quarto-marimo#74, marimo-team/jupyter-book-marimo#1, and #9413.
DuckDB remote file queries fail in Pyodide because DuckDB-WASM can't use httpfs. Therefore, URL-based SQL like
FROM 'https://...'andread_csv/read_parquet/read_json('https://...')are unusable in WASM notebooks today.This PR adds a DuckDB WASM fallback layer for
mo.sql, SQL cells, rawduckdb.sql/query/execute/query_df, connection SQL methods, and directduckdb.read_csv/read_parquet/read_jsoncalls.It translates queries such as
into
where
__marimo_wasm_duckdb_remote_0is bound to a fetched pandas DataFrame, which DuckDB can query through Python replacement scans.Underneath, the fallback layer:
Unsupported or dynamic cases are left to DuckDB's normal path. The patch is a no-op outside Pyodide, and in Pyodide the DuckDB SQL fallback requires
sqlglotfor AST analysis.Tested with unit coverage for the rewrite/fetch/read paths and manually in the WASM playground against hosted CSV, parquet, JSON, and GeoJSON datasets.
WASM Playground Demo
Demonstrates that the Pyodide-build of marimo supports querying remote files with DuckDB across cases like:
mo.sql: direct URL scanduckdb.sql:read_parquet(...)duckdb.read_csvPython API with patched options (custom delimiter)duckdb.connectwasm-demo.mp4