test: LDBC SNB benchmark suite for embedded mode by genezhang · Pull Request #249 · genezhang/clickgraph

genezhang · 2026-03-28T00:13:07Z

Summary

Self-contained LDBC SNB benchmark for embedded mode using sf0.003 Parquet data (1.4 MB, already in-repo).

Files

File	Purpose
`ldbc_snb_embedded.yaml`	Schema with `table_function:file()` Parquet sources for 14 nodes + 31 edges
`embedded_benchmark.py`	Full runner: 41 queries, timing, `--sql-only`/`--filter`/`--verbose` options
`embedded_benchmark_test.py`	Pytest suite: short-1..7 with sql_only mode (no chdb needed)

Usage

# SQL translation only (no chdb needed)
cd benchmarks/ldbc_snb
LD_LIBRARY_PATH=../../target/release PYTHONPATH=../../clickgraph-py \
    python3 embedded_benchmark.py --sql-only

# Full execution (requires chdb)
LD_LIBRARY_PATH=../../target/release PYTHONPATH=../../clickgraph-py \
    python3 embedded_benchmark.py

# Pytest (sql_only tests run without chdb)
CLICKGRAPH_CHDB_TESTS=1 pytest embedded_benchmark_test.py -v

No existing files modified.

🤖 Generated with Claude Code

Self-contained benchmark loading sf0.003 Parquet data into chdb and running all LDBC queries (short, complex, BI). Files: - ldbc_snb_embedded.yaml: Schema with table_function:file() sources for all 14 node types and 31 edge types, including FK-edge subquery views and polymorphic Message unions - embedded_benchmark.py: Full runner for 41 queries with timing, --sql-only mode, --filter, --verbose options - embedded_benchmark_test.py: Pytest suite for short-1..7 with sql_only parametrized tests (no chdb needed) and chdb e2e tests (gated by CLICKGRAPH_CHDB_TESTS=1) No existing files modified. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

genezhang

Code Review: LDBC SNB Benchmark Suite for Embedded Mode

Impressive and thorough. 676-line schema mapping the full LDBC SNB graph (14 node types, 31 edge types) to Parquet files via table_function:file() and table_function:(SELECT ... UNION ALL ...) for polymorphic Message views. The benchmark runner and pytest suite are well-structured.

What's good

Schema is comprehensive: Covers all LDBC entity types including polymorphic Message (Post + Comment UNION), FK-edge patterns (Person→City via LocationCityId), filtered label variants (City/Country/Continent from Place), and multi-from/to edge types (IS_LOCATED_IN for Person/Post/Comment/Organisation)
__DATA_DIR__ placeholder pattern: Clean separation between schema structure and data location — same approach as the tutorial examples
--sql-only mode: Enables CI testing without chdb — the pytest TestSqlOnly class validates all 7 short queries translate to SQL
Graceful expected failures: EXPECTED_FAILURES dict with reasons (bi-16 CALL subquery, etc.) — queries are skipped, not falsely reported as passing
xfail in pytest: Message-related queries (short-2, 4, 5, 6, 7) correctly use pytest.xfail since they depend on Message view support which may have issues
Benchmark runner is standalone: argparse with --filter, --verbose, --timeout — works as both a development tool and CI check
Timing integration: Uses result.compiling_time and result.execution_time from the new Kuzu parity API

Observations

1. Duplicated parameter substitution logic

substitute_params() in embedded_benchmark.py and _load_and_prepare() in embedded_benchmark_test.py have identical parameter substitution code (comment stripping + $param replacement). Consider extracting to a shared helper module or having the test import from the benchmark runner.

2. pytestmark overwrite

pytestmark = pytest.mark.skipif(...)
pytestmark = [pytestmark, pytest.mark.skipif(...)]

The second line references the first pytestmark correctly (it becomes a list), but this pattern is fragile — it depends on the first assignment being evaluated before the second. A clearer approach:

pytestmark = [
    pytest.mark.skipif(not CLICKGRAPH_AVAILABLE, reason="..."),
    pytest.mark.skipif(os.environ.get(...) not in (...), reason="..."),
]

3. TestSqlOnly class also gated behind CLICKGRAPH_CHDB_TESTS

The module-level pytestmark skips all tests when CLICKGRAPH_CHDB_TESTS is not set, including TestSqlOnly which explicitly doesn't need chdb. This means the sql_only tests won't run in CI unless the env var is set. Consider moving the chdb gate to TestInteractiveShort only, or overriding with pytestmark = [] on TestSqlOnly.

4. --timeout flag declared but not used

args.timeout is parsed but never applied — there's no signal.alarm() or Connection.set_query_timeout() call. The signal import at line 24 suggests this was planned. Either wire it up or remove the flag to avoid confusion.

5. Schema: duplicate table names for different from_node/to_node

Several IS_LOCATED_IN edges share the same table: Organisation_isLocatedIn_Place with different from_node/to_node pairs (Organisation→Place, Company→Country, University→City). This works because ClickGraph resolves by the composite key TYPE::From::To, but worth a comment in the schema noting this is intentional.

Verdict

LGTM with minor nits. The schema mapping is the star — it makes the full LDBC SNB benchmark runnable on Parquet files without any ClickHouse server. #3 (sql_only gating) is the most actionable issue since it affects CI usability.

1. Deduplicated param substitution: test imports substitute_params and load_query from embedded_benchmark.py instead of reimplementing 2. Cleaned up pytestmark: single skipif for clickgraph availability; chdb gate moved to TestInteractiveShort class only 3. TestSqlOnly now runs without CLICKGRAPH_CHDB_TESTS — enables CI testing of Cypher→SQL translation 4. Removed unused --timeout flag and signal import 5. Added comment about intentional duplicate table names for IS_LOCATED_IN edges (Organisation/Company/University variants) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

genezhang commented Mar 28, 2026

View reviewed changes

genezhang merged commit 96f5132 into main Mar 28, 2026
4 checks passed

genezhang deleted the test/ldbc-embedded-benchmark branch March 28, 2026 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: LDBC SNB benchmark suite for embedded mode#249

test: LDBC SNB benchmark suite for embedded mode#249
genezhang merged 2 commits intomainfrom
test/ldbc-embedded-benchmark

genezhang commented Mar 28, 2026

Uh oh!

genezhang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

genezhang commented Mar 28, 2026

Summary

Files

Usage

Uh oh!

genezhang left a comment

Choose a reason for hiding this comment

Code Review: LDBC SNB Benchmark Suite for Embedded Mode

What's good

Observations

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant