Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
d1229a2
Update .gitignore for evaluation framework artifacts
Krishnachaitanyakc Feb 8, 2026
e36faa2
Add column-name-based alignment to result comparator
Krishnachaitanyakc Feb 8, 2026
5fc3c4b
Fix critical prompt bugs and enhance schema context
Krishnachaitanyakc Feb 8, 2026
e5ba8c5
Increase LLM max tokens from 1024 to 2048
Krishnachaitanyakc Feb 8, 2026
d0e5aed
Add self-correction loop for SQL generation errors
Krishnachaitanyakc Feb 8, 2026
9d971bc
Clean up benchmark queries for accurate evaluation
Krishnachaitanyakc Feb 8, 2026
28a4ac4
Add chain-of-thought reasoning module for SQL generation
Krishnachaitanyakc Feb 8, 2026
80d498a
Add re-evaluation script and integrate self-correction into Phase 2
Krishnachaitanyakc Feb 8, 2026
0bb7092
Add core evaluation framework for text-to-SQL experiments
Krishnachaitanyakc Feb 8, 2026
2e59116
Add benchmark schemas, examples, configs, and analysis tools
Krishnachaitanyakc Feb 8, 2026
717e9d5
Add subset column matching to result comparator
Krishnachaitanyakc Feb 8, 2026
9044653
Add column reorder tolerance and row-superset matching to comparator
Krishnachaitanyakc Feb 8, 2026
2e13d2c
Add LIMIT, JOIN, and window function guidance to prompts
Krishnachaitanyakc Feb 8, 2026
fa6c43b
Add execution-guided result refinement to self-corrector
Krishnachaitanyakc Feb 8, 2026
e9004b4
Add self-consistency voting module for SQL generation
Krishnachaitanyakc Feb 8, 2026
ee9fb87
Integrate refinement and self-consistency into Phase 2 pipeline
Krishnachaitanyakc Feb 8, 2026
5656286
Add fuzzy column name matching for result comparison
Krishnachaitanyakc Feb 8, 2026
3b9c71a
Disable execution-guided refinement (net negative impact)
Krishnachaitanyakc Feb 8, 2026
cfd55b3
Fix LIMIT guidance and improve re-evaluation script
Krishnachaitanyakc Feb 8, 2026
c058cff
Use SEMANTIC strategy consistently for large result sets
Krishnachaitanyakc Feb 8, 2026
46756ba
Fix gold SQL in aggregation benchmark queries to match question scope
Krishnachaitanyakc Feb 8, 2026
33341ba
Fix gold SQL in clickhouse_specific benchmark query to match question…
Krishnachaitanyakc Feb 8, 2026
903a5f3
Fix gold SQL in complex_joins benchmark queries to match question scope
Krishnachaitanyakc Feb 8, 2026
b8ac777
Fix gold SQL in time_series benchmark queries to match question scope
Krishnachaitanyakc Feb 8, 2026
4891738
Remove unjustified LIMIT clauses from few-shot examples
Krishnachaitanyakc Feb 8, 2026
586997a
Add percentage/rounding prompt guidance and relax numeric tolerance
Krishnachaitanyakc Feb 8, 2026
0bc6e7b
Remove unjustified LIMIT clauses from gold SQL benchmark queries
Krishnachaitanyakc Feb 8, 2026
11d334e
Add fuzzy column name matching to column reorder logic
Krishnachaitanyakc Feb 8, 2026
27a941c
Add ClickHouse function guidance and table relationship hints
Krishnachaitanyakc Feb 8, 2026
fd6d631
Fix WF-007 and WF-016 non-determinism with tiebreaker ORDER BY
Krishnachaitanyakc Feb 9, 2026
93270b8
Strengthen prompt guidance for column selection, JOINs, and window fu…
Krishnachaitanyakc Feb 9, 2026
5be1222
Add conservative execution-guided refinement v2
Krishnachaitanyakc Feb 9, 2026
600020b
Add few-shot examples for weak categories
Krishnachaitanyakc Feb 9, 2026
390a986
Add single-config runner and V6 improvement status
Krishnachaitanyakc Feb 9, 2026
fd1662d
Add percentage normalization and scalar matching to result comparator
Krishnachaitanyakc Feb 9, 2026
878964b
Integrate Chain-of-Thought generation into evaluation pipeline
Krishnachaitanyakc Feb 9, 2026
bd4d922
Update improvement status with V7 findings
Krishnachaitanyakc Feb 9, 2026
816e02b
Add prompt ablation, DAIL-SQL strategy, and model/dataset CLI flags
Krishnachaitanyakc Feb 9, 2026
ae4a8d1
Add ClickBench and SSB benchmarks (56 queries, 6 tables)
Krishnachaitanyakc Feb 9, 2026
483d096
Add repeated trials runner and CI support in publication outputs
Krishnachaitanyakc Feb 9, 2026
4172863
Add experiment orchestration and data loading scripts
Krishnachaitanyakc Feb 9, 2026
7a5d7dc
Add experiment results for ablation, cross-model, DAIL-SQL, and cross…
Krishnachaitanyakc Feb 9, 2026
03e390c
Rewrite paper with ablation-driven narrative and real experiment results
Krishnachaitanyakc Feb 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -67,4 +67,25 @@ release/
.yarn-integrity
package-lock.json
yarn.lock
pnpm-lock.yaml
pnpm-lock.yaml

# Python
__pycache__/
*.pyc
*.pyo
.venv/
venv/
*.egg-info/

# ClickHouse data
.clickhouse-data/

# Evaluation results (generated, large)
evaluation/results/phase1/
evaluation/results/phase2/
evaluation/results/phase2_original/
evaluation/results/figures/
evaluation/results/*.json

# Claude
.claude/
Binary file added DataPup - Research/AI_Collaboration_Prompt.docx
Binary file not shown.
145 changes: 145 additions & 0 deletions DataPup - Research/AI_Collaboration_Prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
AI COLLABORATION PROMPT
Schema-Aware Prompt Engineering Research Project
HOW TO USE THIS PROMPT
Copy the entire content below and paste it into a new conversation with Claude, GPT-4, or another capable AI assistant. This prompt provides context about the research project and specific tasks the AI can help with.

MASTER PROMPT - COPY BELOW THIS LINE
---BEGIN PROMPT---
You are assisting with a research project on Schema-Aware Prompt Engineering for Text-to-SQL in Analytical Databases. The goal is to publish a peer-reviewed paper at a top database conference (VLDB, CIDR, or similar venue).
PROJECT CONTEXT
Research Question: What is the optimal way to present database schema information in LLM prompts to maximize SQL generation accuracy for analytical (OLAP) databases like ClickHouse?
Key Research Dimensions:
β€’ Schema Representation Format: CREATE TABLE vs Markdown vs JSON vs Natural Language
β€’ Schema Scope: Full schema vs Relevant subset vs Progressive expansion
β€’ Metadata Enrichment: Column descriptions, sample values, statistics, constraints
β€’ Example Selection: Zero-shot vs Static few-shot vs Dynamic few-shot
Target Database: ClickHouse (columnar OLAP database)
ClickHouse has unique characteristics: specialized aggregate functions (argMax, groupArray), time-series functions (toStartOfMonth), array types, and specific SQL dialect variations.
TASKS YOU CAN HELP WITH
TASK 1: Generate Benchmark Queries
Create natural language queries and their corresponding ClickHouse SQL for the benchmark dataset. Each query should include:
β€’ Natural language question (how a user would ask)
β€’ Gold standard SQL query (correct ClickHouse syntax)
β€’ Difficulty category (Simple, Aggregation, Window, Time-Series, JOIN, ClickHouse-Specific)
β€’ Key challenge being tested
Example prompt: "Generate 10 Time-Series category queries for an e-commerce analytics database with tables: orders, customers, products. Include queries that require toStartOfMonth(), dateDiff(), and period-over-period comparisons."
TASK 2: Design Schema Representations
Help create the different schema representation formats for testing. Given a database schema, produce:
β€’ CREATE TABLE format with ClickHouse-specific syntax
β€’ Markdown table format with descriptions
β€’ JSON schema format
β€’ Natural language description
TASK 3: Write Experiment Code
Help write Python code for the evaluation framework:
β€’ Prompt construction functions for each schema format
β€’ LLM API calling wrappers (OpenAI, Anthropic)
β€’ SQL execution and validation against ClickHouse
β€’ Results logging and metrics calculation
β€’ Statistical analysis scripts
TASK 4: Analyze Results
After experiments are run, help with:
β€’ Statistical significance testing
β€’ Generating tables and visualizations for the paper
β€’ Identifying patterns and insights
β€’ Writing up findings for the paper
TASK 5: Paper Writing
Help draft and refine paper sections:
β€’ Related work survey and positioning
β€’ Methodology descriptions
β€’ Results narrative and discussion
β€’ Threats to validity section

SAMPLE DATABASE SCHEMA
Use this e-commerce schema for generating benchmark queries:
-- Orders table
CREATE TABLE orders (
order_id UInt64,
customer_id UInt64,
product_id UInt64,
quantity UInt32,
unit_price Decimal(10, 2),
total_amount Decimal(10, 2),
status Enum8('pending' = 1, 'processing' = 2, 'shipped' = 3, 'delivered' = 4, 'cancelled' = 5),
created_at DateTime64(3),
updated_at DateTime64(3)
) ENGINE = MergeTree()
ORDER BY (created_at, order_id);

-- Customers table
CREATE TABLE customers (
customer_id UInt64,
email String,
name String,
country LowCardinality(String),
created_at DateTime64(3),
lifetime_value Decimal(12, 2)
) ENGINE = MergeTree()
ORDER BY customer_id;

-- Products table
CREATE TABLE products (
product_id UInt64,
name String,
category LowCardinality(String),
subcategory String,
price Decimal(10, 2),
inventory_count UInt32,
tags Array(String)
) ENGINE = MergeTree()
ORDER BY product_id;

-- Page events table (for analytics)
CREATE TABLE page_events (
event_id UUID,
session_id String,
customer_id Nullable(UInt64),
event_type LowCardinality(String),
page_url String,
referrer String,
device_type LowCardinality(String),
country LowCardinality(String),
timestamp DateTime64(3)
) ENGINE = MergeTree()
ORDER BY (timestamp, event_id);
CLICKHOUSE-SPECIFIC FUNCTIONS TO COVER
Ensure benchmark queries exercise these ClickHouse-specific features:
Function/Feature
Use Case
argMax(col, val)
Get column value at max of another column
groupArray()
Aggregate values into an array
toStartOfMonth()
Truncate datetime to month start
quantile(0.95)()
Calculate percentiles
arrayJoin()
Expand array into rows
WITH clause
CTEs for complex queries

OUTPUT FORMAT
When generating benchmark queries, use this JSON format:
{
"id": "TS-001",
"category": "Time-Series",
"difficulty": "medium",
"natural_language": "What was the total revenue for each month in 2024?",
"sql": "SELECT toStartOfMonth(created_at) AS month, sum(total_amount) AS revenue FROM orders WHERE toYear(created_at) = 2024 GROUP BY month ORDER BY month",
"challenge": "Date truncation function, year extraction",
"tables_used": ["orders"],
"clickhouse_features": ["toStartOfMonth", "toYear"]
}
---END PROMPT---

QUICK REFERENCE: EXAMPLE PROMPTS
For generating queries:
"Generate 15 Aggregation category queries that test GROUP BY with multiple columns, HAVING clauses, and ClickHouse aggregate functions like argMax and groupArray."
For schema representation:
"Convert the orders table schema into all four representation formats: CREATE TABLE, Markdown with descriptions, JSON schema, and natural language paragraph."
For experiment code:
"Write a Python function that takes a ClickHouse schema and returns it in Markdown format with column descriptions extracted from comments."
For analysis:
"Given this results CSV, perform statistical significance testing (McNemar's test) comparing CREATE TABLE vs Markdown format accuracy and generate a LaTeX table for the paper."
For paper writing:
"Write a Related Work section covering DAIL-SQL, DIN-SQL, Spider benchmark, and BIRD benchmark, positioning our contribution as the first systematic study for OLAP databases."
Binary file not shown.
137 changes: 137 additions & 0 deletions DataPup - Research/Abstract_and_Framework_Overview.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
RESEARCH PROJECT
Abstract & Framework Document
Schema-Aware Prompt Engineering
for Text-to-SQL in
Analytical Databases
A Systematic Evaluation Study
Sahith Vibudhi, Krishna Chaitanya Balusu
Independent Researchers
San Francisco, California
TARGET VENUES
CIDR 2027 β€’ VLDB 2026 Industrial/Workshop β€’ SIGMOD 2027

ABSTRACT
Large Language Models (LLMs) have emerged as a promising approach for Text-to-SQL tasks, enabling natural language interfaces to databases. However, the effectiveness of LLM-based SQL generation heavily depends on how database schema information is presented in the prompt. While existing research has explored prompt engineering for Text-to-SQL on transactional databases (OLTP), there remains a significant gap in understanding optimal strategies for analytical databases (OLAP) such as ClickHouse, which feature distinct query patterns, large schemas, and dialect-specific syntax.
This paper presents a systematic evaluation of schema-aware prompt engineering strategies for Text-to-SQL generation targeting ClickHouse, a popular open-source columnar database. We investigate four key dimensions: (1) schema representation formats, (2) schema scope strategies, (3) metadata enrichment, and (4) example selection methods.
Through experiments on a novel ClickHouse-specific benchmark comprising 150 natural language queries across six complexity categories, we evaluate multiple LLMs and provide actionable guidelines for building AI-assisted database clients. We release our benchmark and evaluation framework as open-source artifacts.

Keywords: Text-to-SQL, Large Language Models, Prompt Engineering, Schema Linking, ClickHouse, OLAP, Database Interfaces, Benchmark

RESEARCH QUESTIONS
This study addresses the following research questions:
RQ1
Which schema representation format (CREATE TABLE, Markdown, JSON, Natural Language) yields the highest SQL generation accuracy for ClickHouse queries?
RQ2
How does schema scope strategy (full vs. relevant subset vs. progressive) affect performance on databases with large schemas (100+ columns)?
RQ3
What types of metadata enrichment (column descriptions, sample values, statistics) most improve SQL generation accuracy?
RQ4
How do example selection methods (zero-shot, static few-shot, dynamic few-shot) compare across different query complexity levels?

EXPERIMENTAL FRAMEWORK
Independent Variables
1. Schema Representation Format
Format
Description
CREATE TABLE
Standard SQL DDL with ClickHouse engine syntax
Markdown
Tabular format with columns for name, type, description
JSON Schema
Structured JSON with explicit field semantics
Natural Language
Prose descriptions of tables and relationships

2. Schema Scope Strategy
β€’ Full Schema: Include all tables and columns
β€’ Relevant Subset: Pre-filter to likely-needed tables based on query keywords
β€’ Progressive: Start minimal, expand if query fails
β€’ User-Guided: User specifies relevant tables
3. Metadata Enrichment
β€’ Column descriptions (human-written semantics)
β€’ Sample values (e.g., status: ['pending', 'completed'])
β€’ Statistics (row counts, cardinality)
β€’ Constraints (primary keys, foreign keys)
4. Example Selection Method
β€’ Zero-shot: No examples provided
β€’ Static few-shot: Same 3-5 examples for all queries
β€’ Dynamic few-shot: Examples selected by query similarity
β€’ Schema-matched: Examples using same tables as query

Dependent Variables (Metrics)
Metric
Description
Execution Accuracy (EX)
% of queries that execute without syntax errors
Result Correctness (RC)
% producing correct output (exact or semantic match)
Schema Linking (SL)
Correct identification of tables and columns
Token Efficiency (TE)
Prompt tokens required per query
Latency (L)
End-to-end time from query to result

Benchmark Dataset
Category
Count
Challenge Focus
Simple SELECT
25
Basic filtering, column selection
Aggregation
30
GROUP BY, HAVING, aggregate functions
Window Functions
25
Running totals, rankings, partitions
Time-Series
30
Date functions, period comparisons
Complex JOINs
20
Multi-table reasoning, subqueries
ClickHouse-Specific
20
argMax, arrays, dialect syntax
TOTAL
150


PROJECT TIMELINE
Phase
Deliverables
Weeks 1-2
Dataset Creation
150 NL-SQL pairs across 6 categories, validated against ClickHouse
Week 3
Infrastructure
Experiment harness, LLM API wrappers, result logging
Week 4
Experiments
Run all prompt strategy combinations across models
Week 5
Analysis
Statistical analysis, tables, figures, insights
Weeks 6-7
Writing
Complete paper draft, internal review, polish
Week 8
Submission
Final formatting, submission to target venue

EXPECTED CONTRIBUTIONS
β€’ First systematic study of prompt engineering for OLAP Text-to-SQL
β€’ Novel ClickHouse-specific benchmark (150 queries, 6 categories)
β€’ Empirical comparison of schema representation strategies
β€’ Actionable guidelines for AI-assisted database client developers
β€’ Open-source benchmark and evaluation framework
Contact
Sahith Vibudhi
Email: v.sahithkumar@gmail.com
GitHub: github.com/sahithvibudhi
LinkedIn: linkedin.com/in/v-sahith
Krishna Chaitanya Balusu
Email: krishnabkc15@gmail.com
GitHub: github.com/Krishnachaitanyakc
LinkedIn: linkedin.com/in/kcbalusu/
Loading