DataPupOrg · Krishnachaitanyakc · Feb 8, 2026 · Feb 8, 2026 · Feb 8, 2026 · Feb 8, 2026
diff --git a/.gitignore b/.gitignore
@@ -67,4 +67,25 @@ release/
 .yarn-integrity
 package-lock.json
 yarn.lock
-pnpm-lock.yaml
+pnpm-lock.yaml
+
+# Python
+__pycache__/
+*.pyc
+*.pyo
+.venv/
+venv/
+*.egg-info/
+
+# ClickHouse data
+.clickhouse-data/
+
+# Evaluation results (generated, large)
+evaluation/results/phase1/
+evaluation/results/phase2/
+evaluation/results/phase2_original/
+evaluation/results/figures/
+evaluation/results/*.json
+
+# Claude
+.claude/
diff --git a/DataPup - Research/AI_Collaboration_Prompt.docx b/DataPup - Research/AI_Collaboration_Prompt.docx
diff --git a/DataPup - Research/AI_Collaboration_Prompt.txt b/DataPup - Research/AI_Collaboration_Prompt.txt
@@ -0,0 +1,145 @@
+AI COLLABORATION PROMPT
+Schema-Aware Prompt Engineering Research Project
+HOW TO USE THIS PROMPT
+Copy the entire content below and paste it into a new conversation with Claude, GPT-4, or another capable AI assistant. This prompt provides context about the research project and specific tasks the AI can help with.
+
+MASTER PROMPT - COPY BELOW THIS LINE
+---BEGIN PROMPT---
+You are assisting with a research project on Schema-Aware Prompt Engineering for Text-to-SQL in Analytical Databases. The goal is to publish a peer-reviewed paper at a top database conference (VLDB, CIDR, or similar venue).
+PROJECT CONTEXT
+Research Question: What is the optimal way to present database schema information in LLM prompts to maximize SQL generation accuracy for analytical (OLAP) databases like ClickHouse?
+Key Research Dimensions:
+	•	Schema Representation Format: CREATE TABLE vs Markdown vs JSON vs Natural Language
+	•	Schema Scope: Full schema vs Relevant subset vs Progressive expansion
+	•	Metadata Enrichment: Column descriptions, sample values, statistics, constraints
+	•	Example Selection: Zero-shot vs Static few-shot vs Dynamic few-shot
+Target Database: ClickHouse (columnar OLAP database)
+ClickHouse has unique characteristics: specialized aggregate functions (argMax, groupArray), time-series functions (toStartOfMonth), array types, and specific SQL dialect variations.
+TASKS YOU CAN HELP WITH
+TASK 1: Generate Benchmark Queries
+Create natural language queries and their corresponding ClickHouse SQL for the benchmark dataset. Each query should include:
+	•	Natural language question (how a user would ask)
+	•	Gold standard SQL query (correct ClickHouse syntax)
+	•	Difficulty category (Simple, Aggregation, Window, Time-Series, JOIN, ClickHouse-Specific)
+	•	Key challenge being tested
+Example prompt: "Generate 10 Time-Series category queries for an e-commerce analytics database with tables: orders, customers, products. Include queries that require toStartOfMonth(), dateDiff(), and period-over-period comparisons."
+TASK 2: Design Schema Representations
+Help create the different schema representation formats for testing. Given a database schema, produce:
+	•	CREATE TABLE format with ClickHouse-specific syntax
+	•	Markdown table format with descriptions
+	•	JSON schema format
+	•	Natural language description
+TASK 3: Write Experiment Code
+Help write Python code for the evaluation framework:
+	•	Prompt construction functions for each schema format
+	•	LLM API calling wrappers (OpenAI, Anthropic)
+	•	SQL execution and validation against ClickHouse
+	•	Results logging and metrics calculation
+	•	Statistical analysis scripts
+TASK 4: Analyze Results
+After experiments are run, help with:
+	•	Statistical significance testing
+	•	Generating tables and visualizations for the paper
+	•	Identifying patterns and insights
+	•	Writing up findings for the paper
+TASK 5: Paper Writing
+Help draft and refine paper sections:
+	•	Related work survey and positioning
+	•	Methodology descriptions
+	•	Results narrative and discussion
+	•	Threats to validity section
+
+SAMPLE DATABASE SCHEMA
+Use this e-commerce schema for generating benchmark queries:
+-- Orders table
+CREATE TABLE orders (
+    order_id UInt64,
+    customer_id UInt64,
+    product_id UInt64,
+    quantity UInt32,
+    unit_price Decimal(10, 2),
+    total_amount Decimal(10, 2),
+    status Enum8('pending' = 1, 'processing' = 2, 'shipped' = 3, 'delivered' = 4, 'cancelled' = 5),
+    created_at DateTime64(3),
+    updated_at DateTime64(3)
+) ENGINE = MergeTree()
+ORDER BY (created_at, order_id);
+
+-- Customers table
+CREATE TABLE customers (
+    customer_id UInt64,
+    email String,
+    name String,
+    country LowCardinality(String),
+    created_at DateTime64(3),
+    lifetime_value Decimal(12, 2)
+) ENGINE = MergeTree()
+ORDER BY customer_id;
+
+-- Products table
+CREATE TABLE products (
+    product_id UInt64,
+    name String,
+    category LowCardinality(String),
+    subcategory String,
+    price Decimal(10, 2),
+    inventory_count UInt32,
+    tags Array(String)
+) ENGINE = MergeTree()
+ORDER BY product_id;
+
+-- Page events table (for analytics)
+CREATE TABLE page_events (
+    event_id UUID,
+    session_id String,
+    customer_id Nullable(UInt64),
+    event_type LowCardinality(String),
+    page_url String,
+    referrer String,
+    device_type LowCardinality(String),
+    country LowCardinality(String),
+    timestamp DateTime64(3)
+) ENGINE = MergeTree()
+ORDER BY (timestamp, event_id);
+CLICKHOUSE-SPECIFIC FUNCTIONS TO COVER
+Ensure benchmark queries exercise these ClickHouse-specific features:
+Function/Feature
+Use Case
+argMax(col, val)
+Get column value at max of another column
+groupArray()
+Aggregate values into an array
+toStartOfMonth()
+Truncate datetime to month start
+quantile(0.95)()
+Calculate percentiles
+arrayJoin()
+Expand array into rows
+WITH clause
+CTEs for complex queries
+
+OUTPUT FORMAT
+When generating benchmark queries, use this JSON format:
+{
+  "id": "TS-001",
+  "category": "Time-Series",
+  "difficulty": "medium",
+  "natural_language": "What was the total revenue for each month in 2024?",
+  "sql": "SELECT toStartOfMonth(created_at) AS month, sum(total_amount) AS revenue FROM orders WHERE toYear(created_at) = 2024 GROUP BY month ORDER BY month",
+  "challenge": "Date truncation function, year extraction",
+  "tables_used": ["orders"],
+  "clickhouse_features": ["toStartOfMonth", "toYear"]
+}
+---END PROMPT---
+
+QUICK REFERENCE: EXAMPLE PROMPTS
+For generating queries:
+"Generate 15 Aggregation category queries that test GROUP BY with multiple columns, HAVING clauses, and ClickHouse aggregate functions like argMax and groupArray."
+For schema representation:
+"Convert the orders table schema into all four representation formats: CREATE TABLE, Markdown with descriptions, JSON schema, and natural language paragraph."
+For experiment code:
+"Write a Python function that takes a ClickHouse schema and returns it in Markdown format with column descriptions extracted from comments."
+For analysis:
+"Given this results CSV, perform statistical significance testing (McNemar's test) comparing CREATE TABLE vs Markdown format accuracy and generate a LaTeX table for the paper."
+For paper writing:
+"Write a Related Work section covering DAIL-SQL, DIN-SQL, Spider benchmark, and BIRD benchmark, positioning our contribution as the first systematic study for OLAP databases."
diff --git a/DataPup - Research/Abstract_and_Framework_Overview.docx b/DataPup - Research/Abstract_and_Framework_Overview.docx
diff --git a/DataPup - Research/Abstract_and_Framework_Overview.txt b/DataPup - Research/Abstract_and_Framework_Overview.txt
@@ -0,0 +1,137 @@
+RESEARCH PROJECT
+Abstract & Framework Document
+Schema-Aware Prompt Engineering
+for Text-to-SQL in
+Analytical Databases
+A Systematic Evaluation Study
+Sahith Vibudhi, Krishna Chaitanya Balusu
+Independent Researchers
+San Francisco, California
+TARGET VENUES
+CIDR 2027  •  VLDB 2026 Industrial/Workshop  •  SIGMOD 2027
+
+ABSTRACT
+Large Language Models (LLMs) have emerged as a promising approach for Text-to-SQL tasks, enabling natural language interfaces to databases. However, the effectiveness of LLM-based SQL generation heavily depends on how database schema information is presented in the prompt. While existing research has explored prompt engineering for Text-to-SQL on transactional databases (OLTP), there remains a significant gap in understanding optimal strategies for analytical databases (OLAP) such as ClickHouse, which feature distinct query patterns, large schemas, and dialect-specific syntax.
+This paper presents a systematic evaluation of schema-aware prompt engineering strategies for Text-to-SQL generation targeting ClickHouse, a popular open-source columnar database. We investigate four key dimensions: (1) schema representation formats, (2) schema scope strategies, (3) metadata enrichment, and (4) example selection methods.
+Through experiments on a novel ClickHouse-specific benchmark comprising 150 natural language queries across six complexity categories, we evaluate multiple LLMs and provide actionable guidelines for building AI-assisted database clients. We release our benchmark and evaluation framework as open-source artifacts.
+
+Keywords: Text-to-SQL, Large Language Models, Prompt Engineering, Schema Linking, ClickHouse, OLAP, Database Interfaces, Benchmark
+
+RESEARCH QUESTIONS
+This study addresses the following research questions:
+RQ1
+Which schema representation format (CREATE TABLE, Markdown, JSON, Natural Language) yields the highest SQL generation accuracy for ClickHouse queries?
+RQ2
+How does schema scope strategy (full vs. relevant subset vs. progressive) affect performance on databases with large schemas (100+ columns)?
+RQ3
+What types of metadata enrichment (column descriptions, sample values, statistics) most improve SQL generation accuracy?
+RQ4
+How do example selection methods (zero-shot, static few-shot, dynamic few-shot) compare across different query complexity levels?
+
+EXPERIMENTAL FRAMEWORK
+Independent Variables
+1. Schema Representation Format
+Format
+Description
+CREATE TABLE
+Standard SQL DDL with ClickHouse engine syntax
+Markdown
+Tabular format with columns for name, type, description
+JSON Schema
+Structured JSON with explicit field semantics
+Natural Language
+Prose descriptions of tables and relationships
+
+2. Schema Scope Strategy
+	•	Full Schema: Include all tables and columns
+	•	Relevant Subset: Pre-filter to likely-needed tables based on query keywords
+	•	Progressive: Start minimal, expand if query fails
+	•	User-Guided: User specifies relevant tables
+3. Metadata Enrichment
+	•	Column descriptions (human-written semantics)
+	•	Sample values (e.g., status: ['pending', 'completed'])
+	•	Statistics (row counts, cardinality)
+	•	Constraints (primary keys, foreign keys)
+4. Example Selection Method
+	•	Zero-shot: No examples provided
+	•	Static few-shot: Same 3-5 examples for all queries
+	•	Dynamic few-shot: Examples selected by query similarity
+	•	Schema-matched: Examples using same tables as query
+
+Dependent Variables (Metrics)
+Metric
+Description
+Execution Accuracy (EX)
+% of queries that execute without syntax errors
+Result Correctness (RC)
+% producing correct output (exact or semantic match)
+Schema Linking (SL)
+Correct identification of tables and columns
+Token Efficiency (TE)
+Prompt tokens required per query
+Latency (L)
+End-to-end time from query to result
+
+Benchmark Dataset
+Category
+Count
+Challenge Focus
+Simple SELECT
+25
+Basic filtering, column selection
+Aggregation
+30
+GROUP BY, HAVING, aggregate functions
+Window Functions
+25
+Running totals, rankings, partitions
+Time-Series
+30
+Date functions, period comparisons
+Complex JOINs
+20
+Multi-table reasoning, subqueries
+ClickHouse-Specific
+20
+argMax, arrays, dialect syntax
+TOTAL
+150
+
+
+PROJECT TIMELINE
+Phase
+Deliverables
+Weeks 1-2
+Dataset Creation
+150 NL-SQL pairs across 6 categories, validated against ClickHouse
+Week 3
+Infrastructure
+Experiment harness, LLM API wrappers, result logging
+Week 4
+Experiments
+Run all prompt strategy combinations across models
+Week 5
+Analysis
+Statistical analysis, tables, figures, insights
+Weeks 6-7
+Writing
+Complete paper draft, internal review, polish
+Week 8
+Submission
+Final formatting, submission to target venue
+
+EXPECTED CONTRIBUTIONS
+	•	First systematic study of prompt engineering for OLAP Text-to-SQL
+	•	Novel ClickHouse-specific benchmark (150 queries, 6 categories)
+	•	Empirical comparison of schema representation strategies
+	•	Actionable guidelines for AI-assisted database client developers
+	•	Open-source benchmark and evaluation framework
+Contact
+Sahith Vibudhi
+Email: v.sahithkumar@gmail.com
+GitHub: github.com/sahithvibudhi
+LinkedIn: linkedin.com/in/v-sahith
+Krishna Chaitanya Balusu
+Email: krishnabkc15@gmail.com
+GitHub: github.com/Krishnachaitanyakc
+LinkedIn: linkedin.com/in/kcbalusu/