A comprehensive Python library for evaluating LLM models' SQL query generation capabilities against baseline datasets.
To run this project, ensure the following dependencies are installed:
pandas- Data manipulation and analysissqlalchemy- Database connections and ORManthropic- Anthropic Claude APIazure-core- Azure SDK core librarytabulate- Report formattingpython-dotenv- Environment variable managementpyodbc- ODBC driver for SQL Serveropenai- OpenAI API clientduckdb- DuckDB database engine
- Python Version: Ensure Python 3.10 or higher is installed.
- ODBC Driver: Install ODBC Driver 18 for SQL Server (recommended).
- SQL Database used as example tpch database, build your own test!
The application supports multiple database sources:
- SQL Azure Database or SQL Server OnPrem (
sql-server) - DuckDB File (
duckdb)
To use DuckDB with TPCH sample data:
-
Download the TPCH database files from the DuckDB documentation. https://duckdb.org/docs/stable/core_extensions/tpch.html
-
Run the setup_duckdb.py script to create the baseline results:
# Run the TPCH queries to populate the baseline files:
python scripts/setup_duckdb.py- Adjust the relevant parameters in the
.envfile to point to the DuckDB database file:
# Set environment variable
DUCKDB_PATH="./docs/tpch-sf10.db"To run queries in Snowflake, you need to activate a warehouse. Depending on your request, you need to choose the appopiate size.
example:
CREATE WAREHOUSE MY_WH
WAREHOUSE_SIZE = 'XSMALL'
AUTO_SUSPEND = 120 -- 2 minutes
AUTO_RESUME = TRUE;
tpch database is included as sample database in Snowflake. It is accessible for querying without any additional setup.
- Satabase name = SNOWFLAKE_SAMPLE_DATA
- Schema name = TPCH_SF10 [or TPCH_SF100, TPCH_SF1000]
-
Clone the Repository:
git clone https://github.com/erincon01/llm_eval.git cd llm_eval -
Install Python Dependencies: Use Poetry to install dependencies:
poetry lock poetry install
Check poetry envirormnent for debugging in vs.code:
poetry env list poetry env info --path
-
Install System Dependencies:
- Download and install ODBC Driver 18 for SQL Server if sql-server is used as data-source [both for SQL Azure and SQL Server].
-
Set Up Environment Variables: Create a
.envfollowing the.env.sample
├── core/ # Core business logic components
├── data/ # Data management classes
├── services/ # Service layer for external integrations
├── utils/ # Utility functions and helpers
└── docs/ # Configuration and resultsPrimary orchestrator class that coordinates all evaluation components.
Constructor Parameters:
questions_file_name- Path to YAML file with evaluation questionsdb_schema_file_name- Path to YAML file with database schemasemantic_rules_file_name- Path to Markdown file with semantic rulessystem_message_file_name- Path to Markdown file with system message templatemodels_file_name- Path to YAML file with model configurations
Key Methods:
evaluate_models(temperature, results_to_path, file_name_prefix, log_results, log_summary)- Run full evaluation across all modelsexecute_queries(sql_query_column, summary_file_name, results_to_path, ...)- Generate baseline datasetsload_baseline_datasets(baseline_path)- Load reference datasets for comparisonprocess_questions_with_model(model, model_config, temperature, max_tokens)- Process questions with specific model
Attributes:
all_questions- Loaded evaluation questionsbaseline_datasets- Reference datasets for comparisonmodels,models_configs- Available models and their configurationsdb_schema- Database schema informationsemantic_rules- Business rules contentsystem_message- LLM system prompt template
Handles database connections and SQL execution.
Methods:
get_dynamic_sql(source, sql_query, as_data_frame)- Execute SQL queries on specified databaseexecute_sql_query(sql_query)- Execute SQL and return results with metadataget_connection(source)- Get database connection enginedecode_source(source)- Normalize database source names
Supported Sources: sql-server, duckdb, snowflake
Manages LLM interactions for SQL generation.
Methods:
generate_sql_query(platform, model, question_number, user_prompt, ...)- Generate SQL from natural language
Supported Platforms: azure_openai, anthropic
Executes baseline SQL queries and generates reference datasets.
Methods:
execute_queries(questions, sql_query_column, summary_file_name, results_to_path, ...)- Run baseline queries and export results
Orchestrates evaluation of multiple LLM models.
Methods:
evaluate_models(models, models_configs, all_questions, baseline_datasets, ...)- Evaluate multiple models against baseline
Processes individual questions with LLM models and compares results.
Methods:
process_questions_with_model(questions, baseline_datasets, model, ...)- Process questions with specific model and compare results
Manages question datasets and YAML serialization.
Methods:
load_questions()- Load questions from YAML fileadd_question(question)- Add new question to datasetget_all_questions()- Retrieve all questionssave_questions(yaml_file)- Save questions to YAMLfind_question(keyword)- Search questions by keyword
Attributes:
yaml_file- Path to YAML filequestions- List of question dictionaries
Utility for loading questions from files.
Methods:
load_questions_from_file(questions_file_name)- Static method to load questions
Handles model configuration loading from YAML.
Methods:
load_models_from_yaml()- Load enabled models and configurationsget_models_configs()- Return loaded model configurationsget_models()- Return flat list of enabled modelsget_model_config_by_id(model_id)- Get configuration by provider ID
Attributes:
yaml_path- Path to models configuration YAMLmodels_configs- List of provider configurationsmodels- Flat list of enabled models
Database schema management from YAML configuration.
Methods:
load_schema()- Load database schema from YAML fileget_table_names()- Return list of all table namesget_table_script(table_name)- Return script for specific tableget_all_tables_scripts()- Return all table scriptsadd_table(table_name, script)- Add new table to schemasave_schema()- Save schema back to YAML file
Attributes:
yaml_file- Path to schema YAML filetables- List of table definitions
Data comparison and baseline loading utilities.
Methods:
compare_dataframes(baseline_df, llm_df, question_number)- Compare baseline vs LLM resultsload_baseline_datasets(baseline_path)- Load baseline CSV files from directory
Returns: Equality percentages for rows, columns, and coverage metrics
DataFrame normalization and alignment utilities.
Methods:
normalize_numeric_columns(df1, df2)- Normalize numeric precision across DataFramesalign_columns_by_first_row(df1, df2)- Align column order based on first row values
File operations utilities.
Methods:
load_file(filename)- Load file content as stringremove_baseline_datasets(results_to_path)- Remove baseline CSV files
Low-level LLM API interactions.
Methods:
get_chat_completion_from_platform(platform, model, system_message, user_prompt, ...)- Get chat completion from various platforms
SQL query processing utilities.
Methods:
remove_quotations(sql_query)- Extract SQL from markdown code blocks
Performance reporting and analysis.
Methods:
performance_report(results_path, file_name_prefix)- Generate comprehensive performance reports_generate_model_performance_report(all_data)- Model-specific performance metrics_generate_query_performance_report(all_data)- Query-specific performance metrics_generate_ranking_reports(all_data)- Ranking reports by different metrics_generate_combined_ranking(agg)- Combined ranking by quality, time, and price
Complete evaluation workflow example:
from llms_evaluator import LLMsEvaluator
from utils.reporting_utils import performance_report
# Initialize evaluator
evaluator = LLMsEvaluator(
questions_file_name="./docs/01-questions-sql-server.yaml",
db_schema_file_name="./docs/02-database_schema.yaml",
semantic_rules_file_name="./docs/03-semantic_rules-sql-server.md",
system_message_file_name="./docs/04-system_message-sql-server.md",
models_file_name="./docs/05-models.yaml",
data_source="sql-server" # or "duckdb"
)
# Step 1: Generate baseline datasets
evaluator.execute_queries(
sql_query_column="sql_query",
summary_file_name="questions_baseline_summary.csv",
results_to_path="./docs/results/baseline_dataset-sql-server",
persist_results=True,
drop_results_if_exists=True
)
# Step 2: Load baseline for comparison
evaluator.load_baseline_datasets("./docs/results/baseline_dataset-sql-server")
# Step 3: Run evaluation iterations
for i in range(number_of_iterations):
i_str = str(i + 1).zfill(2)
evaluator.evaluate_models(
temperature=0.9,
results_to_path="./docs/results",
file_name_prefix=f"results_llm_{i:02d}",
log_results=True,
log_summary=True,
iteration=i_str,
)
# Step 4: Generate performance reports
performance_report(
results_path="./docs/results",
file_name_prefix="questions_summary_results_llm"
)questions:
- question_number: 1
user_question: |
Which customers from the 'BUILDING' market segment placed more than 10 orders in 1996? Order by total order value descending.
sql_query: |
SELECT ...
c.c_custkey AS customer_id,
c.c_name AS customer_name,
COUNT(o.o_orderkey) AS num_orders,
SUM(o.o_totalprice) AS total_amount
FROM customer c
JOIN orders o ON c.c_custkey = o.o_custkey
WHERE c.c_mktsegment = 'BUILDING'
AND YEAR(o.o_orderdate) = 1996
GROUP BY c.c_custkey, c.c_name
HAVING COUNT(o.o_orderkey) > 10
ORDER BY total_amount DESC;
tables_used:
- "customer"
- "orders"
- question_number: 2
user_question: |
xxxx
sql_query: |
xxxx
tables_used:
- xxxx
- yyyy
- question_number: ntables:
- name: "region"
script: |
create table region (
r_regionkey integer not null,
r_name char(25) not null,
r_comment varchar(152),
primary key (r_regionkey)
);
- name: xxxx
script: |
xxxx
- name: yyyyy
script: |
xxxxxxx# Semantic Rules for SQL Queries
## Semantic Rules
- Ensure all SQL queries are syntactically correct and executable.
- Use appropriate JOINs to connect tables based on foreign key relationships.
- Filter results using WHERE clauses to match user questions.# System Message Template for LLMs
You are an expert SQL query generator. Given a user question, generate a valid SQL query that
retrieves the requested data from the specified database schema. Ensure the query is optimized for performance and adheres to the semantic rules provided.models_configs:
- id: "azure_openai"
enabled: true | false
models:
- name: "gpt-4o"
enabled: true | false
cost_input_tokens_EUR_1K: 0.005
cost_output_tokens_EUR_1K: 0.015See the .env.sample file for a complete list of environment variables.
- Initialize: Create
LLMsEvaluatorinstance with configuration files - Generate Baseline: Execute ground truth SQL queries to create reference datasets
- Load Baseline: Load reference datasets for comparison
- Evaluate Models: Run evaluation iterations across configured LLM models
- Analyze Results: Generate performance reports and rankings
- Consolidate: Merge results across experiments (optional)
- Accuracy Metrics: Row/column equality percentages, coverage ratios
- Performance Metrics: SQL execution time, LLM response time
- Cost Metrics: Token usage and EUR costs per model
- Quality Metrics: Data completeness and correctness ratios
See the files: requirements.txt and pyproject.toml
- Baseline: Tab-separated CSV files per question (
question_01.csv, etc.) - Results: YAML files with detailed evaluation results per model
- Summary: CSV files with aggregated metrics across all evaluations
- Reports: Formatted performance and ranking reports