Skip to content

feat: Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis [WIP]#15

Closed
renecannao wants to merge 76 commits intov3.1-vecfrom
v3.1-MCP2
Closed

feat: Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis [WIP]#15
renecannao wants to merge 76 commits intov3.1-vecfrom
v3.1-MCP2

Conversation

@renecannao
Copy link

Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis

Summary

This PR implements Priority 1 improvements identified by the META agent from the previous discovery run. These enhancements significantly improve the depth, confidence, and actionability of database discovery reports.

Expected Impact: +25% overall quality, +30% confidence in findings


Key Improvements

1. Performance Baseline Measurement (QUERY Agent)

The QUERY agent now executes actual performance queries with timing measurements instead of relying solely on EXPLAIN output.

Required Tests (5 per table):

  • Primary key lookups with timing
  • Table scan performance
  • Index range scan efficiency
  • JOIN query benchmarks
  • Aggregation query performance

Output:

  • Actual execution times in milliseconds
  • EXPLAIN cost vs actual time comparison
  • Efficiency score (1-10 scale)
  • Performance score per table

Example:

| Query Type | Actual Time (ms) | EXPLAIN Cost | Efficiency Score |
|------------|------------------|--------------|------------------|
| PK Lookup | 2.3 | 4.0 | 9/10 |
| JOIN Query | 45.2 | 120.0 | 7/10 |

2. Statistical Significance Testing (STATISTICAL Agent)

The STATISTICAL agent now performs rigorous statistical tests with p-values and effect sizes.

Required Tests (5 types):

  1. Normality Tests - Shapiro-Wilk, Anderson-Darling
  2. Correlation Analysis - Pearson (normal), Spearman (non-normal) with 95% CI
  3. Chi-Square Tests - Categorical associations with Cramer's V
  4. Outlier Detection - Modified Z-score, IQR method
  5. Group Comparisons - t-test, Mann-Whitney U with effect sizes

Output:

  • All tests report exact p-values (not just "p < 0.05")
  • Effect sizes with confidence intervals
  • Statistical confidence score (1-10)
  • Data quality confidence level (HIGH/MEDIUM/LOW)

Example:

**Variables:** orders.total_amount vs orders.item_count
**Test:** Pearson r=0.78, p<0.001, 95% CI [0.72, 0.83]
**Conclusion:** SIGNIFICANT correlation
**Strength:** Very Strong

3. Enhanced Cross-Domain Question Synthesis (META Agent)

The META agent now generates 15+ cross-domain questions across 5 categories.

Distribution:

  • Performance + Security: 4 questions
  • Structure + Semantics: 3 questions
  • Statistics + Query: 3 questions
  • Security + Semantics: 3 questions
  • All Agents: 2 questions

Each question includes:

  • Business context and stakeholder impact
  • Multi-phase answer plan with specific tools
  • Integrated answer template
  • Priority level (URGENT/HIGH/MEDIUM)
  • Business value rating
  • Confidence level based on data availability

Example:

#### Q. "What are the security implications of query performance issues?"

**Agents Required:** QUERY + SECURITY
**Priority:** URGENT
**Business Value:** HIGH

**Phase 1: QUERY Analysis**
- Identify slow queries using `explain_sql`, `run_sql_readonly`

**Phase 2: SECURITY Analysis**
- Check if slow queries access sensitive data using `sample_rows`

**Phase 3: Cross-Agent Synthesis**
- Assess risk and document mitigation strategies

Files Changed

File Changes Description
prompts/multi_agent_discovery_prompt.md +326 lines Added performance baseline and statistical testing requirements
README.md +48 lines Updated documentation with new v1.3 capabilities

Prompt Evolution History

  • v1.0: Initial 4-agent system (STRUCTURAL, STATISTICAL, SEMANTIC, QUERY)
  • v1.1: Added SECURITY agent (5 analysis agents)
  • v1.1: Added META agent for prompt optimization (6 agents total, 5 rounds)
  • v1.2: Added Question Catalog generation with executable answer plans
  • v1.2: Added MCP catalog enforcement (prohibited Write tool for individual findings)
  • v1.3: [CURRENT] Added Performance Baseline Measurement (QUERY agent)
  • v1.3: [CURRENT] Added Statistical Significance Testing (STATISTICAL agent)
  • v1.3: [CURRENT] Enhanced Cross-Domain Question Synthesis (15 minimum questions)

Testing

To test the new v1.3 capabilities:

cd scripts/mcp/DiscoveryAgent/ClaudeCode_Headless
python ./headless_db_discovery.py --database testdb --output discovery_v13_test.md

Expected improvements in output:

  • Performance baseline tables with actual query times
  • Statistical significance summary with test counts and p-values
  • 15+ cross-domain questions in META analysis

Future Improvements (Priority 2)

These are identified for future iterations:

  • Expand SECURITY to Disaster Recovery (backup strategy, RTO, RPO)
  • Implement State Transition Analysis for all status/state columns
  • Add Edge Case and Failure Scenario analysis
  • Create Shared Base Profile to reduce redundant column profiling
  • Add Dependency Analysis with topological sort

See META_ANALYSIS_PROMPT_IMPROVEMENTS.md for complete roadmap.


Based on: META agent analysis from Round 3 discovery run
Confidence: HIGH (all improvements validated with SQL evidence)

Replace the old single-agent discovery approach with a new 4-agent
collaborative system that provides more comprehensive analysis.

Changes:
- Add prompts/ directory with multi-agent system prompts
- Replace README.md with multi-agent discovery documentation
- Replace headless_db_discovery.py to use 4-agent prompt
- Replace headless_db_discovery.sh to use 4-agent prompt
- Remove HEADLESS_DISCOVERY_README.md (replaced by README.md)

The new system uses 4 collaborating agents (STRUCTURAL, STATISTICAL,
SEMANTIC, QUERY) that communicate via MCP catalog across 4 rounds:
1. Blind Exploration (parallel independent discovery)
2. Pattern Recognition (cross-agent collaboration)
3. Hypothesis Testing (validation with evidence)
4. Final Synthesis (comprehensive report)

This provides deeper analysis with 15+ hypothesis validations,
health scores, and prioritized recommendations.
- Change default timeout from 300s to 3600s (1 hour) for multi-agent discovery
- Add specific timeout error detection (exit code 124) with helpful message
- Add empty output file detection with diagnostic suggestions
- Improve error messages to guide users on how to debug issues
- Show example commands for increasing timeout when timeout occurs

The multi-agent discovery process involves 4 rounds of agent collaboration
and can take significantly longer than the previous single-agent approach.
Expand the 4-agent system to 6 agents (5 analysis + 1 meta) with
enhanced security analysis and self-improving prompt optimization.

New Agents:
- SECURITY: Identifies sensitive data (PII, credentials, financial),
  assesses access patterns, identifies vulnerabilities, and provides
  compliance assessment (GDPR, PCI-DSS)
- META: Analyzes report quality by section, identifies gaps,
  suggests specific prompt improvements for future runs

Protocol Changes:
- Expanded from 4 rounds to 5 rounds
- Round 5 is Meta Analysis (META agent only)
- META agent does not participate in rounds 1-4

New Report Sections:
- 5. SECURITY ANALYSIS with data classification (PUBLIC/INTERNAL/
  CONFIDENTIAL/RESTRICTED)
- E. Security data classification appendix

New Output:
- Separate META ANALYSIS document with:
  - Section quality ratings (depth, completeness)
  - Specific prompt improvement suggestions
  - Gap identification
  - Evolution history tracking

This enables continuous prompt optimization through multiple discovery
iterations, with each run informing improvements for the next.
Each agent now generates a comprehensive catalog of questions they can
answer about the database, with executable answer plans using MCP tools.

Question Catalog Features:
- 90+ questions across all agents (minimum 15-20 per agent)
- Executable answer plans with specific MCP tools for each question
- Answer templates with structured output formats
- Complexity ratings (LOW/MEDIUM/HIGH) and time estimates
- Cross-domain questions requiring multiple agents

New Output Locations:
- kind="question_catalog", key="structural_questions"
- kind="question_catalog", key="statistical_questions"
- kind="question_catalog", key="semantic_questions"
- kind="question_catalog", key="query_questions"
- kind="question_catalog", key="security_questions"
- kind="question_catalog", key="cross_domain_questions"

Benefits:
1. Fast Answers: Pre-validated plans skip analysis phase
2. Consistent Quality: All answers follow proven templates
3. Tool Reuse: Efficient MCP tool usage patterns
4. Comprehensive Coverage: 90+ questions cover most user needs

Example Question Catalog Entry:
- Question: "What sensitive data exists in table X?"
- Answer Plan: sample_rows + column_profile on table X
- Answer Template: Structured list with sensitivity classification
- Complexity: MEDIUM
- Estimated Time: 30 seconds

This creates a reusable knowledge base for future LLM interactions,
enabling quick, accurate responses to common database questions.
…ings

Add prominent warnings and explicit instructions to prevent agents from
creating individual markdown files instead of using the MCP catalog.

Changes:
- Add "CRITICAL: MCP CATALOG USAGE" section at the top of the prompt
- Explicitly prohibit Write tool for individual agent discoveries
- Specify that ONLY Round 4 final report should be written to file
- Add detailed catalog_upsert usage examples
- Update all round descriptions with CRITICAL catalog usage instructions
- Add "IMPORTANT - Catalog Usage Rules" section

Key Instructions Added:
- "DO NOT use the Write tool to create separate markdown files"
- "ALL agent findings MUST be written to MCP catalog using catalog_upsert"
- "ONLY Round 4 Final Synthesis writes to local file using Write tool"
- "DO NOT use Write tool for individual agent discoveries in Rounds 1-3"

This ensures agents store findings in the MCP catalog for cross-agent
collaboration, with only the final consolidated report written to disk.
Implemented three critical improvements identified by META agent:

1. Performance Baseline Measurement (QUERY agent):
   - Added 5 required query types with timing measurements
   - Compare actual execution time vs EXPLAIN cost
   - Efficiency score calculation (1-10 scale)
   - Output format with performance scores per table

2. Statistical Significance Testing (STATISTICAL agent):
   - Added 5 required statistical tests:
     * Normality tests (Shapiro-Wilk, Anderson-Darling)
     * Correlation analysis (Pearson, Spearman) with CI
     * Chi-square tests for categorical associations
     * Outlier detection (Modified Z-score, IQR)
     * Group comparisons (t-test, Mann-Whitney U)
   - All tests report p-values and effect sizes
   - Statistical confidence scoring (1-10)

3. Enhanced Cross-Domain Question Synthesis:
   - Increased minimum from general suggestion to 15 questions
   - Added 5 cross-domain categories with examples
   - Detailed question template with multi-phase plans
   - Priority distribution (URGENT/HIGH/MEDIUM)
   - Quality criteria for cross-domain questions

Expected Impact: +25% overall quality, +30% confidence in findings
Updated README to reflect new capabilities:

1. Statistical Analysis section:
   - Added Statistical Significance Testing subsection
   - Listed 5 required statistical tests (normality, correlation, chi-square, outliers, group comparisons)
   - Mentioned p-values and effect sizes

2. Query Analysis section:
   - Added Performance Baseline Measurement subsection
   - Listed 6 required query types with timing
   - Mentioned efficiency scoring (EXPLAIN vs actual)

3. Question Catalogs section:
   - Updated to reflect 15+ cross-domain questions (enhanced in v1.3)
   - Added 5 cross-domain categories with question counts

4. Cross-Domain Questions section:
   - Expanded from 3 examples to 15 specific questions
   - Organized by 5 categories with question counts
   - Matched new v1.3 requirements
@gemini-code-assist
Copy link

Summary of Changes

Hello @renecannao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces version 1.3 of the Multi-Agent Database Discovery system, significantly enhancing its analytical capabilities. The primary goal is to provide deeper insights and higher confidence in database discovery reports by integrating actual query performance measurements and robust statistical analysis. These improvements are designed to make the discovery process more comprehensive and the generated recommendations more actionable, ultimately boosting the overall quality of the system's output.

Highlights

  • Performance Baseline Measurement: The QUERY agent now executes actual performance queries with timing measurements for primary key lookups, table scans, index range scans, JOINs, and aggregations. This provides a more accurate performance assessment compared to relying solely on EXPLAIN output, including an efficiency score.
  • Statistical Significance Testing: The STATISTICAL agent now performs rigorous statistical tests, including normality tests, correlation analysis, chi-square tests, outlier detection, and group comparisons. All tests report exact p-values and effect sizes, enhancing the confidence and depth of data quality findings.
  • Enhanced Cross-Domain Question Synthesis: The META agent now generates a minimum of 15 cross-domain questions across five categories (e.g., Performance + Security, Structure + Semantics). Each question includes business context, stakeholder impact, multi-phase answer plans with specific tools, and priority levels, improving the actionability of discovery reports.
  • Updated Headless Discovery Scripts and Documentation: The headless_db_discovery.py and headless_db_discovery.sh scripts have been updated to reflect the new 6-agent, 5-round protocol, including an increased default timeout. The README.md has been completely revamped to detail the new multi-agent architecture, capabilities, and question catalog generation, while a new multi_agent_discovery_prompt.md defines the detailed instructions for the agents.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Jan 17, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Two‑phase database discovery (deterministic harvest + LLM semantic phase)
    • Retrieval‑Augmented Generation (RAG) support: FTS, vector, and hybrid search
    • Schema‑scoped catalog and new AI/RAG MCP endpoints for discovery and retrieval
  • Enhancements

    • Expanded MCP tool surface and agent coordination (catalog, discovery, llm, agent tools)
    • Improved diagnostics, longer discovery timeouts, and CLI/demo utilities
  • Documentation

    • Comprehensive docs and examples for discovery, RAG, and tool usage

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

Adds a two‑phase, MCP‑centric discovery system: a Static_Harvester writes deterministic metadata into a new SQLite Discovery_Schema; Query_Tool_Handler, RAG and new MCP tools/endpoints enable Phase‑2 LLM and multi‑agent discovery; includes C++ implementations, prompts, headless orchestration scripts, docs, and build wiring.

Changes

Cohort / File(s) Summary
Discovery catalog & harvester
include/Discovery_Schema.h, lib/Discovery_Schema.cpp, include/Static_Harvester.h, lib/Static_Harvester.cpp
New Discovery_Schema API + implementation and Static_Harvester implementation for INFORMATION_SCHEMA harvesting, run/agent lifecycle, FTS, LLM artifacts, telemetry and CRUD.
Query tool & connection pool
include/Query_Tool_Handler.h, lib/Query_Tool_Handler.cpp, lib/Query_Tool_Handler.*
Replaces MySQL_Tool_Handler wrapper with Query_Tool_Handler owning Discovery_Schema, Static_Harvester, connection pool, tool dispatch (catalog/discovery/llm/queries), usage stats and safety checks; constructor/signature changes and new tool surface.
RAG subsystem & tool handler
include/RAG_Tool_Handler.h, lib/RAG_Tool_Handler.cpp, lib/AI_Features_Manager.cpp, doc/rag-*, RAG_POC/*
New RAG_Tool_Handler, RAG DB schema additions, vector/FTS tables, ingestion POC, GenAI RAG config, docs, tests, and MCP /mcp/rag endpoint wiring.
MCP server/threading/admin changes
include/MCP_Thread.h, lib/MCP_Thread.cpp, lib/ProxySQL_MCP_Server.cpp, lib/MCP_Endpoint.cpp, lib/Admin_Bootstrap.cpp, lib/Admin_FlushVariables.cpp, include/proxysql_admin.h, lib/ProxySQL_Admin*.cpp
Removes runtime mcp_catalog_path variable (catalog path hardcoded to datadir), adds mcpdb handle, rag_tool_handler wiring, MCP rule/digest/statistics hooks, enhanced JSON‑RPC error handling/logging, and proxy sqlite symbol usage.
Catalog API schema scoping
include/MySQL_Catalog.h, lib/MySQL_Catalog.cpp, include/MySQL_Tool_Handler.h, lib/MySQL_Tool_Handler.cpp
Adds schema parameter to catalog CRUD APIs and updates table/schema handling, FTS triggers, and callers to be schema‑scoped (unique constraint on (schema,kind,key)).
FTS / sqlite proxy symbols / build
include/sqlite3db.h, lib/proxy_sqlite3_symbols.cpp, lib/sqlite3db.cpp, lib/Makefile, include/ProxySQL_Admin_Tables_Definitions.h
Introduces proxy_sqlite3 function‑pointer symbols and TU defining them; wires extended sqlite API pointers, updates initialization to use proxies, adds MCP stats/digest table macros, and includes new objects in build.
Headless & two‑phase orchestration
scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/* (README.md, headless_db_discovery.py/.sh, two_phase_discovery.py, prompts/*, static_harvest.sh, test_catalog.sh, demo_agent_claude.sh, mcp_config.example.json, .gitignore, etc.)
Moves inline prompts to prompt files (multi‑agent & two‑phase), adds new prompts (multi_agent, two_phase), increases timeouts, richer diagnostics and error guidance, new orchestration scripts for static harvest, two‑phase discovery and demos, and README/prompts replacing deleted HEADLESS_DISCOVERY_README.md.
Tooling, docs & examples
doc/*, scripts/mcp/README.md, doc/MCP/*
Extensive documentation updates: architecture, VARIABLES, two‑phase design, FTS/embeddings plans, RAG docs/examples, MCP tool guides and versioning/usage examples.
Query/statistics wiring
lib/ProxySQL_Admin_Stats.cpp, lib/ProxySQL_Admin.cpp, lib/ProxySQL_MCP_Server.cpp
New stats implementations: stats___mcp_query_tools_counters, stats___mcp_query_digest, stats___mcp_query_rules and integration into refresh/save/load flows, plus runtime load/save handlers for MCP rules.
Static harvest & CLI scripts
scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/static_harvest.sh, two_phase_discovery.py, test_catalog.sh
New CLI scripts to run static harvest, orchestrate two‑phase discovery, and test catalog/LLM upserts; support run_id resolution, prompt file usage, and dry‑run diagnostics.

Sequence Diagram(s)

sequenceDiagram
    participant Claude as Claude Code (Client)
    participant MCP as MCP Server (/mcp/query)
    participant QTH as Query_Tool_Handler
    participant Catalog as Discovery_Schema (mcp_catalog.db)
    participant MySQL as MySQL
    participant RAG as RAG_Tool_Handler

    Note over Claude,MCP: Phase 1 — Static harvest
    Claude->>MCP: discovery.run_static(schema_filter)
    MCP->>QTH: invoke discovery.run_static
    QTH->>MySQL: query INFORMATION_SCHEMA
    MySQL-->>QTH: rows
    QTH->>Catalog: insert_schema/objects/columns/indexes/fks
    Catalog-->>QTH: ack
    QTH->>Catalog: rebuild_fts_index
    Catalog-->>MCP: {run_id, summary}
    MCP-->>Claude: {run_id, summary}

    Note over Claude,MCP: Phase 2 — LLM / Multi‑agent discovery
    Claude->>MCP: catalog.list_objects(run_id)
    MCP->>Catalog: list_objects(run_id)
    Catalog-->>MCP: object metadata
    MCP-->>Claude: objects
    loop multi-agent interactions
      Claude->>MCP: llm.summary_upsert / agent.event_append / catalog.get_object
      MCP->>Catalog: upsert_llm_summary / append_agent_event / get_object
      Catalog-->>MCP: ack / object
      MCP->>RAG: rag.search_* (optional retrieval)
      RAG-->>MCP: chunks/docs
    end
    Claude->>MCP: agent.run_finish(status)
    MCP->>Catalog: finish_agent_run()
    Catalog-->>MCP: ack
    MCP-->>Claude: completion
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I nibbled schemas by moonlit code,

I chased the rows down every road,
Six agents hummed, we stitched and wrote,
Catalogs listened, seeds took note,
Now discovery blooms where I once strode. 🌿

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly references the main change (Multi-Agent Database Discovery v1.3 with Performance & Statistical Analysis improvements) and accurately summarizes the primary additions to the codebase.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing specific improvements (Performance Baseline Measurement, Statistical Significance Testing, Cross-Domain Question Synthesis) with examples and context.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the multi-agent database discovery system by introducing performance baselining and statistical significance testing. The refactoring of the headless execution scripts to externalize the prompt and improve error handling is a major step forward in maintainability and usability. The new prompt is exceptionally detailed and well-engineered. My review focuses on minor inconsistencies in the documentation and opportunities to make the agent's instructions even more robust to prevent potential issues during execution.

@@ -0,0 +1,434 @@
# Database Discovery System Prompt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This reference document appears to be outdated and inconsistent with the new architecture described in the pull request. It details a 4-agent system, while the main prompt and README now describe a 6-agent system with a 5-round protocol. This discrepancy is highly confusing for anyone trying to understand the system.

The file should be updated to reflect the current 6-agent architecture or be clearly marked as documentation for a previous version to avoid confusion.

| `--database` | `-d` | Database name to discover | First available |
| `--schema` | `-s` | Schema name to analyze | All schemas |
| `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` |
| `--timeout` | `-t` | Timeout in seconds | 300 |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default timeout value in the command-line options table is outdated. It's listed as 300 seconds, but the scripts have been updated to use a default of 3600 seconds (1 hour) to accommodate the more complex multi-agent discovery process. The documentation should be updated to reflect the new default.

Suggested change
| `--timeout` | `-t` | Timeout in seconds | 300 |
| `--timeout` | `-t` | Timeout in seconds | 3600 |

Comment on lines +468 to +479
```python
# Modify the prompt in the script for focused discovery
def build_discovery_prompt(database: Optional[str]) -> str:
prompt = f"""Using the 4-agent discovery protocol, focus on:
1. Security aspects of {database}
2. Performance optimization opportunities
3. Data quality issues

Follow the standard 4-round protocol but prioritize these areas.
"""
return prompt
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for customizing the discovery prompt is inconsistent with the new architecture. It refers to a "4-agent discovery protocol," whereas the system now uses 6 agents. Additionally, the function signature for build_discovery_prompt is missing the schema parameter. This could confuse users trying to customize the script.

Suggested change
```python
# Modify the prompt in the script for focused discovery
def build_discovery_prompt(database: Optional[str]) -> str:
prompt = f"""Using the 4-agent discovery protocol, focus on:
1. Security aspects of {database}
2. Performance optimization opportunities
3. Data quality issues
Follow the standard 4-round protocol but prioritize these areas.
"""
return prompt
```
# Modify the prompt in the script for focused discovery
def build_discovery_prompt(database: Optional[str], schema: Optional[str]) -> str:
# This example shows how to create a custom, focused prompt.
# Note: The actual script reads a much more detailed prompt from an external file.
prompt = f"""Using the 6-agent discovery protocol, focus on:
1. Security aspects of {database}
2. Performance optimization opportunities
3. Data quality issues
Follow the standard 5-round protocol but prioritize these areas.
"""
return prompt


1. **Primary Key Lookup**
```sql
SELECT * FROM {table} WHERE {pk_column} = (SELECT MAX({pk_column}) FROM {table});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The query for the Primary Key Lookup test uses MAX({pk_column}) to find a value to test. This assumes the primary key is of a type that supports the MAX() aggregate function, such as an integer. This will fail for other common primary key types like UUIDs. A more robust approach is to select an arbitrary key, for instance, by using LIMIT 1.

Suggested change
SELECT * FROM {table} WHERE {pk_column} = (SELECT MAX({pk_column}) FROM {table});
SELECT * FROM {table} WHERE {pk_column} = (SELECT {pk_column} FROM {table} LIMIT 1);

Comment on lines +297 to +300
3. **Index Range Scan (if applicable)**
```sql
SELECT * FROM {table} WHERE {indexed_column} BETWEEN {min} AND {max} LIMIT 1000;
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The instruction for the 'Index Range Scan' test provides a query template but doesn't guide the agent on how to determine appropriate values for {min} and {max}. This ambiguity could lead to ineffective or failing tests if the agent chooses a poor range.

I recommend adding guidance for the agent. For example, you could add a note before the SQL block:

- Note: To determine appropriate `{min}` and `{max}` values, first profile the `{indexed_column}` to find a representative range.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Fix all issues with AI agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/.gitignore`:
- Line 1: The .gitignore entry uses a leading slash (/discovery_*.md) which
limits matches to the current directory; change the pattern to discovery_*.md
(or use **/discovery_*.md if you want to match in subdirectories) to allow
ignoring discovery files produced in nested paths; update the entry that
currently contains "/discovery_*.md" accordingly.

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.py`:
- Around line 5-7: The module-level docstring in headless_db_discovery.py still
lists 4 agents (STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY); update that
docstring to reflect the current 6-agent workflow by adding the two missing
agents (name them exactly as used in the codebase) so the description matches
the implementation; search for the module docstring at the top of
headless_db_discovery.py and edit the agent list text to include all six agent
names used by the discovery workflow.
- Around line 131-155: The call to get_discovery_prompt_path() is executed
before the try/except that handles missing prompt files, so a missing prompt
will raise FileNotFoundError outside the handler; fix by moving the
get_discovery_prompt_path() invocation (and the log_verbose that references it)
into the same try block that calls build_discovery_prompt(), or wrap the
get_discovery_prompt_path() call in its own try/except to catch
FileNotFoundError and call process-safe fallback or exit; update the references
in log_verbose and build_discovery_prompt() usage accordingly so all
FileNotFoundError paths are handled in one place.

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.sh`:
- Around line 7-9: Update the usage/help header: change the phrase "4
collaborating agents" to "6 collaborating agents" and edit the agent list that
currently reads "STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY" to include the
two missing agent names (or explicitly say "STRUCTURAL, STATISTICAL, SEMANTIC,
QUERY, <AGENT5>, and <AGENT6>" if you have the names) so the comment accurately
reflects the 6‑agent system in headless_db_discovery.sh.
- Around line 175-204: MCP_ARGS being a plain string causes word-splitting when
expanded into the CLAUDE_ARGS array; change MCP_ARGS to an array and populate it
with separate elements (e.g., add the flag and its value as two entries) when
building from MCP_CONFIG or MCP_FILE, then append to CLAUDE_ARGS using array
expansion (use the MCP_ARGS array name and its elements) so JSON or paths with
spaces are preserved; update the code locations that set MCP_ARGS and the line
that appends to CLAUDE_ARGS to use the MCP_ARGS array and proper quoted array
expansion.

In
`@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_prompt.md`:
- Around line 3-5: Update the mission header that currently says "ONLY MCP
server tools (`mcp__proxysql-stdio__*`)" to clarify that the MCP-only
restriction applies to database access and discovery steps only, while
permitting the use of the Write tool in Round 4 for producing the final report;
reference the exact phrase "ONLY MCP server tools (`mcp__proxysql-stdio__*`)"
and "Round 4" so the change is applied there and ensure the new wording
explicitly states that all database interactions must use MCP tools but output
generation may use Write.
- Around line 845-856: Add explicit language identifiers to the untyped Markdown
code fences around the catalog examples: update the three fences that surround
the calls to catalog_upsert, catalog_search, and catalog_get so they specify a
language (e.g., ```python) instead of plain ```, ensuring each opening fence
before catalog_upsert(kind="agent_type"...), before catalog_search(...), and
before catalog_get(...) is changed to include the identifier and the
corresponding closing fence remains ```; no code changes required besides the
fence markers.

In
`@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md`:
- Around line 5-11: The prompt currently states "Execute in 4 rounds" and
implies a 4‑agent workflow but your tooling uses a 6‑agent/5‑round protocol;
update the prompt text (the "Database Discovery Orchestrator" section and the
"Critical Constraints" bullet that references "Execute in 4 rounds") to
explicitly state whether this file is the 4‑agent/4‑round variant or to convert
it to the canonical 6‑agent/5‑round workflow used by the tooling; ensure you
update the round count and agent count consistently and keep the MCP tooling
requirement (`mcp__proxysql-stdio__*`, `catalog_upsert`, `catalog_search`)
intact so readers aren’t misled.
- Around line 101-109: Update the example call to catalog_upsert so the kind
field uses the round-specific kind names used elsewhere in the protocol (e.g.,
replace "collaborative" with "collaborative_round2" and "validation" with
"validation_round3") to ensure catalog lookups match; keep the parameter names
(kind, key, document, tags) and structure the same but use the exact protocol
strings used later in the document so catalog_upsert queries succeed.

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md`:
- Around line 519-530: The directory-structure code fence in README.md is
missing a language tag (causing MD040); update the opening fence for the tree
block from ``` to ```text so the block becomes a fenced code block with a
language tag (e.g., change the README.md directory-structure code fence that
surrounds the "ClaudeCode_Headless/" tree to start with ```text).
- Around line 392-398: Update the README table entry for the `--timeout` / `-t`
option to reflect the actual default of 3600 seconds: change the "Default"
column value for the `--timeout` row from 300 to 3600 so the documented default
matches the script's runtime behavior.
- Around line 471-477: Update the example prompt string (variable "prompt") so
it references the 6-agent, 5-round discovery protocol rather than "4-agent" and
"4-round"; modify the body to instruct focus on security, performance, and data
quality while explicitly stating the 6-agent/5-round structure and any change in
sequencing (e.g., "Follow the standard 6-agent, 5-round protocol and prioritize
these areas") and adjust numbering or phrasing inside the triple-quoted string
to match the new protocol description.

Comment on lines 5 to +7
This script runs Claude Code in non-interactive mode to perform
comprehensive database discovery. It works with any database
type that is accessible via MCP (Model Context Protocol).
comprehensive database discovery using 4 collaborating agents:
STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the docstring agent list to match the 6‑agent workflow.

Proposed fix
-comprehensive database discovery using 4 collaborating agents:
-STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.
+comprehensive database discovery using 6 collaborating agents:
+STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
This script runs Claude Code in non-interactive mode to perform
comprehensive database discovery. It works with any database
type that is accessible via MCP (Model Context Protocol).
comprehensive database discovery using 4 collaborating agents:
STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.
This script runs Claude Code in non-interactive mode to perform
comprehensive database discovery using 6 collaborating agents:
STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META.
🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.py`
around lines 5 - 7, The module-level docstring in headless_db_discovery.py still
lists 4 agents (STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY); update that
docstring to reflect the current 6-agent workflow by adding the two missing
agents (name them exactly as used in the codebase) so the description matches
the implementation; search for the module docstring at the top of
headless_db_discovery.py and edit the agent list text to include all six agent
names used by the discovery workflow.

Comment on lines +131 to +155
log_info("Starting Multi-Agent Database Discovery")
log_info(f"Output will be saved to: {output_file}")
log_verbose(f"Claude Code executable: {claude_cmd}", args.verbose)

# Build MCP configuration
mcp_config_file, _ = build_mcp_config(args)
if mcp_config_file:
log_verbose(f"Using MCP configuration: {mcp_config_file}", args.verbose)
log_verbose(f"Using discovery prompt: {get_discovery_prompt_path()}", args.verbose)

# Build command arguments
cmd_args = [
claude_cmd,
'--print', # Non-interactive mode
'--no-session-persistence', # Don't save session
'--permission-mode', 'bypassPermissions', # Bypass permission checks in headless mode
'--permission-mode', 'bypassPermissions', # Bypass permission checks
]

# Add MCP configuration if available
if mcp_config_file:
cmd_args.extend(['--mcp-config', mcp_config_file])
# Add MCP configuration if provided
if args.mcp_config:
cmd_args.extend(['--mcp-config', args.mcp_config])
log_verbose(f"Using MCP config: {args.mcp_config}", args.verbose)
elif args.mcp_file:
cmd_args.extend(['--mcp-config', args.mcp_file])
log_verbose(f"Using MCP config file: {args.mcp_file}", args.verbose)

# Build discovery prompt
prompt = build_discovery_prompt(args.database, args.schema)
try:
prompt = build_discovery_prompt(args.database, args.schema)
except FileNotFoundError as e:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid uncaught FileNotFoundError from the log line.
get_discovery_prompt_path() is called before the try/except, so a missing prompt file bypasses the intended error handling.

Proposed fix
-    log_verbose(f"Using discovery prompt: {get_discovery_prompt_path()}", args.verbose)
+    # Build discovery prompt
+    try:
+        prompt_path = get_discovery_prompt_path()
+        log_verbose(f"Using discovery prompt: {prompt_path}", args.verbose)
+        prompt = build_discovery_prompt(args.database, args.schema)
+    except FileNotFoundError as e:
+        log_error(str(e))
+        sys.exit(1)
@@
-    # Build discovery prompt
-    try:
-        prompt = build_discovery_prompt(args.database, args.schema)
-    except FileNotFoundError as e:
-        log_error(str(e))
-        sys.exit(1)
🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.py`
around lines 131 - 155, The call to get_discovery_prompt_path() is executed
before the try/except that handles missing prompt files, so a missing prompt
will raise FileNotFoundError outside the handler; fix by moving the
get_discovery_prompt_path() invocation (and the log_verbose that references it)
into the same try block that calls build_discovery_prompt(), or wrap the
get_discovery_prompt_path() call in its own try/except to catch
FileNotFoundError and call process-safe fallback or exit; update the references
in log_verbose and build_discovery_prompt() usage accordingly so all
FileNotFoundError paths are handled in one place.

Comment on lines 7 to +9
# This script runs Claude Code in non-interactive mode to perform
# comprehensive database discovery. It works with any database
# type that is accessible via MCP (Model Context Protocol).
# comprehensive database discovery using 4 collaborating agents:
# STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the help header to reflect the 6‑agent system.
The usage text still says “4 collaborating agents”.

Proposed fix
-# comprehensive database discovery using 4 collaborating agents:
-# STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.
+# comprehensive database discovery using 6 collaborating agents:
+# STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# This script runs Claude Code in non-interactive mode to perform
# comprehensive database discovery. It works with any database
# type that is accessible via MCP (Model Context Protocol).
# comprehensive database discovery using 4 collaborating agents:
# STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.
# This script runs Claude Code in non-interactive mode to perform
# comprehensive database discovery using 6 collaborating agents:
# STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META.
🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.sh`
around lines 7 - 9, Update the usage/help header: change the phrase "4
collaborating agents" to "6 collaborating agents" and edit the agent list that
currently reads "STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY" to include the
two missing agent names (or explicitly say "STRUCTURAL, STATISTICAL, SEMANTIC,
QUERY, <AGENT5>, and <AGENT6>" if you have the names) so the comment accurately
reflects the 6‑agent system in headless_db_discovery.sh.

Comment on lines +175 to 204
# Build MCP args
MCP_ARGS=""
if [ -n "$MCP_CONFIG" ]; then
# Write inline config to temp file
MCP_CONFIG_FILE=$(mktemp)
echo "$MCP_CONFIG" > "$MCP_CONFIG_FILE"
MCP_ARGS="--mcp-config $MCP_CONFIG_FILE"
MCP_ARGS="--mcp-config $MCP_CONFIG"
log_verbose "Using inline MCP configuration"
elif [ -n "$MCP_FILE" ]; then
if [ -f "$MCP_FILE" ]; then
MCP_CONFIG_FILE="$MCP_FILE"
MCP_ARGS="--mcp-config $MCP_FILE"
log_verbose "Using MCP configuration from: $MCP_FILE"
else
log_error "MCP configuration file not found: $MCP_FILE"
exit 1
fi
elif [ -n "$PROXYSQL_MCP_ENDPOINT" ]; then
# Build MCP config for ProxySQL and write to temp file
MCP_CONFIG_FILE=$(mktemp)
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
BRIDGE_PATH="$SCRIPT_DIR/../mcp/proxysql_mcp_stdio_bridge.py"

# Build the JSON config
cat > "$MCP_CONFIG_FILE" << MCPJSONEOF
{
"mcpServers": {
"proxysql": {
"command": "python3",
"args": ["$BRIDGE_PATH"],
"env": {
"PROXYSQL_MCP_ENDPOINT": "$PROXYSQL_MCP_ENDPOINT"
MCPJSONEOF

if [ -n "$PROXYSQL_MCP_TOKEN" ]; then
echo ", \"PROXYSQL_MCP_TOKEN\": \"$PROXYSQL_MCP_TOKEN\"" >> "$MCP_CONFIG_FILE"
fi

if [ "$PROXYSQL_MCP_INSECURE_SSL" = "1" ]; then
echo ", \"PROXYSQL_MCP_INSECURE_SSL\": \"1\"" >> "$MCP_CONFIG_FILE"
fi

cat >> "$MCP_CONFIG_FILE" << 'MCPJSONEOF2'
}
}
}
}
MCPJSONEOF2

MCP_ARGS="--mcp-config $MCP_CONFIG_FILE"
log_verbose "Using ProxySQL MCP endpoint: $PROXYSQL_MCP_ENDPOINT"
log_verbose "MCP config written to: $MCP_CONFIG_FILE"
else
log_verbose "No explicit MCP configuration, using available MCP servers"
fi

# Build the discovery prompt
DATABASE_ARG=""
if [ -n "$DATABASE_NAME" ]; then
DATABASE_ARG="database named '$DATABASE_NAME'"
else
DATABASE_ARG="the first available database"
fi
# Log the command being executed
log_info "Running Claude Code in headless mode with 6-agent discovery..."
log_verbose "Timeout: ${TIMEOUT}s"

SCHEMA_ARG=""
if [ -n "$SCHEMA_NAME" ]; then
SCHEMA_ARG="the schema '$SCHEMA_NAME' within"
fi
# Build Claude command
CLAUDE_ARGS=(
--print
--no-session-persistence
--permission-mode bypassPermissions
)

DISCOVERY_PROMPT="You are a Database Discovery Agent. Your mission is to perform comprehensive analysis of $DATABASE_ARG.

${SCHEMA_ARG:+Focus on $SCHEMA_ARG}

Use the available MCP database tools to discover and document:

## 1. STRUCTURAL ANALYSIS
- List all tables in the database/schema
- For each table, describe:
- Column names, data types, and nullability
- Primary keys and unique constraints
- Foreign key relationships
- Indexes and their purposes
- Any CHECK constraints or defaults

- Create an Entity Relationship Diagram (ERD) showing:
- All tables and their relationships
- Cardinality (1:1, 1:N, M:N)
- Primary and foreign keys

## 2. DATA PROFILING
- For each table, analyze:
- Row count
- Data distributions for key columns
- Null value percentages
- Distinct value counts (cardinality)
- Min/max/average values for numeric columns
- Sample data (first few rows)

- Identify patterns and anomalies:
- Duplicate records
- Data quality issues
- Unexpected distributions
- Outliers

## 3. SEMANTIC ANALYSIS
- Infer the business domain:
- What type of application/database is this?
- What are the main business entities?
- What are the business processes?

- Document business rules:
- Entity lifecycles and state machines
- Validation rules implied by constraints
- Relationship patterns

- Classify tables:
- Master/reference data (customers, products, etc.)
- Transactional data (orders, transactions, etc.)
- Junction/association tables
- Configuration/metadata

## 4. PERFORMANCE & ACCESS PATTERNS
- Identify:
- Missing indexes on foreign keys
- Missing indexes on frequently filtered columns
- Composite index opportunities
- Potential N+1 query patterns

- Suggest optimizations:
- Indexes that should be added
- Query patterns that would benefit from optimization
- Denormalization opportunities

## OUTPUT FORMAT

Provide your findings as a comprehensive Markdown report with:

1. **Executive Summary** - High-level overview
2. **Database Schema** - Complete table definitions
3. **Entity Relationship Diagram** - ASCII ERD
4. **Data Quality Assessment** - Score (1-100) with issues
5. **Business Domain Analysis** - Industry, use cases, entities
6. **Performance Recommendations** - Prioritized optimization list
7. **Anomalies & Issues** - All problems found with severity

Be thorough. Discover everything about this database structure and data.
Write the complete report to standard output."

# Log the command being executed (without showing the full prompt for clarity)
log_info "Running Claude Code in headless mode..."
log_verbose "Timeout: ${TIMEOUT}s"
if [ -n "$DATABASE_NAME" ]; then
log_verbose "Target database: $DATABASE_NAME"
fi
if [ -n "$SCHEMA_NAME" ]; then
log_verbose "Target schema: $SCHEMA_NAME"
# Add MCP configuration if available
if [ -n "$MCP_ARGS" ]; then
CLAUDE_ARGS+=($MCP_ARGS)
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent MCP config JSON from breaking argument parsing.
MCP_ARGS is a string that gets expanded unquoted into the args array, so JSON or paths with spaces will split into multiple arguments.

Proposed fix (use an array and proper quoting)
-# Build MCP args
-MCP_ARGS=""
-if [ -n "$MCP_CONFIG" ]; then
-    MCP_ARGS="--mcp-config $MCP_CONFIG"
-    log_verbose "Using inline MCP configuration"
-elif [ -n "$MCP_FILE" ]; then
-    if [ -f "$MCP_FILE" ]; then
-        MCP_ARGS="--mcp-config $MCP_FILE"
-        log_verbose "Using MCP configuration from: $MCP_FILE"
-    else
-        log_error "MCP configuration file not found: $MCP_FILE"
-        exit 1
-    fi
-fi
+MCP_ARGS=()
+if [ -n "$MCP_CONFIG" ]; then
+    MCP_ARGS+=(--mcp-config "$MCP_CONFIG")
+    log_verbose "Using inline MCP configuration"
+elif [ -n "$MCP_FILE" ]; then
+    if [ -f "$MCP_FILE" ]; then
+        MCP_ARGS+=(--mcp-config "$MCP_FILE")
+        log_verbose "Using MCP configuration from: $MCP_FILE"
+    else
+        log_error "MCP configuration file not found: $MCP_FILE"
+        exit 1
+    fi
+fi
@@
-# Add MCP configuration if available
-if [ -n "$MCP_ARGS" ]; then
-    CLAUDE_ARGS+=($MCP_ARGS)
-fi
+# Add MCP configuration if available
+if [ "${`#MCP_ARGS`[@]}" -gt 0 ]; then
+    CLAUDE_ARGS+=("${MCP_ARGS[@]}")
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Build MCP args
MCP_ARGS=""
if [ -n "$MCP_CONFIG" ]; then
# Write inline config to temp file
MCP_CONFIG_FILE=$(mktemp)
echo "$MCP_CONFIG" > "$MCP_CONFIG_FILE"
MCP_ARGS="--mcp-config $MCP_CONFIG_FILE"
MCP_ARGS="--mcp-config $MCP_CONFIG"
log_verbose "Using inline MCP configuration"
elif [ -n "$MCP_FILE" ]; then
if [ -f "$MCP_FILE" ]; then
MCP_CONFIG_FILE="$MCP_FILE"
MCP_ARGS="--mcp-config $MCP_FILE"
log_verbose "Using MCP configuration from: $MCP_FILE"
else
log_error "MCP configuration file not found: $MCP_FILE"
exit 1
fi
elif [ -n "$PROXYSQL_MCP_ENDPOINT" ]; then
# Build MCP config for ProxySQL and write to temp file
MCP_CONFIG_FILE=$(mktemp)
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
BRIDGE_PATH="$SCRIPT_DIR/../mcp/proxysql_mcp_stdio_bridge.py"
# Build the JSON config
cat > "$MCP_CONFIG_FILE" << MCPJSONEOF
{
"mcpServers": {
"proxysql": {
"command": "python3",
"args": ["$BRIDGE_PATH"],
"env": {
"PROXYSQL_MCP_ENDPOINT": "$PROXYSQL_MCP_ENDPOINT"
MCPJSONEOF
if [ -n "$PROXYSQL_MCP_TOKEN" ]; then
echo ", \"PROXYSQL_MCP_TOKEN\": \"$PROXYSQL_MCP_TOKEN\"" >> "$MCP_CONFIG_FILE"
fi
if [ "$PROXYSQL_MCP_INSECURE_SSL" = "1" ]; then
echo ", \"PROXYSQL_MCP_INSECURE_SSL\": \"1\"" >> "$MCP_CONFIG_FILE"
fi
cat >> "$MCP_CONFIG_FILE" << 'MCPJSONEOF2'
}
}
}
}
MCPJSONEOF2
MCP_ARGS="--mcp-config $MCP_CONFIG_FILE"
log_verbose "Using ProxySQL MCP endpoint: $PROXYSQL_MCP_ENDPOINT"
log_verbose "MCP config written to: $MCP_CONFIG_FILE"
else
log_verbose "No explicit MCP configuration, using available MCP servers"
fi
# Build the discovery prompt
DATABASE_ARG=""
if [ -n "$DATABASE_NAME" ]; then
DATABASE_ARG="database named '$DATABASE_NAME'"
else
DATABASE_ARG="the first available database"
fi
# Log the command being executed
log_info "Running Claude Code in headless mode with 6-agent discovery..."
log_verbose "Timeout: ${TIMEOUT}s"
SCHEMA_ARG=""
if [ -n "$SCHEMA_NAME" ]; then
SCHEMA_ARG="the schema '$SCHEMA_NAME' within"
fi
# Build Claude command
CLAUDE_ARGS=(
--print
--no-session-persistence
--permission-mode bypassPermissions
)
DISCOVERY_PROMPT="You are a Database Discovery Agent. Your mission is to perform comprehensive analysis of $DATABASE_ARG.
${SCHEMA_ARG:+Focus on $SCHEMA_ARG}
Use the available MCP database tools to discover and document:
## 1. STRUCTURAL ANALYSIS
- List all tables in the database/schema
- For each table, describe:
- Column names, data types, and nullability
- Primary keys and unique constraints
- Foreign key relationships
- Indexes and their purposes
- Any CHECK constraints or defaults
- Create an Entity Relationship Diagram (ERD) showing:
- All tables and their relationships
- Cardinality (1:1, 1:N, M:N)
- Primary and foreign keys
## 2. DATA PROFILING
- For each table, analyze:
- Row count
- Data distributions for key columns
- Null value percentages
- Distinct value counts (cardinality)
- Min/max/average values for numeric columns
- Sample data (first few rows)
- Identify patterns and anomalies:
- Duplicate records
- Data quality issues
- Unexpected distributions
- Outliers
## 3. SEMANTIC ANALYSIS
- Infer the business domain:
- What type of application/database is this?
- What are the main business entities?
- What are the business processes?
- Document business rules:
- Entity lifecycles and state machines
- Validation rules implied by constraints
- Relationship patterns
- Classify tables:
- Master/reference data (customers, products, etc.)
- Transactional data (orders, transactions, etc.)
- Junction/association tables
- Configuration/metadata
## 4. PERFORMANCE & ACCESS PATTERNS
- Identify:
- Missing indexes on foreign keys
- Missing indexes on frequently filtered columns
- Composite index opportunities
- Potential N+1 query patterns
- Suggest optimizations:
- Indexes that should be added
- Query patterns that would benefit from optimization
- Denormalization opportunities
## OUTPUT FORMAT
Provide your findings as a comprehensive Markdown report with:
1. **Executive Summary** - High-level overview
2. **Database Schema** - Complete table definitions
3. **Entity Relationship Diagram** - ASCII ERD
4. **Data Quality Assessment** - Score (1-100) with issues
5. **Business Domain Analysis** - Industry, use cases, entities
6. **Performance Recommendations** - Prioritized optimization list
7. **Anomalies & Issues** - All problems found with severity
Be thorough. Discover everything about this database structure and data.
Write the complete report to standard output."
# Log the command being executed (without showing the full prompt for clarity)
log_info "Running Claude Code in headless mode..."
log_verbose "Timeout: ${TIMEOUT}s"
if [ -n "$DATABASE_NAME" ]; then
log_verbose "Target database: $DATABASE_NAME"
fi
if [ -n "$SCHEMA_NAME" ]; then
log_verbose "Target schema: $SCHEMA_NAME"
# Add MCP configuration if available
if [ -n "$MCP_ARGS" ]; then
CLAUDE_ARGS+=($MCP_ARGS)
fi
# Build MCP args
MCP_ARGS=()
if [ -n "$MCP_CONFIG" ]; then
MCP_ARGS+=(--mcp-config "$MCP_CONFIG")
log_verbose "Using inline MCP configuration"
elif [ -n "$MCP_FILE" ]; then
if [ -f "$MCP_FILE" ]; then
MCP_ARGS+=(--mcp-config "$MCP_FILE")
log_verbose "Using MCP configuration from: $MCP_FILE"
else
log_error "MCP configuration file not found: $MCP_FILE"
exit 1
fi
fi
# Log the command being executed
log_info "Running Claude Code in headless mode with 6-agent discovery..."
log_verbose "Timeout: ${TIMEOUT}s"
# Build Claude command
CLAUDE_ARGS=(
--print
--no-session-persistence
--permission-mode bypassPermissions
)
# Add MCP configuration if available
if [ "${`#MCP_ARGS`[@]}" -gt 0 ]; then
CLAUDE_ARGS+=("${MCP_ARGS[@]}")
fi
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 203-203: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)

🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.sh`
around lines 175 - 204, MCP_ARGS being a plain string causes word-splitting when
expanded into the CLAUDE_ARGS array; change MCP_ARGS to an array and populate it
with separate elements (e.g., add the flag and its value as two entries) when
building from MCP_CONFIG or MCP_FILE, then append to CLAUDE_ARGS using array
expansion (use the MCP_ARGS array name and its elements) so JSON or paths with
spaces are preserved; update the code locations that set MCP_ARGS and the line
that appends to CLAUDE_ARGS to use the MCP_ARGS array and proper quoted array
expansion.

Comment on lines +5 to +11
You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools.

**Critical Constraints:**
- Use **ONLY** MCP server tools (`mcp__proxysql-stdio__*`) - never connect directly to backend databases
- All agents collaborate via the MCP catalog (`catalog_upsert`, `catalog_search`)
- Execute in 4 rounds: Blind Exploration → Pattern Recognition → Hypothesis Testing → Final Synthesis
- Generate a comprehensive report as the final output
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify that this reference prompt is the 4‑agent/4‑round variant (or update it to the 6‑agent workflow).
Right now it reads like the canonical prompt while the tooling uses the 6‑agent/5‑round protocol, which can mislead readers.

Suggested clarification
-You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools.
+You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools.
+> **Note:** This reference describes the legacy 4‑agent / 4‑round workflow. The active v1.3 prompt uses 6 agents and 5 rounds (see `prompts/multi_agent_discovery_prompt.md`).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools.
**Critical Constraints:**
- Use **ONLY** MCP server tools (`mcp__proxysql-stdio__*`) - never connect directly to backend databases
- All agents collaborate via the MCP catalog (`catalog_upsert`, `catalog_search`)
- Execute in 4 rounds: Blind Exploration → Pattern Recognition → Hypothesis Testing → Final Synthesis
- Generate a comprehensive report as the final output
You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools.
> **Note:** This reference describes the legacy 4‑agent / 4‑round workflow. The active v1.3 prompt uses 6 agents and 5 rounds (see `prompts/multi_agent_discovery_prompt.md`).
**Critical Constraints:**
- Use **ONLY** MCP server tools (`mcp__proxysql-stdio__*`) - never connect directly to backend databases
- All agents collaborate via the MCP catalog (`catalog_upsert`, `catalog_search`)
- Execute in 4 rounds: Blind Exploration → Pattern Recognition → Hypothesis Testing → Final Synthesis
- Generate a comprehensive report as the final output
🤖 Prompt for AI Agents
In
`@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md`
around lines 5 - 11, The prompt currently states "Execute in 4 rounds" and
implies a 4‑agent workflow but your tooling uses a 6‑agent/5‑round protocol;
update the prompt text (the "Database Discovery Orchestrator" section and the
"Critical Constraints" bullet that references "Execute in 4 rounds") to
explicitly state whether this file is the 4‑agent/4‑round variant or to convert
it to the canonical 6‑agent/5‑round workflow used by the tooling; ensure you
update the round count and agent count consistently and keep the MCP tooling
requirement (`mcp__proxysql-stdio__*`, `catalog_upsert`, `catalog_search`)
intact so readers aren’t misled.

Comment on lines +101 to +109
**Writing Findings:**
```python
catalog_upsert(
kind="structural|statistical|semantic|query|collaborative|validation|final_report",
key="specific_identifier",
document="detailed_findings_markdown",
tags="optional_tags"
)
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix catalog kind names in the example to match the round-specific kinds used later.
The snippet advertises collaborative/validation, while the protocol uses collaborative_round2/validation_round3. This mismatch can break catalog lookups.

Proposed fix
-catalog_upsert(
-    kind="structural|statistical|semantic|query|collaborative|validation|final_report",
+catalog_upsert(
+    kind="structural|statistical|semantic|query|collaborative_round2|validation_round3|final_report",
     key="specific_identifier",
     document="detailed_findings_markdown",
     tags="optional_tags"
 )
🤖 Prompt for AI Agents
In
`@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md`
around lines 101 - 109, Update the example call to catalog_upsert so the kind
field uses the round-specific kind names used elsewhere in the protocol (e.g.,
replace "collaborative" with "collaborative_round2" and "validation" with
"validation_round3") to ensure catalog lookups match; keep the parameter names
(kind, key, document, tags) and structure the same but use the exact protocol
strings used later in the document so catalog_upsert queries succeed.

Comment on lines +392 to +398
| Option | Short | Description | Default |
|--------|-------|-------------|---------|
| `--database` | `-d` | Database name to discover | First available |
| `--schema` | `-s` | Schema name to analyze | All schemas |
| `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` |
| `--timeout` | `-t` | Timeout in seconds | 300 |
| `--verbose` | `-v` | Enable verbose output | Disabled |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the documented default timeout to 3600s.
The scripts default to 3600 seconds, but the table still says 300.

Proposed fix
-| `--timeout` | `-t` | Timeout in seconds | 300 |
+| `--timeout` | `-t` | Timeout in seconds | 3600 |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Option | Short | Description | Default |
|--------|-------|-------------|---------|
| `--database` | `-d` | Database name to discover | First available |
| `--schema` | `-s` | Schema name to analyze | All schemas |
| `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` |
| `--timeout` | `-t` | Timeout in seconds | 300 |
| `--verbose` | `-v` | Enable verbose output | Disabled |
| Option | Short | Description | Default |
|--------|-------|-------------|---------|
| `--database` | `-d` | Database name to discover | First available |
| `--schema` | `-s` | Schema name to analyze | All schemas |
| `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` |
| `--timeout` | `-t` | Timeout in seconds | 3600 |
| `--verbose` | `-v` | Enable verbose output | Disabled |
🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md` around lines 392 -
398, Update the README table entry for the `--timeout` / `-t` option to reflect
the actual default of 3600 seconds: change the "Default" column value for the
`--timeout` row from 300 to 3600 so the documented default matches the script's
runtime behavior.

Comment on lines +471 to +477
prompt = f"""Using the 4-agent discovery protocol, focus on:
1. Security aspects of {database}
2. Performance optimization opportunities
3. Data quality issues

Follow the standard 4-round protocol but prioritize these areas.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align the custom prompt example with the 6‑agent/5‑round protocol.
The example still references a 4‑agent, 4‑round flow.

Proposed fix
-    prompt = f"""Using the 4-agent discovery protocol, focus on:
+    prompt = f"""Using the 6-agent, 5-round discovery protocol, focus on:
     1. Security aspects of {database}
     2. Performance optimization opportunities
     3. Data quality issues
 
-    Follow the standard 4-round protocol but prioritize these areas.
+    Follow the standard 5-round protocol but prioritize these areas.
     """
🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md` around lines 471 -
477, Update the example prompt string (variable "prompt") so it references the
6-agent, 5-round discovery protocol rather than "4-agent" and "4-round"; modify
the body to instruct focus on security, performance, and data quality while
explicitly stating the 6-agent/5-round structure and any change in sequencing
(e.g., "Follow the standard 6-agent, 5-round protocol and prioritize these
areas") and adjust numbering or phrasing inside the triple-quoted string to
match the new protocol description.

Comment on lines +519 to +530
```
ClaudeCode_Headless/
├── README.md # This file
├── prompts/
│ ├── multi_agent_discovery_prompt.md # Concise system prompt
│ └── multi_agent_discovery_reference.md # Comprehensive reference
├── headless_db_discovery.py # Python script
├── headless_db_discovery.sh # Bash script
└── examples/
├── DATABASE_DISCOVERY_REPORT.md # Example output
└── DATABASE_QUESTION_CAPABILITIES.md # Feature documentation
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to the directory-structure code fence to satisfy MD040.

Proposed fix
-```
+```text
 ClaudeCode_Headless/
 ├── README.md                           # This file
 ├── prompts/
 │   ├── multi_agent_discovery_prompt.md      # Concise system prompt
 │   └── multi_agent_discovery_reference.md   # Comprehensive reference
 ├── headless_db_discovery.py            # Python script
 ├── headless_db_discovery.sh            # Bash script
 └── examples/
     ├── DATABASE_DISCOVERY_REPORT.md        # Example output
     └── DATABASE_QUESTION_CAPABILITIES.md   # Feature documentation
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

519-519: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md` around lines 519 -
530, The directory-structure code fence in README.md is missing a language tag
(causing MD040); update the opening fence for the tree block from ``` to ```text
so the block becomes a fenced code block with a language tag (e.g., change the
README.md directory-structure code fence that surrounds the
"ClaudeCode_Headless/" tree to start with ```text).

Added patterns to ignore all discovery output files:

Root .gitignore:
- discovery_*.md (main discovery reports at repo root)
- database_discovery_report.md

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/.gitignore:
- *_QUESTION_CATALOG.md (individual agent question catalogs)
- *_round1_*.md, *_round2_*.md, *_round3_*.md, *_round4_*.md (round outputs)
- *_COORDINATOR_SUMMARY.md (coordinator summaries)
- *_HYPOTHESIS_TESTING.md (hypothesis testing results)
- *_INDEX.md (index files)
- *_QUICK_REFERENCE.md (quick reference docs)
- META_ANALYSIS_*.md (META agent outputs)
- SECURITY_AGENT_*.txt (security agent text outputs)
- query_agent_*.md, security_agent_*.md, semantic_*.md, statistical_*.md

These files are generated when agents incorrectly use Write tool instead of
catalog_upsert (the issue fixed in v1.2). They should not be committed to
the repository as they are test outputs and are stored in the MCP catalog.
…aint

This commit addresses two issues:

1. MCP Catalog Schema Separation:
   - Add 'schema' column to catalog table for proper isolation
   - Update all catalog methods (upsert, get, search, list, remove) to accept schema parameter
   - Update MCP tool handlers and JSON-RPC parameter parsing
   - Unique constraint changed from (kind, key) to (schema, kind, key)
   - FTS table updated to include schema column

2. Discovery Prompt Scope Constraint:
   - Add explicit SCOPE CONSTRAINT section to multi_agent_discovery_prompt.md
   - Agents now respect Target Schema and skip list_schemas when specified
   - Prevents analyzing all schemas when only one is targeted

Files modified:
- include/MySQL_Catalog.h: Add schema parameter to all catalog methods
- include/MySQL_Tool_Handler.h: Update wrapper method signatures
- lib/MySQL_Catalog.cpp: Implement schema filtering in all operations
- lib/MySQL_Tool_Handler.cpp: Update wrapper implementations
- lib/Query_Tool_Handler.cpp: Extract schema from JSON-RPC params, update tool descriptions
- scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_prompt.md: Add scope constraint
Phase 1 (Static/Deterministic):
- Add Discovery_Schema: SQLite catalog with deterministic and LLM tables
- Add Static_Harvester: MySQL INFORMATION_SCHEMA metadata extraction
- Harvest schemas, objects, columns, indexes, foreign keys, view definitions
- Compute derived hints: is_time, is_id_like, has_pk, has_fks, has_time
- Build quick profiles and FTS5 indexes

Phase 2 (LLM Agent):
- Add 19 new MCP tools for two-phase discovery
- discovery.run_static: Trigger ProxySQL's static harvest
- Catalog tools: init, search, get_object, list_objects, get_relationships
- Agent tools: run_start, run_finish, event_append
- LLM tools: summary_upsert, relationship_upsert, domain_upsert, etc.

Files:
- include/Discovery_Schema.h, lib/Discovery_Schema.cpp
- include/Static_Harvester.h, lib/Static_Harvester.cpp
- include/Query_Tool_Handler.h, lib/Query_Tool_Handler.cpp (updated)
- lib/Makefile (updated)
- scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/
- scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/two_phase_discovery.py
Add missing --mcp-config argument which is required by the script.
The dry-run example now correctly shows all required parameters.
- Add mcp_config.example.json for Claude Code MCP configuration
- Fix MCP bridge path in example config (../../proxysql_mcp_stdio_bridge.py)
- Update Two_Phase_Discovery_Implementation.md with correct Phase 1/Phase 2 usage
- Fix Two_Phase_Discovery_Implementation.md DELETE FROM fts_objects to scope to run_id
- Update README.md with two-phase discovery section and multi-agent legacy note
- Create static_harvest.sh bash wrapper for Phase 1
- Create two_phase_discovery.py orchestration script with prompts
- Add --run-id parameter to skip auto-fetch
- Fix RUN_ID placeholder mismatch (<USE_THE_PROVIDED_RUN_ID>)
- Fix catalog path default to mcp_catalog.db
- Add test_catalog.sh to verify catalog tools work
- Fix Discovery_Schema.cpp FTS5 syntax (missing space)
- Remove invalid CREATE INDEX on FTS virtual tables
- Add MCP tool call logging to track tool usage
- Fix Static_Harvester::get_harvest_stats() to accept run_id parameter
- Fix DELETE FROM fts_objects to only delete for specific run_id
- Update system prompts to say DO NOT call discovery.run_static
- Update user prompts to say Phase 1 is already complete
- Add --mcp-only flag to restrict Claude Code to MCP tools only
- Make FTS table failures non-fatal (check if table exists first)
- Add comprehensive documentation for both discovery approaches
- Add try-catch around handle_jsonrpc_request to catch unexpected exceptions
- Add detailed logging for tool execution success/failure
- Add proper SQLite error checking in create_agent_run with error messages
- Fix json_int/json_double to handle both numbers and numeric strings

The json_int function was throwing exceptions when receiving numeric
strings (e.g., "14" instead of 14) from clients, causing 500 errors.
Now it handles both formats gracefully.

Also added logging so tool failures are visible in logs instead of
being silent 500 errors.
Exception handlers now log the full request payload that caused the error,
making debugging much easier.

Changes:
- Move req_body/req_path declarations outside try block so catch handlers can access them
- Log request payload in all exception handlers (parse errors, std::exception, and catch-all)
- Log tool arguments when tool execution fails

Previously, exceptions would only log the error message without context,
making it impossible to reproduce the issue. Now the full payload is logged.
1. Fix error logging to catch ALL tool failures, not just those with
   both success and result fields. Previously, error responses like
   {"success": false, "error": "..."} without a result field were
   silently ignored.

2. Fix llm.domain_set_members to accept both array and JSON string
   formats for the members parameter. Some clients send it as a
   JSON string, others as a native array.

3. Add detailed error logging for llm.domain_set_members failures,
   including what was actually received.
- Add resolve_run_id() to map schema names to latest run_id
- Update all catalog and LLM tools to accept schema names
- Deprecate describe_table, table_profile, column_profile
- Deprecate get_constraints, suggest_joins, find_reference_candidates
- Keep sample_rows, sample_distinct for data preview
Tool was deprecated; users should use catalog.get_object instead.
- Query schemas from catalog's schemas table
- Maintains same output format for compatibility
- Removes dependency on live MySQL connection
Add stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset
tables to track MCP query tool usage statistics.

- Added get_tool_usage_stats_resultset() method to Query_Tool_Handler
- Defined table schemas in ProxySQL_Admin_Tables_Definitions.h
- Registered tables in Admin_Bootstrap.cpp
- Added pattern matching in ProxySQL_Admin.cpp
- Added stats___mcp_query_tools_counters() in ProxySQL_Admin_Stats.cpp
- Fixed friend declaration for track_tool_invocation()
- Fixed Discovery_Schema.cpp log_llm_search() to use prepare_v2/finalize
Remove the mcp-catalog_path configuration variable and hardcode the catalog
database path to datadir/mcp_catalog.db for stability.

Rationale: The catalog database is session state, not user configuration.
Runtime swapping of the catalog could cause tables to be missed and the
catalog to fail even if it was succeeding a second earlier.

Changes:
- Removed catalog_path from mcp_thread_variables_names array
- Removed mcp_catalog_path from MCP_Thread variables struct
- Removed getter/setter logic for catalog_path
- Hardcoded catalog path to GloVars.datadir/mcp_catalog.db in:
  - ProxySQL_MCP_Server.cpp (Query_Tool_Handler initialization)
  - Admin_FlushVariables.cpp (MySQL_Tool_Handler reinitialization)
- Updated VARIABLES.md to document the hardcoded path
- Updated configure_mcp.sh to remove catalog_path configuration
- Updated MCP README to remove catalog_path references
The MCP catalog database is now accessible as the 'mcp_catalog' schema
from the ProxySQL Admin interface, enabling direct SQL queries against
discovered schemas and LLM memories.
Extend the stats_mcp_query_tools_counters table with timing statistics
(first_seen, last_seen, sum_time, min_time, max_time) following the
same pattern as stats_mysql_query_digest.

All timing values are in microseconds using monotonic_time().

New schema:
- tool VARCHAR
- schema VARCHAR
- count INT
- first_seen INTEGER (microseconds)
- last_seen INTEGER (microseconds)
- sum_time INTEGER (microseconds - total execution time)
- min_time INTEGER (microseconds - minimum execution time)
- max_time INTEGER (microseconds - maximum execution time)
Previously re_modifiers was hardcoded to 1 (CASELESS), ignoring the
value stored in the database. Now properly reads from row->fields[7].
- Fully implemented rag.search_hybrid tool with both fuse and fts_then_vec modes
- Added complete filter support across all search tools (source_ids, source_names, doc_ids, post_type_ids, tags_any, tags_all, created_after, created_before, min_score)
- Implemented proper score normalization (higher is better) for all search modes
- Updated all tool schemas to match blueprint specifications exactly
- Added metadata inclusion in search results
- Implemented Reciprocal Rank Fusion (RRF) scoring for hybrid search
- Enhanced error handling and input validation
- Added debug information for hybrid search ranking
- Updated documentation and created completion summary

This completes the v0 RAG implementation according to the blueprint requirements.
- Enhanced inline Doxygen comments in RAG_Tool_Handler.h and RAG_Tool_Handler.cpp
- Added detailed parameter descriptions, return values, and cross-references
- Created Doxyfile for documentation generation
- Added documentation summary and guidelines
- Documented all RAG tools with their schemas and usage patterns
- Added security and performance considerations documentation

The RAG subsystem is now fully documented with comprehensive Doxygen comments
that provide clear guidance for developers working with the codebase.
Add detailed function-level documentation to all MCP query rules,
query digest, static harvester, and catalog components.

Static_Harvester.cpp:
- Document all 18+ harvest functions (schemas, objects, columns, indexes, FKs, views)
- Document lifecycle methods (init, close, connect, disconnect)
- Document helper methods (is_time_type, is_id_like_name)
- Document run management (start_run, finish_run, run_full_harvest)
- Document statistics methods (get_harvest_stats)

Query_Tool_Handler.cpp:
- Document JSON helper functions (json_string, json_int, json_double)
- Document digest tracking section with flow explanation

MySQL_Catalog.cpp:
- Document schema isolation architecture
- Document CRUD operations (upsert, get, search, list, remove, merge)

Discovery_Schema.cpp:
- Document MCP query rules evaluation (evaluate_mcp_query_rules)
- Document digest functions (compute_mcp_digest, fingerprint_mcp_args)
- Document update/get functions for rules and digests

ProxySQL_Admin_Stats.cpp:
- Document stats collection functions

ProxySQL_Admin.cpp:
- Document load/save functions for query rules

Admin_Handler.cpp:
- Document MCP query rules command handlers

include/ProxySQL_Admin_Tables_Definitions.h:
- Add comments explaining table purposes
@renecannao renecannao changed the title feat: Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis feat: Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis [WIP] Jan 19, 2026
renecannao and others added 15 commits January 20, 2026 08:58
- Remove Doxyfile that was added by mistake
- Remove .doxygen files from doc/ directory that were generated by mistake
- Clean up documentation artifacts that shouldn't have been committed
- Address SQL injection vulnerabilities by adding input validation and escaping
- Fix configuration variable handling in get_variable and set_variable methods for RAG variables
- Make embedding dimension configurable for rag_vec_chunks table
- Remove code duplication in SQL filter building logic by creating consolidated build_sql_filters function
- Update all search tools (FTS, vector, hybrid) to use consolidated filter building
- Move RAG test files to dedicated test/rag/ directory
- Create proper Makefile in test/rag/ directory
- Remove build_rag_test.sh script (replaced by Makefile)
- Revert changes to main test/Makefile
- Update documentation to reflect new test structure
Fixed 6 legitimate issues from AI code review:

Critical fixes:
- Implement stats___mcp_query_digest to populate table with actual data
- Fix double trigger bug (stats_mcp_query_digest_reset substring matching)

Important fixes:
- Fix re_modifiers parsing (VARCHAR "CASELESS" -> int 1)
- Fix TOCTOU race condition in reset path (use write lock from start)
- Add column count validation before accessing row fields

Documentation:
- Add memory ownership documentation for evaluate_mcp_query_rules

False positives ignored (Issue 2: Schema mismatch, Issue 6: SQL injection)
- Move RAG test files to dedicated test/rag/ directory
- Create proper Makefile in test/rag/ directory
- Remove build_rag_test.sh script (replaced by Makefile)
- Update .gitignore to exclude test/rag/test_rag_schema executable
- Update documentation to reflect new test structure
Use proxy_sqlite3_* wrappers and canonical proxy symbol TU
Add comprehensive RAG subsystem with Doxygen documentation [WIP]
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/MCP_Thread.cpp (1)

1-10: Missing include for RAG_Tool_Handler.h.

The file includes headers for all other tool handlers (MySQL_Tool_Handler.h, Config_Tool_Handler.h, etc.) but doesn't include RAG_Tool_Handler.h. While the forward declaration in MCP_Thread.h allows the pointer member to compile, the destructor on lines 127-130 calls delete rag_tool_handler, which requires the complete type definition.

🔧 Proposed fix
 `#include` "MCP_Thread.h"
 `#include` "MySQL_Tool_Handler.h"
 `#include` "Config_Tool_Handler.h"
 `#include` "Query_Tool_Handler.h"
 `#include` "Admin_Tool_Handler.h"
 `#include` "Cache_Tool_Handler.h"
 `#include` "Observe_Tool_Handler.h"
+#include "RAG_Tool_Handler.h"
 `#include` "proxysql_debug.h"
 `#include` "ProxySQL_MCP_Server.hpp"
🤖 Fix all issues with AI agents
In `@lib/Admin_Bootstrap.cpp`:
- Around line 726-736: The code calls mcpdb->open(...) and immediately enables
extension loading via (*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1)
without checking the open result; update the logic in the block around
mcpdb->open() (SQLite3DB::open) to capture and check its return value, log an
error and skip calling proxy_sqlite3_enable_load_extension and any further
attach/setup when open fails, and ensure you use mcpdb->get_db() only after a
successful open; include descriptive logging (via existing logging mechanism)
when open fails to make the early exit clear.

In `@lib/AI_Features_Manager.cpp`:
- Around line 300-315: Check GloGATH->variables.genai_vector_dimension before
using it to build create_rag_vec_chunks_sql: if it's <= 0 or unreasonably large,
clamp it to a safe default (e.g., 1536) and log the correction via the same
logger used elsewhere (or emit a warning) so the fallback is visible; then use
the sanitized vector_dimension when constructing the vec0 DDL for rag_vec_chunks
to avoid silent failures.

In `@lib/Anomaly_Detector.cpp.bak`:
- Line 500: Replace the direct call to sqlite3_finalize(stmt) with the proxy
wrapper used elsewhere (e.g., proxy_sqlite3_finalize(stmt)) so the proxy layer's
logging/metrics/error handling is applied consistently; update any includes or
declarations if needed to access proxy_sqlite3_finalize and ensure the call
semantics and error checking remain identical to the original sqlite3_finalize
usage (use the same stmt variable and follow existing error/logging patterns
used with proxy_sqlite3_step).
- Around line 814-836: The code is calling sqlite3_prepare_v2, sqlite3_errmsg,
and sqlite3_finalize directly which breaks the proxy pattern; replace these
direct calls with the proxy wrappers used elsewhere: call
proxy_sqlite3_prepare_v2(db, query, -1, &stmt, NULL) and assign its return to
rc, use proxy_sqlite3_errmsg(db) in the proxy_error call, and call
proxy_sqlite3_finalize(stmt) at the end; keep the existing use of
proxy_sqlite3_step, proxy_sqlite3_column_text, proxy_sqlite3_column_int, and
proxy_sqlite3_column_int64 and ensure variable names (stmt, rc, query, db)
remain unchanged.
- Around line 920-938: This block mixes direct SQLite calls with proxy wrappers;
make all SQLite interactions consistent by replacing sqlite3_prepare_v2,
sqlite3_column_text, sqlite3_column_int, and sqlite3_finalize with their proxy
equivalents (e.g., proxy_sqlite3_prepare_v2, proxy_sqlite3_column_text,
proxy_sqlite3_column_int, proxy_sqlite3_finalize) so the loop that uses
proxy_sqlite3_step also uses proxy_* for prepare/column/finalize on the
variables stmt, db, type_query, and by_type.
- Around line 577-590: The static local int cleanup_counter in the cleanup block
inside update_user_statistics is not thread-safe; replace it with a
std::atomic<int> (e.g., declare static std::atomic<int> cleanup_counter{0}) and
perform an atomic increment/check (use fetch_add or ++cleanup_counter and
compare the result mod 1000) so multiple threads can't race on the counter; also
ensure the user_statistics cleanup loop remains protected by the same mutex/lock
used for accessing the user_statistics map (or document that user_statistics is
already synchronized) to avoid concurrent access issues.
- Around line 771-776: Replace direct SQLite calls with the proxy wrappers to
match the rest of the file: in the error branch where proxy_error("Anomaly:
Failed to insert pattern: %s\n", sqlite3_errmsg(db)); and where
sqlite3_finalize(stmt); are used, call the proxy wrappers (e.g.,
proxy_sqlite3_errmsg(db) and proxy_sqlite3_finalize(stmt)) instead so
proxy_error uses the proxy_sqlite3_errmsg result and both finalization calls use
proxy_sqlite3_finalize; locate these occurrences around the Anomaly pattern
insert logic where proxy_error, db, and stmt are referenced.

In `@lib/Discovery_Schema.cpp`:
- Around line 1211-1221: The SQL built in get_object (around the sql
ostringstream) and list_objects currently concatenates user-controlled
schema_name, object_name and object_type into the query; change both functions
to use prepared/parameterized statements instead of string concatenation:
replace the ostringstream construction with a SQL statement containing
placeholders (e.g., ? or named parameters) for run_id, object_id (when used),
schema_name, object_name and object_type, then bind those variables with the DB
API's bind/prepare calls before executing; ensure proper types are used for
numeric vs string parameters and remove any remaining direct concatenation of
schema_name/object_name/object_type.
- Around line 30-38: Discovery_Schema::init currently uses a C99 VLA (char
path_buf[db_path.size() + 1]) and leaks the heap-allocated db if db->open(...)
fails; replace the VLA with a std::vector<char> (or std::string::data copy)
sized db_path.size()+1 and use that buffer when calling db->open, and ensure you
free the allocated SQLite3DB (delete db) on error (or better, change db to a
smart pointer) before returning after proxy_error to avoid the resource leak.
- Around line 1988-1993: The fts_search_llm() code is vulnerable because it
concatenates the user-supplied query into the MATCH clause; change it to use a
prepared statement and parameter binding instead of string concatenation: build
the SQL with "WHERE f.fts_llm MATCH ?1" (keep the rest of the SELECT/LEFT JOIN
and ORDER BY score LIMIT part), prepare the statement, bind the C++ string query
to parameter 1 using sqlite3_bind_text (or proxy_sqlite3_bind_text) and execute;
also ensure the LIMIT value (limit) is either bound as a validated integer or
strictly validated before concatenation to prevent injection via limit.

In `@lib/ProxySQL_MCP_Server.cpp`:
- Around line 78-98: The Query_Tool_Handler instance is set to NULL on init()
failure but the /mcp/query resource is still registered later; change the flow
so that registration and any increment of the endpoint counter only occur when
handler->query_tool_handler->init() returns success (non-zero indicates
failure), i.e., after new Query_Tool_Handler(...) call check init(), and only on
successful init register the /mcp/query endpoint and adjust any endpoint
counting; on failure leave handler->query_tool_handler NULL and skip
registration to avoid null deref or a broken endpoint.

In `@lib/RAG_Tool_Handler.cpp`:
- Around line 2316-2330: The doc_id list is concatenated into SQL in
rag.get_docs and rag.fetch_from_source (see the doc_ids loop that builds
doc_list and the sql string using "WHERE ... IN (" + doc_list + ")"), which is
vulnerable to SQL injection and fails on quotes; fix by replacing direct
concatenation with a parameterized IN clause using execute_parameterized_query()
and bound parameters (generate placeholders ?1,?2,... from doc_ids and bind each
doc_ids[i]) or, if you prefer the validation approach, validate every entry in
doc_ids against the same filter used in the filter handlers (reject or escape
entries containing '\'', '\\', ';') before building the query; ensure you update
both rag.get_docs and rag.fetch_from_source to use the same safe pattern.
- Around line 1490-1504: The hybrid weights w_fts and w_vec are parsed with
get_json_int which truncates fractional values; update both the "fuse" path and
the fallback path to parse these weights as double (use the existing
get_json_double or equivalent helper instead of get_json_int) so fractional
weights like 0.5 are preserved; ensure you change the lines setting w_fts and
w_vec in the block that reads fuse_params and the fallback block that reads
arguments, keeping the same variable names (w_fts, w_vec) and types.
- Around line 1150-1159: The SQL FTS query is built by concatenating the
user-provided variable `query` into the `sql` string (the SELECT from
`rag_fts_chunks` / JOIN `rag_chunks` block), which fails to escape quotes and is
vulnerable; change this to use a prepared statement with a parameter placeholder
in the MATCH clause (e.g. `... WHERE f MATCH ?1`) and bind the user input with
`sqlite3_bind_text()` before stepping the statement (prepare with
`sqlite3_prepare_v2`, bind with `sqlite3_bind_text`, execute with
`sqlite3_step`, and finalize with `sqlite3_finalize`). Apply the same change for
the other blocks referenced (the other `sql` constructions at the later ranges)
so all FTS MATCH usages use parameter binding instead of raw concatenation.
- Around line 401-455: The prepared statement and bound parameters are never
used because you call vector_db->execute_statement(query, ...); instead step the
prepared sqlite3_stmt* from prepare_result.second so bindings are applied, and
when constructing a SQLite3_result from the stmt do not leave the unique_ptr
owning it — release ownership with prepare_result.second.release() and pass the
raw sqlite3_stmt* into SQLite3_result so the result object owns and finalizes
the statement; remove the execute_statement(...) call and replace it with
creating/returning the SQLite3_result from the released stmt after collecting
any needed error/cols/affected_rows info.

In `@RAG_POC/rag_ingest.cpp`:
- Around line 686-698: The assignments reading columns into struct s are
vulnerable to null pointer dereference because sqlite3_column_text can return
NULL; update the code that sets s.name, s.backend_type, s.host, s.user, s.pass,
s.db, s.table_name, s.pk_column and s.where_sql to first call
sqlite3_column_text into a temporary const unsigned char* (or const char*),
check for NULL (or use sqlite3_column_type to detect SQLITE_NULL) and only
assign to the std::string members when non-NULL, otherwise set them to an empty
string (or another safe default); do the same null-check pattern for text
columns used by functions like sqlite3_column_text while keeping integer reads
(sqlite3_column_int for s.enabled and s.port) as-is or guarding them with
sqlite3_column_type if you want to detect NULLs explicitly.
- Around line 901-905: The code currently sets now_epoch to 0 which yields
invalid timestamps for rag_vec_chunks; replace the hardcoded value by capturing
a real epoch once per run (e.g., use std::time(nullptr) / time(NULL) and cast to
std::int64_t) and reuse that value when populating updated_at (variable
now_epoch), updating any affected functions that rely on now_epoch; ensure the
translation unit includes <ctime> (or <time.h>) if not already present so the
build succeeds.
- Around line 468-479: build_select_sql concatenates RagSource::where_sql and
raw identifiers, creating SQL injection/identifier-injection risks; update
build_select_sql to (1) escape backticks inside identifiers used for cols and
s.table_name (replace any '`' with '``') before wrapping with backticks, and (2)
stop directly concatenating s.where_sql: either accept a parameterized WHERE
with placeholders and require callers to supply bound parameters, or validate
allowlisted characters/patterns (e.g., only simple column comparisons with a
strict regex) and reject/return an error for unsafe values; update function
signature or call sites as needed to support bound parameters or validation on
RagSource::where_sql.

In `@RAG_POC/schema.sql`:
- Around line 14-45: The rag_sources table currently stores plaintext
credentials in the backend_pass column; instead migrate to a secure approach by
removing plaintext storage and either (A) storing only a reference to an
external secret manager (add a new column like backend_secret_ref) and update
any code that reads rag_sources to fetch the secret from the secret manager, or
(B) store an encrypted value (add backend_pass_encrypted) and implement
symmetric key encryption/decryption in the service layer to read/write via
functions that use a KMS-held key; update schema changes for rag_sources, add a
migration to backfill secrets (move existing backend_pass values into the secret
store or encrypt them) and remove or null out backend_pass, and update codepaths
that reference backend_pass to use backend_secret_ref or decrypt
backend_pass_encrypted instead.
♻️ Duplicate comments (7)
lib/ProxySQL_Admin_Stats.cpp (1)

1587-1617: SQL injection vulnerability and incorrect table name logic.

This issue was previously flagged: the INSERT statement has two problems:

  1. Wrong table name: When reset=true, the prefix logic produces "stats_mcp_query_tools_counters_mcp_query_tools_counters" instead of the intended "stats_mcp_query_tools_counters_reset".

  2. SQL injection risk: r->fields[0] and r->fields[1] (tool name and schema) are directly interpolated into the query without escaping. Malicious values could corrupt or exploit the database.

🔒 Proposed fix using prepared statements
-	for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin();
-	     it != resultset->rows.end(); ++it) {
-		SQLite3_row* r = *it;
-		char query[1024];
-		snprintf(query, sizeof(query),
-			"INSERT INTO %smcp_query_tools_counters VALUES ('%s', '%s', %s, %s, %s, %s, %s, %s)",
-			reset ? "stats_mcp_query_tools_counters_" : "stats_",
-			r->fields[0], r->fields[1], r->fields[2], r->fields[3],
-			r->fields[4], r->fields[5], r->fields[6], r->fields[7]);
-		statsdb->execute(query);
-	}
+	const char* table = reset
+		? "stats_mcp_query_tools_counters_reset"
+		: "stats_mcp_query_tools_counters";
+	char query[256];
+	snprintf(query, sizeof(query),
+		"INSERT INTO %s VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8)", table);
+	
+	sqlite3_stmt* stmt = nullptr;
+	int rc = statsdb->prepare_v2(query, &stmt);
+	ASSERT_SQLITE_OK(rc, statsdb);
+	
+	for (SQLite3_row* r : resultset->rows) {
+		rc = (*proxy_sqlite3_bind_text)(stmt, 1, r->fields[0], -1, SQLITE_TRANSIENT);
+		rc = (*proxy_sqlite3_bind_text)(stmt, 2, r->fields[1], -1, SQLITE_TRANSIENT);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 3, atoll(r->fields[2]));
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 4, atoll(r->fields[3]));
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 5, atoll(r->fields[4]));
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 6, atoll(r->fields[5]));
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 7, atoll(r->fields[6]));
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 8, atoll(r->fields[7]));
+		SAFE_SQLITE3_STEP2(stmt);
+		(*proxy_sqlite3_clear_bindings)(stmt);
+		(*proxy_sqlite3_reset)(stmt);
+	}
+	(*proxy_sqlite3_finalize)(stmt);
lib/Discovery_Schema.cpp (6)

12-18: gmtime() is not thread‑safe

Same issue as previously noted: gmtime() uses shared static storage; switch to gmtime_r / gmtime_s.


76-88: Use delete for SQLite3_result allocated with new

Same issue as previously noted: execute_statement() allocates with new, so free(resultset) is undefined behavior.


589-598: get_run_info reads columns off by one

Same issue as previously noted: field indices are shifted relative to the SELECT list.


970-987: ?1 placeholders aren’t bound in update_object_flags

Same issue as previously noted: db->execute() doesn’t bind parameters, so run_id is ignored.


1162-1171: FTS query strings are concatenated directly

Same issue as previously noted: query / object_type / schema_name are interpolated without escaping or binding.


1761-1763: domain_key concatenation is injection‑prone

Same issue as previously noted: domain_key is interpolated directly into SQL.

🧹 Nitpick comments (11)
RAG_POC/rag_ingest.cpp (1)

384-404: MySQL connection lacks timeout configuration.

Without a connection timeout, mysql_real_connect can block indefinitely if the MySQL server is unreachable. Consider adding MYSQL_OPT_CONNECT_TIMEOUT.

♻️ Suggested improvement
 static MYSQL* mysql_connect_or_die(const RagSource& s) {
   MYSQL* conn = mysql_init(nullptr);
   if (!conn) fatal("mysql_init failed");

   // Set utf8mb4 for safety with StackOverflow-like content
   mysql_options(conn, MYSQL_SET_CHARSET_NAME, "utf8mb4");
+
+  // Set connection timeout to avoid indefinite blocking
+  unsigned int timeout = 30;  // seconds
+  mysql_options(conn, MYSQL_OPT_CONNECT_TIMEOUT, &timeout);

   if (!mysql_real_connect(conn,
lib/Anomaly_Detector.cpp.bak (1)

136-163: Consider caching compiled regex patterns for performance.

normalize_query is likely called on the hot path for every query. Compiling std::regex objects on each call is expensive. Consider pre-compiling and caching these patterns.

♻️ Suggested improvement
// At file scope or as class members:
static const std::regex comment_regex("--.*?$|/\\*.*?\\*/", std::regex::multiline);
static const std::regex string_regex("'[^']*'|\"[^\"]*\"");
static const std::regex numeric_regex("\\b\\d+\\b");
static const std::regex whitespace_regex("\\s+");

// Then in normalize_query, use these pre-compiled patterns
lib/ProxySQL_Admin_Stats.cpp (2)

2308-2309: Minor: Duplicate commented code.

Line 2308 has a duplicate commented-out line that appears to be a copy-paste artifact.

🧹 Cleanup suggestion
 	statsdb->execute("DELETE FROM stats_mysql_prepared_statements_info");
 	query1=(char *)"INSERT INTO stats_mysql_prepared_statements_info VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9)";
 	query32s = "INSERT INTO stats_mysql_prepared_statements_info VALUES " + generate_multi_rows_query(32,9);
 	query32 = (char *)query32s.c_str();
-	//rc=(*proxy_sqlite3_prepare_v2)(mydb3, query1, -1, &statement1, 0);
-	//rc=(*proxy_sqlite3_prepare_v2)(mydb3, query1, -1, &statement1, 0);
+	//rc=(*proxy_sqlite3_prepare_v2)(mydb3, query1, -1, &statement1, 0);
 	rc = statsdb->prepare_v2(query1, &statement1);

2375-2376: Minor: Same duplicate comment issue in pgsql version.

Same copy-paste artifact as in the MySQL version.

RAG_POC/architecture-data-model.md (1)

203-203: Minor: Trailing space in code span.

There's a trailing space inside the code span that triggers a markdown lint warning.

🔧 Fix
-- `{"lit":"..."} ` — appends a literal string
+- `{"lit":"..."}` — appends a literal string
lib/sqlite3db.cpp (1)

1102-1108: Fix inconsistent indentation.

Lines 1102-1108 have inconsistent indentation compared to the surrounding code. The assignments should be indented with tabs to match the style of adjacent lines (e.g., line 1101 and 1109).

♻️ Suggested fix for indentation
 		proxy_sqlite3_column_count = sqlite3_column_count;
 		proxy_sqlite3_column_int = sqlite3_column_int;
-	proxy_sqlite3_column_int64 = sqlite3_column_int64;
-	proxy_sqlite3_column_double = sqlite3_column_double;
-	proxy_sqlite3_last_insert_rowid = sqlite3_last_insert_rowid;
-	proxy_sqlite3_errstr = sqlite3_errstr;
-	proxy_sqlite3_db_handle = sqlite3_db_handle;
-	proxy_sqlite3_enable_load_extension = sqlite3_enable_load_extension;
-	proxy_sqlite3_auto_extension = sqlite3_auto_extension;
+		proxy_sqlite3_column_int64 = sqlite3_column_int64;
+		proxy_sqlite3_column_double = sqlite3_column_double;
+		proxy_sqlite3_last_insert_rowid = sqlite3_last_insert_rowid;
+		proxy_sqlite3_errstr = sqlite3_errstr;
+		proxy_sqlite3_db_handle = sqlite3_db_handle;
+		proxy_sqlite3_enable_load_extension = sqlite3_enable_load_extension;
+		proxy_sqlite3_auto_extension = sqlite3_auto_extension;
 		proxy_sqlite3_errmsg = sqlite3_errmsg;
lib/GenAI_Thread.cpp (2)

688-691: Inconsistent boolean parsing for rag_enabled.

The rag_enabled setter accepts both "true" and "1", while other boolean setters in this file (e.g., enabled, llm_enabled, anomaly_enabled at lines 552-563) only accept "true". This inconsistency could confuse users.

Consider either:

  1. Aligning rag_enabled with existing setters (only "true"), or
  2. Updating all boolean setters to accept both "true" and "1" for consistency.
♻️ Option 1: Align with existing pattern
 	if (!strcmp(name, "rag_enabled")) {
-		variables.genai_rag_enabled = (strcmp(value, "true") == 0 || strcmp(value, "1") == 0);
+		variables.genai_rag_enabled = (strcmp(value, "true") == 0);
 		return true;
 	}

728-736: Consider raising minimum timeout threshold.

The minimum value of 1ms for rag_timeout_ms is extremely low and could cause spurious timeouts in production. Even local SQLite operations may exceed 1ms under load.

Consider raising the minimum to a more practical value like 100ms, consistent with other timeout validations in this file (e.g., embedding_timeout_ms uses 100-300000 range at line 534).

♻️ Suggested fix
 	if (!strcmp(name, "rag_timeout_ms")) {
 		int val = atoi(value);
-		if (val < 1 || val > 60000) {
-			proxy_error("Invalid value for rag_timeout_ms: %d (must be 1-60000)\n", val);
+		if (val < 100 || val > 60000) {
+			proxy_error("Invalid value for rag_timeout_ms: %d (must be 100-60000)\n", val);
 			return false;
 		}
 		variables.genai_rag_timeout_ms = val;
 		return true;
 	}
RAG_POC/embeddings-design.md (1)

42-47: Consider adding language specifier to fenced code block.

The embedding input template example lacks a language specifier. While this is intentionally plain text, adding a language identifier (e.g., text or plaintext) would satisfy linters and improve rendering consistency.

♻️ Suggested fix
-```
+```text
 {Title}
 Tags: {Tags}
 
 {ChunkBody}
</details>

</blockquote></details>
<details>
<summary>RAG_IMPLEMENTATION_SUMMARY.md (1)</summary><blockquote>

`1-129`: **Consider consolidating summary documents.**

This file contains substantial overlap with `RAG_FILE_SUMMARY.md` and `RAG_COMPLETION_SUMMARY.md`. Having three separate summary documents with similar content may lead to maintenance burden and potential inconsistencies over time.

Consider consolidating into a single comprehensive summary document or clearly differentiating the purpose of each file (e.g., one for developers, one for release notes).

</blockquote></details>
<details>
<summary>lib/RAG_Tool_Handler.cpp (1)</summary><blockquote>

`34-43`: **Remove the duplicate `GloGATH` forward declaration**

The `extern GenAI_Threads_Handler *GloGATH;` appears twice back‑to‑back. Dropping the duplicate reduces noise.

</blockquote></details>

</blockquote></details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +726 to +736
mcpdb = new SQLite3DB();
std::string mcp_catalog_path = std::string(GloVars.datadir) + "/mcp_catalog.db";
mcpdb->open((char *)mcp_catalog_path.c_str(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX);

/**
* @brief Enable SQLite extension loading for MCP catalog database
*
* Allows loading SQLite extensions at runtime. This enables sqlite-vec to be
* registered for vector similarity searches in the catalog.
*/
(*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Check mcpdb->open() result before attaching.

mcpdb->open() return code is ignored; if it fails, the subsequent attach can misbehave. Please handle the error and skip attachment/log accordingly.

✅ Suggested check
-	mcpdb = new SQLite3DB();
-	std::string mcp_catalog_path = std::string(GloVars.datadir) + "/mcp_catalog.db";
-	mcpdb->open((char *)mcp_catalog_path.c_str(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX);
+	mcpdb = new SQLite3DB();
+	std::string mcp_catalog_path = std::string(GloVars.datadir) + "/mcp_catalog.db";
+	int mcp_rc = mcpdb->open((char *)mcp_catalog_path.c_str(),
+	                         SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX);
+	if (mcp_rc != SQLITE_OK) {
+		proxy_error("Failed to open MCP catalog DB at %s\n", mcp_catalog_path.c_str());
+		delete mcpdb;
+		mcpdb = nullptr;
+	} 
🤖 Prompt for AI Agents
In `@lib/Admin_Bootstrap.cpp` around lines 726 - 736, The code calls
mcpdb->open(...) and immediately enables extension loading via
(*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1) without checking the
open result; update the logic in the block around mcpdb->open()
(SQLite3DB::open) to capture and check its return value, log an error and skip
calling proxy_sqlite3_enable_load_extension and any further attach/setup when
open fails, and ensure you use mcpdb->get_db() only after a successful open;
include descriptive logging (via existing logging mechanism) when open fails to
make the early exit clear.

Comment on lines +300 to +315
// rag_vec_chunks: sqlite3-vec index
// Use configurable vector dimension from GenAI module
int vector_dimension = 1536; // Default value
if (GloGATH) {
vector_dimension = GloGATH->variables.genai_vector_dimension;
}

std::string create_rag_vec_chunks_sql =
"CREATE VIRTUAL TABLE IF NOT EXISTS rag_vec_chunks USING vec0("
"embedding float(" + std::to_string(vector_dimension) + "), "
"chunk_id TEXT, "
"doc_id TEXT, "
"source_id INTEGER, "
"updated_at INTEGER"
");";

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate genai_vector_dimension before building vec0 DDL.

If the dimension is 0 or negative, the rag_vec_chunks creation will fail and silently disable vector search. Please clamp or fall back to a safe default and log the correction.

✅ Suggested guard
-	int vector_dimension = 1536; // Default value
-	if (GloGATH) {
-		vector_dimension = GloGATH->variables.genai_vector_dimension;
-	}
+	int vector_dimension = 1536; // Default value
+	if (GloGATH) {
+		const int cfg_dim = GloGATH->variables.genai_vector_dimension;
+		if (cfg_dim > 0) {
+			vector_dimension = cfg_dim;
+		} else {
+			proxy_warning("AI: Invalid genai_vector_dimension=%d; using default %d\n",
+			              cfg_dim, vector_dimension);
+		}
+	}
🤖 Prompt for AI Agents
In `@lib/AI_Features_Manager.cpp` around lines 300 - 315, Check
GloGATH->variables.genai_vector_dimension before using it to build
create_rag_vec_chunks_sql: if it's <= 0 or unreasonably large, clamp it to a
safe default (e.g., 1536) and log the correction via the same logger used
elsewhere (or emit a warning) so the fallback is visible; then use the sanitized
vector_dimension when constructing the vec0 DDL for rag_vec_chunks to avoid
silent failures.

pattern_name ? pattern_name : "unknown", result.risk_score);
}

sqlite3_finalize(stmt);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Inconsistent SQLite API usage: missing proxy wrapper.

Line 500 uses sqlite3_finalize(stmt) directly, but nearby code uses proxy wrappers like proxy_sqlite3_step. This inconsistency could cause issues if the proxy layer adds logging, metrics, or error handling.

🐛 Proposed fix
-	sqlite3_finalize(stmt);
+	(*proxy_sqlite3_finalize)(stmt);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sqlite3_finalize(stmt);
(*proxy_sqlite3_finalize)(stmt);
🤖 Prompt for AI Agents
In `@lib/Anomaly_Detector.cpp.bak` at line 500, Replace the direct call to
sqlite3_finalize(stmt) with the proxy wrapper used elsewhere (e.g.,
proxy_sqlite3_finalize(stmt)) so the proxy layer's logging/metrics/error
handling is applied consistently; update any includes or declarations if needed
to access proxy_sqlite3_finalize and ensure the call semantics and error
checking remain identical to the original sqlite3_finalize usage (use the same
stmt variable and follow existing error/logging patterns used with
proxy_sqlite3_step).

Comment on lines +577 to +590

// Cleanup old entries periodically
static int cleanup_counter = 0;
if (++cleanup_counter % 1000 == 0) {
uint64_t current_time = (uint64_t)time(NULL);
auto it = user_statistics.begin();
while (it != user_statistics.end()) {
if (current_time - it->second.last_query_time > USER_STATS_WINDOW * 2) {
it = user_statistics.erase(it);
} else {
++it;
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Static local variable cleanup_counter is not thread-safe.

Multiple threads calling update_user_statistics can race on the static cleanup_counter variable, potentially causing cleanup to run less frequently than intended or creating undefined behavior.

🐛 Proposed fix: Use atomic counter
+#include <atomic>
+
+// At class level or file scope:
+static std::atomic<int> cleanup_counter{0};
+
 void Anomaly_Detector::update_user_statistics(const QueryFingerprint& fp) {
 	// ... existing code ...

 	// Cleanup old entries periodically
-	static int cleanup_counter = 0;
-	if (++cleanup_counter % 1000 == 0) {
+	if (cleanup_counter.fetch_add(1, std::memory_order_relaxed) % 1000 == 0) {
 		// ... cleanup logic (consider adding mutex for user_statistics map) ...

Note: The user_statistics map itself may also need synchronization if accessed from multiple threads.

🤖 Prompt for AI Agents
In `@lib/Anomaly_Detector.cpp.bak` around lines 577 - 590, The static local int
cleanup_counter in the cleanup block inside update_user_statistics is not
thread-safe; replace it with a std::atomic<int> (e.g., declare static
std::atomic<int> cleanup_counter{0}) and perform an atomic increment/check (use
fetch_add or ++cleanup_counter and compare the result mod 1000) so multiple
threads can't race on the counter; also ensure the user_statistics cleanup loop
remains protected by the same mutex/lock used for accessing the user_statistics
map (or document that user_statistics is already synchronized) to avoid
concurrent access issues.

Comment on lines +771 to +776
proxy_error("Anomaly: Failed to insert pattern: %s\n", sqlite3_errmsg(db));
sqlite3_finalize(stmt);
return -1;
}

sqlite3_finalize(stmt);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Inconsistent proxy wrapper usage continues.

Lines 771-772 and 776 use sqlite3_errmsg and sqlite3_finalize directly instead of their proxy equivalents used elsewhere in this file.

🐛 Proposed fix
 	rc = (*proxy_sqlite3_step)(stmt);
 	if (rc != SQLITE_DONE) {
-		proxy_error("Anomaly: Failed to insert pattern: %s\n", sqlite3_errmsg(db));
-		sqlite3_finalize(stmt);
+		proxy_error("Anomaly: Failed to insert pattern: %s\n", (*proxy_sqlite3_errmsg)(db));
+		(*proxy_sqlite3_finalize)(stmt);
 		return -1;
 	}

-	sqlite3_finalize(stmt);
+	(*proxy_sqlite3_finalize)(stmt);
🤖 Prompt for AI Agents
In `@lib/Anomaly_Detector.cpp.bak` around lines 771 - 776, Replace direct SQLite
calls with the proxy wrappers to match the rest of the file: in the error branch
where proxy_error("Anomaly: Failed to insert pattern: %s\n",
sqlite3_errmsg(db)); and where sqlite3_finalize(stmt); are used, call the proxy
wrappers (e.g., proxy_sqlite3_errmsg(db) and proxy_sqlite3_finalize(stmt))
instead so proxy_error uses the proxy_sqlite3_errmsg result and both
finalization calls use proxy_sqlite3_finalize; locate these occurrences around
the Anomaly pattern insert logic where proxy_error, db, and stmt are referenced.

Comment on lines +2316 to +2330
// Build doc ID list for SQL
std::string doc_list = "'";
for (size_t i = 0; i < doc_ids.size(); ++i) {
if (i > 0) doc_list += "','";
doc_list += doc_ids[i];
}
doc_list += "'";

// Build query
std::string sql = "SELECT doc_id, source_id, "
"(SELECT name FROM rag_sources WHERE source_id = rag_documents.source_id) as source_name, "
"pk_json, title, body, metadata_json "
"FROM rag_documents "
"WHERE doc_id IN (" + doc_list + ")";

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file and locate the specific lines
wc -l lib/RAG_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 186


🏁 Script executed:

# Read the first location (lines 2316-2330)
sed -n '2310,2335p' lib/RAG_Tool_Handler.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 186


🏁 Script executed:

# Read the second location (lines 2399-2413)
sed -n '2395,2420p' lib/RAG_Tool_Handler.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 186


🏁 Script executed:

# Search for chunk_id validation/escaping patterns
rg "chunk_id" lib/RAG_Tool_Handler.cpp -B 3 -A 3 | head -60

Repository: ProxySQL/proxysql-vec

Length of output: 2361


🏁 Script executed:

# Look for SQLite binding patterns used in the codebase
rg "bind|sqlite3_bind" lib/RAG_Tool_Handler.cpp | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 1130


🏁 Script executed:

# Search for rag.get_docs implementation
rg "rag\.get_docs" lib/RAG_Tool_Handler.cpp -B 5 -A 30

Repository: ProxySQL/proxysql-vec

Length of output: 5016


🏁 Script executed:

# Search for rag.fetch_from_source implementation
rg "rag\.fetch_from_source" lib/RAG_Tool_Handler.cpp -B 5 -A 30

Repository: ProxySQL/proxysql-vec

Length of output: 4702


🏁 Script executed:

# Search for doc_list pattern mentioned in the review
rg "doc_list" lib/RAG_Tool_Handler.cpp -B 5 -A 15

Repository: ProxySQL/proxysql-vec

Length of output: 7955


🏁 Script executed:

# Check for any doc_id SQL injection patterns
rg "doc_id IN|WHERE doc_id" lib/RAG_Tool_Handler.cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 1377


Apply SQL injection mitigation to doc_id lists in rag.get_docs and rag.fetch_from_source

The code concatenates doc_ids directly into SQL queries without validation or parameterization. This is injection-prone and breaks on single quotes. Use the validation pattern found in the filter handlers (checking for ', \, ;) or migrate to the existing execute_parameterized_query() function with bound parameters (?1, ?2, ...).

Affects both:

  • rag.get_docs: WHERE doc_id IN ( + direct concatenation
  • rag.fetch_from_source: identical pattern in WHERE d.doc_id IN ( clause
🤖 Prompt for AI Agents
In `@lib/RAG_Tool_Handler.cpp` around lines 2316 - 2330, The doc_id list is
concatenated into SQL in rag.get_docs and rag.fetch_from_source (see the doc_ids
loop that builds doc_list and the sql string using "WHERE ... IN (" + doc_list +
")"), which is vulnerable to SQL injection and fails on quotes; fix by replacing
direct concatenation with a parameterized IN clause using
execute_parameterized_query() and bound parameters (generate placeholders
?1,?2,... from doc_ids and bind each doc_ids[i]) or, if you prefer the
validation approach, validate every entry in doc_ids against the same filter
used in the filter handlers (reject or escape entries containing '\'', '\\',
';') before building the query; ensure you update both rag.get_docs and
rag.fetch_from_source to use the same safe pattern.

Comment on lines +468 to +479
static std::string build_select_sql(const RagSource& s, const std::vector<std::string>& cols) {
std::string sql = "SELECT ";
for (size_t i = 0; i < cols.size(); i++) {
if (i) sql += ", ";
sql += "`" + cols[i] + "`";
}
sql += " FROM `" + s.table_name + "`";
if (!s.where_sql.empty()) {
sql += " WHERE " + s.where_sql;
}
return sql;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find RagSource definition and related structures
rg -n "struct RagSource|class RagSource" --type=cpp -A 10

Repository: ProxySQL/proxysql-vec

Length of output: 592


🏁 Script executed:

#!/bin/bash
# Search for where rag_sources is populated/loaded
rg -n "rag_sources" --type=cpp -B 3 -A 3 | head -100

Repository: ProxySQL/proxysql-vec

Length of output: 6079


🏁 Script executed:

#!/bin/bash
# Check for any identifier validation functions
rg -n "validate|sanitize|escape" --type=cpp -i | grep -i "sql\|identifier\|column\|table"

Repository: ProxySQL/proxysql-vec

Length of output: 50378


🏁 Script executed:

#!/bin/bash
# Look for how build_select_sql is called and used
rg -n "build_select_sql" --type=cpp -B 5 -A 5

Repository: ProxySQL/proxysql-vec

Length of output: 1346


🏁 Script executed:

#!/bin/bash
# Check for documentation on config loading
find . -name "*.md" -o -name "*.txt" | xargs grep -l "rag_sources\|config" 2>/dev/null | head -5

Repository: ProxySQL/proxysql-vec

Length of output: 142


🏁 Script executed:

#!/bin/bash
# Find all places where rag_sources is modified (INSERT, UPDATE, DELETE)
rg -n "INSERT.*rag_sources|UPDATE.*rag_sources|DELETE.*rag_sources" --type=cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Find if there are admin/API endpoints for modifying rag_sources
rg -n "rag_sources" --type=cpp | grep -i "insert\|update\|delete\|admin\|api" -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Check if backticks alone are sufficient for identifier protection
# Look for MySQL identifier validation patterns
rg -n "is_identifier|is_valid_identifier|validate_identifier" --type=cpp -i

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Look for documentation about rag_sources access control
find . -name "*.md" -o -name "*.txt" | xargs grep -l "rag_sources\|RAG.*access\|RAG.*permission" 2>/dev/null

Repository: ProxySQL/proxysql-vec

Length of output: 326


🏁 Script executed:

#!/bin/bash
# Check if backticks provide full protection against MySQL identifier injection
# Search for any injection concerns or notes in RAG-related code
rg -n "sql.*injection|sanitize|escape" RAG_POC/ --type=cpp -i -B 2 -A 2 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Check the RAG documentation for access control information
head -200 ./doc/rag-documentation.md

Repository: ProxySQL/proxysql-vec

Length of output: 3536


🏁 Script executed:

#!/bin/bash
# More specific search: find UPSERT, INSERT patterns in RAG-related files
find ./RAG_POC -name "*.cpp" -o -name "*.h" | xargs rg -n "INSERT\|UPDATE\|DELETE" | grep -i "rag_sources\|sql"

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Search for Admin functions that might manage RAG sources
rg -n "rag_sources\|RAG.*source" lib/AI_Features_Manager.cpp | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Check if MySQL backticks alone prevent identifier injection
# Look for any validation of column/table names in collect_needed_columns or similar
rg -n "collect_needed_columns\|table_name\|pk_column" RAG_POC/rag_ingest.cpp -A 10 | head -100

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Search for column whitelisting code
rg -n "whitelist\|allowed.*column\|valid.*column" RAG_POC/ --type=cpp -i

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Find collect_needed_columns implementation
rg -n "collect_needed_columns" RAG_POC/rag_ingest.cpp -A 50 | head -100

Repository: ProxySQL/proxysql-vec

Length of output: 4010


🏁 Script executed:

#!/bin/bash
# Search for MCP endpoints or admin functions that modify RAG configuration
rg -n "rag_sources\|RAG.*config\|MCP.*rag" lib/RAG_Tool_Handler.cpp --type=cpp -i | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Check if there's any validation before build_select_sql is called
rg -n "build_select_sql" RAG_POC/rag_ingest.cpp -B 20 | grep -E "valid|check|verify|whitelist" -i

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Look at RagSource structure more carefully and trace column handling
sed -n '145,250p' RAG_POC/rag_ingest.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 3092


🏁 Script executed:

#!/bin/bash
# Check if where_sql is validated anywhere before being used
rg -n "where_sql" RAG_POC/rag_ingest.cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 1136


🏁 Script executed:

#!/bin/bash
# Search for any validation of where_sql or identifier columns from rag_sources
rg -n "where_sql\|validate\|check" RAG_POC/rag_ingest.cpp | head -30

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Check who can insert/update rag_sources - search in Admin/MCP code
rg -n "INSERT INTO.*rag_sources\|UPDATE.*rag_sources" lib/ --type=cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Look at the full RagSource loading to see if there's validation of fields
sed -n '670,750p' RAG_POC/rag_ingest.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 3043


🏁 Script executed:

#!/bin/bash
# Search in lib/ and other dirs for where rag_sources gets populated/modified
rg -n "rag_sources" lib/ --type=cpp | grep -i "insert\|update\|delete\|execute\|prepare"

Repository: ProxySQL/proxysql-vec

Length of output: 340


🏁 Script executed:

#!/bin/bash
# Check if rag_sources can be modified via admin/config mechanisms
rg -n "INSERT.*VALUES\|UPDATE.*SET" lib/AI_Features_Manager.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Look for endpoints/handlers that could modify RAG configuration
find . -name "*Admin*" -o -name "*Tool*Handler*" | xargs rg -l "rag" 2>/dev/null

Repository: ProxySQL/proxysql-vec

Length of output: 254


🏁 Script executed:

#!/bin/bash
# Check the schema for rag_sources - is there any permission/access control?
sed -n '163,190p' lib/AI_Features_Manager.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 947


🏁 Script executed:

#!/bin/bash
# Search for any documentation about modifying rag_sources
grep -r "rag_sources" ./doc/ ./RAG_POC/ -l 2>/dev/null | xargs grep -A 5 -B 5 "INSERT\|UPDATE\|modify\|admin" -i | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 2963


🏁 Script executed:

#!/bin/bash
# Search for where_sql specifically and how it's used in MySQL queries
rg -n "where_sql" lib/RAG_Tool_Handler.cpp --type=cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Check ProxySQL_Admin for any endpoints that allow modifying rag_sources
rg -n "rag_sources\|RAG.*source" lib/ProxySQL_Admin.cpp --type=cpp | head -30

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Verify: backticks in MySQL - can they be escaped? Look for any escaping logic
rg -n "escape.*backtick\|backtick.*escape" --type=cpp -i

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Search for "where_sql" usage across the entire codebase more broadly
rg -n "where_sql" --type=cpp | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 472


🏁 Script executed:

#!/bin/bash
# Check if there's any initialization/seeding of rag_sources data
find . -name "*.sql" -o -name "*seed*" -o -name "*init*" | xargs grep -l "rag_sources" 2>/dev/null

Repository: ProxySQL/proxysql-vec

Length of output: 87


🏁 Script executed:

#!/bin/bash
# Verify if backticks alone protect against MySQL identifier injection
# Look up MySQL identifier quoting rules - check what the code comment says
sed -n '468,480p' RAG_POC/rag_ingest.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 424


🏁 Script executed:

#!/bin/bash
# Check if there's any WHERE clause validation or parameterization elsewhere in RAG code
rg -n "WHERE\s" RAG_POC/rag_ingest.cpp --type=cpp

Repository: ProxySQL/proxysql-vec

Length of output: 215


🏁 Script executed:

#!/bin/bash
# Final check: see if where_sql is ever validated in load_sources or elsewhere
sed -n '714,730p' RAG_POC/rag_ingest.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 520


Add WHERE clause SQL injection validation; backtick-escaped identifiers need backtick escaping.

The where_sql field at line 476 is concatenated directly into the SQL query without validation, creating a SQL injection risk. Additionally, while column and table names are backtick-quoted (lines 472, 474), backticks within the identifier values themselves are not escaped—a malicious identifier like `col`name` would break the quoting.

Since rag_sources is loaded from the internal SQLite database and no public APIs were found to modify it, the actual attack surface is limited to trusted administrators. However, the code should defensively validate where_sql content (e.g., using a prepared statement or allowlist for WHERE conditions) and properly escape backticks in identifiers to prevent identifier injection.

🤖 Prompt for AI Agents
In `@RAG_POC/rag_ingest.cpp` around lines 468 - 479, build_select_sql concatenates
RagSource::where_sql and raw identifiers, creating SQL
injection/identifier-injection risks; update build_select_sql to (1) escape
backticks inside identifiers used for cols and s.table_name (replace any '`'
with '``') before wrapping with backticks, and (2) stop directly concatenating
s.where_sql: either accept a parameterized WHERE with placeholders and require
callers to supply bound parameters, or validate allowlisted characters/patterns
(e.g., only simple column comparisons with a strict regex) and reject/return an
error for unsafe values; update function signature or call sites as needed to
support bound parameters or validation on RagSource::where_sql.

Comment on lines +686 to +698
s.name = (const char*)sqlite3_column_text(st, 1);
s.enabled = sqlite3_column_int(st, 2);

s.backend_type = (const char*)sqlite3_column_text(st, 3);
s.host = (const char*)sqlite3_column_text(st, 4);
s.port = sqlite3_column_int(st, 5);
s.user = (const char*)sqlite3_column_text(st, 6);
s.pass = (const char*)sqlite3_column_text(st, 7);
s.db = (const char*)sqlite3_column_text(st, 8);

s.table_name = (const char*)sqlite3_column_text(st, 9);
s.pk_column = (const char*)sqlite3_column_text(st, 10);
s.where_sql = (const char*)sqlite3_column_text(st, 11);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential null pointer dereference from sqlite3_column_text.

sqlite3_column_text can return NULL for NULL column values or on allocation failure. Directly assigning to std::string without a null check will cause undefined behavior if NULL is returned.

🐛 Proposed fix: Add null checks
-    s.name      = (const char*)sqlite3_column_text(st, 1);
-    s.enabled   = sqlite3_column_int(st, 2);
-
-    s.backend_type = (const char*)sqlite3_column_text(st, 3);
-    s.host         = (const char*)sqlite3_column_text(st, 4);
-    s.port         = sqlite3_column_int(st, 5);
-    s.user         = (const char*)sqlite3_column_text(st, 6);
-    s.pass         = (const char*)sqlite3_column_text(st, 7);
-    s.db           = (const char*)sqlite3_column_text(st, 8);
-
-    s.table_name   = (const char*)sqlite3_column_text(st, 9);
-    s.pk_column    = (const char*)sqlite3_column_text(st, 10);
-    s.where_sql    = (const char*)sqlite3_column_text(st, 11);
+    s.name      = str_or_empty((const char*)sqlite3_column_text(st, 1));
+    s.enabled   = sqlite3_column_int(st, 2);
+
+    s.backend_type = str_or_empty((const char*)sqlite3_column_text(st, 3));
+    s.host         = str_or_empty((const char*)sqlite3_column_text(st, 4));
+    s.port         = sqlite3_column_int(st, 5);
+    s.user         = str_or_empty((const char*)sqlite3_column_text(st, 6));
+    s.pass         = str_or_empty((const char*)sqlite3_column_text(st, 7));
+    s.db           = str_or_empty((const char*)sqlite3_column_text(st, 8));
+
+    s.table_name   = str_or_empty((const char*)sqlite3_column_text(st, 9));
+    s.pk_column    = str_or_empty((const char*)sqlite3_column_text(st, 10));
+    s.where_sql    = str_or_empty((const char*)sqlite3_column_text(st, 11));
🤖 Prompt for AI Agents
In `@RAG_POC/rag_ingest.cpp` around lines 686 - 698, The assignments reading
columns into struct s are vulnerable to null pointer dereference because
sqlite3_column_text can return NULL; update the code that sets s.name,
s.backend_type, s.host, s.user, s.pass, s.db, s.table_name, s.pk_column and
s.where_sql to first call sqlite3_column_text into a temporary const unsigned
char* (or const char*), check for NULL (or use sqlite3_column_type to detect
SQLITE_NULL) and only assign to the std::string members when non-NULL, otherwise
set them to an empty string (or another safe default); do the same null-check
pattern for text columns used by functions like sqlite3_column_text while
keeping integer reads (sqlite3_column_int for s.enabled and s.port) as-is or
guarding them with sqlite3_column_type if you want to detect NULLs explicitly.

Comment on lines +901 to +905
// Use SQLite's unixepoch() for updated_at normally; vec table also stores updated_at as unix epoch.
// Here we store a best-effort "now" from SQLite (unixepoch()) would require a query; instead store 0
// or a local time. For v0, we store 0 and let schema default handle other tables.
// If you want accuracy, query SELECT unixepoch() once per run and reuse it.
std::int64_t now_epoch = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded now_epoch = 0 will cause incorrect timestamps.

The comment acknowledges this is a placeholder, but storing 0 as updated_at in rag_vec_chunks will make temporal queries unreliable. Consider using time(NULL) for a reasonable approximation.

💡 Suggested fix
-    // If you want accuracy, query SELECT unixepoch() once per run and reuse it.
-    std::int64_t now_epoch = 0;
+    // Use current time for updated_at; for higher accuracy, query SQLite's unixepoch() once per run.
+    std::int64_t now_epoch = (std::int64_t)time(NULL);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Use SQLite's unixepoch() for updated_at normally; vec table also stores updated_at as unix epoch.
// Here we store a best-effort "now" from SQLite (unixepoch()) would require a query; instead store 0
// or a local time. For v0, we store 0 and let schema default handle other tables.
// If you want accuracy, query SELECT unixepoch() once per run and reuse it.
std::int64_t now_epoch = 0;
// Use SQLite's unixepoch() for updated_at normally; vec table also stores updated_at as unix epoch.
// Here we store a best-effort "now" from SQLite (unixepoch()) would require a query; instead store 0
// or a local time. For v0, we store 0 and let schema default handle other tables.
// Use current time for updated_at; for higher accuracy, query SQLite's unixepoch() once per run.
std::int64_t now_epoch = (std::int64_t)time(NULL);
🤖 Prompt for AI Agents
In `@RAG_POC/rag_ingest.cpp` around lines 901 - 905, The code currently sets
now_epoch to 0 which yields invalid timestamps for rag_vec_chunks; replace the
hardcoded value by capturing a real epoch once per run (e.g., use
std::time(nullptr) / time(NULL) and cast to std::int64_t) and reuse that value
when populating updated_at (variable now_epoch), updating any affected functions
that rely on now_epoch; ensure the translation unit includes <ctime> (or
<time.h>) if not already present so the build succeeds.

Comment on lines +14 to +45
CREATE TABLE IF NOT EXISTS rag_sources (
source_id INTEGER PRIMARY KEY,
name TEXT NOT NULL UNIQUE, -- e.g. "stack_posts"
enabled INTEGER NOT NULL DEFAULT 1,

-- Where to retrieve from (PoC: connect directly; later can be "via ProxySQL")
backend_type TEXT NOT NULL, -- "mysql" | "postgres" | ...
backend_host TEXT NOT NULL,
backend_port INTEGER NOT NULL,
backend_user TEXT NOT NULL,
backend_pass TEXT NOT NULL,
backend_db TEXT NOT NULL, -- database/schema name

table_name TEXT NOT NULL, -- e.g. "posts"
pk_column TEXT NOT NULL, -- e.g. "Id"

-- Optional: restrict ingestion; appended to SELECT as WHERE <where_sql>
where_sql TEXT, -- e.g. "PostTypeId IN (1,2)"

-- REQUIRED: mapping from source row -> rag_documents fields
-- JSON spec describing doc_id, title/body concat, metadata pick/rename, etc.
doc_map_json TEXT NOT NULL,

-- REQUIRED: chunking strategy (enabled, chunk_size, overlap, etc.)
chunking_json TEXT NOT NULL,

-- Optional: embedding strategy (how to build embedding input text)
-- In v0 you can keep it NULL/empty; define later without schema changes.
embedding_json TEXT,

created_at INTEGER NOT NULL DEFAULT (unixepoch()),
updated_at INTEGER NOT NULL DEFAULT (unixepoch())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid storing plaintext backend credentials

backend_pass is stored as TEXT, which risks credential leakage. Please move secrets to a secure store or encrypt at rest (and restrict access) rather than persisting raw passwords in the catalog.

🤖 Prompt for AI Agents
In `@RAG_POC/schema.sql` around lines 14 - 45, The rag_sources table currently
stores plaintext credentials in the backend_pass column; instead migrate to a
secure approach by removing plaintext storage and either (A) storing only a
reference to an external secret manager (add a new column like
backend_secret_ref) and update any code that reads rag_sources to fetch the
secret from the secret manager, or (B) store an encrypted value (add
backend_pass_encrypted) and implement symmetric key encryption/decryption in the
service layer to read/write via functions that use a KMS-held key; update schema
changes for rag_sources, add a migration to backfill secrets (move existing
backend_pass values into the secret store or encrypt them) and remove or null
out backend_pass, and update codepaths that reference backend_pass to use
backend_secret_ref or decrypt backend_pass_encrypted instead.

renecannao and others added 5 commits January 21, 2026 13:12
- Add special build rule in Makefile for anomaly_detection-t that includes:
  - $(OBJ) for global variables (GloVars, GloGATH)
  - -Wl,--allow-multiple-definition to allow test's main() to override ProxySQL's
  - ClickHouse client libraries (libclickhouse-cpp-lib.a, libzstdstatic.a, liblz4.a)
  - SQLite rembed library (libsqlite_rembed.a)
  - -lscram -lusual for PostgreSQL SCRAM support
- Add stub function SQLite3_Server_session_handler required by SQLite3_Server.cpp

Resolves compilation errors for anomaly_detection-t test.
- Add #include <string> for C++ std::string support
- Add #include <cmath> for sqrt() function
- Change format %lld to %ld for chrono duration types (long int, not long long)

Resolves compilation errors for vector_db_performance-t test.
- Add #include <string> for C++ std::string support

Resolves compilation errors for ai_llm_retry_scenarios-t test.
This commit addresses valid concerns raised by coding agents (Gemini, Copilot, CoderabbitAI):

1. Fix stats_mcp_query_digest naming conflict (ProxySQL_Admin.cpp):
   - Made reset and non-reset paths mutually exclusive using else block
   - Prevents both flags from being true, matching MySQL pattern
   - Ensures reset takes precedence over non-reset

2. Fix INSERT OR REPLACE sync issue (Admin_Handler.cpp):
   - Added DELETE before INSERT OR REPLACE in LOAD/SAVE MCP QUERY RULES
   - Prevents stale rules from persisting when syncing disk <-> memory
   - Ensures deleted source rows are also removed from target

3. Fix integer division truncation for timeout (Query_Tool_Handler.cpp):
   - Changed timeout_ms/1000 to (timeout_ms+999)/1000 for ceiling division
   - Ensures sub-second timeouts (e.g., 500ms) become at least 1 second
   - Prevents zero-second timeouts from causing unexpected behavior

4. Remove confusing comment (Discovery_Schema.cpp):
   - Simplified column count comment to be clear and accurate

Note: The re_modifiers parsing code already correctly handles VARCHAR "CASELESS"
to int conversion (lines 2414-2425), so that review comment was already addressed.
feat: Add MCP query rules and digest statistics [WIP]
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
lib/MySQL_Catalog.cpp (2)

56-57: Avoid Variable Length Arrays (VLAs) - use heap allocation instead.

VLAs are not standard C++ (they're a C99/GCC extension) and can cause stack overflow for large paths. Use std::string or heap allocation.

Proposed fix
-	char path_buf[db_path.size() + 1];
-	strcpy(path_buf, db_path.c_str());
-	int rc = db->open(path_buf, SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE);
+	int rc = db->open(const_cast<char*>(db_path.c_str()), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE);

Or if db->open requires a mutable buffer:

-	char path_buf[db_path.size() + 1];
-	strcpy(path_buf, db_path.c_str());
+	std::vector<char> path_buf(db_path.begin(), db_path.end());
+	path_buf.push_back('\0');
	int rc = db->open(path_buf.data(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE);

279-301: SQL injection vulnerability - user input concatenated directly into SQL.

The search() method concatenates schema, kind, tags, and query parameters directly into the SQL string without escaping or using prepared statements. This is a critical security vulnerability that allows SQL injection attacks.

The vulnerable patterns are:

  • Line 283: sql << " AND schema = '" << schema << "'";
  • Line 288: sql << " AND kind = '" << kind << "'";
  • Line 293: sql << " AND tags LIKE '%" << tags << "%'";
  • Lines 298-300: query parameter concatenated into LIKE clauses

Consider using prepared statements like the upsert() and get() methods do, or at minimum escape single quotes in the input strings.

Example fix using prepared statements
std::string MySQL_Catalog::search(
	const std::string& schema,
	const std::string& query,
	const std::string& kind,
	const std::string& tags,
	int limit,
	int offset
) {
	sqlite3_stmt* stmt = NULL;
	std::ostringstream sql;
	std::vector<std::string> bindings;
	int bind_idx = 1;

	sql << "SELECT schema, kind, key, document, tags, links FROM catalog WHERE 1=1";

	if (!schema.empty()) {
		sql << " AND schema = ?" << bind_idx++;
		bindings.push_back(schema);
	}
	if (!kind.empty()) {
		sql << " AND kind = ?" << bind_idx++;
		bindings.push_back(kind);
	}
	// ... similar for other parameters
	
	// Then bind each parameter using proxy_sqlite3_bind_text
}
🤖 Fix all issues with AI agents
In `@lib/Admin_Handler.cpp`:
- Around line 2369-2374: The current code builds a single semicolon-separated
SQL string into *q and relies on execute_statement (which uses
sqlite3_prepare_v2 and only compiles the first statement) so the INSERT never
runs; change the logic around where *q is set (the l_free/l_strdup block that
creates "DELETE ...; INSERT OR REPLACE ...") to instead perform each SQL
statement with calls that support multiple statements—either call execute()
(which uses sqlite3_exec) for the combined string or, better, call execute() or
the existing BQE1/execute pattern twice to run "DELETE FROM
main.mcp_query_rules" and then "INSERT OR REPLACE INTO main.mcp_query_rules
SELECT * FROM disk.mcp_query_rules" as separate statements; update the code
paths that set *q/*ql accordingly so no multi-statement string is passed to
execute_statement.

In `@lib/MySQL_Catalog.cpp`:
- Around line 370-390: The list() method currently concatenates user inputs into
the SQL strings (variables sql and count_sql), causing SQL injection; change it
to use prepared statements and parameter binding like upsert() and get(): build
the base query with placeholders (e.g., "WHERE 1=1" plus "AND schema = ?"/"AND
kind = ?"), prepare the statement via your MySQL API, bind schema and kind only
when non-empty, and bind limit/offset as parameters before executing both the
main query and the COUNT query (use the same conditional placeholder logic for
count_sql). Ensure you replace the direct string concatenation at the spots that
currently append schema/kind and the final "LIMIT/OFFSET" with proper
placeholders and binding.
- Around line 505-511: In the remove() method, avoid concatenating user inputs
(schema, kind, key) into the SQL string; instead build a parameterized DELETE
and use a prepared statement to bind values for schema (when non-empty), kind
and key before executing. Replace the ostringstream construction in
MySQL_Catalog::remove() with a DELETE query that uses placeholders for
schema/kind/key (e.g. "DELETE FROM catalog WHERE 1=1" plus "AND schema = ?" when
schema provided, then always "AND kind = ? AND key = ?"), prepare that statement
via your MySQL client API, bind schema (conditionally), bind kind and key,
execute, and propagate/log any execution errors.

In `@lib/ProxySQL_Admin_Stats.cpp`:
- Around line 2581-2627: The INSERT always targets stats_mcp_query_digest and
uses strlen/risky sprintf, causing reset data to go to the live table, crashes
on NULL run_id and SQL-injection risks; in
ProxySQL_Admin::stats___mcp_query_digest switch the target table based on the
reset flag (use stats_mcp_query_digest_reset when reset==true), stop building
SQL via sprintf, and use a prepared statement with parameter binding (prepare
once per loop or reuse) to bind each SQLite3_row->fields value NULL-safely
(checking for NULL before binding and using sqlite3_bind_null or
sqlite3_bind_text) instead of strlen on fields and statsdb->execute(query); also
remove manual malloc/free and call the DB's execute/step/finalize prepared API
(e.g., statsdb prepare/bind/execute/finalize) to prevent injection and crashes.

In `@lib/ProxySQL_Admin.cpp`:
- Around line 1354-1357: The current checks set both
stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset when
query_no_space contains the reset token, causing duplicate refresh work; change
the logic to test the reset token first and make the second check an else if (or
otherwise skip setting the non-reset flag when reset matches) so only
stats_mcp_query_tools_counters_reset is set and refresh is executed once; update
the conditional around query_no_space that touches
stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset
accordingly.
- Around line 7750-7804: The function
ProxySQL_Admin::load_mcp_query_rules_to_runtime leaks the SQLite3_result* on the
success path because Discovery_Schema::load_mcp_query_rules(resultset) does not
free it; after calling catalog->load_mcp_query_rules(resultset) add deletion of
the resultset (delete resultset;) to mirror the error-path cleanup used in
load_mysql_firewall_to_runtime(), and ensure you use the correct deallocation
(delete, not free). Also update the function comment header to correct the
documentation: change "error message string on failure (caller must free)" to
indicate returned error strings are string literals and must not be freed by the
caller.

In `@lib/Query_Tool_Handler.cpp`:
- Around line 235-319: The init_connection_pool function must clean up any
partially initialized connections on error: add a helper (lambda or private
method) inside init_connection_pool that iterates connection_pool, calls
mysql_close on each entry's MYSQL* (and sets conn.mysql = nullptr), clears
connection_pool, resets pool_size to 0, and unlocks pool_lock if held; call this
cleanup helper before every early return that occurs after some successful
connections (e.g., after mysql_init/mysql_real_connect failures) so the function
either returns with a fully initialized pool or an empty, consistent pool;
update all return paths in init_connection_pool to invoke this helper and then
return -1.

In `@lib/Static_Harvester.cpp`:
- Around line 1088-1138: The quick-profile field indices are misaligned with the
SELECT: update the row->fields index usage inside the loop (where profile is
built) so they match the SELECT order (object_id, schema_name, object_name,
object_type, engine, table_rows_est, data_length, index_length, has_primary_key,
has_foreign_keys, has_time_column). Concretely, set rows_est to row->fields[5],
engine to row->fields[4], size_bytes to the sum of row->fields[6] and
row->fields[7], and the boolean flags
has_primary_key/has_foreign_keys/has_time_column to row->fields[8], [9], and
[10] respectively; update these in the code that constructs json profile (the
block around profile["guessed_kind"], profile["rows_est"],
profile["size_bytes"], profile["engine"], profile["has_primary_key"],
profile["has_foreign_keys"], profile["has_time_column"] in
Static_Harvester.cpp).
- Around line 236-258: In Static_Harvester::execute_query, mysql_store_result()
NULL is being treated as a successful "no result set" case; instead check
mysql_field_count(mysql_conn) to see if the query should have a result set—if
mysql_field_count() == 0 treat it as a successful non-result (unlock conn_lock
and return 0), otherwise treat NULL as an error by calling
mysql_errno(mysql_conn) and logging mysql_error(mysql_conn), unlock conn_lock
and return -1; update the error log to include mysql_error(mysql_conn) and
ensure conn_lock is always released before returning.
- Around line 388-662: All five fetch_* methods (fetch_schemas,
fetch_tables_views, fetch_columns, fetch_indexes, fetch_foreign_keys) currently
interpolate the user-provided filter directly into SQL and need whitelist
validation to prevent SQL injection: add a small helper (e.g.,
is_valid_identifier or validate_schema_filter) that returns true only for
non-empty strings matching /^[A-Za-z0-9_]+$/ (allowing empty meaning no filter),
call it at the start of each method and if the filter is non-empty and invalid
either return an empty result or treat it as no filter (choose consistent
behavior), and only append " AND ... = '<filter>'" to the SQL when the helper
approves the value; reference the filter variable and SQL-building blocks in
each function so the validation is applied before any SQL concatenation.
- Around line 1014-1055: The UPDATE that writes view_def, schema_name, and
view_name into the SQLite catalog (inside the for-loop using update_sql and
catalog->get_db()->execute_statement) is vulnerable to breaking on single
quotes; change it to use a prepared statement with parameter binding
(sqlite3_prepare_v2 / sqlite3_bind_text / sqlite3_step / sqlite3_finalize) and
bind view_def, schema_name, view_name, and current_run_id instead of
interpolating them, following the pattern in Discovery_Schema.cpp; alternatively
(less preferred) call escape_string_single_quotes() on view_def.c_str(),
schema_name.c_str(), and view_name.c_str() and use the escaped C strings in the
constructed SQL before calling execute_statement.
♻️ Duplicate comments (9)
lib/ProxySQL_Admin_Stats.cpp (1)

1587-1613: Fix reset insert target and avoid raw SQL interpolation.

Line 1608 builds the INSERT using a prefix; when reset == true it targets stats_mcp_query_tools_counters_mcp_query_tools_counters (non‑existent). Raw interpolation of tool/schema also breaks on quotes or enables injection. Use an explicit table name plus bound parameters.

🛠️ Proposed fix (explicit table + prepared statement)
-	for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin();
-	     it != resultset->rows.end(); ++it) {
-		SQLite3_row* r = *it;
-		char query[1024];
-		snprintf(query, sizeof(query),
-			"INSERT INTO %smcp_query_tools_counters VALUES ('%s', '%s', %s, %s, %s, %s, %s, %s)",
-			reset ? "stats_mcp_query_tools_counters_" : "stats_",
-			r->fields[0], r->fields[1], r->fields[2], r->fields[3],
-			r->fields[4], r->fields[5], r->fields[6], r->fields[7]);
-		statsdb->execute(query);
-	}
+	const char* table = reset
+		? "stats_mcp_query_tools_counters_reset"
+		: "stats_mcp_query_tools_counters";
+	const std::string sql =
+		std::string("INSERT INTO ") + table +
+		" VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8)";
+	int rc = 0;
+	stmt_unique_ptr u_stmt { nullptr };
+	std::tie(rc, u_stmt) = statsdb->prepare_v2(sql.c_str());
+	ASSERT_SQLITE_OK(rc, statsdb);
+	sqlite3_stmt* const stmt = u_stmt.get();
+	for (SQLite3_row* r : resultset->rows) {
+		rc = (*proxy_sqlite3_bind_text)(stmt, 1, r->fields[0], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_text)(stmt, 2, r->fields[1], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 3, atoll(r->fields[2])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 4, atoll(r->fields[3])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 5, atoll(r->fields[4])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 6, atoll(r->fields[5])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 7, atoll(r->fields[6])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 8, atoll(r->fields[7])); ASSERT_SQLITE_OK(rc, statsdb);
+		SAFE_SQLITE3_STEP2(stmt);
+		rc = (*proxy_sqlite3_clear_bindings)(stmt); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_reset)(stmt); ASSERT_SQLITE_OK(rc, statsdb);
+	}
lib/Admin_Bootstrap.cpp (1)

726-736: Check MCP catalog DB open result before enabling extensions/attach.

Line 728 ignores mcpdb->open()’s return code; if it fails, proxy_sqlite3_enable_load_extension and __attach_db (Line 947) can dereference an invalid handle.

🛠️ Suggested guard
-	mcpdb->open((char *)mcp_catalog_path.c_str(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX);
+	int mcp_rc = mcpdb->open((char *)mcp_catalog_path.c_str(),
+	                         SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX);
+	if (mcp_rc != SQLITE_OK) {
+		proxy_error("Failed to open MCP catalog DB at %s\n", mcp_catalog_path.c_str());
+		delete mcpdb;
+		mcpdb = nullptr;
+	}
 ...
-	(*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1);
+	if (mcpdb) {
+		(*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1);
+	}
 ...
-	__attach_db(admindb, mcpdb, (char *)"mcp_catalog");
+	if (mcpdb) {
+		__attach_db(admindb, mcpdb, (char *)"mcp_catalog");
+	}

Also applies to: 947-947

lib/Query_Tool_Handler.cpp (7)

352-360: Synchronize find_connection with pool_lock.

Iterating connection_pool without a lock can race with close() or pool initialization.


362-405: Avoid using a returned connection and return immediately on query failure.

The connection is returned before mysql_affected_rows, and the error path continues after mysql_query fails. Both create races/UB.


407-476: Hold the connection until mysql_affected_rows/result processing completes.

Returning the connection before mysql_affected_rows risks another thread reusing it.


809-846: Match allocation/deallocation for SQLite3_result.

execute_statement allocates SQLite3_result with new, but this path frees it with free(). That’s undefined behavior.


920-936: Escape schema/name_filter before concatenating SQL.

Direct interpolation enables SQL injection via tool arguments.


1112-1131: Sanitize schema/table in object_id lookup SQL.

Direct string concatenation allows injection via object_key.


1799-1850: Free row buffers after add_row to avoid leaks.

Each row allocates char** plus strdup fields that are never freed.

🧹 Nitpick comments (1)
lib/Admin_Handler.cpp (1)

2360-2360: Minor: Inconsistent length check could be clarified.

The condition query_no_space_length>20 combined with strncasecmp(..., 21) works correctly (since "SAVE MCP QUERY RULES " is 21 chars), but using strlen("SAVE MCP QUERY RULES ") would be more maintainable and consistent with nearby code patterns (e.g., line 2365 uses strlen()).

-if ((query_no_space_length>20) && ( (!strncasecmp("SAVE MCP QUERY RULES ", query_no_space, 21)) || (!strncasecmp("LOAD MCP QUERY RULES ", query_no_space, 21)) ) ) {
+if ((query_no_space_length>20) && ( (!strncasecmp("SAVE MCP QUERY RULES ", query_no_space, strlen("SAVE MCP QUERY RULES "))) || (!strncasecmp("LOAD MCP QUERY RULES ", query_no_space, strlen("LOAD MCP QUERY RULES "))) ) ) {

Comment on lines +2369 to +2374
l_free(*ql,*q);
// First clear target table, then insert to ensure deleted source rows are also removed
*q=l_strdup("DELETE FROM main.mcp_query_rules; INSERT OR REPLACE INTO main.mcp_query_rules SELECT * FROM disk.mcp_query_rules");
*ql=strlen(*q)+1;
return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Check how other disk-to-memory operations handle this pattern
rg -n "DELETE FROM.*INSERT" --type cpp lib/

Repository: ProxySQL/proxysql-vec

Length of output: 628


🏁 Script executed:

# Also search for the execute_statement function and how SQL is executed
rg -n "execute_statement" --type cpp lib/ | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 2642


🏁 Script executed:

# Get context around lines 2369-2374 and 2383
sed -n '2360,2390p' lib/Admin_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 1821


🏁 Script executed:

# Find execute_statement function definition
rg -n "execute_statement" lib/ --type cpp -A 10 | grep -A 10 "execute_statement.*{" | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 2293


🏁 Script executed:

# Find BQE1 macro definition which also handles DELETE...INSERT patterns
rg -n "BQE1" lib/ --type cpp -B 2 -A 5 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 4392


🏁 Script executed:

# Search for the database class and SQLite3_exec or similar execution functions
rg -n "SQLite3_exec\|execute_statement" lib/ --type cpp | grep "define\|{" | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Look for where execute_statement is defined (it might be in a header)
fd -e "h" lib/ -exec rg -l "execute_statement" {} \; | head -10

Repository: ProxySQL/proxysql-vec

Length of output: 315


🏁 Script executed:

# Look at the full BQE1 implementation
sed -n '187,210p' lib/ProxySQL_Admin.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 624


🏁 Script executed:

# Find the execute function to see how it differs from execute_statement
rg -n "void SQLite3DB::execute\(" lib/ --type cpp -A 15 | head -60

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Check if sqlite3_prepare_v2 handles multiple statements
rg -n "sqlite3_prepare_v2" lib/ --type cpp -B 2 -A 5 | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 3557


🏁 Script executed:

# Look at how the mcp_query_rules statements are actually being executed later
rg -n "mcp_query_rules" lib/Admin_Handler.cpp -B 2 -A 8

Repository: ProxySQL/proxysql-vec

Length of output: 3897


🏁 Script executed:

# Find the execute function implementation
rg -n "void SQLite3DB::execute\(" lib/ --type cpp -A 20 | head -80

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Also check if there's a sqlite3_exec call anywhere that might handle multiple statements
rg -n "sqlite3_exec" lib/ --type cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 3222


🏁 Script executed:

# Check the SQLite3DB class header to understand what execute vs execute_statement does
fd -name "*.h" lib/ -exec rg -l "SQLite3DB" {} \; | head -5

Repository: ProxySQL/proxysql-vec

Length of output: 295


🏁 Script executed:

# Find the execute function that takes a const char*
rg -n "void SQLite3DB::execute" lib/sqlite3db.cpp -A 25 | head -100

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Check the full context around the sqlite3_exec call
sed -n '215,250p' lib/sqlite3db.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 964


🏁 Script executed:

# Let's look for the header file to understand the class structure
find lib -name "*.h" -o -name "*.hpp" | xargs rg -l "class SQLite3DB" 2>/dev/null | head -1

Repository: ProxySQL/proxysql-vec

Length of output: 94


🏁 Script executed:

# Find where the returned query string (in *q) is executed
# This code is in Admin_Handler.cpp processing admin commands
# Let's see the full function context
sed -n '2300,2450p' lib/Admin_Handler.cpp | head -150

Repository: ProxySQL/proxysql-vec

Length of output: 8064


🏁 Script executed:

# Search for how admin queries are executed after being processed
rg -n "execute_statement\|->execute\(" lib/Admin_Handler.cpp | grep -A 3 -B 3 "2[34][0-9][0-9]" | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Check the broader context to see what happens with the query after processing
sed -n '2350,2500p' lib/Admin_Handler.cpp | tail -150

Repository: ProxySQL/proxysql-vec

Length of output: 7730


🏁 Script executed:

# Find the function signature and caller context
rg -n "admin_handler_function\|process_admin_query" lib/Admin_Handler.cpp | head -5

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Look for the function that calls this handler
grep -n "^.*(" lib/Admin_Handler.cpp | grep -E "admin_handler|process_admin" | head -5

Repository: ProxySQL/proxysql-vec

Length of output: 725


🏁 Script executed:

# Find the function definition that contains this code
sed -n '1,100p' lib/Admin_Handler.cpp | grep -E "^[a-zA-Z_].*\(|^bool|^int"

Repository: ProxySQL/proxysql-vec

Length of output: 127


🏁 Script executed:

# Search for where the query string *q gets executed after being returned
rg -n "admindb->execute\(" lib/ --type cpp | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 1882


🏁 Script executed:

# Find where admin_handler_command_load_or_save is called and what happens with the returned query
rg -n "admin_handler_command_load_or_save" lib/ --type cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 1121


🏁 Script executed:

# Let's look at the full function signature and see how it's used
sed -n '1025,1100p' lib/Admin_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 4033


🏁 Script executed:

# Check for "sqlite3_exec" documentation or SQLite3 behavior regarding multiple statements
# Let's look for comments or documentation about multi-statement support
rg -n "sqlite3_exec\|multiple.*statement" lib/ --type cpp -B 2 -A 2 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Find the __run_query label and see how the query is executed
sed -n '3237,3500p' lib/Admin_Handler.cpp | head -200

Repository: ProxySQL/proxysql-vec

Length of output: 10475


🏁 Script executed:

# Search for the __run_query label explicitly
rg -n "__run_query:" lib/Admin_Handler.cpp -A 30 | head -80

Repository: ProxySQL/proxysql-vec

Length of output: 1399


🏁 Script executed:

# Find what happens immediately after __run_query label - look for execute_statement or execute
sed -n '4317,4450p' lib/Admin_Handler.cpp | head -150

Repository: ProxySQL/proxysql-vec

Length of output: 4148


🏁 Script executed:

# Search more broadly for where the query gets executed after __run_query
rg -n "run_query" lib/Admin_Handler.cpp | tail -20

Repository: ProxySQL/proxysql-vec

Length of output: 561


🏁 Script executed:

# Look for any comments or documentation about prepare_v2 and multiple statements
rg -n "prepare_v2\|multiple.*statement" lib/sqlite3db.cpp -B 2 -A 2 | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Also check if there are any workarounds or special handling for multi-statement queries
rg -n "sqlite3_next_stmt\|tail.*statement" lib/ --type cpp | head -10

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Check the exact behavior of sqlite3_prepare_v2 with documentation or comments
rg -n "sqlite3_prepare_v2" lib/sqlite3db.cpp -B 5 -A 15 | head -60

Repository: ProxySQL/proxysql-vec

Length of output: 2141


🏁 Script executed:

# Verify that execute_statement does NOT loop or handle multiple statements
sed -n '322,388p' lib/sqlite3db.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 2511


Fix multi-statement SQL execution in mcp_query_rules operations.

Lines 2371 and 2383 use execute_statement, which calls sqlite3_prepare_v2 to compile the query. This function only prepares the first SQL statement when given multiple semicolon-separated statements. The INSERT OR REPLACE part will never execute.

Split into separate execute() calls (which uses sqlite3_exec and supports multiple statements), or execute each statement separately. The pattern used elsewhere in the codebase—such as the BQE1 macro in ProxySQL_Admin.cpp and debug handlers—demonstrates the correct approach of executing statements individually.

Example pattern (from LOAD DEBUG handlers):
SPA->admindb->execute("DELETE FROM main.debug_filters");
SPA->admindb->execute("INSERT OR REPLACE INTO main.debug_levels SELECT * FROM disk.debug_levels");
SPA->admindb->execute("INSERT INTO main.debug_filters SELECT * FROM disk.debug_filters");
🤖 Prompt for AI Agents
In `@lib/Admin_Handler.cpp` around lines 2369 - 2374, The current code builds a
single semicolon-separated SQL string into *q and relies on execute_statement
(which uses sqlite3_prepare_v2 and only compiles the first statement) so the
INSERT never runs; change the logic around where *q is set (the l_free/l_strdup
block that creates "DELETE ...; INSERT OR REPLACE ...") to instead perform each
SQL statement with calls that support multiple statements—either call execute()
(which uses sqlite3_exec) for the combined string or, better, call execute() or
the existing BQE1/execute pattern twice to run "DELETE FROM
main.mcp_query_rules" and then "INSERT OR REPLACE INTO main.mcp_query_rules
SELECT * FROM disk.mcp_query_rules" as separate statements; update the code
paths that set *q/*ql accordingly so no multi-statement string is passed to
execute_statement.

Comment on lines +370 to 390
sql << "SELECT schema, kind, key, document, tags , links FROM catalog WHERE 1=1";

if (!schema.empty()) {
sql << " AND schema = '" << schema << "'";
}

if (!kind.empty()) {
sql << " WHERE kind = '" << kind << "'";
sql << " AND kind = '" << kind << "'";
}

sql << " ORDER BY kind, key ASC LIMIT " << limit << " OFFSET " << offset;
sql << " ORDER BY schema, kind , key ASC LIMIT " << limit << " OFFSET " << offset;

// Get total count
std::ostringstream count_sql;
count_sql << "SELECT COUNT(*) FROM catalog";
count_sql << "SELECT COUNT(*) FROM catalog WHERE 1=1";
if (!schema.empty()) {
count_sql << " AND schema = '" << schema << "'";
}
if (!kind.empty()) {
count_sql << " WHERE kind = '" << kind << "'";
count_sql << " AND kind = '" << kind << "'";
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

SQL injection vulnerability in list() - same issue as search().

The list() method has the same SQL injection vulnerability, directly concatenating schema and kind parameters into SQL strings at lines 373, 377, 386, and 389.

Use prepared statements with parameter binding as done in upsert() and get().

🤖 Prompt for AI Agents
In `@lib/MySQL_Catalog.cpp` around lines 370 - 390, The list() method currently
concatenates user inputs into the SQL strings (variables sql and count_sql),
causing SQL injection; change it to use prepared statements and parameter
binding like upsert() and get(): build the base query with placeholders (e.g.,
"WHERE 1=1" plus "AND schema = ?"/"AND kind = ?"), prepare the statement via
your MySQL API, bind schema and kind only when non-empty, and bind limit/offset
as parameters before executing both the main query and the COUNT query (use the
same conditional placeholder logic for count_sql). Ensure you replace the direct
string concatenation at the spots that currently append schema/kind and the
final "LIMIT/OFFSET" with proper placeholders and binding.

Comment on lines 505 to +511
std::ostringstream sql;
sql << "DELETE FROM catalog WHERE kind = '" << kind << "' AND key = '" << key << "'";
sql << "DELETE FROM catalog WHERE 1=1";

if (!schema.empty()) {
sql << " AND schema = '" << schema << "'";
}
sql << " AND kind = '" << kind << "' AND key = '" << key << "'";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

SQL injection vulnerability in remove() - user input in DELETE statement.

The remove() method concatenates schema, kind, and key directly into a DELETE SQL statement. This is especially dangerous as it could allow attackers to delete arbitrary catalog entries.

Proposed fix using prepared statements
 int MySQL_Catalog::remove(
 	const std::string& schema,
 	const std::string& kind,
 	const std::string& key
 ) {
-	std::ostringstream sql;
-	sql << "DELETE FROM catalog WHERE 1=1";
-
-	if (!schema.empty()) {
-		sql << " AND schema = '" << schema << "'";
-	}
-	sql << " AND kind = '" << kind << "' AND key = '" << key << "'";
-
-	if (!db->execute(sql.str().c_str())) {
+	sqlite3_stmt* stmt = NULL;
+	const char* delete_sql = schema.empty()
+		? "DELETE FROM catalog WHERE kind = ?1 AND key = ?2"
+		: "DELETE FROM catalog WHERE schema = ?1 AND kind = ?2 AND key = ?3";
+
+	int rc = db->prepare_v2(delete_sql, &stmt);
+	if (rc != SQLITE_OK) {
+		proxy_error("Failed to prepare catalog remove: %d\n", rc);
+		return -1;
+	}
+
+	if (schema.empty()) {
+		(*proxy_sqlite3_bind_text)(stmt, 1, kind.c_str(), -1, SQLITE_TRANSIENT);
+		(*proxy_sqlite3_bind_text)(stmt, 2, key.c_str(), -1, SQLITE_TRANSIENT);
+	} else {
+		(*proxy_sqlite3_bind_text)(stmt, 1, schema.c_str(), -1, SQLITE_TRANSIENT);
+		(*proxy_sqlite3_bind_text)(stmt, 2, kind.c_str(), -1, SQLITE_TRANSIENT);
+		(*proxy_sqlite3_bind_text)(stmt, 3, key.c_str(), -1, SQLITE_TRANSIENT);
+	}
+
+	SAFE_SQLITE3_STEP2(stmt);
+	(*proxy_sqlite3_finalize)(stmt);
+
+	if (rc != SQLITE_DONE && rc != SQLITE_OK) {
 		proxy_error("Catalog remove error\n");
 		return -1;
 	}
 
 	return 0;
 }
🤖 Prompt for AI Agents
In `@lib/MySQL_Catalog.cpp` around lines 505 - 511, In the remove() method, avoid
concatenating user inputs (schema, kind, key) into the SQL string; instead build
a parameterized DELETE and use a prepared statement to bind values for schema
(when non-empty), kind and key before executing. Replace the ostringstream
construction in MySQL_Catalog::remove() with a DELETE query that uses
placeholders for schema/kind/key (e.g. "DELETE FROM catalog WHERE 1=1" plus "AND
schema = ?" when schema provided, then always "AND kind = ? AND key = ?"),
prepare that statement via your MySQL client API, bind schema (conditionally),
bind kind and key, execute, and propagate/log any execution errors.

Comment on lines +2581 to +2627
void ProxySQL_Admin::stats___mcp_query_digest(bool reset) {
if (!GloMCPH) return;
Query_Tool_Handler* qth = GloMCPH->query_tool_handler;
if (!qth) return;

// Get the discovery schema catalog
Discovery_Schema* catalog = qth->get_catalog();
if (!catalog) return;

// Get the stats from the catalog (includes reset logic)
SQLite3_result* resultset = catalog->get_mcp_query_digest(reset);
if (!resultset) return;

statsdb->execute("BEGIN");

if (reset) {
statsdb->execute("DELETE FROM stats_mcp_query_digest_reset");
} else {
statsdb->execute("DELETE FROM stats_mcp_query_digest");
}

// Insert digest statistics into the stats table
// Columns: tool_name, run_id, digest, digest_text, count_star,
// first_seen, last_seen, sum_time, min_time, max_time
char* a = (char*)"INSERT INTO stats_mcp_query_digest VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\")";
for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) {
SQLite3_row* r = *it;
int arg_len = 0;
for (int i = 0; i < 10; i++) {
arg_len += strlen(r->fields[i]);
}
char* query = (char*)malloc(strlen(a) + arg_len + 32);
sprintf(query, a,
r->fields[0], // tool_name
r->fields[1], // run_id
r->fields[2], // digest
r->fields[3], // digest_text
r->fields[4], // count_star
r->fields[5], // first_seen
r->fields[6], // last_seen
r->fields[7], // sum_time
r->fields[8], // min_time
r->fields[9] // max_time
);
statsdb->execute(query);
free(query);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

head -c 500000 lib/ProxySQL_Admin_Stats.cpp | wc -l

Repository: ProxySQL/proxysql-vec

Length of output: 69


🏁 Script executed:

sed -n '2580,2630p' lib/ProxySQL_Admin_Stats.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 1741


🏁 Script executed:

rg "stats_mcp_query_digest" --type cpp -A 3 -B 3 | head -100

Repository: ProxySQL/proxysql-vec

Length of output: 6780


Fix table name in INSERT statement and use prepared statements for NULL-safe binding.

Line 2605 always inserts into stats_mcp_query_digest even when reset == true, writing reset data to the live table instead of the reset table. Additionally, the strlen-based approach crashes if run_id is NULL (the schema allows it) and is vulnerable to SQL injection from unescaped quotes in digest text.

Replace the hardcoded INSERT with dynamic table selection and prepared statement binding:

Proposed fix
-	char* a = (char*)"INSERT INTO stats_mcp_query_digest VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\")";
-	for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) {
-		SQLite3_row* r = *it;
-		int arg_len = 0;
-		for (int i = 0; i < 10; i++) {
-			arg_len += strlen(r->fields[i]);
-		}
-		char* query = (char*)malloc(strlen(a) + arg_len + 32);
-		sprintf(query, a,
-			r->fields[0],  // tool_name
-			r->fields[1],  // run_id
-			r->fields[2],  // digest
-			r->fields[3],  // digest_text
-			r->fields[4],  // count_star
-			r->fields[5],  // first_seen
-			r->fields[6],  // last_seen
-			r->fields[7],  // sum_time
-			r->fields[8],  // min_time
-			r->fields[9]   // max_time
-		);
-		statsdb->execute(query);
-		free(query);
-	}
+	const char* table = reset
+		? "stats_mcp_query_digest_reset"
+		: "stats_mcp_query_digest";
+	const std::string sql =
+		std::string("INSERT INTO ") + table +
+		" VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10)";
+	int rc = 0;
+	stmt_unique_ptr u_stmt { nullptr };
+	std::tie(rc, u_stmt) = statsdb->prepare_v2(sql.c_str());
+	ASSERT_SQLITE_OK(rc, statsdb);
+	sqlite3_stmt* const stmt = u_stmt.get();
+	for (SQLite3_row* r : resultset->rows) {
+		rc = (*proxy_sqlite3_bind_text)(stmt, 1, r->fields[0], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb);
+		if (r->fields[1]) {
+			rc = (*proxy_sqlite3_bind_int64)(stmt, 2, atoll(r->fields[1])); ASSERT_SQLITE_OK(rc, statsdb);
+		} else {
+			rc = (*proxy_sqlite3_bind_null)(stmt, 2); ASSERT_SQLITE_OK(rc, statsdb);
+		}
+		rc = (*proxy_sqlite3_bind_text)(stmt, 3, r->fields[2], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_text)(stmt, 4, r->fields[3], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 5, atoll(r->fields[4])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 6, atoll(r->fields[5])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 7, atoll(r->fields[6])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 8, atoll(r->fields[7])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 9, atoll(r->fields[8])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 10, atoll(r->fields[9])); ASSERT_SQLITE_OK(rc, statsdb);
+		SAFE_SQLITE3_STEP2(stmt);
+		rc = (*proxy_sqlite3_clear_bindings)(stmt); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_reset)(stmt); ASSERT_SQLITE_OK(rc, statsdb);
+	}
🤖 Prompt for AI Agents
In `@lib/ProxySQL_Admin_Stats.cpp` around lines 2581 - 2627, The INSERT always
targets stats_mcp_query_digest and uses strlen/risky sprintf, causing reset data
to go to the live table, crashes on NULL run_id and SQL-injection risks; in
ProxySQL_Admin::stats___mcp_query_digest switch the target table based on the
reset flag (use stats_mcp_query_digest_reset when reset==true), stop building
SQL via sprintf, and use a prepared statement with parameter binding (prepare
once per loop or reuse) to bind each SQLite3_row->fields value NULL-safely
(checking for NULL before binding and using sqlite3_bind_null or
sqlite3_bind_text) instead of strlen on fields and statsdb->execute(query); also
remove manual malloc/free and call the DB's execute/step/finalize prepared API
(e.g., statsdb prepare/bind/execute/finalize) to prevent injection and crashes.

Comment on lines +1354 to +1357
if (strstr(query_no_space,"stats_mcp_query_tools_counters"))
{ stats_mcp_query_tools_counters=true; refresh=true; }
if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset"))
{ stats_mcp_query_tools_counters_reset=true; refresh=true; }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid double-refresh when querying stats_mcp_query_tools_counters_reset.
The reset token contains the non-reset token, so both flags become true and both refresh paths run. Prefer a reset-first else if like other reset paths to avoid extra work and unintended table updates.

🛠️ Suggested fix
-	if (strstr(query_no_space,"stats_mcp_query_tools_counters"))
-		{ stats_mcp_query_tools_counters=true; refresh=true; }
-	if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset"))
-		{ stats_mcp_query_tools_counters_reset=true; refresh=true; }
+	if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset"))
+		{ stats_mcp_query_tools_counters_reset=true; refresh=true; }
+	else if (strstr(query_no_space,"stats_mcp_query_tools_counters"))
+		{ stats_mcp_query_tools_counters=true; refresh=true; }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (strstr(query_no_space,"stats_mcp_query_tools_counters"))
{ stats_mcp_query_tools_counters=true; refresh=true; }
if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset"))
{ stats_mcp_query_tools_counters_reset=true; refresh=true; }
if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset"))
{ stats_mcp_query_tools_counters_reset=true; refresh=true; }
else if (strstr(query_no_space,"stats_mcp_query_tools_counters"))
{ stats_mcp_query_tools_counters=true; refresh=true; }
🤖 Prompt for AI Agents
In `@lib/ProxySQL_Admin.cpp` around lines 1354 - 1357, The current checks set both
stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset when
query_no_space contains the reset token, causing duplicate refresh work; change
the logic to test the reset token first and make the second check an else if (or
otherwise skip setting the non-reset flag when reset matches) so only
stats_mcp_query_tools_counters_reset is set and refresh is executed once; update
the conditional around query_no_space that touches
stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset
accordingly.

Comment on lines +235 to +319
int Query_Tool_Handler::init_connection_pool() {
// Parse hosts
std::vector<std::string> host_list;
std::istringstream h(mysql_hosts);
std::string host;
while (std::getline(h, host, ',')) {
host.erase(0, host.find_first_not_of(" \t"));
host.erase(host.find_last_not_of(" \t") + 1);
if (!host.empty()) {
host_list.push_back(host);
}
}

// Parse ports
std::vector<int> port_list;
std::istringstream p(mysql_ports);
std::string port;
while (std::getline(p, port, ',')) {
port.erase(0, port.find_first_not_of(" \t"));
port.erase(port.find_last_not_of(" \t") + 1);
if (!port.empty()) {
port_list.push_back(atoi(port.c_str()));
}
}

// Ensure ports array matches hosts array size
while (port_list.size() < host_list.size()) {
port_list.push_back(3306);
}

if (host_list.empty()) {
proxy_error("Query_Tool_Handler: No hosts configured\n");
return -1;
}

pthread_mutex_lock(&pool_lock);

for (size_t i = 0; i < host_list.size(); i++) {
MySQLConnection conn;
conn.host = host_list[i];
conn.port = port_list[i];
conn.in_use = false;

MYSQL* mysql = mysql_init(NULL);
if (!mysql) {
proxy_error("Query_Tool_Handler: mysql_init failed for %s:%d\n",
conn.host.c_str(), conn.port);
pthread_mutex_unlock(&pool_lock);
return -1;
}

unsigned int timeout = 5;
mysql_options(mysql, MYSQL_OPT_CONNECT_TIMEOUT, &timeout);
mysql_options(mysql, MYSQL_OPT_READ_TIMEOUT, &timeout);
mysql_options(mysql, MYSQL_OPT_WRITE_TIMEOUT, &timeout);

if (!mysql_real_connect(
mysql,
conn.host.c_str(),
mysql_user.c_str(),
mysql_password.c_str(),
mysql_schema.empty() ? NULL : mysql_schema.c_str(),
conn.port,
NULL,
CLIENT_MULTI_STATEMENTS
)) {
proxy_error("Query_Tool_Handler: mysql_real_connect failed for %s:%d: %s\n",
conn.host.c_str(), conn.port, mysql_error(mysql));
mysql_close(mysql);
pthread_mutex_unlock(&pool_lock);
return -1;
}

conn.mysql = mysql;
connection_pool.push_back(conn);
pool_size++;

proxy_info("Query_Tool_Handler: Connected to %s:%d\n",
conn.host.c_str(), conn.port);
}

pthread_mutex_unlock(&pool_lock);
proxy_info("Query_Tool_Handler: Connection pool initialized with %d connection(s)\n", pool_size);
return 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the header file for Query_Tool_Handler
fd -e "h" -e "hpp" | xargs grep -l "class Query_Tool_Handler" 2>/dev/null

Repository: ProxySQL/proxysql-vec

Length of output: 114


🏁 Script executed:

# Also check the cpp file for class definition inline
head -100 lib/Query_Tool_Handler.cpp | grep -A 50 "class Query_Tool_Handler"

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Find all method definitions in the Query_Tool_Handler.cpp file
rg "^\s*(public|private|protected):" lib/Query_Tool_Handler.cpp -A 5

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Search for connection_pool usages
rg "connection_pool" lib/Query_Tool_Handler.cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 1123


🏁 Script executed:

# Check for constructor and initialization of connection_pool
rg "Query_Tool_Handler::|connection_pool.*=" lib/Query_Tool_Handler.cpp | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 1169


🏁 Script executed:

# Read the Query_Tool_Handler header file to see class structure
cat -n include/Query_Tool_Handler.h

Repository: ProxySQL/proxysql-vec

Length of output: 7390


🏁 Script executed:

# Read the destructor and close() method implementations
rg "Query_Tool_Handler::~Query_Tool_Handler|void Query_Tool_Handler::close" lib/Query_Tool_Handler.cpp -A 30

Repository: ProxySQL/proxysql-vec

Length of output: 1464


🏁 Script executed:

# Get the full init() method
rg "int Query_Tool_Handler::init()" lib/Query_Tool_Handler.cpp -A 50

Repository: ProxySQL/proxysql-vec

Length of output: 2255


🏁 Script executed:

# Search for all calls to init_connection_pool
rg "init_connection_pool" lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 144


🏁 Script executed:

# Check the constructor implementation
rg "Query_Tool_Handler::Query_Tool_Handler" lib/Query_Tool_Handler.cpp -A 30

Repository: ProxySQL/proxysql-vec

Length of output: 857


🏁 Script executed:

# Check if there are any guards or flags in init() to prevent calling it multiple times
rg "init()" lib/Query_Tool_Handler.cpp -B 5 -A 15 | head -80

Repository: ProxySQL/proxysql-vec

Length of output: 2084


🏁 Script executed:

# Check the full init_connection_pool signature and look at early error returns
sed -n '235,319p' lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 2310


🏁 Script executed:

# Check if there's any state management or guards to prevent multiple init calls
rg "connection_pool.empty()|pool_size" lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 211


🏁 Script executed:

# Check get_connection() to see if it guards against partially initialized pool
rg "void\* Query_Tool_Handler::get_connection" lib/Query_Tool_Handler.cpp -A 20

Repository: ProxySQL/proxysql-vec

Length of output: 530


🏁 Script executed:

# Check if init() is ever called more than once, or if there's any guard against it
rg "\.init\(\)" lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Also check the caller context - how is Query_Tool_Handler created and used
rg "Query_Tool_Handler" lib/Query_Tool_Handler.cpp | grep -E "(new|delete|init\(\)|close\(\))" | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 132


🏁 Script executed:

# Check if there are any other places in the codebase that use Query_Tool_Handler
# to understand the usage pattern
find . -name "*.cpp" -o -name "*.h" | xargs grep -l "Query_Tool_Handler" | grep -v Query_Tool_Handler.cpp | grep -v Query_Tool_Handler.h

Repository: ProxySQL/proxysql-vec

Length of output: 194


🏁 Script executed:

# Look at one of those files to see how Query_Tool_Handler is used
rg "Query_Tool_Handler" include/MCP_Thread.h -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 508


🏁 Script executed:

# Check MCP_Thread.cpp to see how Query_Tool_Handler is initialized and used
rg "query_tool_handler" lib/MCP_Thread.cpp -B 2 -A 2 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 361


🏁 Script executed:

# Check ProxySQL_MCP_Server.cpp for initialization patterns
rg "Query_Tool_Handler|init\(\)|close\(\)" lib/ProxySQL_MCP_Server.cpp | head -30

Repository: ProxySQL/proxysql-vec

Length of output: 494


🏁 Script executed:

# Verify that the early return (no hosts) happens before acquiring lock
sed -n '280,290p' lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 430


🏁 Script executed:

# Get exact line numbers around pool_lock acquisition and early returns
sed -n '270,285p' lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 461


Clean up partially initialized pools on connection failure.

If a later host fails to connect, earlier MYSQL* handles remain in connection_pool while the function returns -1, leaving the object in an inconsistent state. Although the destructor eventually cleans up via close(), the intermediate state violates the expected postcondition of the function (either fully initialized or empty).

Add a helper lambda to clean up and reset the pool before returning on connection failures:

🧹 Suggested fix
 int Query_Tool_Handler::init_connection_pool() {
+	// Helper to cleanup partially initialized pool (pool_lock must be held)
+	auto cleanup_pool_locked = [&]() {
+		for (auto& c : connection_pool) {
+			if (c.mysql) {
+				mysql_close(static_cast<MYSQL*>(c.mysql));
+				c.mysql = NULL;
+			}
+		}
+		connection_pool.clear();
+		pool_size = 0;
+	};
+
 	// Parse hosts
 	std::vector<std::string> host_list;
@@
 		MYSQL* mysql = mysql_init(NULL);
 		if (!mysql) {
 			proxy_error("Query_Tool_Handler: mysql_init failed for %s:%d\n",
 				conn.host.c_str(), conn.port);
-			pthread_mutex_unlock(&pool_lock);
+			cleanup_pool_locked();
+			pthread_mutex_unlock(&pool_lock);
 			return -1;
 		}
@@
 		)) {
 			proxy_error("Query_Tool_Handler: mysql_real_connect failed for %s:%d: %s\n",
 				conn.host.c_str(), conn.port, mysql_error(mysql));
 			mysql_close(mysql);
-			pthread_mutex_unlock(&pool_lock);
+			cleanup_pool_locked();
+			pthread_mutex_unlock(&pool_lock);
 			return -1;
 		}
🤖 Prompt for AI Agents
In `@lib/Query_Tool_Handler.cpp` around lines 235 - 319, The init_connection_pool
function must clean up any partially initialized connections on error: add a
helper (lambda or private method) inside init_connection_pool that iterates
connection_pool, calls mysql_close on each entry's MYSQL* (and sets conn.mysql =
nullptr), clears connection_pool, resets pool_size to 0, and unlocks pool_lock
if held; call this cleanup helper before every early return that occurs after
some successful connections (e.g., after mysql_init/mysql_real_connect failures)
so the function either returns with a fully initialized pool or an empty,
consistent pool; update all return paths in init_connection_pool to invoke this
helper and then return -1.

Comment on lines +236 to +258
int Static_Harvester::execute_query(const std::string& query, std::vector<std::vector<std::string>>& results) {
pthread_mutex_lock(&conn_lock);

if (!mysql_conn) {
pthread_mutex_unlock(&conn_lock);
proxy_error("Static_Harvester: Not connected to MySQL\n");
return -1;
}

proxy_debug(PROXY_DEBUG_GENERIC, 3, "Static_Harvester: Executing query: %s\n", query.c_str());

if (mysql_query(mysql_conn, query.c_str())) {
proxy_error("Static_Harvester: Query failed: %s\n", mysql_error(mysql_conn));
pthread_mutex_unlock(&conn_lock);
return -1;
}

MYSQL_RES* res = mysql_store_result(mysql_conn);
if (!res) {
// No result set (e.g., INSERT/UPDATE)
pthread_mutex_unlock(&conn_lock);
return 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "Static_Harvester.cpp" | head -5

Repository: ProxySQL/proxysql-vec

Length of output: 91


🏁 Script executed:

wc -l lib/Static_Harvester.cpp 2>/dev/null || find . -name "Static_Harvester.cpp" -exec wc -l {} \;

Repository: ProxySQL/proxysql-vec

Length of output: 94


🏁 Script executed:

cat -n lib/Static_Harvester.cpp | sed -n '230,265p'

Repository: ProxySQL/proxysql-vec

Length of output: 1366


🌐 Web query:

MySQL C API mysql_store_result NULL error handling documentation

💡 Result:

Short summary — how to handle mysql_store_result() returning NULL:

  • mysql_store_result() returns NULL either when the statement produced no result set (e.g., INSERT/UPDATE/DELETE) or when an error occurred while reading the result. [1][2]
  • To tell which case it is:
    • If mysql_field_count(mysql) == 0 then the statement does not produce a result set. [1][2]
    • Otherwise check mysql_errno(mysql) != 0 or mysql_error(mysql) (non-empty) to detect an error. [1][2]
  • Note: mysql_store_result() resets mysql_error()/mysql_errno() on success; when it returns NULL an allocation/connection/read error (CR_OUT_OF_MEMORY, CR_SERVER_GONE_ERROR, CR_SERVER_LOST, CR_COMMANDS_OUT_OF_SYNC, etc.) may be indicated by mysql_errno(). [1][3]
  • See the MySQL manual’s dedicated section “NULL mysql_store_result() Return After mysql_query() Success” for examples and details. [3]

Sources:
[1] MySQL C API — mysql_store_result() documentation.
[2] MySQL C API Developer Guide (mysql_store_result) — dev.mysql.com.
[3] MySQL C API guide — “NULL mysql_store_result() Return After mysql_query() Success.”


Distinguish "no result set" from query error in mysql_store_result() handling.

mysql_store_result() returns NULL for both successful queries with no result set (INSERT/UPDATE) and for errors (memory allocation, connection loss, read failures). The current code treats all NULL cases as success, silently masking database errors.

Per MySQL C API documentation, distinguish these cases by checking mysql_field_count() to detect whether the query produced a result set, then check mysql_errno() if no result set to confirm it was not an error.

Suggested approach
 	MYSQL_RES* res = mysql_store_result(mysql_conn);
 	if (!res) {
-		// No result set (e.g., INSERT/UPDATE)
-		pthread_mutex_unlock(&conn_lock);
-		return 0;
+		// Distinguish: no result set vs. error
+		if (mysql_field_count(mysql_conn) != 0) {
+			proxy_error("Static_Harvester: Query failed after store_result: %s\n",
+				mysql_error(mysql_conn));
+			pthread_mutex_unlock(&conn_lock);
+			return -1;
+		}
+		pthread_mutex_unlock(&conn_lock);
+		return 0;
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int Static_Harvester::execute_query(const std::string& query, std::vector<std::vector<std::string>>& results) {
pthread_mutex_lock(&conn_lock);
if (!mysql_conn) {
pthread_mutex_unlock(&conn_lock);
proxy_error("Static_Harvester: Not connected to MySQL\n");
return -1;
}
proxy_debug(PROXY_DEBUG_GENERIC, 3, "Static_Harvester: Executing query: %s\n", query.c_str());
if (mysql_query(mysql_conn, query.c_str())) {
proxy_error("Static_Harvester: Query failed: %s\n", mysql_error(mysql_conn));
pthread_mutex_unlock(&conn_lock);
return -1;
}
MYSQL_RES* res = mysql_store_result(mysql_conn);
if (!res) {
// No result set (e.g., INSERT/UPDATE)
pthread_mutex_unlock(&conn_lock);
return 0;
}
int Static_Harvester::execute_query(const std::string& query, std::vector<std::vector<std::string>>& results) {
pthread_mutex_lock(&conn_lock);
if (!mysql_conn) {
pthread_mutex_unlock(&conn_lock);
proxy_error("Static_Harvester: Not connected to MySQL\n");
return -1;
}
proxy_debug(PROXY_DEBUG_GENERIC, 3, "Static_Harvester: Executing query: %s\n", query.c_str());
if (mysql_query(mysql_conn, query.c_str())) {
proxy_error("Static_Harvester: Query failed: %s\n", mysql_error(mysql_conn));
pthread_mutex_unlock(&conn_lock);
return -1;
}
MYSQL_RES* res = mysql_store_result(mysql_conn);
if (!res) {
// Distinguish: no result set vs. error
if (mysql_field_count(mysql_conn) != 0) {
proxy_error("Static_Harvester: Query failed after store_result: %s\n",
mysql_error(mysql_conn));
pthread_mutex_unlock(&conn_lock);
return -1;
}
pthread_mutex_unlock(&conn_lock);
return 0;
}
🤖 Prompt for AI Agents
In `@lib/Static_Harvester.cpp` around lines 236 - 258, In
Static_Harvester::execute_query, mysql_store_result() NULL is being treated as a
successful "no result set" case; instead check mysql_field_count(mysql_conn) to
see if the query should have a result set—if mysql_field_count() == 0 treat it
as a successful non-result (unlock conn_lock and return 0), otherwise treat NULL
as an error by calling mysql_errno(mysql_conn) and logging
mysql_error(mysql_conn), unlock conn_lock and return -1; update the error log to
include mysql_error(mysql_conn) and ensure conn_lock is always released before
returning.

Comment on lines +388 to +662
// Fetch schema/database metadata from information_schema.SCHEMATA.
//
// Queries MySQL for all schemas (databases) and their character set
// and collation information.
//
// Parameters:
// filter - Optional schema name filter (empty for all schemas)
//
// Returns:
// Vector of SchemaRow structures containing schema metadata
std::vector<Static_Harvester::SchemaRow> Static_Harvester::fetch_schemas(const std::string& filter) {
std::vector<SchemaRow> schemas;

std::ostringstream sql;
sql << "SELECT SCHEMA_NAME, DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME "
<< "FROM information_schema.SCHEMATA";

if (!filter.empty()) {
sql << " WHERE SCHEMA_NAME = '" << filter << "'";
}

sql << " ORDER BY SCHEMA_NAME;";

std::vector<std::vector<std::string>> results;
if (execute_query(sql.str(), results) == 0) {
for (const auto& row : results) {
SchemaRow s;
s.schema_name = row[0];
s.charset = row[1];
s.collation = row[2];
schemas.push_back(s);
}
}

return schemas;
}

// ============================================================
// Harvest Stage Methods
// ============================================================

// Harvest schemas/databases to the catalog.
//
// Fetches schemas from information_schema.SCHEMATA and inserts them
// into the catalog. System schemas (mysql, information_schema,
// performance_schema, sys) are skipped.
//
// Parameters:
// only_schema - Optional filter to harvest only one schema
//
// Returns:
// Number of schemas harvested, or -1 on error
//
// Notes:
// - Requires an active run (start_run must be called first)
// - Skips system schemas automatically
int Static_Harvester::harvest_schemas(const std::string& only_schema) {
if (current_run_id < 0) {
proxy_error("Static_Harvester: No active run\n");
return -1;
}

std::vector<SchemaRow> schemas = fetch_schemas(only_schema);
int count = 0;

for (const auto& s : schemas) {
// Skip system schemas
if (s.schema_name == "mysql" || s.schema_name == "information_schema" ||
s.schema_name == "performance_schema" || s.schema_name == "sys") {
continue;
}

if (catalog->insert_schema(current_run_id, s.schema_name, s.charset, s.collation) >= 0) {
count++;
}
}

proxy_info("Static_Harvester: Harvested %d schemas\n", count);
return count;
}

// Fetch table and view metadata from information_schema.TABLES.
//
// Queries MySQL for all tables and views with their physical
// characteristics (rows, size, engine, timestamps).
//
// Parameters:
// filter - Optional schema name filter
//
// Returns:
// Vector of ObjectRow structures containing table/view metadata
std::vector<Static_Harvester::ObjectRow> Static_Harvester::fetch_tables_views(const std::string& filter) {
std::vector<ObjectRow> objects;

std::ostringstream sql;
sql << "SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE, ENGINE, TABLE_ROWS, "
<< "DATA_LENGTH, INDEX_LENGTH, CREATE_TIME, UPDATE_TIME, TABLE_COMMENT "
<< "FROM information_schema.TABLES "
<< "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')";

if (!filter.empty()) {
sql << " AND TABLE_SCHEMA = '" << filter << "'";
}

sql << " ORDER BY TABLE_SCHEMA, TABLE_NAME;";

std::vector<std::vector<std::string>> results;
if (execute_query(sql.str(), results) == 0) {
for (const auto& row : results) {
ObjectRow o;
o.schema_name = row[0];
o.object_name = row[1];
o.object_type = (row[2] == "VIEW") ? "view" : "table";
o.engine = row[3];
o.table_rows_est = row[4].empty() ? 0 : atol(row[4].c_str());
o.data_length = row[5].empty() ? 0 : atol(row[5].c_str());
o.index_length = row[6].empty() ? 0 : atol(row[6].c_str());
o.create_time = row[7];
o.update_time = row[8];
o.object_comment = row[9];
objects.push_back(o);
}
}

return objects;
}

// Fetch column metadata from information_schema.COLUMNS.
//
// Queries MySQL for all columns with their data types, nullability,
// defaults, character set, and comments.
//
// Parameters:
// filter - Optional schema name filter
//
// Returns:
// Vector of ColumnRow structures containing column metadata
std::vector<Static_Harvester::ColumnRow> Static_Harvester::fetch_columns(const std::string& filter) {
std::vector<ColumnRow> columns;

std::ostringstream sql;
sql << "SELECT TABLE_SCHEMA, TABLE_NAME, ORDINAL_POSITION, COLUMN_NAME, "
<< "DATA_TYPE, COLUMN_TYPE, IS_NULLABLE, COLUMN_DEFAULT, EXTRA, "
<< "CHARACTER_SET_NAME, COLLATION_NAME, COLUMN_COMMENT "
<< "FROM information_schema.COLUMNS "
<< "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')";

if (!filter.empty()) {
sql << " AND TABLE_SCHEMA = '" << filter << "'";
}

sql << " ORDER BY TABLE_SCHEMA, TABLE_NAME, ORDINAL_POSITION;";

std::vector<std::vector<std::string>> results;
if (execute_query(sql.str(), results) == 0) {
for (const auto& row : results) {
ColumnRow c;
c.schema_name = row[0];
c.object_name = row[1];
c.ordinal_pos = atoi(row[2].c_str());
c.column_name = row[3];
c.data_type = row[4];
c.column_type = row[5];
c.is_nullable = (row[6] == "YES") ? 1 : 0;
c.column_default = row[7];
c.extra = row[8];
c.charset = row[9];
c.collation = row[10];
c.column_comment = row[11];
columns.push_back(c);
}
}

return columns;
}

// Fetch index metadata from information_schema.STATISTICS.
//
// Queries MySQL for all indexes with their columns, sequence,
// uniqueness, cardinality, and collation.
//
// Parameters:
// filter - Optional schema name filter
//
// Returns:
// Vector of IndexRow structures containing index metadata
std::vector<Static_Harvester::IndexRow> Static_Harvester::fetch_indexes(const std::string& filter) {
std::vector<IndexRow> indexes;

std::ostringstream sql;
sql << "SELECT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME, NON_UNIQUE, INDEX_TYPE, "
<< "SEQ_IN_INDEX, COLUMN_NAME, SUB_PART, COLLATION, CARDINALITY "
<< "FROM information_schema.STATISTICS "
<< "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')";

if (!filter.empty()) {
sql << " AND TABLE_SCHEMA = '" << filter << "'";
}

sql << " ORDER BY TABLE_SCHEMA, TABLE_NAME, INDEX_NAME, SEQ_IN_INDEX;";

std::vector<std::vector<std::string>> results;
if (execute_query(sql.str(), results) == 0) {
for (const auto& row : results) {
IndexRow i;
i.schema_name = row[0];
i.object_name = row[1];
i.index_name = row[2];
i.is_unique = (row[3] == "0") ? 1 : 0;
i.index_type = row[4];
i.seq_in_index = atoi(row[5].c_str());
i.column_name = row[6];
i.sub_part = row[7].empty() ? 0 : atoi(row[7].c_str());
i.collation = row[8];
i.cardinality = row[9].empty() ? 0 : atol(row[9].c_str());
indexes.push_back(i);
}
}

return indexes;
}

// Fetch foreign key metadata from information_schema.
//
// Queries KEY_COLUMN_USAGE and REFERENTIAL_CONSTRAINTS to get
// foreign key relationships including child/parent tables and columns,
// and ON UPDATE/DELETE rules.
//
// Parameters:
// filter - Optional schema name filter
//
// Returns:
// Vector of FKRow structures containing foreign key metadata
std::vector<Static_Harvester::FKRow> Static_Harvester::fetch_foreign_keys(const std::string& filter) {
std::vector<FKRow> fks;

std::ostringstream sql;
sql << "SELECT kcu.CONSTRAINT_SCHEMA AS child_schema, "
<< "kcu.TABLE_NAME AS child_table, kcu.CONSTRAINT_NAME AS fk_name, "
<< "kcu.COLUMN_NAME AS child_column, kcu.REFERENCED_TABLE_SCHEMA AS parent_schema, "
<< "kcu.REFERENCED_TABLE_NAME AS parent_table, kcu.REFERENCED_COLUMN_NAME AS parent_column, "
<< "kcu.ORDINAL_POSITION AS seq, rc.UPDATE_RULE AS on_update, rc.DELETE_RULE AS on_delete "
<< "FROM information_schema.KEY_COLUMN_USAGE kcu "
<< "JOIN information_schema.REFERENTIAL_CONSTRAINTS rc "
<< " ON rc.CONSTRAINT_SCHEMA = kcu.CONSTRAINT_SCHEMA "
<< " AND rc.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME "
<< "WHERE kcu.TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')";

if (!filter.empty()) {
sql << " AND kcu.TABLE_SCHEMA = '" << filter << "'";
}

sql << " AND kcu.REFERENCED_TABLE_NAME IS NOT NULL "
<< "ORDER BY child_schema, child_table, fk_name, seq;";

std::vector<std::vector<std::string>> results;
if (execute_query(sql.str(), results) == 0) {
for (const auto& row : results) {
FKRow fk;
fk.child_schema = row[0];
fk.child_table = row[1];
fk.fk_name = row[2];
fk.child_column = row[3];
fk.parent_schema = row[4];
fk.parent_table = row[5];
fk.parent_column = row[6];
fk.seq = atoi(row[7].c_str());
fk.on_update = row[8];
fk.on_delete = row[9];
fks.push_back(fk);
}
}

return fks;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for calls to the fetch_* methods to understand where filter comes from
rg "fetch_schemas|fetch_tables_views|fetch_columns|fetch_indexes|fetch_foreign_keys" --type cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 5678


🏁 Script executed:

# Check if there's a header file that might show the method signatures and their usage context
fd "\.h$" | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 630


🏁 Script executed:

# Look for where harvest_schemas is called to trace the data flow
rg "harvest_schemas" --type cpp -B 5 -A 5

Repository: ProxySQL/proxysql-vec

Length of output: 1911


🏁 Script executed:

# Search for calls to run_full_harvest
rg "run_full_harvest" --type cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 1870


🏁 Script executed:

# Also look for any input validation or escaping patterns used elsewhere in the codebase
rg "mysql_escape|mysql_real_escape|prepared.*statement|parameterized" --type cpp | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 2187


🏁 Script executed:

# Look at the Query_Tool_Handler.cpp context more thoroughly
rg "schema_filter|json_string" lib/Query_Tool_Handler.cpp -B 5 -A 10 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 1929


🏁 Script executed:

# Check execute_query implementation
rg "execute_query" lib/Static_Harvester.cpp -A 15 | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 1217


🏁 Script executed:

# Check if execute_query can execute multiple statements
ast-grep --pattern 'int Static_Harvester::execute_query($_, $_) { $$$ }'

Repository: ProxySQL/proxysql-vec

Length of output: 47


Add input validation for filter parameter to prevent SQL injection in fetch_ methods.*

The filter parameter originates from user input via the MCP tool interface (discovery.run_static tool) and is directly embedded into SQL queries without escaping. This allows SQL injection via payloads like test' OR '1'='1. Implement whitelist validation accepting only alphanumeric characters and underscores, which is appropriate for MySQL schema identifiers.

Apply the validation across all five fetch methods:

  • fetch_schemas()
  • fetch_tables_views()
  • fetch_columns()
  • fetch_indexes()
  • fetch_foreign_keys()
Example fix
+static bool is_safe_identifier(const std::string& s) {
+	return !s.empty() &&
+	       std::all_of(s.begin(), s.end(),
+	           [](unsigned char c){ return std::isalnum(c) || c == '_'; });
+}
+
 std::vector<Static_Harvester::SchemaRow> Static_Harvester::fetch_schemas(const std::string& filter) {
 	std::vector<SchemaRow> schemas;
 
 	std::ostringstream sql;
 	sql << "SELECT SCHEMA_NAME, DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME "
 	    << "FROM information_schema.SCHEMATA";
 
 	if (!filter.empty()) {
+		if (!is_safe_identifier(filter)) {
+			proxy_error("Static_Harvester: Invalid schema filter '%s'\n", filter.c_str());
+			return schemas;
+		}
 		sql << " WHERE SCHEMA_NAME = '" << filter << "'";
 	}
🤖 Prompt for AI Agents
In `@lib/Static_Harvester.cpp` around lines 388 - 662, All five fetch_* methods
(fetch_schemas, fetch_tables_views, fetch_columns, fetch_indexes,
fetch_foreign_keys) currently interpolate the user-provided filter directly into
SQL and need whitelist validation to prevent SQL injection: add a small helper
(e.g., is_valid_identifier or validate_schema_filter) that returns true only for
non-empty strings matching /^[A-Za-z0-9_]+$/ (allowing empty meaning no filter),
call it at the start of each method and if the filter is non-empty and invalid
either return an empty result or treat it as no filter (choose consistent
behavior), and only append " AND ... = '<filter>'" to the SQL when the helper
approves the value; reference the filter variable and SQL-building blocks in
each function so the validation is applied before any SQL concatenation.

Comment on lines +1014 to +1055
if (current_run_id < 0) {
proxy_error("Static_Harvester: No active run\n");
return -1;
}

std::ostringstream sql;
sql << "SELECT TABLE_SCHEMA, TABLE_NAME, VIEW_DEFINITION "
<< "FROM information_schema.VIEWS "
<< "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')";

if (!only_schema.empty()) {
sql << " AND TABLE_SCHEMA = '" << only_schema << "'";
}

sql << ";";

std::vector<std::vector<std::string>> results;
if (execute_query(sql.str(), results) != 0) {
return -1;
}

int count = 0;
for (const auto& row : results) {
std::string schema_name = row[0];
std::string view_name = row[1];
std::string view_def = row[2];

// Update object with definition
char* error = NULL;
int cols = 0, affected = 0;
std::ostringstream update_sql;
update_sql << "UPDATE objects SET definition_sql = '" << view_def << "' "
<< "WHERE run_id = " << current_run_id
<< " AND schema_name = '" << schema_name << "'"
<< " AND object_name = '" << view_name << "'"
<< " AND object_type = 'view';";

catalog->get_db()->execute_statement(update_sql.str().c_str(), &error, &cols, &affected);
if (affected > 0) {
count++;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find and examine the file structure
find . -name "Static_Harvester.cpp" -o -name "Static_Harvester.h" | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 186


🏁 Script executed:

# Search for SQL escaping utilities or helper functions
rg -i "escape|quote" --type cpp | head -30

Repository: ProxySQL/proxysql-vec

Length of output: 3528


🏁 Script executed:

# Check the database type being used and how other queries handle input
rg "execute_query|execute_statement" --type cpp -B 2 -A 2 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 4030


🏁 Script executed:

# Find Static_Harvester.cpp
find . -name "Static_Harvester.cpp" -type f

Repository: ProxySQL/proxysql-vec

Length of output: 91


🏁 Script executed:

# Examine the Static_Harvester.cpp file around the problematic lines
find . -name "Static_Harvester.cpp" -type f -exec wc -l {} \;

Repository: ProxySQL/proxysql-vec

Length of output: 96


🏁 Script executed:

# Search for database-related utilities and escape functions in lib/
find ./lib -name "*.cpp" -o -name "*.h" | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 589


🏁 Script executed:

# Examine the problematic section in Static_Harvester.cpp
sed -n '1000,1060p' ./lib/Static_Harvester.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 2215


🏁 Script executed:

# Search for the execute_query function definition in Static_Harvester.cpp
rg "execute_query" ./lib/Static_Harvester.cpp -B 5 -A 10

Repository: ProxySQL/proxysql-vec

Length of output: 3632


🏁 Script executed:

# Look at the class definition and database connection context
sed -n '1,100p' ./lib/Static_Harvester.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 4193


🏁 Script executed:

# Examine the catalog database interface - check Discovery_Schema and get_db()
rg "class Discovery_Schema" -A 20 ./lib --type cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Search for SQL escape or quote handling in the codebase
rg "definition_sql|UPDATE.*SET" ./lib/Static_Harvester.cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 1174


🏁 Script executed:

# Check what database interface is being used for catalog
rg "execute_statement" ./lib --type cpp -B 1 | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 2021


🏁 Script executed:

# Look for prepared statement usage or escaping utilities in Discovery_Schema or nearby
find ./lib -name "*Discovery*" -type f

Repository: ProxySQL/proxysql-vec

Length of output: 91


🏁 Script executed:

# Examine Discovery_Schema to understand the database interface
head -100 ./lib/Discovery_Schema.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 3317


🏁 Script executed:

# Check the get_db() method and what it returns
rg "get_db()" ./lib/Discovery_Schema.cpp -B 2 -A 5

Repository: ProxySQL/proxysql-vec

Length of output: 2641


🏁 Script executed:

# Look for SQLite3 interface - check if execute_statement supports prepared statements
rg "execute_statement" ./lib/SQLite3_Server.cpp -B 1 -A 3 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 129


🏁 Script executed:

# Check if there's a C++ wrapper or any escaping utilities available
rg "escape|sqlite3_bind|sqlite3_prepare" ./lib --type cpp | head -30

Repository: ProxySQL/proxysql-vec

Length of output: 2982


🏁 Script executed:

# Check the escape_string_single_quotes function implementation
rg "escape_string_single_quotes" ./lib/gen_utils.cpp -B 2 -A 15

Repository: ProxySQL/proxysql-vec

Length of output: 785


🏁 Script executed:

# Check the header file to see if it's publicly available
rg "escape_string_single_quotes" ./include --type cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 765


🏁 Script executed:

# Verify that the code is really concatenating strings without escaping
sed -n '1035,1055p' ./lib/Static_Harvester.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 914


🏁 Script executed:

# Check if Static_Harvester.h includes gen_utils or has access to escape functions
cat ./lib/Static_Harvester.h | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 121


🏁 Script executed:

# Find the Static_Harvester header file
find . -name "*Static_Harvester*" -type f

Repository: ProxySQL/proxysql-vec

Length of output: 120


🏁 Script executed:

# Check the include statements in Static_Harvester.cpp
head -30 ./lib/Static_Harvester.cpp | grep -E "^#include"

Repository: ProxySQL/proxysql-vec

Length of output: 250


🏁 Script executed:

# Look for the prepared statement usage pattern in Discovery_Schema
rg "sqlite3_prepare|SAFE_SQLITE3_STEP|bind_text" ./lib/Discovery_Schema.cpp -B 1 -A 3 | head -60

Repository: ProxySQL/proxysql-vec

Length of output: 2595


🏁 Script executed:

# Check if execute_statement method supports prepared statements with parameter binding
rg "execute_statement" ./lib/SQLite3DB.cpp -B 2 -A 5 2>/dev/null | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Check if Static_Harvester.cpp includes gen_utils.h
rg "#include" ./lib/Static_Harvester.cpp | head -10

Repository: ProxySQL/proxysql-vec

Length of output: 250


🏁 Script executed:

# Look for other similar SQL concatenation patterns in Static_Harvester.cpp
rg "UPDATE|INSERT" ./lib/Static_Harvester.cpp | grep -E "<<.*<<" | head -10

Repository: ProxySQL/proxysql-vec

Length of output: 140


🏁 Script executed:

# Examine how Discovery_Schema does prepared statements for similar updates
rg "UPDATE objects" ./lib/Discovery_Schema.cpp -B 5 -A 10

Repository: ProxySQL/proxysql-vec

Length of output: 821


🏁 Script executed:

# Check if there are other places in Static_Harvester with unescaped string concatenation
rg "ostringstream.*<<" ./lib/Static_Harvester.cpp | grep -v "ORDER\|SELECT\|WHERE.*NOT IN"

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Check how execute_statement is used elsewhere for similar updates with variables
rg "execute_statement.*UPDATE" ./lib --type cpp -B 2 -A 2 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Verify that prepared statements are NOT used in Static_Harvester.cpp
rg "sqlite3_prepare|bind_text|bind_int" ./lib/Static_Harvester.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Check if there are similar vulnerable patterns elsewhere in Static_Harvester
rg "ostringstream.*view_def|ostringstream.*schema_name|ostringstream.*view_name" ./lib/Static_Harvester.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Final check: confirm the vulnerability by looking at the exact SQL being built
sed -n '1035,1055p' ./lib/Static_Harvester.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 767


🏁 Script executed:

# Check the database interface for execute_statement - can it use prepared statements?
rg "class SQLite3DB" ./lib -A 30 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 3002


🏁 Script executed:

# Look for how other parts of the code handle similar updates with user data
rg "INSERT.*definition_sql\|UPDATE.*definition_sql" ./lib --type cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 47


🏁 Script executed:

# Verify that the fetched data (view_def, schema_name, view_name) can contain quotes
rg "VIEW_DEFINITION\|TABLE_SCHEMA\|TABLE_NAME" ./lib/Static_Harvester.cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 47


Escape view definitions and identifiers in SQLite catalog UPDATE statements.

view_def, schema_name, and view_name from MySQL information_schema.VIEWS can contain single quotes, breaking the SQL statement or enabling injection into the SQLite catalog. Use prepared statements with parameter binding (already available in the codebase via sqlite3_bind_text), or escape single quotes by doubling them ('''). The codebase has escape_string_single_quotes() available but it operates on C strings; if using that approach, convert the std::string values to C strings first.

🛠️ Suggested fix (prepared statement approach, preferred)

Replace the direct execute_statement() call with prepared statement binding, following the pattern already used in Discovery_Schema.cpp:

		// Update object with definition
-		char* error = NULL;
-		int cols = 0, affected = 0;
-		std::ostringstream update_sql;
-		update_sql << "UPDATE objects SET definition_sql = '" << view_def << "' "
-		           << "WHERE run_id = " << current_run_id
-		           << " AND schema_name = '" << schema_name << "'"
-		           << " AND object_name = '" << view_name << "'"
-		           << " AND object_type = 'view';";
-
-		catalog->get_db()->execute_statement(update_sql.str().c_str(), &error, &cols, &affected);
+		const char* sql = "UPDATE objects SET definition_sql = ?1 "
+		                  "WHERE run_id = ?2 AND schema_name = ?3 "
+		                  "AND object_name = ?4 AND object_type = 'view';";
+		sqlite3_stmt* stmt = NULL;
+		(*proxy_sqlite3_prepare_v2)(catalog->get_db()->get_db(), sql, -1, &stmt, NULL);
+		(*proxy_sqlite3_bind_text)(stmt, 1, view_def.c_str(), -1, SQLITE_TRANSIENT);
+		(*proxy_sqlite3_bind_int)(stmt, 2, current_run_id);
+		(*proxy_sqlite3_bind_text)(stmt, 3, schema_name.c_str(), -1, SQLITE_TRANSIENT);
+		(*proxy_sqlite3_bind_text)(stmt, 4, view_name.c_str(), -1, SQLITE_TRANSIENT);
+		SAFE_SQLITE3_STEP2(stmt);
+		(*proxy_sqlite3_finalize)(stmt);
🤖 Prompt for AI Agents
In `@lib/Static_Harvester.cpp` around lines 1014 - 1055, The UPDATE that writes
view_def, schema_name, and view_name into the SQLite catalog (inside the
for-loop using update_sql and catalog->get_db()->execute_statement) is
vulnerable to breaking on single quotes; change it to use a prepared statement
with parameter binding (sqlite3_prepare_v2 / sqlite3_bind_text / sqlite3_step /
sqlite3_finalize) and bind view_def, schema_name, view_name, and current_run_id
instead of interpolating them, following the pattern in Discovery_Schema.cpp;
alternatively (less preferred) call escape_string_single_quotes() on
view_def.c_str(), schema_name.c_str(), and view_name.c_str() and use the escaped
C strings in the constructed SQL before calling execute_statement.

Comment on lines +1088 to +1138
std::ostringstream sql;
sql << "SELECT object_id, schema_name, object_name, object_type, engine, table_rows_est, "
<< "data_length, index_length, has_primary_key, has_foreign_keys, has_time_column "
<< "FROM objects WHERE run_id = " << current_run_id
<< " AND object_type IN ('table', 'view')";

catalog->get_db()->execute_statement(sql.str().c_str(), &error, &cols, &affected, &resultset);

if (!resultset) {
return -1;
}

int count = 0;
for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin();
it != resultset->rows.end(); ++it) {
SQLite3_row* row = *it;

int object_id = atoi(row->fields[0]);
std::string object_name = std::string(row->fields[2] ? row->fields[2] : "");

// Guess kind from name
std::string guessed_kind = "unknown";
std::string name_lower = object_name;
std::transform(name_lower.begin(), name_lower.end(), name_lower.begin(), ::tolower);

if (name_lower.find("log") != std::string::npos ||
name_lower.find("event") != std::string::npos ||
name_lower.find("audit") != std::string::npos) {
guessed_kind = "log/event";
} else if (name_lower.find("order") != std::string::npos ||
name_lower.find("invoice") != std::string::npos ||
name_lower.find("payment") != std::string::npos ||
name_lower.find("transaction") != std::string::npos) {
guessed_kind = "fact";
} else if (name_lower.find("user") != std::string::npos ||
name_lower.find("customer") != std::string::npos ||
name_lower.find("account") != std::string::npos ||
name_lower.find("product") != std::string::npos) {
guessed_kind = "entity";
}

// Build profile JSON
json profile;
profile["guessed_kind"] = guessed_kind;
profile["rows_est"] = row->fields[4] ? atol(row->fields[4]) : 0;
profile["size_bytes"] = (atol(row->fields[5] ? row->fields[5] : "0") +
atol(row->fields[6] ? row->fields[6] : "0"));
profile["engine"] = std::string(row->fields[3] ? row->fields[3] : "");
profile["has_primary_key"] = atoi(row->fields[7]) != 0;
profile["has_foreign_keys"] = atoi(row->fields[8]) != 0;
profile["has_time_column"] = atoi(row->fields[9]) != 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix column index mapping in quick profiles.

The SELECT column order doesn’t match the indices used, so rows_est/engine/flags are derived from the wrong fields.

✅ Proposed fix
-		profile["rows_est"] = row->fields[4] ? atol(row->fields[4]) : 0;
-		profile["size_bytes"] = (atol(row->fields[5] ? row->fields[5] : "0") +
-		                       atol(row->fields[6] ? row->fields[6] : "0"));
-		profile["engine"] = std::string(row->fields[3] ? row->fields[3] : "");
-		profile["has_primary_key"] = atoi(row->fields[7]) != 0;
-		profile["has_foreign_keys"] = atoi(row->fields[8]) != 0;
-		profile["has_time_column"] = atoi(row->fields[9]) != 0;
+		profile["rows_est"] = row->fields[5] ? atol(row->fields[5]) : 0;
+		profile["size_bytes"] = (atol(row->fields[6] ? row->fields[6] : "0") +
+		                       atol(row->fields[7] ? row->fields[7] : "0"));
+		profile["engine"] = std::string(row->fields[4] ? row->fields[4] : "");
+		profile["has_primary_key"] = atoi(row->fields[8]) != 0;
+		profile["has_foreign_keys"] = atoi(row->fields[9]) != 0;
+		profile["has_time_column"] = atoi(row->fields[10]) != 0;
🤖 Prompt for AI Agents
In `@lib/Static_Harvester.cpp` around lines 1088 - 1138, The quick-profile field
indices are misaligned with the SELECT: update the row->fields index usage
inside the loop (where profile is built) so they match the SELECT order
(object_id, schema_name, object_name, object_type, engine, table_rows_est,
data_length, index_length, has_primary_key, has_foreign_keys, has_time_column).
Concretely, set rows_est to row->fields[5], engine to row->fields[4], size_bytes
to the sum of row->fields[6] and row->fields[7], and the boolean flags
has_primary_key/has_foreign_keys/has_time_column to row->fields[8], [9], and
[10] respectively; update these in the code that constructs json profile (the
block around profile["guessed_kind"], profile["rows_est"],
profile["size_bytes"], profile["engine"], profile["has_primary_key"],
profile["has_foreign_keys"], profile["has_time_column"] in
Static_Harvester.cpp).

@renecannao
Copy link
Author

Replaced by #26

@renecannao renecannao closed this Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant