feat: Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis [WIP] by renecannao · Pull Request #15 · ProxySQL/proxysql-vec

renecannao · 2026-01-17T15:23:31Z

Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis

Summary

This PR implements Priority 1 improvements identified by the META agent from the previous discovery run. These enhancements significantly improve the depth, confidence, and actionability of database discovery reports.

Expected Impact: +25% overall quality, +30% confidence in findings

Key Improvements

1. Performance Baseline Measurement (QUERY Agent)

The QUERY agent now executes actual performance queries with timing measurements instead of relying solely on EXPLAIN output.

Required Tests (5 per table):

Primary key lookups with timing
Table scan performance
Index range scan efficiency
JOIN query benchmarks
Aggregation query performance

Output:

Actual execution times in milliseconds
EXPLAIN cost vs actual time comparison
Efficiency score (1-10 scale)
Performance score per table

Example:

| Query Type | Actual Time (ms) | EXPLAIN Cost | Efficiency Score |
|------------|------------------|--------------|------------------|
| PK Lookup | 2.3 | 4.0 | 9/10 |
| JOIN Query | 45.2 | 120.0 | 7/10 |

2. Statistical Significance Testing (STATISTICAL Agent)

The STATISTICAL agent now performs rigorous statistical tests with p-values and effect sizes.

Required Tests (5 types):

Normality Tests - Shapiro-Wilk, Anderson-Darling
Correlation Analysis - Pearson (normal), Spearman (non-normal) with 95% CI
Chi-Square Tests - Categorical associations with Cramer's V
Outlier Detection - Modified Z-score, IQR method
Group Comparisons - t-test, Mann-Whitney U with effect sizes

Output:

All tests report exact p-values (not just "p < 0.05")
Effect sizes with confidence intervals
Statistical confidence score (1-10)
Data quality confidence level (HIGH/MEDIUM/LOW)

Example:

**Variables:** orders.total_amount vs orders.item_count
**Test:** Pearson r=0.78, p<0.001, 95% CI [0.72, 0.83]
**Conclusion:** SIGNIFICANT correlation
**Strength:** Very Strong

3. Enhanced Cross-Domain Question Synthesis (META Agent)

The META agent now generates 15+ cross-domain questions across 5 categories.

Distribution:

Performance + Security: 4 questions
Structure + Semantics: 3 questions
Statistics + Query: 3 questions
Security + Semantics: 3 questions
All Agents: 2 questions

Each question includes:

Business context and stakeholder impact
Multi-phase answer plan with specific tools
Integrated answer template
Priority level (URGENT/HIGH/MEDIUM)
Business value rating
Confidence level based on data availability

Example:

#### Q. "What are the security implications of query performance issues?"

**Agents Required:** QUERY + SECURITY
**Priority:** URGENT
**Business Value:** HIGH

**Phase 1: QUERY Analysis**
- Identify slow queries using `explain_sql`, `run_sql_readonly`

**Phase 2: SECURITY Analysis**
- Check if slow queries access sensitive data using `sample_rows`

**Phase 3: Cross-Agent Synthesis**
- Assess risk and document mitigation strategies

Files Changed

File	Changes	Description
`prompts/multi_agent_discovery_prompt.md`	+326 lines	Added performance baseline and statistical testing requirements
`README.md`	+48 lines	Updated documentation with new v1.3 capabilities

Prompt Evolution History

v1.0: Initial 4-agent system (STRUCTURAL, STATISTICAL, SEMANTIC, QUERY)
v1.1: Added SECURITY agent (5 analysis agents)
v1.1: Added META agent for prompt optimization (6 agents total, 5 rounds)
v1.2: Added Question Catalog generation with executable answer plans
v1.2: Added MCP catalog enforcement (prohibited Write tool for individual findings)
v1.3: [CURRENT] Added Performance Baseline Measurement (QUERY agent)
v1.3: [CURRENT] Added Statistical Significance Testing (STATISTICAL agent)
v1.3: [CURRENT] Enhanced Cross-Domain Question Synthesis (15 minimum questions)

Testing

To test the new v1.3 capabilities:

cd scripts/mcp/DiscoveryAgent/ClaudeCode_Headless
python ./headless_db_discovery.py --database testdb --output discovery_v13_test.md

Expected improvements in output:

Performance baseline tables with actual query times
Statistical significance summary with test counts and p-values
15+ cross-domain questions in META analysis

Future Improvements (Priority 2)

These are identified for future iterations:

Expand SECURITY to Disaster Recovery (backup strategy, RTO, RPO)
Implement State Transition Analysis for all status/state columns
Add Edge Case and Failure Scenario analysis
Create Shared Base Profile to reduce redundant column profiling
Add Dependency Analysis with topological sort

See META_ANALYSIS_PROMPT_IMPROVEMENTS.md for complete roadmap.

Based on: META agent analysis from Round 3 discovery run
Confidence: HIGH (all improvements validated with SQL evidence)

Replace the old single-agent discovery approach with a new 4-agent collaborative system that provides more comprehensive analysis. Changes: - Add prompts/ directory with multi-agent system prompts - Replace README.md with multi-agent discovery documentation - Replace headless_db_discovery.py to use 4-agent prompt - Replace headless_db_discovery.sh to use 4-agent prompt - Remove HEADLESS_DISCOVERY_README.md (replaced by README.md) The new system uses 4 collaborating agents (STRUCTURAL, STATISTICAL, SEMANTIC, QUERY) that communicate via MCP catalog across 4 rounds: 1. Blind Exploration (parallel independent discovery) 2. Pattern Recognition (cross-agent collaboration) 3. Hypothesis Testing (validation with evidence) 4. Final Synthesis (comprehensive report) This provides deeper analysis with 15+ hypothesis validations, health scores, and prioritized recommendations.

- Change default timeout from 300s to 3600s (1 hour) for multi-agent discovery - Add specific timeout error detection (exit code 124) with helpful message - Add empty output file detection with diagnostic suggestions - Improve error messages to guide users on how to debug issues - Show example commands for increasing timeout when timeout occurs The multi-agent discovery process involves 4 rounds of agent collaboration and can take significantly longer than the previous single-agent approach.

…d file

Expand the 4-agent system to 6 agents (5 analysis + 1 meta) with enhanced security analysis and self-improving prompt optimization. New Agents: - SECURITY: Identifies sensitive data (PII, credentials, financial), assesses access patterns, identifies vulnerabilities, and provides compliance assessment (GDPR, PCI-DSS) - META: Analyzes report quality by section, identifies gaps, suggests specific prompt improvements for future runs Protocol Changes: - Expanded from 4 rounds to 5 rounds - Round 5 is Meta Analysis (META agent only) - META agent does not participate in rounds 1-4 New Report Sections: - 5. SECURITY ANALYSIS with data classification (PUBLIC/INTERNAL/ CONFIDENTIAL/RESTRICTED) - E. Security data classification appendix New Output: - Separate META ANALYSIS document with: - Section quality ratings (depth, completeness) - Specific prompt improvement suggestions - Gap identification - Evolution history tracking This enables continuous prompt optimization through multiple discovery iterations, with each run informing improvements for the next.

Each agent now generates a comprehensive catalog of questions they can answer about the database, with executable answer plans using MCP tools. Question Catalog Features: - 90+ questions across all agents (minimum 15-20 per agent) - Executable answer plans with specific MCP tools for each question - Answer templates with structured output formats - Complexity ratings (LOW/MEDIUM/HIGH) and time estimates - Cross-domain questions requiring multiple agents New Output Locations: - kind="question_catalog", key="structural_questions" - kind="question_catalog", key="statistical_questions" - kind="question_catalog", key="semantic_questions" - kind="question_catalog", key="query_questions" - kind="question_catalog", key="security_questions" - kind="question_catalog", key="cross_domain_questions" Benefits: 1. Fast Answers: Pre-validated plans skip analysis phase 2. Consistent Quality: All answers follow proven templates 3. Tool Reuse: Efficient MCP tool usage patterns 4. Comprehensive Coverage: 90+ questions cover most user needs Example Question Catalog Entry: - Question: "What sensitive data exists in table X?" - Answer Plan: sample_rows + column_profile on table X - Answer Template: Structured list with sensitivity classification - Complexity: MEDIUM - Estimated Time: 30 seconds This creates a reusable knowledge base for future LLM interactions, enabling quick, accurate responses to common database questions.

…ings Add prominent warnings and explicit instructions to prevent agents from creating individual markdown files instead of using the MCP catalog. Changes: - Add "CRITICAL: MCP CATALOG USAGE" section at the top of the prompt - Explicitly prohibit Write tool for individual agent discoveries - Specify that ONLY Round 4 final report should be written to file - Add detailed catalog_upsert usage examples - Update all round descriptions with CRITICAL catalog usage instructions - Add "IMPORTANT - Catalog Usage Rules" section Key Instructions Added: - "DO NOT use the Write tool to create separate markdown files" - "ALL agent findings MUST be written to MCP catalog using catalog_upsert" - "ONLY Round 4 Final Synthesis writes to local file using Write tool" - "DO NOT use Write tool for individual agent discoveries in Rounds 1-3" This ensures agents store findings in the MCP catalog for cross-agent collaboration, with only the final consolidated report written to disk.

Implemented three critical improvements identified by META agent: 1. Performance Baseline Measurement (QUERY agent): - Added 5 required query types with timing measurements - Compare actual execution time vs EXPLAIN cost - Efficiency score calculation (1-10 scale) - Output format with performance scores per table 2. Statistical Significance Testing (STATISTICAL agent): - Added 5 required statistical tests: * Normality tests (Shapiro-Wilk, Anderson-Darling) * Correlation analysis (Pearson, Spearman) with CI * Chi-square tests for categorical associations * Outlier detection (Modified Z-score, IQR) * Group comparisons (t-test, Mann-Whitney U) - All tests report p-values and effect sizes - Statistical confidence scoring (1-10) 3. Enhanced Cross-Domain Question Synthesis: - Increased minimum from general suggestion to 15 questions - Added 5 cross-domain categories with examples - Detailed question template with multi-phase plans - Priority distribution (URGENT/HIGH/MEDIUM) - Quality criteria for cross-domain questions Expected Impact: +25% overall quality, +30% confidence in findings

Updated README to reflect new capabilities: 1. Statistical Analysis section: - Added Statistical Significance Testing subsection - Listed 5 required statistical tests (normality, correlation, chi-square, outliers, group comparisons) - Mentioned p-values and effect sizes 2. Query Analysis section: - Added Performance Baseline Measurement subsection - Listed 6 required query types with timing - Mentioned efficiency scoring (EXPLAIN vs actual) 3. Question Catalogs section: - Updated to reflect 15+ cross-domain questions (enhanced in v1.3) - Added 5 cross-domain categories with question counts 4. Cross-Domain Questions section: - Expanded from 3 examples to 15 specific questions - Organized by 5 categories with question counts - Matched new v1.3 requirements

gemini-code-assist · 2026-01-17T15:23:53Z

Summary of Changes

Hello @renecannao, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces version 1.3 of the Multi-Agent Database Discovery system, significantly enhancing its analytical capabilities. The primary goal is to provide deeper insights and higher confidence in database discovery reports by integrating actual query performance measurements and robust statistical analysis. These improvements are designed to make the discovery process more comprehensive and the generated recommendations more actionable, ultimately boosting the overall quality of the system's output.

Highlights

Performance Baseline Measurement: The QUERY agent now executes actual performance queries with timing measurements for primary key lookups, table scans, index range scans, JOINs, and aggregations. This provides a more accurate performance assessment compared to relying solely on EXPLAIN output, including an efficiency score.
Statistical Significance Testing: The STATISTICAL agent now performs rigorous statistical tests, including normality tests, correlation analysis, chi-square tests, outlier detection, and group comparisons. All tests report exact p-values and effect sizes, enhancing the confidence and depth of data quality findings.
Enhanced Cross-Domain Question Synthesis: The META agent now generates a minimum of 15 cross-domain questions across five categories (e.g., Performance + Security, Structure + Semantics). Each question includes business context, stakeholder impact, multi-phase answer plans with specific tools, and priority levels, improving the actionability of discovery reports.
Updated Headless Discovery Scripts and Documentation: The headless_db_discovery.py and headless_db_discovery.sh scripts have been updated to reflect the new 6-agent, 5-round protocol, including an increased default timeout. The README.md has been completely revamped to detail the new multi-agent architecture, capabilities, and question catalog generation, while a new multi_agent_discovery_prompt.md defines the detailed instructions for the agents.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-01-17T15:24:05Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Summary by CodeRabbit

New Features
- Two‑phase database discovery (deterministic harvest + LLM semantic phase)
- Retrieval‑Augmented Generation (RAG) support: FTS, vector, and hybrid search
- Schema‑scoped catalog and new AI/RAG MCP endpoints for discovery and retrieval
Enhancements
- Expanded MCP tool surface and agent coordination (catalog, discovery, llm, agent tools)
- Improved diagnostics, longer discovery timeouts, and CLI/demo utilities
Documentation
- Comprehensive docs and examples for discovery, RAG, and tool usage

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Walkthrough

Adds a two‑phase, MCP‑centric discovery system: a Static_Harvester writes deterministic metadata into a new SQLite Discovery_Schema; Query_Tool_Handler, RAG and new MCP tools/endpoints enable Phase‑2 LLM and multi‑agent discovery; includes C++ implementations, prompts, headless orchestration scripts, docs, and build wiring.

Changes

Cohort / File(s)	Summary
Discovery catalog & harvester `include/Discovery_Schema.h`, `lib/Discovery_Schema.cpp`, `include/Static_Harvester.h`, `lib/Static_Harvester.cpp`	New Discovery_Schema API + implementation and Static_Harvester implementation for INFORMATION_SCHEMA harvesting, run/agent lifecycle, FTS, LLM artifacts, telemetry and CRUD.
Query tool & connection pool `include/Query_Tool_Handler.h`, `lib/Query_Tool_Handler.cpp`, `lib/Query_Tool_Handler.*`	Replaces MySQL_Tool_Handler wrapper with Query_Tool_Handler owning Discovery_Schema, Static_Harvester, connection pool, tool dispatch (catalog/discovery/llm/queries), usage stats and safety checks; constructor/signature changes and new tool surface.
RAG subsystem & tool handler `include/RAG_Tool_Handler.h`, `lib/RAG_Tool_Handler.cpp`, `lib/AI_Features_Manager.cpp`, `doc/rag-`, `RAG_POC/`	New RAG_Tool_Handler, RAG DB schema additions, vector/FTS tables, ingestion POC, GenAI RAG config, docs, tests, and MCP /mcp/rag endpoint wiring.
MCP server/threading/admin changes `include/MCP_Thread.h`, `lib/MCP_Thread.cpp`, `lib/ProxySQL_MCP_Server.cpp`, `lib/MCP_Endpoint.cpp`, `lib/Admin_Bootstrap.cpp`, `lib/Admin_FlushVariables.cpp`, `include/proxysql_admin.h`, `lib/ProxySQL_Admin*.cpp`	Removes runtime mcp_catalog_path variable (catalog path hardcoded to datadir), adds mcpdb handle, rag_tool_handler wiring, MCP rule/digest/statistics hooks, enhanced JSON‑RPC error handling/logging, and proxy sqlite symbol usage.
Catalog API schema scoping `include/MySQL_Catalog.h`, `lib/MySQL_Catalog.cpp`, `include/MySQL_Tool_Handler.h`, `lib/MySQL_Tool_Handler.cpp`	Adds `schema` parameter to catalog CRUD APIs and updates table/schema handling, FTS triggers, and callers to be schema‑scoped (unique constraint on (schema,kind,key)).
FTS / sqlite proxy symbols / build `include/sqlite3db.h`, `lib/proxy_sqlite3_symbols.cpp`, `lib/sqlite3db.cpp`, `lib/Makefile`, `include/ProxySQL_Admin_Tables_Definitions.h`	Introduces proxy_sqlite3 function‑pointer symbols and TU defining them; wires extended sqlite API pointers, updates initialization to use proxies, adds MCP stats/digest table macros, and includes new objects in build.
Headless & two‑phase orchestration `scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/` (README.md, headless_db_discovery.py/.sh, two_phase_discovery.py, prompts/, static_harvest.sh, test_catalog.sh, demo_agent_claude.sh, mcp_config.example.json, .gitignore, etc.)	Moves inline prompts to prompt files (multi‑agent & two‑phase), adds new prompts (multi_agent, two_phase), increases timeouts, richer diagnostics and error guidance, new orchestration scripts for static harvest, two‑phase discovery and demos, and README/prompts replacing deleted HEADLESS_DISCOVERY_README.md.
Tooling, docs & examples `doc/`, `scripts/mcp/README.md`, `doc/MCP/`	Extensive documentation updates: architecture, VARIABLES, two‑phase design, FTS/embeddings plans, RAG docs/examples, MCP tool guides and versioning/usage examples.
Query/statistics wiring `lib/ProxySQL_Admin_Stats.cpp`, `lib/ProxySQL_Admin.cpp`, `lib/ProxySQL_MCP_Server.cpp`	New stats implementations: stats___mcp_query_tools_counters, stats___mcp_query_digest, stats___mcp_query_rules and integration into refresh/save/load flows, plus runtime load/save handlers for MCP rules.
Static harvest & CLI scripts `scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/static_harvest.sh`, `two_phase_discovery.py`, `test_catalog.sh`	New CLI scripts to run static harvest, orchestrate two‑phase discovery, and test catalog/LLM upserts; support run_id resolution, prompt file usage, and dry‑run diagnostics.

Sequence Diagram(s)

sequenceDiagram
    participant Claude as Claude Code (Client)
    participant MCP as MCP Server (/mcp/query)
    participant QTH as Query_Tool_Handler
    participant Catalog as Discovery_Schema (mcp_catalog.db)
    participant MySQL as MySQL
    participant RAG as RAG_Tool_Handler

    Note over Claude,MCP: Phase 1 — Static harvest
    Claude->>MCP: discovery.run_static(schema_filter)
    MCP->>QTH: invoke discovery.run_static
    QTH->>MySQL: query INFORMATION_SCHEMA
    MySQL-->>QTH: rows
    QTH->>Catalog: insert_schema/objects/columns/indexes/fks
    Catalog-->>QTH: ack
    QTH->>Catalog: rebuild_fts_index
    Catalog-->>MCP: {run_id, summary}
    MCP-->>Claude: {run_id, summary}

    Note over Claude,MCP: Phase 2 — LLM / Multi‑agent discovery
    Claude->>MCP: catalog.list_objects(run_id)
    MCP->>Catalog: list_objects(run_id)
    Catalog-->>MCP: object metadata
    MCP-->>Claude: objects
    loop multi-agent interactions
      Claude->>MCP: llm.summary_upsert / agent.event_append / catalog.get_object
      MCP->>Catalog: upsert_llm_summary / append_agent_event / get_object
      Catalog-->>MCP: ack / object
      MCP->>RAG: rag.search_* (optional retrieval)
      RAG-->>MCP: chunks/docs
    end
    Claude->>MCP: agent.run_finish(status)
    MCP->>Catalog: finish_agent_run()
    Catalog-->>MCP: ack
    MCP-->>Claude: completion

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

MCP (Model Context Protocol) Multi-Endpoint Architecture #8 — Implements the multi-endpoint MCP architecture and discovery tooling; closely aligns with Discovery_Schema, Query_Tool_Handler and two‑phase discovery changes here.

Possibly related PRs

Use proxy_sqlite3_* wrappers and canonical proxy symbol TU #22 — Adds/wires proxy_sqlite3_* function‑pointer symbols and the single TU defining them; directly related to the sqlite proxy symbol work in this PR.
Add comprehensive RAG subsystem with Doxygen documentation [WIP] #18 — Overlaps on RAG subsystem implementation (RAG_Tool_Handler, RAG schema, GenAI config, MCP rag endpoint).
Add MCP stdio bridge and database discovery tools #10 — Overlaps on MCP stdio bridge and headless/Claude Code discovery scripts and prompts.

Poem

🐰 I nibbled schemas by moonlit code,

I chased the rows down every road,
Six agents hummed, we stitched and wrote,
Catalogs listened, seeds took note,
Now discovery blooms where I once strode. 🌿

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.81% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly references the main change (Multi-Agent Database Discovery v1.3 with Performance & Statistical Analysis improvements) and accurately summarizes the primary additions to the codebase.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing specific improvements (Performance Baseline Measurement, Statistical Significance Testing, Cross-Domain Question Synthesis) with examples and context.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request significantly enhances the multi-agent database discovery system by introducing performance baselining and statistical significance testing. The refactoring of the headless execution scripts to externalize the prompt and improve error handling is a major step forward in maintainability and usability. The new prompt is exceptionally detailed and well-engineered. My review focuses on minor inconsistencies in the documentation and opportunities to make the agent's instructions even more robust to prevent potential issues during execution.

gemini-code-assist · 2026-01-17T15:26:11Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md

@@ -0,0 +1,434 @@
+# Database Discovery System Prompt


This reference document appears to be outdated and inconsistent with the new architecture described in the pull request. It details a 4-agent system, while the main prompt and README now describe a 6-agent system with a 5-round protocol. This discrepancy is highly confusing for anyone trying to understand the system.

The file should be updated to reflect the current 6-agent architecture or be clearly marked as documentation for a previous version to avoid confusion.

gemini-code-assist · 2026-01-17T15:26:11Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md

+| `--database` | `-d` | Database name to discover | First available |
+| `--schema` | `-s` | Schema name to analyze | All schemas |
+| `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` |
+| `--timeout` | `-t` | Timeout in seconds | 300 |


The default timeout value in the command-line options table is outdated. It's listed as 300 seconds, but the scripts have been updated to use a default of 3600 seconds (1 hour) to accommodate the more complex multi-agent discovery process. The documentation should be updated to reflect the new default.

Suggested change

| `--timeout` | `-t` | Timeout in seconds | 300 |

| `--timeout` | `-t` | Timeout in seconds | 3600 |

gemini-code-assist · 2026-01-17T15:26:11Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md

+```python
+# Modify the prompt in the script for focused discovery
+def build_discovery_prompt(database: Optional[str]) -> str:
+    prompt = f"""Using the 4-agent discovery protocol, focus on:
+    1. Security aspects of {database}
+    2. Performance optimization opportunities
+    3. Data quality issues
+
+    Follow the standard 4-round protocol but prioritize these areas.
+    """
+    return prompt
+```


The example for customizing the discovery prompt is inconsistent with the new architecture. It refers to a "4-agent discovery protocol," whereas the system now uses 6 agents. Additionally, the function signature for build_discovery_prompt is missing the schema parameter. This could confuse users trying to customize the script.

Suggested change

```python

# Modify the prompt in the script for focused discovery

def build_discovery_prompt(database: Optional[str]) -> str:

prompt = f"""Using the 4-agent discovery protocol, focus on:

1. Security aspects of {database}

2. Performance optimization opportunities

3. Data quality issues

Follow the standard 4-round protocol but prioritize these areas.

"""

return prompt

```

# Modify the prompt in the script for focused discovery

def build_discovery_prompt(database: Optional[str], schema: Optional[str]) -> str:

# This example shows how to create a custom, focused prompt.

# Note: The actual script reads a much more detailed prompt from an external file.

prompt = f"""Using the 6-agent discovery protocol, focus on:

1. Security aspects of {database}

2. Performance optimization opportunities

3. Data quality issues

Follow the standard 5-round protocol but prioritize these areas.

"""

return prompt

gemini-code-assist · 2026-01-17T15:26:11Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_prompt.md

+
+1. **Primary Key Lookup**
+   ```sql
+   SELECT * FROM {table} WHERE {pk_column} = (SELECT MAX({pk_column}) FROM {table});


The query for the Primary Key Lookup test uses MAX({pk_column}) to find a value to test. This assumes the primary key is of a type that supports the MAX() aggregate function, such as an integer. This will fail for other common primary key types like UUIDs. A more robust approach is to select an arbitrary key, for instance, by using LIMIT 1.

Suggested change

SELECT * FROM {table} WHERE {pk_column} = (SELECT MAX({pk_column}) FROM {table});

SELECT * FROM {table} WHERE {pk_column} = (SELECT {pk_column} FROM {table} LIMIT 1);

gemini-code-assist · 2026-01-17T15:26:11Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_prompt.md

+3. **Index Range Scan (if applicable)**
+   ```sql
+   SELECT * FROM {table} WHERE {indexed_column} BETWEEN {min} AND {max} LIMIT 1000;
+   ```


The instruction for the 'Index Range Scan' test provides a query template but doesn't guide the agent on how to determine appropriate values for {min} and {max}. This ambiguity could lead to ineffective or failing tests if the agent chooses a poor range.

I recommend adding guidance for the agent. For example, you could add a note before the SQL block:

- Note: To determine appropriate `{min}` and `{max}` values, first profile the `{indexed_column}` to find a representative range.

coderabbitai

Actionable comments posted: 12

🤖 Fix all issues with AI agents

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/.gitignore`:
- Line 1: The .gitignore entry uses a leading slash (/discovery_*.md) which
limits matches to the current directory; change the pattern to discovery_*.md
(or use **/discovery_*.md if you want to match in subdirectories) to allow
ignoring discovery files produced in nested paths; update the entry that
currently contains "/discovery_*.md" accordingly.

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.py`:
- Around line 5-7: The module-level docstring in headless_db_discovery.py still
lists 4 agents (STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY); update that
docstring to reflect the current 6-agent workflow by adding the two missing
agents (name them exactly as used in the codebase) so the description matches
the implementation; search for the module docstring at the top of
headless_db_discovery.py and edit the agent list text to include all six agent
names used by the discovery workflow.
- Around line 131-155: The call to get_discovery_prompt_path() is executed
before the try/except that handles missing prompt files, so a missing prompt
will raise FileNotFoundError outside the handler; fix by moving the
get_discovery_prompt_path() invocation (and the log_verbose that references it)
into the same try block that calls build_discovery_prompt(), or wrap the
get_discovery_prompt_path() call in its own try/except to catch
FileNotFoundError and call process-safe fallback or exit; update the references
in log_verbose and build_discovery_prompt() usage accordingly so all
FileNotFoundError paths are handled in one place.

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.sh`:
- Around line 7-9: Update the usage/help header: change the phrase "4
collaborating agents" to "6 collaborating agents" and edit the agent list that
currently reads "STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY" to include the
two missing agent names (or explicitly say "STRUCTURAL, STATISTICAL, SEMANTIC,
QUERY, <AGENT5>, and <AGENT6>" if you have the names) so the comment accurately
reflects the 6‑agent system in headless_db_discovery.sh.
- Around line 175-204: MCP_ARGS being a plain string causes word-splitting when
expanded into the CLAUDE_ARGS array; change MCP_ARGS to an array and populate it
with separate elements (e.g., add the flag and its value as two entries) when
building from MCP_CONFIG or MCP_FILE, then append to CLAUDE_ARGS using array
expansion (use the MCP_ARGS array name and its elements) so JSON or paths with
spaces are preserved; update the code locations that set MCP_ARGS and the line
that appends to CLAUDE_ARGS to use the MCP_ARGS array and proper quoted array
expansion.

In
`@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_prompt.md`:
- Around line 3-5: Update the mission header that currently says "ONLY MCP
server tools (`mcp__proxysql-stdio__*`)" to clarify that the MCP-only
restriction applies to database access and discovery steps only, while
permitting the use of the Write tool in Round 4 for producing the final report;
reference the exact phrase "ONLY MCP server tools (`mcp__proxysql-stdio__*`)"
and "Round 4" so the change is applied there and ensure the new wording
explicitly states that all database interactions must use MCP tools but output
generation may use Write.
- Around line 845-856: Add explicit language identifiers to the untyped Markdown
code fences around the catalog examples: update the three fences that surround
the calls to catalog_upsert, catalog_search, and catalog_get so they specify a
language (e.g., ```python) instead of plain ```, ensuring each opening fence
before catalog_upsert(kind="agent_type"...), before catalog_search(...), and
before catalog_get(...) is changed to include the identifier and the
corresponding closing fence remains ```; no code changes required besides the
fence markers.

In
`@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md`:
- Around line 5-11: The prompt currently states "Execute in 4 rounds" and
implies a 4‑agent workflow but your tooling uses a 6‑agent/5‑round protocol;
update the prompt text (the "Database Discovery Orchestrator" section and the
"Critical Constraints" bullet that references "Execute in 4 rounds") to
explicitly state whether this file is the 4‑agent/4‑round variant or to convert
it to the canonical 6‑agent/5‑round workflow used by the tooling; ensure you
update the round count and agent count consistently and keep the MCP tooling
requirement (`mcp__proxysql-stdio__*`, `catalog_upsert`, `catalog_search`)
intact so readers aren’t misled.
- Around line 101-109: Update the example call to catalog_upsert so the kind
field uses the round-specific kind names used elsewhere in the protocol (e.g.,
replace "collaborative" with "collaborative_round2" and "validation" with
"validation_round3") to ensure catalog lookups match; keep the parameter names
(kind, key, document, tags) and structure the same but use the exact protocol
strings used later in the document so catalog_upsert queries succeed.

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md`:
- Around line 519-530: The directory-structure code fence in README.md is
missing a language tag (causing MD040); update the opening fence for the tree
block from ``` to ```text so the block becomes a fenced code block with a
language tag (e.g., change the README.md directory-structure code fence that
surrounds the "ClaudeCode_Headless/" tree to start with ```text).
- Around line 392-398: Update the README table entry for the `--timeout` / `-t`
option to reflect the actual default of 3600 seconds: change the "Default"
column value for the `--timeout` row from 300 to 3600 so the documented default
matches the script's runtime behavior.
- Around line 471-477: Update the example prompt string (variable "prompt") so
it references the 6-agent, 5-round discovery protocol rather than "4-agent" and
"4-round"; modify the body to instruct focus on security, performance, and data
quality while explicitly stating the 6-agent/5-round structure and any change in
sequencing (e.g., "Follow the standard 6-agent, 5-round protocol and prioritize
these areas") and adjust numbering or phrasing inside the triple-quoted string
to match the new protocol description.

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/.gitignore

coderabbitai · 2026-01-17T15:31:11Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.py

 This script runs Claude Code in non-interactive mode to perform
-comprehensive database discovery. It works with any database
-type that is accessible via MCP (Model Context Protocol).
+comprehensive database discovery using 4 collaborating agents:
+STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.


⚠️ Potential issue | 🟡 Minor

Update the docstring agent list to match the 6‑agent workflow.

Proposed fix

-comprehensive database discovery using 4 collaborating agents: -STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY. +comprehensive database discovery using 6 collaborating agents: +STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

This script runs Claude Code in non-interactive mode to perform

comprehensive database discovery. It works with any database

type that is accessible via MCP (Model Context Protocol).

comprehensive database discovery using 4 collaborating agents:

STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.

This script runs Claude Code in non-interactive mode to perform

comprehensive database discovery using 6 collaborating agents:

STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META.

🤖 Prompt for AI Agents

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.py` around lines 5 - 7, The module-level docstring in headless_db_discovery.py still lists 4 agents (STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY); update that docstring to reflect the current 6-agent workflow by adding the two missing agents (name them exactly as used in the codebase) so the description matches the implementation; search for the module docstring at the top of headless_db_discovery.py and edit the agent list text to include all six agent names used by the discovery workflow.

coderabbitai · 2026-01-17T15:31:11Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.py

+    log_info("Starting Multi-Agent Database Discovery")
    log_info(f"Output will be saved to: {output_file}")
    log_verbose(f"Claude Code executable: {claude_cmd}", args.verbose)
-
-    # Build MCP configuration
-    mcp_config_file, _ = build_mcp_config(args)
-    if mcp_config_file:
-        log_verbose(f"Using MCP configuration: {mcp_config_file}", args.verbose)
+    log_verbose(f"Using discovery prompt: {get_discovery_prompt_path()}", args.verbose)

    # Build command arguments
    cmd_args = [
        claude_cmd,
        '--print',                          # Non-interactive mode
        '--no-session-persistence',         # Don't save session
-        '--permission-mode', 'bypassPermissions',  # Bypass permission checks in headless mode
+        '--permission-mode', 'bypassPermissions',  # Bypass permission checks
    ]

-    # Add MCP configuration if available
-    if mcp_config_file:
-        cmd_args.extend(['--mcp-config', mcp_config_file])
+    # Add MCP configuration if provided
+    if args.mcp_config:
+        cmd_args.extend(['--mcp-config', args.mcp_config])
+        log_verbose(f"Using MCP config: {args.mcp_config}", args.verbose)
+    elif args.mcp_file:
+        cmd_args.extend(['--mcp-config', args.mcp_file])
+        log_verbose(f"Using MCP config file: {args.mcp_file}", args.verbose)

    # Build discovery prompt
-    prompt = build_discovery_prompt(args.database, args.schema)
+    try:
+        prompt = build_discovery_prompt(args.database, args.schema)
+    except FileNotFoundError as e:


⚠️ Potential issue | 🟡 Minor

Avoid uncaught FileNotFoundError from the log line.
get_discovery_prompt_path() is called before the try/except, so a missing prompt file bypasses the intended error handling.

Proposed fix

- log_verbose(f"Using discovery prompt: {get_discovery_prompt_path()}", args.verbose) + # Build discovery prompt + try: + prompt_path = get_discovery_prompt_path() + log_verbose(f"Using discovery prompt: {prompt_path}", args.verbose) + prompt = build_discovery_prompt(args.database, args.schema) + except FileNotFoundError as e: + log_error(str(e)) + sys.exit(1) @@ - # Build discovery prompt - try: - prompt = build_discovery_prompt(args.database, args.schema) - except FileNotFoundError as e: - log_error(str(e)) - sys.exit(1)

🤖 Prompt for AI Agents

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.py` around lines 131 - 155, The call to get_discovery_prompt_path() is executed before the try/except that handles missing prompt files, so a missing prompt will raise FileNotFoundError outside the handler; fix by moving the get_discovery_prompt_path() invocation (and the log_verbose that references it) into the same try block that calls build_discovery_prompt(), or wrap the get_discovery_prompt_path() call in its own try/except to catch FileNotFoundError and call process-safe fallback or exit; update the references in log_verbose and build_discovery_prompt() usage accordingly so all FileNotFoundError paths are handled in one place.

coderabbitai · 2026-01-17T15:31:11Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.sh

 # This script runs Claude Code in non-interactive mode to perform
-# comprehensive database discovery. It works with any database
-# type that is accessible via MCP (Model Context Protocol).
+# comprehensive database discovery using 4 collaborating agents:
+# STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.


⚠️ Potential issue | 🟡 Minor

Update the help header to reflect the 6‑agent system.
The usage text still says “4 collaborating agents”.

Proposed fix

-# comprehensive database discovery using 4 collaborating agents: -# STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY. +# comprehensive database discovery using 6 collaborating agents: +# STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# This script runs Claude Code in non-interactive mode to perform

# comprehensive database discovery. It works with any database

# type that is accessible via MCP (Model Context Protocol).

# comprehensive database discovery using 4 collaborating agents:

# STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.

# This script runs Claude Code in non-interactive mode to perform

# comprehensive database discovery using 6 collaborating agents:

# STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META.

🤖 Prompt for AI Agents

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.sh` around lines 7 - 9, Update the usage/help header: change the phrase "4 collaborating agents" to "6 collaborating agents" and edit the agent list that currently reads "STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY" to include the two missing agent names (or explicitly say "STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, <AGENT5>, and <AGENT6>" if you have the names) so the comment accurately reflects the 6‑agent system in headless_db_discovery.sh.

coderabbitai · 2026-01-17T15:31:11Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.sh

+# Build MCP args
 MCP_ARGS=""
 if [ -n "$MCP_CONFIG" ]; then
-    # Write inline config to temp file
-    MCP_CONFIG_FILE=$(mktemp)
-    echo "$MCP_CONFIG" > "$MCP_CONFIG_FILE"
-    MCP_ARGS="--mcp-config $MCP_CONFIG_FILE"
+    MCP_ARGS="--mcp-config $MCP_CONFIG"
    log_verbose "Using inline MCP configuration"
 elif [ -n "$MCP_FILE" ]; then
    if [ -f "$MCP_FILE" ]; then
-        MCP_CONFIG_FILE="$MCP_FILE"
        MCP_ARGS="--mcp-config $MCP_FILE"
        log_verbose "Using MCP configuration from: $MCP_FILE"
    else
        log_error "MCP configuration file not found: $MCP_FILE"
        exit 1
    fi
-elif [ -n "$PROXYSQL_MCP_ENDPOINT" ]; then
-    # Build MCP config for ProxySQL and write to temp file
-    MCP_CONFIG_FILE=$(mktemp)
-    SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
-    BRIDGE_PATH="$SCRIPT_DIR/../mcp/proxysql_mcp_stdio_bridge.py"
-
-    # Build the JSON config
-    cat > "$MCP_CONFIG_FILE" << MCPJSONEOF
-{
-  "mcpServers": {
-    "proxysql": {
-      "command": "python3",
-      "args": ["$BRIDGE_PATH"],
-      "env": {
-        "PROXYSQL_MCP_ENDPOINT": "$PROXYSQL_MCP_ENDPOINT"
-MCPJSONEOF
-
-    if [ -n "$PROXYSQL_MCP_TOKEN" ]; then
-        echo ",        \"PROXYSQL_MCP_TOKEN\": \"$PROXYSQL_MCP_TOKEN\"" >> "$MCP_CONFIG_FILE"
-    fi
-
-    if [ "$PROXYSQL_MCP_INSECURE_SSL" = "1" ]; then
-        echo ",        \"PROXYSQL_MCP_INSECURE_SSL\": \"1\"" >> "$MCP_CONFIG_FILE"
-    fi
-
-    cat >> "$MCP_CONFIG_FILE" << 'MCPJSONEOF2'
-      }
-    }
-  }
-}
-MCPJSONEOF2
-
-    MCP_ARGS="--mcp-config $MCP_CONFIG_FILE"
-    log_verbose "Using ProxySQL MCP endpoint: $PROXYSQL_MCP_ENDPOINT"
-    log_verbose "MCP config written to: $MCP_CONFIG_FILE"
-else
-    log_verbose "No explicit MCP configuration, using available MCP servers"
 fi

-# Build the discovery prompt
-DATABASE_ARG=""
-if [ -n "$DATABASE_NAME" ]; then
-    DATABASE_ARG="database named '$DATABASE_NAME'"
-else
-    DATABASE_ARG="the first available database"
-fi
+# Log the command being executed
+log_info "Running Claude Code in headless mode with 6-agent discovery..."
+log_verbose "Timeout: ${TIMEOUT}s"

-SCHEMA_ARG=""
-if [ -n "$SCHEMA_NAME" ]; then
-    SCHEMA_ARG="the schema '$SCHEMA_NAME' within"
-fi
+# Build Claude command
+CLAUDE_ARGS=(
+    --print
+    --no-session-persistence
+    --permission-mode bypassPermissions
+)

-DISCOVERY_PROMPT="You are a Database Discovery Agent. Your mission is to perform comprehensive analysis of $DATABASE_ARG.
-
-${SCHEMA_ARG:+Focus on $SCHEMA_ARG}
-
-Use the available MCP database tools to discover and document:
-
-## 1. STRUCTURAL ANALYSIS
- List all tables in the database/schema
- For each table, describe:
-  - Column names, data types, and nullability
-  - Primary keys and unique constraints
-  - Foreign key relationships
-  - Indexes and their purposes
-  - Any CHECK constraints or defaults
-
- Create an Entity Relationship Diagram (ERD) showing:
-  - All tables and their relationships
-  - Cardinality (1:1, 1:N, M:N)
-  - Primary and foreign keys
-
-## 2. DATA PROFILING
- For each table, analyze:
-  - Row count
-  - Data distributions for key columns
-  - Null value percentages
-  - Distinct value counts (cardinality)
-  - Min/max/average values for numeric columns
-  - Sample data (first few rows)
-
- Identify patterns and anomalies:
-  - Duplicate records
-  - Data quality issues
-  - Unexpected distributions
-  - Outliers
-
-## 3. SEMANTIC ANALYSIS
- Infer the business domain:
-  - What type of application/database is this?
-  - What are the main business entities?
-  - What are the business processes?
-
- Document business rules:
-  - Entity lifecycles and state machines
-  - Validation rules implied by constraints
-  - Relationship patterns
-
- Classify tables:
-  - Master/reference data (customers, products, etc.)
-  - Transactional data (orders, transactions, etc.)
-  - Junction/association tables
-  - Configuration/metadata
-
-## 4. PERFORMANCE & ACCESS PATTERNS
- Identify:
-  - Missing indexes on foreign keys
-  - Missing indexes on frequently filtered columns
-  - Composite index opportunities
-  - Potential N+1 query patterns
-
- Suggest optimizations:
-  - Indexes that should be added
-  - Query patterns that would benefit from optimization
-  - Denormalization opportunities
-
-## OUTPUT FORMAT
-
-Provide your findings as a comprehensive Markdown report with:
-
-1. **Executive Summary** - High-level overview
-2. **Database Schema** - Complete table definitions
-3. **Entity Relationship Diagram** - ASCII ERD
-4. **Data Quality Assessment** - Score (1-100) with issues
-5. **Business Domain Analysis** - Industry, use cases, entities
-6. **Performance Recommendations** - Prioritized optimization list
-7. **Anomalies & Issues** - All problems found with severity
-
-Be thorough. Discover everything about this database structure and data.
-Write the complete report to standard output."
-
-# Log the command being executed (without showing the full prompt for clarity)
-log_info "Running Claude Code in headless mode..."
-log_verbose "Timeout: ${TIMEOUT}s"
-if [ -n "$DATABASE_NAME" ]; then
-    log_verbose "Target database: $DATABASE_NAME"
-fi
-if [ -n "$SCHEMA_NAME" ]; then
-    log_verbose "Target schema: $SCHEMA_NAME"
+# Add MCP configuration if available
+if [ -n "$MCP_ARGS" ]; then
+    CLAUDE_ARGS+=($MCP_ARGS)
 fi


⚠️ Potential issue | 🟠 Major

Prevent MCP config JSON from breaking argument parsing.
MCP_ARGS is a string that gets expanded unquoted into the args array, so JSON or paths with spaces will split into multiple arguments.

Proposed fix (use an array and proper quoting)

-# Build MCP args -MCP_ARGS="" -if [ -n "$MCP_CONFIG" ]; then - MCP_ARGS="--mcp-config $MCP_CONFIG" - log_verbose "Using inline MCP configuration" -elif [ -n "$MCP_FILE" ]; then - if [ -f "$MCP_FILE" ]; then - MCP_ARGS="--mcp-config $MCP_FILE" - log_verbose "Using MCP configuration from: $MCP_FILE" - else - log_error "MCP configuration file not found: $MCP_FILE" - exit 1 - fi -fi +MCP_ARGS=() +if [ -n "$MCP_CONFIG" ]; then + MCP_ARGS+=(--mcp-config "$MCP_CONFIG") + log_verbose "Using inline MCP configuration" +elif [ -n "$MCP_FILE" ]; then + if [ -f "$MCP_FILE" ]; then + MCP_ARGS+=(--mcp-config "$MCP_FILE") + log_verbose "Using MCP configuration from: $MCP_FILE" + else + log_error "MCP configuration file not found: $MCP_FILE" + exit 1 + fi +fi @@ -# Add MCP configuration if available -if [ -n "$MCP_ARGS" ]; then - CLAUDE_ARGS+=($MCP_ARGS) -fi +# Add MCP configuration if available +if [ "${`#MCP_ARGS`[@]}" -gt 0 ]; then + CLAUDE_ARGS+=("${MCP_ARGS[@]}") +fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Build MCP args

MCP_ARGS=""

if [ -n "$MCP_CONFIG" ]; then

# Write inline config to temp file

MCP_CONFIG_FILE=$(mktemp)

echo "$MCP_CONFIG" > "$MCP_CONFIG_FILE"

MCP_ARGS="--mcp-config $MCP_CONFIG_FILE"

MCP_ARGS="--mcp-config $MCP_CONFIG"

log_verbose "Using inline MCP configuration"

elif [ -n "$MCP_FILE" ]; then

if [ -f "$MCP_FILE" ]; then

MCP_CONFIG_FILE="$MCP_FILE"

MCP_ARGS="--mcp-config $MCP_FILE"

log_verbose "Using MCP configuration from: $MCP_FILE"

else

log_error "MCP configuration file not found: $MCP_FILE"

exit 1

fi

elif [ -n "$PROXYSQL_MCP_ENDPOINT" ]; then

# Build MCP config for ProxySQL and write to temp file

MCP_CONFIG_FILE=$(mktemp)

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

BRIDGE_PATH="$SCRIPT_DIR/../mcp/proxysql_mcp_stdio_bridge.py"

# Build the JSON config

cat > "$MCP_CONFIG_FILE" << MCPJSONEOF

{

"mcpServers": {

"proxysql": {

"command": "python3",

"args": ["$BRIDGE_PATH"],

"env": {

"PROXYSQL_MCP_ENDPOINT": "$PROXYSQL_MCP_ENDPOINT"

MCPJSONEOF

if [ -n "$PROXYSQL_MCP_TOKEN" ]; then

echo ", \"PROXYSQL_MCP_TOKEN\": \"$PROXYSQL_MCP_TOKEN\"" >> "$MCP_CONFIG_FILE"

fi

if [ "$PROXYSQL_MCP_INSECURE_SSL" = "1" ]; then

echo ", \"PROXYSQL_MCP_INSECURE_SSL\": \"1\"" >> "$MCP_CONFIG_FILE"

fi

cat >> "$MCP_CONFIG_FILE" << 'MCPJSONEOF2'

}

}

}

}

MCPJSONEOF2

MCP_ARGS="--mcp-config $MCP_CONFIG_FILE"

log_verbose "Using ProxySQL MCP endpoint: $PROXYSQL_MCP_ENDPOINT"

log_verbose "MCP config written to: $MCP_CONFIG_FILE"

else

log_verbose "No explicit MCP configuration, using available MCP servers"

fi

# Build the discovery prompt

DATABASE_ARG=""

if [ -n "$DATABASE_NAME" ]; then

DATABASE_ARG="database named '$DATABASE_NAME'"

else

DATABASE_ARG="the first available database"

fi

# Log the command being executed

log_info "Running Claude Code in headless mode with 6-agent discovery..."

log_verbose "Timeout: ${TIMEOUT}s"

SCHEMA_ARG=""

if [ -n "$SCHEMA_NAME" ]; then

SCHEMA_ARG="the schema '$SCHEMA_NAME' within"

fi

# Build Claude command

CLAUDE_ARGS=(

--print

--no-session-persistence

--permission-mode bypassPermissions

)

DISCOVERY_PROMPT="You are a Database Discovery Agent. Your mission is to perform comprehensive analysis of $DATABASE_ARG.

${SCHEMA_ARG:+Focus on $SCHEMA_ARG}

Use the available MCP database tools to discover and document:

## 1. STRUCTURAL ANALYSIS

- List all tables in the database/schema

- For each table, describe:

- Column names, data types, and nullability

- Primary keys and unique constraints

- Foreign key relationships

- Indexes and their purposes

- Any CHECK constraints or defaults

- Create an Entity Relationship Diagram (ERD) showing:

- All tables and their relationships

- Cardinality (1:1, 1:N, M:N)

- Primary and foreign keys

## 2. DATA PROFILING

- For each table, analyze:

- Row count

- Data distributions for key columns

- Null value percentages

- Distinct value counts (cardinality)

- Min/max/average values for numeric columns

- Sample data (first few rows)

- Identify patterns and anomalies:

- Duplicate records

- Data quality issues

- Unexpected distributions

- Outliers

## 3. SEMANTIC ANALYSIS

- Infer the business domain:

- What type of application/database is this?

- What are the main business entities?

- What are the business processes?

- Document business rules:

- Entity lifecycles and state machines

- Validation rules implied by constraints

- Relationship patterns

- Classify tables:

- Master/reference data (customers, products, etc.)

- Transactional data (orders, transactions, etc.)

- Junction/association tables

- Configuration/metadata

## 4. PERFORMANCE & ACCESS PATTERNS

- Identify:

- Missing indexes on foreign keys

- Missing indexes on frequently filtered columns

- Composite index opportunities

- Potential N+1 query patterns

- Suggest optimizations:

- Indexes that should be added

- Query patterns that would benefit from optimization

- Denormalization opportunities

## OUTPUT FORMAT

Provide your findings as a comprehensive Markdown report with:

1. **Executive Summary** - High-level overview

2. **Database Schema** - Complete table definitions

3. **Entity Relationship Diagram** - ASCII ERD

4. **Data Quality Assessment** - Score (1-100) with issues

5. **Business Domain Analysis** - Industry, use cases, entities

6. **Performance Recommendations** - Prioritized optimization list

7. **Anomalies & Issues** - All problems found with severity

Be thorough. Discover everything about this database structure and data.

Write the complete report to standard output."

# Log the command being executed (without showing the full prompt for clarity)

log_info "Running Claude Code in headless mode..."

log_verbose "Timeout: ${TIMEOUT}s"

if [ -n "$DATABASE_NAME" ]; then

log_verbose "Target database: $DATABASE_NAME"

fi

if [ -n "$SCHEMA_NAME" ]; then

log_verbose "Target schema: $SCHEMA_NAME"

# Add MCP configuration if available

if [ -n "$MCP_ARGS" ]; then

CLAUDE_ARGS+=($MCP_ARGS)

fi

# Build MCP args

MCP_ARGS=()

if [ -n "$MCP_CONFIG" ]; then

MCP_ARGS+=(--mcp-config "$MCP_CONFIG")

log_verbose "Using inline MCP configuration"

elif [ -n "$MCP_FILE" ]; then

if [ -f "$MCP_FILE" ]; then

MCP_ARGS+=(--mcp-config "$MCP_FILE")

log_verbose "Using MCP configuration from: $MCP_FILE"

else

log_error "MCP configuration file not found: $MCP_FILE"

exit 1

fi

fi

# Log the command being executed

log_info "Running Claude Code in headless mode with 6-agent discovery..."

log_verbose "Timeout: ${TIMEOUT}s"

# Build Claude command

CLAUDE_ARGS=(

--print

--no-session-persistence

--permission-mode bypassPermissions

)

# Add MCP configuration if available

if [ "${`#MCP_ARGS`[@]}" -gt 0 ]; then

CLAUDE_ARGS+=("${MCP_ARGS[@]}")

fi

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 203-203: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)

🤖 Prompt for AI Agents

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.sh` around lines 175 - 204, MCP_ARGS being a plain string causes word-splitting when expanded into the CLAUDE_ARGS array; change MCP_ARGS to an array and populate it with separate elements (e.g., add the flag and its value as two entries) when building from MCP_CONFIG or MCP_FILE, then append to CLAUDE_ARGS using array expansion (use the MCP_ARGS array name and its elements) so JSON or paths with spaces are preserved; update the code locations that set MCP_ARGS and the line that appends to CLAUDE_ARGS to use the MCP_ARGS array and proper quoted array expansion.

coderabbitai · 2026-01-17T15:31:11Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md

+You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools.
+
+**Critical Constraints:**
+- Use **ONLY** MCP server tools (`mcp__proxysql-stdio__*`) - never connect directly to backend databases
+- All agents collaborate via the MCP catalog (`catalog_upsert`, `catalog_search`)
+- Execute in 4 rounds: Blind Exploration → Pattern Recognition → Hypothesis Testing → Final Synthesis
+- Generate a comprehensive report as the final output


⚠️ Potential issue | 🟡 Minor

Clarify that this reference prompt is the 4‑agent/4‑round variant (or update it to the 6‑agent workflow).
Right now it reads like the canonical prompt while the tooling uses the 6‑agent/5‑round protocol, which can mislead readers.

Suggested clarification

-You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools. +You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools. +> **Note:** This reference describes the legacy 4‑agent / 4‑round workflow. The active v1.3 prompt uses 6 agents and 5 rounds (see `prompts/multi_agent_discovery_prompt.md`).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools.

**Critical Constraints:**

- Use **ONLY** MCP server tools (`mcp__proxysql-stdio__*`) - never connect directly to backend databases

- All agents collaborate via the MCP catalog (`catalog_upsert`, `catalog_search`)

- Execute in 4 rounds: Blind Exploration → Pattern Recognition → Hypothesis Testing → Final Synthesis

- Generate a comprehensive report as the final output

You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools.

> **Note:** This reference describes the legacy 4‑agent / 4‑round workflow. The active v1.3 prompt uses 6 agents and 5 rounds (see `prompts/multi_agent_discovery_prompt.md`).

**Critical Constraints:**

- Use **ONLY** MCP server tools (`mcp__proxysql-stdio__*`) - never connect directly to backend databases

- All agents collaborate via the MCP catalog (`catalog_upsert`, `catalog_search`)

- Execute in 4 rounds: Blind Exploration → Pattern Recognition → Hypothesis Testing → Final Synthesis

- Generate a comprehensive report as the final output

🤖 Prompt for AI Agents

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md` around lines 5 - 11, The prompt currently states "Execute in 4 rounds" and implies a 4‑agent workflow but your tooling uses a 6‑agent/5‑round protocol; update the prompt text (the "Database Discovery Orchestrator" section and the "Critical Constraints" bullet that references "Execute in 4 rounds") to explicitly state whether this file is the 4‑agent/4‑round variant or to convert it to the canonical 6‑agent/5‑round workflow used by the tooling; ensure you update the round count and agent count consistently and keep the MCP tooling requirement (`mcp__proxysql-stdio__*`, `catalog_upsert`, `catalog_search`) intact so readers aren’t misled.

coderabbitai · 2026-01-17T15:31:12Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md

+**Writing Findings:**
+```python
+catalog_upsert(
+    kind="structural|statistical|semantic|query|collaborative|validation|final_report",
+    key="specific_identifier",
+    document="detailed_findings_markdown",
+    tags="optional_tags"
+)
+```


⚠️ Potential issue | 🟠 Major

Fix catalog kind names in the example to match the round-specific kinds used later.
The snippet advertises collaborative/validation, while the protocol uses collaborative_round2/validation_round3. This mismatch can break catalog lookups.

Proposed fix

-catalog_upsert( - kind="structural|statistical|semantic|query|collaborative|validation|final_report", +catalog_upsert( + kind="structural|statistical|semantic|query|collaborative_round2|validation_round3|final_report", key="specific_identifier", document="detailed_findings_markdown", tags="optional_tags" )

🤖 Prompt for AI Agents

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md` around lines 101 - 109, Update the example call to catalog_upsert so the kind field uses the round-specific kind names used elsewhere in the protocol (e.g., replace "collaborative" with "collaborative_round2" and "validation" with "validation_round3") to ensure catalog lookups match; keep the parameter names (kind, key, document, tags) and structure the same but use the exact protocol strings used later in the document so catalog_upsert queries succeed.

coderabbitai · 2026-01-17T15:31:12Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md

+| Option | Short | Description | Default |
+|--------|-------|-------------|---------|
+| `--database` | `-d` | Database name to discover | First available |
+| `--schema` | `-s` | Schema name to analyze | All schemas |
+| `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` |
+| `--timeout` | `-t` | Timeout in seconds | 300 |
+| `--verbose` | `-v` | Enable verbose output | Disabled |


⚠️ Potential issue | 🟡 Minor

Update the documented default timeout to 3600s.
The scripts default to 3600 seconds, but the table still says 300.

Proposed fix

-| `--timeout` | `-t` | Timeout in seconds | 300 | +| `--timeout` | `-t` | Timeout in seconds | 3600 |

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

| Option | Short | Description | Default |

|--------|-------|-------------|---------|

| `--database` | `-d` | Database name to discover | First available |

| `--schema` | `-s` | Schema name to analyze | All schemas |

| `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` |

| `--timeout` | `-t` | Timeout in seconds | 300 |

| `--verbose` | `-v` | Enable verbose output | Disabled |

| Option | Short | Description | Default |

|--------|-------|-------------|---------|

| `--database` | `-d` | Database name to discover | First available |

| `--schema` | `-s` | Schema name to analyze | All schemas |

| `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` |

| `--timeout` | `-t` | Timeout in seconds | 3600 |

| `--verbose` | `-v` | Enable verbose output | Disabled |

🤖 Prompt for AI Agents

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md` around lines 392 - 398, Update the README table entry for the `--timeout` / `-t` option to reflect the actual default of 3600 seconds: change the "Default" column value for the `--timeout` row from 300 to 3600 so the documented default matches the script's runtime behavior.

coderabbitai · 2026-01-17T15:31:12Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md

+    prompt = f"""Using the 4-agent discovery protocol, focus on:
+    1. Security aspects of {database}
+    2. Performance optimization opportunities
+    3. Data quality issues
+
+    Follow the standard 4-round protocol but prioritize these areas.
+    """


⚠️ Potential issue | 🟡 Minor

Align the custom prompt example with the 6‑agent/5‑round protocol.
The example still references a 4‑agent, 4‑round flow.

Proposed fix

- prompt = f"""Using the 4-agent discovery protocol, focus on: + prompt = f"""Using the 6-agent, 5-round discovery protocol, focus on: 1. Security aspects of {database} 2. Performance optimization opportunities 3. Data quality issues - Follow the standard 4-round protocol but prioritize these areas. + Follow the standard 5-round protocol but prioritize these areas. """

🤖 Prompt for AI Agents

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md` around lines 471 - 477, Update the example prompt string (variable "prompt") so it references the 6-agent, 5-round discovery protocol rather than "4-agent" and "4-round"; modify the body to instruct focus on security, performance, and data quality while explicitly stating the 6-agent/5-round structure and any change in sequencing (e.g., "Follow the standard 6-agent, 5-round protocol and prioritize these areas") and adjust numbering or phrasing inside the triple-quoted string to match the new protocol description.

coderabbitai · 2026-01-17T15:31:12Z

scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md

+```
+ClaudeCode_Headless/
+├── README.md                           # This file
+├── prompts/
+│   ├── multi_agent_discovery_prompt.md      # Concise system prompt
+│   └── multi_agent_discovery_reference.md   # Comprehensive reference
+├── headless_db_discovery.py            # Python script
+├── headless_db_discovery.sh            # Bash script
+└── examples/
+    ├── DATABASE_DISCOVERY_REPORT.md        # Example output
+    └── DATABASE_QUESTION_CAPABILITIES.md   # Feature documentation
+```


⚠️ Potential issue | 🟡 Minor

Add a language tag to the directory-structure code fence to satisfy MD040.

Proposed fix

-``` +```text ClaudeCode_Headless/ ├── README.md # This file ├── prompts/ │ ├── multi_agent_discovery_prompt.md # Concise system prompt │ └── multi_agent_discovery_reference.md # Comprehensive reference ├── headless_db_discovery.py # Python script ├── headless_db_discovery.sh # Bash script └── examples/ ├── DATABASE_DISCOVERY_REPORT.md # Example output └── DATABASE_QUESTION_CAPABILITIES.md # Feature documentation

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

519-519: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md` around lines 519 - 530, The directory-structure code fence in README.md is missing a language tag (causing MD040); update the opening fence for the tree block from ``` to ```text so the block becomes a fenced code block with a language tag (e.g., change the README.md directory-structure code fence that surrounds the "ClaudeCode_Headless/" tree to start with ```text).

Added patterns to ignore all discovery output files: Root .gitignore: - discovery_*.md (main discovery reports at repo root) - database_discovery_report.md scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/.gitignore: - *_QUESTION_CATALOG.md (individual agent question catalogs) - *_round1_*.md, *_round2_*.md, *_round3_*.md, *_round4_*.md (round outputs) - *_COORDINATOR_SUMMARY.md (coordinator summaries) - *_HYPOTHESIS_TESTING.md (hypothesis testing results) - *_INDEX.md (index files) - *_QUICK_REFERENCE.md (quick reference docs) - META_ANALYSIS_*.md (META agent outputs) - SECURITY_AGENT_*.txt (security agent text outputs) - query_agent_*.md, security_agent_*.md, semantic_*.md, statistical_*.md These files are generated when agents incorrectly use Write tool instead of catalog_upsert (the issue fixed in v1.2). They should not be committed to the repository as they are test outputs and are stored in the MCP catalog.

…aint This commit addresses two issues: 1. MCP Catalog Schema Separation: - Add 'schema' column to catalog table for proper isolation - Update all catalog methods (upsert, get, search, list, remove) to accept schema parameter - Update MCP tool handlers and JSON-RPC parameter parsing - Unique constraint changed from (kind, key) to (schema, kind, key) - FTS table updated to include schema column 2. Discovery Prompt Scope Constraint: - Add explicit SCOPE CONSTRAINT section to multi_agent_discovery_prompt.md - Agents now respect Target Schema and skip list_schemas when specified - Prevents analyzing all schemas when only one is targeted Files modified: - include/MySQL_Catalog.h: Add schema parameter to all catalog methods - include/MySQL_Tool_Handler.h: Update wrapper method signatures - lib/MySQL_Catalog.cpp: Implement schema filtering in all operations - lib/MySQL_Tool_Handler.cpp: Update wrapper implementations - lib/Query_Tool_Handler.cpp: Extract schema from JSON-RPC params, update tool descriptions - scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_prompt.md: Add scope constraint

Phase 1 (Static/Deterministic): - Add Discovery_Schema: SQLite catalog with deterministic and LLM tables - Add Static_Harvester: MySQL INFORMATION_SCHEMA metadata extraction - Harvest schemas, objects, columns, indexes, foreign keys, view definitions - Compute derived hints: is_time, is_id_like, has_pk, has_fks, has_time - Build quick profiles and FTS5 indexes Phase 2 (LLM Agent): - Add 19 new MCP tools for two-phase discovery - discovery.run_static: Trigger ProxySQL's static harvest - Catalog tools: init, search, get_object, list_objects, get_relationships - Agent tools: run_start, run_finish, event_append - LLM tools: summary_upsert, relationship_upsert, domain_upsert, etc. Files: - include/Discovery_Schema.h, lib/Discovery_Schema.cpp - include/Static_Harvester.h, lib/Static_Harvester.cpp - include/Query_Tool_Handler.h, lib/Query_Tool_Handler.cpp (updated) - lib/Makefile (updated) - scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/ - scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/two_phase_discovery.py

Add missing --mcp-config argument which is required by the script. The dry-run example now correctly shows all required parameters.

- Add mcp_config.example.json for Claude Code MCP configuration - Fix MCP bridge path in example config (../../proxysql_mcp_stdio_bridge.py) - Update Two_Phase_Discovery_Implementation.md with correct Phase 1/Phase 2 usage - Fix Two_Phase_Discovery_Implementation.md DELETE FROM fts_objects to scope to run_id - Update README.md with two-phase discovery section and multi-agent legacy note - Create static_harvest.sh bash wrapper for Phase 1 - Create two_phase_discovery.py orchestration script with prompts - Add --run-id parameter to skip auto-fetch - Fix RUN_ID placeholder mismatch (<USE_THE_PROVIDED_RUN_ID>) - Fix catalog path default to mcp_catalog.db - Add test_catalog.sh to verify catalog tools work - Fix Discovery_Schema.cpp FTS5 syntax (missing space) - Remove invalid CREATE INDEX on FTS virtual tables - Add MCP tool call logging to track tool usage - Fix Static_Harvester::get_harvest_stats() to accept run_id parameter - Fix DELETE FROM fts_objects to only delete for specific run_id - Update system prompts to say DO NOT call discovery.run_static - Update user prompts to say Phase 1 is already complete - Add --mcp-only flag to restrict Claude Code to MCP tools only - Make FTS table failures non-fatal (check if table exists first) - Add comprehensive documentation for both discovery approaches

- Add try-catch around handle_jsonrpc_request to catch unexpected exceptions - Add detailed logging for tool execution success/failure - Add proper SQLite error checking in create_agent_run with error messages - Fix json_int/json_double to handle both numbers and numeric strings The json_int function was throwing exceptions when receiving numeric strings (e.g., "14" instead of 14) from clients, causing 500 errors. Now it handles both formats gracefully. Also added logging so tool failures are visible in logs instead of being silent 500 errors.

Exception handlers now log the full request payload that caused the error, making debugging much easier. Changes: - Move req_body/req_path declarations outside try block so catch handlers can access them - Log request payload in all exception handlers (parse errors, std::exception, and catch-all) - Log tool arguments when tool execution fails Previously, exceptions would only log the error message without context, making it impossible to reproduce the issue. Now the full payload is logged.

1. Fix error logging to catch ALL tool failures, not just those with both success and result fields. Previously, error responses like {"success": false, "error": "..."} without a result field were silently ignored. 2. Fix llm.domain_set_members to accept both array and JSON string formats for the members parameter. Some clients send it as a JSON string, others as a native array. 3. Add detailed error logging for llm.domain_set_members failures, including what was actually received.

- Add resolve_run_id() to map schema names to latest run_id - Update all catalog and LLM tools to accept schema names - Deprecate describe_table, table_profile, column_profile - Deprecate get_constraints, suggest_joins, find_reference_candidates - Keep sample_rows, sample_distinct for data preview

Tool was deprecated; users should use catalog.get_object instead.

- Query schemas from catalog's schemas table - Maintains same output format for compatibility - Removes dependency on live MySQL connection

Add stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset tables to track MCP query tool usage statistics. - Added get_tool_usage_stats_resultset() method to Query_Tool_Handler - Defined table schemas in ProxySQL_Admin_Tables_Definitions.h - Registered tables in Admin_Bootstrap.cpp - Added pattern matching in ProxySQL_Admin.cpp - Added stats___mcp_query_tools_counters() in ProxySQL_Admin_Stats.cpp - Fixed friend declaration for track_tool_invocation() - Fixed Discovery_Schema.cpp log_llm_search() to use prepare_v2/finalize

Remove the mcp-catalog_path configuration variable and hardcode the catalog database path to datadir/mcp_catalog.db for stability. Rationale: The catalog database is session state, not user configuration. Runtime swapping of the catalog could cause tables to be missed and the catalog to fail even if it was succeeding a second earlier. Changes: - Removed catalog_path from mcp_thread_variables_names array - Removed mcp_catalog_path from MCP_Thread variables struct - Removed getter/setter logic for catalog_path - Hardcoded catalog path to GloVars.datadir/mcp_catalog.db in: - ProxySQL_MCP_Server.cpp (Query_Tool_Handler initialization) - Admin_FlushVariables.cpp (MySQL_Tool_Handler reinitialization) - Updated VARIABLES.md to document the hardcoded path - Updated configure_mcp.sh to remove catalog_path configuration - Updated MCP README to remove catalog_path references

The MCP catalog database is now accessible as the 'mcp_catalog' schema from the ProxySQL Admin interface, enabling direct SQL queries against discovered schemas and LLM memories.

Extend the stats_mcp_query_tools_counters table with timing statistics (first_seen, last_seen, sum_time, min_time, max_time) following the same pattern as stats_mysql_query_digest. All timing values are in microseconds using monotonic_time(). New schema: - tool VARCHAR - schema VARCHAR - count INT - first_seen INTEGER (microseconds) - last_seen INTEGER (microseconds) - sum_time INTEGER (microseconds - total execution time) - min_time INTEGER (microseconds - minimum execution time) - max_time INTEGER (microseconds - maximum execution time)

Previously re_modifiers was hardcoded to 1 (CASELESS), ignoring the value stored in the database. Now properly reads from row->fields[7].

- Fully implemented rag.search_hybrid tool with both fuse and fts_then_vec modes - Added complete filter support across all search tools (source_ids, source_names, doc_ids, post_type_ids, tags_any, tags_all, created_after, created_before, min_score) - Implemented proper score normalization (higher is better) for all search modes - Updated all tool schemas to match blueprint specifications exactly - Added metadata inclusion in search results - Implemented Reciprocal Rank Fusion (RRF) scoring for hybrid search - Enhanced error handling and input validation - Added debug information for hybrid search ranking - Updated documentation and created completion summary This completes the v0 RAG implementation according to the blueprint requirements.

- Enhanced inline Doxygen comments in RAG_Tool_Handler.h and RAG_Tool_Handler.cpp - Added detailed parameter descriptions, return values, and cross-references - Created Doxyfile for documentation generation - Added documentation summary and guidelines - Documented all RAG tools with their schemas and usage patterns - Added security and performance considerations documentation The RAG subsystem is now fully documented with comprehensive Doxygen comments that provide clear guidance for developers working with the codebase.

Add detailed function-level documentation to all MCP query rules, query digest, static harvester, and catalog components. Static_Harvester.cpp: - Document all 18+ harvest functions (schemas, objects, columns, indexes, FKs, views) - Document lifecycle methods (init, close, connect, disconnect) - Document helper methods (is_time_type, is_id_like_name) - Document run management (start_run, finish_run, run_full_harvest) - Document statistics methods (get_harvest_stats) Query_Tool_Handler.cpp: - Document JSON helper functions (json_string, json_int, json_double) - Document digest tracking section with flow explanation MySQL_Catalog.cpp: - Document schema isolation architecture - Document CRUD operations (upsert, get, search, list, remove, merge) Discovery_Schema.cpp: - Document MCP query rules evaluation (evaluate_mcp_query_rules) - Document digest functions (compute_mcp_digest, fingerprint_mcp_args) - Document update/get functions for rules and digests ProxySQL_Admin_Stats.cpp: - Document stats collection functions ProxySQL_Admin.cpp: - Document load/save functions for query rules Admin_Handler.cpp: - Document MCP query rules command handlers include/ProxySQL_Admin_Tables_Definitions.h: - Add comments explaining table purposes

- Remove Doxyfile that was added by mistake - Remove .doxygen files from doc/ directory that were generated by mistake - Clean up documentation artifacts that shouldn't have been committed

- Address SQL injection vulnerabilities by adding input validation and escaping - Fix configuration variable handling in get_variable and set_variable methods for RAG variables - Make embedding dimension configurable for rag_vec_chunks table - Remove code duplication in SQL filter building logic by creating consolidated build_sql_filters function - Update all search tools (FTS, vector, hybrid) to use consolidated filter building

- Move RAG test files to dedicated test/rag/ directory - Create proper Makefile in test/rag/ directory - Remove build_rag_test.sh script (replaced by Makefile) - Revert changes to main test/Makefile - Update documentation to reflect new test structure

Fixed 6 legitimate issues from AI code review: Critical fixes: - Implement stats___mcp_query_digest to populate table with actual data - Fix double trigger bug (stats_mcp_query_digest_reset substring matching) Important fixes: - Fix re_modifiers parsing (VARCHAR "CASELESS" -> int 1) - Fix TOCTOU race condition in reset path (use write lock from start) - Add column count validation before accessing row fields Documentation: - Add memory ownership documentation for evaluate_mcp_query_rules False positives ignored (Issue 2: Schema mismatch, Issue 6: SQL injection)

- Move RAG test files to dedicated test/rag/ directory - Create proper Makefile in test/rag/ directory - Remove build_rag_test.sh script (replaced by Makefile) - Update .gitignore to exclude test/rag/test_rag_schema executable - Update documentation to reflect new test structure

…ove MAIN_PROXY_SQLITE3 from main.cpp

… PR review)

…ents (address code-review)

Use proxy_sqlite3_* wrappers and canonical proxy symbol TU

Add comprehensive RAG subsystem with Doxygen documentation [WIP]

coderabbitai

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/MCP_Thread.cpp (1)
1-10: Missing include for RAG_Tool_Handler.h.

The file includes headers for all other tool handlers (MySQL_Tool_Handler.h, Config_Tool_Handler.h, etc.) but doesn't include RAG_Tool_Handler.h. While the forward declaration in MCP_Thread.h allows the pointer member to compile, the destructor on lines 127-130 calls delete rag_tool_handler, which requires the complete type definition.
🔧 Proposed fix
 `#include` "MCP_Thread.h"
 `#include` "MySQL_Tool_Handler.h"
 `#include` "Config_Tool_Handler.h"
 `#include` "Query_Tool_Handler.h"
 `#include` "Admin_Tool_Handler.h"
 `#include` "Cache_Tool_Handler.h"
 `#include` "Observe_Tool_Handler.h"
+#include "RAG_Tool_Handler.h"
 `#include` "proxysql_debug.h"
 `#include` "ProxySQL_MCP_Server.hpp"

🤖 Fix all issues with AI agents

In `@lib/Admin_Bootstrap.cpp`:
- Around line 726-736: The code calls mcpdb->open(...) and immediately enables
extension loading via (*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1)
without checking the open result; update the logic in the block around
mcpdb->open() (SQLite3DB::open) to capture and check its return value, log an
error and skip calling proxy_sqlite3_enable_load_extension and any further
attach/setup when open fails, and ensure you use mcpdb->get_db() only after a
successful open; include descriptive logging (via existing logging mechanism)
when open fails to make the early exit clear.

In `@lib/AI_Features_Manager.cpp`:
- Around line 300-315: Check GloGATH->variables.genai_vector_dimension before
using it to build create_rag_vec_chunks_sql: if it's <= 0 or unreasonably large,
clamp it to a safe default (e.g., 1536) and log the correction via the same
logger used elsewhere (or emit a warning) so the fallback is visible; then use
the sanitized vector_dimension when constructing the vec0 DDL for rag_vec_chunks
to avoid silent failures.

In `@lib/Anomaly_Detector.cpp.bak`:
- Line 500: Replace the direct call to sqlite3_finalize(stmt) with the proxy
wrapper used elsewhere (e.g., proxy_sqlite3_finalize(stmt)) so the proxy layer's
logging/metrics/error handling is applied consistently; update any includes or
declarations if needed to access proxy_sqlite3_finalize and ensure the call
semantics and error checking remain identical to the original sqlite3_finalize
usage (use the same stmt variable and follow existing error/logging patterns
used with proxy_sqlite3_step).
- Around line 814-836: The code is calling sqlite3_prepare_v2, sqlite3_errmsg,
and sqlite3_finalize directly which breaks the proxy pattern; replace these
direct calls with the proxy wrappers used elsewhere: call
proxy_sqlite3_prepare_v2(db, query, -1, &stmt, NULL) and assign its return to
rc, use proxy_sqlite3_errmsg(db) in the proxy_error call, and call
proxy_sqlite3_finalize(stmt) at the end; keep the existing use of
proxy_sqlite3_step, proxy_sqlite3_column_text, proxy_sqlite3_column_int, and
proxy_sqlite3_column_int64 and ensure variable names (stmt, rc, query, db)
remain unchanged.
- Around line 920-938: This block mixes direct SQLite calls with proxy wrappers;
make all SQLite interactions consistent by replacing sqlite3_prepare_v2,
sqlite3_column_text, sqlite3_column_int, and sqlite3_finalize with their proxy
equivalents (e.g., proxy_sqlite3_prepare_v2, proxy_sqlite3_column_text,
proxy_sqlite3_column_int, proxy_sqlite3_finalize) so the loop that uses
proxy_sqlite3_step also uses proxy_* for prepare/column/finalize on the
variables stmt, db, type_query, and by_type.
- Around line 577-590: The static local int cleanup_counter in the cleanup block
inside update_user_statistics is not thread-safe; replace it with a
std::atomic<int> (e.g., declare static std::atomic<int> cleanup_counter{0}) and
perform an atomic increment/check (use fetch_add or ++cleanup_counter and
compare the result mod 1000) so multiple threads can't race on the counter; also
ensure the user_statistics cleanup loop remains protected by the same mutex/lock
used for accessing the user_statistics map (or document that user_statistics is
already synchronized) to avoid concurrent access issues.
- Around line 771-776: Replace direct SQLite calls with the proxy wrappers to
match the rest of the file: in the error branch where proxy_error("Anomaly:
Failed to insert pattern: %s\n", sqlite3_errmsg(db)); and where
sqlite3_finalize(stmt); are used, call the proxy wrappers (e.g.,
proxy_sqlite3_errmsg(db) and proxy_sqlite3_finalize(stmt)) instead so
proxy_error uses the proxy_sqlite3_errmsg result and both finalization calls use
proxy_sqlite3_finalize; locate these occurrences around the Anomaly pattern
insert logic where proxy_error, db, and stmt are referenced.

In `@lib/Discovery_Schema.cpp`:
- Around line 1211-1221: The SQL built in get_object (around the sql
ostringstream) and list_objects currently concatenates user-controlled
schema_name, object_name and object_type into the query; change both functions
to use prepared/parameterized statements instead of string concatenation:
replace the ostringstream construction with a SQL statement containing
placeholders (e.g., ? or named parameters) for run_id, object_id (when used),
schema_name, object_name and object_type, then bind those variables with the DB
API's bind/prepare calls before executing; ensure proper types are used for
numeric vs string parameters and remove any remaining direct concatenation of
schema_name/object_name/object_type.
- Around line 30-38: Discovery_Schema::init currently uses a C99 VLA (char
path_buf[db_path.size() + 1]) and leaks the heap-allocated db if db->open(...)
fails; replace the VLA with a std::vector<char> (or std::string::data copy)
sized db_path.size()+1 and use that buffer when calling db->open, and ensure you
free the allocated SQLite3DB (delete db) on error (or better, change db to a
smart pointer) before returning after proxy_error to avoid the resource leak.
- Around line 1988-1993: The fts_search_llm() code is vulnerable because it
concatenates the user-supplied query into the MATCH clause; change it to use a
prepared statement and parameter binding instead of string concatenation: build
the SQL with "WHERE f.fts_llm MATCH ?1" (keep the rest of the SELECT/LEFT JOIN
and ORDER BY score LIMIT part), prepare the statement, bind the C++ string query
to parameter 1 using sqlite3_bind_text (or proxy_sqlite3_bind_text) and execute;
also ensure the LIMIT value (limit) is either bound as a validated integer or
strictly validated before concatenation to prevent injection via limit.

In `@lib/ProxySQL_MCP_Server.cpp`:
- Around line 78-98: The Query_Tool_Handler instance is set to NULL on init()
failure but the /mcp/query resource is still registered later; change the flow
so that registration and any increment of the endpoint counter only occur when
handler->query_tool_handler->init() returns success (non-zero indicates
failure), i.e., after new Query_Tool_Handler(...) call check init(), and only on
successful init register the /mcp/query endpoint and adjust any endpoint
counting; on failure leave handler->query_tool_handler NULL and skip
registration to avoid null deref or a broken endpoint.

In `@lib/RAG_Tool_Handler.cpp`:
- Around line 2316-2330: The doc_id list is concatenated into SQL in
rag.get_docs and rag.fetch_from_source (see the doc_ids loop that builds
doc_list and the sql string using "WHERE ... IN (" + doc_list + ")"), which is
vulnerable to SQL injection and fails on quotes; fix by replacing direct
concatenation with a parameterized IN clause using execute_parameterized_query()
and bound parameters (generate placeholders ?1,?2,... from doc_ids and bind each
doc_ids[i]) or, if you prefer the validation approach, validate every entry in
doc_ids against the same filter used in the filter handlers (reject or escape
entries containing '\'', '\\', ';') before building the query; ensure you update
both rag.get_docs and rag.fetch_from_source to use the same safe pattern.
- Around line 1490-1504: The hybrid weights w_fts and w_vec are parsed with
get_json_int which truncates fractional values; update both the "fuse" path and
the fallback path to parse these weights as double (use the existing
get_json_double or equivalent helper instead of get_json_int) so fractional
weights like 0.5 are preserved; ensure you change the lines setting w_fts and
w_vec in the block that reads fuse_params and the fallback block that reads
arguments, keeping the same variable names (w_fts, w_vec) and types.
- Around line 1150-1159: The SQL FTS query is built by concatenating the
user-provided variable `query` into the `sql` string (the SELECT from
`rag_fts_chunks` / JOIN `rag_chunks` block), which fails to escape quotes and is
vulnerable; change this to use a prepared statement with a parameter placeholder
in the MATCH clause (e.g. `... WHERE f MATCH ?1`) and bind the user input with
`sqlite3_bind_text()` before stepping the statement (prepare with
`sqlite3_prepare_v2`, bind with `sqlite3_bind_text`, execute with
`sqlite3_step`, and finalize with `sqlite3_finalize`). Apply the same change for
the other blocks referenced (the other `sql` constructions at the later ranges)
so all FTS MATCH usages use parameter binding instead of raw concatenation.
- Around line 401-455: The prepared statement and bound parameters are never
used because you call vector_db->execute_statement(query, ...); instead step the
prepared sqlite3_stmt* from prepare_result.second so bindings are applied, and
when constructing a SQLite3_result from the stmt do not leave the unique_ptr
owning it — release ownership with prepare_result.second.release() and pass the
raw sqlite3_stmt* into SQLite3_result so the result object owns and finalizes
the statement; remove the execute_statement(...) call and replace it with
creating/returning the SQLite3_result from the released stmt after collecting
any needed error/cols/affected_rows info.

In `@RAG_POC/rag_ingest.cpp`:
- Around line 686-698: The assignments reading columns into struct s are
vulnerable to null pointer dereference because sqlite3_column_text can return
NULL; update the code that sets s.name, s.backend_type, s.host, s.user, s.pass,
s.db, s.table_name, s.pk_column and s.where_sql to first call
sqlite3_column_text into a temporary const unsigned char* (or const char*),
check for NULL (or use sqlite3_column_type to detect SQLITE_NULL) and only
assign to the std::string members when non-NULL, otherwise set them to an empty
string (or another safe default); do the same null-check pattern for text
columns used by functions like sqlite3_column_text while keeping integer reads
(sqlite3_column_int for s.enabled and s.port) as-is or guarding them with
sqlite3_column_type if you want to detect NULLs explicitly.
- Around line 901-905: The code currently sets now_epoch to 0 which yields
invalid timestamps for rag_vec_chunks; replace the hardcoded value by capturing
a real epoch once per run (e.g., use std::time(nullptr) / time(NULL) and cast to
std::int64_t) and reuse that value when populating updated_at (variable
now_epoch), updating any affected functions that rely on now_epoch; ensure the
translation unit includes <ctime> (or <time.h>) if not already present so the
build succeeds.
- Around line 468-479: build_select_sql concatenates RagSource::where_sql and
raw identifiers, creating SQL injection/identifier-injection risks; update
build_select_sql to (1) escape backticks inside identifiers used for cols and
s.table_name (replace any '`' with '``') before wrapping with backticks, and (2)
stop directly concatenating s.where_sql: either accept a parameterized WHERE
with placeholders and require callers to supply bound parameters, or validate
allowlisted characters/patterns (e.g., only simple column comparisons with a
strict regex) and reject/return an error for unsafe values; update function
signature or call sites as needed to support bound parameters or validation on
RagSource::where_sql.

In `@RAG_POC/schema.sql`:
- Around line 14-45: The rag_sources table currently stores plaintext
credentials in the backend_pass column; instead migrate to a secure approach by
removing plaintext storage and either (A) storing only a reference to an
external secret manager (add a new column like backend_secret_ref) and update
any code that reads rag_sources to fetch the secret from the secret manager, or
(B) store an encrypted value (add backend_pass_encrypted) and implement
symmetric key encryption/decryption in the service layer to read/write via
functions that use a KMS-held key; update schema changes for rag_sources, add a
migration to backfill secrets (move existing backend_pass values into the secret
store or encrypt them) and remove or null out backend_pass, and update codepaths
that reference backend_pass to use backend_secret_ref or decrypt
backend_pass_encrypted instead.

♻️ Duplicate comments (7)

lib/ProxySQL_Admin_Stats.cpp (1)
1587-1617: SQL injection vulnerability and incorrect table name logic.

This issue was previously flagged: the INSERT statement has two problems:

Wrong table name: When reset=true, the prefix logic produces "stats_mcp_query_tools_counters_mcp_query_tools_counters" instead of the intended "stats_mcp_query_tools_counters_reset".

SQL injection risk: r->fields[0] and r->fields[1] (tool name and schema) are directly interpolated into the query without escaping. Malicious values could corrupt or exploit the database.
🔒 Proposed fix using prepared statements
-	for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin();
-	     it != resultset->rows.end(); ++it) {
-		SQLite3_row* r = *it;
-		char query[1024];
-		snprintf(query, sizeof(query),
-			"INSERT INTO %smcp_query_tools_counters VALUES ('%s', '%s', %s, %s, %s, %s, %s, %s)",
-			reset ? "stats_mcp_query_tools_counters_" : "stats_",
-			r->fields[0], r->fields[1], r->fields[2], r->fields[3],
-			r->fields[4], r->fields[5], r->fields[6], r->fields[7]);
-		statsdb->execute(query);
-	}
+	const char* table = reset
+		? "stats_mcp_query_tools_counters_reset"
+		: "stats_mcp_query_tools_counters";
+	char query[256];
+	snprintf(query, sizeof(query),
+		"INSERT INTO %s VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8)", table);
+	
+	sqlite3_stmt* stmt = nullptr;
+	int rc = statsdb->prepare_v2(query, &stmt);
+	ASSERT_SQLITE_OK(rc, statsdb);
+	
+	for (SQLite3_row* r : resultset->rows) {
+		rc = (*proxy_sqlite3_bind_text)(stmt, 1, r->fields[0], -1, SQLITE_TRANSIENT);
+		rc = (*proxy_sqlite3_bind_text)(stmt, 2, r->fields[1], -1, SQLITE_TRANSIENT);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 3, atoll(r->fields[2]));
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 4, atoll(r->fields[3]));
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 5, atoll(r->fields[4]));
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 6, atoll(r->fields[5]));
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 7, atoll(r->fields[6]));
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 8, atoll(r->fields[7]));
+		SAFE_SQLITE3_STEP2(stmt);
+		(*proxy_sqlite3_clear_bindings)(stmt);
+		(*proxy_sqlite3_reset)(stmt);
+	}
+	(*proxy_sqlite3_finalize)(stmt);
lib/Discovery_Schema.cpp (6)

12-18: gmtime() is not thread‑safe

Same issue as previously noted: gmtime() uses shared static storage; switch to gmtime_r / gmtime_s.

76-88: Use delete for SQLite3_result allocated with new

Same issue as previously noted: execute_statement() allocates with new, so free(resultset) is undefined behavior.

589-598: get_run_info reads columns off by one

Same issue as previously noted: field indices are shifted relative to the SELECT list.

970-987: ?1 placeholders aren’t bound in update_object_flags

Same issue as previously noted: db->execute() doesn’t bind parameters, so run_id is ignored.

1162-1171: FTS query strings are concatenated directly

Same issue as previously noted: query / object_type / schema_name are interpolated without escaping or binding.

1761-1763: domain_key concatenation is injection‑prone

Same issue as previously noted: domain_key is interpolated directly into SQL.

🧹 Nitpick comments (11)

RAG_POC/rag_ingest.cpp (1)
384-404: MySQL connection lacks timeout configuration.

Without a connection timeout, mysql_real_connect can block indefinitely if the MySQL server is unreachable. Consider adding MYSQL_OPT_CONNECT_TIMEOUT.
♻️ Suggested improvement
 static MYSQL* mysql_connect_or_die(const RagSource& s) {
   MYSQL* conn = mysql_init(nullptr);
   if (!conn) fatal("mysql_init failed");

   // Set utf8mb4 for safety with StackOverflow-like content
   mysql_options(conn, MYSQL_SET_CHARSET_NAME, "utf8mb4");
+
+  // Set connection timeout to avoid indefinite blocking
+  unsigned int timeout = 30;  // seconds
+  mysql_options(conn, MYSQL_OPT_CONNECT_TIMEOUT, &timeout);

   if (!mysql_real_connect(conn,
lib/Anomaly_Detector.cpp.bak (1)
136-163: Consider caching compiled regex patterns for performance.

normalize_query is likely called on the hot path for every query. Compiling std::regex objects on each call is expensive. Consider pre-compiling and caching these patterns.
♻️ Suggested improvement
// At file scope or as class members:
static const std::regex comment_regex("--.*?$|/\\*.*?\\*/", std::regex::multiline);
static const std::regex string_regex("'[^']*'|\"[^\"]*\"");
static const std::regex numeric_regex("\\b\\d+\\b");
static const std::regex whitespace_regex("\\s+");

// Then in normalize_query, use these pre-compiled patterns
lib/ProxySQL_Admin_Stats.cpp (2)
2308-2309: Minor: Duplicate commented code.

Line 2308 has a duplicate commented-out line that appears to be a copy-paste artifact.
🧹 Cleanup suggestion
 	statsdb->execute("DELETE FROM stats_mysql_prepared_statements_info");
 	query1=(char *)"INSERT INTO stats_mysql_prepared_statements_info VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9)";
 	query32s = "INSERT INTO stats_mysql_prepared_statements_info VALUES " + generate_multi_rows_query(32,9);
 	query32 = (char *)query32s.c_str();
-	//rc=(*proxy_sqlite3_prepare_v2)(mydb3, query1, -1, &statement1, 0);
-	//rc=(*proxy_sqlite3_prepare_v2)(mydb3, query1, -1, &statement1, 0);
+	//rc=(*proxy_sqlite3_prepare_v2)(mydb3, query1, -1, &statement1, 0);
 	rc = statsdb->prepare_v2(query1, &statement1);
2375-2376: Minor: Same duplicate comment issue in pgsql version.

Same copy-paste artifact as in the MySQL version.
RAG_POC/architecture-data-model.md (1)
203-203: Minor: Trailing space in code span.

There's a trailing space inside the code span that triggers a markdown lint warning.
🔧 Fix
-- `{"lit":"..."} ` — appends a literal string
+- `{"lit":"..."}` — appends a literal string
lib/sqlite3db.cpp (1)
1102-1108: Fix inconsistent indentation.

Lines 1102-1108 have inconsistent indentation compared to the surrounding code. The assignments should be indented with tabs to match the style of adjacent lines (e.g., line 1101 and 1109).
♻️ Suggested fix for indentation
 		proxy_sqlite3_column_count = sqlite3_column_count;
 		proxy_sqlite3_column_int = sqlite3_column_int;
-	proxy_sqlite3_column_int64 = sqlite3_column_int64;
-	proxy_sqlite3_column_double = sqlite3_column_double;
-	proxy_sqlite3_last_insert_rowid = sqlite3_last_insert_rowid;
-	proxy_sqlite3_errstr = sqlite3_errstr;
-	proxy_sqlite3_db_handle = sqlite3_db_handle;
-	proxy_sqlite3_enable_load_extension = sqlite3_enable_load_extension;
-	proxy_sqlite3_auto_extension = sqlite3_auto_extension;
+		proxy_sqlite3_column_int64 = sqlite3_column_int64;
+		proxy_sqlite3_column_double = sqlite3_column_double;
+		proxy_sqlite3_last_insert_rowid = sqlite3_last_insert_rowid;
+		proxy_sqlite3_errstr = sqlite3_errstr;
+		proxy_sqlite3_db_handle = sqlite3_db_handle;
+		proxy_sqlite3_enable_load_extension = sqlite3_enable_load_extension;
+		proxy_sqlite3_auto_extension = sqlite3_auto_extension;
 		proxy_sqlite3_errmsg = sqlite3_errmsg;
lib/GenAI_Thread.cpp (2)
688-691: Inconsistent boolean parsing for rag_enabled.

The rag_enabled setter accepts both "true" and "1", while other boolean setters in this file (e.g., enabled, llm_enabled, anomaly_enabled at lines 552-563) only accept "true". This inconsistency could confuse users.

Consider either:

Aligning rag_enabled with existing setters (only "true"), or

Updating all boolean setters to accept both "true" and "1" for consistency.
♻️ Option 1: Align with existing pattern
 	if (!strcmp(name, "rag_enabled")) {
-		variables.genai_rag_enabled = (strcmp(value, "true") == 0 || strcmp(value, "1") == 0);
+		variables.genai_rag_enabled = (strcmp(value, "true") == 0);
 		return true;
 	}
728-736: Consider raising minimum timeout threshold.

The minimum value of 1ms for rag_timeout_ms is extremely low and could cause spurious timeouts in production. Even local SQLite operations may exceed 1ms under load.

Consider raising the minimum to a more practical value like 100ms, consistent with other timeout validations in this file (e.g., embedding_timeout_ms uses 100-300000 range at line 534).
♻️ Suggested fix
 	if (!strcmp(name, "rag_timeout_ms")) {
 		int val = atoi(value);
-		if (val < 1 || val > 60000) {
-			proxy_error("Invalid value for rag_timeout_ms: %d (must be 1-60000)\n", val);
+		if (val < 100 || val > 60000) {
+			proxy_error("Invalid value for rag_timeout_ms: %d (must be 100-60000)\n", val);
 			return false;
 		}
 		variables.genai_rag_timeout_ms = val;
 		return true;
 	}
RAG_POC/embeddings-design.md (1)
42-47: Consider adding language specifier to fenced code block.

The embedding input template example lacks a language specifier. While this is intentionally plain text, adding a language identifier (e.g., text or plaintext) would satisfy linters and improve rendering consistency.
♻️ Suggested fix
-```
+```text
 {Title}
 Tags: {Tags}
 
 {ChunkBody}
</details>

</blockquote></details>
<details>
<summary>RAG_IMPLEMENTATION_SUMMARY.md (1)</summary><blockquote>

`1-129`: **Consider consolidating summary documents.**

This file contains substantial overlap with `RAG_FILE_SUMMARY.md` and `RAG_COMPLETION_SUMMARY.md`. Having three separate summary documents with similar content may lead to maintenance burden and potential inconsistencies over time.

Consider consolidating into a single comprehensive summary document or clearly differentiating the purpose of each file (e.g., one for developers, one for release notes).

</blockquote></details>
<details>
<summary>lib/RAG_Tool_Handler.cpp (1)</summary><blockquote>

`34-43`: **Remove the duplicate `GloGATH` forward declaration**

The `extern GenAI_Threads_Handler *GloGATH;` appears twice back‑to‑back. Dropping the duplicate reduces noise.

</blockquote></details>

</blockquote></details>

coderabbitai · 2026-01-21T12:59:24Z

lib/Admin_Bootstrap.cpp

+	mcpdb = new SQLite3DB();
+	std::string mcp_catalog_path = std::string(GloVars.datadir) + "/mcp_catalog.db";
+	mcpdb->open((char *)mcp_catalog_path.c_str(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX);
+
+	/**
+	 * @brief Enable SQLite extension loading for MCP catalog database
+	 *
+	 * Allows loading SQLite extensions at runtime. This enables sqlite-vec to be
+	 * registered for vector similarity searches in the catalog.
+	 */
+	(*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1);


⚠️ Potential issue | 🟡 Minor

Check mcpdb->open() result before attaching.

mcpdb->open() return code is ignored; if it fails, the subsequent attach can misbehave. Please handle the error and skip attachment/log accordingly.

✅ Suggested check

- mcpdb = new SQLite3DB(); - std::string mcp_catalog_path = std::string(GloVars.datadir) + "/mcp_catalog.db"; - mcpdb->open((char *)mcp_catalog_path.c_str(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX); + mcpdb = new SQLite3DB(); + std::string mcp_catalog_path = std::string(GloVars.datadir) + "/mcp_catalog.db"; + int mcp_rc = mcpdb->open((char *)mcp_catalog_path.c_str(), + SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX); + if (mcp_rc != SQLITE_OK) { + proxy_error("Failed to open MCP catalog DB at %s\n", mcp_catalog_path.c_str()); + delete mcpdb; + mcpdb = nullptr; + }

🤖 Prompt for AI Agents

In `@lib/Admin_Bootstrap.cpp` around lines 726 - 736, The code calls mcpdb->open(...) and immediately enables extension loading via (*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1) without checking the open result; update the logic in the block around mcpdb->open() (SQLite3DB::open) to capture and check its return value, log an error and skip calling proxy_sqlite3_enable_load_extension and any further attach/setup when open fails, and ensure you use mcpdb->get_db() only after a successful open; include descriptive logging (via existing logging mechanism) when open fails to make the early exit clear.

coderabbitai · 2026-01-21T12:59:24Z

lib/AI_Features_Manager.cpp

+	// rag_vec_chunks: sqlite3-vec index
+	// Use configurable vector dimension from GenAI module
+	int vector_dimension = 1536; // Default value
+	if (GloGATH) {
+		vector_dimension = GloGATH->variables.genai_vector_dimension;
+	}
+
+	std::string create_rag_vec_chunks_sql =
+		"CREATE VIRTUAL TABLE IF NOT EXISTS rag_vec_chunks USING vec0("
+		"embedding float(" + std::to_string(vector_dimension) + "), "
+		"chunk_id TEXT, "
+		"doc_id TEXT, "
+		"source_id INTEGER, "
+		"updated_at INTEGER"
+		");";
+


⚠️ Potential issue | 🟡 Minor

Validate genai_vector_dimension before building vec0 DDL.

If the dimension is 0 or negative, the rag_vec_chunks creation will fail and silently disable vector search. Please clamp or fall back to a safe default and log the correction.

✅ Suggested guard

- int vector_dimension = 1536; // Default value - if (GloGATH) { - vector_dimension = GloGATH->variables.genai_vector_dimension; - } + int vector_dimension = 1536; // Default value + if (GloGATH) { + const int cfg_dim = GloGATH->variables.genai_vector_dimension; + if (cfg_dim > 0) { + vector_dimension = cfg_dim; + } else { + proxy_warning("AI: Invalid genai_vector_dimension=%d; using default %d\n", + cfg_dim, vector_dimension); + } + }

🤖 Prompt for AI Agents

In `@lib/AI_Features_Manager.cpp` around lines 300 - 315, Check GloGATH->variables.genai_vector_dimension before using it to build create_rag_vec_chunks_sql: if it's <= 0 or unreasonably large, clamp it to a safe default (e.g., 1536) and log the correction via the same logger used elsewhere (or emit a warning) so the fallback is visible; then use the sanitized vector_dimension when constructing the vec0 DDL for rag_vec_chunks to avoid silent failures.

coderabbitai · 2026-01-21T12:59:24Z

lib/Anomaly_Detector.cpp.bak

+			pattern_name ? pattern_name : "unknown", result.risk_score);
+	}
+
+	sqlite3_finalize(stmt);


⚠️ Potential issue | 🟠 Major

Inconsistent SQLite API usage: missing proxy wrapper.

Line 500 uses sqlite3_finalize(stmt) directly, but nearby code uses proxy wrappers like proxy_sqlite3_step. This inconsistency could cause issues if the proxy layer adds logging, metrics, or error handling.

🐛 Proposed fix

- sqlite3_finalize(stmt); + (*proxy_sqlite3_finalize)(stmt);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sqlite3_finalize(stmt);

(*proxy_sqlite3_finalize)(stmt);

🤖 Prompt for AI Agents

In `@lib/Anomaly_Detector.cpp.bak` at line 500, Replace the direct call to sqlite3_finalize(stmt) with the proxy wrapper used elsewhere (e.g., proxy_sqlite3_finalize(stmt)) so the proxy layer's logging/metrics/error handling is applied consistently; update any includes or declarations if needed to access proxy_sqlite3_finalize and ensure the call semantics and error checking remain identical to the original sqlite3_finalize usage (use the same stmt variable and follow existing error/logging patterns used with proxy_sqlite3_step).

coderabbitai · 2026-01-21T12:59:24Z

lib/Anomaly_Detector.cpp.bak

+
+	// Cleanup old entries periodically
+	static int cleanup_counter = 0;
+	if (++cleanup_counter % 1000 == 0) {
+		uint64_t current_time = (uint64_t)time(NULL);
+		auto it = user_statistics.begin();
+		while (it != user_statistics.end()) {
+			if (current_time - it->second.last_query_time > USER_STATS_WINDOW * 2) {
+				it = user_statistics.erase(it);
+			} else {
+				++it;
+			}
+		}
+	}


⚠️ Potential issue | 🟠 Major

Static local variable cleanup_counter is not thread-safe.

Multiple threads calling update_user_statistics can race on the static cleanup_counter variable, potentially causing cleanup to run less frequently than intended or creating undefined behavior.

🐛 Proposed fix: Use atomic counter

+#include <atomic> + +// At class level or file scope: +static std::atomic<int> cleanup_counter{0}; + void Anomaly_Detector::update_user_statistics(const QueryFingerprint& fp) { // ... existing code ... // Cleanup old entries periodically - static int cleanup_counter = 0; - if (++cleanup_counter % 1000 == 0) { + if (cleanup_counter.fetch_add(1, std::memory_order_relaxed) % 1000 == 0) { // ... cleanup logic (consider adding mutex for user_statistics map) ...

Note: The user_statistics map itself may also need synchronization if accessed from multiple threads.

🤖 Prompt for AI Agents

In `@lib/Anomaly_Detector.cpp.bak` around lines 577 - 590, The static local int cleanup_counter in the cleanup block inside update_user_statistics is not thread-safe; replace it with a std::atomic<int> (e.g., declare static std::atomic<int> cleanup_counter{0}) and perform an atomic increment/check (use fetch_add or ++cleanup_counter and compare the result mod 1000) so multiple threads can't race on the counter; also ensure the user_statistics cleanup loop remains protected by the same mutex/lock used for accessing the user_statistics map (or document that user_statistics is already synchronized) to avoid concurrent access issues.

coderabbitai · 2026-01-21T12:59:24Z

lib/Anomaly_Detector.cpp.bak

+		proxy_error("Anomaly: Failed to insert pattern: %s\n", sqlite3_errmsg(db));
+		sqlite3_finalize(stmt);
+		return -1;
+	}
+
+	sqlite3_finalize(stmt);


⚠️ Potential issue | 🟠 Major

Inconsistent proxy wrapper usage continues.

Lines 771-772 and 776 use sqlite3_errmsg and sqlite3_finalize directly instead of their proxy equivalents used elsewhere in this file.

🐛 Proposed fix

rc = (*proxy_sqlite3_step)(stmt); if (rc != SQLITE_DONE) { - proxy_error("Anomaly: Failed to insert pattern: %s\n", sqlite3_errmsg(db)); - sqlite3_finalize(stmt); + proxy_error("Anomaly: Failed to insert pattern: %s\n", (*proxy_sqlite3_errmsg)(db)); + (*proxy_sqlite3_finalize)(stmt); return -1; } - sqlite3_finalize(stmt); + (*proxy_sqlite3_finalize)(stmt);

🤖 Prompt for AI Agents

In `@lib/Anomaly_Detector.cpp.bak` around lines 771 - 776, Replace direct SQLite calls with the proxy wrappers to match the rest of the file: in the error branch where proxy_error("Anomaly: Failed to insert pattern: %s\n", sqlite3_errmsg(db)); and where sqlite3_finalize(stmt); are used, call the proxy wrappers (e.g., proxy_sqlite3_errmsg(db) and proxy_sqlite3_finalize(stmt)) instead so proxy_error uses the proxy_sqlite3_errmsg result and both finalization calls use proxy_sqlite3_finalize; locate these occurrences around the Anomaly pattern insert logic where proxy_error, db, and stmt are referenced.

coderabbitai · 2026-01-21T12:59:25Z

lib/RAG_Tool_Handler.cpp

+			// Build doc ID list for SQL
+			std::string doc_list = "'";
+			for (size_t i = 0; i < doc_ids.size(); ++i) {
+				if (i > 0) doc_list += "','";
+				doc_list += doc_ids[i];
+			}
+			doc_list += "'";
+
+			// Build query
+			std::string sql = "SELECT doc_id, source_id, "
+				"(SELECT name FROM rag_sources WHERE source_id = rag_documents.source_id) as source_name, "
+				"pk_json, title, body, metadata_json "
+				"FROM rag_documents "
+				"WHERE doc_id IN (" + doc_list + ")";
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file and locate the specific lines wc -l lib/RAG_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 186

🏁 Script executed:

# Read the first location (lines 2316-2330) sed -n '2310,2335p' lib/RAG_Tool_Handler.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 186

🏁 Script executed:

# Read the second location (lines 2399-2413) sed -n '2395,2420p' lib/RAG_Tool_Handler.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 186

🏁 Script executed:

# Search for chunk_id validation/escaping patterns rg "chunk_id" lib/RAG_Tool_Handler.cpp -B 3 -A 3 | head -60

Repository: ProxySQL/proxysql-vec

Length of output: 2361

🏁 Script executed:

# Look for SQLite binding patterns used in the codebase rg "bind|sqlite3_bind" lib/RAG_Tool_Handler.cpp | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 1130

🏁 Script executed:

# Search for rag.get_docs implementation rg "rag\.get_docs" lib/RAG_Tool_Handler.cpp -B 5 -A 30

Repository: ProxySQL/proxysql-vec

Length of output: 5016

🏁 Script executed:

# Search for rag.fetch_from_source implementation rg "rag\.fetch_from_source" lib/RAG_Tool_Handler.cpp -B 5 -A 30

Repository: ProxySQL/proxysql-vec

Length of output: 4702

🏁 Script executed:

# Search for doc_list pattern mentioned in the review rg "doc_list" lib/RAG_Tool_Handler.cpp -B 5 -A 15

Repository: ProxySQL/proxysql-vec

Length of output: 7955

🏁 Script executed:

# Check for any doc_id SQL injection patterns rg "doc_id IN|WHERE doc_id" lib/RAG_Tool_Handler.cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 1377

Apply SQL injection mitigation to doc_id lists in rag.get_docs and rag.fetch_from_source

The code concatenates doc_ids directly into SQL queries without validation or parameterization. This is injection-prone and breaks on single quotes. Use the validation pattern found in the filter handlers (checking for ', \, ;) or migrate to the existing execute_parameterized_query() function with bound parameters (?1, ?2, ...).

Affects both:

rag.get_docs: WHERE doc_id IN ( + direct concatenation

rag.fetch_from_source: identical pattern in WHERE d.doc_id IN ( clause

🤖 Prompt for AI Agents

In `@lib/RAG_Tool_Handler.cpp` around lines 2316 - 2330, The doc_id list is concatenated into SQL in rag.get_docs and rag.fetch_from_source (see the doc_ids loop that builds doc_list and the sql string using "WHERE ... IN (" + doc_list + ")"), which is vulnerable to SQL injection and fails on quotes; fix by replacing direct concatenation with a parameterized IN clause using execute_parameterized_query() and bound parameters (generate placeholders ?1,?2,... from doc_ids and bind each doc_ids[i]) or, if you prefer the validation approach, validate every entry in doc_ids against the same filter used in the filter handlers (reject or escape entries containing '\'', '\\', ';') before building the query; ensure you update both rag.get_docs and rag.fetch_from_source to use the same safe pattern.

coderabbitai · 2026-01-21T12:59:25Z

RAG_POC/rag_ingest.cpp

+static std::string build_select_sql(const RagSource& s, const std::vector<std::string>& cols) {
+  std::string sql = "SELECT ";
+  for (size_t i = 0; i < cols.size(); i++) {
+    if (i) sql += ", ";
+    sql += "`" + cols[i] + "`";
+  }
+  sql += " FROM `" + s.table_name + "`";
+  if (!s.where_sql.empty()) {
+    sql += " WHERE " + s.where_sql;
+  }
+  return sql;
+}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find RagSource definition and related structures rg -n "struct RagSource|class RagSource" --type=cpp -A 10

Repository: ProxySQL/proxysql-vec

Length of output: 592

🏁 Script executed:

#!/bin/bash # Search for where rag_sources is populated/loaded rg -n "rag_sources" --type=cpp -B 3 -A 3 | head -100

Repository: ProxySQL/proxysql-vec

Length of output: 6079

🏁 Script executed:

#!/bin/bash # Check for any identifier validation functions rg -n "validate|sanitize|escape" --type=cpp -i | grep -i "sql\|identifier\|column\|table"

Repository: ProxySQL/proxysql-vec

Length of output: 50378

🏁 Script executed:

#!/bin/bash # Look for how build_select_sql is called and used rg -n "build_select_sql" --type=cpp -B 5 -A 5

Repository: ProxySQL/proxysql-vec

Length of output: 1346

🏁 Script executed:

#!/bin/bash # Check for documentation on config loading find . -name "*.md" -o -name "*.txt" | xargs grep -l "rag_sources\|config" 2>/dev/null | head -5

Repository: ProxySQL/proxysql-vec

Length of output: 142

🏁 Script executed:

#!/bin/bash # Find all places where rag_sources is modified (INSERT, UPDATE, DELETE) rg -n "INSERT.*rag_sources|UPDATE.*rag_sources|DELETE.*rag_sources" --type=cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Find if there are admin/API endpoints for modifying rag_sources rg -n "rag_sources" --type=cpp | grep -i "insert\|update\|delete\|admin\|api" -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Check if backticks alone are sufficient for identifier protection # Look for MySQL identifier validation patterns rg -n "is_identifier|is_valid_identifier|validate_identifier" --type=cpp -i

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Look for documentation about rag_sources access control find . -name "*.md" -o -name "*.txt" | xargs grep -l "rag_sources\|RAG.*access\|RAG.*permission" 2>/dev/null

Repository: ProxySQL/proxysql-vec

Length of output: 326

🏁 Script executed:

#!/bin/bash # Check if backticks provide full protection against MySQL identifier injection # Search for any injection concerns or notes in RAG-related code rg -n "sql.*injection|sanitize|escape" RAG_POC/ --type=cpp -i -B 2 -A 2 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Check the RAG documentation for access control information head -200 ./doc/rag-documentation.md

Repository: ProxySQL/proxysql-vec

Length of output: 3536

🏁 Script executed:

#!/bin/bash # More specific search: find UPSERT, INSERT patterns in RAG-related files find ./RAG_POC -name "*.cpp" -o -name "*.h" | xargs rg -n "INSERT\|UPDATE\|DELETE" | grep -i "rag_sources\|sql"

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Search for Admin functions that might manage RAG sources rg -n "rag_sources\|RAG.*source" lib/AI_Features_Manager.cpp | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Check if MySQL backticks alone prevent identifier injection # Look for any validation of column/table names in collect_needed_columns or similar rg -n "collect_needed_columns\|table_name\|pk_column" RAG_POC/rag_ingest.cpp -A 10 | head -100

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Search for column whitelisting code rg -n "whitelist\|allowed.*column\|valid.*column" RAG_POC/ --type=cpp -i

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Find collect_needed_columns implementation rg -n "collect_needed_columns" RAG_POC/rag_ingest.cpp -A 50 | head -100

Repository: ProxySQL/proxysql-vec

Length of output: 4010

🏁 Script executed:

#!/bin/bash # Search for MCP endpoints or admin functions that modify RAG configuration rg -n "rag_sources\|RAG.*config\|MCP.*rag" lib/RAG_Tool_Handler.cpp --type=cpp -i | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Check if there's any validation before build_select_sql is called rg -n "build_select_sql" RAG_POC/rag_ingest.cpp -B 20 | grep -E "valid|check|verify|whitelist" -i

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Look at RagSource structure more carefully and trace column handling sed -n '145,250p' RAG_POC/rag_ingest.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 3092

🏁 Script executed:

#!/bin/bash # Check if where_sql is validated anywhere before being used rg -n "where_sql" RAG_POC/rag_ingest.cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 1136

🏁 Script executed:

#!/bin/bash # Search for any validation of where_sql or identifier columns from rag_sources rg -n "where_sql\|validate\|check" RAG_POC/rag_ingest.cpp | head -30

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Check who can insert/update rag_sources - search in Admin/MCP code rg -n "INSERT INTO.*rag_sources\|UPDATE.*rag_sources" lib/ --type=cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Look at the full RagSource loading to see if there's validation of fields sed -n '670,750p' RAG_POC/rag_ingest.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 3043

🏁 Script executed:

#!/bin/bash # Search in lib/ and other dirs for where rag_sources gets populated/modified rg -n "rag_sources" lib/ --type=cpp | grep -i "insert\|update\|delete\|execute\|prepare"

Repository: ProxySQL/proxysql-vec

Length of output: 340

🏁 Script executed:

#!/bin/bash # Check if rag_sources can be modified via admin/config mechanisms rg -n "INSERT.*VALUES\|UPDATE.*SET" lib/AI_Features_Manager.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Look for endpoints/handlers that could modify RAG configuration find . -name "*Admin*" -o -name "*Tool*Handler*" | xargs rg -l "rag" 2>/dev/null

Repository: ProxySQL/proxysql-vec

Length of output: 254

🏁 Script executed:

#!/bin/bash # Check the schema for rag_sources - is there any permission/access control? sed -n '163,190p' lib/AI_Features_Manager.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 947

🏁 Script executed:

#!/bin/bash # Search for any documentation about modifying rag_sources grep -r "rag_sources" ./doc/ ./RAG_POC/ -l 2>/dev/null | xargs grep -A 5 -B 5 "INSERT\|UPDATE\|modify\|admin" -i | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 2963

🏁 Script executed:

#!/bin/bash # Search for where_sql specifically and how it's used in MySQL queries rg -n "where_sql" lib/RAG_Tool_Handler.cpp --type=cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Check ProxySQL_Admin for any endpoints that allow modifying rag_sources rg -n "rag_sources\|RAG.*source" lib/ProxySQL_Admin.cpp --type=cpp | head -30

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Verify: backticks in MySQL - can they be escaped? Look for any escaping logic rg -n "escape.*backtick\|backtick.*escape" --type=cpp -i

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

#!/bin/bash # Search for "where_sql" usage across the entire codebase more broadly rg -n "where_sql" --type=cpp | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 472

🏁 Script executed:

#!/bin/bash # Check if there's any initialization/seeding of rag_sources data find . -name "*.sql" -o -name "*seed*" -o -name "*init*" | xargs grep -l "rag_sources" 2>/dev/null

Repository: ProxySQL/proxysql-vec

Length of output: 87

🏁 Script executed:

#!/bin/bash # Verify if backticks alone protect against MySQL identifier injection # Look up MySQL identifier quoting rules - check what the code comment says sed -n '468,480p' RAG_POC/rag_ingest.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 424

🏁 Script executed:

#!/bin/bash # Check if there's any WHERE clause validation or parameterization elsewhere in RAG code rg -n "WHERE\s" RAG_POC/rag_ingest.cpp --type=cpp

Repository: ProxySQL/proxysql-vec

Length of output: 215

🏁 Script executed:

#!/bin/bash # Final check: see if where_sql is ever validated in load_sources or elsewhere sed -n '714,730p' RAG_POC/rag_ingest.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 520

Add WHERE clause SQL injection validation; backtick-escaped identifiers need backtick escaping.

The where_sql field at line 476 is concatenated directly into the SQL query without validation, creating a SQL injection risk. Additionally, while column and table names are backtick-quoted (lines 472, 474), backticks within the identifier values themselves are not escaped—a malicious identifier like `col`name` would break the quoting.

Since rag_sources is loaded from the internal SQLite database and no public APIs were found to modify it, the actual attack surface is limited to trusted administrators. However, the code should defensively validate where_sql content (e.g., using a prepared statement or allowlist for WHERE conditions) and properly escape backticks in identifiers to prevent identifier injection.

🤖 Prompt for AI Agents

In `@RAG_POC/rag_ingest.cpp` around lines 468 - 479, build_select_sql concatenates RagSource::where_sql and raw identifiers, creating SQL injection/identifier-injection risks; update build_select_sql to (1) escape backticks inside identifiers used for cols and s.table_name (replace any '`' with '``') before wrapping with backticks, and (2) stop directly concatenating s.where_sql: either accept a parameterized WHERE with placeholders and require callers to supply bound parameters, or validate allowlisted characters/patterns (e.g., only simple column comparisons with a strict regex) and reject/return an error for unsafe values; update function signature or call sites as needed to support bound parameters or validation on RagSource::where_sql.

coderabbitai · 2026-01-21T12:59:25Z

RAG_POC/rag_ingest.cpp

+    s.name      = (const char*)sqlite3_column_text(st, 1);
+    s.enabled   = sqlite3_column_int(st, 2);
+
+    s.backend_type = (const char*)sqlite3_column_text(st, 3);
+    s.host         = (const char*)sqlite3_column_text(st, 4);
+    s.port         = sqlite3_column_int(st, 5);
+    s.user         = (const char*)sqlite3_column_text(st, 6);
+    s.pass         = (const char*)sqlite3_column_text(st, 7);
+    s.db           = (const char*)sqlite3_column_text(st, 8);
+
+    s.table_name   = (const char*)sqlite3_column_text(st, 9);
+    s.pk_column    = (const char*)sqlite3_column_text(st, 10);
+    s.where_sql    = (const char*)sqlite3_column_text(st, 11);


⚠️ Potential issue | 🟠 Major

Potential null pointer dereference from sqlite3_column_text.

sqlite3_column_text can return NULL for NULL column values or on allocation failure. Directly assigning to std::string without a null check will cause undefined behavior if NULL is returned.

🐛 Proposed fix: Add null checks

- s.name = (const char*)sqlite3_column_text(st, 1); - s.enabled = sqlite3_column_int(st, 2); - - s.backend_type = (const char*)sqlite3_column_text(st, 3); - s.host = (const char*)sqlite3_column_text(st, 4); - s.port = sqlite3_column_int(st, 5); - s.user = (const char*)sqlite3_column_text(st, 6); - s.pass = (const char*)sqlite3_column_text(st, 7); - s.db = (const char*)sqlite3_column_text(st, 8); - - s.table_name = (const char*)sqlite3_column_text(st, 9); - s.pk_column = (const char*)sqlite3_column_text(st, 10); - s.where_sql = (const char*)sqlite3_column_text(st, 11); + s.name = str_or_empty((const char*)sqlite3_column_text(st, 1)); + s.enabled = sqlite3_column_int(st, 2); + + s.backend_type = str_or_empty((const char*)sqlite3_column_text(st, 3)); + s.host = str_or_empty((const char*)sqlite3_column_text(st, 4)); + s.port = sqlite3_column_int(st, 5); + s.user = str_or_empty((const char*)sqlite3_column_text(st, 6)); + s.pass = str_or_empty((const char*)sqlite3_column_text(st, 7)); + s.db = str_or_empty((const char*)sqlite3_column_text(st, 8)); + + s.table_name = str_or_empty((const char*)sqlite3_column_text(st, 9)); + s.pk_column = str_or_empty((const char*)sqlite3_column_text(st, 10)); + s.where_sql = str_or_empty((const char*)sqlite3_column_text(st, 11));

🤖 Prompt for AI Agents

In `@RAG_POC/rag_ingest.cpp` around lines 686 - 698, The assignments reading columns into struct s are vulnerable to null pointer dereference because sqlite3_column_text can return NULL; update the code that sets s.name, s.backend_type, s.host, s.user, s.pass, s.db, s.table_name, s.pk_column and s.where_sql to first call sqlite3_column_text into a temporary const unsigned char* (or const char*), check for NULL (or use sqlite3_column_type to detect SQLITE_NULL) and only assign to the std::string members when non-NULL, otherwise set them to an empty string (or another safe default); do the same null-check pattern for text columns used by functions like sqlite3_column_text while keeping integer reads (sqlite3_column_int for s.enabled and s.port) as-is or guarding them with sqlite3_column_type if you want to detect NULLs explicitly.

coderabbitai · 2026-01-21T12:59:25Z

RAG_POC/rag_ingest.cpp

+    // Use SQLite's unixepoch() for updated_at normally; vec table also stores updated_at as unix epoch.
+    // Here we store a best-effort "now" from SQLite (unixepoch()) would require a query; instead store 0
+    // or a local time. For v0, we store 0 and let schema default handle other tables.
+    // If you want accuracy, query SELECT unixepoch() once per run and reuse it.
+    std::int64_t now_epoch = 0;


⚠️ Potential issue | 🟡 Minor

Hardcoded now_epoch = 0 will cause incorrect timestamps.

The comment acknowledges this is a placeholder, but storing 0 as updated_at in rag_vec_chunks will make temporal queries unreliable. Consider using time(NULL) for a reasonable approximation.

💡 Suggested fix

- // If you want accuracy, query SELECT unixepoch() once per run and reuse it. - std::int64_t now_epoch = 0; + // Use current time for updated_at; for higher accuracy, query SQLite's unixepoch() once per run. + std::int64_t now_epoch = (std::int64_t)time(NULL);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Use SQLite's unixepoch() for updated_at normally; vec table also stores updated_at as unix epoch.

// Here we store a best-effort "now" from SQLite (unixepoch()) would require a query; instead store 0

// or a local time. For v0, we store 0 and let schema default handle other tables.

// If you want accuracy, query SELECT unixepoch() once per run and reuse it.

std::int64_t now_epoch = 0;

// Use SQLite's unixepoch() for updated_at normally; vec table also stores updated_at as unix epoch.

// Here we store a best-effort "now" from SQLite (unixepoch()) would require a query; instead store 0

// or a local time. For v0, we store 0 and let schema default handle other tables.

// Use current time for updated_at; for higher accuracy, query SQLite's unixepoch() once per run.

std::int64_t now_epoch = (std::int64_t)time(NULL);

🤖 Prompt for AI Agents

In `@RAG_POC/rag_ingest.cpp` around lines 901 - 905, The code currently sets now_epoch to 0 which yields invalid timestamps for rag_vec_chunks; replace the hardcoded value by capturing a real epoch once per run (e.g., use std::time(nullptr) / time(NULL) and cast to std::int64_t) and reuse that value when populating updated_at (variable now_epoch), updating any affected functions that rely on now_epoch; ensure the translation unit includes <ctime> (or <time.h>) if not already present so the build succeeds.

coderabbitai · 2026-01-21T12:59:25Z

RAG_POC/schema.sql

+CREATE TABLE IF NOT EXISTS rag_sources (
+  source_id        INTEGER PRIMARY KEY,
+  name             TEXT NOT NULL UNIQUE,     -- e.g. "stack_posts"
+  enabled          INTEGER NOT NULL DEFAULT 1,
+
+  -- Where to retrieve from (PoC: connect directly; later can be "via ProxySQL")
+  backend_type     TEXT NOT NULL,            -- "mysql" | "postgres" | ...
+  backend_host     TEXT NOT NULL,
+  backend_port     INTEGER NOT NULL,
+  backend_user     TEXT NOT NULL,
+  backend_pass     TEXT NOT NULL,
+  backend_db       TEXT NOT NULL,            -- database/schema name
+
+  table_name       TEXT NOT NULL,            -- e.g. "posts"
+  pk_column        TEXT NOT NULL,            -- e.g. "Id"
+
+  -- Optional: restrict ingestion; appended to SELECT as WHERE <where_sql>
+  where_sql        TEXT,                     -- e.g. "PostTypeId IN (1,2)"
+
+  -- REQUIRED: mapping from source row -> rag_documents fields
+  -- JSON spec describing doc_id, title/body concat, metadata pick/rename, etc.
+  doc_map_json     TEXT NOT NULL,
+
+  -- REQUIRED: chunking strategy (enabled, chunk_size, overlap, etc.)
+  chunking_json    TEXT NOT NULL,
+
+  -- Optional: embedding strategy (how to build embedding input text)
+  -- In v0 you can keep it NULL/empty; define later without schema changes.
+  embedding_json   TEXT,
+
+  created_at       INTEGER NOT NULL DEFAULT (unixepoch()),
+  updated_at       INTEGER NOT NULL DEFAULT (unixepoch())


⚠️ Potential issue | 🟠 Major

Avoid storing plaintext backend credentials

backend_pass is stored as TEXT, which risks credential leakage. Please move secrets to a secure store or encrypt at rest (and restrict access) rather than persisting raw passwords in the catalog.

🤖 Prompt for AI Agents

In `@RAG_POC/schema.sql` around lines 14 - 45, The rag_sources table currently stores plaintext credentials in the backend_pass column; instead migrate to a secure approach by removing plaintext storage and either (A) storing only a reference to an external secret manager (add a new column like backend_secret_ref) and update any code that reads rag_sources to fetch the secret from the secret manager, or (B) store an encrypted value (add backend_pass_encrypted) and implement symmetric key encryption/decryption in the service layer to read/write via functions that use a KMS-held key; update schema changes for rag_sources, add a migration to backfill secrets (move existing backend_pass values into the secret store or encrypt them) and remove or null out backend_pass, and update codepaths that reference backend_pass to use backend_secret_ref or decrypt backend_pass_encrypted instead.

- Add special build rule in Makefile for anomaly_detection-t that includes: - $(OBJ) for global variables (GloVars, GloGATH) - -Wl,--allow-multiple-definition to allow test's main() to override ProxySQL's - ClickHouse client libraries (libclickhouse-cpp-lib.a, libzstdstatic.a, liblz4.a) - SQLite rembed library (libsqlite_rembed.a) - -lscram -lusual for PostgreSQL SCRAM support - Add stub function SQLite3_Server_session_handler required by SQLite3_Server.cpp Resolves compilation errors for anomaly_detection-t test.

- Add #include <string> for C++ std::string support - Add #include <cmath> for sqrt() function - Change format %lld to %ld for chrono duration types (long int, not long long) Resolves compilation errors for vector_db_performance-t test.

- Add #include <string> for C++ std::string support Resolves compilation errors for ai_llm_retry_scenarios-t test.

This commit addresses valid concerns raised by coding agents (Gemini, Copilot, CoderabbitAI): 1. Fix stats_mcp_query_digest naming conflict (ProxySQL_Admin.cpp): - Made reset and non-reset paths mutually exclusive using else block - Prevents both flags from being true, matching MySQL pattern - Ensures reset takes precedence over non-reset 2. Fix INSERT OR REPLACE sync issue (Admin_Handler.cpp): - Added DELETE before INSERT OR REPLACE in LOAD/SAVE MCP QUERY RULES - Prevents stale rules from persisting when syncing disk <-> memory - Ensures deleted source rows are also removed from target 3. Fix integer division truncation for timeout (Query_Tool_Handler.cpp): - Changed timeout_ms/1000 to (timeout_ms+999)/1000 for ceiling division - Ensures sub-second timeouts (e.g., 500ms) become at least 1 second - Prevents zero-second timeouts from causing unexpected behavior 4. Remove confusing comment (Discovery_Schema.cpp): - Simplified column count comment to be clear and accurate Note: The re_modifiers parsing code already correctly handles VARCHAR "CASELESS" to int conversion (lines 2414-2425), so that review comment was already addressed.

feat: Add MCP query rules and digest statistics [WIP]

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

lib/MySQL_Catalog.cpp (2)
56-57: Avoid Variable Length Arrays (VLAs) - use heap allocation instead.

VLAs are not standard C++ (they're a C99/GCC extension) and can cause stack overflow for large paths. Use std::string or heap allocation.
Proposed fix
-	char path_buf[db_path.size() + 1];
-	strcpy(path_buf, db_path.c_str());
-	int rc = db->open(path_buf, SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE);
+	int rc = db->open(const_cast<char*>(db_path.c_str()), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE);
Or if db->open requires a mutable buffer:
-	char path_buf[db_path.size() + 1];
-	strcpy(path_buf, db_path.c_str());
+	std::vector<char> path_buf(db_path.begin(), db_path.end());
+	path_buf.push_back('\0');
	int rc = db->open(path_buf.data(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE);
279-301: SQL injection vulnerability - user input concatenated directly into SQL.

The search() method concatenates schema, kind, tags, and query parameters directly into the SQL string without escaping or using prepared statements. This is a critical security vulnerability that allows SQL injection attacks.

The vulnerable patterns are:

Line 283: sql << " AND schema = '" << schema << "'";

Line 288: sql << " AND kind = '" << kind << "'";

Line 293: sql << " AND tags LIKE '%" << tags << "%'";

Lines 298-300: query parameter concatenated into LIKE clauses

Consider using prepared statements like the upsert() and get() methods do, or at minimum escape single quotes in the input strings.
Example fix using prepared statements
std::string MySQL_Catalog::search(
	const std::string& schema,
	const std::string& query,
	const std::string& kind,
	const std::string& tags,
	int limit,
	int offset
) {
	sqlite3_stmt* stmt = NULL;
	std::ostringstream sql;
	std::vector<std::string> bindings;
	int bind_idx = 1;

	sql << "SELECT schema, kind, key, document, tags, links FROM catalog WHERE 1=1";

	if (!schema.empty()) {
		sql << " AND schema = ?" << bind_idx++;
		bindings.push_back(schema);
	}
	if (!kind.empty()) {
		sql << " AND kind = ?" << bind_idx++;
		bindings.push_back(kind);
	}
	// ... similar for other parameters
	
	// Then bind each parameter using proxy_sqlite3_bind_text
}

🤖 Fix all issues with AI agents

In `@lib/Admin_Handler.cpp`:
- Around line 2369-2374: The current code builds a single semicolon-separated
SQL string into *q and relies on execute_statement (which uses
sqlite3_prepare_v2 and only compiles the first statement) so the INSERT never
runs; change the logic around where *q is set (the l_free/l_strdup block that
creates "DELETE ...; INSERT OR REPLACE ...") to instead perform each SQL
statement with calls that support multiple statements—either call execute()
(which uses sqlite3_exec) for the combined string or, better, call execute() or
the existing BQE1/execute pattern twice to run "DELETE FROM
main.mcp_query_rules" and then "INSERT OR REPLACE INTO main.mcp_query_rules
SELECT * FROM disk.mcp_query_rules" as separate statements; update the code
paths that set *q/*ql accordingly so no multi-statement string is passed to
execute_statement.

In `@lib/MySQL_Catalog.cpp`:
- Around line 370-390: The list() method currently concatenates user inputs into
the SQL strings (variables sql and count_sql), causing SQL injection; change it
to use prepared statements and parameter binding like upsert() and get(): build
the base query with placeholders (e.g., "WHERE 1=1" plus "AND schema = ?"/"AND
kind = ?"), prepare the statement via your MySQL API, bind schema and kind only
when non-empty, and bind limit/offset as parameters before executing both the
main query and the COUNT query (use the same conditional placeholder logic for
count_sql). Ensure you replace the direct string concatenation at the spots that
currently append schema/kind and the final "LIMIT/OFFSET" with proper
placeholders and binding.
- Around line 505-511: In the remove() method, avoid concatenating user inputs
(schema, kind, key) into the SQL string; instead build a parameterized DELETE
and use a prepared statement to bind values for schema (when non-empty), kind
and key before executing. Replace the ostringstream construction in
MySQL_Catalog::remove() with a DELETE query that uses placeholders for
schema/kind/key (e.g. "DELETE FROM catalog WHERE 1=1" plus "AND schema = ?" when
schema provided, then always "AND kind = ? AND key = ?"), prepare that statement
via your MySQL client API, bind schema (conditionally), bind kind and key,
execute, and propagate/log any execution errors.

In `@lib/ProxySQL_Admin_Stats.cpp`:
- Around line 2581-2627: The INSERT always targets stats_mcp_query_digest and
uses strlen/risky sprintf, causing reset data to go to the live table, crashes
on NULL run_id and SQL-injection risks; in
ProxySQL_Admin::stats___mcp_query_digest switch the target table based on the
reset flag (use stats_mcp_query_digest_reset when reset==true), stop building
SQL via sprintf, and use a prepared statement with parameter binding (prepare
once per loop or reuse) to bind each SQLite3_row->fields value NULL-safely
(checking for NULL before binding and using sqlite3_bind_null or
sqlite3_bind_text) instead of strlen on fields and statsdb->execute(query); also
remove manual malloc/free and call the DB's execute/step/finalize prepared API
(e.g., statsdb prepare/bind/execute/finalize) to prevent injection and crashes.

In `@lib/ProxySQL_Admin.cpp`:
- Around line 1354-1357: The current checks set both
stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset when
query_no_space contains the reset token, causing duplicate refresh work; change
the logic to test the reset token first and make the second check an else if (or
otherwise skip setting the non-reset flag when reset matches) so only
stats_mcp_query_tools_counters_reset is set and refresh is executed once; update
the conditional around query_no_space that touches
stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset
accordingly.
- Around line 7750-7804: The function
ProxySQL_Admin::load_mcp_query_rules_to_runtime leaks the SQLite3_result* on the
success path because Discovery_Schema::load_mcp_query_rules(resultset) does not
free it; after calling catalog->load_mcp_query_rules(resultset) add deletion of
the resultset (delete resultset;) to mirror the error-path cleanup used in
load_mysql_firewall_to_runtime(), and ensure you use the correct deallocation
(delete, not free). Also update the function comment header to correct the
documentation: change "error message string on failure (caller must free)" to
indicate returned error strings are string literals and must not be freed by the
caller.

In `@lib/Query_Tool_Handler.cpp`:
- Around line 235-319: The init_connection_pool function must clean up any
partially initialized connections on error: add a helper (lambda or private
method) inside init_connection_pool that iterates connection_pool, calls
mysql_close on each entry's MYSQL* (and sets conn.mysql = nullptr), clears
connection_pool, resets pool_size to 0, and unlocks pool_lock if held; call this
cleanup helper before every early return that occurs after some successful
connections (e.g., after mysql_init/mysql_real_connect failures) so the function
either returns with a fully initialized pool or an empty, consistent pool;
update all return paths in init_connection_pool to invoke this helper and then
return -1.

In `@lib/Static_Harvester.cpp`:
- Around line 1088-1138: The quick-profile field indices are misaligned with the
SELECT: update the row->fields index usage inside the loop (where profile is
built) so they match the SELECT order (object_id, schema_name, object_name,
object_type, engine, table_rows_est, data_length, index_length, has_primary_key,
has_foreign_keys, has_time_column). Concretely, set rows_est to row->fields[5],
engine to row->fields[4], size_bytes to the sum of row->fields[6] and
row->fields[7], and the boolean flags
has_primary_key/has_foreign_keys/has_time_column to row->fields[8], [9], and
[10] respectively; update these in the code that constructs json profile (the
block around profile["guessed_kind"], profile["rows_est"],
profile["size_bytes"], profile["engine"], profile["has_primary_key"],
profile["has_foreign_keys"], profile["has_time_column"] in
Static_Harvester.cpp).
- Around line 236-258: In Static_Harvester::execute_query, mysql_store_result()
NULL is being treated as a successful "no result set" case; instead check
mysql_field_count(mysql_conn) to see if the query should have a result set—if
mysql_field_count() == 0 treat it as a successful non-result (unlock conn_lock
and return 0), otherwise treat NULL as an error by calling
mysql_errno(mysql_conn) and logging mysql_error(mysql_conn), unlock conn_lock
and return -1; update the error log to include mysql_error(mysql_conn) and
ensure conn_lock is always released before returning.
- Around line 388-662: All five fetch_* methods (fetch_schemas,
fetch_tables_views, fetch_columns, fetch_indexes, fetch_foreign_keys) currently
interpolate the user-provided filter directly into SQL and need whitelist
validation to prevent SQL injection: add a small helper (e.g.,
is_valid_identifier or validate_schema_filter) that returns true only for
non-empty strings matching /^[A-Za-z0-9_]+$/ (allowing empty meaning no filter),
call it at the start of each method and if the filter is non-empty and invalid
either return an empty result or treat it as no filter (choose consistent
behavior), and only append " AND ... = '<filter>'" to the SQL when the helper
approves the value; reference the filter variable and SQL-building blocks in
each function so the validation is applied before any SQL concatenation.
- Around line 1014-1055: The UPDATE that writes view_def, schema_name, and
view_name into the SQLite catalog (inside the for-loop using update_sql and
catalog->get_db()->execute_statement) is vulnerable to breaking on single
quotes; change it to use a prepared statement with parameter binding
(sqlite3_prepare_v2 / sqlite3_bind_text / sqlite3_step / sqlite3_finalize) and
bind view_def, schema_name, view_name, and current_run_id instead of
interpolating them, following the pattern in Discovery_Schema.cpp; alternatively
(less preferred) call escape_string_single_quotes() on view_def.c_str(),
schema_name.c_str(), and view_name.c_str() and use the escaped C strings in the
constructed SQL before calling execute_statement.

♻️ Duplicate comments (9)

lib/ProxySQL_Admin_Stats.cpp (1)
1587-1613: Fix reset insert target and avoid raw SQL interpolation.

Line 1608 builds the INSERT using a prefix; when reset == true it targets stats_mcp_query_tools_counters_mcp_query_tools_counters (non‑existent). Raw interpolation of tool/schema also breaks on quotes or enables injection. Use an explicit table name plus bound parameters.
🛠️ Proposed fix (explicit table + prepared statement)
-	for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin();
-	     it != resultset->rows.end(); ++it) {
-		SQLite3_row* r = *it;
-		char query[1024];
-		snprintf(query, sizeof(query),
-			"INSERT INTO %smcp_query_tools_counters VALUES ('%s', '%s', %s, %s, %s, %s, %s, %s)",
-			reset ? "stats_mcp_query_tools_counters_" : "stats_",
-			r->fields[0], r->fields[1], r->fields[2], r->fields[3],
-			r->fields[4], r->fields[5], r->fields[6], r->fields[7]);
-		statsdb->execute(query);
-	}
+	const char* table = reset
+		? "stats_mcp_query_tools_counters_reset"
+		: "stats_mcp_query_tools_counters";
+	const std::string sql =
+		std::string("INSERT INTO ") + table +
+		" VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8)";
+	int rc = 0;
+	stmt_unique_ptr u_stmt { nullptr };
+	std::tie(rc, u_stmt) = statsdb->prepare_v2(sql.c_str());
+	ASSERT_SQLITE_OK(rc, statsdb);
+	sqlite3_stmt* const stmt = u_stmt.get();
+	for (SQLite3_row* r : resultset->rows) {
+		rc = (*proxy_sqlite3_bind_text)(stmt, 1, r->fields[0], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_text)(stmt, 2, r->fields[1], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 3, atoll(r->fields[2])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 4, atoll(r->fields[3])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 5, atoll(r->fields[4])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 6, atoll(r->fields[5])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 7, atoll(r->fields[6])); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_bind_int64)(stmt, 8, atoll(r->fields[7])); ASSERT_SQLITE_OK(rc, statsdb);
+		SAFE_SQLITE3_STEP2(stmt);
+		rc = (*proxy_sqlite3_clear_bindings)(stmt); ASSERT_SQLITE_OK(rc, statsdb);
+		rc = (*proxy_sqlite3_reset)(stmt); ASSERT_SQLITE_OK(rc, statsdb);
+	}
lib/Admin_Bootstrap.cpp (1)
726-736: Check MCP catalog DB open result before enabling extensions/attach.

Line 728 ignores mcpdb->open()’s return code; if it fails, proxy_sqlite3_enable_load_extension and __attach_db (Line 947) can dereference an invalid handle.
🛠️ Suggested guard
-	mcpdb->open((char *)mcp_catalog_path.c_str(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX);
+	int mcp_rc = mcpdb->open((char *)mcp_catalog_path.c_str(),
+	                         SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX);
+	if (mcp_rc != SQLITE_OK) {
+		proxy_error("Failed to open MCP catalog DB at %s\n", mcp_catalog_path.c_str());
+		delete mcpdb;
+		mcpdb = nullptr;
+	}
 ...
-	(*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1);
+	if (mcpdb) {
+		(*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1);
+	}
 ...
-	__attach_db(admindb, mcpdb, (char *)"mcp_catalog");
+	if (mcpdb) {
+		__attach_db(admindb, mcpdb, (char *)"mcp_catalog");
+	}
Also applies to: 947-947
lib/Query_Tool_Handler.cpp (7)

352-360: Synchronize find_connection with pool_lock.

Iterating connection_pool without a lock can race with close() or pool initialization.

362-405: Avoid using a returned connection and return immediately on query failure.

The connection is returned before mysql_affected_rows, and the error path continues after mysql_query fails. Both create races/UB.

407-476: Hold the connection until mysql_affected_rows/result processing completes.

Returning the connection before mysql_affected_rows risks another thread reusing it.

809-846: Match allocation/deallocation for SQLite3_result.

execute_statement allocates SQLite3_result with new, but this path frees it with free(). That’s undefined behavior.

920-936: Escape schema/name_filter before concatenating SQL.

Direct interpolation enables SQL injection via tool arguments.

1112-1131: Sanitize schema/table in object_id lookup SQL.

Direct string concatenation allows injection via object_key.

1799-1850: Free row buffers after add_row to avoid leaks.

Each row allocates char** plus strdup fields that are never freed.

🧹 Nitpick comments (1)

lib/Admin_Handler.cpp (1)
2360-2360: Minor: Inconsistent length check could be clarified.

The condition query_no_space_length>20 combined with strncasecmp(..., 21) works correctly (since "SAVE MCP QUERY RULES " is 21 chars), but using strlen("SAVE MCP QUERY RULES ") would be more maintainable and consistent with nearby code patterns (e.g., line 2365 uses strlen()).
-if ((query_no_space_length>20) && ( (!strncasecmp("SAVE MCP QUERY RULES ", query_no_space, 21)) || (!strncasecmp("LOAD MCP QUERY RULES ", query_no_space, 21)) ) ) {
+if ((query_no_space_length>20) && ( (!strncasecmp("SAVE MCP QUERY RULES ", query_no_space, strlen("SAVE MCP QUERY RULES "))) || (!strncasecmp("LOAD MCP QUERY RULES ", query_no_space, strlen("LOAD MCP QUERY RULES "))) ) ) {

coderabbitai · 2026-01-21T14:10:01Z

lib/Admin_Handler.cpp

+			l_free(*ql,*q);
+			// First clear target table, then insert to ensure deleted source rows are also removed
+			*q=l_strdup("DELETE FROM main.mcp_query_rules; INSERT OR REPLACE INTO main.mcp_query_rules SELECT * FROM disk.mcp_query_rules");
+			*ql=strlen(*q)+1;
+			return true;
+		}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Check how other disk-to-memory operations handle this pattern rg -n "DELETE FROM.*INSERT" --type cpp lib/

Repository: ProxySQL/proxysql-vec

Length of output: 628

🏁 Script executed:

# Also search for the execute_statement function and how SQL is executed rg -n "execute_statement" --type cpp lib/ | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 2642

🏁 Script executed:

# Get context around lines 2369-2374 and 2383 sed -n '2360,2390p' lib/Admin_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 1821

🏁 Script executed:

# Find execute_statement function definition rg -n "execute_statement" lib/ --type cpp -A 10 | grep -A 10 "execute_statement.*{" | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 2293

🏁 Script executed:

# Find BQE1 macro definition which also handles DELETE...INSERT patterns rg -n "BQE1" lib/ --type cpp -B 2 -A 5 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 4392

🏁 Script executed:

# Search for the database class and SQLite3_exec or similar execution functions rg -n "SQLite3_exec\|execute_statement" lib/ --type cpp | grep "define\|{" | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Look for where execute_statement is defined (it might be in a header) fd -e "h" lib/ -exec rg -l "execute_statement" {} \; | head -10

Repository: ProxySQL/proxysql-vec

Length of output: 315

🏁 Script executed:

# Look at the full BQE1 implementation sed -n '187,210p' lib/ProxySQL_Admin.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 624

🏁 Script executed:

# Find the execute function to see how it differs from execute_statement rg -n "void SQLite3DB::execute\(" lib/ --type cpp -A 15 | head -60

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Check if sqlite3_prepare_v2 handles multiple statements rg -n "sqlite3_prepare_v2" lib/ --type cpp -B 2 -A 5 | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 3557

🏁 Script executed:

# Look at how the mcp_query_rules statements are actually being executed later rg -n "mcp_query_rules" lib/Admin_Handler.cpp -B 2 -A 8

Repository: ProxySQL/proxysql-vec

Length of output: 3897

🏁 Script executed:

# Find the execute function implementation rg -n "void SQLite3DB::execute\(" lib/ --type cpp -A 20 | head -80

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Also check if there's a sqlite3_exec call anywhere that might handle multiple statements rg -n "sqlite3_exec" lib/ --type cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 3222

🏁 Script executed:

# Check the SQLite3DB class header to understand what execute vs execute_statement does fd -name "*.h" lib/ -exec rg -l "SQLite3DB" {} \; | head -5

Repository: ProxySQL/proxysql-vec

Length of output: 295

🏁 Script executed:

# Find the execute function that takes a const char* rg -n "void SQLite3DB::execute" lib/sqlite3db.cpp -A 25 | head -100

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Check the full context around the sqlite3_exec call sed -n '215,250p' lib/sqlite3db.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 964

🏁 Script executed:

# Let's look for the header file to understand the class structure find lib -name "*.h" -o -name "*.hpp" | xargs rg -l "class SQLite3DB" 2>/dev/null | head -1

Repository: ProxySQL/proxysql-vec

Length of output: 94

🏁 Script executed:

# Find where the returned query string (in *q) is executed # This code is in Admin_Handler.cpp processing admin commands # Let's see the full function context sed -n '2300,2450p' lib/Admin_Handler.cpp | head -150

Repository: ProxySQL/proxysql-vec

Length of output: 8064

🏁 Script executed:

# Search for how admin queries are executed after being processed rg -n "execute_statement\|->execute\(" lib/Admin_Handler.cpp | grep -A 3 -B 3 "2[34][0-9][0-9]" | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Check the broader context to see what happens with the query after processing sed -n '2350,2500p' lib/Admin_Handler.cpp | tail -150

Repository: ProxySQL/proxysql-vec

Length of output: 7730

🏁 Script executed:

# Find the function signature and caller context rg -n "admin_handler_function\|process_admin_query" lib/Admin_Handler.cpp | head -5

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Look for the function that calls this handler grep -n "^.*(" lib/Admin_Handler.cpp | grep -E "admin_handler|process_admin" | head -5

Repository: ProxySQL/proxysql-vec

Length of output: 725

🏁 Script executed:

# Find the function definition that contains this code sed -n '1,100p' lib/Admin_Handler.cpp | grep -E "^[a-zA-Z_].*\(|^bool|^int"

Repository: ProxySQL/proxysql-vec

Length of output: 127

🏁 Script executed:

# Search for where the query string *q gets executed after being returned rg -n "admindb->execute\(" lib/ --type cpp | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 1882

🏁 Script executed:

# Find where admin_handler_command_load_or_save is called and what happens with the returned query rg -n "admin_handler_command_load_or_save" lib/ --type cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 1121

🏁 Script executed:

# Let's look at the full function signature and see how it's used sed -n '1025,1100p' lib/Admin_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 4033

🏁 Script executed:

# Check for "sqlite3_exec" documentation or SQLite3 behavior regarding multiple statements # Let's look for comments or documentation about multi-statement support rg -n "sqlite3_exec\|multiple.*statement" lib/ --type cpp -B 2 -A 2 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Find the __run_query label and see how the query is executed sed -n '3237,3500p' lib/Admin_Handler.cpp | head -200

Repository: ProxySQL/proxysql-vec

Length of output: 10475

🏁 Script executed:

# Search for the __run_query label explicitly rg -n "__run_query:" lib/Admin_Handler.cpp -A 30 | head -80

Repository: ProxySQL/proxysql-vec

Length of output: 1399

🏁 Script executed:

# Find what happens immediately after __run_query label - look for execute_statement or execute sed -n '4317,4450p' lib/Admin_Handler.cpp | head -150

Repository: ProxySQL/proxysql-vec

Length of output: 4148

🏁 Script executed:

# Search more broadly for where the query gets executed after __run_query rg -n "run_query" lib/Admin_Handler.cpp | tail -20

Repository: ProxySQL/proxysql-vec

Length of output: 561

🏁 Script executed:

# Look for any comments or documentation about prepare_v2 and multiple statements rg -n "prepare_v2\|multiple.*statement" lib/sqlite3db.cpp -B 2 -A 2 | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Also check if there are any workarounds or special handling for multi-statement queries rg -n "sqlite3_next_stmt\|tail.*statement" lib/ --type cpp | head -10

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Check the exact behavior of sqlite3_prepare_v2 with documentation or comments rg -n "sqlite3_prepare_v2" lib/sqlite3db.cpp -B 5 -A 15 | head -60

Repository: ProxySQL/proxysql-vec

Length of output: 2141

🏁 Script executed:

# Verify that execute_statement does NOT loop or handle multiple statements sed -n '322,388p' lib/sqlite3db.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 2511

Fix multi-statement SQL execution in mcp_query_rules operations.

Lines 2371 and 2383 use execute_statement, which calls sqlite3_prepare_v2 to compile the query. This function only prepares the first SQL statement when given multiple semicolon-separated statements. The INSERT OR REPLACE part will never execute.

Split into separate execute() calls (which uses sqlite3_exec and supports multiple statements), or execute each statement separately. The pattern used elsewhere in the codebase—such as the BQE1 macro in ProxySQL_Admin.cpp and debug handlers—demonstrates the correct approach of executing statements individually.

Example pattern (from LOAD DEBUG handlers):

SPA->admindb->execute("DELETE FROM main.debug_filters"); SPA->admindb->execute("INSERT OR REPLACE INTO main.debug_levels SELECT * FROM disk.debug_levels"); SPA->admindb->execute("INSERT INTO main.debug_filters SELECT * FROM disk.debug_filters");

🤖 Prompt for AI Agents

In `@lib/Admin_Handler.cpp` around lines 2369 - 2374, The current code builds a single semicolon-separated SQL string into *q and relies on execute_statement (which uses sqlite3_prepare_v2 and only compiles the first statement) so the INSERT never runs; change the logic around where *q is set (the l_free/l_strdup block that creates "DELETE ...; INSERT OR REPLACE ...") to instead perform each SQL statement with calls that support multiple statements—either call execute() (which uses sqlite3_exec) for the combined string or, better, call execute() or the existing BQE1/execute pattern twice to run "DELETE FROM main.mcp_query_rules" and then "INSERT OR REPLACE INTO main.mcp_query_rules SELECT * FROM disk.mcp_query_rules" as separate statements; update the code paths that set *q/*ql accordingly so no multi-statement string is passed to execute_statement.

coderabbitai · 2026-01-21T14:10:01Z

lib/MySQL_Catalog.cpp

+	sql << "SELECT schema, kind, key, document, tags ,  links FROM catalog WHERE 1=1";
+
+	if (!schema.empty()) {
+		sql << " AND schema = '" << schema << "'";
+	}

 	if (!kind.empty()) {
-		sql << " WHERE kind = '" << kind << "'";
+		sql << " AND kind = '" << kind << "'";
 	}

-	sql << " ORDER BY kind, key ASC LIMIT " << limit << " OFFSET " << offset;
+	sql << " ORDER BY schema, kind ,  key ASC LIMIT " << limit << " OFFSET " << offset;

 	// Get total count
 	std::ostringstream count_sql;
-	count_sql << "SELECT COUNT(*) FROM catalog";
+	count_sql << "SELECT COUNT(*) FROM catalog WHERE 1=1";
+	if (!schema.empty()) {
+		count_sql << " AND schema = '" << schema << "'";
+	}
 	if (!kind.empty()) {
-		count_sql << " WHERE kind = '" << kind << "'";
+		count_sql << " AND kind = '" << kind << "'";
 	}


⚠️ Potential issue | 🔴 Critical

SQL injection vulnerability in list() - same issue as search().

The list() method has the same SQL injection vulnerability, directly concatenating schema and kind parameters into SQL strings at lines 373, 377, 386, and 389.

Use prepared statements with parameter binding as done in upsert() and get().

🤖 Prompt for AI Agents

In `@lib/MySQL_Catalog.cpp` around lines 370 - 390, The list() method currently concatenates user inputs into the SQL strings (variables sql and count_sql), causing SQL injection; change it to use prepared statements and parameter binding like upsert() and get(): build the base query with placeholders (e.g., "WHERE 1=1" plus "AND schema = ?"/"AND kind = ?"), prepare the statement via your MySQL API, bind schema and kind only when non-empty, and bind limit/offset as parameters before executing both the main query and the COUNT query (use the same conditional placeholder logic for count_sql). Ensure you replace the direct string concatenation at the spots that currently append schema/kind and the final "LIMIT/OFFSET" with proper placeholders and binding.

coderabbitai · 2026-01-21T14:10:01Z

lib/MySQL_Catalog.cpp

 	std::ostringstream sql;
-	sql << "DELETE FROM catalog WHERE kind = '" << kind << "' AND key = '" << key << "'";
+	sql << "DELETE FROM catalog WHERE 1=1";
+
+	if (!schema.empty()) {
+		sql << " AND schema = '" << schema << "'";
+	}
+	sql << " AND kind = '" << kind << "' AND key = '" << key << "'";


⚠️ Potential issue | 🔴 Critical

SQL injection vulnerability in remove() - user input in DELETE statement.

The remove() method concatenates schema, kind, and key directly into a DELETE SQL statement. This is especially dangerous as it could allow attackers to delete arbitrary catalog entries.

Proposed fix using prepared statements

int MySQL_Catalog::remove( const std::string& schema, const std::string& kind, const std::string& key ) { - std::ostringstream sql; - sql << "DELETE FROM catalog WHERE 1=1"; - - if (!schema.empty()) { - sql << " AND schema = '" << schema << "'"; - } - sql << " AND kind = '" << kind << "' AND key = '" << key << "'"; - - if (!db->execute(sql.str().c_str())) { + sqlite3_stmt* stmt = NULL; + const char* delete_sql = schema.empty() + ? "DELETE FROM catalog WHERE kind = ?1 AND key = ?2" + : "DELETE FROM catalog WHERE schema = ?1 AND kind = ?2 AND key = ?3"; + + int rc = db->prepare_v2(delete_sql, &stmt); + if (rc != SQLITE_OK) { + proxy_error("Failed to prepare catalog remove: %d\n", rc); + return -1; + } + + if (schema.empty()) { + (*proxy_sqlite3_bind_text)(stmt, 1, kind.c_str(), -1, SQLITE_TRANSIENT); + (*proxy_sqlite3_bind_text)(stmt, 2, key.c_str(), -1, SQLITE_TRANSIENT); + } else { + (*proxy_sqlite3_bind_text)(stmt, 1, schema.c_str(), -1, SQLITE_TRANSIENT); + (*proxy_sqlite3_bind_text)(stmt, 2, kind.c_str(), -1, SQLITE_TRANSIENT); + (*proxy_sqlite3_bind_text)(stmt, 3, key.c_str(), -1, SQLITE_TRANSIENT); + } + + SAFE_SQLITE3_STEP2(stmt); + (*proxy_sqlite3_finalize)(stmt); + + if (rc != SQLITE_DONE && rc != SQLITE_OK) { proxy_error("Catalog remove error\n"); return -1; } return 0; }

🤖 Prompt for AI Agents

In `@lib/MySQL_Catalog.cpp` around lines 505 - 511, In the remove() method, avoid concatenating user inputs (schema, kind, key) into the SQL string; instead build a parameterized DELETE and use a prepared statement to bind values for schema (when non-empty), kind and key before executing. Replace the ostringstream construction in MySQL_Catalog::remove() with a DELETE query that uses placeholders for schema/kind/key (e.g. "DELETE FROM catalog WHERE 1=1" plus "AND schema = ?" when schema provided, then always "AND kind = ? AND key = ?"), prepare that statement via your MySQL client API, bind schema (conditionally), bind kind and key, execute, and propagate/log any execution errors.

coderabbitai · 2026-01-21T14:10:01Z

lib/ProxySQL_Admin_Stats.cpp

+void ProxySQL_Admin::stats___mcp_query_digest(bool reset) {
+	if (!GloMCPH) return;
+	Query_Tool_Handler* qth = GloMCPH->query_tool_handler;
+	if (!qth) return;
+
+	// Get the discovery schema catalog
+	Discovery_Schema* catalog = qth->get_catalog();
+	if (!catalog) return;
+
+	// Get the stats from the catalog (includes reset logic)
+	SQLite3_result* resultset = catalog->get_mcp_query_digest(reset);
+	if (!resultset) return;
+
+	statsdb->execute("BEGIN");
+
+	if (reset) {
+		statsdb->execute("DELETE FROM stats_mcp_query_digest_reset");
+	} else {
+		statsdb->execute("DELETE FROM stats_mcp_query_digest");
+	}
+
+	// Insert digest statistics into the stats table
+	// Columns: tool_name, run_id, digest, digest_text, count_star,
+	//          first_seen, last_seen, sum_time, min_time, max_time
+	char* a = (char*)"INSERT INTO stats_mcp_query_digest VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\")";
+	for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) {
+		SQLite3_row* r = *it;
+		int arg_len = 0;
+		for (int i = 0; i < 10; i++) {
+			arg_len += strlen(r->fields[i]);
+		}
+		char* query = (char*)malloc(strlen(a) + arg_len + 32);
+		sprintf(query, a,
+			r->fields[0],  // tool_name
+			r->fields[1],  // run_id
+			r->fields[2],  // digest
+			r->fields[3],  // digest_text
+			r->fields[4],  // count_star
+			r->fields[5],  // first_seen
+			r->fields[6],  // last_seen
+			r->fields[7],  // sum_time
+			r->fields[8],  // min_time
+			r->fields[9]   // max_time
+		);
+		statsdb->execute(query);
+		free(query);
+	}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

head -c 500000 lib/ProxySQL_Admin_Stats.cpp | wc -l

Repository: ProxySQL/proxysql-vec

Length of output: 69

🏁 Script executed:

sed -n '2580,2630p' lib/ProxySQL_Admin_Stats.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 1741

🏁 Script executed:

rg "stats_mcp_query_digest" --type cpp -A 3 -B 3 | head -100

Repository: ProxySQL/proxysql-vec

Length of output: 6780

Fix table name in INSERT statement and use prepared statements for NULL-safe binding.

Line 2605 always inserts into stats_mcp_query_digest even when reset == true, writing reset data to the live table instead of the reset table. Additionally, the strlen-based approach crashes if run_id is NULL (the schema allows it) and is vulnerable to SQL injection from unescaped quotes in digest text.

Replace the hardcoded INSERT with dynamic table selection and prepared statement binding:

Proposed fix

- char* a = (char*)"INSERT INTO stats_mcp_query_digest VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\")"; - for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) { - SQLite3_row* r = *it; - int arg_len = 0; - for (int i = 0; i < 10; i++) { - arg_len += strlen(r->fields[i]); - } - char* query = (char*)malloc(strlen(a) + arg_len + 32); - sprintf(query, a, - r->fields[0], // tool_name - r->fields[1], // run_id - r->fields[2], // digest - r->fields[3], // digest_text - r->fields[4], // count_star - r->fields[5], // first_seen - r->fields[6], // last_seen - r->fields[7], // sum_time - r->fields[8], // min_time - r->fields[9] // max_time - ); - statsdb->execute(query); - free(query); - } + const char* table = reset + ? "stats_mcp_query_digest_reset" + : "stats_mcp_query_digest"; + const std::string sql = + std::string("INSERT INTO ") + table + + " VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10)"; + int rc = 0; + stmt_unique_ptr u_stmt { nullptr }; + std::tie(rc, u_stmt) = statsdb->prepare_v2(sql.c_str()); + ASSERT_SQLITE_OK(rc, statsdb); + sqlite3_stmt* const stmt = u_stmt.get(); + for (SQLite3_row* r : resultset->rows) { + rc = (*proxy_sqlite3_bind_text)(stmt, 1, r->fields[0], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb); + if (r->fields[1]) { + rc = (*proxy_sqlite3_bind_int64)(stmt, 2, atoll(r->fields[1])); ASSERT_SQLITE_OK(rc, statsdb); + } else { + rc = (*proxy_sqlite3_bind_null)(stmt, 2); ASSERT_SQLITE_OK(rc, statsdb); + } + rc = (*proxy_sqlite3_bind_text)(stmt, 3, r->fields[2], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_text)(stmt, 4, r->fields[3], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 5, atoll(r->fields[4])); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 6, atoll(r->fields[5])); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 7, atoll(r->fields[6])); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 8, atoll(r->fields[7])); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 9, atoll(r->fields[8])); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 10, atoll(r->fields[9])); ASSERT_SQLITE_OK(rc, statsdb); + SAFE_SQLITE3_STEP2(stmt); + rc = (*proxy_sqlite3_clear_bindings)(stmt); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_reset)(stmt); ASSERT_SQLITE_OK(rc, statsdb); + }

🤖 Prompt for AI Agents

In `@lib/ProxySQL_Admin_Stats.cpp` around lines 2581 - 2627, The INSERT always targets stats_mcp_query_digest and uses strlen/risky sprintf, causing reset data to go to the live table, crashes on NULL run_id and SQL-injection risks; in ProxySQL_Admin::stats___mcp_query_digest switch the target table based on the reset flag (use stats_mcp_query_digest_reset when reset==true), stop building SQL via sprintf, and use a prepared statement with parameter binding (prepare once per loop or reuse) to bind each SQLite3_row->fields value NULL-safely (checking for NULL before binding and using sqlite3_bind_null or sqlite3_bind_text) instead of strlen on fields and statsdb->execute(query); also remove manual malloc/free and call the DB's execute/step/finalize prepared API (e.g., statsdb prepare/bind/execute/finalize) to prevent injection and crashes.

coderabbitai · 2026-01-21T14:10:01Z

lib/ProxySQL_Admin.cpp

+	if (strstr(query_no_space,"stats_mcp_query_tools_counters"))
+		{ stats_mcp_query_tools_counters=true; refresh=true; }
+	if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset"))
+		{ stats_mcp_query_tools_counters_reset=true; refresh=true; }


⚠️ Potential issue | 🟡 Minor

Avoid double-refresh when querying stats_mcp_query_tools_counters_reset.
The reset token contains the non-reset token, so both flags become true and both refresh paths run. Prefer a reset-first else if like other reset paths to avoid extra work and unintended table updates.

🛠️ Suggested fix

- if (strstr(query_no_space,"stats_mcp_query_tools_counters")) - { stats_mcp_query_tools_counters=true; refresh=true; } - if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset")) - { stats_mcp_query_tools_counters_reset=true; refresh=true; } + if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset")) + { stats_mcp_query_tools_counters_reset=true; refresh=true; } + else if (strstr(query_no_space,"stats_mcp_query_tools_counters")) + { stats_mcp_query_tools_counters=true; refresh=true; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (strstr(query_no_space,"stats_mcp_query_tools_counters"))

{ stats_mcp_query_tools_counters=true; refresh=true; }

if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset"))

{ stats_mcp_query_tools_counters_reset=true; refresh=true; }

if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset"))

{ stats_mcp_query_tools_counters_reset=true; refresh=true; }

else if (strstr(query_no_space,"stats_mcp_query_tools_counters"))

{ stats_mcp_query_tools_counters=true; refresh=true; }

🤖 Prompt for AI Agents

In `@lib/ProxySQL_Admin.cpp` around lines 1354 - 1357, The current checks set both stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset when query_no_space contains the reset token, causing duplicate refresh work; change the logic to test the reset token first and make the second check an else if (or otherwise skip setting the non-reset flag when reset matches) so only stats_mcp_query_tools_counters_reset is set and refresh is executed once; update the conditional around query_no_space that touches stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset accordingly.

coderabbitai · 2026-01-21T14:10:01Z

lib/Query_Tool_Handler.cpp

+int Query_Tool_Handler::init_connection_pool() {
+	// Parse hosts
+	std::vector<std::string> host_list;
+	std::istringstream h(mysql_hosts);
+	std::string host;
+	while (std::getline(h, host, ',')) {
+		host.erase(0, host.find_first_not_of(" \t"));
+		host.erase(host.find_last_not_of(" \t") + 1);
+		if (!host.empty()) {
+			host_list.push_back(host);
+		}
+	}
+
+	// Parse ports
+	std::vector<int> port_list;
+	std::istringstream p(mysql_ports);
+	std::string port;
+	while (std::getline(p, port, ',')) {
+		port.erase(0, port.find_first_not_of(" \t"));
+		port.erase(port.find_last_not_of(" \t") + 1);
+		if (!port.empty()) {
+			port_list.push_back(atoi(port.c_str()));
+		}
+	}
+
+	// Ensure ports array matches hosts array size
+	while (port_list.size() < host_list.size()) {
+		port_list.push_back(3306);
 	}
+
+	if (host_list.empty()) {
+		proxy_error("Query_Tool_Handler: No hosts configured\n");
+		return -1;
+	}
+
+	pthread_mutex_lock(&pool_lock);
+
+	for (size_t i = 0; i < host_list.size(); i++) {
+		MySQLConnection conn;
+		conn.host = host_list[i];
+		conn.port = port_list[i];
+		conn.in_use = false;
+
+		MYSQL* mysql = mysql_init(NULL);
+		if (!mysql) {
+			proxy_error("Query_Tool_Handler: mysql_init failed for %s:%d\n",
+				conn.host.c_str(), conn.port);
+			pthread_mutex_unlock(&pool_lock);
+			return -1;
+		}
+
+		unsigned int timeout = 5;
+		mysql_options(mysql, MYSQL_OPT_CONNECT_TIMEOUT, &timeout);
+		mysql_options(mysql, MYSQL_OPT_READ_TIMEOUT, &timeout);
+		mysql_options(mysql, MYSQL_OPT_WRITE_TIMEOUT, &timeout);
+
+		if (!mysql_real_connect(
+			mysql,
+			conn.host.c_str(),
+			mysql_user.c_str(),
+			mysql_password.c_str(),
+			mysql_schema.empty() ? NULL : mysql_schema.c_str(),
+			conn.port,
+			NULL,
+			CLIENT_MULTI_STATEMENTS
+		)) {
+			proxy_error("Query_Tool_Handler: mysql_real_connect failed for %s:%d: %s\n",
+				conn.host.c_str(), conn.port, mysql_error(mysql));
+			mysql_close(mysql);
+			pthread_mutex_unlock(&pool_lock);
+			return -1;
+		}
+
+		conn.mysql = mysql;
+		connection_pool.push_back(conn);
+		pool_size++;
+
+		proxy_info("Query_Tool_Handler: Connected to %s:%d\n",
+			conn.host.c_str(), conn.port);
+	}
+
+	pthread_mutex_unlock(&pool_lock);
+	proxy_info("Query_Tool_Handler: Connection pool initialized with %d connection(s)\n", pool_size);
+	return 0;
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the header file for Query_Tool_Handler fd -e "h" -e "hpp" | xargs grep -l "class Query_Tool_Handler" 2>/dev/null

Repository: ProxySQL/proxysql-vec

Length of output: 114

🏁 Script executed:

# Also check the cpp file for class definition inline head -100 lib/Query_Tool_Handler.cpp | grep -A 50 "class Query_Tool_Handler"

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Find all method definitions in the Query_Tool_Handler.cpp file rg "^\s*(public|private|protected):" lib/Query_Tool_Handler.cpp -A 5

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Search for connection_pool usages rg "connection_pool" lib/Query_Tool_Handler.cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 1123

🏁 Script executed:

# Check for constructor and initialization of connection_pool rg "Query_Tool_Handler::|connection_pool.*=" lib/Query_Tool_Handler.cpp | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 1169

🏁 Script executed:

# Read the Query_Tool_Handler header file to see class structure cat -n include/Query_Tool_Handler.h

Repository: ProxySQL/proxysql-vec

Length of output: 7390

🏁 Script executed:

# Read the destructor and close() method implementations rg "Query_Tool_Handler::~Query_Tool_Handler|void Query_Tool_Handler::close" lib/Query_Tool_Handler.cpp -A 30

Repository: ProxySQL/proxysql-vec

Length of output: 1464

🏁 Script executed:

# Get the full init() method rg "int Query_Tool_Handler::init()" lib/Query_Tool_Handler.cpp -A 50

Repository: ProxySQL/proxysql-vec

Length of output: 2255

🏁 Script executed:

# Search for all calls to init_connection_pool rg "init_connection_pool" lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 144

🏁 Script executed:

# Check the constructor implementation rg "Query_Tool_Handler::Query_Tool_Handler" lib/Query_Tool_Handler.cpp -A 30

Repository: ProxySQL/proxysql-vec

Length of output: 857

🏁 Script executed:

# Check if there are any guards or flags in init() to prevent calling it multiple times rg "init()" lib/Query_Tool_Handler.cpp -B 5 -A 15 | head -80

Repository: ProxySQL/proxysql-vec

Length of output: 2084

🏁 Script executed:

# Check the full init_connection_pool signature and look at early error returns sed -n '235,319p' lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 2310

🏁 Script executed:

# Check if there's any state management or guards to prevent multiple init calls rg "connection_pool.empty()|pool_size" lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 211

🏁 Script executed:

# Check get_connection() to see if it guards against partially initialized pool rg "void\* Query_Tool_Handler::get_connection" lib/Query_Tool_Handler.cpp -A 20

Repository: ProxySQL/proxysql-vec

Length of output: 530

🏁 Script executed:

# Check if init() is ever called more than once, or if there's any guard against it rg "\.init" lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Also check the caller context - how is Query_Tool_Handler created and used rg "Query_Tool_Handler" lib/Query_Tool_Handler.cpp | grep -E "(new|delete|init|close)" | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 132

🏁 Script executed:

# Check if there are any other places in the codebase that use Query_Tool_Handler # to understand the usage pattern find . -name "*.cpp" -o -name "*.h" | xargs grep -l "Query_Tool_Handler" | grep -v Query_Tool_Handler.cpp | grep -v Query_Tool_Handler.h

Repository: ProxySQL/proxysql-vec

Length of output: 194

🏁 Script executed:

# Look at one of those files to see how Query_Tool_Handler is used rg "Query_Tool_Handler" include/MCP_Thread.h -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 508

🏁 Script executed:

# Check MCP_Thread.cpp to see how Query_Tool_Handler is initialized and used rg "query_tool_handler" lib/MCP_Thread.cpp -B 2 -A 2 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 361

🏁 Script executed:

# Check ProxySQL_MCP_Server.cpp for initialization patterns rg "Query_Tool_Handler|init|close" lib/ProxySQL_MCP_Server.cpp | head -30

Repository: ProxySQL/proxysql-vec

Length of output: 494

🏁 Script executed:

# Verify that the early return (no hosts) happens before acquiring lock sed -n '280,290p' lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 430

🏁 Script executed:

# Get exact line numbers around pool_lock acquisition and early returns sed -n '270,285p' lib/Query_Tool_Handler.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 461

Clean up partially initialized pools on connection failure.

If a later host fails to connect, earlier MYSQL* handles remain in connection_pool while the function returns -1, leaving the object in an inconsistent state. Although the destructor eventually cleans up via close(), the intermediate state violates the expected postcondition of the function (either fully initialized or empty).

Add a helper lambda to clean up and reset the pool before returning on connection failures:

🧹 Suggested fix

int Query_Tool_Handler::init_connection_pool() { + // Helper to cleanup partially initialized pool (pool_lock must be held) + auto cleanup_pool_locked = [&]() { + for (auto& c : connection_pool) { + if (c.mysql) { + mysql_close(static_cast<MYSQL*>(c.mysql)); + c.mysql = NULL; + } + } + connection_pool.clear(); + pool_size = 0; + }; + // Parse hosts std::vector<std::string> host_list; @@ MYSQL* mysql = mysql_init(NULL); if (!mysql) { proxy_error("Query_Tool_Handler: mysql_init failed for %s:%d\n", conn.host.c_str(), conn.port); - pthread_mutex_unlock(&pool_lock); + cleanup_pool_locked(); + pthread_mutex_unlock(&pool_lock); return -1; } @@ )) { proxy_error("Query_Tool_Handler: mysql_real_connect failed for %s:%d: %s\n", conn.host.c_str(), conn.port, mysql_error(mysql)); mysql_close(mysql); - pthread_mutex_unlock(&pool_lock); + cleanup_pool_locked(); + pthread_mutex_unlock(&pool_lock); return -1; }

🤖 Prompt for AI Agents

In `@lib/Query_Tool_Handler.cpp` around lines 235 - 319, The init_connection_pool function must clean up any partially initialized connections on error: add a helper (lambda or private method) inside init_connection_pool that iterates connection_pool, calls mysql_close on each entry's MYSQL* (and sets conn.mysql = nullptr), clears connection_pool, resets pool_size to 0, and unlocks pool_lock if held; call this cleanup helper before every early return that occurs after some successful connections (e.g., after mysql_init/mysql_real_connect failures) so the function either returns with a fully initialized pool or an empty, consistent pool; update all return paths in init_connection_pool to invoke this helper and then return -1.

coderabbitai · 2026-01-21T14:10:02Z

lib/Static_Harvester.cpp

+int Static_Harvester::execute_query(const std::string& query, std::vector<std::vector<std::string>>& results) {
+	pthread_mutex_lock(&conn_lock);
+
+	if (!mysql_conn) {
+		pthread_mutex_unlock(&conn_lock);
+		proxy_error("Static_Harvester: Not connected to MySQL\n");
+		return -1;
+	}
+
+	proxy_debug(PROXY_DEBUG_GENERIC, 3, "Static_Harvester: Executing query: %s\n", query.c_str());
+
+	if (mysql_query(mysql_conn, query.c_str())) {
+		proxy_error("Static_Harvester: Query failed: %s\n", mysql_error(mysql_conn));
+		pthread_mutex_unlock(&conn_lock);
+		return -1;
+	}
+
+	MYSQL_RES* res = mysql_store_result(mysql_conn);
+	if (!res) {
+		// No result set (e.g., INSERT/UPDATE)
+		pthread_mutex_unlock(&conn_lock);
+		return 0;
+	}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "Static_Harvester.cpp" | head -5

Repository: ProxySQL/proxysql-vec

Length of output: 91

🏁 Script executed:

wc -l lib/Static_Harvester.cpp 2>/dev/null || find . -name "Static_Harvester.cpp" -exec wc -l {} \;

Repository: ProxySQL/proxysql-vec

Length of output: 94

🏁 Script executed:

cat -n lib/Static_Harvester.cpp | sed -n '230,265p'

Repository: ProxySQL/proxysql-vec

Length of output: 1366

🌐 Web query:

MySQL C API mysql_store_result NULL error handling documentation

💡 Result:

Short summary — how to handle mysql_store_result() returning NULL:

mysql_store_result() returns NULL either when the statement produced no result set (e.g., INSERT/UPDATE/DELETE) or when an error occurred while reading the result. [1][2]

To tell which case it is:

If mysql_field_count(mysql) == 0 then the statement does not produce a result set. [1][2]

Otherwise check mysql_errno(mysql) != 0 or mysql_error(mysql) (non-empty) to detect an error. [1][2]

Note: mysql_store_result() resets mysql_error()/mysql_errno() on success; when it returns NULL an allocation/connection/read error (CR_OUT_OF_MEMORY, CR_SERVER_GONE_ERROR, CR_SERVER_LOST, CR_COMMANDS_OUT_OF_SYNC, etc.) may be indicated by mysql_errno(). [1][3]

See the MySQL manual’s dedicated section “NULL mysql_store_result() Return After mysql_query() Success” for examples and details. [3]

Sources:
[1] MySQL C API — mysql_store_result() documentation.
[2] MySQL C API Developer Guide (mysql_store_result) — dev.mysql.com.
[3] MySQL C API guide — “NULL mysql_store_result() Return After mysql_query() Success.”

Distinguish "no result set" from query error in mysql_store_result() handling.

mysql_store_result() returns NULL for both successful queries with no result set (INSERT/UPDATE) and for errors (memory allocation, connection loss, read failures). The current code treats all NULL cases as success, silently masking database errors.

Per MySQL C API documentation, distinguish these cases by checking mysql_field_count() to detect whether the query produced a result set, then check mysql_errno() if no result set to confirm it was not an error.

Suggested approach

MYSQL_RES* res = mysql_store_result(mysql_conn); if (!res) { - // No result set (e.g., INSERT/UPDATE) - pthread_mutex_unlock(&conn_lock); - return 0; + // Distinguish: no result set vs. error + if (mysql_field_count(mysql_conn) != 0) { + proxy_error("Static_Harvester: Query failed after store_result: %s\n", + mysql_error(mysql_conn)); + pthread_mutex_unlock(&conn_lock); + return -1; + } + pthread_mutex_unlock(&conn_lock); + return 0; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

int Static_Harvester::execute_query(const std::string& query, std::vector<std::vector<std::string>>& results) {

pthread_mutex_lock(&conn_lock);

if (!mysql_conn) {

pthread_mutex_unlock(&conn_lock);

proxy_error("Static_Harvester: Not connected to MySQL\n");

return -1;

}

proxy_debug(PROXY_DEBUG_GENERIC, 3, "Static_Harvester: Executing query: %s\n", query.c_str());

if (mysql_query(mysql_conn, query.c_str())) {

proxy_error("Static_Harvester: Query failed: %s\n", mysql_error(mysql_conn));

pthread_mutex_unlock(&conn_lock);

return -1;

}

MYSQL_RES* res = mysql_store_result(mysql_conn);

if (!res) {

// No result set (e.g., INSERT/UPDATE)

pthread_mutex_unlock(&conn_lock);

return 0;

}

int Static_Harvester::execute_query(const std::string& query, std::vector<std::vector<std::string>>& results) {

pthread_mutex_lock(&conn_lock);

if (!mysql_conn) {

pthread_mutex_unlock(&conn_lock);

proxy_error("Static_Harvester: Not connected to MySQL\n");

return -1;

}

proxy_debug(PROXY_DEBUG_GENERIC, 3, "Static_Harvester: Executing query: %s\n", query.c_str());

if (mysql_query(mysql_conn, query.c_str())) {

proxy_error("Static_Harvester: Query failed: %s\n", mysql_error(mysql_conn));

pthread_mutex_unlock(&conn_lock);

return -1;

}

MYSQL_RES* res = mysql_store_result(mysql_conn);

if (!res) {

// Distinguish: no result set vs. error

if (mysql_field_count(mysql_conn) != 0) {

proxy_error("Static_Harvester: Query failed after store_result: %s\n",

mysql_error(mysql_conn));

pthread_mutex_unlock(&conn_lock);

return -1;

}

pthread_mutex_unlock(&conn_lock);

return 0;

}

🤖 Prompt for AI Agents

In `@lib/Static_Harvester.cpp` around lines 236 - 258, In Static_Harvester::execute_query, mysql_store_result() NULL is being treated as a successful "no result set" case; instead check mysql_field_count(mysql_conn) to see if the query should have a result set—if mysql_field_count() == 0 treat it as a successful non-result (unlock conn_lock and return 0), otherwise treat NULL as an error by calling mysql_errno(mysql_conn) and logging mysql_error(mysql_conn), unlock conn_lock and return -1; update the error log to include mysql_error(mysql_conn) and ensure conn_lock is always released before returning.

coderabbitai · 2026-01-21T14:10:02Z

lib/Static_Harvester.cpp

+// Fetch schema/database metadata from information_schema.SCHEMATA.
+//
+// Queries MySQL for all schemas (databases) and their character set
+// and collation information.
+//
+// Parameters:
+//   filter - Optional schema name filter (empty for all schemas)
+//
+// Returns:
+//   Vector of SchemaRow structures containing schema metadata
+std::vector<Static_Harvester::SchemaRow> Static_Harvester::fetch_schemas(const std::string& filter) {
+	std::vector<SchemaRow> schemas;
+
+	std::ostringstream sql;
+	sql << "SELECT SCHEMA_NAME, DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME "
+	    << "FROM information_schema.SCHEMATA";
+
+	if (!filter.empty()) {
+		sql << " WHERE SCHEMA_NAME = '" << filter << "'";
+	}
+
+	sql << " ORDER BY SCHEMA_NAME;";
+
+	std::vector<std::vector<std::string>> results;
+	if (execute_query(sql.str(), results) == 0) {
+		for (const auto& row : results) {
+			SchemaRow s;
+			s.schema_name = row[0];
+			s.charset = row[1];
+			s.collation = row[2];
+			schemas.push_back(s);
+		}
+	}
+
+	return schemas;
+}
+
+// ============================================================
+// Harvest Stage Methods
+// ============================================================
+
+// Harvest schemas/databases to the catalog.
+//
+// Fetches schemas from information_schema.SCHEMATA and inserts them
+// into the catalog. System schemas (mysql, information_schema,
+// performance_schema, sys) are skipped.
+//
+// Parameters:
+//   only_schema - Optional filter to harvest only one schema
+//
+// Returns:
+//   Number of schemas harvested, or -1 on error
+//
+// Notes:
+//   - Requires an active run (start_run must be called first)
+//   - Skips system schemas automatically
+int Static_Harvester::harvest_schemas(const std::string& only_schema) {
+	if (current_run_id < 0) {
+		proxy_error("Static_Harvester: No active run\n");
+		return -1;
+	}
+
+	std::vector<SchemaRow> schemas = fetch_schemas(only_schema);
+	int count = 0;
+
+	for (const auto& s : schemas) {
+		// Skip system schemas
+		if (s.schema_name == "mysql" || s.schema_name == "information_schema" ||
+		    s.schema_name == "performance_schema" || s.schema_name == "sys") {
+			continue;
+		}
+
+		if (catalog->insert_schema(current_run_id, s.schema_name, s.charset, s.collation) >= 0) {
+			count++;
+		}
+	}
+
+	proxy_info("Static_Harvester: Harvested %d schemas\n", count);
+	return count;
+}
+
+// Fetch table and view metadata from information_schema.TABLES.
+//
+// Queries MySQL for all tables and views with their physical
+// characteristics (rows, size, engine, timestamps).
+//
+// Parameters:
+//   filter - Optional schema name filter
+//
+// Returns:
+//   Vector of ObjectRow structures containing table/view metadata
+std::vector<Static_Harvester::ObjectRow> Static_Harvester::fetch_tables_views(const std::string& filter) {
+	std::vector<ObjectRow> objects;
+
+	std::ostringstream sql;
+	sql << "SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE, ENGINE, TABLE_ROWS, "
+	    << "DATA_LENGTH, INDEX_LENGTH, CREATE_TIME, UPDATE_TIME, TABLE_COMMENT "
+	    << "FROM information_schema.TABLES "
+	    << "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')";
+
+	if (!filter.empty()) {
+		sql << " AND TABLE_SCHEMA = '" << filter << "'";
+	}
+
+	sql << " ORDER BY TABLE_SCHEMA, TABLE_NAME;";
+
+	std::vector<std::vector<std::string>> results;
+	if (execute_query(sql.str(), results) == 0) {
+		for (const auto& row : results) {
+			ObjectRow o;
+			o.schema_name = row[0];
+			o.object_name = row[1];
+			o.object_type = (row[2] == "VIEW") ? "view" : "table";
+			o.engine = row[3];
+			o.table_rows_est = row[4].empty() ? 0 : atol(row[4].c_str());
+			o.data_length = row[5].empty() ? 0 : atol(row[5].c_str());
+			o.index_length = row[6].empty() ? 0 : atol(row[6].c_str());
+			o.create_time = row[7];
+			o.update_time = row[8];
+			o.object_comment = row[9];
+			objects.push_back(o);
+		}
+	}
+
+	return objects;
+}
+
+// Fetch column metadata from information_schema.COLUMNS.
+//
+// Queries MySQL for all columns with their data types, nullability,
+// defaults, character set, and comments.
+//
+// Parameters:
+//   filter - Optional schema name filter
+//
+// Returns:
+//   Vector of ColumnRow structures containing column metadata
+std::vector<Static_Harvester::ColumnRow> Static_Harvester::fetch_columns(const std::string& filter) {
+	std::vector<ColumnRow> columns;
+
+	std::ostringstream sql;
+	sql << "SELECT TABLE_SCHEMA, TABLE_NAME, ORDINAL_POSITION, COLUMN_NAME, "
+	    << "DATA_TYPE, COLUMN_TYPE, IS_NULLABLE, COLUMN_DEFAULT, EXTRA, "
+	    << "CHARACTER_SET_NAME, COLLATION_NAME, COLUMN_COMMENT "
+	    << "FROM information_schema.COLUMNS "
+	    << "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')";
+
+	if (!filter.empty()) {
+		sql << " AND TABLE_SCHEMA = '" << filter << "'";
+	}
+
+	sql << " ORDER BY TABLE_SCHEMA, TABLE_NAME, ORDINAL_POSITION;";
+
+	std::vector<std::vector<std::string>> results;
+	if (execute_query(sql.str(), results) == 0) {
+		for (const auto& row : results) {
+			ColumnRow c;
+			c.schema_name = row[0];
+			c.object_name = row[1];
+			c.ordinal_pos = atoi(row[2].c_str());
+			c.column_name = row[3];
+			c.data_type = row[4];
+			c.column_type = row[5];
+			c.is_nullable = (row[6] == "YES") ? 1 : 0;
+			c.column_default = row[7];
+			c.extra = row[8];
+			c.charset = row[9];
+			c.collation = row[10];
+			c.column_comment = row[11];
+			columns.push_back(c);
+		}
+	}
+
+	return columns;
+}
+
+// Fetch index metadata from information_schema.STATISTICS.
+//
+// Queries MySQL for all indexes with their columns, sequence,
+// uniqueness, cardinality, and collation.
+//
+// Parameters:
+//   filter - Optional schema name filter
+//
+// Returns:
+//   Vector of IndexRow structures containing index metadata
+std::vector<Static_Harvester::IndexRow> Static_Harvester::fetch_indexes(const std::string& filter) {
+	std::vector<IndexRow> indexes;
+
+	std::ostringstream sql;
+	sql << "SELECT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME, NON_UNIQUE, INDEX_TYPE, "
+	    << "SEQ_IN_INDEX, COLUMN_NAME, SUB_PART, COLLATION, CARDINALITY "
+	    << "FROM information_schema.STATISTICS "
+	    << "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')";
+
+	if (!filter.empty()) {
+		sql << " AND TABLE_SCHEMA = '" << filter << "'";
+	}
+
+	sql << " ORDER BY TABLE_SCHEMA, TABLE_NAME, INDEX_NAME, SEQ_IN_INDEX;";
+
+	std::vector<std::vector<std::string>> results;
+	if (execute_query(sql.str(), results) == 0) {
+		for (const auto& row : results) {
+			IndexRow i;
+			i.schema_name = row[0];
+			i.object_name = row[1];
+			i.index_name = row[2];
+			i.is_unique = (row[3] == "0") ? 1 : 0;
+			i.index_type = row[4];
+			i.seq_in_index = atoi(row[5].c_str());
+			i.column_name = row[6];
+			i.sub_part = row[7].empty() ? 0 : atoi(row[7].c_str());
+			i.collation = row[8];
+			i.cardinality = row[9].empty() ? 0 : atol(row[9].c_str());
+			indexes.push_back(i);
+		}
+	}
+
+	return indexes;
+}
+
+// Fetch foreign key metadata from information_schema.
+//
+// Queries KEY_COLUMN_USAGE and REFERENTIAL_CONSTRAINTS to get
+// foreign key relationships including child/parent tables and columns,
+// and ON UPDATE/DELETE rules.
+//
+// Parameters:
+//   filter - Optional schema name filter
+//
+// Returns:
+//   Vector of FKRow structures containing foreign key metadata
+std::vector<Static_Harvester::FKRow> Static_Harvester::fetch_foreign_keys(const std::string& filter) {
+	std::vector<FKRow> fks;
+
+	std::ostringstream sql;
+	sql << "SELECT kcu.CONSTRAINT_SCHEMA AS child_schema, "
+	    << "kcu.TABLE_NAME AS child_table, kcu.CONSTRAINT_NAME AS fk_name, "
+	    << "kcu.COLUMN_NAME AS child_column, kcu.REFERENCED_TABLE_SCHEMA AS parent_schema, "
+	    << "kcu.REFERENCED_TABLE_NAME AS parent_table, kcu.REFERENCED_COLUMN_NAME AS parent_column, "
+	    << "kcu.ORDINAL_POSITION AS seq, rc.UPDATE_RULE AS on_update, rc.DELETE_RULE AS on_delete "
+	    << "FROM information_schema.KEY_COLUMN_USAGE kcu "
+	    << "JOIN information_schema.REFERENTIAL_CONSTRAINTS rc "
+	    << "  ON rc.CONSTRAINT_SCHEMA = kcu.CONSTRAINT_SCHEMA "
+	    << " AND rc.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME "
+	    << "WHERE kcu.TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')";
+
+	if (!filter.empty()) {
+		sql << " AND kcu.TABLE_SCHEMA = '" << filter << "'";
+	}
+
+	sql << "  AND kcu.REFERENCED_TABLE_NAME IS NOT NULL "
+	    << "ORDER BY child_schema, child_table, fk_name, seq;";
+
+	std::vector<std::vector<std::string>> results;
+	if (execute_query(sql.str(), results) == 0) {
+		for (const auto& row : results) {
+			FKRow fk;
+			fk.child_schema = row[0];
+			fk.child_table = row[1];
+			fk.fk_name = row[2];
+			fk.child_column = row[3];
+			fk.parent_schema = row[4];
+			fk.parent_table = row[5];
+			fk.parent_column = row[6];
+			fk.seq = atoi(row[7].c_str());
+			fk.on_update = row[8];
+			fk.on_delete = row[9];
+			fks.push_back(fk);
+		}
+	}
+
+	return fks;
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for calls to the fetch_* methods to understand where filter comes from rg "fetch_schemas|fetch_tables_views|fetch_columns|fetch_indexes|fetch_foreign_keys" --type cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 5678

🏁 Script executed:

# Check if there's a header file that might show the method signatures and their usage context fd "\.h$" | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 630

🏁 Script executed:

# Look for where harvest_schemas is called to trace the data flow rg "harvest_schemas" --type cpp -B 5 -A 5

Repository: ProxySQL/proxysql-vec

Length of output: 1911

🏁 Script executed:

# Search for calls to run_full_harvest rg "run_full_harvest" --type cpp -B 3 -A 3

Repository: ProxySQL/proxysql-vec

Length of output: 1870

🏁 Script executed:

# Also look for any input validation or escaping patterns used elsewhere in the codebase rg "mysql_escape|mysql_real_escape|prepared.*statement|parameterized" --type cpp | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 2187

🏁 Script executed:

# Look at the Query_Tool_Handler.cpp context more thoroughly rg "schema_filter|json_string" lib/Query_Tool_Handler.cpp -B 5 -A 10 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 1929

🏁 Script executed:

# Check execute_query implementation rg "execute_query" lib/Static_Harvester.cpp -A 15 | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 1217

🏁 Script executed:

# Check if execute_query can execute multiple statements ast-grep --pattern 'int Static_Harvester::execute_query($_, $_) { $$$ }'

Repository: ProxySQL/proxysql-vec

Length of output: 47

Add input validation for filter parameter to prevent SQL injection in fetch_ methods.*

The filter parameter originates from user input via the MCP tool interface (discovery.run_static tool) and is directly embedded into SQL queries without escaping. This allows SQL injection via payloads like test' OR '1'='1. Implement whitelist validation accepting only alphanumeric characters and underscores, which is appropriate for MySQL schema identifiers.

Apply the validation across all five fetch methods:

fetch_schemas()

fetch_tables_views()

fetch_columns()

fetch_indexes()

fetch_foreign_keys()

Example fix

+static bool is_safe_identifier(const std::string& s) { + return !s.empty() && + std::all_of(s.begin(), s.end(), + [](unsigned char c){ return std::isalnum(c) || c == '_'; }); +} + std::vector<Static_Harvester::SchemaRow> Static_Harvester::fetch_schemas(const std::string& filter) { std::vector<SchemaRow> schemas; std::ostringstream sql; sql << "SELECT SCHEMA_NAME, DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME " << "FROM information_schema.SCHEMATA"; if (!filter.empty()) { + if (!is_safe_identifier(filter)) { + proxy_error("Static_Harvester: Invalid schema filter '%s'\n", filter.c_str()); + return schemas; + } sql << " WHERE SCHEMA_NAME = '" << filter << "'"; }

🤖 Prompt for AI Agents

In `@lib/Static_Harvester.cpp` around lines 388 - 662, All five fetch_* methods (fetch_schemas, fetch_tables_views, fetch_columns, fetch_indexes, fetch_foreign_keys) currently interpolate the user-provided filter directly into SQL and need whitelist validation to prevent SQL injection: add a small helper (e.g., is_valid_identifier or validate_schema_filter) that returns true only for non-empty strings matching /^[A-Za-z0-9_]+$/ (allowing empty meaning no filter), call it at the start of each method and if the filter is non-empty and invalid either return an empty result or treat it as no filter (choose consistent behavior), and only append " AND ... = '<filter>'" to the SQL when the helper approves the value; reference the filter variable and SQL-building blocks in each function so the validation is applied before any SQL concatenation.

coderabbitai · 2026-01-21T14:10:02Z

lib/Static_Harvester.cpp

+	if (current_run_id < 0) {
+		proxy_error("Static_Harvester: No active run\n");
+		return -1;
+	}
+
+	std::ostringstream sql;
+	sql << "SELECT TABLE_SCHEMA, TABLE_NAME, VIEW_DEFINITION "
+	    << "FROM information_schema.VIEWS "
+	    << "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')";
+
+	if (!only_schema.empty()) {
+		sql << " AND TABLE_SCHEMA = '" << only_schema << "'";
+	}
+
+	sql << ";";
+
+	std::vector<std::vector<std::string>> results;
+	if (execute_query(sql.str(), results) != 0) {
+		return -1;
+	}
+
+	int count = 0;
+	for (const auto& row : results) {
+		std::string schema_name = row[0];
+		std::string view_name = row[1];
+		std::string view_def = row[2];
+
+		// Update object with definition
+		char* error = NULL;
+		int cols = 0, affected = 0;
+		std::ostringstream update_sql;
+		update_sql << "UPDATE objects SET definition_sql = '" << view_def << "' "
+		           << "WHERE run_id = " << current_run_id
+		           << " AND schema_name = '" << schema_name << "'"
+		           << " AND object_name = '" << view_name << "'"
+		           << " AND object_type = 'view';";
+
+		catalog->get_db()->execute_statement(update_sql.str().c_str(), &error, &cols, &affected);
+		if (affected > 0) {
+			count++;
+		}
+	}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find and examine the file structure find . -name "Static_Harvester.cpp" -o -name "Static_Harvester.h" | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 186

🏁 Script executed:

# Search for SQL escaping utilities or helper functions rg -i "escape|quote" --type cpp | head -30

Repository: ProxySQL/proxysql-vec

Length of output: 3528

🏁 Script executed:

# Check the database type being used and how other queries handle input rg "execute_query|execute_statement" --type cpp -B 2 -A 2 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 4030

🏁 Script executed:

# Find Static_Harvester.cpp find . -name "Static_Harvester.cpp" -type f

Repository: ProxySQL/proxysql-vec

Length of output: 91

🏁 Script executed:

# Examine the Static_Harvester.cpp file around the problematic lines find . -name "Static_Harvester.cpp" -type f -exec wc -l {} \;

Repository: ProxySQL/proxysql-vec

Length of output: 96

🏁 Script executed:

# Search for database-related utilities and escape functions in lib/ find ./lib -name "*.cpp" -o -name "*.h" | head -20

Repository: ProxySQL/proxysql-vec

Length of output: 589

🏁 Script executed:

# Examine the problematic section in Static_Harvester.cpp sed -n '1000,1060p' ./lib/Static_Harvester.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 2215

🏁 Script executed:

# Search for the execute_query function definition in Static_Harvester.cpp rg "execute_query" ./lib/Static_Harvester.cpp -B 5 -A 10

Repository: ProxySQL/proxysql-vec

Length of output: 3632

🏁 Script executed:

# Look at the class definition and database connection context sed -n '1,100p' ./lib/Static_Harvester.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 4193

🏁 Script executed:

# Examine the catalog database interface - check Discovery_Schema and get_db() rg "class Discovery_Schema" -A 20 ./lib --type cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Search for SQL escape or quote handling in the codebase rg "definition_sql|UPDATE.*SET" ./lib/Static_Harvester.cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 1174

🏁 Script executed:

# Check what database interface is being used for catalog rg "execute_statement" ./lib --type cpp -B 1 | head -40

Repository: ProxySQL/proxysql-vec

Length of output: 2021

🏁 Script executed:

# Look for prepared statement usage or escaping utilities in Discovery_Schema or nearby find ./lib -name "*Discovery*" -type f

Repository: ProxySQL/proxysql-vec

Length of output: 91

🏁 Script executed:

# Examine Discovery_Schema to understand the database interface head -100 ./lib/Discovery_Schema.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 3317

🏁 Script executed:

# Check the get_db() method and what it returns rg "get_db()" ./lib/Discovery_Schema.cpp -B 2 -A 5

Repository: ProxySQL/proxysql-vec

Length of output: 2641

🏁 Script executed:

# Look for SQLite3 interface - check if execute_statement supports prepared statements rg "execute_statement" ./lib/SQLite3_Server.cpp -B 1 -A 3 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 129

🏁 Script executed:

# Check if there's a C++ wrapper or any escaping utilities available rg "escape|sqlite3_bind|sqlite3_prepare" ./lib --type cpp | head -30

Repository: ProxySQL/proxysql-vec

Length of output: 2982

🏁 Script executed:

# Check the escape_string_single_quotes function implementation rg "escape_string_single_quotes" ./lib/gen_utils.cpp -B 2 -A 15

Repository: ProxySQL/proxysql-vec

Length of output: 785

🏁 Script executed:

# Check the header file to see if it's publicly available rg "escape_string_single_quotes" ./include --type cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 765

🏁 Script executed:

# Verify that the code is really concatenating strings without escaping sed -n '1035,1055p' ./lib/Static_Harvester.cpp | cat -n

Repository: ProxySQL/proxysql-vec

Length of output: 914

🏁 Script executed:

# Check if Static_Harvester.h includes gen_utils or has access to escape functions cat ./lib/Static_Harvester.h | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 121

🏁 Script executed:

# Find the Static_Harvester header file find . -name "*Static_Harvester*" -type f

Repository: ProxySQL/proxysql-vec

Length of output: 120

🏁 Script executed:

# Check the include statements in Static_Harvester.cpp head -30 ./lib/Static_Harvester.cpp | grep -E "^#include"

Repository: ProxySQL/proxysql-vec

Length of output: 250

🏁 Script executed:

# Look for the prepared statement usage pattern in Discovery_Schema rg "sqlite3_prepare|SAFE_SQLITE3_STEP|bind_text" ./lib/Discovery_Schema.cpp -B 1 -A 3 | head -60

Repository: ProxySQL/proxysql-vec

Length of output: 2595

🏁 Script executed:

# Check if execute_statement method supports prepared statements with parameter binding rg "execute_statement" ./lib/SQLite3DB.cpp -B 2 -A 5 2>/dev/null | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Check if Static_Harvester.cpp includes gen_utils.h rg "#include" ./lib/Static_Harvester.cpp | head -10

Repository: ProxySQL/proxysql-vec

Length of output: 250

🏁 Script executed:

# Look for other similar SQL concatenation patterns in Static_Harvester.cpp rg "UPDATE|INSERT" ./lib/Static_Harvester.cpp | grep -E "<<.*<<" | head -10

Repository: ProxySQL/proxysql-vec

Length of output: 140

🏁 Script executed:

# Examine how Discovery_Schema does prepared statements for similar updates rg "UPDATE objects" ./lib/Discovery_Schema.cpp -B 5 -A 10

Repository: ProxySQL/proxysql-vec

Length of output: 821

🏁 Script executed:

# Check if there are other places in Static_Harvester with unescaped string concatenation rg "ostringstream.*<<" ./lib/Static_Harvester.cpp | grep -v "ORDER\|SELECT\|WHERE.*NOT IN"

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Check how execute_statement is used elsewhere for similar updates with variables rg "execute_statement.*UPDATE" ./lib --type cpp -B 2 -A 2 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Verify that prepared statements are NOT used in Static_Harvester.cpp rg "sqlite3_prepare|bind_text|bind_int" ./lib/Static_Harvester.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Check if there are similar vulnerable patterns elsewhere in Static_Harvester rg "ostringstream.*view_def|ostringstream.*schema_name|ostringstream.*view_name" ./lib/Static_Harvester.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Final check: confirm the vulnerability by looking at the exact SQL being built sed -n '1035,1055p' ./lib/Static_Harvester.cpp

Repository: ProxySQL/proxysql-vec

Length of output: 767

🏁 Script executed:

# Check the database interface for execute_statement - can it use prepared statements? rg "class SQLite3DB" ./lib -A 30 | head -50

Repository: ProxySQL/proxysql-vec

Length of output: 3002

🏁 Script executed:

# Look for how other parts of the code handle similar updates with user data rg "INSERT.*definition_sql\|UPDATE.*definition_sql" ./lib --type cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 47

🏁 Script executed:

# Verify that the fetched data (view_def, schema_name, view_name) can contain quotes rg "VIEW_DEFINITION\|TABLE_SCHEMA\|TABLE_NAME" ./lib/Static_Harvester.cpp -B 2 -A 2

Repository: ProxySQL/proxysql-vec

Length of output: 47

Escape view definitions and identifiers in SQLite catalog UPDATE statements.

view_def, schema_name, and view_name from MySQL information_schema.VIEWS can contain single quotes, breaking the SQL statement or enabling injection into the SQLite catalog. Use prepared statements with parameter binding (already available in the codebase via sqlite3_bind_text), or escape single quotes by doubling them (' → ''). The codebase has escape_string_single_quotes() available but it operates on C strings; if using that approach, convert the std::string values to C strings first.

🛠️ Suggested fix (prepared statement approach, preferred)

Replace the direct execute_statement() call with prepared statement binding, following the pattern already used in Discovery_Schema.cpp:

// Update object with definition - char* error = NULL; - int cols = 0, affected = 0; - std::ostringstream update_sql; - update_sql << "UPDATE objects SET definition_sql = '" << view_def << "' " - << "WHERE run_id = " << current_run_id - << " AND schema_name = '" << schema_name << "'" - << " AND object_name = '" << view_name << "'" - << " AND object_type = 'view';"; - - catalog->get_db()->execute_statement(update_sql.str().c_str(), &error, &cols, &affected); + const char* sql = "UPDATE objects SET definition_sql = ?1 " + "WHERE run_id = ?2 AND schema_name = ?3 " + "AND object_name = ?4 AND object_type = 'view';"; + sqlite3_stmt* stmt = NULL; + (*proxy_sqlite3_prepare_v2)(catalog->get_db()->get_db(), sql, -1, &stmt, NULL); + (*proxy_sqlite3_bind_text)(stmt, 1, view_def.c_str(), -1, SQLITE_TRANSIENT); + (*proxy_sqlite3_bind_int)(stmt, 2, current_run_id); + (*proxy_sqlite3_bind_text)(stmt, 3, schema_name.c_str(), -1, SQLITE_TRANSIENT); + (*proxy_sqlite3_bind_text)(stmt, 4, view_name.c_str(), -1, SQLITE_TRANSIENT); + SAFE_SQLITE3_STEP2(stmt); + (*proxy_sqlite3_finalize)(stmt);

🤖 Prompt for AI Agents

In `@lib/Static_Harvester.cpp` around lines 1014 - 1055, The UPDATE that writes view_def, schema_name, and view_name into the SQLite catalog (inside the for-loop using update_sql and catalog->get_db()->execute_statement) is vulnerable to breaking on single quotes; change it to use a prepared statement with parameter binding (sqlite3_prepare_v2 / sqlite3_bind_text / sqlite3_step / sqlite3_finalize) and bind view_def, schema_name, view_name, and current_run_id instead of interpolating them, following the pattern in Discovery_Schema.cpp; alternatively (less preferred) call escape_string_single_quotes() on view_def.c_str(), schema_name.c_str(), and view_name.c_str() and use the escaped C strings in the constructed SQL before calling execute_statement.

coderabbitai · 2026-01-21T14:10:02Z

lib/Static_Harvester.cpp

+	std::ostringstream sql;
+	sql << "SELECT object_id, schema_name, object_name, object_type, engine, table_rows_est, "
+	    << "data_length, index_length, has_primary_key, has_foreign_keys, has_time_column "
+	    << "FROM objects WHERE run_id = " << current_run_id
+	    << " AND object_type IN ('table', 'view')";
+
+	catalog->get_db()->execute_statement(sql.str().c_str(), &error, &cols, &affected, &resultset);
+
+	if (!resultset) {
+		return -1;
+	}
+
+	int count = 0;
+	for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin();
+	     it != resultset->rows.end(); ++it) {
+		SQLite3_row* row = *it;
+
+		int object_id = atoi(row->fields[0]);
+		std::string object_name = std::string(row->fields[2] ? row->fields[2] : "");
+
+		// Guess kind from name
+		std::string guessed_kind = "unknown";
+		std::string name_lower = object_name;
+		std::transform(name_lower.begin(), name_lower.end(), name_lower.begin(), ::tolower);
+
+		if (name_lower.find("log") != std::string::npos ||
+		    name_lower.find("event") != std::string::npos ||
+		    name_lower.find("audit") != std::string::npos) {
+			guessed_kind = "log/event";
+		} else if (name_lower.find("order") != std::string::npos ||
+		           name_lower.find("invoice") != std::string::npos ||
+		           name_lower.find("payment") != std::string::npos ||
+		           name_lower.find("transaction") != std::string::npos) {
+			guessed_kind = "fact";
+		} else if (name_lower.find("user") != std::string::npos ||
+		           name_lower.find("customer") != std::string::npos ||
+		           name_lower.find("account") != std::string::npos ||
+		           name_lower.find("product") != std::string::npos) {
+			guessed_kind = "entity";
+		}
+
+		// Build profile JSON
+		json profile;
+		profile["guessed_kind"] = guessed_kind;
+		profile["rows_est"] = row->fields[4] ? atol(row->fields[4]) : 0;
+		profile["size_bytes"] = (atol(row->fields[5] ? row->fields[5] : "0") +
+		                       atol(row->fields[6] ? row->fields[6] : "0"));
+		profile["engine"] = std::string(row->fields[3] ? row->fields[3] : "");
+		profile["has_primary_key"] = atoi(row->fields[7]) != 0;
+		profile["has_foreign_keys"] = atoi(row->fields[8]) != 0;
+		profile["has_time_column"] = atoi(row->fields[9]) != 0;


⚠️ Potential issue | 🟠 Major

Fix column index mapping in quick profiles.

The SELECT column order doesn’t match the indices used, so rows_est/engine/flags are derived from the wrong fields.

✅ Proposed fix

- profile["rows_est"] = row->fields[4] ? atol(row->fields[4]) : 0; - profile["size_bytes"] = (atol(row->fields[5] ? row->fields[5] : "0") + - atol(row->fields[6] ? row->fields[6] : "0")); - profile["engine"] = std::string(row->fields[3] ? row->fields[3] : ""); - profile["has_primary_key"] = atoi(row->fields[7]) != 0; - profile["has_foreign_keys"] = atoi(row->fields[8]) != 0; - profile["has_time_column"] = atoi(row->fields[9]) != 0; + profile["rows_est"] = row->fields[5] ? atol(row->fields[5]) : 0; + profile["size_bytes"] = (atol(row->fields[6] ? row->fields[6] : "0") + + atol(row->fields[7] ? row->fields[7] : "0")); + profile["engine"] = std::string(row->fields[4] ? row->fields[4] : ""); + profile["has_primary_key"] = atoi(row->fields[8]) != 0; + profile["has_foreign_keys"] = atoi(row->fields[9]) != 0; + profile["has_time_column"] = atoi(row->fields[10]) != 0;

🤖 Prompt for AI Agents

In `@lib/Static_Harvester.cpp` around lines 1088 - 1138, The quick-profile field indices are misaligned with the SELECT: update the row->fields index usage inside the loop (where profile is built) so they match the SELECT order (object_id, schema_name, object_name, object_type, engine, table_rows_est, data_length, index_length, has_primary_key, has_foreign_keys, has_time_column). Concretely, set rows_est to row->fields[5], engine to row->fields[4], size_bytes to the sum of row->fields[6] and row->fields[7], and the boolean flags has_primary_key/has_foreign_keys/has_time_column to row->fields[8], [9], and [10] respectively; update these in the code that constructs json profile (the block around profile["guessed_kind"], profile["rows_est"], profile["size_bytes"], profile["engine"], profile["has_primary_key"], profile["has_foreign_keys"], profile["has_time_column"] in Static_Harvester.cpp).

renecannao · 2026-01-21T20:52:43Z

Replaced by #26

renecannao added 10 commits January 17, 2026 12:11

chore: Ignore discovery output files and remove accidentally committe…

82d7f0c

…d file

fix: Correct log message from 4-agent to 6-agent discovery

da0b5a5

chore: Remove accidentally committed discovery output file

7ade08f

gemini-code-assist bot reviewed Jan 17, 2026

View reviewed changes

coderabbitai bot reviewed Jan 17, 2026

View reviewed changes

renecannao added 16 commits January 17, 2026 15:42

fix: Correct two_phase_discovery.py usage example in docs

f9270e6

Add missing --mcp-config argument which is required by the script. The dry-run example now correctly shows all required parameters.

refactor: Remove describe_table tool completely

527a748

Tool was deprecated; users should use catalog.get_object instead.

refactor: list_schemas to use catalog instead of live database

df0527c

- Query schemas from catalog's schemas table - Maintains same output format for compatibility - Removes dependency on live MySQL connection

fix: Use row->cnt instead of row->fields_count

393967f

feat: Expose MCP catalog database in ProxySQL Admin interface

fb66af7

The MCP catalog database is now accessible as the 'mcp_catalog' schema from the ProxySQL Admin interface, enabling direct SQL queries against discovered schemas and LLM memories.

renecannao added 4 commits January 19, 2026 23:02

fix: Load re_modifiers field from database in load_mcp_query_rules()

c092fdb

Previously re_modifiers was hardcoded to 1 (CASELESS), ignoring the value stored in the database. Now properly reads from row->fields[7].

renecannao changed the title ~~feat: Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis~~ feat: Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis [WIP] Jan 19, 2026

renecannao and others added 15 commits January 20, 2026 08:58

Remove mistakenly created Doxygen files

ed65b69

- Remove Doxyfile that was added by mistake - Remove .doxygen files from doc/ directory that were generated by mistake - Clean up documentation artifacts that shouldn't have been committed

Organize RAG test files properly

acd05b6

- Move RAG test files to dedicated test/rag/ directory - Create proper Makefile in test/rag/ directory - Remove build_rag_test.sh script (replaced by Makefile) - Revert changes to main test/Makefile - Update documentation to reflect new test structure

Introduce canonical proxy_sqlite3 symbol TU; update lib Makefile; rem…

8dc4246

…ove MAIN_PROXY_SQLITE3 from main.cpp

Use proxy_sqlite3_* for SQLite calls in Anomaly_Detector.cpp (address…

a24b8ad

… PR review)

Replace remaining direct sqlite3_* calls with proxy_sqlite3_* equival…

2dfd61a

…ents (address code-review)

sqlite3: fix duplicate proxy declarations and add forward declarations

7bf9121

Apply fixes

0db022a

Disable sqlite3 plugin function replacement; warn instead

4f0e6e0

Restore commented SQLite3DB::LoadPlugin reference with TODO

f877366

Keep main.cpp only; remove accidental backup from commits

6ce0538

Merge pull request #22 from ProxySQL/sqlite3-proxy-replacements

ceaaa01

Use proxy_sqlite3_* wrappers and canonical proxy symbol TU

Merge pull request #18 from ProxySQL/v3.1-MCP2_RAG1

7231ffd

Add comprehensive RAG subsystem with Doxygen documentation [WIP]

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

renecannao and others added 5 commits January 21, 2026 13:12

fix: Missing <string> header in ai_llm_retry_scenarios-t

f45506e

- Add #include <string> for C++ std::string support Resolves compilation errors for ai_llm_retry_scenarios-t test.

Merge pull request #19 from ProxySQL/v3.1-MCP2_QR

af28598

feat: Add MCP query rules and digest statistics [WIP]

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

renecannao mentioned this pull request Jan 21, 2026

Merge v3.1-MCP2 into v3.1-vec (conflicts resolved) #26

Merged

renecannao closed this Jan 21, 2026

	\| `--timeout` \| `-t` \| Timeout in seconds \| 300 \|
	\| `--timeout` \| `-t` \| Timeout in seconds \| 3600 \|

	SELECT * FROM {table} WHERE {pk_column} = (SELECT MAX({pk_column}) FROM {table});
	SELECT * FROM {table} WHERE {pk_column} = (SELECT {pk_column} FROM {table} LIMIT 1);

Conversation

renecannao commented Jan 17, 2026

Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis

Summary

Key Improvements

1. Performance Baseline Measurement (QUERY Agent)

2. Statistical Significance Testing (STATISTICAL Agent)

3. Enhanced Cross-Domain Question Synthesis (META Agent)

Files Changed

Prompt Evolution History

Testing

Future Improvements (Priority 2)

Uh oh!

gemini-code-assist bot commented Jan 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Jan 17, 2026 •

edited

Loading