feat: Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis [WIP]#15
feat: Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis [WIP]#15renecannao wants to merge 76 commits intov3.1-vecfrom
Conversation
Replace the old single-agent discovery approach with a new 4-agent collaborative system that provides more comprehensive analysis. Changes: - Add prompts/ directory with multi-agent system prompts - Replace README.md with multi-agent discovery documentation - Replace headless_db_discovery.py to use 4-agent prompt - Replace headless_db_discovery.sh to use 4-agent prompt - Remove HEADLESS_DISCOVERY_README.md (replaced by README.md) The new system uses 4 collaborating agents (STRUCTURAL, STATISTICAL, SEMANTIC, QUERY) that communicate via MCP catalog across 4 rounds: 1. Blind Exploration (parallel independent discovery) 2. Pattern Recognition (cross-agent collaboration) 3. Hypothesis Testing (validation with evidence) 4. Final Synthesis (comprehensive report) This provides deeper analysis with 15+ hypothesis validations, health scores, and prioritized recommendations.
- Change default timeout from 300s to 3600s (1 hour) for multi-agent discovery - Add specific timeout error detection (exit code 124) with helpful message - Add empty output file detection with diagnostic suggestions - Improve error messages to guide users on how to debug issues - Show example commands for increasing timeout when timeout occurs The multi-agent discovery process involves 4 rounds of agent collaboration and can take significantly longer than the previous single-agent approach.
Expand the 4-agent system to 6 agents (5 analysis + 1 meta) with enhanced security analysis and self-improving prompt optimization. New Agents: - SECURITY: Identifies sensitive data (PII, credentials, financial), assesses access patterns, identifies vulnerabilities, and provides compliance assessment (GDPR, PCI-DSS) - META: Analyzes report quality by section, identifies gaps, suggests specific prompt improvements for future runs Protocol Changes: - Expanded from 4 rounds to 5 rounds - Round 5 is Meta Analysis (META agent only) - META agent does not participate in rounds 1-4 New Report Sections: - 5. SECURITY ANALYSIS with data classification (PUBLIC/INTERNAL/ CONFIDENTIAL/RESTRICTED) - E. Security data classification appendix New Output: - Separate META ANALYSIS document with: - Section quality ratings (depth, completeness) - Specific prompt improvement suggestions - Gap identification - Evolution history tracking This enables continuous prompt optimization through multiple discovery iterations, with each run informing improvements for the next.
Each agent now generates a comprehensive catalog of questions they can answer about the database, with executable answer plans using MCP tools. Question Catalog Features: - 90+ questions across all agents (minimum 15-20 per agent) - Executable answer plans with specific MCP tools for each question - Answer templates with structured output formats - Complexity ratings (LOW/MEDIUM/HIGH) and time estimates - Cross-domain questions requiring multiple agents New Output Locations: - kind="question_catalog", key="structural_questions" - kind="question_catalog", key="statistical_questions" - kind="question_catalog", key="semantic_questions" - kind="question_catalog", key="query_questions" - kind="question_catalog", key="security_questions" - kind="question_catalog", key="cross_domain_questions" Benefits: 1. Fast Answers: Pre-validated plans skip analysis phase 2. Consistent Quality: All answers follow proven templates 3. Tool Reuse: Efficient MCP tool usage patterns 4. Comprehensive Coverage: 90+ questions cover most user needs Example Question Catalog Entry: - Question: "What sensitive data exists in table X?" - Answer Plan: sample_rows + column_profile on table X - Answer Template: Structured list with sensitivity classification - Complexity: MEDIUM - Estimated Time: 30 seconds This creates a reusable knowledge base for future LLM interactions, enabling quick, accurate responses to common database questions.
…ings Add prominent warnings and explicit instructions to prevent agents from creating individual markdown files instead of using the MCP catalog. Changes: - Add "CRITICAL: MCP CATALOG USAGE" section at the top of the prompt - Explicitly prohibit Write tool for individual agent discoveries - Specify that ONLY Round 4 final report should be written to file - Add detailed catalog_upsert usage examples - Update all round descriptions with CRITICAL catalog usage instructions - Add "IMPORTANT - Catalog Usage Rules" section Key Instructions Added: - "DO NOT use the Write tool to create separate markdown files" - "ALL agent findings MUST be written to MCP catalog using catalog_upsert" - "ONLY Round 4 Final Synthesis writes to local file using Write tool" - "DO NOT use Write tool for individual agent discoveries in Rounds 1-3" This ensures agents store findings in the MCP catalog for cross-agent collaboration, with only the final consolidated report written to disk.
Implemented three critical improvements identified by META agent:
1. Performance Baseline Measurement (QUERY agent):
- Added 5 required query types with timing measurements
- Compare actual execution time vs EXPLAIN cost
- Efficiency score calculation (1-10 scale)
- Output format with performance scores per table
2. Statistical Significance Testing (STATISTICAL agent):
- Added 5 required statistical tests:
* Normality tests (Shapiro-Wilk, Anderson-Darling)
* Correlation analysis (Pearson, Spearman) with CI
* Chi-square tests for categorical associations
* Outlier detection (Modified Z-score, IQR)
* Group comparisons (t-test, Mann-Whitney U)
- All tests report p-values and effect sizes
- Statistical confidence scoring (1-10)
3. Enhanced Cross-Domain Question Synthesis:
- Increased minimum from general suggestion to 15 questions
- Added 5 cross-domain categories with examples
- Detailed question template with multi-phase plans
- Priority distribution (URGENT/HIGH/MEDIUM)
- Quality criteria for cross-domain questions
Expected Impact: +25% overall quality, +30% confidence in findings
Updated README to reflect new capabilities: 1. Statistical Analysis section: - Added Statistical Significance Testing subsection - Listed 5 required statistical tests (normality, correlation, chi-square, outliers, group comparisons) - Mentioned p-values and effect sizes 2. Query Analysis section: - Added Performance Baseline Measurement subsection - Listed 6 required query types with timing - Mentioned efficiency scoring (EXPLAIN vs actual) 3. Question Catalogs section: - Updated to reflect 15+ cross-domain questions (enhanced in v1.3) - Added 5 cross-domain categories with question counts 4. Cross-Domain Questions section: - Expanded from 3 examples to 15 specific questions - Organized by 5 categories with question counts - Matched new v1.3 requirements
Summary of ChangesHello @renecannao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces version 1.3 of the Multi-Agent Database Discovery system, significantly enhancing its analytical capabilities. The primary goal is to provide deeper insights and higher confidence in database discovery reports by integrating actual query performance measurements and robust statistical analysis. These improvements are designed to make the discovery process more comprehensive and the generated recommendations more actionable, ultimately boosting the overall quality of the system's output. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughSummary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughAdds a two‑phase, MCP‑centric discovery system: a Static_Harvester writes deterministic metadata into a new SQLite Discovery_Schema; Query_Tool_Handler, RAG and new MCP tools/endpoints enable Phase‑2 LLM and multi‑agent discovery; includes C++ implementations, prompts, headless orchestration scripts, docs, and build wiring. Changes
Sequence Diagram(s)sequenceDiagram
participant Claude as Claude Code (Client)
participant MCP as MCP Server (/mcp/query)
participant QTH as Query_Tool_Handler
participant Catalog as Discovery_Schema (mcp_catalog.db)
participant MySQL as MySQL
participant RAG as RAG_Tool_Handler
Note over Claude,MCP: Phase 1 — Static harvest
Claude->>MCP: discovery.run_static(schema_filter)
MCP->>QTH: invoke discovery.run_static
QTH->>MySQL: query INFORMATION_SCHEMA
MySQL-->>QTH: rows
QTH->>Catalog: insert_schema/objects/columns/indexes/fks
Catalog-->>QTH: ack
QTH->>Catalog: rebuild_fts_index
Catalog-->>MCP: {run_id, summary}
MCP-->>Claude: {run_id, summary}
Note over Claude,MCP: Phase 2 — LLM / Multi‑agent discovery
Claude->>MCP: catalog.list_objects(run_id)
MCP->>Catalog: list_objects(run_id)
Catalog-->>MCP: object metadata
MCP-->>Claude: objects
loop multi-agent interactions
Claude->>MCP: llm.summary_upsert / agent.event_append / catalog.get_object
MCP->>Catalog: upsert_llm_summary / append_agent_event / get_object
Catalog-->>MCP: ack / object
MCP->>RAG: rag.search_* (optional retrieval)
RAG-->>MCP: chunks/docs
end
Claude->>MCP: agent.run_finish(status)
MCP->>Catalog: finish_agent_run()
Catalog-->>MCP: ack
MCP-->>Claude: completion
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the multi-agent database discovery system by introducing performance baselining and statistical significance testing. The refactoring of the headless execution scripts to externalize the prompt and improve error handling is a major step forward in maintainability and usability. The new prompt is exceptionally detailed and well-engineered. My review focuses on minor inconsistencies in the documentation and opportunities to make the agent's instructions even more robust to prevent potential issues during execution.
| @@ -0,0 +1,434 @@ | |||
| # Database Discovery System Prompt | |||
There was a problem hiding this comment.
This reference document appears to be outdated and inconsistent with the new architecture described in the pull request. It details a 4-agent system, while the main prompt and README now describe a 6-agent system with a 5-round protocol. This discrepancy is highly confusing for anyone trying to understand the system.
The file should be updated to reflect the current 6-agent architecture or be clearly marked as documentation for a previous version to avoid confusion.
| | `--database` | `-d` | Database name to discover | First available | | ||
| | `--schema` | `-s` | Schema name to analyze | All schemas | | ||
| | `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` | | ||
| | `--timeout` | `-t` | Timeout in seconds | 300 | |
There was a problem hiding this comment.
The default timeout value in the command-line options table is outdated. It's listed as 300 seconds, but the scripts have been updated to use a default of 3600 seconds (1 hour) to accommodate the more complex multi-agent discovery process. The documentation should be updated to reflect the new default.
| | `--timeout` | `-t` | Timeout in seconds | 300 | | |
| | `--timeout` | `-t` | Timeout in seconds | 3600 | |
| ```python | ||
| # Modify the prompt in the script for focused discovery | ||
| def build_discovery_prompt(database: Optional[str]) -> str: | ||
| prompt = f"""Using the 4-agent discovery protocol, focus on: | ||
| 1. Security aspects of {database} | ||
| 2. Performance optimization opportunities | ||
| 3. Data quality issues | ||
|
|
||
| Follow the standard 4-round protocol but prioritize these areas. | ||
| """ | ||
| return prompt | ||
| ``` |
There was a problem hiding this comment.
The example for customizing the discovery prompt is inconsistent with the new architecture. It refers to a "4-agent discovery protocol," whereas the system now uses 6 agents. Additionally, the function signature for build_discovery_prompt is missing the schema parameter. This could confuse users trying to customize the script.
| ```python | |
| # Modify the prompt in the script for focused discovery | |
| def build_discovery_prompt(database: Optional[str]) -> str: | |
| prompt = f"""Using the 4-agent discovery protocol, focus on: | |
| 1. Security aspects of {database} | |
| 2. Performance optimization opportunities | |
| 3. Data quality issues | |
| Follow the standard 4-round protocol but prioritize these areas. | |
| """ | |
| return prompt | |
| ``` | |
| # Modify the prompt in the script for focused discovery | |
| def build_discovery_prompt(database: Optional[str], schema: Optional[str]) -> str: | |
| # This example shows how to create a custom, focused prompt. | |
| # Note: The actual script reads a much more detailed prompt from an external file. | |
| prompt = f"""Using the 6-agent discovery protocol, focus on: | |
| 1. Security aspects of {database} | |
| 2. Performance optimization opportunities | |
| 3. Data quality issues | |
| Follow the standard 5-round protocol but prioritize these areas. | |
| """ | |
| return prompt |
|
|
||
| 1. **Primary Key Lookup** | ||
| ```sql | ||
| SELECT * FROM {table} WHERE {pk_column} = (SELECT MAX({pk_column}) FROM {table}); |
There was a problem hiding this comment.
The query for the Primary Key Lookup test uses MAX({pk_column}) to find a value to test. This assumes the primary key is of a type that supports the MAX() aggregate function, such as an integer. This will fail for other common primary key types like UUIDs. A more robust approach is to select an arbitrary key, for instance, by using LIMIT 1.
| SELECT * FROM {table} WHERE {pk_column} = (SELECT MAX({pk_column}) FROM {table}); | |
| SELECT * FROM {table} WHERE {pk_column} = (SELECT {pk_column} FROM {table} LIMIT 1); |
| 3. **Index Range Scan (if applicable)** | ||
| ```sql | ||
| SELECT * FROM {table} WHERE {indexed_column} BETWEEN {min} AND {max} LIMIT 1000; | ||
| ``` |
There was a problem hiding this comment.
The instruction for the 'Index Range Scan' test provides a query template but doesn't guide the agent on how to determine appropriate values for {min} and {max}. This ambiguity could lead to ineffective or failing tests if the agent chooses a poor range.
I recommend adding guidance for the agent. For example, you could add a note before the SQL block:
- Note: To determine appropriate `{min}` and `{max}` values, first profile the `{indexed_column}` to find a representative range.There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Fix all issues with AI agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/.gitignore`:
- Line 1: The .gitignore entry uses a leading slash (/discovery_*.md) which
limits matches to the current directory; change the pattern to discovery_*.md
(or use **/discovery_*.md if you want to match in subdirectories) to allow
ignoring discovery files produced in nested paths; update the entry that
currently contains "/discovery_*.md" accordingly.
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.py`:
- Around line 5-7: The module-level docstring in headless_db_discovery.py still
lists 4 agents (STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY); update that
docstring to reflect the current 6-agent workflow by adding the two missing
agents (name them exactly as used in the codebase) so the description matches
the implementation; search for the module docstring at the top of
headless_db_discovery.py and edit the agent list text to include all six agent
names used by the discovery workflow.
- Around line 131-155: The call to get_discovery_prompt_path() is executed
before the try/except that handles missing prompt files, so a missing prompt
will raise FileNotFoundError outside the handler; fix by moving the
get_discovery_prompt_path() invocation (and the log_verbose that references it)
into the same try block that calls build_discovery_prompt(), or wrap the
get_discovery_prompt_path() call in its own try/except to catch
FileNotFoundError and call process-safe fallback or exit; update the references
in log_verbose and build_discovery_prompt() usage accordingly so all
FileNotFoundError paths are handled in one place.
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.sh`:
- Around line 7-9: Update the usage/help header: change the phrase "4
collaborating agents" to "6 collaborating agents" and edit the agent list that
currently reads "STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY" to include the
two missing agent names (or explicitly say "STRUCTURAL, STATISTICAL, SEMANTIC,
QUERY, <AGENT5>, and <AGENT6>" if you have the names) so the comment accurately
reflects the 6‑agent system in headless_db_discovery.sh.
- Around line 175-204: MCP_ARGS being a plain string causes word-splitting when
expanded into the CLAUDE_ARGS array; change MCP_ARGS to an array and populate it
with separate elements (e.g., add the flag and its value as two entries) when
building from MCP_CONFIG or MCP_FILE, then append to CLAUDE_ARGS using array
expansion (use the MCP_ARGS array name and its elements) so JSON or paths with
spaces are preserved; update the code locations that set MCP_ARGS and the line
that appends to CLAUDE_ARGS to use the MCP_ARGS array and proper quoted array
expansion.
In
`@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_prompt.md`:
- Around line 3-5: Update the mission header that currently says "ONLY MCP
server tools (`mcp__proxysql-stdio__*`)" to clarify that the MCP-only
restriction applies to database access and discovery steps only, while
permitting the use of the Write tool in Round 4 for producing the final report;
reference the exact phrase "ONLY MCP server tools (`mcp__proxysql-stdio__*`)"
and "Round 4" so the change is applied there and ensure the new wording
explicitly states that all database interactions must use MCP tools but output
generation may use Write.
- Around line 845-856: Add explicit language identifiers to the untyped Markdown
code fences around the catalog examples: update the three fences that surround
the calls to catalog_upsert, catalog_search, and catalog_get so they specify a
language (e.g., ```python) instead of plain ```, ensuring each opening fence
before catalog_upsert(kind="agent_type"...), before catalog_search(...), and
before catalog_get(...) is changed to include the identifier and the
corresponding closing fence remains ```; no code changes required besides the
fence markers.
In
`@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md`:
- Around line 5-11: The prompt currently states "Execute in 4 rounds" and
implies a 4‑agent workflow but your tooling uses a 6‑agent/5‑round protocol;
update the prompt text (the "Database Discovery Orchestrator" section and the
"Critical Constraints" bullet that references "Execute in 4 rounds") to
explicitly state whether this file is the 4‑agent/4‑round variant or to convert
it to the canonical 6‑agent/5‑round workflow used by the tooling; ensure you
update the round count and agent count consistently and keep the MCP tooling
requirement (`mcp__proxysql-stdio__*`, `catalog_upsert`, `catalog_search`)
intact so readers aren’t misled.
- Around line 101-109: Update the example call to catalog_upsert so the kind
field uses the round-specific kind names used elsewhere in the protocol (e.g.,
replace "collaborative" with "collaborative_round2" and "validation" with
"validation_round3") to ensure catalog lookups match; keep the parameter names
(kind, key, document, tags) and structure the same but use the exact protocol
strings used later in the document so catalog_upsert queries succeed.
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md`:
- Around line 519-530: The directory-structure code fence in README.md is
missing a language tag (causing MD040); update the opening fence for the tree
block from ``` to ```text so the block becomes a fenced code block with a
language tag (e.g., change the README.md directory-structure code fence that
surrounds the "ClaudeCode_Headless/" tree to start with ```text).
- Around line 392-398: Update the README table entry for the `--timeout` / `-t`
option to reflect the actual default of 3600 seconds: change the "Default"
column value for the `--timeout` row from 300 to 3600 so the documented default
matches the script's runtime behavior.
- Around line 471-477: Update the example prompt string (variable "prompt") so
it references the 6-agent, 5-round discovery protocol rather than "4-agent" and
"4-round"; modify the body to instruct focus on security, performance, and data
quality while explicitly stating the 6-agent/5-round structure and any change in
sequencing (e.g., "Follow the standard 6-agent, 5-round protocol and prioritize
these areas") and adjust numbering or phrasing inside the triple-quoted string
to match the new protocol description.
| This script runs Claude Code in non-interactive mode to perform | ||
| comprehensive database discovery. It works with any database | ||
| type that is accessible via MCP (Model Context Protocol). | ||
| comprehensive database discovery using 4 collaborating agents: | ||
| STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY. |
There was a problem hiding this comment.
Update the docstring agent list to match the 6‑agent workflow.
Proposed fix
-comprehensive database discovery using 4 collaborating agents:
-STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.
+comprehensive database discovery using 6 collaborating agents:
+STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| This script runs Claude Code in non-interactive mode to perform | |
| comprehensive database discovery. It works with any database | |
| type that is accessible via MCP (Model Context Protocol). | |
| comprehensive database discovery using 4 collaborating agents: | |
| STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY. | |
| This script runs Claude Code in non-interactive mode to perform | |
| comprehensive database discovery using 6 collaborating agents: | |
| STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META. |
🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.py`
around lines 5 - 7, The module-level docstring in headless_db_discovery.py still
lists 4 agents (STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY); update that
docstring to reflect the current 6-agent workflow by adding the two missing
agents (name them exactly as used in the codebase) so the description matches
the implementation; search for the module docstring at the top of
headless_db_discovery.py and edit the agent list text to include all six agent
names used by the discovery workflow.
| log_info("Starting Multi-Agent Database Discovery") | ||
| log_info(f"Output will be saved to: {output_file}") | ||
| log_verbose(f"Claude Code executable: {claude_cmd}", args.verbose) | ||
|
|
||
| # Build MCP configuration | ||
| mcp_config_file, _ = build_mcp_config(args) | ||
| if mcp_config_file: | ||
| log_verbose(f"Using MCP configuration: {mcp_config_file}", args.verbose) | ||
| log_verbose(f"Using discovery prompt: {get_discovery_prompt_path()}", args.verbose) | ||
|
|
||
| # Build command arguments | ||
| cmd_args = [ | ||
| claude_cmd, | ||
| '--print', # Non-interactive mode | ||
| '--no-session-persistence', # Don't save session | ||
| '--permission-mode', 'bypassPermissions', # Bypass permission checks in headless mode | ||
| '--permission-mode', 'bypassPermissions', # Bypass permission checks | ||
| ] | ||
|
|
||
| # Add MCP configuration if available | ||
| if mcp_config_file: | ||
| cmd_args.extend(['--mcp-config', mcp_config_file]) | ||
| # Add MCP configuration if provided | ||
| if args.mcp_config: | ||
| cmd_args.extend(['--mcp-config', args.mcp_config]) | ||
| log_verbose(f"Using MCP config: {args.mcp_config}", args.verbose) | ||
| elif args.mcp_file: | ||
| cmd_args.extend(['--mcp-config', args.mcp_file]) | ||
| log_verbose(f"Using MCP config file: {args.mcp_file}", args.verbose) | ||
|
|
||
| # Build discovery prompt | ||
| prompt = build_discovery_prompt(args.database, args.schema) | ||
| try: | ||
| prompt = build_discovery_prompt(args.database, args.schema) | ||
| except FileNotFoundError as e: |
There was a problem hiding this comment.
Avoid uncaught FileNotFoundError from the log line.
get_discovery_prompt_path() is called before the try/except, so a missing prompt file bypasses the intended error handling.
Proposed fix
- log_verbose(f"Using discovery prompt: {get_discovery_prompt_path()}", args.verbose)
+ # Build discovery prompt
+ try:
+ prompt_path = get_discovery_prompt_path()
+ log_verbose(f"Using discovery prompt: {prompt_path}", args.verbose)
+ prompt = build_discovery_prompt(args.database, args.schema)
+ except FileNotFoundError as e:
+ log_error(str(e))
+ sys.exit(1)
@@
- # Build discovery prompt
- try:
- prompt = build_discovery_prompt(args.database, args.schema)
- except FileNotFoundError as e:
- log_error(str(e))
- sys.exit(1)🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.py`
around lines 131 - 155, The call to get_discovery_prompt_path() is executed
before the try/except that handles missing prompt files, so a missing prompt
will raise FileNotFoundError outside the handler; fix by moving the
get_discovery_prompt_path() invocation (and the log_verbose that references it)
into the same try block that calls build_discovery_prompt(), or wrap the
get_discovery_prompt_path() call in its own try/except to catch
FileNotFoundError and call process-safe fallback or exit; update the references
in log_verbose and build_discovery_prompt() usage accordingly so all
FileNotFoundError paths are handled in one place.
| # This script runs Claude Code in non-interactive mode to perform | ||
| # comprehensive database discovery. It works with any database | ||
| # type that is accessible via MCP (Model Context Protocol). | ||
| # comprehensive database discovery using 4 collaborating agents: | ||
| # STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY. |
There was a problem hiding this comment.
Update the help header to reflect the 6‑agent system.
The usage text still says “4 collaborating agents”.
Proposed fix
-# comprehensive database discovery using 4 collaborating agents:
-# STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY.
+# comprehensive database discovery using 6 collaborating agents:
+# STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # This script runs Claude Code in non-interactive mode to perform | |
| # comprehensive database discovery. It works with any database | |
| # type that is accessible via MCP (Model Context Protocol). | |
| # comprehensive database discovery using 4 collaborating agents: | |
| # STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY. | |
| # This script runs Claude Code in non-interactive mode to perform | |
| # comprehensive database discovery using 6 collaborating agents: | |
| # STRUCTURAL, STATISTICAL, SEMANTIC, QUERY, SECURITY, and META. |
🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.sh`
around lines 7 - 9, Update the usage/help header: change the phrase "4
collaborating agents" to "6 collaborating agents" and edit the agent list that
currently reads "STRUCTURAL, STATISTICAL, SEMANTIC, and QUERY" to include the
two missing agent names (or explicitly say "STRUCTURAL, STATISTICAL, SEMANTIC,
QUERY, <AGENT5>, and <AGENT6>" if you have the names) so the comment accurately
reflects the 6‑agent system in headless_db_discovery.sh.
| # Build MCP args | ||
| MCP_ARGS="" | ||
| if [ -n "$MCP_CONFIG" ]; then | ||
| # Write inline config to temp file | ||
| MCP_CONFIG_FILE=$(mktemp) | ||
| echo "$MCP_CONFIG" > "$MCP_CONFIG_FILE" | ||
| MCP_ARGS="--mcp-config $MCP_CONFIG_FILE" | ||
| MCP_ARGS="--mcp-config $MCP_CONFIG" | ||
| log_verbose "Using inline MCP configuration" | ||
| elif [ -n "$MCP_FILE" ]; then | ||
| if [ -f "$MCP_FILE" ]; then | ||
| MCP_CONFIG_FILE="$MCP_FILE" | ||
| MCP_ARGS="--mcp-config $MCP_FILE" | ||
| log_verbose "Using MCP configuration from: $MCP_FILE" | ||
| else | ||
| log_error "MCP configuration file not found: $MCP_FILE" | ||
| exit 1 | ||
| fi | ||
| elif [ -n "$PROXYSQL_MCP_ENDPOINT" ]; then | ||
| # Build MCP config for ProxySQL and write to temp file | ||
| MCP_CONFIG_FILE=$(mktemp) | ||
| SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" | ||
| BRIDGE_PATH="$SCRIPT_DIR/../mcp/proxysql_mcp_stdio_bridge.py" | ||
|
|
||
| # Build the JSON config | ||
| cat > "$MCP_CONFIG_FILE" << MCPJSONEOF | ||
| { | ||
| "mcpServers": { | ||
| "proxysql": { | ||
| "command": "python3", | ||
| "args": ["$BRIDGE_PATH"], | ||
| "env": { | ||
| "PROXYSQL_MCP_ENDPOINT": "$PROXYSQL_MCP_ENDPOINT" | ||
| MCPJSONEOF | ||
|
|
||
| if [ -n "$PROXYSQL_MCP_TOKEN" ]; then | ||
| echo ", \"PROXYSQL_MCP_TOKEN\": \"$PROXYSQL_MCP_TOKEN\"" >> "$MCP_CONFIG_FILE" | ||
| fi | ||
|
|
||
| if [ "$PROXYSQL_MCP_INSECURE_SSL" = "1" ]; then | ||
| echo ", \"PROXYSQL_MCP_INSECURE_SSL\": \"1\"" >> "$MCP_CONFIG_FILE" | ||
| fi | ||
|
|
||
| cat >> "$MCP_CONFIG_FILE" << 'MCPJSONEOF2' | ||
| } | ||
| } | ||
| } | ||
| } | ||
| MCPJSONEOF2 | ||
|
|
||
| MCP_ARGS="--mcp-config $MCP_CONFIG_FILE" | ||
| log_verbose "Using ProxySQL MCP endpoint: $PROXYSQL_MCP_ENDPOINT" | ||
| log_verbose "MCP config written to: $MCP_CONFIG_FILE" | ||
| else | ||
| log_verbose "No explicit MCP configuration, using available MCP servers" | ||
| fi | ||
|
|
||
| # Build the discovery prompt | ||
| DATABASE_ARG="" | ||
| if [ -n "$DATABASE_NAME" ]; then | ||
| DATABASE_ARG="database named '$DATABASE_NAME'" | ||
| else | ||
| DATABASE_ARG="the first available database" | ||
| fi | ||
| # Log the command being executed | ||
| log_info "Running Claude Code in headless mode with 6-agent discovery..." | ||
| log_verbose "Timeout: ${TIMEOUT}s" | ||
|
|
||
| SCHEMA_ARG="" | ||
| if [ -n "$SCHEMA_NAME" ]; then | ||
| SCHEMA_ARG="the schema '$SCHEMA_NAME' within" | ||
| fi | ||
| # Build Claude command | ||
| CLAUDE_ARGS=( | ||
| --no-session-persistence | ||
| --permission-mode bypassPermissions | ||
| ) | ||
|
|
||
| DISCOVERY_PROMPT="You are a Database Discovery Agent. Your mission is to perform comprehensive analysis of $DATABASE_ARG. | ||
|
|
||
| ${SCHEMA_ARG:+Focus on $SCHEMA_ARG} | ||
|
|
||
| Use the available MCP database tools to discover and document: | ||
|
|
||
| ## 1. STRUCTURAL ANALYSIS | ||
| - List all tables in the database/schema | ||
| - For each table, describe: | ||
| - Column names, data types, and nullability | ||
| - Primary keys and unique constraints | ||
| - Foreign key relationships | ||
| - Indexes and their purposes | ||
| - Any CHECK constraints or defaults | ||
|
|
||
| - Create an Entity Relationship Diagram (ERD) showing: | ||
| - All tables and their relationships | ||
| - Cardinality (1:1, 1:N, M:N) | ||
| - Primary and foreign keys | ||
|
|
||
| ## 2. DATA PROFILING | ||
| - For each table, analyze: | ||
| - Row count | ||
| - Data distributions for key columns | ||
| - Null value percentages | ||
| - Distinct value counts (cardinality) | ||
| - Min/max/average values for numeric columns | ||
| - Sample data (first few rows) | ||
|
|
||
| - Identify patterns and anomalies: | ||
| - Duplicate records | ||
| - Data quality issues | ||
| - Unexpected distributions | ||
| - Outliers | ||
|
|
||
| ## 3. SEMANTIC ANALYSIS | ||
| - Infer the business domain: | ||
| - What type of application/database is this? | ||
| - What are the main business entities? | ||
| - What are the business processes? | ||
|
|
||
| - Document business rules: | ||
| - Entity lifecycles and state machines | ||
| - Validation rules implied by constraints | ||
| - Relationship patterns | ||
|
|
||
| - Classify tables: | ||
| - Master/reference data (customers, products, etc.) | ||
| - Transactional data (orders, transactions, etc.) | ||
| - Junction/association tables | ||
| - Configuration/metadata | ||
|
|
||
| ## 4. PERFORMANCE & ACCESS PATTERNS | ||
| - Identify: | ||
| - Missing indexes on foreign keys | ||
| - Missing indexes on frequently filtered columns | ||
| - Composite index opportunities | ||
| - Potential N+1 query patterns | ||
|
|
||
| - Suggest optimizations: | ||
| - Indexes that should be added | ||
| - Query patterns that would benefit from optimization | ||
| - Denormalization opportunities | ||
|
|
||
| ## OUTPUT FORMAT | ||
|
|
||
| Provide your findings as a comprehensive Markdown report with: | ||
|
|
||
| 1. **Executive Summary** - High-level overview | ||
| 2. **Database Schema** - Complete table definitions | ||
| 3. **Entity Relationship Diagram** - ASCII ERD | ||
| 4. **Data Quality Assessment** - Score (1-100) with issues | ||
| 5. **Business Domain Analysis** - Industry, use cases, entities | ||
| 6. **Performance Recommendations** - Prioritized optimization list | ||
| 7. **Anomalies & Issues** - All problems found with severity | ||
|
|
||
| Be thorough. Discover everything about this database structure and data. | ||
| Write the complete report to standard output." | ||
|
|
||
| # Log the command being executed (without showing the full prompt for clarity) | ||
| log_info "Running Claude Code in headless mode..." | ||
| log_verbose "Timeout: ${TIMEOUT}s" | ||
| if [ -n "$DATABASE_NAME" ]; then | ||
| log_verbose "Target database: $DATABASE_NAME" | ||
| fi | ||
| if [ -n "$SCHEMA_NAME" ]; then | ||
| log_verbose "Target schema: $SCHEMA_NAME" | ||
| # Add MCP configuration if available | ||
| if [ -n "$MCP_ARGS" ]; then | ||
| CLAUDE_ARGS+=($MCP_ARGS) | ||
| fi |
There was a problem hiding this comment.
Prevent MCP config JSON from breaking argument parsing.
MCP_ARGS is a string that gets expanded unquoted into the args array, so JSON or paths with spaces will split into multiple arguments.
Proposed fix (use an array and proper quoting)
-# Build MCP args
-MCP_ARGS=""
-if [ -n "$MCP_CONFIG" ]; then
- MCP_ARGS="--mcp-config $MCP_CONFIG"
- log_verbose "Using inline MCP configuration"
-elif [ -n "$MCP_FILE" ]; then
- if [ -f "$MCP_FILE" ]; then
- MCP_ARGS="--mcp-config $MCP_FILE"
- log_verbose "Using MCP configuration from: $MCP_FILE"
- else
- log_error "MCP configuration file not found: $MCP_FILE"
- exit 1
- fi
-fi
+MCP_ARGS=()
+if [ -n "$MCP_CONFIG" ]; then
+ MCP_ARGS+=(--mcp-config "$MCP_CONFIG")
+ log_verbose "Using inline MCP configuration"
+elif [ -n "$MCP_FILE" ]; then
+ if [ -f "$MCP_FILE" ]; then
+ MCP_ARGS+=(--mcp-config "$MCP_FILE")
+ log_verbose "Using MCP configuration from: $MCP_FILE"
+ else
+ log_error "MCP configuration file not found: $MCP_FILE"
+ exit 1
+ fi
+fi
@@
-# Add MCP configuration if available
-if [ -n "$MCP_ARGS" ]; then
- CLAUDE_ARGS+=($MCP_ARGS)
-fi
+# Add MCP configuration if available
+if [ "${`#MCP_ARGS`[@]}" -gt 0 ]; then
+ CLAUDE_ARGS+=("${MCP_ARGS[@]}")
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Build MCP args | |
| MCP_ARGS="" | |
| if [ -n "$MCP_CONFIG" ]; then | |
| # Write inline config to temp file | |
| MCP_CONFIG_FILE=$(mktemp) | |
| echo "$MCP_CONFIG" > "$MCP_CONFIG_FILE" | |
| MCP_ARGS="--mcp-config $MCP_CONFIG_FILE" | |
| MCP_ARGS="--mcp-config $MCP_CONFIG" | |
| log_verbose "Using inline MCP configuration" | |
| elif [ -n "$MCP_FILE" ]; then | |
| if [ -f "$MCP_FILE" ]; then | |
| MCP_CONFIG_FILE="$MCP_FILE" | |
| MCP_ARGS="--mcp-config $MCP_FILE" | |
| log_verbose "Using MCP configuration from: $MCP_FILE" | |
| else | |
| log_error "MCP configuration file not found: $MCP_FILE" | |
| exit 1 | |
| fi | |
| elif [ -n "$PROXYSQL_MCP_ENDPOINT" ]; then | |
| # Build MCP config for ProxySQL and write to temp file | |
| MCP_CONFIG_FILE=$(mktemp) | |
| SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" | |
| BRIDGE_PATH="$SCRIPT_DIR/../mcp/proxysql_mcp_stdio_bridge.py" | |
| # Build the JSON config | |
| cat > "$MCP_CONFIG_FILE" << MCPJSONEOF | |
| { | |
| "mcpServers": { | |
| "proxysql": { | |
| "command": "python3", | |
| "args": ["$BRIDGE_PATH"], | |
| "env": { | |
| "PROXYSQL_MCP_ENDPOINT": "$PROXYSQL_MCP_ENDPOINT" | |
| MCPJSONEOF | |
| if [ -n "$PROXYSQL_MCP_TOKEN" ]; then | |
| echo ", \"PROXYSQL_MCP_TOKEN\": \"$PROXYSQL_MCP_TOKEN\"" >> "$MCP_CONFIG_FILE" | |
| fi | |
| if [ "$PROXYSQL_MCP_INSECURE_SSL" = "1" ]; then | |
| echo ", \"PROXYSQL_MCP_INSECURE_SSL\": \"1\"" >> "$MCP_CONFIG_FILE" | |
| fi | |
| cat >> "$MCP_CONFIG_FILE" << 'MCPJSONEOF2' | |
| } | |
| } | |
| } | |
| } | |
| MCPJSONEOF2 | |
| MCP_ARGS="--mcp-config $MCP_CONFIG_FILE" | |
| log_verbose "Using ProxySQL MCP endpoint: $PROXYSQL_MCP_ENDPOINT" | |
| log_verbose "MCP config written to: $MCP_CONFIG_FILE" | |
| else | |
| log_verbose "No explicit MCP configuration, using available MCP servers" | |
| fi | |
| # Build the discovery prompt | |
| DATABASE_ARG="" | |
| if [ -n "$DATABASE_NAME" ]; then | |
| DATABASE_ARG="database named '$DATABASE_NAME'" | |
| else | |
| DATABASE_ARG="the first available database" | |
| fi | |
| # Log the command being executed | |
| log_info "Running Claude Code in headless mode with 6-agent discovery..." | |
| log_verbose "Timeout: ${TIMEOUT}s" | |
| SCHEMA_ARG="" | |
| if [ -n "$SCHEMA_NAME" ]; then | |
| SCHEMA_ARG="the schema '$SCHEMA_NAME' within" | |
| fi | |
| # Build Claude command | |
| CLAUDE_ARGS=( | |
| --no-session-persistence | |
| --permission-mode bypassPermissions | |
| ) | |
| DISCOVERY_PROMPT="You are a Database Discovery Agent. Your mission is to perform comprehensive analysis of $DATABASE_ARG. | |
| ${SCHEMA_ARG:+Focus on $SCHEMA_ARG} | |
| Use the available MCP database tools to discover and document: | |
| ## 1. STRUCTURAL ANALYSIS | |
| - List all tables in the database/schema | |
| - For each table, describe: | |
| - Column names, data types, and nullability | |
| - Primary keys and unique constraints | |
| - Foreign key relationships | |
| - Indexes and their purposes | |
| - Any CHECK constraints or defaults | |
| - Create an Entity Relationship Diagram (ERD) showing: | |
| - All tables and their relationships | |
| - Cardinality (1:1, 1:N, M:N) | |
| - Primary and foreign keys | |
| ## 2. DATA PROFILING | |
| - For each table, analyze: | |
| - Row count | |
| - Data distributions for key columns | |
| - Null value percentages | |
| - Distinct value counts (cardinality) | |
| - Min/max/average values for numeric columns | |
| - Sample data (first few rows) | |
| - Identify patterns and anomalies: | |
| - Duplicate records | |
| - Data quality issues | |
| - Unexpected distributions | |
| - Outliers | |
| ## 3. SEMANTIC ANALYSIS | |
| - Infer the business domain: | |
| - What type of application/database is this? | |
| - What are the main business entities? | |
| - What are the business processes? | |
| - Document business rules: | |
| - Entity lifecycles and state machines | |
| - Validation rules implied by constraints | |
| - Relationship patterns | |
| - Classify tables: | |
| - Master/reference data (customers, products, etc.) | |
| - Transactional data (orders, transactions, etc.) | |
| - Junction/association tables | |
| - Configuration/metadata | |
| ## 4. PERFORMANCE & ACCESS PATTERNS | |
| - Identify: | |
| - Missing indexes on foreign keys | |
| - Missing indexes on frequently filtered columns | |
| - Composite index opportunities | |
| - Potential N+1 query patterns | |
| - Suggest optimizations: | |
| - Indexes that should be added | |
| - Query patterns that would benefit from optimization | |
| - Denormalization opportunities | |
| ## OUTPUT FORMAT | |
| Provide your findings as a comprehensive Markdown report with: | |
| 1. **Executive Summary** - High-level overview | |
| 2. **Database Schema** - Complete table definitions | |
| 3. **Entity Relationship Diagram** - ASCII ERD | |
| 4. **Data Quality Assessment** - Score (1-100) with issues | |
| 5. **Business Domain Analysis** - Industry, use cases, entities | |
| 6. **Performance Recommendations** - Prioritized optimization list | |
| 7. **Anomalies & Issues** - All problems found with severity | |
| Be thorough. Discover everything about this database structure and data. | |
| Write the complete report to standard output." | |
| # Log the command being executed (without showing the full prompt for clarity) | |
| log_info "Running Claude Code in headless mode..." | |
| log_verbose "Timeout: ${TIMEOUT}s" | |
| if [ -n "$DATABASE_NAME" ]; then | |
| log_verbose "Target database: $DATABASE_NAME" | |
| fi | |
| if [ -n "$SCHEMA_NAME" ]; then | |
| log_verbose "Target schema: $SCHEMA_NAME" | |
| # Add MCP configuration if available | |
| if [ -n "$MCP_ARGS" ]; then | |
| CLAUDE_ARGS+=($MCP_ARGS) | |
| fi | |
| # Build MCP args | |
| MCP_ARGS=() | |
| if [ -n "$MCP_CONFIG" ]; then | |
| MCP_ARGS+=(--mcp-config "$MCP_CONFIG") | |
| log_verbose "Using inline MCP configuration" | |
| elif [ -n "$MCP_FILE" ]; then | |
| if [ -f "$MCP_FILE" ]; then | |
| MCP_ARGS+=(--mcp-config "$MCP_FILE") | |
| log_verbose "Using MCP configuration from: $MCP_FILE" | |
| else | |
| log_error "MCP configuration file not found: $MCP_FILE" | |
| exit 1 | |
| fi | |
| fi | |
| # Log the command being executed | |
| log_info "Running Claude Code in headless mode with 6-agent discovery..." | |
| log_verbose "Timeout: ${TIMEOUT}s" | |
| # Build Claude command | |
| CLAUDE_ARGS=( | |
| --no-session-persistence | |
| --permission-mode bypassPermissions | |
| ) | |
| # Add MCP configuration if available | |
| if [ "${`#MCP_ARGS`[@]}" -gt 0 ]; then | |
| CLAUDE_ARGS+=("${MCP_ARGS[@]}") | |
| fi |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 203-203: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/headless_db_discovery.sh`
around lines 175 - 204, MCP_ARGS being a plain string causes word-splitting when
expanded into the CLAUDE_ARGS array; change MCP_ARGS to an array and populate it
with separate elements (e.g., add the flag and its value as two entries) when
building from MCP_CONFIG or MCP_FILE, then append to CLAUDE_ARGS using array
expansion (use the MCP_ARGS array name and its elements) so JSON or paths with
spaces are preserved; update the code locations that set MCP_ARGS and the line
that appends to CLAUDE_ARGS to use the MCP_ARGS array and proper quoted array
expansion.
| You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools. | ||
|
|
||
| **Critical Constraints:** | ||
| - Use **ONLY** MCP server tools (`mcp__proxysql-stdio__*`) - never connect directly to backend databases | ||
| - All agents collaborate via the MCP catalog (`catalog_upsert`, `catalog_search`) | ||
| - Execute in 4 rounds: Blind Exploration → Pattern Recognition → Hypothesis Testing → Final Synthesis | ||
| - Generate a comprehensive report as the final output |
There was a problem hiding this comment.
Clarify that this reference prompt is the 4‑agent/4‑round variant (or update it to the 6‑agent workflow).
Right now it reads like the canonical prompt while the tooling uses the 6‑agent/5‑round protocol, which can mislead readers.
Suggested clarification
-You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools.
+You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools.
+> **Note:** This reference describes the legacy 4‑agent / 4‑round workflow. The active v1.3 prompt uses 6 agents and 5 rounds (see `prompts/multi_agent_discovery_prompt.md`).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools. | |
| **Critical Constraints:** | |
| - Use **ONLY** MCP server tools (`mcp__proxysql-stdio__*`) - never connect directly to backend databases | |
| - All agents collaborate via the MCP catalog (`catalog_upsert`, `catalog_search`) | |
| - Execute in 4 rounds: Blind Exploration → Pattern Recognition → Hypothesis Testing → Final Synthesis | |
| - Generate a comprehensive report as the final output | |
| You are a **Database Discovery Orchestrator** for Claude Code. Your mission is to perform comprehensive database analysis through 4 collaborating subagents using MCP (Model Context Protocol) server tools. | |
| > **Note:** This reference describes the legacy 4‑agent / 4‑round workflow. The active v1.3 prompt uses 6 agents and 5 rounds (see `prompts/multi_agent_discovery_prompt.md`). | |
| **Critical Constraints:** | |
| - Use **ONLY** MCP server tools (`mcp__proxysql-stdio__*`) - never connect directly to backend databases | |
| - All agents collaborate via the MCP catalog (`catalog_upsert`, `catalog_search`) | |
| - Execute in 4 rounds: Blind Exploration → Pattern Recognition → Hypothesis Testing → Final Synthesis | |
| - Generate a comprehensive report as the final output |
🤖 Prompt for AI Agents
In
`@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md`
around lines 5 - 11, The prompt currently states "Execute in 4 rounds" and
implies a 4‑agent workflow but your tooling uses a 6‑agent/5‑round protocol;
update the prompt text (the "Database Discovery Orchestrator" section and the
"Critical Constraints" bullet that references "Execute in 4 rounds") to
explicitly state whether this file is the 4‑agent/4‑round variant or to convert
it to the canonical 6‑agent/5‑round workflow used by the tooling; ensure you
update the round count and agent count consistently and keep the MCP tooling
requirement (`mcp__proxysql-stdio__*`, `catalog_upsert`, `catalog_search`)
intact so readers aren’t misled.
| **Writing Findings:** | ||
| ```python | ||
| catalog_upsert( | ||
| kind="structural|statistical|semantic|query|collaborative|validation|final_report", | ||
| key="specific_identifier", | ||
| document="detailed_findings_markdown", | ||
| tags="optional_tags" | ||
| ) | ||
| ``` |
There was a problem hiding this comment.
Fix catalog kind names in the example to match the round-specific kinds used later.
The snippet advertises collaborative/validation, while the protocol uses collaborative_round2/validation_round3. This mismatch can break catalog lookups.
Proposed fix
-catalog_upsert(
- kind="structural|statistical|semantic|query|collaborative|validation|final_report",
+catalog_upsert(
+ kind="structural|statistical|semantic|query|collaborative_round2|validation_round3|final_report",
key="specific_identifier",
document="detailed_findings_markdown",
tags="optional_tags"
)🤖 Prompt for AI Agents
In
`@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_reference.md`
around lines 101 - 109, Update the example call to catalog_upsert so the kind
field uses the round-specific kind names used elsewhere in the protocol (e.g.,
replace "collaborative" with "collaborative_round2" and "validation" with
"validation_round3") to ensure catalog lookups match; keep the parameter names
(kind, key, document, tags) and structure the same but use the exact protocol
strings used later in the document so catalog_upsert queries succeed.
| | Option | Short | Description | Default | | ||
| |--------|-------|-------------|---------| | ||
| | `--database` | `-d` | Database name to discover | First available | | ||
| | `--schema` | `-s` | Schema name to analyze | All schemas | | ||
| | `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` | | ||
| | `--timeout` | `-t` | Timeout in seconds | 300 | | ||
| | `--verbose` | `-v` | Enable verbose output | Disabled | |
There was a problem hiding this comment.
Update the documented default timeout to 3600s.
The scripts default to 3600 seconds, but the table still says 300.
Proposed fix
-| `--timeout` | `-t` | Timeout in seconds | 300 |
+| `--timeout` | `-t` | Timeout in seconds | 3600 |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | Option | Short | Description | Default | | |
| |--------|-------|-------------|---------| | |
| | `--database` | `-d` | Database name to discover | First available | | |
| | `--schema` | `-s` | Schema name to analyze | All schemas | | |
| | `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` | | |
| | `--timeout` | `-t` | Timeout in seconds | 300 | | |
| | `--verbose` | `-v` | Enable verbose output | Disabled | | |
| | Option | Short | Description | Default | | |
| |--------|-------|-------------|---------| | |
| | `--database` | `-d` | Database name to discover | First available | | |
| | `--schema` | `-s` | Schema name to analyze | All schemas | | |
| | `--output` | `-o` | Output file path | `discovery_YYYYMMDD_HHMMSS.md` | | |
| | `--timeout` | `-t` | Timeout in seconds | 3600 | | |
| | `--verbose` | `-v` | Enable verbose output | Disabled | |
🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md` around lines 392 -
398, Update the README table entry for the `--timeout` / `-t` option to reflect
the actual default of 3600 seconds: change the "Default" column value for the
`--timeout` row from 300 to 3600 so the documented default matches the script's
runtime behavior.
| prompt = f"""Using the 4-agent discovery protocol, focus on: | ||
| 1. Security aspects of {database} | ||
| 2. Performance optimization opportunities | ||
| 3. Data quality issues | ||
|
|
||
| Follow the standard 4-round protocol but prioritize these areas. | ||
| """ |
There was a problem hiding this comment.
Align the custom prompt example with the 6‑agent/5‑round protocol.
The example still references a 4‑agent, 4‑round flow.
Proposed fix
- prompt = f"""Using the 4-agent discovery protocol, focus on:
+ prompt = f"""Using the 6-agent, 5-round discovery protocol, focus on:
1. Security aspects of {database}
2. Performance optimization opportunities
3. Data quality issues
- Follow the standard 4-round protocol but prioritize these areas.
+ Follow the standard 5-round protocol but prioritize these areas.
"""🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md` around lines 471 -
477, Update the example prompt string (variable "prompt") so it references the
6-agent, 5-round discovery protocol rather than "4-agent" and "4-round"; modify
the body to instruct focus on security, performance, and data quality while
explicitly stating the 6-agent/5-round structure and any change in sequencing
(e.g., "Follow the standard 6-agent, 5-round protocol and prioritize these
areas") and adjust numbering or phrasing inside the triple-quoted string to
match the new protocol description.
| ``` | ||
| ClaudeCode_Headless/ | ||
| ├── README.md # This file | ||
| ├── prompts/ | ||
| │ ├── multi_agent_discovery_prompt.md # Concise system prompt | ||
| │ └── multi_agent_discovery_reference.md # Comprehensive reference | ||
| ├── headless_db_discovery.py # Python script | ||
| ├── headless_db_discovery.sh # Bash script | ||
| └── examples/ | ||
| ├── DATABASE_DISCOVERY_REPORT.md # Example output | ||
| └── DATABASE_QUESTION_CAPABILITIES.md # Feature documentation | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to the directory-structure code fence to satisfy MD040.
Proposed fix
-```
+```text
ClaudeCode_Headless/
├── README.md # This file
├── prompts/
│ ├── multi_agent_discovery_prompt.md # Concise system prompt
│ └── multi_agent_discovery_reference.md # Comprehensive reference
├── headless_db_discovery.py # Python script
├── headless_db_discovery.sh # Bash script
└── examples/
├── DATABASE_DISCOVERY_REPORT.md # Example output
└── DATABASE_QUESTION_CAPABILITIES.md # Feature documentation🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
519-519: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md` around lines 519 -
530, The directory-structure code fence in README.md is missing a language tag
(causing MD040); update the opening fence for the tree block from ``` to ```text
so the block becomes a fenced code block with a language tag (e.g., change the
README.md directory-structure code fence that surrounds the
"ClaudeCode_Headless/" tree to start with ```text).
Added patterns to ignore all discovery output files: Root .gitignore: - discovery_*.md (main discovery reports at repo root) - database_discovery_report.md scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/.gitignore: - *_QUESTION_CATALOG.md (individual agent question catalogs) - *_round1_*.md, *_round2_*.md, *_round3_*.md, *_round4_*.md (round outputs) - *_COORDINATOR_SUMMARY.md (coordinator summaries) - *_HYPOTHESIS_TESTING.md (hypothesis testing results) - *_INDEX.md (index files) - *_QUICK_REFERENCE.md (quick reference docs) - META_ANALYSIS_*.md (META agent outputs) - SECURITY_AGENT_*.txt (security agent text outputs) - query_agent_*.md, security_agent_*.md, semantic_*.md, statistical_*.md These files are generated when agents incorrectly use Write tool instead of catalog_upsert (the issue fixed in v1.2). They should not be committed to the repository as they are test outputs and are stored in the MCP catalog.
…aint This commit addresses two issues: 1. MCP Catalog Schema Separation: - Add 'schema' column to catalog table for proper isolation - Update all catalog methods (upsert, get, search, list, remove) to accept schema parameter - Update MCP tool handlers and JSON-RPC parameter parsing - Unique constraint changed from (kind, key) to (schema, kind, key) - FTS table updated to include schema column 2. Discovery Prompt Scope Constraint: - Add explicit SCOPE CONSTRAINT section to multi_agent_discovery_prompt.md - Agents now respect Target Schema and skip list_schemas when specified - Prevents analyzing all schemas when only one is targeted Files modified: - include/MySQL_Catalog.h: Add schema parameter to all catalog methods - include/MySQL_Tool_Handler.h: Update wrapper method signatures - lib/MySQL_Catalog.cpp: Implement schema filtering in all operations - lib/MySQL_Tool_Handler.cpp: Update wrapper implementations - lib/Query_Tool_Handler.cpp: Extract schema from JSON-RPC params, update tool descriptions - scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_prompt.md: Add scope constraint
Phase 1 (Static/Deterministic): - Add Discovery_Schema: SQLite catalog with deterministic and LLM tables - Add Static_Harvester: MySQL INFORMATION_SCHEMA metadata extraction - Harvest schemas, objects, columns, indexes, foreign keys, view definitions - Compute derived hints: is_time, is_id_like, has_pk, has_fks, has_time - Build quick profiles and FTS5 indexes Phase 2 (LLM Agent): - Add 19 new MCP tools for two-phase discovery - discovery.run_static: Trigger ProxySQL's static harvest - Catalog tools: init, search, get_object, list_objects, get_relationships - Agent tools: run_start, run_finish, event_append - LLM tools: summary_upsert, relationship_upsert, domain_upsert, etc. Files: - include/Discovery_Schema.h, lib/Discovery_Schema.cpp - include/Static_Harvester.h, lib/Static_Harvester.cpp - include/Query_Tool_Handler.h, lib/Query_Tool_Handler.cpp (updated) - lib/Makefile (updated) - scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/ - scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/two_phase_discovery.py
Add missing --mcp-config argument which is required by the script. The dry-run example now correctly shows all required parameters.
- Add mcp_config.example.json for Claude Code MCP configuration - Fix MCP bridge path in example config (../../proxysql_mcp_stdio_bridge.py) - Update Two_Phase_Discovery_Implementation.md with correct Phase 1/Phase 2 usage - Fix Two_Phase_Discovery_Implementation.md DELETE FROM fts_objects to scope to run_id - Update README.md with two-phase discovery section and multi-agent legacy note - Create static_harvest.sh bash wrapper for Phase 1 - Create two_phase_discovery.py orchestration script with prompts - Add --run-id parameter to skip auto-fetch - Fix RUN_ID placeholder mismatch (<USE_THE_PROVIDED_RUN_ID>) - Fix catalog path default to mcp_catalog.db - Add test_catalog.sh to verify catalog tools work - Fix Discovery_Schema.cpp FTS5 syntax (missing space) - Remove invalid CREATE INDEX on FTS virtual tables - Add MCP tool call logging to track tool usage - Fix Static_Harvester::get_harvest_stats() to accept run_id parameter - Fix DELETE FROM fts_objects to only delete for specific run_id - Update system prompts to say DO NOT call discovery.run_static - Update user prompts to say Phase 1 is already complete - Add --mcp-only flag to restrict Claude Code to MCP tools only - Make FTS table failures non-fatal (check if table exists first) - Add comprehensive documentation for both discovery approaches
- Add try-catch around handle_jsonrpc_request to catch unexpected exceptions - Add detailed logging for tool execution success/failure - Add proper SQLite error checking in create_agent_run with error messages - Fix json_int/json_double to handle both numbers and numeric strings The json_int function was throwing exceptions when receiving numeric strings (e.g., "14" instead of 14) from clients, causing 500 errors. Now it handles both formats gracefully. Also added logging so tool failures are visible in logs instead of being silent 500 errors.
Exception handlers now log the full request payload that caused the error, making debugging much easier. Changes: - Move req_body/req_path declarations outside try block so catch handlers can access them - Log request payload in all exception handlers (parse errors, std::exception, and catch-all) - Log tool arguments when tool execution fails Previously, exceptions would only log the error message without context, making it impossible to reproduce the issue. Now the full payload is logged.
1. Fix error logging to catch ALL tool failures, not just those with
both success and result fields. Previously, error responses like
{"success": false, "error": "..."} without a result field were
silently ignored.
2. Fix llm.domain_set_members to accept both array and JSON string
formats for the members parameter. Some clients send it as a
JSON string, others as a native array.
3. Add detailed error logging for llm.domain_set_members failures,
including what was actually received.
- Add resolve_run_id() to map schema names to latest run_id - Update all catalog and LLM tools to accept schema names - Deprecate describe_table, table_profile, column_profile - Deprecate get_constraints, suggest_joins, find_reference_candidates - Keep sample_rows, sample_distinct for data preview
Tool was deprecated; users should use catalog.get_object instead.
- Query schemas from catalog's schemas table - Maintains same output format for compatibility - Removes dependency on live MySQL connection
Add stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset tables to track MCP query tool usage statistics. - Added get_tool_usage_stats_resultset() method to Query_Tool_Handler - Defined table schemas in ProxySQL_Admin_Tables_Definitions.h - Registered tables in Admin_Bootstrap.cpp - Added pattern matching in ProxySQL_Admin.cpp - Added stats___mcp_query_tools_counters() in ProxySQL_Admin_Stats.cpp - Fixed friend declaration for track_tool_invocation() - Fixed Discovery_Schema.cpp log_llm_search() to use prepare_v2/finalize
Remove the mcp-catalog_path configuration variable and hardcode the catalog database path to datadir/mcp_catalog.db for stability. Rationale: The catalog database is session state, not user configuration. Runtime swapping of the catalog could cause tables to be missed and the catalog to fail even if it was succeeding a second earlier. Changes: - Removed catalog_path from mcp_thread_variables_names array - Removed mcp_catalog_path from MCP_Thread variables struct - Removed getter/setter logic for catalog_path - Hardcoded catalog path to GloVars.datadir/mcp_catalog.db in: - ProxySQL_MCP_Server.cpp (Query_Tool_Handler initialization) - Admin_FlushVariables.cpp (MySQL_Tool_Handler reinitialization) - Updated VARIABLES.md to document the hardcoded path - Updated configure_mcp.sh to remove catalog_path configuration - Updated MCP README to remove catalog_path references
The MCP catalog database is now accessible as the 'mcp_catalog' schema from the ProxySQL Admin interface, enabling direct SQL queries against discovered schemas and LLM memories.
Extend the stats_mcp_query_tools_counters table with timing statistics (first_seen, last_seen, sum_time, min_time, max_time) following the same pattern as stats_mysql_query_digest. All timing values are in microseconds using monotonic_time(). New schema: - tool VARCHAR - schema VARCHAR - count INT - first_seen INTEGER (microseconds) - last_seen INTEGER (microseconds) - sum_time INTEGER (microseconds - total execution time) - min_time INTEGER (microseconds - minimum execution time) - max_time INTEGER (microseconds - maximum execution time)
Previously re_modifiers was hardcoded to 1 (CASELESS), ignoring the value stored in the database. Now properly reads from row->fields[7].
- Fully implemented rag.search_hybrid tool with both fuse and fts_then_vec modes - Added complete filter support across all search tools (source_ids, source_names, doc_ids, post_type_ids, tags_any, tags_all, created_after, created_before, min_score) - Implemented proper score normalization (higher is better) for all search modes - Updated all tool schemas to match blueprint specifications exactly - Added metadata inclusion in search results - Implemented Reciprocal Rank Fusion (RRF) scoring for hybrid search - Enhanced error handling and input validation - Added debug information for hybrid search ranking - Updated documentation and created completion summary This completes the v0 RAG implementation according to the blueprint requirements.
- Enhanced inline Doxygen comments in RAG_Tool_Handler.h and RAG_Tool_Handler.cpp - Added detailed parameter descriptions, return values, and cross-references - Created Doxyfile for documentation generation - Added documentation summary and guidelines - Documented all RAG tools with their schemas and usage patterns - Added security and performance considerations documentation The RAG subsystem is now fully documented with comprehensive Doxygen comments that provide clear guidance for developers working with the codebase.
Add detailed function-level documentation to all MCP query rules, query digest, static harvester, and catalog components. Static_Harvester.cpp: - Document all 18+ harvest functions (schemas, objects, columns, indexes, FKs, views) - Document lifecycle methods (init, close, connect, disconnect) - Document helper methods (is_time_type, is_id_like_name) - Document run management (start_run, finish_run, run_full_harvest) - Document statistics methods (get_harvest_stats) Query_Tool_Handler.cpp: - Document JSON helper functions (json_string, json_int, json_double) - Document digest tracking section with flow explanation MySQL_Catalog.cpp: - Document schema isolation architecture - Document CRUD operations (upsert, get, search, list, remove, merge) Discovery_Schema.cpp: - Document MCP query rules evaluation (evaluate_mcp_query_rules) - Document digest functions (compute_mcp_digest, fingerprint_mcp_args) - Document update/get functions for rules and digests ProxySQL_Admin_Stats.cpp: - Document stats collection functions ProxySQL_Admin.cpp: - Document load/save functions for query rules Admin_Handler.cpp: - Document MCP query rules command handlers include/ProxySQL_Admin_Tables_Definitions.h: - Add comments explaining table purposes
- Remove Doxyfile that was added by mistake - Remove .doxygen files from doc/ directory that were generated by mistake - Clean up documentation artifacts that shouldn't have been committed
- Address SQL injection vulnerabilities by adding input validation and escaping - Fix configuration variable handling in get_variable and set_variable methods for RAG variables - Make embedding dimension configurable for rag_vec_chunks table - Remove code duplication in SQL filter building logic by creating consolidated build_sql_filters function - Update all search tools (FTS, vector, hybrid) to use consolidated filter building
- Move RAG test files to dedicated test/rag/ directory - Create proper Makefile in test/rag/ directory - Remove build_rag_test.sh script (replaced by Makefile) - Revert changes to main test/Makefile - Update documentation to reflect new test structure
Fixed 6 legitimate issues from AI code review: Critical fixes: - Implement stats___mcp_query_digest to populate table with actual data - Fix double trigger bug (stats_mcp_query_digest_reset substring matching) Important fixes: - Fix re_modifiers parsing (VARCHAR "CASELESS" -> int 1) - Fix TOCTOU race condition in reset path (use write lock from start) - Add column count validation before accessing row fields Documentation: - Add memory ownership documentation for evaluate_mcp_query_rules False positives ignored (Issue 2: Schema mismatch, Issue 6: SQL injection)
- Move RAG test files to dedicated test/rag/ directory - Create proper Makefile in test/rag/ directory - Remove build_rag_test.sh script (replaced by Makefile) - Update .gitignore to exclude test/rag/test_rag_schema executable - Update documentation to reflect new test structure
…ove MAIN_PROXY_SQLITE3 from main.cpp
…ents (address code-review)
Use proxy_sqlite3_* wrappers and canonical proxy symbol TU
Add comprehensive RAG subsystem with Doxygen documentation [WIP]
There was a problem hiding this comment.
Actionable comments posted: 19
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/MCP_Thread.cpp (1)
1-10: Missing include forRAG_Tool_Handler.h.The file includes headers for all other tool handlers (
MySQL_Tool_Handler.h,Config_Tool_Handler.h, etc.) but doesn't includeRAG_Tool_Handler.h. While the forward declaration inMCP_Thread.hallows the pointer member to compile, the destructor on lines 127-130 callsdelete rag_tool_handler, which requires the complete type definition.🔧 Proposed fix
`#include` "MCP_Thread.h" `#include` "MySQL_Tool_Handler.h" `#include` "Config_Tool_Handler.h" `#include` "Query_Tool_Handler.h" `#include` "Admin_Tool_Handler.h" `#include` "Cache_Tool_Handler.h" `#include` "Observe_Tool_Handler.h" +#include "RAG_Tool_Handler.h" `#include` "proxysql_debug.h" `#include` "ProxySQL_MCP_Server.hpp"
🤖 Fix all issues with AI agents
In `@lib/Admin_Bootstrap.cpp`:
- Around line 726-736: The code calls mcpdb->open(...) and immediately enables
extension loading via (*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1)
without checking the open result; update the logic in the block around
mcpdb->open() (SQLite3DB::open) to capture and check its return value, log an
error and skip calling proxy_sqlite3_enable_load_extension and any further
attach/setup when open fails, and ensure you use mcpdb->get_db() only after a
successful open; include descriptive logging (via existing logging mechanism)
when open fails to make the early exit clear.
In `@lib/AI_Features_Manager.cpp`:
- Around line 300-315: Check GloGATH->variables.genai_vector_dimension before
using it to build create_rag_vec_chunks_sql: if it's <= 0 or unreasonably large,
clamp it to a safe default (e.g., 1536) and log the correction via the same
logger used elsewhere (or emit a warning) so the fallback is visible; then use
the sanitized vector_dimension when constructing the vec0 DDL for rag_vec_chunks
to avoid silent failures.
In `@lib/Anomaly_Detector.cpp.bak`:
- Line 500: Replace the direct call to sqlite3_finalize(stmt) with the proxy
wrapper used elsewhere (e.g., proxy_sqlite3_finalize(stmt)) so the proxy layer's
logging/metrics/error handling is applied consistently; update any includes or
declarations if needed to access proxy_sqlite3_finalize and ensure the call
semantics and error checking remain identical to the original sqlite3_finalize
usage (use the same stmt variable and follow existing error/logging patterns
used with proxy_sqlite3_step).
- Around line 814-836: The code is calling sqlite3_prepare_v2, sqlite3_errmsg,
and sqlite3_finalize directly which breaks the proxy pattern; replace these
direct calls with the proxy wrappers used elsewhere: call
proxy_sqlite3_prepare_v2(db, query, -1, &stmt, NULL) and assign its return to
rc, use proxy_sqlite3_errmsg(db) in the proxy_error call, and call
proxy_sqlite3_finalize(stmt) at the end; keep the existing use of
proxy_sqlite3_step, proxy_sqlite3_column_text, proxy_sqlite3_column_int, and
proxy_sqlite3_column_int64 and ensure variable names (stmt, rc, query, db)
remain unchanged.
- Around line 920-938: This block mixes direct SQLite calls with proxy wrappers;
make all SQLite interactions consistent by replacing sqlite3_prepare_v2,
sqlite3_column_text, sqlite3_column_int, and sqlite3_finalize with their proxy
equivalents (e.g., proxy_sqlite3_prepare_v2, proxy_sqlite3_column_text,
proxy_sqlite3_column_int, proxy_sqlite3_finalize) so the loop that uses
proxy_sqlite3_step also uses proxy_* for prepare/column/finalize on the
variables stmt, db, type_query, and by_type.
- Around line 577-590: The static local int cleanup_counter in the cleanup block
inside update_user_statistics is not thread-safe; replace it with a
std::atomic<int> (e.g., declare static std::atomic<int> cleanup_counter{0}) and
perform an atomic increment/check (use fetch_add or ++cleanup_counter and
compare the result mod 1000) so multiple threads can't race on the counter; also
ensure the user_statistics cleanup loop remains protected by the same mutex/lock
used for accessing the user_statistics map (or document that user_statistics is
already synchronized) to avoid concurrent access issues.
- Around line 771-776: Replace direct SQLite calls with the proxy wrappers to
match the rest of the file: in the error branch where proxy_error("Anomaly:
Failed to insert pattern: %s\n", sqlite3_errmsg(db)); and where
sqlite3_finalize(stmt); are used, call the proxy wrappers (e.g.,
proxy_sqlite3_errmsg(db) and proxy_sqlite3_finalize(stmt)) instead so
proxy_error uses the proxy_sqlite3_errmsg result and both finalization calls use
proxy_sqlite3_finalize; locate these occurrences around the Anomaly pattern
insert logic where proxy_error, db, and stmt are referenced.
In `@lib/Discovery_Schema.cpp`:
- Around line 1211-1221: The SQL built in get_object (around the sql
ostringstream) and list_objects currently concatenates user-controlled
schema_name, object_name and object_type into the query; change both functions
to use prepared/parameterized statements instead of string concatenation:
replace the ostringstream construction with a SQL statement containing
placeholders (e.g., ? or named parameters) for run_id, object_id (when used),
schema_name, object_name and object_type, then bind those variables with the DB
API's bind/prepare calls before executing; ensure proper types are used for
numeric vs string parameters and remove any remaining direct concatenation of
schema_name/object_name/object_type.
- Around line 30-38: Discovery_Schema::init currently uses a C99 VLA (char
path_buf[db_path.size() + 1]) and leaks the heap-allocated db if db->open(...)
fails; replace the VLA with a std::vector<char> (or std::string::data copy)
sized db_path.size()+1 and use that buffer when calling db->open, and ensure you
free the allocated SQLite3DB (delete db) on error (or better, change db to a
smart pointer) before returning after proxy_error to avoid the resource leak.
- Around line 1988-1993: The fts_search_llm() code is vulnerable because it
concatenates the user-supplied query into the MATCH clause; change it to use a
prepared statement and parameter binding instead of string concatenation: build
the SQL with "WHERE f.fts_llm MATCH ?1" (keep the rest of the SELECT/LEFT JOIN
and ORDER BY score LIMIT part), prepare the statement, bind the C++ string query
to parameter 1 using sqlite3_bind_text (or proxy_sqlite3_bind_text) and execute;
also ensure the LIMIT value (limit) is either bound as a validated integer or
strictly validated before concatenation to prevent injection via limit.
In `@lib/ProxySQL_MCP_Server.cpp`:
- Around line 78-98: The Query_Tool_Handler instance is set to NULL on init()
failure but the /mcp/query resource is still registered later; change the flow
so that registration and any increment of the endpoint counter only occur when
handler->query_tool_handler->init() returns success (non-zero indicates
failure), i.e., after new Query_Tool_Handler(...) call check init(), and only on
successful init register the /mcp/query endpoint and adjust any endpoint
counting; on failure leave handler->query_tool_handler NULL and skip
registration to avoid null deref or a broken endpoint.
In `@lib/RAG_Tool_Handler.cpp`:
- Around line 2316-2330: The doc_id list is concatenated into SQL in
rag.get_docs and rag.fetch_from_source (see the doc_ids loop that builds
doc_list and the sql string using "WHERE ... IN (" + doc_list + ")"), which is
vulnerable to SQL injection and fails on quotes; fix by replacing direct
concatenation with a parameterized IN clause using execute_parameterized_query()
and bound parameters (generate placeholders ?1,?2,... from doc_ids and bind each
doc_ids[i]) or, if you prefer the validation approach, validate every entry in
doc_ids against the same filter used in the filter handlers (reject or escape
entries containing '\'', '\\', ';') before building the query; ensure you update
both rag.get_docs and rag.fetch_from_source to use the same safe pattern.
- Around line 1490-1504: The hybrid weights w_fts and w_vec are parsed with
get_json_int which truncates fractional values; update both the "fuse" path and
the fallback path to parse these weights as double (use the existing
get_json_double or equivalent helper instead of get_json_int) so fractional
weights like 0.5 are preserved; ensure you change the lines setting w_fts and
w_vec in the block that reads fuse_params and the fallback block that reads
arguments, keeping the same variable names (w_fts, w_vec) and types.
- Around line 1150-1159: The SQL FTS query is built by concatenating the
user-provided variable `query` into the `sql` string (the SELECT from
`rag_fts_chunks` / JOIN `rag_chunks` block), which fails to escape quotes and is
vulnerable; change this to use a prepared statement with a parameter placeholder
in the MATCH clause (e.g. `... WHERE f MATCH ?1`) and bind the user input with
`sqlite3_bind_text()` before stepping the statement (prepare with
`sqlite3_prepare_v2`, bind with `sqlite3_bind_text`, execute with
`sqlite3_step`, and finalize with `sqlite3_finalize`). Apply the same change for
the other blocks referenced (the other `sql` constructions at the later ranges)
so all FTS MATCH usages use parameter binding instead of raw concatenation.
- Around line 401-455: The prepared statement and bound parameters are never
used because you call vector_db->execute_statement(query, ...); instead step the
prepared sqlite3_stmt* from prepare_result.second so bindings are applied, and
when constructing a SQLite3_result from the stmt do not leave the unique_ptr
owning it — release ownership with prepare_result.second.release() and pass the
raw sqlite3_stmt* into SQLite3_result so the result object owns and finalizes
the statement; remove the execute_statement(...) call and replace it with
creating/returning the SQLite3_result from the released stmt after collecting
any needed error/cols/affected_rows info.
In `@RAG_POC/rag_ingest.cpp`:
- Around line 686-698: The assignments reading columns into struct s are
vulnerable to null pointer dereference because sqlite3_column_text can return
NULL; update the code that sets s.name, s.backend_type, s.host, s.user, s.pass,
s.db, s.table_name, s.pk_column and s.where_sql to first call
sqlite3_column_text into a temporary const unsigned char* (or const char*),
check for NULL (or use sqlite3_column_type to detect SQLITE_NULL) and only
assign to the std::string members when non-NULL, otherwise set them to an empty
string (or another safe default); do the same null-check pattern for text
columns used by functions like sqlite3_column_text while keeping integer reads
(sqlite3_column_int for s.enabled and s.port) as-is or guarding them with
sqlite3_column_type if you want to detect NULLs explicitly.
- Around line 901-905: The code currently sets now_epoch to 0 which yields
invalid timestamps for rag_vec_chunks; replace the hardcoded value by capturing
a real epoch once per run (e.g., use std::time(nullptr) / time(NULL) and cast to
std::int64_t) and reuse that value when populating updated_at (variable
now_epoch), updating any affected functions that rely on now_epoch; ensure the
translation unit includes <ctime> (or <time.h>) if not already present so the
build succeeds.
- Around line 468-479: build_select_sql concatenates RagSource::where_sql and
raw identifiers, creating SQL injection/identifier-injection risks; update
build_select_sql to (1) escape backticks inside identifiers used for cols and
s.table_name (replace any '`' with '``') before wrapping with backticks, and (2)
stop directly concatenating s.where_sql: either accept a parameterized WHERE
with placeholders and require callers to supply bound parameters, or validate
allowlisted characters/patterns (e.g., only simple column comparisons with a
strict regex) and reject/return an error for unsafe values; update function
signature or call sites as needed to support bound parameters or validation on
RagSource::where_sql.
In `@RAG_POC/schema.sql`:
- Around line 14-45: The rag_sources table currently stores plaintext
credentials in the backend_pass column; instead migrate to a secure approach by
removing plaintext storage and either (A) storing only a reference to an
external secret manager (add a new column like backend_secret_ref) and update
any code that reads rag_sources to fetch the secret from the secret manager, or
(B) store an encrypted value (add backend_pass_encrypted) and implement
symmetric key encryption/decryption in the service layer to read/write via
functions that use a KMS-held key; update schema changes for rag_sources, add a
migration to backfill secrets (move existing backend_pass values into the secret
store or encrypt them) and remove or null out backend_pass, and update codepaths
that reference backend_pass to use backend_secret_ref or decrypt
backend_pass_encrypted instead.
♻️ Duplicate comments (7)
lib/ProxySQL_Admin_Stats.cpp (1)
1587-1617: SQL injection vulnerability and incorrect table name logic.This issue was previously flagged: the INSERT statement has two problems:
Wrong table name: When
reset=true, the prefix logic produces"stats_mcp_query_tools_counters_mcp_query_tools_counters"instead of the intended"stats_mcp_query_tools_counters_reset".SQL injection risk:
r->fields[0]andr->fields[1](tool name and schema) are directly interpolated into the query without escaping. Malicious values could corrupt or exploit the database.🔒 Proposed fix using prepared statements
- for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); - it != resultset->rows.end(); ++it) { - SQLite3_row* r = *it; - char query[1024]; - snprintf(query, sizeof(query), - "INSERT INTO %smcp_query_tools_counters VALUES ('%s', '%s', %s, %s, %s, %s, %s, %s)", - reset ? "stats_mcp_query_tools_counters_" : "stats_", - r->fields[0], r->fields[1], r->fields[2], r->fields[3], - r->fields[4], r->fields[5], r->fields[6], r->fields[7]); - statsdb->execute(query); - } + const char* table = reset + ? "stats_mcp_query_tools_counters_reset" + : "stats_mcp_query_tools_counters"; + char query[256]; + snprintf(query, sizeof(query), + "INSERT INTO %s VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8)", table); + + sqlite3_stmt* stmt = nullptr; + int rc = statsdb->prepare_v2(query, &stmt); + ASSERT_SQLITE_OK(rc, statsdb); + + for (SQLite3_row* r : resultset->rows) { + rc = (*proxy_sqlite3_bind_text)(stmt, 1, r->fields[0], -1, SQLITE_TRANSIENT); + rc = (*proxy_sqlite3_bind_text)(stmt, 2, r->fields[1], -1, SQLITE_TRANSIENT); + rc = (*proxy_sqlite3_bind_int64)(stmt, 3, atoll(r->fields[2])); + rc = (*proxy_sqlite3_bind_int64)(stmt, 4, atoll(r->fields[3])); + rc = (*proxy_sqlite3_bind_int64)(stmt, 5, atoll(r->fields[4])); + rc = (*proxy_sqlite3_bind_int64)(stmt, 6, atoll(r->fields[5])); + rc = (*proxy_sqlite3_bind_int64)(stmt, 7, atoll(r->fields[6])); + rc = (*proxy_sqlite3_bind_int64)(stmt, 8, atoll(r->fields[7])); + SAFE_SQLITE3_STEP2(stmt); + (*proxy_sqlite3_clear_bindings)(stmt); + (*proxy_sqlite3_reset)(stmt); + } + (*proxy_sqlite3_finalize)(stmt);lib/Discovery_Schema.cpp (6)
12-18:gmtime()is not thread‑safeSame issue as previously noted:
gmtime()uses shared static storage; switch togmtime_r/gmtime_s.
76-88: UsedeleteforSQLite3_resultallocated withnewSame issue as previously noted:
execute_statement()allocates withnew, sofree(resultset)is undefined behavior.
589-598:get_run_inforeads columns off by oneSame issue as previously noted: field indices are shifted relative to the SELECT list.
970-987:?1placeholders aren’t bound inupdate_object_flagsSame issue as previously noted:
db->execute()doesn’t bind parameters, sorun_idis ignored.
1162-1171: FTS query strings are concatenated directlySame issue as previously noted:
query/object_type/schema_nameare interpolated without escaping or binding.
1761-1763:domain_keyconcatenation is injection‑proneSame issue as previously noted:
domain_keyis interpolated directly into SQL.
🧹 Nitpick comments (11)
RAG_POC/rag_ingest.cpp (1)
384-404: MySQL connection lacks timeout configuration.Without a connection timeout,
mysql_real_connectcan block indefinitely if the MySQL server is unreachable. Consider addingMYSQL_OPT_CONNECT_TIMEOUT.♻️ Suggested improvement
static MYSQL* mysql_connect_or_die(const RagSource& s) { MYSQL* conn = mysql_init(nullptr); if (!conn) fatal("mysql_init failed"); // Set utf8mb4 for safety with StackOverflow-like content mysql_options(conn, MYSQL_SET_CHARSET_NAME, "utf8mb4"); + + // Set connection timeout to avoid indefinite blocking + unsigned int timeout = 30; // seconds + mysql_options(conn, MYSQL_OPT_CONNECT_TIMEOUT, &timeout); if (!mysql_real_connect(conn,lib/Anomaly_Detector.cpp.bak (1)
136-163: Consider caching compiled regex patterns for performance.
normalize_queryis likely called on the hot path for every query. Compilingstd::regexobjects on each call is expensive. Consider pre-compiling and caching these patterns.♻️ Suggested improvement
// At file scope or as class members: static const std::regex comment_regex("--.*?$|/\\*.*?\\*/", std::regex::multiline); static const std::regex string_regex("'[^']*'|\"[^\"]*\""); static const std::regex numeric_regex("\\b\\d+\\b"); static const std::regex whitespace_regex("\\s+"); // Then in normalize_query, use these pre-compiled patternslib/ProxySQL_Admin_Stats.cpp (2)
2308-2309: Minor: Duplicate commented code.Line 2308 has a duplicate commented-out line that appears to be a copy-paste artifact.
🧹 Cleanup suggestion
statsdb->execute("DELETE FROM stats_mysql_prepared_statements_info"); query1=(char *)"INSERT INTO stats_mysql_prepared_statements_info VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9)"; query32s = "INSERT INTO stats_mysql_prepared_statements_info VALUES " + generate_multi_rows_query(32,9); query32 = (char *)query32s.c_str(); - //rc=(*proxy_sqlite3_prepare_v2)(mydb3, query1, -1, &statement1, 0); - //rc=(*proxy_sqlite3_prepare_v2)(mydb3, query1, -1, &statement1, 0); + //rc=(*proxy_sqlite3_prepare_v2)(mydb3, query1, -1, &statement1, 0); rc = statsdb->prepare_v2(query1, &statement1);
2375-2376: Minor: Same duplicate comment issue in pgsql version.Same copy-paste artifact as in the MySQL version.
RAG_POC/architecture-data-model.md (1)
203-203: Minor: Trailing space in code span.There's a trailing space inside the code span that triggers a markdown lint warning.
🔧 Fix
-- `{"lit":"..."} ` — appends a literal string +- `{"lit":"..."}` — appends a literal stringlib/sqlite3db.cpp (1)
1102-1108: Fix inconsistent indentation.Lines 1102-1108 have inconsistent indentation compared to the surrounding code. The assignments should be indented with tabs to match the style of adjacent lines (e.g., line 1101 and 1109).
♻️ Suggested fix for indentation
proxy_sqlite3_column_count = sqlite3_column_count; proxy_sqlite3_column_int = sqlite3_column_int; - proxy_sqlite3_column_int64 = sqlite3_column_int64; - proxy_sqlite3_column_double = sqlite3_column_double; - proxy_sqlite3_last_insert_rowid = sqlite3_last_insert_rowid; - proxy_sqlite3_errstr = sqlite3_errstr; - proxy_sqlite3_db_handle = sqlite3_db_handle; - proxy_sqlite3_enable_load_extension = sqlite3_enable_load_extension; - proxy_sqlite3_auto_extension = sqlite3_auto_extension; + proxy_sqlite3_column_int64 = sqlite3_column_int64; + proxy_sqlite3_column_double = sqlite3_column_double; + proxy_sqlite3_last_insert_rowid = sqlite3_last_insert_rowid; + proxy_sqlite3_errstr = sqlite3_errstr; + proxy_sqlite3_db_handle = sqlite3_db_handle; + proxy_sqlite3_enable_load_extension = sqlite3_enable_load_extension; + proxy_sqlite3_auto_extension = sqlite3_auto_extension; proxy_sqlite3_errmsg = sqlite3_errmsg;lib/GenAI_Thread.cpp (2)
688-691: Inconsistent boolean parsing forrag_enabled.The
rag_enabledsetter accepts both"true"and"1", while other boolean setters in this file (e.g.,enabled,llm_enabled,anomaly_enabledat lines 552-563) only accept"true". This inconsistency could confuse users.Consider either:
- Aligning
rag_enabledwith existing setters (only"true"), or- Updating all boolean setters to accept both
"true"and"1"for consistency.♻️ Option 1: Align with existing pattern
if (!strcmp(name, "rag_enabled")) { - variables.genai_rag_enabled = (strcmp(value, "true") == 0 || strcmp(value, "1") == 0); + variables.genai_rag_enabled = (strcmp(value, "true") == 0); return true; }
728-736: Consider raising minimum timeout threshold.The minimum value of 1ms for
rag_timeout_msis extremely low and could cause spurious timeouts in production. Even local SQLite operations may exceed 1ms under load.Consider raising the minimum to a more practical value like 100ms, consistent with other timeout validations in this file (e.g.,
embedding_timeout_msuses 100-300000 range at line 534).♻️ Suggested fix
if (!strcmp(name, "rag_timeout_ms")) { int val = atoi(value); - if (val < 1 || val > 60000) { - proxy_error("Invalid value for rag_timeout_ms: %d (must be 1-60000)\n", val); + if (val < 100 || val > 60000) { + proxy_error("Invalid value for rag_timeout_ms: %d (must be 100-60000)\n", val); return false; } variables.genai_rag_timeout_ms = val; return true; }RAG_POC/embeddings-design.md (1)
42-47: Consider adding language specifier to fenced code block.The embedding input template example lacks a language specifier. While this is intentionally plain text, adding a language identifier (e.g.,
textorplaintext) would satisfy linters and improve rendering consistency.♻️ Suggested fix
-``` +```text {Title} Tags: {Tags} {ChunkBody}</details> </blockquote></details> <details> <summary>RAG_IMPLEMENTATION_SUMMARY.md (1)</summary><blockquote> `1-129`: **Consider consolidating summary documents.** This file contains substantial overlap with `RAG_FILE_SUMMARY.md` and `RAG_COMPLETION_SUMMARY.md`. Having three separate summary documents with similar content may lead to maintenance burden and potential inconsistencies over time. Consider consolidating into a single comprehensive summary document or clearly differentiating the purpose of each file (e.g., one for developers, one for release notes). </blockquote></details> <details> <summary>lib/RAG_Tool_Handler.cpp (1)</summary><blockquote> `34-43`: **Remove the duplicate `GloGATH` forward declaration** The `extern GenAI_Threads_Handler *GloGATH;` appears twice back‑to‑back. Dropping the duplicate reduces noise. </blockquote></details> </blockquote></details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| mcpdb = new SQLite3DB(); | ||
| std::string mcp_catalog_path = std::string(GloVars.datadir) + "/mcp_catalog.db"; | ||
| mcpdb->open((char *)mcp_catalog_path.c_str(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX); | ||
|
|
||
| /** | ||
| * @brief Enable SQLite extension loading for MCP catalog database | ||
| * | ||
| * Allows loading SQLite extensions at runtime. This enables sqlite-vec to be | ||
| * registered for vector similarity searches in the catalog. | ||
| */ | ||
| (*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1); |
There was a problem hiding this comment.
Check mcpdb->open() result before attaching.
mcpdb->open() return code is ignored; if it fails, the subsequent attach can misbehave. Please handle the error and skip attachment/log accordingly.
✅ Suggested check
- mcpdb = new SQLite3DB();
- std::string mcp_catalog_path = std::string(GloVars.datadir) + "/mcp_catalog.db";
- mcpdb->open((char *)mcp_catalog_path.c_str(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX);
+ mcpdb = new SQLite3DB();
+ std::string mcp_catalog_path = std::string(GloVars.datadir) + "/mcp_catalog.db";
+ int mcp_rc = mcpdb->open((char *)mcp_catalog_path.c_str(),
+ SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX);
+ if (mcp_rc != SQLITE_OK) {
+ proxy_error("Failed to open MCP catalog DB at %s\n", mcp_catalog_path.c_str());
+ delete mcpdb;
+ mcpdb = nullptr;
+ } 🤖 Prompt for AI Agents
In `@lib/Admin_Bootstrap.cpp` around lines 726 - 736, The code calls
mcpdb->open(...) and immediately enables extension loading via
(*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1) without checking the
open result; update the logic in the block around mcpdb->open()
(SQLite3DB::open) to capture and check its return value, log an error and skip
calling proxy_sqlite3_enable_load_extension and any further attach/setup when
open fails, and ensure you use mcpdb->get_db() only after a successful open;
include descriptive logging (via existing logging mechanism) when open fails to
make the early exit clear.
| // rag_vec_chunks: sqlite3-vec index | ||
| // Use configurable vector dimension from GenAI module | ||
| int vector_dimension = 1536; // Default value | ||
| if (GloGATH) { | ||
| vector_dimension = GloGATH->variables.genai_vector_dimension; | ||
| } | ||
|
|
||
| std::string create_rag_vec_chunks_sql = | ||
| "CREATE VIRTUAL TABLE IF NOT EXISTS rag_vec_chunks USING vec0(" | ||
| "embedding float(" + std::to_string(vector_dimension) + "), " | ||
| "chunk_id TEXT, " | ||
| "doc_id TEXT, " | ||
| "source_id INTEGER, " | ||
| "updated_at INTEGER" | ||
| ");"; | ||
|
|
There was a problem hiding this comment.
Validate genai_vector_dimension before building vec0 DDL.
If the dimension is 0 or negative, the rag_vec_chunks creation will fail and silently disable vector search. Please clamp or fall back to a safe default and log the correction.
✅ Suggested guard
- int vector_dimension = 1536; // Default value
- if (GloGATH) {
- vector_dimension = GloGATH->variables.genai_vector_dimension;
- }
+ int vector_dimension = 1536; // Default value
+ if (GloGATH) {
+ const int cfg_dim = GloGATH->variables.genai_vector_dimension;
+ if (cfg_dim > 0) {
+ vector_dimension = cfg_dim;
+ } else {
+ proxy_warning("AI: Invalid genai_vector_dimension=%d; using default %d\n",
+ cfg_dim, vector_dimension);
+ }
+ }🤖 Prompt for AI Agents
In `@lib/AI_Features_Manager.cpp` around lines 300 - 315, Check
GloGATH->variables.genai_vector_dimension before using it to build
create_rag_vec_chunks_sql: if it's <= 0 or unreasonably large, clamp it to a
safe default (e.g., 1536) and log the correction via the same logger used
elsewhere (or emit a warning) so the fallback is visible; then use the sanitized
vector_dimension when constructing the vec0 DDL for rag_vec_chunks to avoid
silent failures.
| pattern_name ? pattern_name : "unknown", result.risk_score); | ||
| } | ||
|
|
||
| sqlite3_finalize(stmt); |
There was a problem hiding this comment.
Inconsistent SQLite API usage: missing proxy wrapper.
Line 500 uses sqlite3_finalize(stmt) directly, but nearby code uses proxy wrappers like proxy_sqlite3_step. This inconsistency could cause issues if the proxy layer adds logging, metrics, or error handling.
🐛 Proposed fix
- sqlite3_finalize(stmt);
+ (*proxy_sqlite3_finalize)(stmt);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sqlite3_finalize(stmt); | |
| (*proxy_sqlite3_finalize)(stmt); |
🤖 Prompt for AI Agents
In `@lib/Anomaly_Detector.cpp.bak` at line 500, Replace the direct call to
sqlite3_finalize(stmt) with the proxy wrapper used elsewhere (e.g.,
proxy_sqlite3_finalize(stmt)) so the proxy layer's logging/metrics/error
handling is applied consistently; update any includes or declarations if needed
to access proxy_sqlite3_finalize and ensure the call semantics and error
checking remain identical to the original sqlite3_finalize usage (use the same
stmt variable and follow existing error/logging patterns used with
proxy_sqlite3_step).
|
|
||
| // Cleanup old entries periodically | ||
| static int cleanup_counter = 0; | ||
| if (++cleanup_counter % 1000 == 0) { | ||
| uint64_t current_time = (uint64_t)time(NULL); | ||
| auto it = user_statistics.begin(); | ||
| while (it != user_statistics.end()) { | ||
| if (current_time - it->second.last_query_time > USER_STATS_WINDOW * 2) { | ||
| it = user_statistics.erase(it); | ||
| } else { | ||
| ++it; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Static local variable cleanup_counter is not thread-safe.
Multiple threads calling update_user_statistics can race on the static cleanup_counter variable, potentially causing cleanup to run less frequently than intended or creating undefined behavior.
🐛 Proposed fix: Use atomic counter
+#include <atomic>
+
+// At class level or file scope:
+static std::atomic<int> cleanup_counter{0};
+
void Anomaly_Detector::update_user_statistics(const QueryFingerprint& fp) {
// ... existing code ...
// Cleanup old entries periodically
- static int cleanup_counter = 0;
- if (++cleanup_counter % 1000 == 0) {
+ if (cleanup_counter.fetch_add(1, std::memory_order_relaxed) % 1000 == 0) {
// ... cleanup logic (consider adding mutex for user_statistics map) ...Note: The user_statistics map itself may also need synchronization if accessed from multiple threads.
🤖 Prompt for AI Agents
In `@lib/Anomaly_Detector.cpp.bak` around lines 577 - 590, The static local int
cleanup_counter in the cleanup block inside update_user_statistics is not
thread-safe; replace it with a std::atomic<int> (e.g., declare static
std::atomic<int> cleanup_counter{0}) and perform an atomic increment/check (use
fetch_add or ++cleanup_counter and compare the result mod 1000) so multiple
threads can't race on the counter; also ensure the user_statistics cleanup loop
remains protected by the same mutex/lock used for accessing the user_statistics
map (or document that user_statistics is already synchronized) to avoid
concurrent access issues.
| proxy_error("Anomaly: Failed to insert pattern: %s\n", sqlite3_errmsg(db)); | ||
| sqlite3_finalize(stmt); | ||
| return -1; | ||
| } | ||
|
|
||
| sqlite3_finalize(stmt); |
There was a problem hiding this comment.
Inconsistent proxy wrapper usage continues.
Lines 771-772 and 776 use sqlite3_errmsg and sqlite3_finalize directly instead of their proxy equivalents used elsewhere in this file.
🐛 Proposed fix
rc = (*proxy_sqlite3_step)(stmt);
if (rc != SQLITE_DONE) {
- proxy_error("Anomaly: Failed to insert pattern: %s\n", sqlite3_errmsg(db));
- sqlite3_finalize(stmt);
+ proxy_error("Anomaly: Failed to insert pattern: %s\n", (*proxy_sqlite3_errmsg)(db));
+ (*proxy_sqlite3_finalize)(stmt);
return -1;
}
- sqlite3_finalize(stmt);
+ (*proxy_sqlite3_finalize)(stmt);🤖 Prompt for AI Agents
In `@lib/Anomaly_Detector.cpp.bak` around lines 771 - 776, Replace direct SQLite
calls with the proxy wrappers to match the rest of the file: in the error branch
where proxy_error("Anomaly: Failed to insert pattern: %s\n",
sqlite3_errmsg(db)); and where sqlite3_finalize(stmt); are used, call the proxy
wrappers (e.g., proxy_sqlite3_errmsg(db) and proxy_sqlite3_finalize(stmt))
instead so proxy_error uses the proxy_sqlite3_errmsg result and both
finalization calls use proxy_sqlite3_finalize; locate these occurrences around
the Anomaly pattern insert logic where proxy_error, db, and stmt are referenced.
| // Build doc ID list for SQL | ||
| std::string doc_list = "'"; | ||
| for (size_t i = 0; i < doc_ids.size(); ++i) { | ||
| if (i > 0) doc_list += "','"; | ||
| doc_list += doc_ids[i]; | ||
| } | ||
| doc_list += "'"; | ||
|
|
||
| // Build query | ||
| std::string sql = "SELECT doc_id, source_id, " | ||
| "(SELECT name FROM rag_sources WHERE source_id = rag_documents.source_id) as source_name, " | ||
| "pk_json, title, body, metadata_json " | ||
| "FROM rag_documents " | ||
| "WHERE doc_id IN (" + doc_list + ")"; | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the file and locate the specific lines
wc -l lib/RAG_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 186
🏁 Script executed:
# Read the first location (lines 2316-2330)
sed -n '2310,2335p' lib/RAG_Tool_Handler.cpp | cat -nRepository: ProxySQL/proxysql-vec
Length of output: 186
🏁 Script executed:
# Read the second location (lines 2399-2413)
sed -n '2395,2420p' lib/RAG_Tool_Handler.cpp | cat -nRepository: ProxySQL/proxysql-vec
Length of output: 186
🏁 Script executed:
# Search for chunk_id validation/escaping patterns
rg "chunk_id" lib/RAG_Tool_Handler.cpp -B 3 -A 3 | head -60Repository: ProxySQL/proxysql-vec
Length of output: 2361
🏁 Script executed:
# Look for SQLite binding patterns used in the codebase
rg "bind|sqlite3_bind" lib/RAG_Tool_Handler.cpp | head -20Repository: ProxySQL/proxysql-vec
Length of output: 1130
🏁 Script executed:
# Search for rag.get_docs implementation
rg "rag\.get_docs" lib/RAG_Tool_Handler.cpp -B 5 -A 30Repository: ProxySQL/proxysql-vec
Length of output: 5016
🏁 Script executed:
# Search for rag.fetch_from_source implementation
rg "rag\.fetch_from_source" lib/RAG_Tool_Handler.cpp -B 5 -A 30Repository: ProxySQL/proxysql-vec
Length of output: 4702
🏁 Script executed:
# Search for doc_list pattern mentioned in the review
rg "doc_list" lib/RAG_Tool_Handler.cpp -B 5 -A 15Repository: ProxySQL/proxysql-vec
Length of output: 7955
🏁 Script executed:
# Check for any doc_id SQL injection patterns
rg "doc_id IN|WHERE doc_id" lib/RAG_Tool_Handler.cpp -B 3 -A 3Repository: ProxySQL/proxysql-vec
Length of output: 1377
Apply SQL injection mitigation to doc_id lists in rag.get_docs and rag.fetch_from_source
The code concatenates doc_ids directly into SQL queries without validation or parameterization. This is injection-prone and breaks on single quotes. Use the validation pattern found in the filter handlers (checking for ', \, ;) or migrate to the existing execute_parameterized_query() function with bound parameters (?1, ?2, ...).
Affects both:
rag.get_docs:WHERE doc_id IN (+ direct concatenationrag.fetch_from_source: identical pattern inWHERE d.doc_id IN (clause
🤖 Prompt for AI Agents
In `@lib/RAG_Tool_Handler.cpp` around lines 2316 - 2330, The doc_id list is
concatenated into SQL in rag.get_docs and rag.fetch_from_source (see the doc_ids
loop that builds doc_list and the sql string using "WHERE ... IN (" + doc_list +
")"), which is vulnerable to SQL injection and fails on quotes; fix by replacing
direct concatenation with a parameterized IN clause using
execute_parameterized_query() and bound parameters (generate placeholders
?1,?2,... from doc_ids and bind each doc_ids[i]) or, if you prefer the
validation approach, validate every entry in doc_ids against the same filter
used in the filter handlers (reject or escape entries containing '\'', '\\',
';') before building the query; ensure you update both rag.get_docs and
rag.fetch_from_source to use the same safe pattern.
| static std::string build_select_sql(const RagSource& s, const std::vector<std::string>& cols) { | ||
| std::string sql = "SELECT "; | ||
| for (size_t i = 0; i < cols.size(); i++) { | ||
| if (i) sql += ", "; | ||
| sql += "`" + cols[i] + "`"; | ||
| } | ||
| sql += " FROM `" + s.table_name + "`"; | ||
| if (!s.where_sql.empty()) { | ||
| sql += " WHERE " + s.where_sql; | ||
| } | ||
| return sql; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find RagSource definition and related structures
rg -n "struct RagSource|class RagSource" --type=cpp -A 10Repository: ProxySQL/proxysql-vec
Length of output: 592
🏁 Script executed:
#!/bin/bash
# Search for where rag_sources is populated/loaded
rg -n "rag_sources" --type=cpp -B 3 -A 3 | head -100Repository: ProxySQL/proxysql-vec
Length of output: 6079
🏁 Script executed:
#!/bin/bash
# Check for any identifier validation functions
rg -n "validate|sanitize|escape" --type=cpp -i | grep -i "sql\|identifier\|column\|table"Repository: ProxySQL/proxysql-vec
Length of output: 50378
🏁 Script executed:
#!/bin/bash
# Look for how build_select_sql is called and used
rg -n "build_select_sql" --type=cpp -B 5 -A 5Repository: ProxySQL/proxysql-vec
Length of output: 1346
🏁 Script executed:
#!/bin/bash
# Check for documentation on config loading
find . -name "*.md" -o -name "*.txt" | xargs grep -l "rag_sources\|config" 2>/dev/null | head -5Repository: ProxySQL/proxysql-vec
Length of output: 142
🏁 Script executed:
#!/bin/bash
# Find all places where rag_sources is modified (INSERT, UPDATE, DELETE)
rg -n "INSERT.*rag_sources|UPDATE.*rag_sources|DELETE.*rag_sources" --type=cppRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Find if there are admin/API endpoints for modifying rag_sources
rg -n "rag_sources" --type=cpp | grep -i "insert\|update\|delete\|admin\|api" -B 2 -A 2Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Check if backticks alone are sufficient for identifier protection
# Look for MySQL identifier validation patterns
rg -n "is_identifier|is_valid_identifier|validate_identifier" --type=cpp -iRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Look for documentation about rag_sources access control
find . -name "*.md" -o -name "*.txt" | xargs grep -l "rag_sources\|RAG.*access\|RAG.*permission" 2>/dev/nullRepository: ProxySQL/proxysql-vec
Length of output: 326
🏁 Script executed:
#!/bin/bash
# Check if backticks provide full protection against MySQL identifier injection
# Search for any injection concerns or notes in RAG-related code
rg -n "sql.*injection|sanitize|escape" RAG_POC/ --type=cpp -i -B 2 -A 2 | head -50Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Check the RAG documentation for access control information
head -200 ./doc/rag-documentation.mdRepository: ProxySQL/proxysql-vec
Length of output: 3536
🏁 Script executed:
#!/bin/bash
# More specific search: find UPSERT, INSERT patterns in RAG-related files
find ./RAG_POC -name "*.cpp" -o -name "*.h" | xargs rg -n "INSERT\|UPDATE\|DELETE" | grep -i "rag_sources\|sql"Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Search for Admin functions that might manage RAG sources
rg -n "rag_sources\|RAG.*source" lib/AI_Features_Manager.cpp | head -50Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Check if MySQL backticks alone prevent identifier injection
# Look for any validation of column/table names in collect_needed_columns or similar
rg -n "collect_needed_columns\|table_name\|pk_column" RAG_POC/rag_ingest.cpp -A 10 | head -100Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Search for column whitelisting code
rg -n "whitelist\|allowed.*column\|valid.*column" RAG_POC/ --type=cpp -iRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Find collect_needed_columns implementation
rg -n "collect_needed_columns" RAG_POC/rag_ingest.cpp -A 50 | head -100Repository: ProxySQL/proxysql-vec
Length of output: 4010
🏁 Script executed:
#!/bin/bash
# Search for MCP endpoints or admin functions that modify RAG configuration
rg -n "rag_sources\|RAG.*config\|MCP.*rag" lib/RAG_Tool_Handler.cpp --type=cpp -i | head -50Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Check if there's any validation before build_select_sql is called
rg -n "build_select_sql" RAG_POC/rag_ingest.cpp -B 20 | grep -E "valid|check|verify|whitelist" -iRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Look at RagSource structure more carefully and trace column handling
sed -n '145,250p' RAG_POC/rag_ingest.cppRepository: ProxySQL/proxysql-vec
Length of output: 3092
🏁 Script executed:
#!/bin/bash
# Check if where_sql is validated anywhere before being used
rg -n "where_sql" RAG_POC/rag_ingest.cpp -B 3 -A 3Repository: ProxySQL/proxysql-vec
Length of output: 1136
🏁 Script executed:
#!/bin/bash
# Search for any validation of where_sql or identifier columns from rag_sources
rg -n "where_sql\|validate\|check" RAG_POC/rag_ingest.cpp | head -30Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Check who can insert/update rag_sources - search in Admin/MCP code
rg -n "INSERT INTO.*rag_sources\|UPDATE.*rag_sources" lib/ --type=cppRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Look at the full RagSource loading to see if there's validation of fields
sed -n '670,750p' RAG_POC/rag_ingest.cppRepository: ProxySQL/proxysql-vec
Length of output: 3043
🏁 Script executed:
#!/bin/bash
# Search in lib/ and other dirs for where rag_sources gets populated/modified
rg -n "rag_sources" lib/ --type=cpp | grep -i "insert\|update\|delete\|execute\|prepare"Repository: ProxySQL/proxysql-vec
Length of output: 340
🏁 Script executed:
#!/bin/bash
# Check if rag_sources can be modified via admin/config mechanisms
rg -n "INSERT.*VALUES\|UPDATE.*SET" lib/AI_Features_Manager.cppRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Look for endpoints/handlers that could modify RAG configuration
find . -name "*Admin*" -o -name "*Tool*Handler*" | xargs rg -l "rag" 2>/dev/nullRepository: ProxySQL/proxysql-vec
Length of output: 254
🏁 Script executed:
#!/bin/bash
# Check the schema for rag_sources - is there any permission/access control?
sed -n '163,190p' lib/AI_Features_Manager.cppRepository: ProxySQL/proxysql-vec
Length of output: 947
🏁 Script executed:
#!/bin/bash
# Search for any documentation about modifying rag_sources
grep -r "rag_sources" ./doc/ ./RAG_POC/ -l 2>/dev/null | xargs grep -A 5 -B 5 "INSERT\|UPDATE\|modify\|admin" -i | head -50Repository: ProxySQL/proxysql-vec
Length of output: 2963
🏁 Script executed:
#!/bin/bash
# Search for where_sql specifically and how it's used in MySQL queries
rg -n "where_sql" lib/RAG_Tool_Handler.cpp --type=cpp -B 3 -A 3Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Check ProxySQL_Admin for any endpoints that allow modifying rag_sources
rg -n "rag_sources\|RAG.*source" lib/ProxySQL_Admin.cpp --type=cpp | head -30Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Verify: backticks in MySQL - can they be escaped? Look for any escaping logic
rg -n "escape.*backtick\|backtick.*escape" --type=cpp -iRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Search for "where_sql" usage across the entire codebase more broadly
rg -n "where_sql" --type=cpp | head -50Repository: ProxySQL/proxysql-vec
Length of output: 472
🏁 Script executed:
#!/bin/bash
# Check if there's any initialization/seeding of rag_sources data
find . -name "*.sql" -o -name "*seed*" -o -name "*init*" | xargs grep -l "rag_sources" 2>/dev/nullRepository: ProxySQL/proxysql-vec
Length of output: 87
🏁 Script executed:
#!/bin/bash
# Verify if backticks alone protect against MySQL identifier injection
# Look up MySQL identifier quoting rules - check what the code comment says
sed -n '468,480p' RAG_POC/rag_ingest.cppRepository: ProxySQL/proxysql-vec
Length of output: 424
🏁 Script executed:
#!/bin/bash
# Check if there's any WHERE clause validation or parameterization elsewhere in RAG code
rg -n "WHERE\s" RAG_POC/rag_ingest.cpp --type=cppRepository: ProxySQL/proxysql-vec
Length of output: 215
🏁 Script executed:
#!/bin/bash
# Final check: see if where_sql is ever validated in load_sources or elsewhere
sed -n '714,730p' RAG_POC/rag_ingest.cppRepository: ProxySQL/proxysql-vec
Length of output: 520
Add WHERE clause SQL injection validation; backtick-escaped identifiers need backtick escaping.
The where_sql field at line 476 is concatenated directly into the SQL query without validation, creating a SQL injection risk. Additionally, while column and table names are backtick-quoted (lines 472, 474), backticks within the identifier values themselves are not escaped—a malicious identifier like `col`name` would break the quoting.
Since rag_sources is loaded from the internal SQLite database and no public APIs were found to modify it, the actual attack surface is limited to trusted administrators. However, the code should defensively validate where_sql content (e.g., using a prepared statement or allowlist for WHERE conditions) and properly escape backticks in identifiers to prevent identifier injection.
🤖 Prompt for AI Agents
In `@RAG_POC/rag_ingest.cpp` around lines 468 - 479, build_select_sql concatenates
RagSource::where_sql and raw identifiers, creating SQL
injection/identifier-injection risks; update build_select_sql to (1) escape
backticks inside identifiers used for cols and s.table_name (replace any '`'
with '``') before wrapping with backticks, and (2) stop directly concatenating
s.where_sql: either accept a parameterized WHERE with placeholders and require
callers to supply bound parameters, or validate allowlisted characters/patterns
(e.g., only simple column comparisons with a strict regex) and reject/return an
error for unsafe values; update function signature or call sites as needed to
support bound parameters or validation on RagSource::where_sql.
| s.name = (const char*)sqlite3_column_text(st, 1); | ||
| s.enabled = sqlite3_column_int(st, 2); | ||
|
|
||
| s.backend_type = (const char*)sqlite3_column_text(st, 3); | ||
| s.host = (const char*)sqlite3_column_text(st, 4); | ||
| s.port = sqlite3_column_int(st, 5); | ||
| s.user = (const char*)sqlite3_column_text(st, 6); | ||
| s.pass = (const char*)sqlite3_column_text(st, 7); | ||
| s.db = (const char*)sqlite3_column_text(st, 8); | ||
|
|
||
| s.table_name = (const char*)sqlite3_column_text(st, 9); | ||
| s.pk_column = (const char*)sqlite3_column_text(st, 10); | ||
| s.where_sql = (const char*)sqlite3_column_text(st, 11); |
There was a problem hiding this comment.
Potential null pointer dereference from sqlite3_column_text.
sqlite3_column_text can return NULL for NULL column values or on allocation failure. Directly assigning to std::string without a null check will cause undefined behavior if NULL is returned.
🐛 Proposed fix: Add null checks
- s.name = (const char*)sqlite3_column_text(st, 1);
- s.enabled = sqlite3_column_int(st, 2);
-
- s.backend_type = (const char*)sqlite3_column_text(st, 3);
- s.host = (const char*)sqlite3_column_text(st, 4);
- s.port = sqlite3_column_int(st, 5);
- s.user = (const char*)sqlite3_column_text(st, 6);
- s.pass = (const char*)sqlite3_column_text(st, 7);
- s.db = (const char*)sqlite3_column_text(st, 8);
-
- s.table_name = (const char*)sqlite3_column_text(st, 9);
- s.pk_column = (const char*)sqlite3_column_text(st, 10);
- s.where_sql = (const char*)sqlite3_column_text(st, 11);
+ s.name = str_or_empty((const char*)sqlite3_column_text(st, 1));
+ s.enabled = sqlite3_column_int(st, 2);
+
+ s.backend_type = str_or_empty((const char*)sqlite3_column_text(st, 3));
+ s.host = str_or_empty((const char*)sqlite3_column_text(st, 4));
+ s.port = sqlite3_column_int(st, 5);
+ s.user = str_or_empty((const char*)sqlite3_column_text(st, 6));
+ s.pass = str_or_empty((const char*)sqlite3_column_text(st, 7));
+ s.db = str_or_empty((const char*)sqlite3_column_text(st, 8));
+
+ s.table_name = str_or_empty((const char*)sqlite3_column_text(st, 9));
+ s.pk_column = str_or_empty((const char*)sqlite3_column_text(st, 10));
+ s.where_sql = str_or_empty((const char*)sqlite3_column_text(st, 11));🤖 Prompt for AI Agents
In `@RAG_POC/rag_ingest.cpp` around lines 686 - 698, The assignments reading
columns into struct s are vulnerable to null pointer dereference because
sqlite3_column_text can return NULL; update the code that sets s.name,
s.backend_type, s.host, s.user, s.pass, s.db, s.table_name, s.pk_column and
s.where_sql to first call sqlite3_column_text into a temporary const unsigned
char* (or const char*), check for NULL (or use sqlite3_column_type to detect
SQLITE_NULL) and only assign to the std::string members when non-NULL, otherwise
set them to an empty string (or another safe default); do the same null-check
pattern for text columns used by functions like sqlite3_column_text while
keeping integer reads (sqlite3_column_int for s.enabled and s.port) as-is or
guarding them with sqlite3_column_type if you want to detect NULLs explicitly.
| // Use SQLite's unixepoch() for updated_at normally; vec table also stores updated_at as unix epoch. | ||
| // Here we store a best-effort "now" from SQLite (unixepoch()) would require a query; instead store 0 | ||
| // or a local time. For v0, we store 0 and let schema default handle other tables. | ||
| // If you want accuracy, query SELECT unixepoch() once per run and reuse it. | ||
| std::int64_t now_epoch = 0; |
There was a problem hiding this comment.
Hardcoded now_epoch = 0 will cause incorrect timestamps.
The comment acknowledges this is a placeholder, but storing 0 as updated_at in rag_vec_chunks will make temporal queries unreliable. Consider using time(NULL) for a reasonable approximation.
💡 Suggested fix
- // If you want accuracy, query SELECT unixepoch() once per run and reuse it.
- std::int64_t now_epoch = 0;
+ // Use current time for updated_at; for higher accuracy, query SQLite's unixepoch() once per run.
+ std::int64_t now_epoch = (std::int64_t)time(NULL);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Use SQLite's unixepoch() for updated_at normally; vec table also stores updated_at as unix epoch. | |
| // Here we store a best-effort "now" from SQLite (unixepoch()) would require a query; instead store 0 | |
| // or a local time. For v0, we store 0 and let schema default handle other tables. | |
| // If you want accuracy, query SELECT unixepoch() once per run and reuse it. | |
| std::int64_t now_epoch = 0; | |
| // Use SQLite's unixepoch() for updated_at normally; vec table also stores updated_at as unix epoch. | |
| // Here we store a best-effort "now" from SQLite (unixepoch()) would require a query; instead store 0 | |
| // or a local time. For v0, we store 0 and let schema default handle other tables. | |
| // Use current time for updated_at; for higher accuracy, query SQLite's unixepoch() once per run. | |
| std::int64_t now_epoch = (std::int64_t)time(NULL); |
🤖 Prompt for AI Agents
In `@RAG_POC/rag_ingest.cpp` around lines 901 - 905, The code currently sets
now_epoch to 0 which yields invalid timestamps for rag_vec_chunks; replace the
hardcoded value by capturing a real epoch once per run (e.g., use
std::time(nullptr) / time(NULL) and cast to std::int64_t) and reuse that value
when populating updated_at (variable now_epoch), updating any affected functions
that rely on now_epoch; ensure the translation unit includes <ctime> (or
<time.h>) if not already present so the build succeeds.
| CREATE TABLE IF NOT EXISTS rag_sources ( | ||
| source_id INTEGER PRIMARY KEY, | ||
| name TEXT NOT NULL UNIQUE, -- e.g. "stack_posts" | ||
| enabled INTEGER NOT NULL DEFAULT 1, | ||
|
|
||
| -- Where to retrieve from (PoC: connect directly; later can be "via ProxySQL") | ||
| backend_type TEXT NOT NULL, -- "mysql" | "postgres" | ... | ||
| backend_host TEXT NOT NULL, | ||
| backend_port INTEGER NOT NULL, | ||
| backend_user TEXT NOT NULL, | ||
| backend_pass TEXT NOT NULL, | ||
| backend_db TEXT NOT NULL, -- database/schema name | ||
|
|
||
| table_name TEXT NOT NULL, -- e.g. "posts" | ||
| pk_column TEXT NOT NULL, -- e.g. "Id" | ||
|
|
||
| -- Optional: restrict ingestion; appended to SELECT as WHERE <where_sql> | ||
| where_sql TEXT, -- e.g. "PostTypeId IN (1,2)" | ||
|
|
||
| -- REQUIRED: mapping from source row -> rag_documents fields | ||
| -- JSON spec describing doc_id, title/body concat, metadata pick/rename, etc. | ||
| doc_map_json TEXT NOT NULL, | ||
|
|
||
| -- REQUIRED: chunking strategy (enabled, chunk_size, overlap, etc.) | ||
| chunking_json TEXT NOT NULL, | ||
|
|
||
| -- Optional: embedding strategy (how to build embedding input text) | ||
| -- In v0 you can keep it NULL/empty; define later without schema changes. | ||
| embedding_json TEXT, | ||
|
|
||
| created_at INTEGER NOT NULL DEFAULT (unixepoch()), | ||
| updated_at INTEGER NOT NULL DEFAULT (unixepoch()) |
There was a problem hiding this comment.
Avoid storing plaintext backend credentials
backend_pass is stored as TEXT, which risks credential leakage. Please move secrets to a secure store or encrypt at rest (and restrict access) rather than persisting raw passwords in the catalog.
🤖 Prompt for AI Agents
In `@RAG_POC/schema.sql` around lines 14 - 45, The rag_sources table currently
stores plaintext credentials in the backend_pass column; instead migrate to a
secure approach by removing plaintext storage and either (A) storing only a
reference to an external secret manager (add a new column like
backend_secret_ref) and update any code that reads rag_sources to fetch the
secret from the secret manager, or (B) store an encrypted value (add
backend_pass_encrypted) and implement symmetric key encryption/decryption in the
service layer to read/write via functions that use a KMS-held key; update schema
changes for rag_sources, add a migration to backfill secrets (move existing
backend_pass values into the secret store or encrypt them) and remove or null
out backend_pass, and update codepaths that reference backend_pass to use
backend_secret_ref or decrypt backend_pass_encrypted instead.
- Add special build rule in Makefile for anomaly_detection-t that includes: - $(OBJ) for global variables (GloVars, GloGATH) - -Wl,--allow-multiple-definition to allow test's main() to override ProxySQL's - ClickHouse client libraries (libclickhouse-cpp-lib.a, libzstdstatic.a, liblz4.a) - SQLite rembed library (libsqlite_rembed.a) - -lscram -lusual for PostgreSQL SCRAM support - Add stub function SQLite3_Server_session_handler required by SQLite3_Server.cpp Resolves compilation errors for anomaly_detection-t test.
- Add #include <string> for C++ std::string support - Add #include <cmath> for sqrt() function - Change format %lld to %ld for chrono duration types (long int, not long long) Resolves compilation errors for vector_db_performance-t test.
- Add #include <string> for C++ std::string support Resolves compilation errors for ai_llm_retry_scenarios-t test.
This commit addresses valid concerns raised by coding agents (Gemini, Copilot, CoderabbitAI): 1. Fix stats_mcp_query_digest naming conflict (ProxySQL_Admin.cpp): - Made reset and non-reset paths mutually exclusive using else block - Prevents both flags from being true, matching MySQL pattern - Ensures reset takes precedence over non-reset 2. Fix INSERT OR REPLACE sync issue (Admin_Handler.cpp): - Added DELETE before INSERT OR REPLACE in LOAD/SAVE MCP QUERY RULES - Prevents stale rules from persisting when syncing disk <-> memory - Ensures deleted source rows are also removed from target 3. Fix integer division truncation for timeout (Query_Tool_Handler.cpp): - Changed timeout_ms/1000 to (timeout_ms+999)/1000 for ceiling division - Ensures sub-second timeouts (e.g., 500ms) become at least 1 second - Prevents zero-second timeouts from causing unexpected behavior 4. Remove confusing comment (Discovery_Schema.cpp): - Simplified column count comment to be clear and accurate Note: The re_modifiers parsing code already correctly handles VARCHAR "CASELESS" to int conversion (lines 2414-2425), so that review comment was already addressed.
feat: Add MCP query rules and digest statistics [WIP]
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
lib/MySQL_Catalog.cpp (2)
56-57: Avoid Variable Length Arrays (VLAs) - use heap allocation instead.VLAs are not standard C++ (they're a C99/GCC extension) and can cause stack overflow for large paths. Use
std::stringor heap allocation.Proposed fix
- char path_buf[db_path.size() + 1]; - strcpy(path_buf, db_path.c_str()); - int rc = db->open(path_buf, SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE); + int rc = db->open(const_cast<char*>(db_path.c_str()), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE);Or if
db->openrequires a mutable buffer:- char path_buf[db_path.size() + 1]; - strcpy(path_buf, db_path.c_str()); + std::vector<char> path_buf(db_path.begin(), db_path.end()); + path_buf.push_back('\0'); int rc = db->open(path_buf.data(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE);
279-301: SQL injection vulnerability - user input concatenated directly into SQL.The
search()method concatenatesschema,kind,tags, andqueryparameters directly into the SQL string without escaping or using prepared statements. This is a critical security vulnerability that allows SQL injection attacks.The vulnerable patterns are:
- Line 283:
sql << " AND schema = '" << schema << "'";- Line 288:
sql << " AND kind = '" << kind << "'";- Line 293:
sql << " AND tags LIKE '%" << tags << "%'";- Lines 298-300: query parameter concatenated into LIKE clauses
Consider using prepared statements like the
upsert()andget()methods do, or at minimum escape single quotes in the input strings.Example fix using prepared statements
std::string MySQL_Catalog::search( const std::string& schema, const std::string& query, const std::string& kind, const std::string& tags, int limit, int offset ) { sqlite3_stmt* stmt = NULL; std::ostringstream sql; std::vector<std::string> bindings; int bind_idx = 1; sql << "SELECT schema, kind, key, document, tags, links FROM catalog WHERE 1=1"; if (!schema.empty()) { sql << " AND schema = ?" << bind_idx++; bindings.push_back(schema); } if (!kind.empty()) { sql << " AND kind = ?" << bind_idx++; bindings.push_back(kind); } // ... similar for other parameters // Then bind each parameter using proxy_sqlite3_bind_text }
🤖 Fix all issues with AI agents
In `@lib/Admin_Handler.cpp`:
- Around line 2369-2374: The current code builds a single semicolon-separated
SQL string into *q and relies on execute_statement (which uses
sqlite3_prepare_v2 and only compiles the first statement) so the INSERT never
runs; change the logic around where *q is set (the l_free/l_strdup block that
creates "DELETE ...; INSERT OR REPLACE ...") to instead perform each SQL
statement with calls that support multiple statements—either call execute()
(which uses sqlite3_exec) for the combined string or, better, call execute() or
the existing BQE1/execute pattern twice to run "DELETE FROM
main.mcp_query_rules" and then "INSERT OR REPLACE INTO main.mcp_query_rules
SELECT * FROM disk.mcp_query_rules" as separate statements; update the code
paths that set *q/*ql accordingly so no multi-statement string is passed to
execute_statement.
In `@lib/MySQL_Catalog.cpp`:
- Around line 370-390: The list() method currently concatenates user inputs into
the SQL strings (variables sql and count_sql), causing SQL injection; change it
to use prepared statements and parameter binding like upsert() and get(): build
the base query with placeholders (e.g., "WHERE 1=1" plus "AND schema = ?"/"AND
kind = ?"), prepare the statement via your MySQL API, bind schema and kind only
when non-empty, and bind limit/offset as parameters before executing both the
main query and the COUNT query (use the same conditional placeholder logic for
count_sql). Ensure you replace the direct string concatenation at the spots that
currently append schema/kind and the final "LIMIT/OFFSET" with proper
placeholders and binding.
- Around line 505-511: In the remove() method, avoid concatenating user inputs
(schema, kind, key) into the SQL string; instead build a parameterized DELETE
and use a prepared statement to bind values for schema (when non-empty), kind
and key before executing. Replace the ostringstream construction in
MySQL_Catalog::remove() with a DELETE query that uses placeholders for
schema/kind/key (e.g. "DELETE FROM catalog WHERE 1=1" plus "AND schema = ?" when
schema provided, then always "AND kind = ? AND key = ?"), prepare that statement
via your MySQL client API, bind schema (conditionally), bind kind and key,
execute, and propagate/log any execution errors.
In `@lib/ProxySQL_Admin_Stats.cpp`:
- Around line 2581-2627: The INSERT always targets stats_mcp_query_digest and
uses strlen/risky sprintf, causing reset data to go to the live table, crashes
on NULL run_id and SQL-injection risks; in
ProxySQL_Admin::stats___mcp_query_digest switch the target table based on the
reset flag (use stats_mcp_query_digest_reset when reset==true), stop building
SQL via sprintf, and use a prepared statement with parameter binding (prepare
once per loop or reuse) to bind each SQLite3_row->fields value NULL-safely
(checking for NULL before binding and using sqlite3_bind_null or
sqlite3_bind_text) instead of strlen on fields and statsdb->execute(query); also
remove manual malloc/free and call the DB's execute/step/finalize prepared API
(e.g., statsdb prepare/bind/execute/finalize) to prevent injection and crashes.
In `@lib/ProxySQL_Admin.cpp`:
- Around line 1354-1357: The current checks set both
stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset when
query_no_space contains the reset token, causing duplicate refresh work; change
the logic to test the reset token first and make the second check an else if (or
otherwise skip setting the non-reset flag when reset matches) so only
stats_mcp_query_tools_counters_reset is set and refresh is executed once; update
the conditional around query_no_space that touches
stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset
accordingly.
- Around line 7750-7804: The function
ProxySQL_Admin::load_mcp_query_rules_to_runtime leaks the SQLite3_result* on the
success path because Discovery_Schema::load_mcp_query_rules(resultset) does not
free it; after calling catalog->load_mcp_query_rules(resultset) add deletion of
the resultset (delete resultset;) to mirror the error-path cleanup used in
load_mysql_firewall_to_runtime(), and ensure you use the correct deallocation
(delete, not free). Also update the function comment header to correct the
documentation: change "error message string on failure (caller must free)" to
indicate returned error strings are string literals and must not be freed by the
caller.
In `@lib/Query_Tool_Handler.cpp`:
- Around line 235-319: The init_connection_pool function must clean up any
partially initialized connections on error: add a helper (lambda or private
method) inside init_connection_pool that iterates connection_pool, calls
mysql_close on each entry's MYSQL* (and sets conn.mysql = nullptr), clears
connection_pool, resets pool_size to 0, and unlocks pool_lock if held; call this
cleanup helper before every early return that occurs after some successful
connections (e.g., after mysql_init/mysql_real_connect failures) so the function
either returns with a fully initialized pool or an empty, consistent pool;
update all return paths in init_connection_pool to invoke this helper and then
return -1.
In `@lib/Static_Harvester.cpp`:
- Around line 1088-1138: The quick-profile field indices are misaligned with the
SELECT: update the row->fields index usage inside the loop (where profile is
built) so they match the SELECT order (object_id, schema_name, object_name,
object_type, engine, table_rows_est, data_length, index_length, has_primary_key,
has_foreign_keys, has_time_column). Concretely, set rows_est to row->fields[5],
engine to row->fields[4], size_bytes to the sum of row->fields[6] and
row->fields[7], and the boolean flags
has_primary_key/has_foreign_keys/has_time_column to row->fields[8], [9], and
[10] respectively; update these in the code that constructs json profile (the
block around profile["guessed_kind"], profile["rows_est"],
profile["size_bytes"], profile["engine"], profile["has_primary_key"],
profile["has_foreign_keys"], profile["has_time_column"] in
Static_Harvester.cpp).
- Around line 236-258: In Static_Harvester::execute_query, mysql_store_result()
NULL is being treated as a successful "no result set" case; instead check
mysql_field_count(mysql_conn) to see if the query should have a result set—if
mysql_field_count() == 0 treat it as a successful non-result (unlock conn_lock
and return 0), otherwise treat NULL as an error by calling
mysql_errno(mysql_conn) and logging mysql_error(mysql_conn), unlock conn_lock
and return -1; update the error log to include mysql_error(mysql_conn) and
ensure conn_lock is always released before returning.
- Around line 388-662: All five fetch_* methods (fetch_schemas,
fetch_tables_views, fetch_columns, fetch_indexes, fetch_foreign_keys) currently
interpolate the user-provided filter directly into SQL and need whitelist
validation to prevent SQL injection: add a small helper (e.g.,
is_valid_identifier or validate_schema_filter) that returns true only for
non-empty strings matching /^[A-Za-z0-9_]+$/ (allowing empty meaning no filter),
call it at the start of each method and if the filter is non-empty and invalid
either return an empty result or treat it as no filter (choose consistent
behavior), and only append " AND ... = '<filter>'" to the SQL when the helper
approves the value; reference the filter variable and SQL-building blocks in
each function so the validation is applied before any SQL concatenation.
- Around line 1014-1055: The UPDATE that writes view_def, schema_name, and
view_name into the SQLite catalog (inside the for-loop using update_sql and
catalog->get_db()->execute_statement) is vulnerable to breaking on single
quotes; change it to use a prepared statement with parameter binding
(sqlite3_prepare_v2 / sqlite3_bind_text / sqlite3_step / sqlite3_finalize) and
bind view_def, schema_name, view_name, and current_run_id instead of
interpolating them, following the pattern in Discovery_Schema.cpp; alternatively
(less preferred) call escape_string_single_quotes() on view_def.c_str(),
schema_name.c_str(), and view_name.c_str() and use the escaped C strings in the
constructed SQL before calling execute_statement.
♻️ Duplicate comments (9)
lib/ProxySQL_Admin_Stats.cpp (1)
1587-1613: Fix reset insert target and avoid raw SQL interpolation.Line 1608 builds the INSERT using a prefix; when
reset == trueit targetsstats_mcp_query_tools_counters_mcp_query_tools_counters(non‑existent). Raw interpolation of tool/schema also breaks on quotes or enables injection. Use an explicit table name plus bound parameters.🛠️ Proposed fix (explicit table + prepared statement)
- for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); - it != resultset->rows.end(); ++it) { - SQLite3_row* r = *it; - char query[1024]; - snprintf(query, sizeof(query), - "INSERT INTO %smcp_query_tools_counters VALUES ('%s', '%s', %s, %s, %s, %s, %s, %s)", - reset ? "stats_mcp_query_tools_counters_" : "stats_", - r->fields[0], r->fields[1], r->fields[2], r->fields[3], - r->fields[4], r->fields[5], r->fields[6], r->fields[7]); - statsdb->execute(query); - } + const char* table = reset + ? "stats_mcp_query_tools_counters_reset" + : "stats_mcp_query_tools_counters"; + const std::string sql = + std::string("INSERT INTO ") + table + + " VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8)"; + int rc = 0; + stmt_unique_ptr u_stmt { nullptr }; + std::tie(rc, u_stmt) = statsdb->prepare_v2(sql.c_str()); + ASSERT_SQLITE_OK(rc, statsdb); + sqlite3_stmt* const stmt = u_stmt.get(); + for (SQLite3_row* r : resultset->rows) { + rc = (*proxy_sqlite3_bind_text)(stmt, 1, r->fields[0], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_text)(stmt, 2, r->fields[1], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 3, atoll(r->fields[2])); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 4, atoll(r->fields[3])); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 5, atoll(r->fields[4])); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 6, atoll(r->fields[5])); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 7, atoll(r->fields[6])); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_int64)(stmt, 8, atoll(r->fields[7])); ASSERT_SQLITE_OK(rc, statsdb); + SAFE_SQLITE3_STEP2(stmt); + rc = (*proxy_sqlite3_clear_bindings)(stmt); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_reset)(stmt); ASSERT_SQLITE_OK(rc, statsdb); + }lib/Admin_Bootstrap.cpp (1)
726-736: Check MCP catalog DB open result before enabling extensions/attach.Line 728 ignores
mcpdb->open()’s return code; if it fails,proxy_sqlite3_enable_load_extensionand__attach_db(Line 947) can dereference an invalid handle.🛠️ Suggested guard
- mcpdb->open((char *)mcp_catalog_path.c_str(), SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX); + int mcp_rc = mcpdb->open((char *)mcp_catalog_path.c_str(), + SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX); + if (mcp_rc != SQLITE_OK) { + proxy_error("Failed to open MCP catalog DB at %s\n", mcp_catalog_path.c_str()); + delete mcpdb; + mcpdb = nullptr; + } ... - (*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1); + if (mcpdb) { + (*proxy_sqlite3_enable_load_extension)(mcpdb->get_db(),1); + } ... - __attach_db(admindb, mcpdb, (char *)"mcp_catalog"); + if (mcpdb) { + __attach_db(admindb, mcpdb, (char *)"mcp_catalog"); + }Also applies to: 947-947
lib/Query_Tool_Handler.cpp (7)
352-360: Synchronizefind_connectionwithpool_lock.Iterating
connection_poolwithout a lock can race withclose()or pool initialization.
362-405: Avoid using a returned connection and return immediately on query failure.The connection is returned before
mysql_affected_rows, and the error path continues aftermysql_queryfails. Both create races/UB.
407-476: Hold the connection untilmysql_affected_rows/result processing completes.Returning the connection before
mysql_affected_rowsrisks another thread reusing it.
809-846: Match allocation/deallocation forSQLite3_result.
execute_statementallocatesSQLite3_resultwithnew, but this path frees it withfree(). That’s undefined behavior.
920-936: Escapeschema/name_filterbefore concatenating SQL.Direct interpolation enables SQL injection via tool arguments.
1112-1131: Sanitize schema/table in object_id lookup SQL.Direct string concatenation allows injection via
object_key.
1799-1850: Free row buffers afteradd_rowto avoid leaks.Each row allocates
char**plusstrdupfields that are never freed.
🧹 Nitpick comments (1)
lib/Admin_Handler.cpp (1)
2360-2360: Minor: Inconsistent length check could be clarified.The condition
query_no_space_length>20combined withstrncasecmp(..., 21)works correctly (since "SAVE MCP QUERY RULES " is 21 chars), but usingstrlen("SAVE MCP QUERY RULES ")would be more maintainable and consistent with nearby code patterns (e.g., line 2365 usesstrlen()).-if ((query_no_space_length>20) && ( (!strncasecmp("SAVE MCP QUERY RULES ", query_no_space, 21)) || (!strncasecmp("LOAD MCP QUERY RULES ", query_no_space, 21)) ) ) { +if ((query_no_space_length>20) && ( (!strncasecmp("SAVE MCP QUERY RULES ", query_no_space, strlen("SAVE MCP QUERY RULES "))) || (!strncasecmp("LOAD MCP QUERY RULES ", query_no_space, strlen("LOAD MCP QUERY RULES "))) ) ) {
| l_free(*ql,*q); | ||
| // First clear target table, then insert to ensure deleted source rows are also removed | ||
| *q=l_strdup("DELETE FROM main.mcp_query_rules; INSERT OR REPLACE INTO main.mcp_query_rules SELECT * FROM disk.mcp_query_rules"); | ||
| *ql=strlen(*q)+1; | ||
| return true; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check how other disk-to-memory operations handle this pattern
rg -n "DELETE FROM.*INSERT" --type cpp lib/Repository: ProxySQL/proxysql-vec
Length of output: 628
🏁 Script executed:
# Also search for the execute_statement function and how SQL is executed
rg -n "execute_statement" --type cpp lib/ | head -20Repository: ProxySQL/proxysql-vec
Length of output: 2642
🏁 Script executed:
# Get context around lines 2369-2374 and 2383
sed -n '2360,2390p' lib/Admin_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 1821
🏁 Script executed:
# Find execute_statement function definition
rg -n "execute_statement" lib/ --type cpp -A 10 | grep -A 10 "execute_statement.*{" | head -40Repository: ProxySQL/proxysql-vec
Length of output: 2293
🏁 Script executed:
# Find BQE1 macro definition which also handles DELETE...INSERT patterns
rg -n "BQE1" lib/ --type cpp -B 2 -A 5 | head -50Repository: ProxySQL/proxysql-vec
Length of output: 4392
🏁 Script executed:
# Search for the database class and SQLite3_exec or similar execution functions
rg -n "SQLite3_exec\|execute_statement" lib/ --type cpp | grep "define\|{" | head -20Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Look for where execute_statement is defined (it might be in a header)
fd -e "h" lib/ -exec rg -l "execute_statement" {} \; | head -10Repository: ProxySQL/proxysql-vec
Length of output: 315
🏁 Script executed:
# Look at the full BQE1 implementation
sed -n '187,210p' lib/ProxySQL_Admin.cppRepository: ProxySQL/proxysql-vec
Length of output: 624
🏁 Script executed:
# Find the execute function to see how it differs from execute_statement
rg -n "void SQLite3DB::execute\(" lib/ --type cpp -A 15 | head -60Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Check if sqlite3_prepare_v2 handles multiple statements
rg -n "sqlite3_prepare_v2" lib/ --type cpp -B 2 -A 5 | head -40Repository: ProxySQL/proxysql-vec
Length of output: 3557
🏁 Script executed:
# Look at how the mcp_query_rules statements are actually being executed later
rg -n "mcp_query_rules" lib/Admin_Handler.cpp -B 2 -A 8Repository: ProxySQL/proxysql-vec
Length of output: 3897
🏁 Script executed:
# Find the execute function implementation
rg -n "void SQLite3DB::execute\(" lib/ --type cpp -A 20 | head -80Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Also check if there's a sqlite3_exec call anywhere that might handle multiple statements
rg -n "sqlite3_exec" lib/ --type cpp -B 2 -A 2Repository: ProxySQL/proxysql-vec
Length of output: 3222
🏁 Script executed:
# Check the SQLite3DB class header to understand what execute vs execute_statement does
fd -name "*.h" lib/ -exec rg -l "SQLite3DB" {} \; | head -5Repository: ProxySQL/proxysql-vec
Length of output: 295
🏁 Script executed:
# Find the execute function that takes a const char*
rg -n "void SQLite3DB::execute" lib/sqlite3db.cpp -A 25 | head -100Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Check the full context around the sqlite3_exec call
sed -n '215,250p' lib/sqlite3db.cppRepository: ProxySQL/proxysql-vec
Length of output: 964
🏁 Script executed:
# Let's look for the header file to understand the class structure
find lib -name "*.h" -o -name "*.hpp" | xargs rg -l "class SQLite3DB" 2>/dev/null | head -1Repository: ProxySQL/proxysql-vec
Length of output: 94
🏁 Script executed:
# Find where the returned query string (in *q) is executed
# This code is in Admin_Handler.cpp processing admin commands
# Let's see the full function context
sed -n '2300,2450p' lib/Admin_Handler.cpp | head -150Repository: ProxySQL/proxysql-vec
Length of output: 8064
🏁 Script executed:
# Search for how admin queries are executed after being processed
rg -n "execute_statement\|->execute\(" lib/Admin_Handler.cpp | grep -A 3 -B 3 "2[34][0-9][0-9]" | head -40Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Check the broader context to see what happens with the query after processing
sed -n '2350,2500p' lib/Admin_Handler.cpp | tail -150Repository: ProxySQL/proxysql-vec
Length of output: 7730
🏁 Script executed:
# Find the function signature and caller context
rg -n "admin_handler_function\|process_admin_query" lib/Admin_Handler.cpp | head -5Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Look for the function that calls this handler
grep -n "^.*(" lib/Admin_Handler.cpp | grep -E "admin_handler|process_admin" | head -5Repository: ProxySQL/proxysql-vec
Length of output: 725
🏁 Script executed:
# Find the function definition that contains this code
sed -n '1,100p' lib/Admin_Handler.cpp | grep -E "^[a-zA-Z_].*\(|^bool|^int"Repository: ProxySQL/proxysql-vec
Length of output: 127
🏁 Script executed:
# Search for where the query string *q gets executed after being returned
rg -n "admindb->execute\(" lib/ --type cpp | head -20Repository: ProxySQL/proxysql-vec
Length of output: 1882
🏁 Script executed:
# Find where admin_handler_command_load_or_save is called and what happens with the returned query
rg -n "admin_handler_command_load_or_save" lib/ --type cpp -B 3 -A 3Repository: ProxySQL/proxysql-vec
Length of output: 1121
🏁 Script executed:
# Let's look at the full function signature and see how it's used
sed -n '1025,1100p' lib/Admin_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 4033
🏁 Script executed:
# Check for "sqlite3_exec" documentation or SQLite3 behavior regarding multiple statements
# Let's look for comments or documentation about multi-statement support
rg -n "sqlite3_exec\|multiple.*statement" lib/ --type cpp -B 2 -A 2 | head -50Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Find the __run_query label and see how the query is executed
sed -n '3237,3500p' lib/Admin_Handler.cpp | head -200Repository: ProxySQL/proxysql-vec
Length of output: 10475
🏁 Script executed:
# Search for the __run_query label explicitly
rg -n "__run_query:" lib/Admin_Handler.cpp -A 30 | head -80Repository: ProxySQL/proxysql-vec
Length of output: 1399
🏁 Script executed:
# Find what happens immediately after __run_query label - look for execute_statement or execute
sed -n '4317,4450p' lib/Admin_Handler.cpp | head -150Repository: ProxySQL/proxysql-vec
Length of output: 4148
🏁 Script executed:
# Search more broadly for where the query gets executed after __run_query
rg -n "run_query" lib/Admin_Handler.cpp | tail -20Repository: ProxySQL/proxysql-vec
Length of output: 561
🏁 Script executed:
# Look for any comments or documentation about prepare_v2 and multiple statements
rg -n "prepare_v2\|multiple.*statement" lib/sqlite3db.cpp -B 2 -A 2 | head -40Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Also check if there are any workarounds or special handling for multi-statement queries
rg -n "sqlite3_next_stmt\|tail.*statement" lib/ --type cpp | head -10Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Check the exact behavior of sqlite3_prepare_v2 with documentation or comments
rg -n "sqlite3_prepare_v2" lib/sqlite3db.cpp -B 5 -A 15 | head -60Repository: ProxySQL/proxysql-vec
Length of output: 2141
🏁 Script executed:
# Verify that execute_statement does NOT loop or handle multiple statements
sed -n '322,388p' lib/sqlite3db.cppRepository: ProxySQL/proxysql-vec
Length of output: 2511
Fix multi-statement SQL execution in mcp_query_rules operations.
Lines 2371 and 2383 use execute_statement, which calls sqlite3_prepare_v2 to compile the query. This function only prepares the first SQL statement when given multiple semicolon-separated statements. The INSERT OR REPLACE part will never execute.
Split into separate execute() calls (which uses sqlite3_exec and supports multiple statements), or execute each statement separately. The pattern used elsewhere in the codebase—such as the BQE1 macro in ProxySQL_Admin.cpp and debug handlers—demonstrates the correct approach of executing statements individually.
Example pattern (from LOAD DEBUG handlers):
SPA->admindb->execute("DELETE FROM main.debug_filters");
SPA->admindb->execute("INSERT OR REPLACE INTO main.debug_levels SELECT * FROM disk.debug_levels");
SPA->admindb->execute("INSERT INTO main.debug_filters SELECT * FROM disk.debug_filters");🤖 Prompt for AI Agents
In `@lib/Admin_Handler.cpp` around lines 2369 - 2374, The current code builds a
single semicolon-separated SQL string into *q and relies on execute_statement
(which uses sqlite3_prepare_v2 and only compiles the first statement) so the
INSERT never runs; change the logic around where *q is set (the l_free/l_strdup
block that creates "DELETE ...; INSERT OR REPLACE ...") to instead perform each
SQL statement with calls that support multiple statements—either call execute()
(which uses sqlite3_exec) for the combined string or, better, call execute() or
the existing BQE1/execute pattern twice to run "DELETE FROM
main.mcp_query_rules" and then "INSERT OR REPLACE INTO main.mcp_query_rules
SELECT * FROM disk.mcp_query_rules" as separate statements; update the code
paths that set *q/*ql accordingly so no multi-statement string is passed to
execute_statement.
| sql << "SELECT schema, kind, key, document, tags , links FROM catalog WHERE 1=1"; | ||
|
|
||
| if (!schema.empty()) { | ||
| sql << " AND schema = '" << schema << "'"; | ||
| } | ||
|
|
||
| if (!kind.empty()) { | ||
| sql << " WHERE kind = '" << kind << "'"; | ||
| sql << " AND kind = '" << kind << "'"; | ||
| } | ||
|
|
||
| sql << " ORDER BY kind, key ASC LIMIT " << limit << " OFFSET " << offset; | ||
| sql << " ORDER BY schema, kind , key ASC LIMIT " << limit << " OFFSET " << offset; | ||
|
|
||
| // Get total count | ||
| std::ostringstream count_sql; | ||
| count_sql << "SELECT COUNT(*) FROM catalog"; | ||
| count_sql << "SELECT COUNT(*) FROM catalog WHERE 1=1"; | ||
| if (!schema.empty()) { | ||
| count_sql << " AND schema = '" << schema << "'"; | ||
| } | ||
| if (!kind.empty()) { | ||
| count_sql << " WHERE kind = '" << kind << "'"; | ||
| count_sql << " AND kind = '" << kind << "'"; | ||
| } |
There was a problem hiding this comment.
SQL injection vulnerability in list() - same issue as search().
The list() method has the same SQL injection vulnerability, directly concatenating schema and kind parameters into SQL strings at lines 373, 377, 386, and 389.
Use prepared statements with parameter binding as done in upsert() and get().
🤖 Prompt for AI Agents
In `@lib/MySQL_Catalog.cpp` around lines 370 - 390, The list() method currently
concatenates user inputs into the SQL strings (variables sql and count_sql),
causing SQL injection; change it to use prepared statements and parameter
binding like upsert() and get(): build the base query with placeholders (e.g.,
"WHERE 1=1" plus "AND schema = ?"/"AND kind = ?"), prepare the statement via
your MySQL API, bind schema and kind only when non-empty, and bind limit/offset
as parameters before executing both the main query and the COUNT query (use the
same conditional placeholder logic for count_sql). Ensure you replace the direct
string concatenation at the spots that currently append schema/kind and the
final "LIMIT/OFFSET" with proper placeholders and binding.
| std::ostringstream sql; | ||
| sql << "DELETE FROM catalog WHERE kind = '" << kind << "' AND key = '" << key << "'"; | ||
| sql << "DELETE FROM catalog WHERE 1=1"; | ||
|
|
||
| if (!schema.empty()) { | ||
| sql << " AND schema = '" << schema << "'"; | ||
| } | ||
| sql << " AND kind = '" << kind << "' AND key = '" << key << "'"; |
There was a problem hiding this comment.
SQL injection vulnerability in remove() - user input in DELETE statement.
The remove() method concatenates schema, kind, and key directly into a DELETE SQL statement. This is especially dangerous as it could allow attackers to delete arbitrary catalog entries.
Proposed fix using prepared statements
int MySQL_Catalog::remove(
const std::string& schema,
const std::string& kind,
const std::string& key
) {
- std::ostringstream sql;
- sql << "DELETE FROM catalog WHERE 1=1";
-
- if (!schema.empty()) {
- sql << " AND schema = '" << schema << "'";
- }
- sql << " AND kind = '" << kind << "' AND key = '" << key << "'";
-
- if (!db->execute(sql.str().c_str())) {
+ sqlite3_stmt* stmt = NULL;
+ const char* delete_sql = schema.empty()
+ ? "DELETE FROM catalog WHERE kind = ?1 AND key = ?2"
+ : "DELETE FROM catalog WHERE schema = ?1 AND kind = ?2 AND key = ?3";
+
+ int rc = db->prepare_v2(delete_sql, &stmt);
+ if (rc != SQLITE_OK) {
+ proxy_error("Failed to prepare catalog remove: %d\n", rc);
+ return -1;
+ }
+
+ if (schema.empty()) {
+ (*proxy_sqlite3_bind_text)(stmt, 1, kind.c_str(), -1, SQLITE_TRANSIENT);
+ (*proxy_sqlite3_bind_text)(stmt, 2, key.c_str(), -1, SQLITE_TRANSIENT);
+ } else {
+ (*proxy_sqlite3_bind_text)(stmt, 1, schema.c_str(), -1, SQLITE_TRANSIENT);
+ (*proxy_sqlite3_bind_text)(stmt, 2, kind.c_str(), -1, SQLITE_TRANSIENT);
+ (*proxy_sqlite3_bind_text)(stmt, 3, key.c_str(), -1, SQLITE_TRANSIENT);
+ }
+
+ SAFE_SQLITE3_STEP2(stmt);
+ (*proxy_sqlite3_finalize)(stmt);
+
+ if (rc != SQLITE_DONE && rc != SQLITE_OK) {
proxy_error("Catalog remove error\n");
return -1;
}
return 0;
}🤖 Prompt for AI Agents
In `@lib/MySQL_Catalog.cpp` around lines 505 - 511, In the remove() method, avoid
concatenating user inputs (schema, kind, key) into the SQL string; instead build
a parameterized DELETE and use a prepared statement to bind values for schema
(when non-empty), kind and key before executing. Replace the ostringstream
construction in MySQL_Catalog::remove() with a DELETE query that uses
placeholders for schema/kind/key (e.g. "DELETE FROM catalog WHERE 1=1" plus "AND
schema = ?" when schema provided, then always "AND kind = ? AND key = ?"),
prepare that statement via your MySQL client API, bind schema (conditionally),
bind kind and key, execute, and propagate/log any execution errors.
| void ProxySQL_Admin::stats___mcp_query_digest(bool reset) { | ||
| if (!GloMCPH) return; | ||
| Query_Tool_Handler* qth = GloMCPH->query_tool_handler; | ||
| if (!qth) return; | ||
|
|
||
| // Get the discovery schema catalog | ||
| Discovery_Schema* catalog = qth->get_catalog(); | ||
| if (!catalog) return; | ||
|
|
||
| // Get the stats from the catalog (includes reset logic) | ||
| SQLite3_result* resultset = catalog->get_mcp_query_digest(reset); | ||
| if (!resultset) return; | ||
|
|
||
| statsdb->execute("BEGIN"); | ||
|
|
||
| if (reset) { | ||
| statsdb->execute("DELETE FROM stats_mcp_query_digest_reset"); | ||
| } else { | ||
| statsdb->execute("DELETE FROM stats_mcp_query_digest"); | ||
| } | ||
|
|
||
| // Insert digest statistics into the stats table | ||
| // Columns: tool_name, run_id, digest, digest_text, count_star, | ||
| // first_seen, last_seen, sum_time, min_time, max_time | ||
| char* a = (char*)"INSERT INTO stats_mcp_query_digest VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\")"; | ||
| for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) { | ||
| SQLite3_row* r = *it; | ||
| int arg_len = 0; | ||
| for (int i = 0; i < 10; i++) { | ||
| arg_len += strlen(r->fields[i]); | ||
| } | ||
| char* query = (char*)malloc(strlen(a) + arg_len + 32); | ||
| sprintf(query, a, | ||
| r->fields[0], // tool_name | ||
| r->fields[1], // run_id | ||
| r->fields[2], // digest | ||
| r->fields[3], // digest_text | ||
| r->fields[4], // count_star | ||
| r->fields[5], // first_seen | ||
| r->fields[6], // last_seen | ||
| r->fields[7], // sum_time | ||
| r->fields[8], // min_time | ||
| r->fields[9] // max_time | ||
| ); | ||
| statsdb->execute(query); | ||
| free(query); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -c 500000 lib/ProxySQL_Admin_Stats.cpp | wc -lRepository: ProxySQL/proxysql-vec
Length of output: 69
🏁 Script executed:
sed -n '2580,2630p' lib/ProxySQL_Admin_Stats.cppRepository: ProxySQL/proxysql-vec
Length of output: 1741
🏁 Script executed:
rg "stats_mcp_query_digest" --type cpp -A 3 -B 3 | head -100Repository: ProxySQL/proxysql-vec
Length of output: 6780
Fix table name in INSERT statement and use prepared statements for NULL-safe binding.
Line 2605 always inserts into stats_mcp_query_digest even when reset == true, writing reset data to the live table instead of the reset table. Additionally, the strlen-based approach crashes if run_id is NULL (the schema allows it) and is vulnerable to SQL injection from unescaped quotes in digest text.
Replace the hardcoded INSERT with dynamic table selection and prepared statement binding:
Proposed fix
- char* a = (char*)"INSERT INTO stats_mcp_query_digest VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\")";
- for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) {
- SQLite3_row* r = *it;
- int arg_len = 0;
- for (int i = 0; i < 10; i++) {
- arg_len += strlen(r->fields[i]);
- }
- char* query = (char*)malloc(strlen(a) + arg_len + 32);
- sprintf(query, a,
- r->fields[0], // tool_name
- r->fields[1], // run_id
- r->fields[2], // digest
- r->fields[3], // digest_text
- r->fields[4], // count_star
- r->fields[5], // first_seen
- r->fields[6], // last_seen
- r->fields[7], // sum_time
- r->fields[8], // min_time
- r->fields[9] // max_time
- );
- statsdb->execute(query);
- free(query);
- }
+ const char* table = reset
+ ? "stats_mcp_query_digest_reset"
+ : "stats_mcp_query_digest";
+ const std::string sql =
+ std::string("INSERT INTO ") + table +
+ " VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10)";
+ int rc = 0;
+ stmt_unique_ptr u_stmt { nullptr };
+ std::tie(rc, u_stmt) = statsdb->prepare_v2(sql.c_str());
+ ASSERT_SQLITE_OK(rc, statsdb);
+ sqlite3_stmt* const stmt = u_stmt.get();
+ for (SQLite3_row* r : resultset->rows) {
+ rc = (*proxy_sqlite3_bind_text)(stmt, 1, r->fields[0], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb);
+ if (r->fields[1]) {
+ rc = (*proxy_sqlite3_bind_int64)(stmt, 2, atoll(r->fields[1])); ASSERT_SQLITE_OK(rc, statsdb);
+ } else {
+ rc = (*proxy_sqlite3_bind_null)(stmt, 2); ASSERT_SQLITE_OK(rc, statsdb);
+ }
+ rc = (*proxy_sqlite3_bind_text)(stmt, 3, r->fields[2], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb);
+ rc = (*proxy_sqlite3_bind_text)(stmt, 4, r->fields[3], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb);
+ rc = (*proxy_sqlite3_bind_int64)(stmt, 5, atoll(r->fields[4])); ASSERT_SQLITE_OK(rc, statsdb);
+ rc = (*proxy_sqlite3_bind_int64)(stmt, 6, atoll(r->fields[5])); ASSERT_SQLITE_OK(rc, statsdb);
+ rc = (*proxy_sqlite3_bind_int64)(stmt, 7, atoll(r->fields[6])); ASSERT_SQLITE_OK(rc, statsdb);
+ rc = (*proxy_sqlite3_bind_int64)(stmt, 8, atoll(r->fields[7])); ASSERT_SQLITE_OK(rc, statsdb);
+ rc = (*proxy_sqlite3_bind_int64)(stmt, 9, atoll(r->fields[8])); ASSERT_SQLITE_OK(rc, statsdb);
+ rc = (*proxy_sqlite3_bind_int64)(stmt, 10, atoll(r->fields[9])); ASSERT_SQLITE_OK(rc, statsdb);
+ SAFE_SQLITE3_STEP2(stmt);
+ rc = (*proxy_sqlite3_clear_bindings)(stmt); ASSERT_SQLITE_OK(rc, statsdb);
+ rc = (*proxy_sqlite3_reset)(stmt); ASSERT_SQLITE_OK(rc, statsdb);
+ }🤖 Prompt for AI Agents
In `@lib/ProxySQL_Admin_Stats.cpp` around lines 2581 - 2627, The INSERT always
targets stats_mcp_query_digest and uses strlen/risky sprintf, causing reset data
to go to the live table, crashes on NULL run_id and SQL-injection risks; in
ProxySQL_Admin::stats___mcp_query_digest switch the target table based on the
reset flag (use stats_mcp_query_digest_reset when reset==true), stop building
SQL via sprintf, and use a prepared statement with parameter binding (prepare
once per loop or reuse) to bind each SQLite3_row->fields value NULL-safely
(checking for NULL before binding and using sqlite3_bind_null or
sqlite3_bind_text) instead of strlen on fields and statsdb->execute(query); also
remove manual malloc/free and call the DB's execute/step/finalize prepared API
(e.g., statsdb prepare/bind/execute/finalize) to prevent injection and crashes.
| if (strstr(query_no_space,"stats_mcp_query_tools_counters")) | ||
| { stats_mcp_query_tools_counters=true; refresh=true; } | ||
| if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset")) | ||
| { stats_mcp_query_tools_counters_reset=true; refresh=true; } |
There was a problem hiding this comment.
Avoid double-refresh when querying stats_mcp_query_tools_counters_reset.
The reset token contains the non-reset token, so both flags become true and both refresh paths run. Prefer a reset-first else if like other reset paths to avoid extra work and unintended table updates.
🛠️ Suggested fix
- if (strstr(query_no_space,"stats_mcp_query_tools_counters"))
- { stats_mcp_query_tools_counters=true; refresh=true; }
- if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset"))
- { stats_mcp_query_tools_counters_reset=true; refresh=true; }
+ if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset"))
+ { stats_mcp_query_tools_counters_reset=true; refresh=true; }
+ else if (strstr(query_no_space,"stats_mcp_query_tools_counters"))
+ { stats_mcp_query_tools_counters=true; refresh=true; }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (strstr(query_no_space,"stats_mcp_query_tools_counters")) | |
| { stats_mcp_query_tools_counters=true; refresh=true; } | |
| if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset")) | |
| { stats_mcp_query_tools_counters_reset=true; refresh=true; } | |
| if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset")) | |
| { stats_mcp_query_tools_counters_reset=true; refresh=true; } | |
| else if (strstr(query_no_space,"stats_mcp_query_tools_counters")) | |
| { stats_mcp_query_tools_counters=true; refresh=true; } |
🤖 Prompt for AI Agents
In `@lib/ProxySQL_Admin.cpp` around lines 1354 - 1357, The current checks set both
stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset when
query_no_space contains the reset token, causing duplicate refresh work; change
the logic to test the reset token first and make the second check an else if (or
otherwise skip setting the non-reset flag when reset matches) so only
stats_mcp_query_tools_counters_reset is set and refresh is executed once; update
the conditional around query_no_space that touches
stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset
accordingly.
| int Query_Tool_Handler::init_connection_pool() { | ||
| // Parse hosts | ||
| std::vector<std::string> host_list; | ||
| std::istringstream h(mysql_hosts); | ||
| std::string host; | ||
| while (std::getline(h, host, ',')) { | ||
| host.erase(0, host.find_first_not_of(" \t")); | ||
| host.erase(host.find_last_not_of(" \t") + 1); | ||
| if (!host.empty()) { | ||
| host_list.push_back(host); | ||
| } | ||
| } | ||
|
|
||
| // Parse ports | ||
| std::vector<int> port_list; | ||
| std::istringstream p(mysql_ports); | ||
| std::string port; | ||
| while (std::getline(p, port, ',')) { | ||
| port.erase(0, port.find_first_not_of(" \t")); | ||
| port.erase(port.find_last_not_of(" \t") + 1); | ||
| if (!port.empty()) { | ||
| port_list.push_back(atoi(port.c_str())); | ||
| } | ||
| } | ||
|
|
||
| // Ensure ports array matches hosts array size | ||
| while (port_list.size() < host_list.size()) { | ||
| port_list.push_back(3306); | ||
| } | ||
|
|
||
| if (host_list.empty()) { | ||
| proxy_error("Query_Tool_Handler: No hosts configured\n"); | ||
| return -1; | ||
| } | ||
|
|
||
| pthread_mutex_lock(&pool_lock); | ||
|
|
||
| for (size_t i = 0; i < host_list.size(); i++) { | ||
| MySQLConnection conn; | ||
| conn.host = host_list[i]; | ||
| conn.port = port_list[i]; | ||
| conn.in_use = false; | ||
|
|
||
| MYSQL* mysql = mysql_init(NULL); | ||
| if (!mysql) { | ||
| proxy_error("Query_Tool_Handler: mysql_init failed for %s:%d\n", | ||
| conn.host.c_str(), conn.port); | ||
| pthread_mutex_unlock(&pool_lock); | ||
| return -1; | ||
| } | ||
|
|
||
| unsigned int timeout = 5; | ||
| mysql_options(mysql, MYSQL_OPT_CONNECT_TIMEOUT, &timeout); | ||
| mysql_options(mysql, MYSQL_OPT_READ_TIMEOUT, &timeout); | ||
| mysql_options(mysql, MYSQL_OPT_WRITE_TIMEOUT, &timeout); | ||
|
|
||
| if (!mysql_real_connect( | ||
| mysql, | ||
| conn.host.c_str(), | ||
| mysql_user.c_str(), | ||
| mysql_password.c_str(), | ||
| mysql_schema.empty() ? NULL : mysql_schema.c_str(), | ||
| conn.port, | ||
| NULL, | ||
| CLIENT_MULTI_STATEMENTS | ||
| )) { | ||
| proxy_error("Query_Tool_Handler: mysql_real_connect failed for %s:%d: %s\n", | ||
| conn.host.c_str(), conn.port, mysql_error(mysql)); | ||
| mysql_close(mysql); | ||
| pthread_mutex_unlock(&pool_lock); | ||
| return -1; | ||
| } | ||
|
|
||
| conn.mysql = mysql; | ||
| connection_pool.push_back(conn); | ||
| pool_size++; | ||
|
|
||
| proxy_info("Query_Tool_Handler: Connected to %s:%d\n", | ||
| conn.host.c_str(), conn.port); | ||
| } | ||
|
|
||
| pthread_mutex_unlock(&pool_lock); | ||
| proxy_info("Query_Tool_Handler: Connection pool initialized with %d connection(s)\n", pool_size); | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the header file for Query_Tool_Handler
fd -e "h" -e "hpp" | xargs grep -l "class Query_Tool_Handler" 2>/dev/nullRepository: ProxySQL/proxysql-vec
Length of output: 114
🏁 Script executed:
# Also check the cpp file for class definition inline
head -100 lib/Query_Tool_Handler.cpp | grep -A 50 "class Query_Tool_Handler"Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Find all method definitions in the Query_Tool_Handler.cpp file
rg "^\s*(public|private|protected):" lib/Query_Tool_Handler.cpp -A 5Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Search for connection_pool usages
rg "connection_pool" lib/Query_Tool_Handler.cpp -B 2 -A 2Repository: ProxySQL/proxysql-vec
Length of output: 1123
🏁 Script executed:
# Check for constructor and initialization of connection_pool
rg "Query_Tool_Handler::|connection_pool.*=" lib/Query_Tool_Handler.cpp | head -20Repository: ProxySQL/proxysql-vec
Length of output: 1169
🏁 Script executed:
# Read the Query_Tool_Handler header file to see class structure
cat -n include/Query_Tool_Handler.hRepository: ProxySQL/proxysql-vec
Length of output: 7390
🏁 Script executed:
# Read the destructor and close() method implementations
rg "Query_Tool_Handler::~Query_Tool_Handler|void Query_Tool_Handler::close" lib/Query_Tool_Handler.cpp -A 30Repository: ProxySQL/proxysql-vec
Length of output: 1464
🏁 Script executed:
# Get the full init() method
rg "int Query_Tool_Handler::init()" lib/Query_Tool_Handler.cpp -A 50Repository: ProxySQL/proxysql-vec
Length of output: 2255
🏁 Script executed:
# Search for all calls to init_connection_pool
rg "init_connection_pool" lib/Query_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 144
🏁 Script executed:
# Check the constructor implementation
rg "Query_Tool_Handler::Query_Tool_Handler" lib/Query_Tool_Handler.cpp -A 30Repository: ProxySQL/proxysql-vec
Length of output: 857
🏁 Script executed:
# Check if there are any guards or flags in init() to prevent calling it multiple times
rg "init()" lib/Query_Tool_Handler.cpp -B 5 -A 15 | head -80Repository: ProxySQL/proxysql-vec
Length of output: 2084
🏁 Script executed:
# Check the full init_connection_pool signature and look at early error returns
sed -n '235,319p' lib/Query_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 2310
🏁 Script executed:
# Check if there's any state management or guards to prevent multiple init calls
rg "connection_pool.empty()|pool_size" lib/Query_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 211
🏁 Script executed:
# Check get_connection() to see if it guards against partially initialized pool
rg "void\* Query_Tool_Handler::get_connection" lib/Query_Tool_Handler.cpp -A 20Repository: ProxySQL/proxysql-vec
Length of output: 530
🏁 Script executed:
# Check if init() is ever called more than once, or if there's any guard against it
rg "\.init\(\)" lib/Query_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Also check the caller context - how is Query_Tool_Handler created and used
rg "Query_Tool_Handler" lib/Query_Tool_Handler.cpp | grep -E "(new|delete|init\(\)|close\(\))" | head -20Repository: ProxySQL/proxysql-vec
Length of output: 132
🏁 Script executed:
# Check if there are any other places in the codebase that use Query_Tool_Handler
# to understand the usage pattern
find . -name "*.cpp" -o -name "*.h" | xargs grep -l "Query_Tool_Handler" | grep -v Query_Tool_Handler.cpp | grep -v Query_Tool_Handler.hRepository: ProxySQL/proxysql-vec
Length of output: 194
🏁 Script executed:
# Look at one of those files to see how Query_Tool_Handler is used
rg "Query_Tool_Handler" include/MCP_Thread.h -B 3 -A 3Repository: ProxySQL/proxysql-vec
Length of output: 508
🏁 Script executed:
# Check MCP_Thread.cpp to see how Query_Tool_Handler is initialized and used
rg "query_tool_handler" lib/MCP_Thread.cpp -B 2 -A 2 | head -50Repository: ProxySQL/proxysql-vec
Length of output: 361
🏁 Script executed:
# Check ProxySQL_MCP_Server.cpp for initialization patterns
rg "Query_Tool_Handler|init\(\)|close\(\)" lib/ProxySQL_MCP_Server.cpp | head -30Repository: ProxySQL/proxysql-vec
Length of output: 494
🏁 Script executed:
# Verify that the early return (no hosts) happens before acquiring lock
sed -n '280,290p' lib/Query_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 430
🏁 Script executed:
# Get exact line numbers around pool_lock acquisition and early returns
sed -n '270,285p' lib/Query_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 461
Clean up partially initialized pools on connection failure.
If a later host fails to connect, earlier MYSQL* handles remain in connection_pool while the function returns -1, leaving the object in an inconsistent state. Although the destructor eventually cleans up via close(), the intermediate state violates the expected postcondition of the function (either fully initialized or empty).
Add a helper lambda to clean up and reset the pool before returning on connection failures:
🧹 Suggested fix
int Query_Tool_Handler::init_connection_pool() {
+ // Helper to cleanup partially initialized pool (pool_lock must be held)
+ auto cleanup_pool_locked = [&]() {
+ for (auto& c : connection_pool) {
+ if (c.mysql) {
+ mysql_close(static_cast<MYSQL*>(c.mysql));
+ c.mysql = NULL;
+ }
+ }
+ connection_pool.clear();
+ pool_size = 0;
+ };
+
// Parse hosts
std::vector<std::string> host_list;
@@
MYSQL* mysql = mysql_init(NULL);
if (!mysql) {
proxy_error("Query_Tool_Handler: mysql_init failed for %s:%d\n",
conn.host.c_str(), conn.port);
- pthread_mutex_unlock(&pool_lock);
+ cleanup_pool_locked();
+ pthread_mutex_unlock(&pool_lock);
return -1;
}
@@
)) {
proxy_error("Query_Tool_Handler: mysql_real_connect failed for %s:%d: %s\n",
conn.host.c_str(), conn.port, mysql_error(mysql));
mysql_close(mysql);
- pthread_mutex_unlock(&pool_lock);
+ cleanup_pool_locked();
+ pthread_mutex_unlock(&pool_lock);
return -1;
}🤖 Prompt for AI Agents
In `@lib/Query_Tool_Handler.cpp` around lines 235 - 319, The init_connection_pool
function must clean up any partially initialized connections on error: add a
helper (lambda or private method) inside init_connection_pool that iterates
connection_pool, calls mysql_close on each entry's MYSQL* (and sets conn.mysql =
nullptr), clears connection_pool, resets pool_size to 0, and unlocks pool_lock
if held; call this cleanup helper before every early return that occurs after
some successful connections (e.g., after mysql_init/mysql_real_connect failures)
so the function either returns with a fully initialized pool or an empty,
consistent pool; update all return paths in init_connection_pool to invoke this
helper and then return -1.
| int Static_Harvester::execute_query(const std::string& query, std::vector<std::vector<std::string>>& results) { | ||
| pthread_mutex_lock(&conn_lock); | ||
|
|
||
| if (!mysql_conn) { | ||
| pthread_mutex_unlock(&conn_lock); | ||
| proxy_error("Static_Harvester: Not connected to MySQL\n"); | ||
| return -1; | ||
| } | ||
|
|
||
| proxy_debug(PROXY_DEBUG_GENERIC, 3, "Static_Harvester: Executing query: %s\n", query.c_str()); | ||
|
|
||
| if (mysql_query(mysql_conn, query.c_str())) { | ||
| proxy_error("Static_Harvester: Query failed: %s\n", mysql_error(mysql_conn)); | ||
| pthread_mutex_unlock(&conn_lock); | ||
| return -1; | ||
| } | ||
|
|
||
| MYSQL_RES* res = mysql_store_result(mysql_conn); | ||
| if (!res) { | ||
| // No result set (e.g., INSERT/UPDATE) | ||
| pthread_mutex_unlock(&conn_lock); | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "Static_Harvester.cpp" | head -5Repository: ProxySQL/proxysql-vec
Length of output: 91
🏁 Script executed:
wc -l lib/Static_Harvester.cpp 2>/dev/null || find . -name "Static_Harvester.cpp" -exec wc -l {} \;Repository: ProxySQL/proxysql-vec
Length of output: 94
🏁 Script executed:
cat -n lib/Static_Harvester.cpp | sed -n '230,265p'Repository: ProxySQL/proxysql-vec
Length of output: 1366
🌐 Web query:
MySQL C API mysql_store_result NULL error handling documentation
💡 Result:
Short summary — how to handle mysql_store_result() returning NULL:
- mysql_store_result() returns NULL either when the statement produced no result set (e.g., INSERT/UPDATE/DELETE) or when an error occurred while reading the result. [1][2]
- To tell which case it is:
- If mysql_field_count(mysql) == 0 then the statement does not produce a result set. [1][2]
- Otherwise check mysql_errno(mysql) != 0 or mysql_error(mysql) (non-empty) to detect an error. [1][2]
- Note: mysql_store_result() resets mysql_error()/mysql_errno() on success; when it returns NULL an allocation/connection/read error (CR_OUT_OF_MEMORY, CR_SERVER_GONE_ERROR, CR_SERVER_LOST, CR_COMMANDS_OUT_OF_SYNC, etc.) may be indicated by mysql_errno(). [1][3]
- See the MySQL manual’s dedicated section “NULL mysql_store_result() Return After mysql_query() Success” for examples and details. [3]
Sources:
[1] MySQL C API — mysql_store_result() documentation.
[2] MySQL C API Developer Guide (mysql_store_result) — dev.mysql.com.
[3] MySQL C API guide — “NULL mysql_store_result() Return After mysql_query() Success.”
Distinguish "no result set" from query error in mysql_store_result() handling.
mysql_store_result() returns NULL for both successful queries with no result set (INSERT/UPDATE) and for errors (memory allocation, connection loss, read failures). The current code treats all NULL cases as success, silently masking database errors.
Per MySQL C API documentation, distinguish these cases by checking mysql_field_count() to detect whether the query produced a result set, then check mysql_errno() if no result set to confirm it was not an error.
Suggested approach
MYSQL_RES* res = mysql_store_result(mysql_conn);
if (!res) {
- // No result set (e.g., INSERT/UPDATE)
- pthread_mutex_unlock(&conn_lock);
- return 0;
+ // Distinguish: no result set vs. error
+ if (mysql_field_count(mysql_conn) != 0) {
+ proxy_error("Static_Harvester: Query failed after store_result: %s\n",
+ mysql_error(mysql_conn));
+ pthread_mutex_unlock(&conn_lock);
+ return -1;
+ }
+ pthread_mutex_unlock(&conn_lock);
+ return 0;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int Static_Harvester::execute_query(const std::string& query, std::vector<std::vector<std::string>>& results) { | |
| pthread_mutex_lock(&conn_lock); | |
| if (!mysql_conn) { | |
| pthread_mutex_unlock(&conn_lock); | |
| proxy_error("Static_Harvester: Not connected to MySQL\n"); | |
| return -1; | |
| } | |
| proxy_debug(PROXY_DEBUG_GENERIC, 3, "Static_Harvester: Executing query: %s\n", query.c_str()); | |
| if (mysql_query(mysql_conn, query.c_str())) { | |
| proxy_error("Static_Harvester: Query failed: %s\n", mysql_error(mysql_conn)); | |
| pthread_mutex_unlock(&conn_lock); | |
| return -1; | |
| } | |
| MYSQL_RES* res = mysql_store_result(mysql_conn); | |
| if (!res) { | |
| // No result set (e.g., INSERT/UPDATE) | |
| pthread_mutex_unlock(&conn_lock); | |
| return 0; | |
| } | |
| int Static_Harvester::execute_query(const std::string& query, std::vector<std::vector<std::string>>& results) { | |
| pthread_mutex_lock(&conn_lock); | |
| if (!mysql_conn) { | |
| pthread_mutex_unlock(&conn_lock); | |
| proxy_error("Static_Harvester: Not connected to MySQL\n"); | |
| return -1; | |
| } | |
| proxy_debug(PROXY_DEBUG_GENERIC, 3, "Static_Harvester: Executing query: %s\n", query.c_str()); | |
| if (mysql_query(mysql_conn, query.c_str())) { | |
| proxy_error("Static_Harvester: Query failed: %s\n", mysql_error(mysql_conn)); | |
| pthread_mutex_unlock(&conn_lock); | |
| return -1; | |
| } | |
| MYSQL_RES* res = mysql_store_result(mysql_conn); | |
| if (!res) { | |
| // Distinguish: no result set vs. error | |
| if (mysql_field_count(mysql_conn) != 0) { | |
| proxy_error("Static_Harvester: Query failed after store_result: %s\n", | |
| mysql_error(mysql_conn)); | |
| pthread_mutex_unlock(&conn_lock); | |
| return -1; | |
| } | |
| pthread_mutex_unlock(&conn_lock); | |
| return 0; | |
| } |
🤖 Prompt for AI Agents
In `@lib/Static_Harvester.cpp` around lines 236 - 258, In
Static_Harvester::execute_query, mysql_store_result() NULL is being treated as a
successful "no result set" case; instead check mysql_field_count(mysql_conn) to
see if the query should have a result set—if mysql_field_count() == 0 treat it
as a successful non-result (unlock conn_lock and return 0), otherwise treat NULL
as an error by calling mysql_errno(mysql_conn) and logging
mysql_error(mysql_conn), unlock conn_lock and return -1; update the error log to
include mysql_error(mysql_conn) and ensure conn_lock is always released before
returning.
| // Fetch schema/database metadata from information_schema.SCHEMATA. | ||
| // | ||
| // Queries MySQL for all schemas (databases) and their character set | ||
| // and collation information. | ||
| // | ||
| // Parameters: | ||
| // filter - Optional schema name filter (empty for all schemas) | ||
| // | ||
| // Returns: | ||
| // Vector of SchemaRow structures containing schema metadata | ||
| std::vector<Static_Harvester::SchemaRow> Static_Harvester::fetch_schemas(const std::string& filter) { | ||
| std::vector<SchemaRow> schemas; | ||
|
|
||
| std::ostringstream sql; | ||
| sql << "SELECT SCHEMA_NAME, DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME " | ||
| << "FROM information_schema.SCHEMATA"; | ||
|
|
||
| if (!filter.empty()) { | ||
| sql << " WHERE SCHEMA_NAME = '" << filter << "'"; | ||
| } | ||
|
|
||
| sql << " ORDER BY SCHEMA_NAME;"; | ||
|
|
||
| std::vector<std::vector<std::string>> results; | ||
| if (execute_query(sql.str(), results) == 0) { | ||
| for (const auto& row : results) { | ||
| SchemaRow s; | ||
| s.schema_name = row[0]; | ||
| s.charset = row[1]; | ||
| s.collation = row[2]; | ||
| schemas.push_back(s); | ||
| } | ||
| } | ||
|
|
||
| return schemas; | ||
| } | ||
|
|
||
| // ============================================================ | ||
| // Harvest Stage Methods | ||
| // ============================================================ | ||
|
|
||
| // Harvest schemas/databases to the catalog. | ||
| // | ||
| // Fetches schemas from information_schema.SCHEMATA and inserts them | ||
| // into the catalog. System schemas (mysql, information_schema, | ||
| // performance_schema, sys) are skipped. | ||
| // | ||
| // Parameters: | ||
| // only_schema - Optional filter to harvest only one schema | ||
| // | ||
| // Returns: | ||
| // Number of schemas harvested, or -1 on error | ||
| // | ||
| // Notes: | ||
| // - Requires an active run (start_run must be called first) | ||
| // - Skips system schemas automatically | ||
| int Static_Harvester::harvest_schemas(const std::string& only_schema) { | ||
| if (current_run_id < 0) { | ||
| proxy_error("Static_Harvester: No active run\n"); | ||
| return -1; | ||
| } | ||
|
|
||
| std::vector<SchemaRow> schemas = fetch_schemas(only_schema); | ||
| int count = 0; | ||
|
|
||
| for (const auto& s : schemas) { | ||
| // Skip system schemas | ||
| if (s.schema_name == "mysql" || s.schema_name == "information_schema" || | ||
| s.schema_name == "performance_schema" || s.schema_name == "sys") { | ||
| continue; | ||
| } | ||
|
|
||
| if (catalog->insert_schema(current_run_id, s.schema_name, s.charset, s.collation) >= 0) { | ||
| count++; | ||
| } | ||
| } | ||
|
|
||
| proxy_info("Static_Harvester: Harvested %d schemas\n", count); | ||
| return count; | ||
| } | ||
|
|
||
| // Fetch table and view metadata from information_schema.TABLES. | ||
| // | ||
| // Queries MySQL for all tables and views with their physical | ||
| // characteristics (rows, size, engine, timestamps). | ||
| // | ||
| // Parameters: | ||
| // filter - Optional schema name filter | ||
| // | ||
| // Returns: | ||
| // Vector of ObjectRow structures containing table/view metadata | ||
| std::vector<Static_Harvester::ObjectRow> Static_Harvester::fetch_tables_views(const std::string& filter) { | ||
| std::vector<ObjectRow> objects; | ||
|
|
||
| std::ostringstream sql; | ||
| sql << "SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE, ENGINE, TABLE_ROWS, " | ||
| << "DATA_LENGTH, INDEX_LENGTH, CREATE_TIME, UPDATE_TIME, TABLE_COMMENT " | ||
| << "FROM information_schema.TABLES " | ||
| << "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')"; | ||
|
|
||
| if (!filter.empty()) { | ||
| sql << " AND TABLE_SCHEMA = '" << filter << "'"; | ||
| } | ||
|
|
||
| sql << " ORDER BY TABLE_SCHEMA, TABLE_NAME;"; | ||
|
|
||
| std::vector<std::vector<std::string>> results; | ||
| if (execute_query(sql.str(), results) == 0) { | ||
| for (const auto& row : results) { | ||
| ObjectRow o; | ||
| o.schema_name = row[0]; | ||
| o.object_name = row[1]; | ||
| o.object_type = (row[2] == "VIEW") ? "view" : "table"; | ||
| o.engine = row[3]; | ||
| o.table_rows_est = row[4].empty() ? 0 : atol(row[4].c_str()); | ||
| o.data_length = row[5].empty() ? 0 : atol(row[5].c_str()); | ||
| o.index_length = row[6].empty() ? 0 : atol(row[6].c_str()); | ||
| o.create_time = row[7]; | ||
| o.update_time = row[8]; | ||
| o.object_comment = row[9]; | ||
| objects.push_back(o); | ||
| } | ||
| } | ||
|
|
||
| return objects; | ||
| } | ||
|
|
||
| // Fetch column metadata from information_schema.COLUMNS. | ||
| // | ||
| // Queries MySQL for all columns with their data types, nullability, | ||
| // defaults, character set, and comments. | ||
| // | ||
| // Parameters: | ||
| // filter - Optional schema name filter | ||
| // | ||
| // Returns: | ||
| // Vector of ColumnRow structures containing column metadata | ||
| std::vector<Static_Harvester::ColumnRow> Static_Harvester::fetch_columns(const std::string& filter) { | ||
| std::vector<ColumnRow> columns; | ||
|
|
||
| std::ostringstream sql; | ||
| sql << "SELECT TABLE_SCHEMA, TABLE_NAME, ORDINAL_POSITION, COLUMN_NAME, " | ||
| << "DATA_TYPE, COLUMN_TYPE, IS_NULLABLE, COLUMN_DEFAULT, EXTRA, " | ||
| << "CHARACTER_SET_NAME, COLLATION_NAME, COLUMN_COMMENT " | ||
| << "FROM information_schema.COLUMNS " | ||
| << "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')"; | ||
|
|
||
| if (!filter.empty()) { | ||
| sql << " AND TABLE_SCHEMA = '" << filter << "'"; | ||
| } | ||
|
|
||
| sql << " ORDER BY TABLE_SCHEMA, TABLE_NAME, ORDINAL_POSITION;"; | ||
|
|
||
| std::vector<std::vector<std::string>> results; | ||
| if (execute_query(sql.str(), results) == 0) { | ||
| for (const auto& row : results) { | ||
| ColumnRow c; | ||
| c.schema_name = row[0]; | ||
| c.object_name = row[1]; | ||
| c.ordinal_pos = atoi(row[2].c_str()); | ||
| c.column_name = row[3]; | ||
| c.data_type = row[4]; | ||
| c.column_type = row[5]; | ||
| c.is_nullable = (row[6] == "YES") ? 1 : 0; | ||
| c.column_default = row[7]; | ||
| c.extra = row[8]; | ||
| c.charset = row[9]; | ||
| c.collation = row[10]; | ||
| c.column_comment = row[11]; | ||
| columns.push_back(c); | ||
| } | ||
| } | ||
|
|
||
| return columns; | ||
| } | ||
|
|
||
| // Fetch index metadata from information_schema.STATISTICS. | ||
| // | ||
| // Queries MySQL for all indexes with their columns, sequence, | ||
| // uniqueness, cardinality, and collation. | ||
| // | ||
| // Parameters: | ||
| // filter - Optional schema name filter | ||
| // | ||
| // Returns: | ||
| // Vector of IndexRow structures containing index metadata | ||
| std::vector<Static_Harvester::IndexRow> Static_Harvester::fetch_indexes(const std::string& filter) { | ||
| std::vector<IndexRow> indexes; | ||
|
|
||
| std::ostringstream sql; | ||
| sql << "SELECT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME, NON_UNIQUE, INDEX_TYPE, " | ||
| << "SEQ_IN_INDEX, COLUMN_NAME, SUB_PART, COLLATION, CARDINALITY " | ||
| << "FROM information_schema.STATISTICS " | ||
| << "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')"; | ||
|
|
||
| if (!filter.empty()) { | ||
| sql << " AND TABLE_SCHEMA = '" << filter << "'"; | ||
| } | ||
|
|
||
| sql << " ORDER BY TABLE_SCHEMA, TABLE_NAME, INDEX_NAME, SEQ_IN_INDEX;"; | ||
|
|
||
| std::vector<std::vector<std::string>> results; | ||
| if (execute_query(sql.str(), results) == 0) { | ||
| for (const auto& row : results) { | ||
| IndexRow i; | ||
| i.schema_name = row[0]; | ||
| i.object_name = row[1]; | ||
| i.index_name = row[2]; | ||
| i.is_unique = (row[3] == "0") ? 1 : 0; | ||
| i.index_type = row[4]; | ||
| i.seq_in_index = atoi(row[5].c_str()); | ||
| i.column_name = row[6]; | ||
| i.sub_part = row[7].empty() ? 0 : atoi(row[7].c_str()); | ||
| i.collation = row[8]; | ||
| i.cardinality = row[9].empty() ? 0 : atol(row[9].c_str()); | ||
| indexes.push_back(i); | ||
| } | ||
| } | ||
|
|
||
| return indexes; | ||
| } | ||
|
|
||
| // Fetch foreign key metadata from information_schema. | ||
| // | ||
| // Queries KEY_COLUMN_USAGE and REFERENTIAL_CONSTRAINTS to get | ||
| // foreign key relationships including child/parent tables and columns, | ||
| // and ON UPDATE/DELETE rules. | ||
| // | ||
| // Parameters: | ||
| // filter - Optional schema name filter | ||
| // | ||
| // Returns: | ||
| // Vector of FKRow structures containing foreign key metadata | ||
| std::vector<Static_Harvester::FKRow> Static_Harvester::fetch_foreign_keys(const std::string& filter) { | ||
| std::vector<FKRow> fks; | ||
|
|
||
| std::ostringstream sql; | ||
| sql << "SELECT kcu.CONSTRAINT_SCHEMA AS child_schema, " | ||
| << "kcu.TABLE_NAME AS child_table, kcu.CONSTRAINT_NAME AS fk_name, " | ||
| << "kcu.COLUMN_NAME AS child_column, kcu.REFERENCED_TABLE_SCHEMA AS parent_schema, " | ||
| << "kcu.REFERENCED_TABLE_NAME AS parent_table, kcu.REFERENCED_COLUMN_NAME AS parent_column, " | ||
| << "kcu.ORDINAL_POSITION AS seq, rc.UPDATE_RULE AS on_update, rc.DELETE_RULE AS on_delete " | ||
| << "FROM information_schema.KEY_COLUMN_USAGE kcu " | ||
| << "JOIN information_schema.REFERENTIAL_CONSTRAINTS rc " | ||
| << " ON rc.CONSTRAINT_SCHEMA = kcu.CONSTRAINT_SCHEMA " | ||
| << " AND rc.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME " | ||
| << "WHERE kcu.TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')"; | ||
|
|
||
| if (!filter.empty()) { | ||
| sql << " AND kcu.TABLE_SCHEMA = '" << filter << "'"; | ||
| } | ||
|
|
||
| sql << " AND kcu.REFERENCED_TABLE_NAME IS NOT NULL " | ||
| << "ORDER BY child_schema, child_table, fk_name, seq;"; | ||
|
|
||
| std::vector<std::vector<std::string>> results; | ||
| if (execute_query(sql.str(), results) == 0) { | ||
| for (const auto& row : results) { | ||
| FKRow fk; | ||
| fk.child_schema = row[0]; | ||
| fk.child_table = row[1]; | ||
| fk.fk_name = row[2]; | ||
| fk.child_column = row[3]; | ||
| fk.parent_schema = row[4]; | ||
| fk.parent_table = row[5]; | ||
| fk.parent_column = row[6]; | ||
| fk.seq = atoi(row[7].c_str()); | ||
| fk.on_update = row[8]; | ||
| fk.on_delete = row[9]; | ||
| fks.push_back(fk); | ||
| } | ||
| } | ||
|
|
||
| return fks; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for calls to the fetch_* methods to understand where filter comes from
rg "fetch_schemas|fetch_tables_views|fetch_columns|fetch_indexes|fetch_foreign_keys" --type cpp -B 3 -A 3Repository: ProxySQL/proxysql-vec
Length of output: 5678
🏁 Script executed:
# Check if there's a header file that might show the method signatures and their usage context
fd "\.h$" | head -20Repository: ProxySQL/proxysql-vec
Length of output: 630
🏁 Script executed:
# Look for where harvest_schemas is called to trace the data flow
rg "harvest_schemas" --type cpp -B 5 -A 5Repository: ProxySQL/proxysql-vec
Length of output: 1911
🏁 Script executed:
# Search for calls to run_full_harvest
rg "run_full_harvest" --type cpp -B 3 -A 3Repository: ProxySQL/proxysql-vec
Length of output: 1870
🏁 Script executed:
# Also look for any input validation or escaping patterns used elsewhere in the codebase
rg "mysql_escape|mysql_real_escape|prepared.*statement|parameterized" --type cpp | head -20Repository: ProxySQL/proxysql-vec
Length of output: 2187
🏁 Script executed:
# Look at the Query_Tool_Handler.cpp context more thoroughly
rg "schema_filter|json_string" lib/Query_Tool_Handler.cpp -B 5 -A 10 | head -50Repository: ProxySQL/proxysql-vec
Length of output: 1929
🏁 Script executed:
# Check execute_query implementation
rg "execute_query" lib/Static_Harvester.cpp -A 15 | head -40Repository: ProxySQL/proxysql-vec
Length of output: 1217
🏁 Script executed:
# Check if execute_query can execute multiple statements
ast-grep --pattern 'int Static_Harvester::execute_query($_, $_) { $$$ }'Repository: ProxySQL/proxysql-vec
Length of output: 47
Add input validation for filter parameter to prevent SQL injection in fetch_ methods.*
The filter parameter originates from user input via the MCP tool interface (discovery.run_static tool) and is directly embedded into SQL queries without escaping. This allows SQL injection via payloads like test' OR '1'='1. Implement whitelist validation accepting only alphanumeric characters and underscores, which is appropriate for MySQL schema identifiers.
Apply the validation across all five fetch methods:
fetch_schemas()fetch_tables_views()fetch_columns()fetch_indexes()fetch_foreign_keys()
Example fix
+static bool is_safe_identifier(const std::string& s) {
+ return !s.empty() &&
+ std::all_of(s.begin(), s.end(),
+ [](unsigned char c){ return std::isalnum(c) || c == '_'; });
+}
+
std::vector<Static_Harvester::SchemaRow> Static_Harvester::fetch_schemas(const std::string& filter) {
std::vector<SchemaRow> schemas;
std::ostringstream sql;
sql << "SELECT SCHEMA_NAME, DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME "
<< "FROM information_schema.SCHEMATA";
if (!filter.empty()) {
+ if (!is_safe_identifier(filter)) {
+ proxy_error("Static_Harvester: Invalid schema filter '%s'\n", filter.c_str());
+ return schemas;
+ }
sql << " WHERE SCHEMA_NAME = '" << filter << "'";
}🤖 Prompt for AI Agents
In `@lib/Static_Harvester.cpp` around lines 388 - 662, All five fetch_* methods
(fetch_schemas, fetch_tables_views, fetch_columns, fetch_indexes,
fetch_foreign_keys) currently interpolate the user-provided filter directly into
SQL and need whitelist validation to prevent SQL injection: add a small helper
(e.g., is_valid_identifier or validate_schema_filter) that returns true only for
non-empty strings matching /^[A-Za-z0-9_]+$/ (allowing empty meaning no filter),
call it at the start of each method and if the filter is non-empty and invalid
either return an empty result or treat it as no filter (choose consistent
behavior), and only append " AND ... = '<filter>'" to the SQL when the helper
approves the value; reference the filter variable and SQL-building blocks in
each function so the validation is applied before any SQL concatenation.
| if (current_run_id < 0) { | ||
| proxy_error("Static_Harvester: No active run\n"); | ||
| return -1; | ||
| } | ||
|
|
||
| std::ostringstream sql; | ||
| sql << "SELECT TABLE_SCHEMA, TABLE_NAME, VIEW_DEFINITION " | ||
| << "FROM information_schema.VIEWS " | ||
| << "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')"; | ||
|
|
||
| if (!only_schema.empty()) { | ||
| sql << " AND TABLE_SCHEMA = '" << only_schema << "'"; | ||
| } | ||
|
|
||
| sql << ";"; | ||
|
|
||
| std::vector<std::vector<std::string>> results; | ||
| if (execute_query(sql.str(), results) != 0) { | ||
| return -1; | ||
| } | ||
|
|
||
| int count = 0; | ||
| for (const auto& row : results) { | ||
| std::string schema_name = row[0]; | ||
| std::string view_name = row[1]; | ||
| std::string view_def = row[2]; | ||
|
|
||
| // Update object with definition | ||
| char* error = NULL; | ||
| int cols = 0, affected = 0; | ||
| std::ostringstream update_sql; | ||
| update_sql << "UPDATE objects SET definition_sql = '" << view_def << "' " | ||
| << "WHERE run_id = " << current_run_id | ||
| << " AND schema_name = '" << schema_name << "'" | ||
| << " AND object_name = '" << view_name << "'" | ||
| << " AND object_type = 'view';"; | ||
|
|
||
| catalog->get_db()->execute_statement(update_sql.str().c_str(), &error, &cols, &affected); | ||
| if (affected > 0) { | ||
| count++; | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find and examine the file structure
find . -name "Static_Harvester.cpp" -o -name "Static_Harvester.h" | head -20Repository: ProxySQL/proxysql-vec
Length of output: 186
🏁 Script executed:
# Search for SQL escaping utilities or helper functions
rg -i "escape|quote" --type cpp | head -30Repository: ProxySQL/proxysql-vec
Length of output: 3528
🏁 Script executed:
# Check the database type being used and how other queries handle input
rg "execute_query|execute_statement" --type cpp -B 2 -A 2 | head -50Repository: ProxySQL/proxysql-vec
Length of output: 4030
🏁 Script executed:
# Find Static_Harvester.cpp
find . -name "Static_Harvester.cpp" -type fRepository: ProxySQL/proxysql-vec
Length of output: 91
🏁 Script executed:
# Examine the Static_Harvester.cpp file around the problematic lines
find . -name "Static_Harvester.cpp" -type f -exec wc -l {} \;Repository: ProxySQL/proxysql-vec
Length of output: 96
🏁 Script executed:
# Search for database-related utilities and escape functions in lib/
find ./lib -name "*.cpp" -o -name "*.h" | head -20Repository: ProxySQL/proxysql-vec
Length of output: 589
🏁 Script executed:
# Examine the problematic section in Static_Harvester.cpp
sed -n '1000,1060p' ./lib/Static_Harvester.cpp | cat -nRepository: ProxySQL/proxysql-vec
Length of output: 2215
🏁 Script executed:
# Search for the execute_query function definition in Static_Harvester.cpp
rg "execute_query" ./lib/Static_Harvester.cpp -B 5 -A 10Repository: ProxySQL/proxysql-vec
Length of output: 3632
🏁 Script executed:
# Look at the class definition and database connection context
sed -n '1,100p' ./lib/Static_Harvester.cpp | cat -nRepository: ProxySQL/proxysql-vec
Length of output: 4193
🏁 Script executed:
# Examine the catalog database interface - check Discovery_Schema and get_db()
rg "class Discovery_Schema" -A 20 ./lib --type cppRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Search for SQL escape or quote handling in the codebase
rg "definition_sql|UPDATE.*SET" ./lib/Static_Harvester.cpp -B 2 -A 2Repository: ProxySQL/proxysql-vec
Length of output: 1174
🏁 Script executed:
# Check what database interface is being used for catalog
rg "execute_statement" ./lib --type cpp -B 1 | head -40Repository: ProxySQL/proxysql-vec
Length of output: 2021
🏁 Script executed:
# Look for prepared statement usage or escaping utilities in Discovery_Schema or nearby
find ./lib -name "*Discovery*" -type fRepository: ProxySQL/proxysql-vec
Length of output: 91
🏁 Script executed:
# Examine Discovery_Schema to understand the database interface
head -100 ./lib/Discovery_Schema.cpp | cat -nRepository: ProxySQL/proxysql-vec
Length of output: 3317
🏁 Script executed:
# Check the get_db() method and what it returns
rg "get_db()" ./lib/Discovery_Schema.cpp -B 2 -A 5Repository: ProxySQL/proxysql-vec
Length of output: 2641
🏁 Script executed:
# Look for SQLite3 interface - check if execute_statement supports prepared statements
rg "execute_statement" ./lib/SQLite3_Server.cpp -B 1 -A 3 | head -50Repository: ProxySQL/proxysql-vec
Length of output: 129
🏁 Script executed:
# Check if there's a C++ wrapper or any escaping utilities available
rg "escape|sqlite3_bind|sqlite3_prepare" ./lib --type cpp | head -30Repository: ProxySQL/proxysql-vec
Length of output: 2982
🏁 Script executed:
# Check the escape_string_single_quotes function implementation
rg "escape_string_single_quotes" ./lib/gen_utils.cpp -B 2 -A 15Repository: ProxySQL/proxysql-vec
Length of output: 785
🏁 Script executed:
# Check the header file to see if it's publicly available
rg "escape_string_single_quotes" ./include --type cpp -B 2 -A 2Repository: ProxySQL/proxysql-vec
Length of output: 765
🏁 Script executed:
# Verify that the code is really concatenating strings without escaping
sed -n '1035,1055p' ./lib/Static_Harvester.cpp | cat -nRepository: ProxySQL/proxysql-vec
Length of output: 914
🏁 Script executed:
# Check if Static_Harvester.h includes gen_utils or has access to escape functions
cat ./lib/Static_Harvester.h | head -50Repository: ProxySQL/proxysql-vec
Length of output: 121
🏁 Script executed:
# Find the Static_Harvester header file
find . -name "*Static_Harvester*" -type fRepository: ProxySQL/proxysql-vec
Length of output: 120
🏁 Script executed:
# Check the include statements in Static_Harvester.cpp
head -30 ./lib/Static_Harvester.cpp | grep -E "^#include"Repository: ProxySQL/proxysql-vec
Length of output: 250
🏁 Script executed:
# Look for the prepared statement usage pattern in Discovery_Schema
rg "sqlite3_prepare|SAFE_SQLITE3_STEP|bind_text" ./lib/Discovery_Schema.cpp -B 1 -A 3 | head -60Repository: ProxySQL/proxysql-vec
Length of output: 2595
🏁 Script executed:
# Check if execute_statement method supports prepared statements with parameter binding
rg "execute_statement" ./lib/SQLite3DB.cpp -B 2 -A 5 2>/dev/null | head -50Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Check if Static_Harvester.cpp includes gen_utils.h
rg "#include" ./lib/Static_Harvester.cpp | head -10Repository: ProxySQL/proxysql-vec
Length of output: 250
🏁 Script executed:
# Look for other similar SQL concatenation patterns in Static_Harvester.cpp
rg "UPDATE|INSERT" ./lib/Static_Harvester.cpp | grep -E "<<.*<<" | head -10Repository: ProxySQL/proxysql-vec
Length of output: 140
🏁 Script executed:
# Examine how Discovery_Schema does prepared statements for similar updates
rg "UPDATE objects" ./lib/Discovery_Schema.cpp -B 5 -A 10Repository: ProxySQL/proxysql-vec
Length of output: 821
🏁 Script executed:
# Check if there are other places in Static_Harvester with unescaped string concatenation
rg "ostringstream.*<<" ./lib/Static_Harvester.cpp | grep -v "ORDER\|SELECT\|WHERE.*NOT IN"Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Check how execute_statement is used elsewhere for similar updates with variables
rg "execute_statement.*UPDATE" ./lib --type cpp -B 2 -A 2 | head -50Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Verify that prepared statements are NOT used in Static_Harvester.cpp
rg "sqlite3_prepare|bind_text|bind_int" ./lib/Static_Harvester.cppRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Check if there are similar vulnerable patterns elsewhere in Static_Harvester
rg "ostringstream.*view_def|ostringstream.*schema_name|ostringstream.*view_name" ./lib/Static_Harvester.cppRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Final check: confirm the vulnerability by looking at the exact SQL being built
sed -n '1035,1055p' ./lib/Static_Harvester.cppRepository: ProxySQL/proxysql-vec
Length of output: 767
🏁 Script executed:
# Check the database interface for execute_statement - can it use prepared statements?
rg "class SQLite3DB" ./lib -A 30 | head -50Repository: ProxySQL/proxysql-vec
Length of output: 3002
🏁 Script executed:
# Look for how other parts of the code handle similar updates with user data
rg "INSERT.*definition_sql\|UPDATE.*definition_sql" ./lib --type cpp -B 2 -A 2Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Verify that the fetched data (view_def, schema_name, view_name) can contain quotes
rg "VIEW_DEFINITION\|TABLE_SCHEMA\|TABLE_NAME" ./lib/Static_Harvester.cpp -B 2 -A 2Repository: ProxySQL/proxysql-vec
Length of output: 47
Escape view definitions and identifiers in SQLite catalog UPDATE statements.
view_def, schema_name, and view_name from MySQL information_schema.VIEWS can contain single quotes, breaking the SQL statement or enabling injection into the SQLite catalog. Use prepared statements with parameter binding (already available in the codebase via sqlite3_bind_text), or escape single quotes by doubling them (' → ''). The codebase has escape_string_single_quotes() available but it operates on C strings; if using that approach, convert the std::string values to C strings first.
🛠️ Suggested fix (prepared statement approach, preferred)
Replace the direct execute_statement() call with prepared statement binding, following the pattern already used in Discovery_Schema.cpp:
// Update object with definition
- char* error = NULL;
- int cols = 0, affected = 0;
- std::ostringstream update_sql;
- update_sql << "UPDATE objects SET definition_sql = '" << view_def << "' "
- << "WHERE run_id = " << current_run_id
- << " AND schema_name = '" << schema_name << "'"
- << " AND object_name = '" << view_name << "'"
- << " AND object_type = 'view';";
-
- catalog->get_db()->execute_statement(update_sql.str().c_str(), &error, &cols, &affected);
+ const char* sql = "UPDATE objects SET definition_sql = ?1 "
+ "WHERE run_id = ?2 AND schema_name = ?3 "
+ "AND object_name = ?4 AND object_type = 'view';";
+ sqlite3_stmt* stmt = NULL;
+ (*proxy_sqlite3_prepare_v2)(catalog->get_db()->get_db(), sql, -1, &stmt, NULL);
+ (*proxy_sqlite3_bind_text)(stmt, 1, view_def.c_str(), -1, SQLITE_TRANSIENT);
+ (*proxy_sqlite3_bind_int)(stmt, 2, current_run_id);
+ (*proxy_sqlite3_bind_text)(stmt, 3, schema_name.c_str(), -1, SQLITE_TRANSIENT);
+ (*proxy_sqlite3_bind_text)(stmt, 4, view_name.c_str(), -1, SQLITE_TRANSIENT);
+ SAFE_SQLITE3_STEP2(stmt);
+ (*proxy_sqlite3_finalize)(stmt);🤖 Prompt for AI Agents
In `@lib/Static_Harvester.cpp` around lines 1014 - 1055, The UPDATE that writes
view_def, schema_name, and view_name into the SQLite catalog (inside the
for-loop using update_sql and catalog->get_db()->execute_statement) is
vulnerable to breaking on single quotes; change it to use a prepared statement
with parameter binding (sqlite3_prepare_v2 / sqlite3_bind_text / sqlite3_step /
sqlite3_finalize) and bind view_def, schema_name, view_name, and current_run_id
instead of interpolating them, following the pattern in Discovery_Schema.cpp;
alternatively (less preferred) call escape_string_single_quotes() on
view_def.c_str(), schema_name.c_str(), and view_name.c_str() and use the escaped
C strings in the constructed SQL before calling execute_statement.
| std::ostringstream sql; | ||
| sql << "SELECT object_id, schema_name, object_name, object_type, engine, table_rows_est, " | ||
| << "data_length, index_length, has_primary_key, has_foreign_keys, has_time_column " | ||
| << "FROM objects WHERE run_id = " << current_run_id | ||
| << " AND object_type IN ('table', 'view')"; | ||
|
|
||
| catalog->get_db()->execute_statement(sql.str().c_str(), &error, &cols, &affected, &resultset); | ||
|
|
||
| if (!resultset) { | ||
| return -1; | ||
| } | ||
|
|
||
| int count = 0; | ||
| for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); | ||
| it != resultset->rows.end(); ++it) { | ||
| SQLite3_row* row = *it; | ||
|
|
||
| int object_id = atoi(row->fields[0]); | ||
| std::string object_name = std::string(row->fields[2] ? row->fields[2] : ""); | ||
|
|
||
| // Guess kind from name | ||
| std::string guessed_kind = "unknown"; | ||
| std::string name_lower = object_name; | ||
| std::transform(name_lower.begin(), name_lower.end(), name_lower.begin(), ::tolower); | ||
|
|
||
| if (name_lower.find("log") != std::string::npos || | ||
| name_lower.find("event") != std::string::npos || | ||
| name_lower.find("audit") != std::string::npos) { | ||
| guessed_kind = "log/event"; | ||
| } else if (name_lower.find("order") != std::string::npos || | ||
| name_lower.find("invoice") != std::string::npos || | ||
| name_lower.find("payment") != std::string::npos || | ||
| name_lower.find("transaction") != std::string::npos) { | ||
| guessed_kind = "fact"; | ||
| } else if (name_lower.find("user") != std::string::npos || | ||
| name_lower.find("customer") != std::string::npos || | ||
| name_lower.find("account") != std::string::npos || | ||
| name_lower.find("product") != std::string::npos) { | ||
| guessed_kind = "entity"; | ||
| } | ||
|
|
||
| // Build profile JSON | ||
| json profile; | ||
| profile["guessed_kind"] = guessed_kind; | ||
| profile["rows_est"] = row->fields[4] ? atol(row->fields[4]) : 0; | ||
| profile["size_bytes"] = (atol(row->fields[5] ? row->fields[5] : "0") + | ||
| atol(row->fields[6] ? row->fields[6] : "0")); | ||
| profile["engine"] = std::string(row->fields[3] ? row->fields[3] : ""); | ||
| profile["has_primary_key"] = atoi(row->fields[7]) != 0; | ||
| profile["has_foreign_keys"] = atoi(row->fields[8]) != 0; | ||
| profile["has_time_column"] = atoi(row->fields[9]) != 0; |
There was a problem hiding this comment.
Fix column index mapping in quick profiles.
The SELECT column order doesn’t match the indices used, so rows_est/engine/flags are derived from the wrong fields.
✅ Proposed fix
- profile["rows_est"] = row->fields[4] ? atol(row->fields[4]) : 0;
- profile["size_bytes"] = (atol(row->fields[5] ? row->fields[5] : "0") +
- atol(row->fields[6] ? row->fields[6] : "0"));
- profile["engine"] = std::string(row->fields[3] ? row->fields[3] : "");
- profile["has_primary_key"] = atoi(row->fields[7]) != 0;
- profile["has_foreign_keys"] = atoi(row->fields[8]) != 0;
- profile["has_time_column"] = atoi(row->fields[9]) != 0;
+ profile["rows_est"] = row->fields[5] ? atol(row->fields[5]) : 0;
+ profile["size_bytes"] = (atol(row->fields[6] ? row->fields[6] : "0") +
+ atol(row->fields[7] ? row->fields[7] : "0"));
+ profile["engine"] = std::string(row->fields[4] ? row->fields[4] : "");
+ profile["has_primary_key"] = atoi(row->fields[8]) != 0;
+ profile["has_foreign_keys"] = atoi(row->fields[9]) != 0;
+ profile["has_time_column"] = atoi(row->fields[10]) != 0;🤖 Prompt for AI Agents
In `@lib/Static_Harvester.cpp` around lines 1088 - 1138, The quick-profile field
indices are misaligned with the SELECT: update the row->fields index usage
inside the loop (where profile is built) so they match the SELECT order
(object_id, schema_name, object_name, object_type, engine, table_rows_est,
data_length, index_length, has_primary_key, has_foreign_keys, has_time_column).
Concretely, set rows_est to row->fields[5], engine to row->fields[4], size_bytes
to the sum of row->fields[6] and row->fields[7], and the boolean flags
has_primary_key/has_foreign_keys/has_time_column to row->fields[8], [9], and
[10] respectively; update these in the code that constructs json profile (the
block around profile["guessed_kind"], profile["rows_est"],
profile["size_bytes"], profile["engine"], profile["has_primary_key"],
profile["has_foreign_keys"], profile["has_time_column"] in
Static_Harvester.cpp).
|
Replaced by #26 |
Multi-Agent Database Discovery v1.3 - Performance & Statistical Analysis
Summary
This PR implements Priority 1 improvements identified by the META agent from the previous discovery run. These enhancements significantly improve the depth, confidence, and actionability of database discovery reports.
Expected Impact: +25% overall quality, +30% confidence in findings
Key Improvements
1. Performance Baseline Measurement (QUERY Agent)
The QUERY agent now executes actual performance queries with timing measurements instead of relying solely on EXPLAIN output.
Required Tests (5 per table):
Output:
Example:
2. Statistical Significance Testing (STATISTICAL Agent)
The STATISTICAL agent now performs rigorous statistical tests with p-values and effect sizes.
Required Tests (5 types):
Output:
Example:
3. Enhanced Cross-Domain Question Synthesis (META Agent)
The META agent now generates 15+ cross-domain questions across 5 categories.
Distribution:
Each question includes:
Example:
Files Changed
prompts/multi_agent_discovery_prompt.mdREADME.mdPrompt Evolution History
Testing
To test the new v1.3 capabilities:
cd scripts/mcp/DiscoveryAgent/ClaudeCode_Headless python ./headless_db_discovery.py --database testdb --output discovery_v13_test.mdExpected improvements in output:
Future Improvements (Priority 2)
These are identified for future iterations:
See
META_ANALYSIS_PROMPT_IMPROVEMENTS.mdfor complete roadmap.Based on: META agent analysis from Round 3 discovery run
Confidence: HIGH (all improvements validated with SQL evidence)