Merge v3.1-MCP2 into v3.1-vec (conflicts resolved)#26
Conversation
Replace the old single-agent discovery approach with a new 4-agent collaborative system that provides more comprehensive analysis. Changes: - Add prompts/ directory with multi-agent system prompts - Replace README.md with multi-agent discovery documentation - Replace headless_db_discovery.py to use 4-agent prompt - Replace headless_db_discovery.sh to use 4-agent prompt - Remove HEADLESS_DISCOVERY_README.md (replaced by README.md) The new system uses 4 collaborating agents (STRUCTURAL, STATISTICAL, SEMANTIC, QUERY) that communicate via MCP catalog across 4 rounds: 1. Blind Exploration (parallel independent discovery) 2. Pattern Recognition (cross-agent collaboration) 3. Hypothesis Testing (validation with evidence) 4. Final Synthesis (comprehensive report) This provides deeper analysis with 15+ hypothesis validations, health scores, and prioritized recommendations.
- Change default timeout from 300s to 3600s (1 hour) for multi-agent discovery - Add specific timeout error detection (exit code 124) with helpful message - Add empty output file detection with diagnostic suggestions - Improve error messages to guide users on how to debug issues - Show example commands for increasing timeout when timeout occurs The multi-agent discovery process involves 4 rounds of agent collaboration and can take significantly longer than the previous single-agent approach.
Expand the 4-agent system to 6 agents (5 analysis + 1 meta) with enhanced security analysis and self-improving prompt optimization. New Agents: - SECURITY: Identifies sensitive data (PII, credentials, financial), assesses access patterns, identifies vulnerabilities, and provides compliance assessment (GDPR, PCI-DSS) - META: Analyzes report quality by section, identifies gaps, suggests specific prompt improvements for future runs Protocol Changes: - Expanded from 4 rounds to 5 rounds - Round 5 is Meta Analysis (META agent only) - META agent does not participate in rounds 1-4 New Report Sections: - 5. SECURITY ANALYSIS with data classification (PUBLIC/INTERNAL/ CONFIDENTIAL/RESTRICTED) - E. Security data classification appendix New Output: - Separate META ANALYSIS document with: - Section quality ratings (depth, completeness) - Specific prompt improvement suggestions - Gap identification - Evolution history tracking This enables continuous prompt optimization through multiple discovery iterations, with each run informing improvements for the next.
Each agent now generates a comprehensive catalog of questions they can answer about the database, with executable answer plans using MCP tools. Question Catalog Features: - 90+ questions across all agents (minimum 15-20 per agent) - Executable answer plans with specific MCP tools for each question - Answer templates with structured output formats - Complexity ratings (LOW/MEDIUM/HIGH) and time estimates - Cross-domain questions requiring multiple agents New Output Locations: - kind="question_catalog", key="structural_questions" - kind="question_catalog", key="statistical_questions" - kind="question_catalog", key="semantic_questions" - kind="question_catalog", key="query_questions" - kind="question_catalog", key="security_questions" - kind="question_catalog", key="cross_domain_questions" Benefits: 1. Fast Answers: Pre-validated plans skip analysis phase 2. Consistent Quality: All answers follow proven templates 3. Tool Reuse: Efficient MCP tool usage patterns 4. Comprehensive Coverage: 90+ questions cover most user needs Example Question Catalog Entry: - Question: "What sensitive data exists in table X?" - Answer Plan: sample_rows + column_profile on table X - Answer Template: Structured list with sensitivity classification - Complexity: MEDIUM - Estimated Time: 30 seconds This creates a reusable knowledge base for future LLM interactions, enabling quick, accurate responses to common database questions.
…ings Add prominent warnings and explicit instructions to prevent agents from creating individual markdown files instead of using the MCP catalog. Changes: - Add "CRITICAL: MCP CATALOG USAGE" section at the top of the prompt - Explicitly prohibit Write tool for individual agent discoveries - Specify that ONLY Round 4 final report should be written to file - Add detailed catalog_upsert usage examples - Update all round descriptions with CRITICAL catalog usage instructions - Add "IMPORTANT - Catalog Usage Rules" section Key Instructions Added: - "DO NOT use the Write tool to create separate markdown files" - "ALL agent findings MUST be written to MCP catalog using catalog_upsert" - "ONLY Round 4 Final Synthesis writes to local file using Write tool" - "DO NOT use Write tool for individual agent discoveries in Rounds 1-3" This ensures agents store findings in the MCP catalog for cross-agent collaboration, with only the final consolidated report written to disk.
Implemented three critical improvements identified by META agent:
1. Performance Baseline Measurement (QUERY agent):
- Added 5 required query types with timing measurements
- Compare actual execution time vs EXPLAIN cost
- Efficiency score calculation (1-10 scale)
- Output format with performance scores per table
2. Statistical Significance Testing (STATISTICAL agent):
- Added 5 required statistical tests:
* Normality tests (Shapiro-Wilk, Anderson-Darling)
* Correlation analysis (Pearson, Spearman) with CI
* Chi-square tests for categorical associations
* Outlier detection (Modified Z-score, IQR)
* Group comparisons (t-test, Mann-Whitney U)
- All tests report p-values and effect sizes
- Statistical confidence scoring (1-10)
3. Enhanced Cross-Domain Question Synthesis:
- Increased minimum from general suggestion to 15 questions
- Added 5 cross-domain categories with examples
- Detailed question template with multi-phase plans
- Priority distribution (URGENT/HIGH/MEDIUM)
- Quality criteria for cross-domain questions
Expected Impact: +25% overall quality, +30% confidence in findings
Updated README to reflect new capabilities: 1. Statistical Analysis section: - Added Statistical Significance Testing subsection - Listed 5 required statistical tests (normality, correlation, chi-square, outliers, group comparisons) - Mentioned p-values and effect sizes 2. Query Analysis section: - Added Performance Baseline Measurement subsection - Listed 6 required query types with timing - Mentioned efficiency scoring (EXPLAIN vs actual) 3. Question Catalogs section: - Updated to reflect 15+ cross-domain questions (enhanced in v1.3) - Added 5 cross-domain categories with question counts 4. Cross-Domain Questions section: - Expanded from 3 examples to 15 specific questions - Organized by 5 categories with question counts - Matched new v1.3 requirements
Added patterns to ignore all discovery output files: Root .gitignore: - discovery_*.md (main discovery reports at repo root) - database_discovery_report.md scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/.gitignore: - *_QUESTION_CATALOG.md (individual agent question catalogs) - *_round1_*.md, *_round2_*.md, *_round3_*.md, *_round4_*.md (round outputs) - *_COORDINATOR_SUMMARY.md (coordinator summaries) - *_HYPOTHESIS_TESTING.md (hypothesis testing results) - *_INDEX.md (index files) - *_QUICK_REFERENCE.md (quick reference docs) - META_ANALYSIS_*.md (META agent outputs) - SECURITY_AGENT_*.txt (security agent text outputs) - query_agent_*.md, security_agent_*.md, semantic_*.md, statistical_*.md These files are generated when agents incorrectly use Write tool instead of catalog_upsert (the issue fixed in v1.2). They should not be committed to the repository as they are test outputs and are stored in the MCP catalog.
…aint This commit addresses two issues: 1. MCP Catalog Schema Separation: - Add 'schema' column to catalog table for proper isolation - Update all catalog methods (upsert, get, search, list, remove) to accept schema parameter - Update MCP tool handlers and JSON-RPC parameter parsing - Unique constraint changed from (kind, key) to (schema, kind, key) - FTS table updated to include schema column 2. Discovery Prompt Scope Constraint: - Add explicit SCOPE CONSTRAINT section to multi_agent_discovery_prompt.md - Agents now respect Target Schema and skip list_schemas when specified - Prevents analyzing all schemas when only one is targeted Files modified: - include/MySQL_Catalog.h: Add schema parameter to all catalog methods - include/MySQL_Tool_Handler.h: Update wrapper method signatures - lib/MySQL_Catalog.cpp: Implement schema filtering in all operations - lib/MySQL_Tool_Handler.cpp: Update wrapper implementations - lib/Query_Tool_Handler.cpp: Extract schema from JSON-RPC params, update tool descriptions - scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/multi_agent_discovery_prompt.md: Add scope constraint
Phase 1 (Static/Deterministic): - Add Discovery_Schema: SQLite catalog with deterministic and LLM tables - Add Static_Harvester: MySQL INFORMATION_SCHEMA metadata extraction - Harvest schemas, objects, columns, indexes, foreign keys, view definitions - Compute derived hints: is_time, is_id_like, has_pk, has_fks, has_time - Build quick profiles and FTS5 indexes Phase 2 (LLM Agent): - Add 19 new MCP tools for two-phase discovery - discovery.run_static: Trigger ProxySQL's static harvest - Catalog tools: init, search, get_object, list_objects, get_relationships - Agent tools: run_start, run_finish, event_append - LLM tools: summary_upsert, relationship_upsert, domain_upsert, etc. Files: - include/Discovery_Schema.h, lib/Discovery_Schema.cpp - include/Static_Harvester.h, lib/Static_Harvester.cpp - include/Query_Tool_Handler.h, lib/Query_Tool_Handler.cpp (updated) - lib/Makefile (updated) - scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/prompts/ - scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/two_phase_discovery.py
Add missing --mcp-config argument which is required by the script. The dry-run example now correctly shows all required parameters.
- Add mcp_config.example.json for Claude Code MCP configuration - Fix MCP bridge path in example config (../../proxysql_mcp_stdio_bridge.py) - Update Two_Phase_Discovery_Implementation.md with correct Phase 1/Phase 2 usage - Fix Two_Phase_Discovery_Implementation.md DELETE FROM fts_objects to scope to run_id - Update README.md with two-phase discovery section and multi-agent legacy note - Create static_harvest.sh bash wrapper for Phase 1 - Create two_phase_discovery.py orchestration script with prompts - Add --run-id parameter to skip auto-fetch - Fix RUN_ID placeholder mismatch (<USE_THE_PROVIDED_RUN_ID>) - Fix catalog path default to mcp_catalog.db - Add test_catalog.sh to verify catalog tools work - Fix Discovery_Schema.cpp FTS5 syntax (missing space) - Remove invalid CREATE INDEX on FTS virtual tables - Add MCP tool call logging to track tool usage - Fix Static_Harvester::get_harvest_stats() to accept run_id parameter - Fix DELETE FROM fts_objects to only delete for specific run_id - Update system prompts to say DO NOT call discovery.run_static - Update user prompts to say Phase 1 is already complete - Add --mcp-only flag to restrict Claude Code to MCP tools only - Make FTS table failures non-fatal (check if table exists first) - Add comprehensive documentation for both discovery approaches
- Add try-catch around handle_jsonrpc_request to catch unexpected exceptions - Add detailed logging for tool execution success/failure - Add proper SQLite error checking in create_agent_run with error messages - Fix json_int/json_double to handle both numbers and numeric strings The json_int function was throwing exceptions when receiving numeric strings (e.g., "14" instead of 14) from clients, causing 500 errors. Now it handles both formats gracefully. Also added logging so tool failures are visible in logs instead of being silent 500 errors.
Exception handlers now log the full request payload that caused the error, making debugging much easier. Changes: - Move req_body/req_path declarations outside try block so catch handlers can access them - Log request payload in all exception handlers (parse errors, std::exception, and catch-all) - Log tool arguments when tool execution fails Previously, exceptions would only log the error message without context, making it impossible to reproduce the issue. Now the full payload is logged.
1. Fix error logging to catch ALL tool failures, not just those with
both success and result fields. Previously, error responses like
{"success": false, "error": "..."} without a result field were
silently ignored.
2. Fix llm.domain_set_members to accept both array and JSON string
formats for the members parameter. Some clients send it as a
JSON string, others as a native array.
3. Add detailed error logging for llm.domain_set_members failures,
including what was actually received.
- Add resolve_run_id() to map schema names to latest run_id - Update all catalog and LLM tools to accept schema names - Deprecate describe_table, table_profile, column_profile - Deprecate get_constraints, suggest_joins, find_reference_candidates - Keep sample_rows, sample_distinct for data preview
Tool was deprecated; users should use catalog.get_object instead.
- Query schemas from catalog's schemas table - Maintains same output format for compatibility - Removes dependency on live MySQL connection
Add stats_mcp_query_tools_counters and stats_mcp_query_tools_counters_reset tables to track MCP query tool usage statistics. - Added get_tool_usage_stats_resultset() method to Query_Tool_Handler - Defined table schemas in ProxySQL_Admin_Tables_Definitions.h - Registered tables in Admin_Bootstrap.cpp - Added pattern matching in ProxySQL_Admin.cpp - Added stats___mcp_query_tools_counters() in ProxySQL_Admin_Stats.cpp - Fixed friend declaration for track_tool_invocation() - Fixed Discovery_Schema.cpp log_llm_search() to use prepare_v2/finalize
Remove the mcp-catalog_path configuration variable and hardcode the catalog database path to datadir/mcp_catalog.db for stability. Rationale: The catalog database is session state, not user configuration. Runtime swapping of the catalog could cause tables to be missed and the catalog to fail even if it was succeeding a second earlier. Changes: - Removed catalog_path from mcp_thread_variables_names array - Removed mcp_catalog_path from MCP_Thread variables struct - Removed getter/setter logic for catalog_path - Hardcoded catalog path to GloVars.datadir/mcp_catalog.db in: - ProxySQL_MCP_Server.cpp (Query_Tool_Handler initialization) - Admin_FlushVariables.cpp (MySQL_Tool_Handler reinitialization) - Updated VARIABLES.md to document the hardcoded path - Updated configure_mcp.sh to remove catalog_path configuration - Updated MCP README to remove catalog_path references
The MCP catalog database is now accessible as the 'mcp_catalog' schema from the ProxySQL Admin interface, enabling direct SQL queries against discovered schemas and LLM memories.
Extend the stats_mcp_query_tools_counters table with timing statistics (first_seen, last_seen, sum_time, min_time, max_time) following the same pattern as stats_mysql_query_digest. All timing values are in microseconds using monotonic_time(). New schema: - tool VARCHAR - schema VARCHAR - count INT - first_seen INTEGER (microseconds) - last_seen INTEGER (microseconds) - sum_time INTEGER (microseconds - total execution time) - min_time INTEGER (microseconds - minimum execution time) - max_time INTEGER (microseconds - maximum execution time)
The column name 'limit' conflicts with SQL reserved keyword. Escaped as "\"limit\"" to fix table creation.
Format column definitions in CREATE TABLE IF NOT EXISTS statements to have a space before and after each comma (e.g., " , "). This allows ProxySQL Admin to properly display multi-line table schemas. Modified files: - Discovery_Schema.cpp - MySQL_Catalog.cpp - AI_Features_Manager.cpp
Add query_tool_calls table to Discovery Schema to track all MCP tool invocations via the /mcp/query/ endpoint. Logs: - tool_name: Name of the tool that was called - schema: Schema name (nullable, empty if not applicable) - run_id: Run ID from discovery (nullable, 0 if not applicable) - start_time: Start monotonic time in microseconds - execution_time: Execution duration in microseconds - error: Error message (null if success) Modified files: - Discovery_Schema.cpp: Added table creation and log_query_tool_call function - Discovery_Schema.h: Added function declaration - Query_Tool_Handler.cpp: Added logging after each tool execution
…d issue - Rename llm_search_log column from \"limit\" to \"lmt\" to avoid SQL reserved keyword - Add FTS inserts to all LLM artifact upsert functions: - add_question_template(): index question templates for search - add_llm_note(): index notes for search - upsert_llm_summary(): index object summaries for search - upsert_llm_domain(): index domains for search - upsert_llm_metric(): index metrics for search - Remove content='' from fts_llm table to store content directly - Add <functional> header for std::hash usage This fixes the bug where llm_search always returned empty results because the FTS index was never populated.
Conflict resolution summary: **Features preserved from v3.1-vec:** - mcp_use_ssl variable (HTTP/HTTPS mode support) - RAG tool handler support - FTS (Full Text Search) functionality in MySQL_Tool_Handler **Changes accepted from v3.1-MCP2:** - Schema isolation for catalog (removed mcp_catalog_path/fts_path variables) - Schema parameter added to catalog functions - Comprehensive query rewriting improvements in Query_Tool_Handler - Format string fixes in vector_db_performance-t.cpp **Key decisions:** - catalog_path/fts_path: Removed per v3.1-MCP2 commit 35b0b22 (intentional) - MySQL_Catalog.cpp: Accepted v3.1-MCP2 version (schema isolation improvements) - Query_Tool_Handler.cpp: Accepted v3.1-MCP2 version (significant improvements) - Makefile: Kept v3.1-vec's EXCLUDE_TRACKING_VARIABLES (correct spelling) **Build notes:** - MySQL_Tool_Handler now requires schema parameter for catalog operations - Catalog path is hardcoded to datadir/mcp_catalog.db - FTS functionality preserved but may need updates for schema isolation
…ility) v3.1-MCP2 introduced lib/proxy_sqlite3_symbols.cpp which provides canonical definitions for proxy_sqlite3_* function pointers. Tests that link against libproxysql.a must NOT define MAIN_PROXY_SQLITE3, otherwise sqlite3db.h will provide conflicting definitions. Removed #define MAIN_PROXY_SQLITE3 from: - test/tap/tests/sqlite3-t.cpp - test/tap/tests/unit_test.h - test/tap/tests/setparser_test.cpp - test/tap/tests/setparser_test_common.h
|
Caution Review failedThe pull request is closed. Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughSummary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughAdds a Retrieval-Augmented Generation (RAG) subsystem and supporting MCP/catalog infrastructure: new RAG_Tool_Handler and implementation, Discovery_Schema and Static_Harvester, RAG SQLite schema and ingest PoC, MCP /mcp/rag endpoint wiring, GenAI/RAG configuration, sqlite3 proxy-symbol layer, MCP query-rules/digest admin & stats, and extensive docs/tests/scripts. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as MCP Client
participant Server as ProxySQL MCP Server
participant Dispatcher as Tool Dispatcher
participant RAG as RAG_Tool_Handler
participant DB as SQLite Vector/FTS DB
participant Catalog as Discovery_Schema
Client->>Server: POST /mcp/rag (rag.search_hybrid)
Server->>Dispatcher: route tool_name, arguments
Dispatcher->>RAG: execute_tool("rag.search_hybrid", params)
RAG->>Catalog: validate source/config and load mapping
RAG->>RAG: validate inputs (k, candidates, byte limits)
RAG->>RAG: build_sql_filters(filters)
par FTS Path
RAG->>DB: SELECT rag_fts_chunks WHERE filters (BM25)
DB-->>RAG: fts_results
and Vector Path
RAG->>RAG: generate/query embedding (via GenAI bridge)
RAG->>DB: SELECT rag_vec_chunks (vector search)
DB-->>RAG: vec_results
end
RAG->>RAG: compute_rrf_score / merge results
RAG->>Catalog: optional refetch (fetch_from_source)
RAG->>Dispatcher: return results (chunks/docs, scores, stats)
Dispatcher-->>Client: JSON-RPC response
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
Summary of ChangesHello @renecannao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates a comprehensive RAG subsystem and a two-phase database schema discovery architecture into ProxySQL. It significantly refactors the MCP module for improved modularity and introduces advanced capabilities for AI-driven database interactions, while also enhancing internal SQLite handling and query management. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is a very large pull request that merges two major feature branches, introducing a RAG (Retrieval-Augmented Generation) subsystem and a new two-phase schema discovery architecture. The changes are extensive, with a significant amount of new code, documentation, and refactoring of the MCP module. The new features appear well-documented. However, I've identified several critical SQL injection vulnerabilities in the new C++ code where SQL queries are constructed using string concatenation with user input. There is also a high-severity issue related to insufficient validation in a tool that executes raw SQL. Additionally, there are a few medium-severity issues concerning documentation redundancy and an unnecessary backup file being checked into the repository. These security vulnerabilities should be addressed with high priority.
lib/MySQL_Catalog.cpp
Outdated
| std::ostringstream sql; | ||
| sql << "SELECT c.kind, c.key, c.document, c.tags, c.links " | ||
| << "FROM catalog c " | ||
| << "INNER JOIN catalog_fts f ON c.id = f.rowid " | ||
| << "WHERE catalog_fts MATCH '" << escaped_query << "'"; | ||
| sql << "SELECT schema, kind, key, document, tags , links FROM catalog WHERE 1=1"; | ||
|
|
||
| int param_count = 1; // Track parameter binding position | ||
| // Add schema filter | ||
| if (!schema.empty()) { | ||
| sql << " AND schema = '" << schema << "'"; | ||
| } | ||
|
|
||
| // Add kind filter with parameter placeholder | ||
| // Add kind filter | ||
| if (!kind.empty()) { | ||
| sql << " AND c.kind = ?"; | ||
| sql << " AND kind = '" << kind << "'"; | ||
| } | ||
|
|
||
| // Add tags filter with parameter placeholder | ||
| // Add tags filter | ||
| if (!tags.empty()) { | ||
| sql << " AND c.tags LIKE ?"; | ||
| sql << " AND tags LIKE '%" << tags << "%'"; | ||
| } | ||
|
|
||
| // Order by relevance (BM25) and recency | ||
| sql << " ORDER BY bm25(f) ASC, c.updated_at DESC LIMIT ? OFFSET ?"; | ||
|
|
||
| // Prepare the statement | ||
| sqlite3_stmt* stmt = NULL; | ||
| int rc = db->prepare_v2(sql.str().c_str(), &stmt); | ||
| if (rc != SQLITE_OK) { | ||
| proxy_error("Catalog search: Failed to prepare statement: %d\n", rc); | ||
| return "[]"; | ||
| // Add search query | ||
| if (!query.empty()) { | ||
| sql << " AND (key LIKE '%" << query << "%' " | ||
| << "OR document LIKE '%" << query << "%' " | ||
| << "OR tags LIKE '%" << query << "%')"; | ||
| } | ||
|
|
||
| // Bind parameters | ||
| param_count = 1; | ||
| if (!kind.empty()) { | ||
| (*proxy_sqlite3_bind_text)(stmt, param_count++, kind.c_str(), -1, SQLITE_TRANSIENT); | ||
| } | ||
| if (!tags.empty()) { | ||
| // Add wildcards for LIKE search | ||
| std::string tags_pattern = "%" + tags + "%"; | ||
| (*proxy_sqlite3_bind_text)(stmt, param_count++, tags_pattern.c_str(), -1, SQLITE_TRANSIENT); | ||
| sql << " ORDER BY updated_at DESC LIMIT " << limit << " OFFSET " << offset; | ||
|
|
There was a problem hiding this comment.
The search method constructs SQL queries by concatenating user-provided parameters (schema, kind, tags, query) directly into the query string. This creates a SQL injection vulnerability. An attacker could craft malicious input to execute arbitrary SQL commands. Please refactor this to use parameterized queries to ensure security. The list method has a similar vulnerability.
| bool RAG_Tool_Handler::build_sql_filters(const json& filters, std::string& sql) { | ||
| // Apply filters with input validation to prevent SQL injection | ||
| if (filters.contains("source_ids") && filters["source_ids"].is_array()) { | ||
| std::vector<int> source_ids = get_json_int_array(filters, "source_ids"); | ||
| if (!source_ids.empty()) { | ||
| // Validate that all source_ids are integers (they should be by definition) | ||
| std::string source_list = ""; | ||
| for (size_t i = 0; i < source_ids.size(); ++i) { | ||
| if (i > 0) source_list += ","; | ||
| source_list += std::to_string(source_ids[i]); | ||
| } | ||
| sql += " AND c.source_id IN (" + source_list + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("source_names") && filters["source_names"].is_array()) { | ||
| std::vector<std::string> source_names = get_json_string_array(filters, "source_names"); | ||
| if (!source_names.empty()) { | ||
| // Validate source names to prevent SQL injection | ||
| std::string source_list = ""; | ||
| for (size_t i = 0; i < source_names.size(); ++i) { | ||
| const std::string& source_name = source_names[i]; | ||
| // Basic validation - check for dangerous characters | ||
| if (source_name.find('\'') != std::string::npos || | ||
| source_name.find('\\') != std::string::npos || | ||
| source_name.find(';') != std::string::npos) { | ||
| return false; | ||
| } | ||
| if (i > 0) source_list += ","; | ||
| source_list += "'" + source_name + "'"; | ||
| } | ||
| sql += " AND c.source_id IN (SELECT source_id FROM rag_sources WHERE name IN (" + source_list + "))"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("doc_ids") && filters["doc_ids"].is_array()) { | ||
| std::vector<std::string> doc_ids = get_json_string_array(filters, "doc_ids"); | ||
| if (!doc_ids.empty()) { | ||
| // Validate doc_ids to prevent SQL injection | ||
| std::string doc_list = ""; | ||
| for (size_t i = 0; i < doc_ids.size(); ++i) { | ||
| const std::string& doc_id = doc_ids[i]; | ||
| // Basic validation - check for dangerous characters | ||
| if (doc_id.find('\'') != std::string::npos || | ||
| doc_id.find('\\') != std::string::npos || | ||
| doc_id.find(';') != std::string::npos) { | ||
| return false; | ||
| } | ||
| if (i > 0) doc_list += ","; | ||
| doc_list += "'" + doc_id + "'"; | ||
| } | ||
| sql += " AND c.doc_id IN (" + doc_list + ")"; | ||
| } | ||
| } | ||
|
|
||
| // Metadata filters | ||
| if (filters.contains("post_type_ids") && filters["post_type_ids"].is_array()) { | ||
| std::vector<int> post_type_ids = get_json_int_array(filters, "post_type_ids"); | ||
| if (!post_type_ids.empty()) { | ||
| // Validate that all post_type_ids are integers | ||
| std::string post_type_conditions = ""; | ||
| for (size_t i = 0; i < post_type_ids.size(); ++i) { | ||
| if (i > 0) post_type_conditions += " OR "; | ||
| post_type_conditions += "json_extract(d.metadata_json, '$.PostTypeId') = " + std::to_string(post_type_ids[i]); | ||
| } | ||
| sql += " AND (" + post_type_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("tags_any") && filters["tags_any"].is_array()) { | ||
| std::vector<std::string> tags_any = get_json_string_array(filters, "tags_any"); | ||
| if (!tags_any.empty()) { | ||
| // Validate tags to prevent SQL injection | ||
| std::string tag_conditions = ""; | ||
| for (size_t i = 0; i < tags_any.size(); ++i) { | ||
| const std::string& tag = tags_any[i]; | ||
| // Basic validation - check for dangerous characters | ||
| if (tag.find('\'') != std::string::npos || | ||
| tag.find('\\') != std::string::npos || | ||
| tag.find(';') != std::string::npos) { | ||
| return false; | ||
| } | ||
| if (i > 0) tag_conditions += " OR "; | ||
| // Escape the tag for LIKE pattern matching | ||
| std::string escaped_tag = tag; | ||
| // Simple escaping - replace special characters | ||
| size_t pos = 0; | ||
| while ((pos = escaped_tag.find("'", pos)) != std::string::npos) { | ||
| escaped_tag.replace(pos, 1, "''"); | ||
| pos += 2; | ||
| } | ||
| tag_conditions += "json_extract(d.metadata_json, '$.Tags') LIKE '%<" + escaped_tag + ">%' ESCAPE '\\'"; | ||
| } | ||
| sql += " AND (" + tag_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("tags_all") && filters["tags_all"].is_array()) { | ||
| std::vector<std::string> tags_all = get_json_string_array(filters, "tags_all"); | ||
| if (!tags_all.empty()) { | ||
| // Validate tags to prevent SQL injection | ||
| std::string tag_conditions = ""; | ||
| for (size_t i = 0; i < tags_all.size(); ++i) { | ||
| const std::string& tag = tags_all[i]; | ||
| // Basic validation - check for dangerous characters | ||
| if (tag.find('\'') != std::string::npos || | ||
| tag.find('\\') != std::string::npos || | ||
| tag.find(';') != std::string::npos) { | ||
| return false; | ||
| } | ||
| if (i > 0) tag_conditions += " AND "; | ||
| // Escape the tag for LIKE pattern matching | ||
| std::string escaped_tag = tag; | ||
| // Simple escaping - replace special characters | ||
| size_t pos = 0; | ||
| while ((pos = escaped_tag.find("'", pos)) != std::string::npos) { | ||
| escaped_tag.replace(pos, 1, "''"); | ||
| pos += 2; | ||
| } | ||
| tag_conditions += "json_extract(d.metadata_json, '$.Tags') LIKE '%<" + escaped_tag + ">%' ESCAPE '\\'"; | ||
| } | ||
| sql += " AND (" + tag_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("created_after") && filters["created_after"].is_string()) { | ||
| std::string created_after = filters["created_after"].get<std::string>(); | ||
| // Validate date format to prevent SQL injection | ||
| if (created_after.find('\'') != std::string::npos || | ||
| created_after.find('\\') != std::string::npos || | ||
| created_after.find(';') != std::string::npos) { | ||
| return false; | ||
| } | ||
| // Filter by CreationDate in metadata_json | ||
| sql += " AND json_extract(d.metadata_json, '$.CreationDate') >= '" + created_after + "'"; | ||
| } | ||
|
|
||
| if (filters.contains("created_before") && filters["created_before"].is_string()) { | ||
| std::string created_before = filters["created_before"].get<std::string>(); | ||
| // Validate date format to prevent SQL injection | ||
| if (created_before.find('\'') != std::string::npos || | ||
| created_before.find('\\') != std::string::npos || | ||
| created_before.find(';') != std::string::npos) { | ||
| return false; | ||
| } | ||
| // Filter by CreationDate in metadata_json | ||
| sql += " AND json_extract(d.metadata_json, '$.CreationDate') <= '" + created_before + "'"; | ||
| } | ||
|
|
||
| return true; | ||
| } |
There was a problem hiding this comment.
The build_sql_filters function constructs SQL WHERE clauses by concatenating filter values directly into the query string. While there's some basic validation, it's not sufficient to prevent SQL injection. For example, the validation for source_names can be bypassed. All user-provided inputs that are part of a SQL query must be treated as untrusted. Please use parameterized queries for all filters to mitigate this critical vulnerability. Also, in rag.search_fts, the FTS MATCH clause concatenates the user query directly, which is also an injection risk.
| if (!filter.empty()) { | ||
| sql << " WHERE SCHEMA_NAME = '" << filter << "'"; | ||
| } |
There was a problem hiding this comment.
This function constructs a SQL query by concatenating the filter string directly. This is a SQL injection vulnerability. If a malicious string is passed as filter, it could lead to arbitrary SQL execution. Please use parameterized queries to prevent this. The same issue exists in fetch_tables_views, fetch_columns, fetch_indexes, and fetch_foreign_keys.
| bool Query_Tool_Handler::validate_readonly_query(const std::string& query) { | ||
| std::string upper = query; | ||
| std::transform(upper.begin(), upper.end(), upper.begin(), ::toupper); | ||
|
|
||
| // Check for dangerous keywords | ||
| std::vector<std::string> dangerous = { | ||
| "INSERT", "UPDATE", "DELETE", "DROP", "CREATE", "ALTER", | ||
| "TRUNCATE", "REPLACE", "LOAD", "CALL", "EXECUTE" | ||
| }; | ||
|
|
||
| for (const auto& word : dangerous) { | ||
| if (upper.find(word) != std::string::npos) { | ||
| return false; | ||
| } | ||
| } | ||
|
|
||
| // Must start with SELECT or WITH or EXPLAIN | ||
| if (upper.find("SELECT") == 0 && upper.find("FROM") != std::string::npos) { | ||
| return true; | ||
| } | ||
| if (upper.find("WITH") == 0) { | ||
| return true; | ||
| } | ||
| if (upper.find("EXPLAIN") == 0) { | ||
| return true; | ||
| } | ||
| if (upper.find("SHOW") == 0) { | ||
| return true; | ||
| } | ||
| if (upper.find("DESCRIBE") == 0 || upper.find("DESC") == 0) { | ||
| return true; | ||
| } | ||
|
|
||
| return false; | ||
| } |
There was a problem hiding this comment.
The validate_readonly_query function uses a blacklist of dangerous keywords to determine if a query is read-only. This approach is notoriously easy to bypass (e.g., using comments /**/, different encodings, or alternative syntax). Executing a raw SQL string that has only been sanitized with a blacklist is a significant security risk. Consider using a more robust validation method, such as a proper SQL parser that can analyze the query's abstract syntax tree (AST) to ensure it's truly a read-only SELECT statement without any side effects.
| # RAG Implementation Completion Summary | ||
|
|
||
| ## Status: COMPLETE | ||
|
|
||
| All required tasks for implementing the ProxySQL RAG (Retrieval-Augmented Generation) subsystem have been successfully completed according to the blueprint specifications. | ||
|
|
||
| ## Completed Deliverables | ||
|
|
||
| ### 1. Core Implementation | ||
| ✅ **RAG Tool Handler**: Fully implemented `RAG_Tool_Handler` class with all required MCP tools | ||
| ✅ **Database Integration**: Complete RAG schema with all 7 tables/views implemented | ||
| ✅ **MCP Integration**: RAG tools available via `/mcp/rag` endpoint | ||
| ✅ **Configuration**: All RAG configuration variables implemented and functional | ||
|
|
||
| ### 2. MCP Tools Implemented | ||
| ✅ **rag.search_fts** - Keyword search using FTS5 | ||
| ✅ **rag.search_vector** - Semantic search using vector embeddings | ||
| ✅ **rag.search_hybrid** - Hybrid search with two modes (fuse and fts_then_vec) | ||
| ✅ **rag.get_chunks** - Fetch chunk content | ||
| ✅ **rag.get_docs** - Fetch document content | ||
| ✅ **rag.fetch_from_source** - Refetch authoritative data | ||
| ✅ **rag.admin.stats** - Operational statistics | ||
|
|
||
| ### 3. Key Features | ||
| ✅ **Search Capabilities**: FTS, vector, and hybrid search with proper scoring | ||
| ✅ **Security Features**: Input validation, limits, timeouts, and column whitelisting | ||
| ✅ **Performance Features**: Prepared statements, connection management, proper indexing | ||
| ✅ **Filtering**: Complete filter support including source_ids, source_names, doc_ids, post_type_ids, tags_any, tags_all, created_after, created_before, min_score | ||
| ✅ **Response Formatting**: Proper JSON response schemas matching blueprint specifications | ||
|
|
||
| ### 4. Testing and Documentation | ||
| ✅ **Test Scripts**: Comprehensive test suite including `test_rag.sh` | ||
| ✅ **Documentation**: Complete documentation in `doc/rag-documentation.md` and `doc/rag-examples.md` | ||
| ✅ **Examples**: Blueprint-compliant usage examples | ||
|
|
||
| ## Files Created/Modified | ||
|
|
||
| ### New Files (10) | ||
| 1. `include/RAG_Tool_Handler.h` - Header file | ||
| 2. `lib/RAG_Tool_Handler.cpp` - Implementation file | ||
| 3. `doc/rag-documentation.md` - Documentation | ||
| 4. `doc/rag-examples.md` - Usage examples | ||
| 5. `scripts/mcp/test_rag.sh` - Test script | ||
| 6. `test/test_rag_schema.cpp` - Schema test | ||
| 7. `test/build_rag_test.sh` - Build script | ||
| 8. `RAG_IMPLEMENTATION_SUMMARY.md` - Implementation summary | ||
| 9. `RAG_FILE_SUMMARY.md` - File summary | ||
| 10. Updated `test/Makefile` - Added RAG test target | ||
|
|
||
| ### Modified Files (7) | ||
| 1. `include/MCP_Thread.h` - Added RAG tool handler member | ||
| 2. `lib/MCP_Thread.cpp` - Added initialization/cleanup | ||
| 3. `lib/ProxySQL_MCP_Server.cpp` - Registered RAG endpoint | ||
| 4. `lib/AI_Features_Manager.cpp` - Added RAG schema | ||
| 5. `include/GenAI_Thread.h` - Added RAG config variables | ||
| 6. `lib/GenAI_Thread.cpp` - Added RAG config initialization | ||
| 7. `scripts/mcp/README.md` - Updated documentation | ||
|
|
||
| ## Blueprint Compliance Verification | ||
|
|
||
| ### Tool Schemas | ||
| ✅ All tool input schemas match blueprint specifications exactly | ||
| ✅ All tool response schemas match blueprint specifications exactly | ||
| ✅ Proper parameter validation and error handling implemented | ||
|
|
||
| ### Hybrid Search Modes | ||
| ✅ **Mode A (fuse)**: Parallel FTS + vector with Reciprocal Rank Fusion | ||
| ✅ **Mode B (fts_then_vec)**: Candidate generation + rerank | ||
| ✅ Both modes implement proper filtering and score normalization | ||
|
|
||
| ### Security and Performance | ||
| ✅ Input validation and sanitization | ||
| ✅ Query length limits (genai_rag_query_max_bytes) | ||
| ✅ Result size limits (genai_rag_k_max, genai_rag_candidates_max) | ||
| ✅ Timeouts for all operations (genai_rag_timeout_ms) | ||
| ✅ Column whitelisting for refetch operations | ||
| ✅ Row and byte limits for all operations | ||
| ✅ Proper use of prepared statements | ||
| ✅ Connection management | ||
| ✅ SQLite3-vec and FTS5 integration | ||
|
|
||
| ## Usage | ||
|
|
||
| The RAG subsystem is ready for production use. To enable: | ||
|
|
||
| ```sql | ||
| -- Enable GenAI module | ||
| SET genai.enabled = true; | ||
|
|
||
| -- Enable RAG features | ||
| SET genai.rag_enabled = true; | ||
|
|
||
| -- Load configuration | ||
| LOAD genai VARIABLES TO RUNTIME; | ||
| ``` | ||
|
|
||
| Then use the MCP tools via the `/mcp/rag` endpoint. | ||
|
|
||
| ## Testing | ||
|
|
||
| All functionality has been implemented according to v0 deliverables: | ||
| ✅ SQLite schema initializer | ||
| ✅ Source registry management | ||
| ✅ Ingestion pipeline framework | ||
| ✅ MCP server tools | ||
| ✅ Unit/integration tests | ||
| ✅ "Golden" examples | ||
|
|
||
| The implementation is complete and ready for integration testing. No newline at end of file |
There was a problem hiding this comment.
This pull request adds multiple RAG summary files (RAG_COMPLETION_SUMMARY.md, RAG_FILE_SUMMARY.md, RAG_IMPLEMENTATION_COMPLETE.md, RAG_IMPLEMENTATION_SUMMARY.md) that seem to contain very similar, overlapping content. This can create confusion and a maintenance burden. It would be better to consolidate these into a single, canonical summary document for the RAG feature.
lib/Anomaly_Detector.cpp.bak
Outdated
| /** | ||
| * @file Anomaly_Detector.cpp | ||
| * @brief Implementation of Real-time Anomaly Detection for ProxySQL | ||
| * | ||
| * Implements multi-stage anomaly detection pipeline: | ||
| * 1. SQL Injection Pattern Detection | ||
| * 2. Query Normalization and Pattern Matching | ||
| * 3. Rate Limiting per User/Host | ||
| * 4. Statistical Outlier Detection | ||
| * 5. Embedding-based Threat Similarity | ||
| * | ||
| * @see Anomaly_Detector.h | ||
| */ | ||
|
|
||
| #include "Anomaly_Detector.h" | ||
| #include "sqlite3db.h" | ||
| #include "proxysql_utils.h" | ||
| #include "GenAI_Thread.h" | ||
| #include "cpp.h" | ||
| #include <cstring> | ||
| #include <cstdlib> | ||
| #include <sstream> | ||
| #include <algorithm> | ||
| #include <regex> | ||
| #include <ctime> | ||
| #include <cmath> | ||
|
|
||
| // JSON library | ||
| #include "../deps/json/json.hpp" | ||
| using json = nlohmann::json; | ||
| #define PROXYJSON | ||
|
|
||
| // Global GenAI handler for embedding generation | ||
| extern GenAI_Threads_Handler *GloGATH; | ||
|
|
||
| // ============================================================================ | ||
| // Constants | ||
| // ============================================================================ | ||
|
|
||
| // SQL Injection Patterns (regex-based) | ||
| static const char* SQL_INJECTION_PATTERNS[] = { | ||
| "('|\").*?('|\")", // Quote sequences | ||
| "\\bor\\b.*=.*\\bor\\b", // OR 1=1 | ||
| "\\band\\b.*=.*\\band\\b", // AND 1=1 | ||
| "union.*select", // UNION SELECT | ||
| "drop.*table", // DROP TABLE | ||
| "exec.*xp_", // SQL Server exec | ||
| ";.*--", // Comment injection | ||
| "/\\*.*\\*/", // Block comments | ||
| "concat\\(", // CONCAT based attacks | ||
| "char\\(", // CHAR based attacks | ||
| "0x[0-9a-f]+", // Hex encoded | ||
| NULL | ||
| }; | ||
|
|
||
| // Suspicious Keywords | ||
| static const char* SUSPICIOUS_KEYWORDS[] = { | ||
| "sleep(", "waitfor delay", "benchmark(", "pg_sleep", | ||
| "load_file", "into outfile", "dumpfile", | ||
| "script>", "javascript:", "onerror=", "onload=", | ||
| NULL | ||
| }; | ||
|
|
||
| // Thresholds | ||
| #define DEFAULT_RATE_LIMIT 100 // queries per minute | ||
| #define DEFAULT_RISK_THRESHOLD 70 // 0-100 | ||
| #define DEFAULT_SIMILARITY_THRESHOLD 85 // 0-100 | ||
| #define USER_STATS_WINDOW 3600 // 1 hour in seconds | ||
| #define MAX_RECENT_QUERIES 100 | ||
|
|
||
| // ============================================================================ | ||
| // Constructor/Destructor | ||
| // ============================================================================ | ||
|
|
||
| Anomaly_Detector::Anomaly_Detector() : vector_db(NULL) { | ||
| config.enabled = true; | ||
| config.risk_threshold = DEFAULT_RISK_THRESHOLD; | ||
| config.similarity_threshold = DEFAULT_SIMILARITY_THRESHOLD; | ||
| config.rate_limit = DEFAULT_RATE_LIMIT; | ||
| config.auto_block = true; | ||
| config.log_only = false; | ||
| } | ||
|
|
||
| Anomaly_Detector::~Anomaly_Detector() { | ||
| close(); | ||
| } | ||
|
|
||
| // ============================================================================ | ||
| // Initialization | ||
| // ============================================================================ | ||
|
|
||
| /** | ||
| * @brief Initialize the anomaly detector | ||
| * | ||
| * Sets up the vector database connection and loads any | ||
| * pre-configured threat patterns from storage. | ||
| */ | ||
| int Anomaly_Detector::init() { | ||
| proxy_info("Anomaly: Initializing Anomaly Detector v%s\n", ANOMALY_DETECTOR_VERSION); | ||
|
|
||
| // Vector DB will be provided by AI_Features_Manager | ||
| // For now, we'll work without it for basic pattern detection | ||
|
|
||
| proxy_info("Anomaly: Anomaly Detector initialized with %zu injection patterns\n", | ||
| sizeof(SQL_INJECTION_PATTERNS) / sizeof(SQL_INJECTION_PATTERNS[0]) - 1); | ||
| return 0; | ||
| } | ||
|
|
||
| /** | ||
| * @brief Close and cleanup resources | ||
| */ | ||
| void Anomaly_Detector::close() { | ||
| // Clear user statistics | ||
| clear_user_statistics(); | ||
|
|
||
| proxy_info("Anomaly: Anomaly Detector closed\n"); | ||
| } | ||
|
|
||
| // ============================================================================ | ||
| // Query Normalization | ||
| // ============================================================================ | ||
|
|
||
| /** | ||
| * @brief Normalize SQL query for pattern matching | ||
| * | ||
| * Normalization steps: | ||
| * 1. Convert to lowercase | ||
| * 2. Remove extra whitespace | ||
| * 3. Replace string literals with placeholders | ||
| * 4. Replace numeric literals with placeholders | ||
| * 5. Remove comments | ||
| * | ||
| * @param query Original SQL query | ||
| * @return Normalized query pattern | ||
| */ | ||
| std::string Anomaly_Detector::normalize_query(const std::string& query) { | ||
| std::string normalized = query; | ||
|
|
||
| // Convert to lowercase | ||
| std::transform(normalized.begin(), normalized.end(), normalized.begin(), ::tolower); | ||
|
|
||
| // Remove SQL comments | ||
| std::regex comment_regex("--.*?$|/\\*.*?\\*/", std::regex::multiline); | ||
| normalized = std::regex_replace(normalized, comment_regex, ""); | ||
|
|
||
| // Replace string literals with placeholder | ||
| std::regex string_regex("'[^']*'|\"[^\"]*\""); | ||
| normalized = std::regex_replace(normalized, string_regex, "?"); | ||
|
|
||
| // Replace numeric literals with placeholder | ||
| std::regex numeric_regex("\\b\\d+\\b"); | ||
| normalized = std::regex_replace(normalized, numeric_regex, "N"); | ||
|
|
||
| // Normalize whitespace | ||
| std::regex whitespace_regex("\\s+"); | ||
| normalized = std::regex_replace(normalized, whitespace_regex, " "); | ||
|
|
||
| // Trim leading/trailing whitespace | ||
| normalized.erase(0, normalized.find_first_not_of(" \t\n\r")); | ||
| normalized.erase(normalized.find_last_not_of(" \t\n\r") + 1); | ||
|
|
||
| return normalized; | ||
| } | ||
|
|
||
| // ============================================================================ | ||
| // SQL Injection Detection | ||
| // ============================================================================ | ||
|
|
||
| /** | ||
| * @brief Check for SQL injection patterns | ||
| * | ||
| * Uses regex-based pattern matching to detect common SQL injection | ||
| * attack vectors including: | ||
| * - Tautologies (OR 1=1) | ||
| * - Union-based injection | ||
| * - Comment-based injection | ||
| * - Stacked queries | ||
| * - String/character encoding attacks | ||
| * | ||
| * @param query SQL query to check | ||
| * @return AnomalyResult with injection details | ||
| */ | ||
| AnomalyResult Anomaly_Detector::check_sql_injection(const std::string& query) { | ||
| AnomalyResult result; | ||
| result.is_anomaly = false; | ||
| result.risk_score = 0.0f; | ||
| result.anomaly_type = "sql_injection"; | ||
| result.should_block = false; | ||
|
|
||
| try { | ||
| std::string query_lower = query; | ||
| std::transform(query_lower.begin(), query_lower.end(), query_lower.begin(), ::tolower); | ||
|
|
||
| // Check each injection pattern | ||
| int pattern_matches = 0; | ||
| for (int i = 0; SQL_INJECTION_PATTERNS[i] != NULL; i++) { | ||
| std::regex pattern(SQL_INJECTION_PATTERNS[i], std::regex::icase); | ||
| if (std::regex_search(query, pattern)) { | ||
| pattern_matches++; | ||
| result.matched_rules.push_back(std::string("injection_pattern_") + std::to_string(i)); | ||
| } | ||
| } | ||
|
|
||
| // Check suspicious keywords | ||
| for (int i = 0; SUSPICIOUS_KEYWORDS[i] != NULL; i++) { | ||
| if (query_lower.find(SUSPICIOUS_KEYWORDS[i]) != std::string::npos) { | ||
| pattern_matches++; | ||
| result.matched_rules.push_back(std::string("suspicious_keyword_") + std::to_string(i)); | ||
| } | ||
| } | ||
|
|
||
| // Calculate risk score based on pattern matches | ||
| if (pattern_matches > 0) { | ||
| result.is_anomaly = true; | ||
| result.risk_score = std::min(1.0f, pattern_matches * 0.3f); | ||
|
|
||
| std::ostringstream explanation; | ||
| explanation << "SQL injection patterns detected: " << pattern_matches << " matches"; | ||
| result.explanation = explanation.str(); | ||
|
|
||
| // Auto-block if high risk and auto-block enabled | ||
| if (result.risk_score >= config.risk_threshold / 100.0f && config.auto_block) { | ||
| result.should_block = true; | ||
| } | ||
|
|
||
| proxy_debug(PROXY_DEBUG_ANOMALY, 3, | ||
| "Anomaly: SQL injection detected in query: %s (risk: %.2f)\n", | ||
| query.c_str(), result.risk_score); | ||
| } | ||
|
|
||
| } catch (const std::regex_error& e) { | ||
| proxy_error("Anomaly: Regex error in injection check: %s\n", e.what()); | ||
| } catch (const std::exception& e) { | ||
| proxy_error("Anomaly: Error in injection check: %s\n", e.what()); | ||
| } | ||
|
|
||
| return result; | ||
| } | ||
|
|
||
| // ============================================================================ | ||
| // Rate Limiting | ||
| // ============================================================================ | ||
|
|
||
| /** | ||
| * @brief Check rate limiting per user/host | ||
| * | ||
| * Tracks the number of queries per user/host within a time window | ||
| * to detect potential DoS attacks or brute force attempts. | ||
| * | ||
| * @param user Username | ||
| * @param client_host Client IP address | ||
| * @return AnomalyResult with rate limit details | ||
| */ | ||
| AnomalyResult Anomaly_Detector::check_rate_limiting(const std::string& user, | ||
| const std::string& client_host) { | ||
| AnomalyResult result; | ||
| result.is_anomaly = false; | ||
| result.risk_score = 0.0f; | ||
| result.anomaly_type = "rate_limit"; | ||
| result.should_block = false; | ||
|
|
||
| if (!config.enabled) { | ||
| return result; | ||
| } | ||
|
|
||
| // Get current time | ||
| uint64_t current_time = (uint64_t)time(NULL); | ||
| std::string key = user + "@" + client_host; | ||
|
|
||
| // Get or create user stats | ||
| UserStats& stats = user_statistics[key]; | ||
|
|
||
| // Check if we're within the time window | ||
| if (current_time - stats.last_query_time > USER_STATS_WINDOW) { | ||
| // Window expired, reset counter | ||
| stats.query_count = 0; | ||
| stats.recent_queries.clear(); | ||
| } | ||
|
|
||
| // Increment query count | ||
| stats.query_count++; | ||
| stats.last_query_time = current_time; | ||
|
|
||
| // Check if rate limit exceeded | ||
| if (stats.query_count > (uint64_t)config.rate_limit) { | ||
| result.is_anomaly = true; | ||
| // Risk score increases with excess queries | ||
| float excess_ratio = (float)(stats.query_count - config.rate_limit) / config.rate_limit; | ||
| result.risk_score = std::min(1.0f, 0.5f + excess_ratio); | ||
|
|
||
| std::ostringstream explanation; | ||
| explanation << "Rate limit exceeded: " << stats.query_count | ||
| << " queries per " << USER_STATS_WINDOW << " seconds (limit: " | ||
| << config.rate_limit << ")"; | ||
| result.explanation = explanation.str(); | ||
| result.matched_rules.push_back("rate_limit_exceeded"); | ||
|
|
||
| if (config.auto_block) { | ||
| result.should_block = true; | ||
| } | ||
|
|
||
| proxy_warning("Anomaly: Rate limit exceeded for %s: %lu queries\n", | ||
| key.c_str(), stats.query_count); | ||
| } | ||
|
|
||
| return result; | ||
| } | ||
|
|
||
| // ============================================================================ | ||
| // Statistical Anomaly Detection | ||
| // ============================================================================ | ||
|
|
||
| /** | ||
| * @brief Detect statistical anomalies in query behavior | ||
| * | ||
| * Analyzes query patterns to detect unusual behavior such as: | ||
| * - Abnormally large result sets | ||
| * - Unexpected execution times | ||
| * - Queries affecting many rows | ||
| * - Unusual query patterns for the user | ||
| * | ||
| * @param fp Query fingerprint | ||
| * @return AnomalyResult with statistical anomaly details | ||
| */ | ||
| AnomalyResult Anomaly_Detector::check_statistical_anomaly(const QueryFingerprint& fp) { | ||
| AnomalyResult result; | ||
| result.is_anomaly = false; | ||
| result.risk_score = 0.0f; | ||
| result.anomaly_type = "statistical"; | ||
| result.should_block = false; | ||
|
|
||
| if (!config.enabled) { | ||
| return result; | ||
| } | ||
|
|
||
| std::string key = fp.user + "@" + fp.client_host; | ||
| UserStats& stats = user_statistics[key]; | ||
|
|
||
| // Calculate some basic statistics | ||
| uint64_t avg_queries = 10; // Default baseline | ||
| float z_score = 0.0f; | ||
|
|
||
| if (stats.query_count > avg_queries * 3) { | ||
| // Query count is more than 3 standard deviations above mean | ||
| result.is_anomaly = true; | ||
| z_score = (float)(stats.query_count - avg_queries) / avg_queries; | ||
| result.risk_score = std::min(1.0f, z_score / 5.0f); // Normalize | ||
|
|
||
| std::ostringstream explanation; | ||
| explanation << "Unusually high query rate: " << stats.query_count | ||
| << " queries (baseline: " << avg_queries << ")"; | ||
| result.explanation = explanation.str(); | ||
| result.matched_rules.push_back("high_query_rate"); | ||
|
|
||
| proxy_debug(PROXY_DEBUG_ANOMALY, 3, | ||
| "Anomaly: Statistical anomaly for %s: z-score=%.2f\n", | ||
| key.c_str(), z_score); | ||
| } | ||
|
|
||
| // Check for abnormal execution time or rows affected | ||
| if (fp.execution_time_ms > 5000) { // 5 seconds | ||
| result.is_anomaly = true; | ||
| result.risk_score = std::max(result.risk_score, 0.3f); | ||
|
|
||
| if (!result.explanation.empty()) { | ||
| result.explanation += "; "; | ||
| } | ||
| result.explanation += "Long execution time detected"; | ||
| result.matched_rules.push_back("long_execution_time"); | ||
| } | ||
|
|
||
| if (fp.affected_rows > 10000) { | ||
| result.is_anomaly = true; | ||
| result.risk_score = std::max(result.risk_score, 0.2f); | ||
|
|
||
| if (!result.explanation.empty()) { | ||
| result.explanation += "; "; | ||
| } | ||
| result.explanation += "Large result set detected"; | ||
| result.matched_rules.push_back("large_result_set"); | ||
| } | ||
|
|
||
| return result; | ||
| } | ||
|
|
||
| // ============================================================================ | ||
| // Embedding-based Similarity Detection | ||
| // ============================================================================ | ||
|
|
||
| /** | ||
| * @brief Check embedding-based similarity to known threats | ||
| * | ||
| * Compares the query embedding to embeddings of known malicious queries | ||
| * stored in the vector database. This can detect novel attacks that | ||
| * don't match explicit patterns. | ||
| * | ||
| * @param query SQL query | ||
| * @param embedding Query vector embedding (if available) | ||
| * @return AnomalyResult with similarity details | ||
| */ | ||
| AnomalyResult Anomaly_Detector::check_embedding_similarity(const std::string& query, | ||
| const std::vector<float>& embedding) { | ||
| AnomalyResult result; | ||
| result.is_anomaly = false; | ||
| result.risk_score = 0.0f; | ||
| result.anomaly_type = "embedding_similarity"; | ||
| result.should_block = false; | ||
|
|
||
| if (!config.enabled || !vector_db) { | ||
| // Can't do embedding check without vector DB | ||
| return result; | ||
| } | ||
|
|
||
| // If embedding not provided, generate it | ||
| std::vector<float> query_embedding = embedding; | ||
| if (query_embedding.empty()) { | ||
| query_embedding = get_query_embedding(query); | ||
| } | ||
|
|
||
| if (query_embedding.empty()) { | ||
| return result; | ||
| } | ||
|
|
||
| // Convert embedding to JSON for sqlite-vec MATCH | ||
| std::string embedding_json = "["; | ||
| for (size_t i = 0; i < query_embedding.size(); i++) { | ||
| if (i > 0) embedding_json += ","; | ||
| embedding_json += std::to_string(query_embedding[i]); | ||
| } | ||
| embedding_json += "]"; | ||
|
|
||
| // Calculate distance threshold from similarity | ||
| // Similarity 0-100 -> Distance 0-2 (cosine distance: 0=similar, 2=dissimilar) | ||
| float distance_threshold = 2.0f - (config.similarity_threshold / 50.0f); | ||
|
|
||
| // Search for similar threat patterns | ||
| char search[1024]; | ||
| snprintf(search, sizeof(search), | ||
| "SELECT p.pattern_name, p.pattern_type, p.severity, " | ||
| " vec_distance_cosine(v.embedding, '%s') as distance " | ||
| "FROM anomaly_patterns p " | ||
| "JOIN anomaly_patterns_vec v ON p.id = v.rowid " | ||
| "WHERE v.embedding MATCH '%s' " | ||
| "AND distance < %f " | ||
| "ORDER BY distance " | ||
| "LIMIT 5", | ||
| embedding_json.c_str(), embedding_json.c_str(), distance_threshold); | ||
|
|
||
| // Execute search | ||
| sqlite3* db = vector_db->get_db(); | ||
| sqlite3_stmt* stmt = NULL; | ||
| int rc = (*proxy_sqlite3_prepare_v2)(db, search, -1, &stmt, NULL); | ||
|
|
||
| if (rc != SQLITE_OK) { | ||
| proxy_debug(PROXY_DEBUG_ANOMALY, 3, "Embedding search prepare failed: %s", (*proxy_sqlite3_errmsg)(db)); | ||
| return result; | ||
| } | ||
|
|
||
| // Check if any threat patterns matched | ||
| rc = (*proxy_sqlite3_step)(stmt); | ||
| if (rc == SQLITE_ROW) { | ||
| // Found similar threat pattern | ||
| result.is_anomaly = true; | ||
|
|
||
| // Extract pattern info | ||
| const char* pattern_name = reinterpret_cast<const char*>((*proxy_sqlite3_column_text)(stmt, 0)); | ||
| const char* pattern_type = reinterpret_cast<const char*>((*proxy_sqlite3_column_text)(stmt, 1)); | ||
| int severity = (*proxy_sqlite3_column_int)(stmt, 2); | ||
| double distance = (*proxy_sqlite3_column_double)(stmt, 3); | ||
|
|
||
| // Calculate risk score based on severity and similarity | ||
| // - Base score from severity (1-10) -> 0.1-1.0 | ||
| // - Boost by similarity (lower distance = higher risk) | ||
| result.risk_score = (severity / 10.0f) * (1.0f - (distance / 2.0f)); | ||
|
|
||
| // Set anomaly type | ||
| result.anomaly_type = "embedding_similarity"; | ||
|
|
||
| // Build explanation | ||
| char explanation[512]; | ||
| snprintf(explanation, sizeof(explanation), | ||
| "Query similar to known threat pattern '%s' (type: %s, severity: %d, distance: %.2f)", | ||
| pattern_name ? pattern_name : "unknown", | ||
| pattern_type ? pattern_type : "unknown", | ||
| severity, distance); | ||
| result.explanation = explanation; | ||
|
|
||
| // Add matched pattern to rules | ||
| if (pattern_name) { | ||
| result.matched_rules.push_back(std::string("pattern:") + pattern_name); | ||
| } | ||
|
|
||
| // Determine if should block | ||
| result.should_block = (result.risk_score > (config.risk_threshold / 100.0f)); | ||
|
|
||
| proxy_info("Anomaly: Embedding similarity detected (pattern: %s, score: %.2f)\n", | ||
| pattern_name ? pattern_name : "unknown", result.risk_score); | ||
| } | ||
|
|
||
| sqlite3_finalize(stmt); | ||
|
|
||
| proxy_debug(PROXY_DEBUG_ANOMALY, 3, | ||
| "Anomaly: Embedding similarity check performed\n"); | ||
|
|
||
| return result; | ||
| } | ||
|
|
||
| /** | ||
| * @brief Get vector embedding for a query | ||
| * | ||
| * Generates a vector representation of the query using a sentence | ||
| * transformer or similar embedding model. | ||
| * | ||
| * Uses the GenAI module (GloGATH) for embedding generation via llama-server. | ||
| * | ||
| * @param query SQL query | ||
| * @return Vector embedding (empty if not available) | ||
| */ | ||
| std::vector<float> Anomaly_Detector::get_query_embedding(const std::string& query) { | ||
| if (!GloGATH) { | ||
| proxy_debug(PROXY_DEBUG_ANOMALY, 3, "GenAI handler not available for embedding"); | ||
| return {}; | ||
| } | ||
|
|
||
| // Normalize query first for better embedding quality | ||
| std::string normalized = normalize_query(query); | ||
|
|
||
| // Generate embedding using GenAI | ||
| GenAI_EmbeddingResult result = GloGATH->embed_documents({normalized}); | ||
|
|
||
| if (!result.data || result.count == 0) { | ||
| proxy_debug(PROXY_DEBUG_ANOMALY, 3, "Failed to generate embedding"); | ||
| return {}; | ||
| } | ||
|
|
||
| // Convert to std::vector<float> | ||
| std::vector<float> embedding(result.data, result.data + result.embedding_size); | ||
|
|
||
| // Free the result data (GenAI allocates with malloc) | ||
| if (result.data) { | ||
| free(result.data); | ||
| } | ||
|
|
||
| proxy_debug(PROXY_DEBUG_ANOMALY, 3, "Generated embedding with %zu dimensions", embedding.size()); | ||
| return embedding; | ||
| } | ||
|
|
||
| // ============================================================================ | ||
| // User Statistics Management | ||
| // ============================================================================ | ||
|
|
||
| /** | ||
| * @brief Update user statistics with query fingerprint | ||
| * | ||
| * Tracks user behavior for statistical anomaly detection. | ||
| * | ||
| * @param fp Query fingerprint | ||
| */ | ||
| void Anomaly_Detector::update_user_statistics(const QueryFingerprint& fp) { | ||
| if (!config.enabled) { | ||
| return; | ||
| } | ||
|
|
||
| std::string key = fp.user + "@" + fp.client_host; | ||
| UserStats& stats = user_statistics[key]; | ||
|
|
||
| // Add to recent queries | ||
| stats.recent_queries.push_back(fp.query_pattern); | ||
|
|
||
| // Keep only recent queries | ||
| if (stats.recent_queries.size() > MAX_RECENT_QUERIES) { | ||
| stats.recent_queries.erase(stats.recent_queries.begin()); | ||
| } | ||
|
|
||
| stats.last_query_time = fp.timestamp; | ||
| stats.query_count++; | ||
|
|
||
| // Cleanup old entries periodically | ||
| static int cleanup_counter = 0; | ||
| if (++cleanup_counter % 1000 == 0) { | ||
| uint64_t current_time = (uint64_t)time(NULL); | ||
| auto it = user_statistics.begin(); | ||
| while (it != user_statistics.end()) { | ||
| if (current_time - it->second.last_query_time > USER_STATS_WINDOW * 2) { | ||
| it = user_statistics.erase(it); | ||
| } else { | ||
| ++it; | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // ============================================================================ | ||
| // Main Analysis Method | ||
| // ============================================================================ | ||
|
|
||
| /** | ||
| * @brief Main entry point for anomaly detection | ||
| * | ||
| * Runs the multi-stage detection pipeline: | ||
| * 1. SQL Injection Pattern Detection | ||
| * 2. Rate Limiting Check | ||
| * 3. Statistical Anomaly Detection | ||
| * 4. Embedding Similarity Check (if vector DB available) | ||
| * | ||
| * @param query SQL query to analyze | ||
| * @param user Username | ||
| * @param client_host Client IP address | ||
| * @param schema Database schema name | ||
| * @return AnomalyResult with combined analysis | ||
| */ | ||
| AnomalyResult Anomaly_Detector::analyze(const std::string& query, const std::string& user, | ||
| const std::string& client_host, const std::string& schema) { | ||
| AnomalyResult combined_result; | ||
| combined_result.is_anomaly = false; | ||
| combined_result.risk_score = 0.0f; | ||
| combined_result.should_block = false; | ||
|
|
||
| if (!config.enabled) { | ||
| return combined_result; | ||
| } | ||
|
|
||
| proxy_debug(PROXY_DEBUG_ANOMALY, 3, | ||
| "Anomaly: Analyzing query from %s@%s\n", | ||
| user.c_str(), client_host.c_str()); | ||
|
|
||
| // Run all detection stages | ||
| AnomalyResult injection_result = check_sql_injection(query); | ||
| AnomalyResult rate_result = check_rate_limiting(user, client_host); | ||
|
|
||
| // Build fingerprint for statistical analysis | ||
| QueryFingerprint fp; | ||
| fp.query_pattern = normalize_query(query); | ||
| fp.user = user; | ||
| fp.client_host = client_host; | ||
| fp.schema = schema; | ||
| fp.timestamp = (uint64_t)time(NULL); | ||
|
|
||
| AnomalyResult stat_result = check_statistical_anomaly(fp); | ||
|
|
||
| // Embedding similarity (optional) | ||
| std::vector<float> embedding; | ||
| AnomalyResult embed_result = check_embedding_similarity(query, embedding); | ||
|
|
||
| // Combine results | ||
| combined_result.is_anomaly = injection_result.is_anomaly || | ||
| rate_result.is_anomaly || | ||
| stat_result.is_anomaly || | ||
| embed_result.is_anomaly; | ||
|
|
||
| // Take maximum risk score | ||
| combined_result.risk_score = std::max({injection_result.risk_score, | ||
| rate_result.risk_score, | ||
| stat_result.risk_score, | ||
| embed_result.risk_score}); | ||
|
|
||
| // Combine explanations | ||
| std::vector<std::string> explanations; | ||
| if (!injection_result.explanation.empty()) { | ||
| explanations.push_back(injection_result.explanation); | ||
| } | ||
| if (!rate_result.explanation.empty()) { | ||
| explanations.push_back(rate_result.explanation); | ||
| } | ||
| if (!stat_result.explanation.empty()) { | ||
| explanations.push_back(stat_result.explanation); | ||
| } | ||
| if (!embed_result.explanation.empty()) { | ||
| explanations.push_back(embed_result.explanation); | ||
| } | ||
|
|
||
| if (!explanations.empty()) { | ||
| combined_result.explanation = explanations[0]; | ||
| for (size_t i = 1; i < explanations.size(); i++) { | ||
| combined_result.explanation += "; " + explanations[i]; | ||
| } | ||
| } | ||
|
|
||
| // Combine matched rules | ||
| combined_result.matched_rules = injection_result.matched_rules; | ||
| combined_result.matched_rules.insert(combined_result.matched_rules.end(), | ||
| rate_result.matched_rules.begin(), | ||
| rate_result.matched_rules.end()); | ||
| combined_result.matched_rules.insert(combined_result.matched_rules.end(), | ||
| stat_result.matched_rules.begin(), | ||
| stat_result.matched_rules.end()); | ||
| combined_result.matched_rules.insert(combined_result.matched_rules.end(), | ||
| embed_result.matched_rules.begin(), | ||
| embed_result.matched_rules.end()); | ||
|
|
||
| // Determine if should block | ||
| combined_result.should_block = injection_result.should_block || | ||
| rate_result.should_block || | ||
| (combined_result.risk_score >= config.risk_threshold / 100.0f && config.auto_block); | ||
|
|
||
| // Update user statistics | ||
| update_user_statistics(fp); | ||
|
|
||
| // Log anomaly if detected | ||
| if (combined_result.is_anomaly) { | ||
| if (config.log_only) { | ||
| proxy_warning("Anomaly: Detected (log-only mode): %s (risk: %.2f)\n", | ||
| combined_result.explanation.c_str(), combined_result.risk_score); | ||
| } else if (combined_result.should_block) { | ||
| proxy_error("Anomaly: BLOCKED: %s (risk: %.2f)\n", | ||
| combined_result.explanation.c_str(), combined_result.risk_score); | ||
| } else { | ||
| proxy_warning("Anomaly: Detected: %s (risk: %.2f)\n", | ||
| combined_result.explanation.c_str(), combined_result.risk_score); | ||
| } | ||
| } | ||
|
|
||
| return combined_result; | ||
| } | ||
|
|
||
| // ============================================================================ | ||
| // Threat Pattern Management | ||
| // ============================================================================ | ||
|
|
||
| /** | ||
| * @brief Add a threat pattern to the database | ||
| * | ||
| * @param pattern_name Human-readable name | ||
| * @param query_example Example query | ||
| * @param pattern_type Type of threat (injection, flooding, etc.) | ||
| * @param severity Severity level (0-100) | ||
| * @return Pattern ID or -1 on error | ||
| */ | ||
| int Anomaly_Detector::add_threat_pattern(const std::string& pattern_name, | ||
| const std::string& query_example, | ||
| const std::string& pattern_type, | ||
| int severity) { | ||
| proxy_info("Anomaly: Adding threat pattern: %s (type: %s, severity: %d)\n", | ||
| pattern_name.c_str(), pattern_type.c_str(), severity); | ||
|
|
||
| if (!vector_db) { | ||
| proxy_error("Anomaly: Cannot add pattern - no vector DB\n"); | ||
| return -1; | ||
| } | ||
|
|
||
| // Generate embedding for the query example | ||
| std::vector<float> embedding = get_query_embedding(query_example); | ||
| if (embedding.empty()) { | ||
| proxy_error("Anomaly: Failed to generate embedding for threat pattern\n"); | ||
| return -1; | ||
| } | ||
|
|
||
| // Insert into main table with embedding BLOB | ||
| sqlite3* db = vector_db->get_db(); | ||
| sqlite3_stmt* stmt = NULL; | ||
| const char* insert = "INSERT INTO anomaly_patterns " | ||
| "(pattern_name, pattern_type, query_example, embedding, severity) " | ||
| "VALUES (?, ?, ?, ?, ?)"; | ||
|
|
||
| int rc = (*proxy_sqlite3_prepare_v2)(db, insert, -1, &stmt, NULL); | ||
| if (rc != SQLITE_OK) { | ||
| proxy_error("Anomaly: Failed to prepare pattern insert: %s\n", (*proxy_sqlite3_errmsg)(db)); | ||
| return -1; | ||
| } | ||
|
|
||
| // Bind values | ||
| (*proxy_sqlite3_bind_text)(stmt, 1, pattern_name.c_str(), -1, SQLITE_TRANSIENT); | ||
| (*proxy_sqlite3_bind_text)(stmt, 2, pattern_type.c_str(), -1, SQLITE_TRANSIENT); | ||
| (*proxy_sqlite3_bind_text)(stmt, 3, query_example.c_str(), -1, SQLITE_TRANSIENT); | ||
| (*proxy_sqlite3_bind_blob)(stmt, 4, embedding.data(), embedding.size() * sizeof(float), SQLITE_TRANSIENT); | ||
| (*proxy_sqlite3_bind_int)(stmt, 5, severity); | ||
|
|
||
| // Execute insert | ||
| rc = (*proxy_sqlite3_step)(stmt); | ||
| if (rc != SQLITE_DONE) { | ||
| proxy_error("Anomaly: Failed to insert pattern: %s\n", sqlite3_errmsg(db)); | ||
| sqlite3_finalize(stmt); | ||
| return -1; | ||
| } | ||
|
|
||
| sqlite3_finalize(stmt); | ||
|
|
||
| // Get the inserted rowid | ||
| sqlite3_int64 rowid = (*proxy_sqlite3_last_insert_rowid)(db); | ||
|
|
||
| // Update virtual table (sqlite-vec needs explicit rowid insertion) | ||
| char update_vec[256]; | ||
| snprintf(update_vec, sizeof(update_vec), | ||
| "INSERT INTO anomaly_patterns_vec(rowid) VALUES (%lld)", rowid); | ||
|
|
||
| char* err = NULL; | ||
| rc = sqlite3_exec(db, update_vec, NULL, NULL, &err); | ||
| if (rc != SQLITE_OK) { | ||
| proxy_error("Anomaly: Failed to update vec table: %s\n", err ? err : "unknown"); | ||
| if (err) sqlite3_free(err); | ||
| return -1; | ||
| } | ||
|
|
||
| proxy_info("Anomaly: Added threat pattern '%s' (id: %lld)\n", pattern_name.c_str(), rowid); | ||
| return (int)rowid; | ||
| } | ||
|
|
||
| /** | ||
| * @brief List all threat patterns | ||
| * | ||
| * @return JSON array of threat patterns | ||
| */ | ||
| std::string Anomaly_Detector::list_threat_patterns() { | ||
| if (!vector_db) { | ||
| return "[]"; | ||
| } | ||
|
|
||
| json patterns = json::array(); | ||
|
|
||
| sqlite3* db = vector_db->get_db(); | ||
| const char* query = "SELECT id, pattern_name, pattern_type, query_example, severity, created_at " | ||
| "FROM anomaly_patterns ORDER BY severity DESC"; | ||
|
|
||
| sqlite3_stmt* stmt = NULL; | ||
| int rc = sqlite3_prepare_v2(db, query, -1, &stmt, NULL); | ||
|
|
||
| if (rc != SQLITE_OK) { | ||
| proxy_error("Anomaly: Failed to query threat patterns: %s\n", sqlite3_errmsg(db)); | ||
| return "[]"; | ||
| } | ||
|
|
||
| while ((*proxy_sqlite3_step)(stmt) == SQLITE_ROW) { | ||
| json pattern; | ||
| pattern["id"] = (*proxy_sqlite3_column_int64)(stmt, 0); | ||
| const char* name = reinterpret_cast<const char*>((*proxy_sqlite3_column_text)(stmt, 1)); | ||
| const char* type = reinterpret_cast<const char*>((*proxy_sqlite3_column_text)(stmt, 2)); | ||
| const char* example = reinterpret_cast<const char*>((*proxy_sqlite3_column_text)(stmt, 3)); | ||
| pattern["pattern_name"] = name ? name : ""; | ||
| pattern["pattern_type"] = type ? type : ""; | ||
| pattern["query_example"] = example ? example : ""; | ||
| pattern["severity"] = (*proxy_sqlite3_column_int)(stmt, 4); | ||
| pattern["created_at"] = (*proxy_sqlite3_column_int64)(stmt, 5); | ||
| patterns.push_back(pattern); | ||
| } | ||
|
|
||
| sqlite3_finalize(stmt); | ||
|
|
||
| return patterns.dump(); | ||
| } | ||
|
|
||
| /** | ||
| * @brief Remove a threat pattern | ||
| * | ||
| * @param pattern_id Pattern ID to remove | ||
| * @return true if removed, false otherwise | ||
| */ | ||
| bool Anomaly_Detector::remove_threat_pattern(int pattern_id) { | ||
| proxy_info("Anomaly: Removing threat pattern: %d\n", pattern_id); | ||
|
|
||
| if (!vector_db) { | ||
| proxy_error("Anomaly: Cannot remove pattern - no vector DB\n"); | ||
| return false; | ||
| } | ||
|
|
||
| sqlite3* db = vector_db->get_db(); | ||
|
|
||
| // First, remove from virtual table | ||
| char del_vec[256]; | ||
| snprintf(del_vec, sizeof(del_vec), "DELETE FROM anomaly_patterns_vec WHERE rowid = %d", pattern_id); | ||
| char* err = NULL; | ||
| int rc = sqlite3_exec(db, del_vec, NULL, NULL, &err); | ||
| if (rc != SQLITE_OK) { | ||
| proxy_error("Anomaly: Failed to delete from vec table: %s\n", err ? err : "unknown"); | ||
| if (err) sqlite3_free(err); | ||
| return false; | ||
| } | ||
|
|
||
| // Then, remove from main table | ||
| snprintf(del_vec, sizeof(del_vec), "DELETE FROM anomaly_patterns WHERE id = %d", pattern_id); | ||
| rc = sqlite3_exec(db, del_vec, NULL, NULL, &err); | ||
| if (rc != SQLITE_OK) { | ||
| proxy_error("Anomaly: Failed to delete pattern: %s\n", err ? err : "unknown"); | ||
| if (err) sqlite3_free(err); | ||
| return false; | ||
| } | ||
|
|
||
| proxy_info("Anomaly: Removed threat pattern %d\n", pattern_id); | ||
| return true; | ||
| } | ||
|
|
||
| // ============================================================================ | ||
| // Statistics and Monitoring | ||
| // ============================================================================ | ||
|
|
||
| /** | ||
| * @brief Get anomaly detection statistics | ||
| * | ||
| * @return JSON string with statistics | ||
| */ | ||
| std::string Anomaly_Detector::get_statistics() { | ||
| json stats; | ||
|
|
||
| stats["users_tracked"] = user_statistics.size(); | ||
| stats["config"] = { | ||
| {"enabled", config.enabled}, | ||
| {"risk_threshold", config.risk_threshold}, | ||
| {"similarity_threshold", config.similarity_threshold}, | ||
| {"rate_limit", config.rate_limit}, | ||
| {"auto_block", config.auto_block}, | ||
| {"log_only", config.log_only} | ||
| }; | ||
|
|
||
| // Count total queries | ||
| uint64_t total_queries = 0; | ||
| for (const auto& entry : user_statistics) { | ||
| total_queries += entry.second.query_count; | ||
| } | ||
| stats["total_queries_tracked"] = total_queries; | ||
|
|
||
| // Count threat patterns | ||
| if (vector_db) { | ||
| sqlite3* db = vector_db->get_db(); | ||
| const char* count_query = "SELECT COUNT(*) FROM anomaly_patterns"; | ||
| sqlite3_stmt* stmt = NULL; | ||
| int rc = sqlite3_prepare_v2(db, count_query, -1, &stmt, NULL); | ||
|
|
||
| if (rc == SQLITE_OK) { | ||
| rc = (*proxy_sqlite3_step)(stmt); | ||
| if (rc == SQLITE_ROW) { | ||
| stats["threat_patterns_count"] = sqlite3_column_int(stmt, 0); | ||
| } | ||
| sqlite3_finalize(stmt); | ||
| } | ||
|
|
||
| // Count by pattern type | ||
| const char* type_query = "SELECT pattern_type, COUNT(*) FROM anomaly_patterns GROUP BY pattern_type"; | ||
| rc = sqlite3_prepare_v2(db, type_query, -1, &stmt, NULL); | ||
|
|
||
| if (rc == SQLITE_OK) { | ||
| json by_type = json::object(); | ||
| while ((*proxy_sqlite3_step)(stmt) == SQLITE_ROW) { | ||
| const char* type = reinterpret_cast<const char*>(sqlite3_column_text(stmt, 0)); | ||
| int count = sqlite3_column_int(stmt, 1); | ||
| if (type) { | ||
| by_type[type] = count; | ||
| } | ||
| } | ||
| sqlite3_finalize(stmt); | ||
| stats["threat_patterns_by_type"] = by_type; | ||
| } | ||
| } | ||
|
|
||
| return stats.dump(); | ||
| } | ||
|
|
||
| /** | ||
| * @brief Clear all user statistics | ||
| */ | ||
| void Anomaly_Detector::clear_user_statistics() { | ||
| size_t count = user_statistics.size(); | ||
| user_statistics.clear(); | ||
| proxy_info("Anomaly: Cleared statistics for %zu users\n", count); | ||
| } |
There was a problem hiding this comment.
Actionable comments posted: 18
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
include/sqlite3db.h (1)
91-147: Add missing proxy declarations to MAIN_PROXY_SQLITE3 branch.The
MAIN_PROXY_SQLITE3branch (lines 91–147) is missingproxy_sqlite3_enable_load_extensionandproxy_sqlite3_global_stats_row_stepdeclarations that exist in the#ifndef MAIN_PROXY_SQLITE3branch (lines 49 and 52). Any translation unit definingMAIN_PROXY_SQLITE3will lack storage for these function pointers, causing linker errors or ODR violations.Add the following to maintain symmetry:
int (*proxy_sqlite3_enable_load_extension)(sqlite3*, int);after line 108 (proxy_sqlite3_auto_extension)void (*proxy_sqlite3_global_stats_row_step)(SQLite3DB*, sqlite3_stmt*, const char*, ...);after line 146 (before#endif)lib/sqlite3db.cpp (1)
1075-1221: Addassert(proxy_sqlite3_bind_blob);to the assertion block.The
proxy_sqlite3_bind_blobfunction is actively used inlib/Anomaly_Detector.cpp:765to bind embeddings. Without this assertion, if a plugin fails to provide the symbol, the code will crash at runtime with no safety check. Add it to the assert block immediately afterassert(proxy_sqlite3_bind_text);.Assertion addition
assert(proxy_sqlite3_bind_null); assert(proxy_sqlite3_bind_text); + assert(proxy_sqlite3_bind_blob); assert(proxy_sqlite3_column_name);lib/AI_Features_Manager.cpp (1)
365-402: Tidy log message spacing.The new strings include extra spaces around commas (e.g., "disabled , skipping"). Minor UX polish.
Proposed fix
- proxy_info("AI: LLM bridge disabled , skipping initialization\n"); + proxy_info("AI: LLM bridge disabled, skipping initialization\n"); - proxy_info("AI: Anomaly detection disabled , skipping initialization\n"); + proxy_info("AI: Anomaly detection disabled, skipping initialization\n");lib/MySQL_Tool_Handler.cpp (1)
893-1004: Critical inconsistency:catalog_merge()lacks schema parameter while all other catalog operations require it.The schema parameter is mandatory in
catalog_upsert,catalog_get,catalog_search,catalog_list, andcatalog_delete, butcatalog_mergeomits it entirely. This allows merge operations to cross schema boundaries without restriction, breaking the schema isolation enforced by other methods. Either add schema as a required parameter tocatalog_mergefor consistency, or document if cross-schema merges are intentional. The JSON-parse fallback incatalog_getis sound.include/MySQL_Tool_Handler.h (1)
329-355: Clarify schema parameter docs and ordering.The Doxygen blocks for
catalog_upsert,catalog_get, andcatalog_deletestill describe schema as “empty for all schemas” and list it after other params, but the signatures put schema first and the implementation treats""as a global entry only (not “all”). This can mislead callers.🔧 Suggested doc fix
- * `@param` kind Entry kind - * `@param` key Unique key - * `@param` document JSON document - * `@param` schema Schema name (empty for all schemas) + * `@param` schema Schema name ("" for global/shared entries) + * `@param` kind Entry kind + * `@param` key Unique key + * `@param` document JSON document- * `@param` schema Schema name (empty for all schemas) + * `@param` schema Schema name ("" for global/shared entries)Also applies to: 406-413
doc/MCP/Vector_Embeddings_Implementation_Plan.md (1)
136-151: Close the SQL code block before the next heading.The ```sql block isn’t terminated before “Implementation Status”, which breaks markdown rendering for the rest of the document.
🧱 Fix
- COALESCE(notes, '')) as vector, + COALESCE(notes, '')) as vector, +``` ## Implementation Status
🤖 Fix all issues with AI agents
In `@lib/Admin_Handler.cpp`:
- Around line 2348-2431: The current handlers build a single string
"*q=l_strdup(\"DELETE ...; INSERT ...\")" (for main.mcp_query_rules and
disk.mcp_query_rules) which is passed to SQLite but only the first statement
runs; split these into separate DB calls inside an explicit transaction: call
SQLite3DB::execute_statement (or the existing DB execute helper) to run "BEGIN",
then execute the "DELETE ..." statement, then execute the "INSERT ... SELECT
..." statement, and finally "COMMIT"; on any error run "ROLLBACK" and propagate
the error to the caller; change the two locations that set *q with the combined
"DELETE ...; INSERT ..." to instead perform the transactional sequence and do
not return a combined multi-statement string.
In `@lib/AI_Features_Manager.cpp`:
- Around line 163-178: The schema currently defines rag_sources.backend_pass as
plain TEXT in the create_rag_sources SQL constant; change the schema to avoid
storing plaintext secrets by replacing the backend_pass column with a secure
alternative (e.g., backend_pass_encrypted BLOB or backend_secret_ref TEXT) in
the create_rag_sources string, and update all code paths that read/write
rag_sources (any functions that INSERT/SELECT/update rag_sources) to use the
chosen approach: either encrypt/decrypt using the existing secret management
APIs or store only a reference/token and integrate with the secret store
retrieval logic. Ensure migrations handle existing rows (migrate/rotate
plaintext into the secure field) and update any code that expects backend_pass
to use the new column and decryption/secret resolution functions.
In `@lib/Anomaly_Detector.cpp.bak`:
- Around line 559-591: The code accesses user_statistics and the static
cleanup_counter without synchronization; add a pthread_mutex_t (e.g.,
mutex_user_stats) as a member of Anomaly_Detector, initialize it in the
constructor and destroy it in the destructor, then surround all accesses to
user_statistics and cleanup_counter in Anomaly_Detector::update_user_statistics
and in check_statistical_anomaly with pthread_mutex_lock/pthread_mutex_unlock to
prevent concurrent map mutations and races (or convert cleanup_counter to a
member and protect it the same way); ensure the same lock is used for both
methods so erases/iterations and pushes to recent_queries are atomic with
respect to each other.
In `@lib/MCP_Endpoint.cpp`:
- Around line 221-230: The error logging currently emits the full request
payload (req_body) when JSON parsing fails; update the handlers in
MCP_Endpoint.cpp (around the json::parse try/catch that references req_body and
req_path, and the similar blocks noted at the other locations) to avoid emitting
sensitive content: replace proxy_error(... req_body ...) with a safe message
that logs either the payload size and a short truncated/redacted preview (e.g.,
first N chars + "...") or moves the full payload to proxy_debug at most; ensure
req_path and e.what() remain logged but any tool arguments or full request
bodies are never printed at error level and apply the same change to the other
similar blocks (the blocks around lines referenced 352-366 and 477-482).
In `@lib/MySQL_Catalog.cpp`:
- Around line 118-135: The FTS index stays stale because upserts use ON CONFLICT
... DO UPDATE in upsert() and you only have INSERT/DELETE triggers (catalog_ai,
catalog_ad) for catalog_fts; add an AFTER UPDATE trigger (e.g., catalog_au) on
the catalog table that first emits a 'delete' row for old.* into catalog_fts and
then an insert row for new.* (mirroring the delete+insert pattern used in
catalog_ad/catalog_ai) so updates (including upserts) keep the FTS index in
sync; implement this trigger SQL alongside the existing CREATE TRIGGER
statements in MySQL_Catalog.cpp.
- Around line 270-304: The search() method currently builds SQL via string
concatenation causing SQL injection; convert it to use prepared statements by
calling prepare_v2() with placeholders (?) for schema, kind, tags (use ? for
"tags LIKE ?"), and query (use ? for each LIKE '%query%' occurrence), bind
values using proxy_sqlite3_bind_text() (providing the pattern strings like "%" +
query + "%"), then execute via execute_prepared(); apply the same change pattern
to list() and remove() to replace direct concatenation of schema/kind/tags/query
with parameterized ? placeholders and corresponding proxy_sqlite3_bind_text()
bindings so all user input is safely bound before execute_prepared().
In `@lib/ProxySQL_Admin_Stats.cpp`:
- Around line 2582-2626: The code deletes from stats_mcp_query_digest_reset when
reset==true but still inserts into stats_mcp_query_digest and uses unsafe
sprintf interpolation for fields like digest_text; change to select the target
table into a variable (e.g. const char* target_table = reset ?
"stats_mcp_query_digest_reset" : "stats_mcp_query_digest") and use it for both
the DELETE and the INSERT, and replace the sprintf-based INSERT logic
(variables: a, query, statsdb->execute(query), and the loop over
resultset->rows) with a prepared statement on statsdb that binds each field
(including digest_text and other fields from r->fields[0..9]) as parameters to
avoid SQL injection and quoting issues; ensure you allocate/prepare once, bind
per-row, execute, and finalize the statement.
- Around line 1587-1612: In stats___mcp_query_tools_counters, the INSERT builds
the wrong table name when reset and injects raw r->fields into SQL via snprintf;
change the table name logic to explicitly use either
"stats_mcp_query_tools_counters_reset" or "stats_mcp_query_tools_counters" (do
not concatenate fragments), and replace the snprintf/execute path with a
prepared statement: prepare an INSERT INTO <chosen_table> (...) VALUES (?, ?, ?,
..., ?) and bind r->fields[0] and r->fields[1] (and any other string fields) as
parameters before executing for each SQLite3_row* r to avoid quoting/injection
issues and ensure correct table usage; update the code around resultset->rows
iteration (function stats___mcp_query_tools_counters, variables resultset, r,
statsdb->execute) to use prepare/bind/step/finalize instead of snprintf.
In `@lib/ProxySQL_Admin.cpp`:
- Around line 7764-7794: The function load_mcp_query_rules_to_runtime leaks and
incorrectly frees resultset: ensure resultset (returned from execute_statement
and typed SQLite3_result*) is always deleted with delete, not free, after it is
consumed by catalog->load_mcp_query_rules, and remove/replace the success==false
conditional so deletion happens in a single cleanup path (e.g., after the
if/else that calls load_mcp_query_rules or logs the error) and set resultset to
nullptr after delete; reference symbols: load_mcp_query_rules_to_runtime,
execute_statement, SQLite3_result, resultset, catalog->load_mcp_query_rules.
In `@lib/Query_Tool_Handler.cpp`:
- Around line 827-834: The SQLite3_result returned by
catalog->get_db()->execute_statement is deallocated with free() in this block
but elsewhere is deleted, causing mixed allocator UB; change both
free(resultset) calls here to the consistent deallocation used in the project
(use delete resultset) and ensure you delete before returning in the success
branch (replace the free(resultset) before return with delete resultset and
likewise replace the later free(resultset) with delete resultset) so resultset
is always deallocated with the same operator.
- Around line 368-373: In Query_Tool_Handler the mysql_query() failure path logs
and calls return_connection(mysql) but does not exit, so mysql_store_result()
and later logic run on a bad state; after calling return_connection(mysql) when
mysql_query(...) returns nonzero, immediately return the error JSON stored in
result (e.g., return result) to stop further processing; update the branch that
checks mysql_query(...) to perform the early return after
return_connection(mysql) so the function returns the error payload instead of
continuing.
- Around line 375-383: The code calls mysql_affected_rows() after
return_connection(), causing a race; fix both execute_query() and
execute_query_with_schema() by capturing affected_rows immediately after
mysql_store_result() (or when res is null) and before calling
return_connection(), e.g., call mysql_affected_rows(static_cast<MYSQL*>(mysql))
into a local variable right after mysql_store_result()/when res==nullptr, then
use that stored value in the JSON response instead of calling
mysql_affected_rows() after return_connection().
In `@lib/RAG_Tool_Handler.cpp`:
- Around line 1490-1504: The code parses fusion weights w_fts and w_vec with
get_json_int (in both the fuse_params branch and the fallback top-level
arguments branch), which truncates fractional values; change those calls to the
floating-point variant (e.g., get_json_double or equivalent) and ensure w_fts
and w_vec are stored as double (or a floating type) and use 1.0 as the default,
updating both the fuse_params block and the fallback block so fractional weights
are preserved for hybrid scoring; leave integer parsing
(validate_k/get_json_int) for fts_k/vec_k/rrf_k0 unchanged.
- Around line 401-447: The prepared sqlite3_stmt created by
vector_db->prepare_v2 and bound with
proxy_sqlite3_bind_text/proxy_sqlite3_bind_int is never executed because the
code calls vector_db->execute_statement(query, ...) with the raw query; change
the implementation of execute_parameterized_query to execute the prepared
statement (stmt) instead of the raw query—either call a provided helper like
vector_db->execute_prepared_statement(stmt, &error, &cols, &affected_rows) if
available, or perform the sqlite3_step loop on stmt, collect columns/rows via
sqlite3_column_*, finalize the statement with sqlite3_finalize (or the smart
pointer cleanup already used), and return the populated SQLite3_result so bound
parameters are actually applied.
- Around line 2316-2330: The doc_id list is concatenated directly into the SQL
string creating SQL injection risk; update the code that constructs doc_list
(the loop iterating over doc_ids in RAG_Tool_Handler.cpp used by
rag.fetch_from_source) to first validate each doc_id for forbidden characters
(', backslash, semicolon) and reject or skip invalid entries, and then escape
single quotes by doubling them before appending (same approach used for
chunk_ids and tag-escaping); apply the same validation+escaping to both
occurrences where doc_list is built so the final "WHERE doc_id IN (...)" uses
only validated, single-quote-escaped values.
In `@lib/Static_Harvester.cpp`:
- Around line 1129-1138: The profile fields are being read from the wrong column
indexes: update the row->fields index usage in the block that builds json
profile so it matches the SELECT clause used to produce row (fix
profile["rows_est"], profile["size_bytes"], profile["engine"], and the flag
fields profile["has_primary_key"], profile["has_foreign_keys"],
profile["has_time_column"]). Locate the SQL that produces row (the quick
profiles query) and remap the reads from row->fields[...] to the correct
positions (replace the current fields[3..9] usages with the indices that match
the SELECT order) so each profile property reads from the corresponding selected
column.
- Around line 1041-1049: The SQL update concatenates unescaped strings
(view_def, schema_name, view_name) into update_sql; to fix, escape single quotes
in each of those variables (e.g., replace ' with '' or use your DB API's
parameterized/prepared statement) before constructing update_sql. Implement or
call a helper like escape_sql_string(...) on view_def, schema_name, and
view_name immediately prior to the update SQL assembly (the block that builds
update_sql and assigns to definition_sql in the objects table) or switch to a
prepared statement API to bind them safely.
In `@RAG_POC/schema.sql`:
- Around line 19-25: The schema stores plaintext credentials in the backend_pass
column; change the schema and access patterns to avoid plaintext at rest by
replacing or augmenting backend_pass with a secret reference (e.g.,
backend_secret_ref) or an encrypted column (e.g., backend_pass_encrypted) and
update any code that reads/writes backend records to (a) write secret references
instead of raw passwords to the catalog, or (b) encrypt/decrypt
backend_pass_encrypted using your KMS/secret-manager client before
persisting/after retrieval (implement decrypt in the code path that uses
backend_host/backend_user/backend_db). Also ensure the catalog file is created
with restrictive filesystem permissions (owner-only read/write) when
initializing the DB. Finally, add a migration path to move existing backend_pass
values into the secret store or encrypted column and remove/zero out plaintext
backend_pass entries.
🟡 Minor comments (13)
scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md-155-163 (1)
155-163: Hyphenate “Meta‑analysis” in the round label.
Line 159 uses “Meta Analysis”; hyphenation reads better and matches standard usage.✏️ Suggested tweak
-5. **Round 5: Meta Analysis** (META agent only) +5. **Round 5: Meta-analysis** (META agent only)scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md-596-607 (1)
596-607: Add a language identifier to the directory tree code fence.
Markdownlint (MD040) expects a language;textis sufficient.✏️ Suggested tweak
-``` +```text ClaudeCode_Headless/ ├── README.md # This file ├── prompts/ │ ├── multi_agent_discovery_prompt.md # Concise system prompt │ └── multi_agent_discovery_reference.md # Comprehensive reference ├── headless_db_discovery.py # Python script ├── headless_db_discovery.sh # Bash script └── examples/ ├── DATABASE_DISCOVERY_REPORT.md # Example output └── DATABASE_QUESTION_CAPABILITIES.md # Feature documentation -``` +```RAG_POC/architecture-runtime-retrieval.md-51-141 (1)
51-141: Replace bold “pseudo‑headings” with actual headings (MD036).
markdownlint flags several bolded labels used as headings; switching to####improves structure and TOC generation.💡 Example fix (apply similarly to other bold labels)
-**Canonical flow** +#### Canonical flow -**Principle** +#### Principle -**Typical output** +#### Typical outputAlso applies to: 279-279
doc/MCP/Database_Discovery_Agent.md-804-808 (1)
804-808: Add the missing implementation link.
The sentence is incomplete and leaves a blank reference, which is a dead end for readers.💡 Proposed fix
-**Actual Implementation:** See for the actual ProxySQL MCP discovery implementation. +**Actual Implementation:** See `Two_Phase_Discovery_Implementation.md` for the actual ProxySQL MCP discovery implementation.RAG_POC/embeddings-design.md-98-104 (1)
98-104: Use proper headings instead of bold text.markdownlint flags bold-as-heading in these sections; convert to real headings.
🔧 Suggested diff
-**Semantics** +#### Semantics ... -**Do not** +#### Do notAlso applies to: 252-255
RAG_POC/embeddings-design.md-41-47 (1)
41-47: Add a language identifier to the fenced block.markdownlint flags this block; adding a language keeps formatting consistent.
🔧 Suggested diff
-``` +```text {Title} Tags: {Tags} {ChunkBody}</details> </blockquote></details> <details> <summary>lib/MCP_Endpoint.cpp-206-209 (1)</summary><blockquote> `206-209`: **Preserve JSON-RPC `id` (or omit) in exception responses.** Current exception paths send `id: ""`, which is neither `null` nor the original request id and can break clients that rely on JSON-RPC semantics. Consider carrying a nullable `req_id` outside the try and using it in catch blocks. <details> <summary>🐛 Proposed fix</summary> ```diff - // Declare these outside the try block so they're available in catch handlers - std::string req_body; - std::string req_path; + // Declare these outside the try block so they're available in catch handlers + std::string req_body; + std::string req_path; + json req_id = nullptr; @@ - // Extract request ID immediately after parsing (JSON-RPC 2.0 spec) + // Extract request ID immediately after parsing (JSON-RPC 2.0 spec) // This must be done BEFORE validation so we can include the ID in error responses - json req_id = nullptr; if (req_json.contains("id")) { req_id = req_json["id"]; } @@ - auto response = std::shared_ptr<http_response>(new string_response( - create_jsonrpc_error(-32603, "Internal error: " + std::string(e.what()), ""), + auto response = std::shared_ptr<http_response>(new string_response( + create_jsonrpc_error(-32603, "Internal error: " + std::string(e.what()), req_id), http::http_utils::http_internal_server_error )); @@ - auto response = std::shared_ptr<http_response>(new string_response( - create_jsonrpc_error(-32603, "Internal error: Unknown exception", ""), + auto response = std::shared_ptr<http_response>(new string_response( + create_jsonrpc_error(-32603, "Internal error: Unknown exception", req_id), http::http_utils::http_internal_server_error ));Also applies to: 241-246, 350-372
doc/MCP/VARIABLES.md-109-118 (1)
109-118: Implementmcp-ai_endpoint_authvariable or update documentation.The
/mcp/aiendpoint is registered but authentication is not wired. Themcp-ai_endpoint_authvariable is documented in VARIABLES.md and Architecture.md, but the variable does not exist in the codebase andMCP_JSONRPC_Resource::authenticate_request()only handles five endpoints (config, observe, query, admin, cache)—not "ai". Either implement the variable inProxySQL_Admin.cppand add the "ai" case toauthenticate_request()inMCP_Endpoint.cpp, or remove the variable documentation.RAG_POC/rag_ingest.cpp-743-784 (1)
743-784: Guard against empty or invaliddoc_id.If the format resolves to empty (e.g., missing pk or placeholder), all docs can collapse to the same
doc_idand be skipped. Add a fallback/validation to prevent silent data loss.Suggested guard
if (src.doc_map_json.contains("doc_id") && src.doc_map_json["doc_id"].is_object() && src.doc_map_json["doc_id"].contains("format") && src.doc_map_json["doc_id"]["format"].is_string()) { d.doc_id = apply_format(src.doc_map_json["doc_id"]["format"].get<std::string>(), row); } else { auto pk = row_get(row, src.pk_column).value_or(""); d.doc_id = src.table_name + ":" + pk; } + if (d.doc_id.empty()) { + auto pk = row_get(row, src.pk_column).value_or(""); + d.doc_id = src.table_name + ":" + pk; + } + if (d.doc_id.empty()) { + throw std::runtime_error("doc_id resolved empty for source_id=" + std::to_string(src.source_id)); + }include/ProxySQL_Admin_Tables_Definitions.h-378-405 (1)
378-405: Consider makingrun_idNOT NULL to match code behavior and prevent accidental duplicates.The schema allows nullable
run_idin the composite primary key, but the code consistently provides explicit values (0 or resolved run_ids, never NULL). While this currently prevents the duplicate row issue, adding a NOT NULL constraint would align the schema with actual usage and provide defensive protection against future modifications.Suggested change
- " run_id INT ," \ + " run_id INT NOT NULL ," \Apply to both
STATS_SQLITE_TABLE_MCP_QUERY_DIGESTandSTATS_SQLITE_TABLE_MCP_QUERY_DIGEST_RESET.doc/MCP/Vector_Embeddings_Implementation_Plan.md-35-37 (1)
35-37: Align the plan with the Discovery_Schema-based storage approach.The doc says embeddings are integrated into
mcp_catalog.dbviaDiscovery_Schema, but later calls for a newmcp-embeddingpathand newMySQL_Embeddings.*files. That conflicts with the earlier architecture and could mislead implementation.Also applies to: 155-163, 186-189
doc/MCP/FTS_Implementation_Plan.md-46-76 (1)
46-76: Use headings instead of bold labels for section titles (MD036).Markdownlint flags the bold “Parameters/Response/Implementation Logic” labels. Convert these to proper headings and apply consistently across the doc.
📝 Example fix
-**Parameters**: +#### Parameters -**Response**: +#### Response -**Implementation Logic**: +#### Implementation Logiclib/ProxySQL_Admin.cpp-1354-1357 (1)
1354-1357: *Avoid double-refresh when querying _reset counters.Line 1354 and Line 1357:
stats_mcp_query_tools_counters_resetalso matchesstats_mcp_query_tools_counters, so a reset-only query refreshes both tables. Consider anelse if(or count logic like the MySQL digest handling) to prevent extra refresh work.🛠️ Proposed fix
-if (strstr(query_no_space,"stats_mcp_query_tools_counters")) - { stats_mcp_query_tools_counters=true; refresh=true; } -if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset")) - { stats_mcp_query_tools_counters_reset=true; refresh=true; } +if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset")) { + stats_mcp_query_tools_counters_reset=true; refresh=true; +} else if (strstr(query_no_space,"stats_mcp_query_tools_counters")) { + stats_mcp_query_tools_counters=true; refresh=true; +}
🧹 Nitpick comments (16)
RAG_POC/schema.sql (1)
134-141: Hardcoded embedding dimension limits flexibility.The
embedding float[1536]dimension is hardcoded for OpenAI'stext-embedding-3-largemodel. Different embedding models use different dimensions (e.g., 768, 1024, 3072).Consider either:
- Making dimension configurable per source via
embedding_json- Adding a comment documenting which models this supports
- Using a migration strategy for dimension changes
doc/MCP/Architecture.md (1)
47-66: ASCII diagram has alignment issues.The component diagram has some misaligned characters that affect readability. The tool handler boxes and connecting lines could be better aligned.
📝 Suggested alignment fix
-│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌───┐│ -│ │conf│ │obs │ │qry │ │adm │ │cach│ │ai ││ -│ │TH │ │TH │ │TH │ │TH │ │TH │ │TH ││ -│ └─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘ └─┬─┘│ +│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │ +│ │conf│ │obs │ │qry │ │adm │ │cach│ │ ai │ │ +│ │TH │ │TH │ │TH │ │TH │ │TH │ │ TH │ │ +│ └─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘ │RAG_POC/mcp-tools.md (1)
24-31: Clarify BM25 score semantics.The note mentions "bm25; lower is better in SQLite's bm25 by default" but then recommends normalizing so "higher is always better." This could cause confusion.
Consider explicitly documenting the transformation applied (e.g.,
score_fts = -bm25_raworscore_fts = 1 / (1 + bm25_raw)) so implementers and consumers understand the mapping.include/Query_Tool_Handler.h (2)
67-80: Potential issue with min_time tracking for zero-duration calls.The
add_timingmethod has a condition that may incorrectly handle minimum time tracking:if (duration < min_time || min_time == 0) { if (duration) min_time = duration; // skips zero durations }If a tool call legitimately completes in 0 microseconds (very fast), it won't be recorded as the minimum. This is likely intentional to avoid meaningless minimums, but consider documenting this behavior or using a sentinel value like
ULLONG_MAXfor uninitialized state instead.♻️ Alternative approach using sentinel value
struct ToolUsageStats { // ... - ToolUsageStats() : count(0), first_seen(0), last_seen(0), - sum_time(0), min_time(0), max_time(0) {} + ToolUsageStats() : count(0), first_seen(0), last_seen(0), + sum_time(0), min_time(ULLONG_MAX), max_time(0) {} void add_timing(unsigned long long duration, unsigned long long timestamp) { count++; sum_time += duration; - if (duration < min_time || min_time == 0) { - if (duration) min_time = duration; - } + if (duration < min_time) { + min_time = duration; + } // ... } };
39-48: Connection pool usesvoid*for MySQL handle.The
MySQLConnection::mysqlfield is typed asvoid*rather thanMYSQL*. While this avoids including MySQL headers in the header file, it loses type safety.Consider using a forward declaration (
typedef struct st_mysql MYSQL;) as done inStatic_Harvester.hto maintain type safety while avoiding header inclusion.lib/Anomaly_Detector.cpp.bak (2)
1-14: Consider removing the.cpp.bakartifact if it’s not intentional.Backup files in source control can confuse tooling and packaging globs. If it’s accidental, dropping it keeps the tree clean.
254-290: The divide-by-zero concern is unfounded in current code; this is an optional defensive improvement.While adding a guard against
config.rate_limit <= 0would be good defensive programming practice, the vulnerability cannot occur in the current codebase. Theconfigstruct is private to theAnomaly_Detectorclass and is initialized toDEFAULT_RATE_LIMIT(100) in the constructor with no external modification pathways. The comparison check at line 285 (if (stats.query_count > (uint64_t)config.rate_limit)) also implicitly ensures the divisor is positive before the division.If you want to add a defensive guard for future-proofing against potential configuration changes, it would be reasonable, but it's not addressing a real vulnerability in the current code.
lib/ProxySQL_MCP_Server.cpp (1)
182-185: Avoid duplicateextern AI_Features_Manager *GloAIdeclaration.You already declare this earlier in the same function scope (Line 132). Consider removing the duplicate for clarity.
lib/AI_Features_Manager.cpp (1)
300-315: Validategenai_vector_dimensionbefore building SQL.If the configured dimension is 0/negative,
vec0(embedding float(0))will fail and silently drop vector search. Consider clamping to a safe default.Proposed fix
- int vector_dimension = 1536; // Default value - if (GloGATH) { - vector_dimension = GloGATH->variables.genai_vector_dimension; - } + int vector_dimension = 1536; // Default value + if (GloGATH && GloGATH->variables.genai_vector_dimension > 0) { + vector_dimension = GloGATH->variables.genai_vector_dimension; + }lib/Admin_Bootstrap.cpp (1)
717-737: Consider handling MCP catalog DB open failures.
mcpdb->open(...)isn’t checked. If it fails, downstream attach and schema usage may behave unpredictably. Consider validatingopen()’s return code and logging a clear error.include/RAG_Tool_Handler.h (1)
31-35: Make JSON dependency explicit to avoid transitive include fragility.This header uses
jsonbut doesn’t include the json header directly. Consider including the json header (or a forward declaration/alias) here to avoid build‑order surprises ifMCP_Tool_Handler.hchanges.RAG_IMPLEMENTATION_COMPLETE.md (1)
3-130: Add a brief “known limitations” note.The document labels the subsystem “COMPLETE”; consider a short limitations section to avoid implying full production readiness (e.g., rag_ingest uses pseudo-embeddings, incremental sync via
rag_sync_statestill pending).RAG_IMPLEMENTATION_SUMMARY.md (1)
3-130: Consider a short limitations/next-steps note.This summary reads as fully production-ready; adding a brief “known limitations / future work” note (e.g., stub embeddings, incremental sync roadmap) would set expectations more clearly.
RAG_POC/rag_ingest.cpp (3)
384-405: Add MySQL timeouts to avoid indefinite hangs.A stalled backend can block ingestion indefinitely. Consider setting connect/read/write timeouts on the MySQL handle.
Possible tweak
MYSQL* conn = mysql_init(nullptr); if (!conn) fatal("mysql_init failed"); + unsigned int timeout = 5; + mysql_options(conn, MYSQL_OPT_CONNECT_TIMEOUT, &timeout); + mysql_options(conn, MYSQL_OPT_READ_TIMEOUT, &timeout); + mysql_options(conn, MYSQL_OPT_WRITE_TIMEOUT, &timeout);
406-479: Validate identifiers used in the dynamic SELECT.
table_nameand column names are interpolated directly; unexpected characters (e.g., backticks) in rag_sources can break SQL or target unintended objects. Consider validating identifiers against a safe pattern before building SQL.
839-947: Consider streaming results to avoid large memory spikes.
mysql_store_resultbuffers the full result set. For large sources, this can exhaust memory; considermysql_use_resultwith row-by-row fetch or paginate the SELECT.
| const char* create_rag_sources = | ||
| "CREATE TABLE IF NOT EXISTS rag_sources (" | ||
| "source_id INTEGER PRIMARY KEY, " | ||
| "name TEXT NOT NULL UNIQUE, " | ||
| "enabled INTEGER NOT NULL DEFAULT 1, " | ||
| "backend_type TEXT NOT NULL, " | ||
| "backend_host TEXT NOT NULL, " | ||
| "backend_port INTEGER NOT NULL, " | ||
| "backend_user TEXT NOT NULL, " | ||
| "backend_pass TEXT NOT NULL, " | ||
| "backend_db TEXT NOT NULL, " | ||
| "table_name TEXT NOT NULL, " | ||
| "pk_column TEXT NOT NULL, " | ||
| "where_sql TEXT, " | ||
| "doc_map_json TEXT NOT NULL, " | ||
| "chunking_json TEXT NOT NULL, " |
There was a problem hiding this comment.
Avoid storing source passwords in plaintext.
rag_sources.backend_pass is stored as plain TEXT. That’s a high‑impact security risk if mcp_catalog.db is read. Consider encrypting at rest, integrating with existing secret handling, or storing references/tokens instead.
🤖 Prompt for AI Agents
In `@lib/AI_Features_Manager.cpp` around lines 163 - 178, The schema currently
defines rag_sources.backend_pass as plain TEXT in the create_rag_sources SQL
constant; change the schema to avoid storing plaintext secrets by replacing the
backend_pass column with a secure alternative (e.g., backend_pass_encrypted BLOB
or backend_secret_ref TEXT) in the create_rag_sources string, and update all
code paths that read/write rag_sources (any functions that INSERT/SELECT/update
rag_sources) to use the chosen approach: either encrypt/decrypt using the
existing secret management APIs or store only a reference/token and integrate
with the secret store retrieval logic. Ensure migrations handle existing rows
(migrate/rotate plaintext into the secure field) and update any code that
expects backend_pass to use the new column and decryption/secret resolution
functions.
lib/Anomaly_Detector.cpp.bak
Outdated
| void Anomaly_Detector::update_user_statistics(const QueryFingerprint& fp) { | ||
| if (!config.enabled) { | ||
| return; | ||
| } | ||
|
|
||
| std::string key = fp.user + "@" + fp.client_host; | ||
| UserStats& stats = user_statistics[key]; | ||
|
|
||
| // Add to recent queries | ||
| stats.recent_queries.push_back(fp.query_pattern); | ||
|
|
||
| // Keep only recent queries | ||
| if (stats.recent_queries.size() > MAX_RECENT_QUERIES) { | ||
| stats.recent_queries.erase(stats.recent_queries.begin()); | ||
| } | ||
|
|
||
| stats.last_query_time = fp.timestamp; | ||
| stats.query_count++; | ||
|
|
||
| // Cleanup old entries periodically | ||
| static int cleanup_counter = 0; | ||
| if (++cleanup_counter % 1000 == 0) { | ||
| uint64_t current_time = (uint64_t)time(NULL); | ||
| auto it = user_statistics.begin(); | ||
| while (it != user_statistics.end()) { | ||
| if (current_time - it->second.last_query_time > USER_STATS_WINDOW * 2) { | ||
| it = user_statistics.erase(it); | ||
| } else { | ||
| ++it; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "*Anomaly_Detector*" | head -20Repository: ProxySQL/proxysql-vec
Length of output: 186
🏁 Script executed:
rg -l "Anomaly_Detector" --type cpp --type h | head -20Repository: ProxySQL/proxysql-vec
Length of output: 398
🏁 Script executed:
rg "class Anomaly_Detector" -A 20 --type cpp --type hRepository: ProxySQL/proxysql-vec
Length of output: 3424
🏁 Script executed:
cat -n include/Anomaly_Detector.hRepository: ProxySQL/proxysql-vec
Length of output: 5045
🏁 Script executed:
cat -n lib/Anomaly_Detector.cpp | head -100Repository: ProxySQL/proxysql-vec
Length of output: 3864
🏁 Script executed:
rg "pthread_rwlock|pthread_mutex|std::mutex|std::lock" lib/Anomaly_Detector.cppRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
grep -n "pthread_rwlock\|pthread_mutex\|std::mutex\|std::lock\|rwlock\|mutex" include/Anomaly_Detector.hRepository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
wc -l lib/Anomaly_Detector.cppRepository: ProxySQL/proxysql-vec
Length of output: 93
🏁 Script executed:
cat -n lib/Anomaly_Detector.cpp | sed -n '550,610p'Repository: ProxySQL/proxysql-vec
Length of output: 2245
🏁 Script executed:
rg "update_user_statistics" --type cpp --type hRepository: ProxySQL/proxysql-vec
Length of output: 304
🏁 Script executed:
rg "update_user_statistics\|analyze\(" lib/Anomaly_Detector.cpp -B 2 -A 2Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
rg "Anomaly_Detector.*analyze" --type cpp --type h -B 3 -A 5Repository: ProxySQL/proxysql-vec
Length of output: 1668
🏁 Script executed:
cat -n lib/Anomaly_Detector.cpp | sed -n '597,650p'Repository: ProxySQL/proxysql-vec
Length of output: 2262
🏁 Script executed:
rg "GloAI\|get_anomaly_detector" --type cpp --type h | head -20Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
rg "\.analyze\(" lib/AI_Features_Manager.cpp lib/MySQL_Session.cpp --type cpp -B 3 -A 3Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
rg "get_anomaly_detector" --type cpp --type h -B 2 -A 5Repository: ProxySQL/proxysql-vec
Length of output: 2786
🏁 Script executed:
cat -n lib/AI_Features_Manager.cpp | head -150Repository: ProxySQL/proxysql-vec
Length of output: 5778
🏁 Script executed:
cat -n lib/MySQL_Session.cpp | grep -B 10 -A 10 "get_anomaly_detector"Repository: ProxySQL/proxysql-vec
Length of output: 1791
🏁 Script executed:
rg "handler___status_WAITING_CLIENT_DATA___STATE_SLEEP___MYSQL_COM_QUERY_detect_ai_anomaly" lib/MySQL_Session.cpp -B 5 -A 15Repository: ProxySQL/proxysql-vec
Length of output: 1388
🏁 Script executed:
cat -n include/AI_Features_Manager.h | sed -n '1,100p'Repository: ProxySQL/proxysql-vec
Length of output: 3759
🏁 Script executed:
cat -n lib/MySQL_Session.cpp | sed -n '3620,3680p'Repository: ProxySQL/proxysql-vec
Length of output: 2530
🏁 Script executed:
rg "wrlock|wrunlock" lib/MySQL_Session.cpp -B 3 -A 3Repository: ProxySQL/proxysql-vec
Length of output: 1860
🏁 Script executed:
cat -n include/AI_Features_Manager.h | sed -n '100,150p'Repository: ProxySQL/proxysql-vec
Length of output: 1527
🏁 Script executed:
cat -n include/AI_Features_Manager.h | sed -n '140,170p'Repository: ProxySQL/proxysql-vec
Length of output: 1183
🏁 Script executed:
rg "handler___status_WAITING_CLIENT_DATA___STATE_SLEEP___MYSQL_COM_QUERY_detect_ai_anomaly" lib/MySQL_Session.cpp -B 15 | head -40Repository: ProxySQL/proxysql-vec
Length of output: 1296
🏁 Script executed:
grep -n "wrlock\|wrunlock" lib/AI_Features_Manager.cpp | head -20Repository: ProxySQL/proxysql-vec
Length of output: 185
🏁 Script executed:
cat -n lib/Anomaly_Detector.cpp | sed -n '635,645p'Repository: ProxySQL/proxysql-vec
Length of output: 457
🏁 Script executed:
grep -n "class Anomaly_Detector {" -A 30 include/Anomaly_Detector.hRepository: ProxySQL/proxysql-vec
Length of output: 1122
🏁 Script executed:
rg "check_statistical_anomaly" lib/Anomaly_Detector.cpp -A 15 | head -30Repository: ProxySQL/proxysql-vec
Length of output: 1011
🏁 Script executed:
rg "update_user_statistics" lib/Anomaly_Detector.cpp -B 2 -A 2Repository: ProxySQL/proxysql-vec
Length of output: 298
Add mutex protection to Anomaly_Detector for thread-safe access to user_statistics.
The Anomaly_Detector::update_user_statistics() method and check_statistical_anomaly() both access the user_statistics map without synchronization. In ProxySQL's multi-threaded environment, the analyze() method is called concurrently from multiple worker threads, causing races on:
user_statisticsmap (access and erase during iteration at line 585)- static
cleanup_counter(line 579)
Add a pthread_mutex_t member to Anomaly_Detector and protect all accesses to user_statistics and cleanup_counter within the class. Alternatively, delegate locking to AI_Features_Manager by wrapping detector calls in wrlock()/wrunlock().
🤖 Prompt for AI Agents
In `@lib/Anomaly_Detector.cpp.bak` around lines 559 - 591, The code accesses
user_statistics and the static cleanup_counter without synchronization; add a
pthread_mutex_t (e.g., mutex_user_stats) as a member of Anomaly_Detector,
initialize it in the constructor and destroy it in the destructor, then surround
all accesses to user_statistics and cleanup_counter in
Anomaly_Detector::update_user_statistics and in check_statistical_anomaly with
pthread_mutex_lock/pthread_mutex_unlock to prevent concurrent map mutations and
races (or convert cleanup_counter to a member and protect it the same way);
ensure the same lock is used for both methods so erases/iterations and pushes to
recent_queries are atomic with respect to each other.
| proxy_debug(PROXY_DEBUG_GENERIC, 2, "MCP request on %s: %s\n", req_path.c_str(), req_body.c_str()); | ||
|
|
||
| // Validate JSON | ||
| json req_json; | ||
| try { | ||
| req_json = json::parse(req_body); | ||
| } catch (json::parse_error& e) { | ||
| proxy_error("MCP request on %s: Invalid JSON - %s\n", req_path.c_str(), e.what()); | ||
| proxy_error("MCP request payload that failed to parse: %s\n", req_body.c_str()); | ||
| if (handler) { |
There was a problem hiding this comment.
Avoid logging full request payloads and tool arguments at error level.
These logs can capture credentials, PII, or sensitive business data. Consider logging sizes or truncated previews (or moving to debug-level with redaction) instead of full payloads.
🛡️ Suggested mitigation (truncate or size-only)
- proxy_error("MCP request payload that failed to parse: %s\n", req_body.c_str());
+ proxy_error("MCP request payload that failed to parse (size=%zu bytes)\n", req_body.size());
@@
- proxy_error("MCP request payload that caused exception: %s\n", req_body.c_str());
+ proxy_error("MCP request payload that caused exception (size=%zu bytes)\n", req_body.size());
@@
- proxy_error("MCP TOOL CALL FAILED: arguments='%s'\n", args_str.c_str());
+ proxy_debug(PROXY_DEBUG_GENERIC, 2, "MCP TOOL CALL FAILED: arguments size=%zu bytes\n", args_str.size());Also applies to: 352-366, 477-482
🤖 Prompt for AI Agents
In `@lib/MCP_Endpoint.cpp` around lines 221 - 230, The error logging currently
emits the full request payload (req_body) when JSON parsing fails; update the
handlers in MCP_Endpoint.cpp (around the json::parse try/catch that references
req_body and req_path, and the similar blocks noted at the other locations) to
avoid emitting sensitive content: replace proxy_error(... req_body ...) with a
safe message that logs either the payload size and a short truncated/redacted
preview (e.g., first N chars + "...") or moves the full payload to proxy_debug
at most; ensure req_path and e.what() remain logged but any tool arguments or
full request bodies are never printed at error level and apply the same change
to the other similar blocks (the blocks around lines referenced 352-366 and
477-482).
| if (arguments.contains("fuse") && arguments["fuse"].is_object()) { | ||
| const json& fuse_params = arguments["fuse"]; | ||
| fts_k = validate_k(get_json_int(fuse_params, "fts_k", 50)); | ||
| vec_k = validate_k(get_json_int(fuse_params, "vec_k", 50)); | ||
| rrf_k0 = get_json_int(fuse_params, "rrf_k0", 60); | ||
| w_fts = get_json_int(fuse_params, "w_fts", 1.0); | ||
| w_vec = get_json_int(fuse_params, "w_vec", 1.0); | ||
| } else { | ||
| // Fallback to top-level parameters for backward compatibility | ||
| fts_k = validate_k(get_json_int(arguments, "fts_k", 50)); | ||
| vec_k = validate_k(get_json_int(arguments, "vec_k", 50)); | ||
| rrf_k0 = get_json_int(arguments, "rrf_k0", 60); | ||
| w_fts = get_json_int(arguments, "w_fts", 1.0); | ||
| w_vec = get_json_int(arguments, "w_vec", 1.0); | ||
| } |
There was a problem hiding this comment.
Preserve fractional weights for hybrid scoring
w_fts/w_vec are parsed with get_json_int(), which truncates decimals (e.g., 0.5 → 0). This skews fusion scores.
💡 Suggested fix (use double parsing)
- w_fts = get_json_int(fuse_params, "w_fts", 1.0);
- w_vec = get_json_int(fuse_params, "w_vec", 1.0);
+ w_fts = std::stod(get_json_string(fuse_params, "w_fts", "1.0"));
+ w_vec = std::stod(get_json_string(fuse_params, "w_vec", "1.0"));Apply the same change to the fallback block using top-level arguments.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (arguments.contains("fuse") && arguments["fuse"].is_object()) { | |
| const json& fuse_params = arguments["fuse"]; | |
| fts_k = validate_k(get_json_int(fuse_params, "fts_k", 50)); | |
| vec_k = validate_k(get_json_int(fuse_params, "vec_k", 50)); | |
| rrf_k0 = get_json_int(fuse_params, "rrf_k0", 60); | |
| w_fts = get_json_int(fuse_params, "w_fts", 1.0); | |
| w_vec = get_json_int(fuse_params, "w_vec", 1.0); | |
| } else { | |
| // Fallback to top-level parameters for backward compatibility | |
| fts_k = validate_k(get_json_int(arguments, "fts_k", 50)); | |
| vec_k = validate_k(get_json_int(arguments, "vec_k", 50)); | |
| rrf_k0 = get_json_int(arguments, "rrf_k0", 60); | |
| w_fts = get_json_int(arguments, "w_fts", 1.0); | |
| w_vec = get_json_int(arguments, "w_vec", 1.0); | |
| } | |
| if (arguments.contains("fuse") && arguments["fuse"].is_object()) { | |
| const json& fuse_params = arguments["fuse"]; | |
| fts_k = validate_k(get_json_int(fuse_params, "fts_k", 50)); | |
| vec_k = validate_k(get_json_int(fuse_params, "vec_k", 50)); | |
| rrf_k0 = get_json_int(fuse_params, "rrf_k0", 60); | |
| w_fts = fuse_params.value("w_fts", 1.0); | |
| w_vec = fuse_params.value("w_vec", 1.0); | |
| } else { | |
| // Fallback to top-level parameters for backward compatibility | |
| fts_k = validate_k(get_json_int(arguments, "fts_k", 50)); | |
| vec_k = validate_k(get_json_int(arguments, "vec_k", 50)); | |
| rrf_k0 = get_json_int(arguments, "rrf_k0", 60); | |
| w_fts = arguments.value("w_fts", 1.0); | |
| w_vec = arguments.value("w_vec", 1.0); | |
| } |
🤖 Prompt for AI Agents
In `@lib/RAG_Tool_Handler.cpp` around lines 1490 - 1504, The code parses fusion
weights w_fts and w_vec with get_json_int (in both the fuse_params branch and
the fallback top-level arguments branch), which truncates fractional values;
change those calls to the floating-point variant (e.g., get_json_double or
equivalent) and ensure w_fts and w_vec are stored as double (or a floating type)
and use 1.0 as the default, updating both the fuse_params block and the fallback
block so fractional weights are preserved for hybrid scoring; leave integer
parsing (validate_k/get_json_int) for fts_k/vec_k/rrf_k0 unchanged.
| // Build doc ID list for SQL | ||
| std::string doc_list = "'"; | ||
| for (size_t i = 0; i < doc_ids.size(); ++i) { | ||
| if (i > 0) doc_list += "','"; | ||
| doc_list += doc_ids[i]; | ||
| } | ||
| doc_list += "'"; | ||
|
|
||
| // Build query | ||
| std::string sql = "SELECT doc_id, source_id, " | ||
| "(SELECT name FROM rag_sources WHERE source_id = rag_documents.source_id) as source_name, " | ||
| "pk_json, title, body, metadata_json " | ||
| "FROM rag_documents " | ||
| "WHERE doc_id IN (" + doc_list + ")"; | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file to see its size and read the specific lines
wc -l lib/RAG_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 186
🏁 Script executed:
# Read lines around 2316-2330 to see the full context
sed -n '2310,2335p' lib/RAG_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 186
🏁 Script executed:
# Read lines around 2399-2405 to see the second occurrence
sed -n '2395,2410p' lib/RAG_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 647
🏁 Script executed:
# Search for chunk_ids handling to understand the validation pattern mentioned
rg -n "chunk_ids" lib/RAG_Tool_Handler.cpp | head -20Repository: ProxySQL/proxysql-vec
Length of output: 660
🏁 Script executed:
# Search for get_chunks function to see how chunk_ids are validated
rg -B5 -A15 "chunk_ids\[" lib/RAG_Tool_Handler.cpp | head -50Repository: ProxySQL/proxysql-vec
Length of output: 876
🏁 Script executed:
# Look for any existing SQL escaping or prepared statement patterns
rg -n "prepared|prepared_statement|escape|sanitize" lib/RAG_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 894
🏁 Script executed:
# Find the second location in fetch_from_source (lines around 2399-2405)
sed -n '2395,2410p' lib/RAG_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 647
🏁 Script executed:
# Verify this is in fetch_from_source function
rg -B10 "2399\|2400\|2401" lib/RAG_Tool_Handler.cpp | head -30Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
# Search for fetch_from_source to see if it has similar doc_ids handling without escaping
rg -n "fetch_from_source" lib/RAG_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 306
🏁 Script executed:
# Get context around line 2399 to confirm it's the second doc_ids vulnerability
sed -n '2385,2415p' lib/RAG_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 1131
🏁 Script executed:
# Get more context around the fetch_from_source function (starting at 2377)
sed -n '2377,2420p' lib/RAG_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 1566
🏁 Script executed:
# Search for the specific pattern in fetch_from_source to see if doc_ids are used without escaping
rg -n "doc_ids\[" lib/RAG_Tool_Handler.cppRepository: ProxySQL/proxysql-vec
Length of output: 371
🏁 Script executed:
# Check if there's validation for doc_ids in fetch_from_source
rg -B5 -A20 "else if \(tool_name == \"rag.fetch_from_source\"\)" lib/RAG_Tool_Handler.cpp | head -60Repository: ProxySQL/proxysql-vec
Length of output: 985
Add escaping and validation to doc_ids in SQL IN clauses
doc_ids are concatenated without validation or escaping in both rag.fetch_from_source locations (lines 2316-2330 and 2399-2405), creating a SQL injection vulnerability. Unlike chunk_ids (which validates characters at line 2206 and escapes single quotes at lines 2230-2237), doc_ids are inserted directly into the SQL query. A crafted ID containing single quotes or semicolons can break the query or enable statement injection.
Add validation and escaping matching the chunk_ids pattern:
- Check for invalid characters (
',\,;) before building the list - Escape single quotes by doubling them (
'→''), consistent with existing tag-escaping logic (lines 554-561, 582-589)
Apply the fix to both occurrences: line 2320 and line 2403.
🤖 Prompt for AI Agents
In `@lib/RAG_Tool_Handler.cpp` around lines 2316 - 2330, The doc_id list is
concatenated directly into the SQL string creating SQL injection risk; update
the code that constructs doc_list (the loop iterating over doc_ids in
RAG_Tool_Handler.cpp used by rag.fetch_from_source) to first validate each
doc_id for forbidden characters (', backslash, semicolon) and reject or skip
invalid entries, and then escape single quotes by doubling them before appending
(same approach used for chunk_ids and tag-escaping); apply the same
validation+escaping to both occurrences where doc_list is built so the final
"WHERE doc_id IN (...)" uses only validated, single-quote-escaped values.
| -- Where to retrieve from (PoC: connect directly; later can be "via ProxySQL") | ||
| backend_type TEXT NOT NULL, -- "mysql" | "postgres" | ... | ||
| backend_host TEXT NOT NULL, | ||
| backend_port INTEGER NOT NULL, | ||
| backend_user TEXT NOT NULL, | ||
| backend_pass TEXT NOT NULL, | ||
| backend_db TEXT NOT NULL, -- database/schema name |
There was a problem hiding this comment.
Security concern: Backend credentials stored in plaintext.
The backend_pass field stores database credentials in plaintext within SQLite. For a PoC this may be acceptable, but for production:
- Consider encrypting credentials at rest
- Use a secrets manager or reference external credential storage
- At minimum, restrict file permissions on the catalog database
🤖 Prompt for AI Agents
In `@RAG_POC/schema.sql` around lines 19 - 25, The schema stores plaintext
credentials in the backend_pass column; change the schema and access patterns to
avoid plaintext at rest by replacing or augmenting backend_pass with a secret
reference (e.g., backend_secret_ref) or an encrypted column (e.g.,
backend_pass_encrypted) and update any code that reads/writes backend records to
(a) write secret references instead of raw passwords to the catalog, or (b)
encrypt/decrypt backend_pass_encrypted using your KMS/secret-manager client
before persisting/after retrieval (implement decrypt in the code path that uses
backend_host/backend_user/backend_db). Also ensure the catalog file is created
with restrictive filesystem permissions (owner-only read/write) when
initializing the DB. Finally, add a migration path to move existing backend_pass
values into the secret store or encrypted column and remove/zero out plaintext
backend_pass entries.
There was a problem hiding this comment.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
include/MySQL_Catalog.h (1)
132-146: Schema isolation is incomplete:mergelacks a schema parameter.All other catalog ops are schema-scoped now; leaving
mergeunscoped makes cross‑schema merges ambiguous.🧭 Suggested fix
- * `@param` keys Array of keys to merge + * `@param` schema Schema name (empty for all schemas) + * `@param` keys Array of keys to merge * `@param` target_key Key for the merged summary * `@param` kind Kind for the merged entry (default "domain") * `@param` instructions Optional instructions for merging * `@return` 0 on success, -1 on error */ int merge( + const std::string& schema, const std::vector<std::string>& keys, const std::string& target_key, const std::string& kind = "domain", const std::string& instructions = "" );lib/MySQL_Catalog.cpp (1)
500-518: SQL injection vulnerability in remove method.The
removemethod concatenatesschema,kind, andkeydirectly into the SQL query string without parameterization, allowing SQL injection attacks.🐛 Proposed fix using parameterized queries
int MySQL_Catalog::remove( const std::string& schema, const std::string& kind, const std::string& key ) { - std::ostringstream sql; - sql << "DELETE FROM catalog WHERE 1=1"; - - if (!schema.empty()) { - sql << " AND schema = '" << schema << "'"; - } - sql << " AND kind = '" << kind << "' AND key = '" << key << "'"; - - if (!db->execute(sql.str().c_str())) { + sqlite3_stmt* stmt = NULL; + const char* delete_sql = schema.empty() + ? "DELETE FROM catalog WHERE kind = ?1 AND key = ?2" + : "DELETE FROM catalog WHERE schema = ?1 AND kind = ?2 AND key = ?3"; + + int rc = db->prepare_v2(delete_sql, &stmt); + if (rc != SQLITE_OK) { + proxy_error("Failed to prepare catalog remove: %d\n", rc); + return -1; + } + + if (schema.empty()) { + (*proxy_sqlite3_bind_text)(stmt, 1, kind.c_str(), -1, SQLITE_TRANSIENT); + (*proxy_sqlite3_bind_text)(stmt, 2, key.c_str(), -1, SQLITE_TRANSIENT); + } else { + (*proxy_sqlite3_bind_text)(stmt, 1, schema.c_str(), -1, SQLITE_TRANSIENT); + (*proxy_sqlite3_bind_text)(stmt, 2, kind.c_str(), -1, SQLITE_TRANSIENT); + (*proxy_sqlite3_bind_text)(stmt, 3, key.c_str(), -1, SQLITE_TRANSIENT); + } + + SAFE_SQLITE3_STEP2(stmt); + (*proxy_sqlite3_finalize)(stmt); + + if (rc != SQLITE_DONE && rc != SQLITE_OK) { proxy_error("Catalog remove error\n"); return -1; } return 0; }include/sqlite3db.h (1)
90-108: Add missing proxy function pointers to MAIN_PROXY_SQLITE3 block for consistency.The
MAIN_PROXY_SQLITE3block (lines 90-146) is missing two function pointers that are declared in the extern block:
proxy_sqlite3_enable_load_extension(declared on line 49)proxy_sqlite3_global_stats_row_step(declared on line 52)While these symbols are defined elsewhere in
lib/proxy_sqlite3_symbols.cpp, this asymmetry is inconsistent with how all other symbols are handled in this header. All other function pointers are defined in both the extern block and the MAIN_PROXY_SQLITE3 block. Add these two symbols to the#elseblock for consistency and clarity.doc/MCP/Tool_Discovery_Guide.md (1)
590-600: Add /mcp/rag (and note HTTP mode) in the endpoint list.
The guide says “all endpoints,” but the list omits the RAG endpoint and doesn’t mention HTTP when TLS is disabled.✏️ Suggested update
- **Observe**: `/mcp/observe` - Monitoring and metrics tools - **AI**: `/mcp/ai` - AI and LLM features +- **RAG**: `/mcp/rag` - Retrieval-Augmented Generation tools + +Note: use `http://` when `mcp_use_ssl=false`.lib/AI_Features_Manager.cpp (1)
52-57: Potential command injection viasystem()call.Using
system()with a path derived from configuration is risky. Ifgenai_vector_db_pathcontains shell metacharacters, this could lead to command injection. Consider usingmkdir()withENOENTloop or a safer directory creation method.🔒 Suggested fix using POSIX mkdir
- if (stat(dir, &st) != 0) { - // Create directory if it doesn't exist - char cmd[512]; - snprintf(cmd, sizeof(cmd), "mkdir -p %s", dir); - system(cmd); - } + if (stat(dir, &st) != 0) { + // Create directory if it doesn't exist + // Note: This only creates the final directory, not parents + // For production use, consider a recursive mkdir implementation + if (mkdir(dir, 0755) != 0 && errno != EEXIST) { + proxy_error("AI: Failed to create directory %s: %s\n", dir, strerror(errno)); + free(path_copy); + return -1; + } + }If parent directory creation is needed, implement a recursive mkdir function or use a library that provides this safely.
doc/MCP/Vector_Embeddings_Implementation_Plan.md (1)
136-151: Incomplete code example.The SQL code example starting at line 137 is truncated. The INSERT statement on lines 143-149 is missing its closing parenthesis and the example ends abruptly at line 150 without completing the demonstration.
Consider completing or removing this incomplete code block to avoid confusion.
lib/MCP_Endpoint.cpp (1)
206-375: Returnnull/request id in internal‑error responses.The catch blocks call
create_jsonrpc_error(..., ""), which serializes an empty‑string id. JSON‑RPC expects the original id when available, ornullwhen it isn’t. Capturereq_idoutside thetryand reuse it in the catch blocks.🛠️ Proposed fix
- // Declare these outside the try block so they're available in catch handlers - std::string req_body; - std::string req_path; + // Declare these outside the try block so they're available in catch handlers + std::string req_body; + std::string req_path; + json req_id = nullptr; ... - // Extract request ID immediately after parsing (JSON-RPC 2.0 spec) - // This must be done BEFORE validation so we can include the ID in error responses - json req_id = nullptr; + // Extract request ID immediately after parsing (JSON-RPC 2.0 spec) + // This must be done BEFORE validation so we can include the ID in error responses if (req_json.contains("id")) { req_id = req_json["id"]; } ... - create_jsonrpc_error(-32603, "Internal error: " + std::string(e.what()), ""), + create_jsonrpc_error(-32603, "Internal error: " + std::string(e.what()), req_id), ... - create_jsonrpc_error(-32603, "Internal error: Unknown exception", ""), + create_jsonrpc_error(-32603, "Internal error: Unknown exception", req_id),
🤖 Fix all issues with AI agents
In `@doc/MCP/Architecture.md`:
- Around line 253-270: The authenticate_request logic in MCP_Endpoint.cpp
(function authenticate_request) fails to recognize the "ai" endpoint, causing
all /mcp/ai requests to be treated as unknown; update the if/else chain inside
authenticate_request to include the "ai" endpoint branch and validate using the
existing mcp-ai_endpoint_auth token variable (same pattern as other endpoints:
read mcp-ai_endpoint_auth and compare the bearer token), ensuring /mcp/ai is
authenticated instead of falling through to the unknown-endpoint error.
In `@doc/rag-documentation.md`:
- Around line 17-28: Update the SQL examples to use ProxySQL's genai-<variable>
naming (replace the dot notation like genai.enabled and genai.rag_enabled) so
each variable uses a hyphen after "genai" and equals-sign assignment;
specifically change genai.enabled -> genai-enabled, genai.rag_enabled ->
genai-rag_enabled, genai.rag_k_max -> genai-rag_k_max, genai.rag_candidates_max
-> genai-rag_candidates_max, and genai.rag_timeout_ms -> genai-rag_timeout_ms in
the examples so they match the implementation pattern (e.g.,
genai-embedding_uri, genai-rerank_timeout_ms).
In `@lib/Admin_Handler.cpp`:
- Around line 2362-2385: The current handlers for the "LOAD MCP QUERY RULES ..."
and "SAVE MCP QUERY RULES ..." commands build a single string with two SQL
statements which causes the second statement to be ignored by
sqlite3_prepare_v2; instead, split the operations so each SQL statement is
executed separately (e.g., call sqlite3_exec/execute or prepare/step twice):
first execute the DELETE, then execute the INSERT; update the code that
currently assigns *q via l_strdup to either execute those two statements
immediately (and set *q/*ql appropriately if a response string is needed) or
assign a single statement plus run the second with a direct sqlite3_exec call;
ensure you continue to call l_free(*ql,*q) before assignments and preserve the
same return true behavior, referring to variables query_no_space,
query_no_space_length and functions l_free/l_strdup used in this diff.
In `@lib/ProxySQL_Admin_Stats.cpp`:
- Around line 2605-2627: The current loop builds SQL with sprintf into the
buffer "query" using format string "a" and raw r->fields[] values, risking SQL
injection and buffer overflow; replace this with a prepared statement: prepare
the SQL "INSERT INTO stats_mcp_query_digest VALUES (?1,...,?10)" via
statsdb->prepare_v2 into an sqlite3_stmt* (e.g., statement), then bind each
r->fields[i] using proxy_sqlite3_bind_text (or equivalent) before calling
statsdb->execute/step and finalizing the statement; remove manual
malloc/sprintf/free and ensure statement is finalized for each row or reused
with sqlite3_reset to avoid leaks.
- Around line 2662-2673: The current loop builds SQL with sprintf using the
template in variable a and r->fields[], risking SQL injection; replace this with
a parameterized/prepared statement: prepare a statement like "INSERT INTO
stats_mcp_query_rules VALUES (?,?)" once (or per loop if the API requires), bind
r->fields[0] and r->fields[1] (use bind_text/bind_blob equivalents), execute the
prepared statement via the statsdb API, and finalize/reset the statement rather
than allocating with malloc/sprintf and calling statsdb->execute with a
concatenated string; update the code that currently touches variables a, query,
r->fields, resultset, and statsdb->execute accordingly.
- Around line 1606-1612: The INSERT builds the table name incorrectly; in
ProxySQL_Admin_Stats.cpp where snprintf builds the INSERT (the query buffer used
with statsdb->execute), change the table-name construction so it produces
"stats_mcp_query_tools_counters_reset" when reset==true and
"stats_mcp_query_tools_counters" when reset==false (instead of prefixing or
suffixing incorrectly). Adjust the format string and arguments around the
snprintf call that currently uses "%s...%s" and the reset variable so the base
name is "stats_mcp_query_tools_counters" with an optional "_reset" suffix,
keeping the rest of the VALUES fields (r->fields[0]..r->fields[7]) unchanged.
In `@lib/ProxySQL_Admin.cpp`:
- Around line 7750-7804: In load_mcp_query_rules_to_runtime, the SQLite3_result*
resultset returned by admindb->execute_statement is freed only on the error
path, causing a memory leak after catalog->load_mcp_query_rules(resultset) on
success; change the cleanup so that if resultset != NULL it is freed after the
call to catalog->load_mcp_query_rules (i.e., move/free resultset outside the if
(success == false) block and ensure you check resultset for NULL before calling
free).
In `@lib/Query_Tool_Handler.cpp`:
- Around line 827-835: The code calls catalog->get_db()->execute_statement(...)
which returns a heap-allocated SQLite3_result* named resultset and currently
uses free(resultset) (undefined behavior); replace all free(resultset) usages in
this block with delete resultset so the object is properly destroyed (ensure
both the branch that returns schema and the subsequent cleanup path use delete
resultset and avoid double-delete by only deleting when not returning).
- Around line 362-405: The connection is returned to the pool too early and
execution continues after query failure; in both execute_query and
execute_query_with_schema ensure mysql_query failures immediately log and return
(do not call mysql_store_result or continue), and move the
return_connection(mysql) call so it happens after all result operations complete
(after mysql_store_result/maria results are processed and after
mysql_free_result and mysql_affected_rows are used) to avoid use-after-return
and race conditions; adjust control flow in execute_query and
execute_query_with_schema to acquire with get_connection, call mysql_query, on
error log and return a JSON error immediately, then call mysql_store_result and
only return_connection once all fetching, mysql_fetch_row, mysql_affected_rows,
and mysql_free_result are done.
- Around line 921-933: The SHOW TABLES query currently concatenates untrusted
strings (schema and name_filter from json_string) into sql which creates an SQL
injection risk; in Query_Tool_Handler.cpp validate and sanitize these before
use: ensure schema matches a strict identifier regex (e.g., only letters, digits
and underscores) and then wrap it in backticks when appending to sql (use the
validated schema variable, not raw input), and for name_filter either restrict
to safe pattern characters (allow % and _ for LIKE but disallow
quotes/backticks/semicolons) or escape/encode any special characters before
adding "LIKE '...'" — update the code that builds sql (the schema and
name_filter variables and the block that appends to sql) to perform these
checks/escapes and reject or error on invalid inputs instead of concatenating
raw values.
In `@lib/RAG_Tool_Handler.cpp`:
- Around line 401-447: In execute_parameterized_query, you're preparing and
binding parameters to sqlite3_stmt* stmt but then calling
vector_db->execute_statement(query) which ignores the prepared statement;
replace that call with the appropriate execution of the bound statement (e.g.,
call vector_db->execute_prepared(stmt, &error, &cols, &affected_rows) or the
library's equivalent), propagate its returned SQLite3_result* (and
error/cols/affected_rows) instead of using the raw query, and remove the
execute_statement(query) usage; update error handling to use the
execute_prepared call and ensure any allocated error strings are handled
consistently.
In `@lib/Static_Harvester.cpp`:
- Around line 1088-1138: The code reads profile fields from the wrong SQLite
columns: update the uses of row->fields in the object loop (the object_id
assignment and the profile population where rows_est, size_bytes, engine,
has_primary_key, has_foreign_keys, has_time_column are set) to match the SELECT
order (object_id=fields[0], schema_name=1, object_name=2, object_type=3,
engine=4, table_rows_est=5, data_length=6, index_length=7, has_primary_key=8,
has_foreign_keys=9, has_time_column=10); specifically set object_id from
fields[0], profile["engine"]=fields[4], profile["rows_est"]=fields[5],
profile["size_bytes"]=data_length(6)+index_length(7), and the three boolean
flags from fields[8], fields[9], fields[10].
- Around line 1041-1051: The UPDATE currently builds SQL by concatenating
view_def, schema_name and view_name which risks SQL injection and breaking on
embedded quotes; change this to use a prepared statement: call the database
prepare_v2() with an UPDATE containing placeholders for definition_sql, run_id,
schema_name and object_name (e.g. "UPDATE objects SET definition_sql = ? WHERE
run_id = ? AND schema_name = ? AND object_name = ? AND object_type = 'view'"),
bind view_def, current_run_id, schema_name and view_name using
proxy_sqlite3_bind_text()/appropriate bind functions, then run
execute_prepared() and finalize the statement instead of using
catalog->get_db()->execute_statement(). Ensure error/affected handling is
preserved.
In `@RAG_POC/rag_ingest.cpp`:
- Around line 98-108: The file is missing the <cmath> header required for
std::sqrt used in the vector normalization function (e.g., normalize_vector /
any function that computes the Euclidean norm and calls std::sqrt); add `#include`
<cmath> to the top header block so the call to std::sqrt compiles correctly.
- Around line 685-698: The code assigns raw sqlite3_column_text() pointers
directly into std::string fields of the struct s (e.g., s.name, s.backend_type,
s.host, s.user, s.pass, s.db, s.table_name, s.pk_column, s.where_sql), which can
be NULL; replace those direct casts with the existing helper
str_or_empty(sqlite3_column_text(st, idx)) for each of these fields so they get
a safe empty string instead of constructing std::string from a NULL pointer
(leave numeric fields like source_id and port as-is and continue using
sqlite3_column_int for them).
- Around line 521-529: sqlite_doc_exists currently treats any sqlite3_step
return other than SQLITE_ROW as "not found" and thus hides DB errors; update
sqlite_doc_exists (function name) to check the sqlite3_step return value and
call fatal() on error codes (e.g., SQLITE_ERROR, SQLITE_BUSY) consistent with
other sqlite helpers in the file: after sqlite3_step(ss.doc_exists) inspect rc,
return true for SQLITE_ROW, return false for SQLITE_DONE (not found), and call
fatal() with a descriptive message (including rc or sqlite3_errmsg on the db)
for any other rc to surface DB failures instead of silently treating them as
missing.
In `@RAG_POC/schema.sql`:
- Around line 22-25: The schema stores plaintext credentials in the backend_pass
column which is insecure; change the schema and application to store encrypted
credentials (or secret references) instead: replace backend_pass with a
ciphertext column (e.g., backend_pass_cipher) plus associated metadata columns
for IV/salt/algorithm or a backend_pass_ref to hold a secrets-manager key,
update any code that reads/writes backend_pass to perform encryption/decryption
using a master key derived from a secret (or call the vault client) and add
migration of existing plaintext rows to encrypted values; if immediate
encryption integration isn't possible, at minimum document the risk clearly and
ensure the DB file permission hardening is enforced and referenced in the
README.
♻️ Duplicate comments (6)
lib/MySQL_Catalog.cpp (2)
278-303: SQL injection vulnerability via string concatenation.The
searchmethod constructs SQL queries by directly concatenating user-provided parameters (schema,kind,tags,query) into the query string. This is the same vulnerability pattern already flagged by a previous reviewer.
369-390: SQL injection vulnerability in list method.The
listmethod has the same SQL injection vulnerability assearch, concatenatingschemaandkinddirectly into SQL strings.lib/Anomaly_Detector.cpp.bak (1)
1-953: Remove backup file from repository.This
.bakfile appears to be an accidentally committed backup ofAnomaly_Detector.cppbefore the proxy wrapper migration. It should be removed from version control to avoid confusion and repository clutter.lib/Static_Harvester.cpp (1)
401-407: Schema filters are concatenated into SQL (SQL injection risk).
filter/only_schemaare interpolated directly into INFORMATION_SCHEMA queries; a crafted schema name can inject SQL (multi-statements are enabled). Please parameterize or strictly validate/escape identifiers before concatenation.Also applies to: 488-490, 535-537, 583-585, 636-638, 702-704
lib/RAG_Tool_Handler.cpp (1)
1516-1595: Filters are applied twice with raw concatenation.
Afterbuild_sql_filters, the hybrid paths append the same filters again via string concatenation, which bypasses any validation and keeps SQL injection surface even if build_sql_filters is hardened. Remove the duplicate blocks and rely on a single, parameterized filter builder.Also applies to: 1640-1717, 1894-1972
lib/Query_Tool_Handler.cpp (1)
478-511: Readonly validation via blacklist is bypassable.
Keyword blacklists are easy to evade (comments, encodings, alternate syntax). Consider a real SQL parser/AST check to enforce read-only semantics.
🟡 Minor comments (16)
RAG_POC/embeddings-design.md-98-99 (1)
98-99: Use headings instead of bold-only section labels (MD036).markdownlint flags emphasis-only lines as headings. Convert these to proper subheadings.
📝 Suggested fix
-**Semantics** +#### Semantics ... -**Recommendation** +#### Recommendation ... -**Recommendation** +#### RecommendationAlso applies to: 139-140, 207-208
RAG_POC/mcp-tools.md-5-13 (1)
5-13: Replace bold-only labels with headings (MD036).These lines are treated as headings by readers but fail markdownlint.
📝 Suggested fix
-**Design goals** +## Design goals ... -**Notes** +#### Notes ... -**Hard limit recommendation** +#### Hard limit recommendation ... -**Security note** +#### Security noteAlso applies to: 64-69, 303-305, 386-389
scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md-155-163 (1)
155-163: Hyphenate “Meta-analysis” in the Round 5 heading.Improves grammar and aligns with the tooling hint.
📝 Suggested fix
-5. **Round 5: Meta Analysis** (META agent only) +5. **Round 5: Meta-analysis** (META agent only)scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/README.md-594-607 (1)
594-607: Add a language tag to the directory-structure code fence (MD040).markdownlint requires a language identifier for fenced blocks.
📝 Suggested fix
-``` +```text ClaudeCode_Headless/ ├── README.md # This file ... └── DATABASE_QUESTION_CAPABILITIES.md # Feature documentation</details> </blockquote></details> <details> <summary>doc/rag-documentation.md-105-116 (1)</summary><blockquote> `105-116`: **Verify database path consistency.** The documentation references `/var/lib/proxysql/ai_features.db`, but the PR objectives and `Admin_Bootstrap.cpp` (line 727) indicate the MCP catalog is stored at `datadir/mcp_catalog.db`. Please verify which path is correct and update the documentation accordingly. </blockquote></details> <details> <summary>doc/MCP/Database_Discovery_Agent.md-806-808 (1)</summary><blockquote> `806-808`: **Incomplete reference: Missing document name.** Line 807 has a broken reference with an empty link target. The sentence reads "See for the actual ProxySQL MCP discovery implementation" but is missing the document filename. <details> <summary>📝 Suggested fix</summary> ```diff **Status:** Conceptual design - Not implemented -**Actual Implementation:** See for the actual ProxySQL MCP discovery implementation. +**Actual Implementation:** See `Two_Phase_Discovery_Implementation.md` for the actual ProxySQL MCP discovery implementation.RAG_POC/architecture-runtime-retrieval.md-51-55 (1)
51-55: Convert bold labels into headings to satisfy MD036.
markdownlint flags these bolded phrases as headings; use heading syntax instead.✏️ Example heading fix (apply similarly elsewhere)
-**Canonical flow** +#### Canonical flow -**Typical output** +#### Typical outputAlso applies to: 74-77, 90-95, 103-107, 117-121, 127-134, 137-145, 279-282
doc/MCP/VARIABLES.md-7-7 (1)
7-7: Mention HTTP mode and /mcp/rag in the overview.
MCP can run without TLS (mcp_use_ssl=false) and now exposes/mcp/rag; the overview should reflect both.✏️ Suggested wording
-The MCP module provides JSON-RPC 2.0 over HTTPS for LLM integration with ProxySQL. It includes endpoints for configuration, observation, querying, administration, caching, and AI features, each with dedicated tool handlers for database exploration and LLM integration. +The MCP module provides JSON-RPC 2.0 over HTTP/HTTPS (controlled by mcp_use_ssl) for LLM integration with ProxySQL. It includes endpoints for configuration, observation, querying, administration, caching, AI features, and RAG, each with dedicated tool handlers for database exploration and LLM integration.lib/ProxySQL_Admin_Stats.cpp-2308-2309 (1)
2308-2309: Duplicate comment line.Lines 2308-2309 contain a duplicate comment:
//rc=(*proxy_sqlite3_prepare_v2)(mydb3, query1, -1, &statement1, 0); //rc=(*proxy_sqlite3_prepare_v2)(mydb3, query1, -1, &statement1, 0);The second line appears to be a copy-paste error.
lib/ProxySQL_Admin_Stats.cpp-2375-2376 (1)
2375-2376: Duplicate comment line.Same duplicate comment pattern as noted above.
doc/MCP/Vector_Embeddings_Implementation_Plan.md-186-198 (1)
186-198: Inconsistency between planned file structure and architecture.Lines 186-188 list
MySQL_Embeddings.handMySQL_Embeddings.cppas "Files to Create", but lines 191-194 describe modifyingDiscovery_Schema.h/.cppto add embeddings methods, and the architecture section (lines 24-26, 35-36) indicates embeddings will be integrated intoDiscovery_Schemausingmcp_catalog.db.Consider clarifying whether embeddings will be a standalone
MySQL_Embeddingsclass or integrated intoDiscovery_Schema, as the current document suggests both approaches.lib/ProxySQL_Admin.cpp-1354-1363 (1)
1354-1363: Avoid double refresh when requestingstats_mcp_query_tools_counters_reset.
Line 1354 matches the reset table name as a substring, so both flags are set and both refreshes run. Use a reset-first check (orelse if) to avoid redundant work.Proposed fix
- if (strstr(query_no_space,"stats_mcp_query_tools_counters")) - { stats_mcp_query_tools_counters=true; refresh=true; } - if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset")) - { stats_mcp_query_tools_counters_reset=true; refresh=true; } + if (strstr(query_no_space,"stats_mcp_query_tools_counters_reset")) + { stats_mcp_query_tools_counters_reset=true; refresh=true; } + else if (strstr(query_no_space,"stats_mcp_query_tools_counters")) + { stats_mcp_query_tools_counters=true; refresh=true; }lib/RAG_Tool_Handler.cpp-34-43 (1)
34-43: DuplicateGloGATHforward declaration.
There are two identical extern declarations; keep one to avoid redundancy.🧹 Proposed cleanup
-// Forward declaration for GloGATH -extern GenAI_Threads_Handler *GloGATH; - -// JSON library +// JSON library `#include` "../deps/json/json.hpp" using json = nlohmann::json; `#define` PROXYJSON -// Forward declaration for GloGATH -extern GenAI_Threads_Handler *GloGATH; +// Forward declaration for GloGATH +extern GenAI_Threads_Handler *GloGATH;lib/RAG_Tool_Handler.cpp-1490-1497 (1)
1490-1497: w_fts/w_vec lose fractional values.
Lines 1490–1491 and 1507–1508 parse weights viaget_json_int, truncating decimals. Use double parsing so fractional weights work. Addget_json_doublehelper alongside the existing JSON helpers and update both the fuse params and fallback code paths.💡 Suggested changes
Lines 1490–1491:
- w_fts = get_json_int(fuse_params, "w_fts", 1.0); - w_vec = get_json_int(fuse_params, "w_vec", 1.0); + w_fts = get_json_double(fuse_params, "w_fts", 1.0); + w_vec = get_json_double(fuse_params, "w_vec", 1.0);Lines 1507–1508 (fallback):
- w_fts = get_json_int(arguments, "w_fts", 1.0); - w_vec = get_json_int(arguments, "w_vec", 1.0); + w_fts = get_json_double(arguments, "w_fts", 1.0); + w_vec = get_json_double(arguments, "w_vec", 1.0);Add helper (include/RAG_Tool_Handler.h and implementation):
static double get_json_double(const json& j, const std::string& key, double default_val = 0.0);double RAG_Tool_Handler::get_json_double(const json& j, const std::string& key, double default_val) { if (j.contains(key) && !j[key].is_null()) { if (j[key].is_number()) return j[key].get<double>(); if (j[key].is_string()) { try { return std::stod(j[key].get<std::string>()); } catch (...) {} } } return default_val; }doc/Two_Phase_Discovery_Implementation.md-331-337 (1)
331-337: Avoid/tmpreferences in published docs.Paths under
/tmpare ephemeral and will be broken for most readers. Consider moving these artifacts into the repo (or removing the references) to keep the document portable.RAG_COMPLETION_SUMMARY.md-84-95 (1)
84-95: Align SET examples with ProxySQL variable naming.Admin
SETexpectsgenai-…prefixed variables (not dotted names), so the current snippet likely fails validation. Please update the example to match the actual variable names exposed bySHOW GENAI VARIABLES.✏️ Possible doc fix
-SET genai.enabled = true; +SET genai-enabled = true; -SET genai.rag_enabled = true; +SET genai-rag_enabled = true; -- adjust to the exact variable name if different
🧹 Nitpick comments (7)
lib/sqlite3db.cpp (1)
1177-1183: Minor: Inconsistent indentation.These lines lack the leading tab that other assignments in this block have. This is a minor formatting inconsistency.
🔧 Suggested fix for indentation
- proxy_sqlite3_column_int64 = sqlite3_column_int64; - proxy_sqlite3_column_double = sqlite3_column_double; - proxy_sqlite3_last_insert_rowid = sqlite3_last_insert_rowid; - proxy_sqlite3_errstr = sqlite3_errstr; - proxy_sqlite3_db_handle = sqlite3_db_handle; - proxy_sqlite3_enable_load_extension = sqlite3_enable_load_extension; - proxy_sqlite3_auto_extension = sqlite3_auto_extension; + proxy_sqlite3_column_int64 = sqlite3_column_int64; + proxy_sqlite3_column_double = sqlite3_column_double; + proxy_sqlite3_last_insert_rowid = sqlite3_last_insert_rowid; + proxy_sqlite3_errstr = sqlite3_errstr; + proxy_sqlite3_db_handle = sqlite3_db_handle; + proxy_sqlite3_enable_load_extension = sqlite3_enable_load_extension; + proxy_sqlite3_auto_extension = sqlite3_auto_extension;doc/rag-examples.md (1)
89-94: Minor inconsistency:rag.admin.statscall format differs from other examples.The
rag.admin.statstool call omits theargumentsfield entirely, while other tool calls include it (even when arguments are optional). For consistency and to match MCP specification patterns, consider adding an emptyargumentsobject:📝 Suggested fix
curl -k -X POST \ -H "Content-Type: application/json" \ - -d '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"rag.admin.stats"},"id":"1"}' \ + -d '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"rag.admin.stats","arguments":{}},"id":"1"}' \ https://127.0.0.1:6071/mcp/ragdoc/MCP/Database_Discovery_Agent.md (1)
5-7: Consider using a heading instead of bold emphasis for the note.The static analysis tool flags line 5 for using bold emphasis (
**Note:**) where a heading might be more appropriate per markdown best practices (MD036). This is a minor stylistic consideration.📝 Optional fix
This document describes a conceptual architecture for an AI-powered database discovery agent that could autonomously explore, understand, and analyze any database schema regardless of complexity or domain. The agent would use a mixture-of-experts approach where specialized LLM agents collaborate to build comprehensive understanding of database structures, data patterns, and business semantics. -**Note:** This is a conceptual design document. The actual ProxySQL MCP implementation uses a different approach based on the two-phase discovery architecture described in `Two_Phase_Discovery_Implementation.md`. +> **Note:** This is a conceptual design document. The actual ProxySQL MCP implementation uses a different approach based on the two-phase discovery architecture described in `Two_Phase_Discovery_Implementation.md`.doc/MCP/Architecture.md (1)
69-71: Consider using a heading instead of emphasis.The "Where:" label uses emphasis (
**Where:**) which static analysis flags as potentially better suited as a proper heading (MD036). This is a minor formatting suggestion.📝 Suggested fix
-Where: -- `TH` = Tool Handler +### Legend +- `TH` = Tool HandlerRAG_POC/schema.sql (1)
134-141: Hardcoded vector dimension may limit flexibility.The
embedding float[1536]dimension is hardcoded for OpenAI's text-embedding-3-small model. While the comment acknowledges this, different embedding models use different dimensions (e.g., 768 for some models, 384 for others).Consider documenting a migration path or making this configurable per-source in
rag_sources.embedding_jsonfor future flexibility.RAG_IMPLEMENTATION_SUMMARY.md (1)
1-130: Consider consolidating duplicate documentation.This file (
RAG_IMPLEMENTATION_SUMMARY.md) appears to be nearly identical toRAG_IMPLEMENTATION_COMPLETE.mdreviewed earlier. Having two files with the same content may cause maintenance issues and confusion.Consider either:
- Removing one of the duplicate files
- Clearly differentiating their purposes (e.g., one as a changelog entry, one as living documentation)
lib/MCP_Endpoint.cpp (1)
472-482: Consider redacting sensitive tool arguments in error logs.
arguments.dump()is logged at error level on tool failures. Some tools may carry secrets or PII; consider masking known sensitive fields (e.g.,password,token,api_key) before logging.
| #### `/mcp/ai` - AI Endpoint | ||
|
|
||
| **Purpose**: AI and LLM features | ||
|
|
||
| **Tools**: | ||
| - `llm.query` - Query LLM with database context | ||
| - `llm.analyze` - Analyze data with LLM | ||
| - `llm.generate` - Generate content with LLM | ||
| - `anomaly.detect` - Detect anomalies in data | ||
| - `anomaly.list` - List detected anomalies | ||
| - `recommendation.get` - Get AI recommendations | ||
|
|
||
| **Use Cases**: | ||
| - LLM-powered data analysis | ||
| - Anomaly detection | ||
| - AI-driven recommendations | ||
|
|
||
| **Authentication**: `mcp-ai_endpoint_auth` (Bearer token) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if ai endpoint authentication is handled in MCP_Endpoint.cpp
rg -n "ai.*endpoint.*auth|endpoint.*ai" --type=cpp lib/Repository: ProxySQL/proxysql-vec
Length of output: 542
🏁 Script executed:
# Read the Architecture.md file around lines 331-345 to see the authenticate_request example
sed -n '320,350p' doc/MCP/Architecture.md | cat -nRepository: ProxySQL/proxysql-vec
Length of output: 1511
🏁 Script executed:
# Search for authenticate_request function implementation
rg -n "authenticate_request" --type=cpp -A 20 lib/ | head -100Repository: ProxySQL/proxysql-vec
Length of output: 2697
🏁 Script executed:
# See the full authenticate_request function including the rest of the logic
sed -n '24,70p' lib/MCP_Endpoint.cpp | cat -nRepository: ProxySQL/proxysql-vec
Length of output: 2172
Add missing authentication handling for /mcp/ai endpoint.
The /mcp/ai endpoint is registered in the server and documented with mcp-ai_endpoint_auth token variable, but the authenticate_request function in MCP_Endpoint.cpp (lines 24-70) does not handle the "ai" endpoint case. This causes ALL /mcp/ai requests to be rejected with an "unknown endpoint" authentication error (line 21-22), regardless of token configuration. The endpoint must be added to the if/else chain to properly use the mcp-ai_endpoint_auth variable.
🤖 Prompt for AI Agents
In `@doc/MCP/Architecture.md` around lines 253 - 270, The authenticate_request
logic in MCP_Endpoint.cpp (function authenticate_request) fails to recognize the
"ai" endpoint, causing all /mcp/ai requests to be treated as unknown; update the
if/else chain inside authenticate_request to include the "ai" endpoint branch
and validate using the existing mcp-ai_endpoint_auth token variable (same
pattern as other endpoints: read mcp-ai_endpoint_auth and compare the bearer
token), ensuring /mcp/ai is authenticated instead of falling through to the
unknown-endpoint error.
| ```sql | ||
| -- Enable GenAI module | ||
| SET genai.enabled = true; | ||
|
|
||
| -- Enable RAG features | ||
| SET genai.rag_enabled = true; | ||
|
|
||
| -- Configure RAG parameters (optional) | ||
| SET genai.rag_k_max = 50; | ||
| SET genai.rag_candidates_max = 500; | ||
| SET genai.rag_timeout_ms = 2000; | ||
| ``` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for genai variable definitions to verify naming convention
rg -n "genai" --type cpp -C2 | head -60Repository: ProxySQL/proxysql-vec
Length of output: 4197
🏁 Script executed:
#!/bin/bash
# Search for genai variable initialization to find actual variable names
rg -n "genai-" --type cpp | head -40Repository: ProxySQL/proxysql-vec
Length of output: 3978
🏁 Script executed:
#!/bin/bash
# Read the documentation file to verify the exact configuration shown
sed -n '17,28p' doc/rag-documentation.mdRepository: ProxySQL/proxysql-vec
Length of output: 315
🏁 Script executed:
#!/bin/bash
# Search for GenAI variable initialization in the admin code
rg -n "init_genai_variables\|genai.*enabled\|rag_enabled" --type cpp -A5 -B2 | head -100Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Search for variable definitions in admin files
find . -name "*admin*" -type f -name "*.cpp" -o -name "*.h" | head -20Repository: ProxySQL/proxysql-vec
Length of output: 670
🏁 Script executed:
#!/bin/bash
# Search for GenAI variable definitions in AI_Features_Manager
rg -n "enabled\|rag_enabled\|rag_k_max\|rag_candidates_max\|rag_timeout_ms" include/AI_Features_Manager.h | head -50Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Let's search for where GenAI variables are actually stored
rg -n "genai" include/AI_Features_Manager.h -i | head -30Repository: ProxySQL/proxysql-vec
Length of output: 497
🏁 Script executed:
#!/bin/bash
# Search for the actual GenAI variable declarations and their names
rg -n "genai-enabled\|genai-rag\|variables\|variable_name" src/ --type cpp | head -60Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Search for variable registration or SET handlers
rg -n "SET.*genai\|genai.*variable\|register.*variable" src/ --type cpp -i | head -50Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Search for the actual GenAI module implementation to find variable names
find src -name "*.cpp" -exec grep -l "GenAI\|AI_Features" {} \; | head -10Repository: ProxySQL/proxysql-vec
Length of output: 77
🏁 Script executed:
#!/bin/bash
# Look at main.cpp around the GenAI initialization
sed -n '1650,1670p' src/main.cppRepository: ProxySQL/proxysql-vec
Length of output: 585
🏁 Script executed:
#!/bin/bash
# Search for GenAI_Threads_Handler or the implementation of init_genai_variables
rg -n "GenAI_Threads_Handler\|init_genai_variables" --type cpp -A10 | head -80Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Search for GenAI_Threads_Handler class definition
find . -name "*.h" -exec grep -l "GenAI_Threads_Handler\|AI_Features_Manager" {} \;Repository: ProxySQL/proxysql-vec
Length of output: 223
🏁 Script executed:
#!/bin/bash
# Check GenAI_Thread.h for the handler class
rg -n "class\|genai-\|enabled\|rag_" include/GenAI_Thread.h -C3 | head -100Repository: ProxySQL/proxysql-vec
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Read the GenAI_Thread.h header file directly
cat include/GenAI_Thread.h | head -150Repository: ProxySQL/proxysql-vec
Length of output: 5456
🏁 Script executed:
#!/bin/bash
# Search for GenAI_Threads_Handler class definition
rg -n "class GenAI_Threads_Handler" -A50 --type cpp -B2Repository: ProxySQL/proxysql-vec
Length of output: 3412
Fix configuration syntax to use ProxySQL variable naming conventions.
The configuration examples use incorrect dot notation (genai.enabled, genai.rag_enabled). ProxySQL uses hyphen-underscore notation for variables. Based on the implementation and tests, the correct syntax should be:
-- Enable GenAI module
SET genai-enabled=true;
-- Enable RAG features
SET genai-rag_enabled=true;
-- Configure RAG parameters (optional)
SET genai-rag_k_max=50;
SET genai-rag_candidates_max=500;
SET genai-rag_timeout_ms=2000;Variable naming pattern from implementation
The GenAI module follows the standard ProxySQL pattern: genai-<variable_name> where underscores separate words (e.g., genai-embedding_uri, genai-rerank_timeout_ms).
🤖 Prompt for AI Agents
In `@doc/rag-documentation.md` around lines 17 - 28, Update the SQL examples to
use ProxySQL's genai-<variable> naming (replace the dot notation like
genai.enabled and genai.rag_enabled) so each variable uses a hyphen after
"genai" and equals-sign assignment; specifically change genai.enabled ->
genai-enabled, genai.rag_enabled -> genai-rag_enabled, genai.rag_k_max ->
genai-rag_k_max, genai.rag_candidates_max -> genai-rag_candidates_max, and
genai.rag_timeout_ms -> genai-rag_timeout_ms in the examples so they match the
implementation pattern (e.g., genai-embedding_uri, genai-rerank_timeout_ms).
| #include <cstdint> | ||
| #include <cstdio> | ||
| #include <cstdlib> | ||
| #include <cstring> | ||
|
|
||
| #include <iostream> | ||
| #include <string> | ||
| #include <vector> | ||
| #include <unordered_map> | ||
| #include <optional> | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "rag_ingest.cpp" | head -20Repository: ProxySQL/proxysql-vec
Length of output: 89
🏁 Script executed:
wc -l ./RAG_POC/rag_ingest.cppRepository: ProxySQL/proxysql-vec
Length of output: 94
🏁 Script executed:
head -120 ./RAG_POC/rag_ingest.cpp | tail -30Repository: ProxySQL/proxysql-vec
Length of output: 726
🏁 Script executed:
rg "std::sqrt|sqrt\(" ./RAG_POC/rag_ingest.cppRepository: ProxySQL/proxysql-vec
Length of output: 90
🏁 Script executed:
rg "std::sqrt" ./RAG_POC/rag_ingest.cpp -B 5 -A 5Repository: ProxySQL/proxysql-vec
Length of output: 350
🏁 Script executed:
rg "^#include" ./RAG_POC/rag_ingest.cppRepository: ProxySQL/proxysql-vec
Length of output: 306
Fix compile error: include <cmath> for std::sqrt.
std::sqrt is used in the vector normalization function but <cmath> isn't included.
🛠️ Suggested fix
`#include` <cstdint>
`#include` <cstdio>
`#include` <cstdlib>
`#include` <cstring>
+#include <cmath>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #include <cstdint> | |
| #include <cstdio> | |
| #include <cstdlib> | |
| #include <cstring> | |
| #include <iostream> | |
| #include <string> | |
| #include <vector> | |
| #include <unordered_map> | |
| #include <optional> | |
| `#include` <cstdint> | |
| `#include` <cstdio> | |
| `#include` <cstdlib> | |
| `#include` <cstring> | |
| `#include` <cmath> | |
| `#include` <iostream> | |
| `#include` <string> | |
| `#include` <vector> | |
| `#include` <unordered_map> | |
| `#include` <optional> | |
🤖 Prompt for AI Agents
In `@RAG_POC/rag_ingest.cpp` around lines 98 - 108, The file is missing the
<cmath> header required for std::sqrt used in the vector normalization function
(e.g., normalize_vector / any function that computes the Euclidean norm and
calls std::sqrt); add `#include` <cmath> to the top header block so the call to
std::sqrt compiles correctly.
| static bool sqlite_doc_exists(SqliteStmts& ss, const std::string& doc_id) { | ||
| sqlite3_reset(ss.doc_exists); | ||
| sqlite3_clear_bindings(ss.doc_exists); | ||
|
|
||
| sqlite_bind_text(ss.doc_exists, 1, doc_id); | ||
|
|
||
| int rc = sqlite3_step(ss.doc_exists); | ||
| return (rc == SQLITE_ROW); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Locate and read the file
find . -name "rag_ingest.cpp" -type fRepository: ProxySQL/proxysql-vec
Length of output: 186
🏁 Script executed:
# Search for the sqlite_doc_exists function and surrounding context
rg -A 10 -B 5 "sqlite_doc_exists" --no-headingRepository: ProxySQL/proxysql-vec
Length of output: 1726
🏁 Script executed:
# Check for fatal() function definition and usage
rg -n "fatal\(" --type cpp | head -20Repository: ProxySQL/proxysql-vec
Length of output: 1698
🏁 Script executed:
# Look at other sqlite functions in rag_ingest.cpp for error handling patterns
rg -n "sqlite3_step" --type cpp -A 3 -B 2Repository: ProxySQL/proxysql-vec
Length of output: 16536
🌐 Web query:
sqlite3_step return codes documentation
💡 Result:
Short summary — sqlite3_step() return codes
- SQLITE_ROW (100): another result row is available. [1][2]
- SQLITE_DONE (101): statement has finished executing successfully (no more rows). [1][2]
- SQLITE_BUSY: database is locked; retry or use a busy handler. [1][2]
- SQLITE_ERROR: runtime error (e.g., constraint); call sqlite3_errmsg()/sqlite3_reset() for details. [1][2]
- SQLITE_MISUSE: API used incorrectly. [1][2]
Notes:
- With the legacy sqlite3_prepare() interface these are the main codes; with the "v2" prepare APIs (sqlite3_prepare_v2/_v3) sqlite3_step() may also return more specific extended result codes directly. [1][2]
Sources: SQLite C API reference and result-code documentation. [1] [2]
Add error handling to sqlite_doc_exists to catch database failures.
The function currently treats all sqlite3_step() returns except SQLITE_ROW as "not found," including error codes. This silently masks database failures (e.g., SQLITE_ERROR, SQLITE_BUSY), risking incorrect skip/insert decisions. Other sqlite functions in the same file explicitly handle errors via fatal(). Align this function with that pattern.
Suggested fix
int rc = sqlite3_step(ss.doc_exists);
-return (rc == SQLITE_ROW);
+if (rc == SQLITE_ROW) return true;
+if (rc == SQLITE_DONE) return false;
+fatal(std::string("SQLite doc_exists failed: ")
+ + sqlite3_errmsg(sqlite3_db_handle(ss.doc_exists)));🤖 Prompt for AI Agents
In `@RAG_POC/rag_ingest.cpp` around lines 521 - 529, sqlite_doc_exists currently
treats any sqlite3_step return other than SQLITE_ROW as "not found" and thus
hides DB errors; update sqlite_doc_exists (function name) to check the
sqlite3_step return value and call fatal() on error codes (e.g., SQLITE_ERROR,
SQLITE_BUSY) consistent with other sqlite helpers in the file: after
sqlite3_step(ss.doc_exists) inspect rc, return true for SQLITE_ROW, return false
for SQLITE_DONE (not found), and call fatal() with a descriptive message
(including rc or sqlite3_errmsg on the db) for any other rc to surface DB
failures instead of silently treating them as missing.
| s.source_id = sqlite3_column_int(st, 0); | ||
| s.name = (const char*)sqlite3_column_text(st, 1); | ||
| s.enabled = sqlite3_column_int(st, 2); | ||
|
|
||
| s.backend_type = (const char*)sqlite3_column_text(st, 3); | ||
| s.host = (const char*)sqlite3_column_text(st, 4); | ||
| s.port = sqlite3_column_int(st, 5); | ||
| s.user = (const char*)sqlite3_column_text(st, 6); | ||
| s.pass = (const char*)sqlite3_column_text(st, 7); | ||
| s.db = (const char*)sqlite3_column_text(st, 8); | ||
|
|
||
| s.table_name = (const char*)sqlite3_column_text(st, 9); | ||
| s.pk_column = (const char*)sqlite3_column_text(st, 10); | ||
| s.where_sql = (const char*)sqlite3_column_text(st, 11); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and get basic info
ls -la RAG_POC/rag_ingest.cpp 2>/dev/null || find . -name "rag_ingest.cpp" -type fRepository: ProxySQL/proxysql-vec
Length of output: 137
🏁 Script executed:
# Get the exact lines mentioned and surrounding context
if [ -f "RAG_POC/rag_ingest.cpp" ]; then
head -700 RAG_POC/rag_ingest.cpp | tail -50 | cat -n
fiRepository: ProxySQL/proxysql-vec
Length of output: 2230
🏁 Script executed:
# Search for the struct definition containing these fields
rg "struct.*\{" RAG_POC/ -A 20 | head -100Repository: ProxySQL/proxysql-vec
Length of output: 4941
🏁 Script executed:
# Look for str_or_empty function
rg "str_or_empty" RAG_POC/Repository: ProxySQL/proxysql-vec
Length of output: 194
🏁 Script executed:
# Search for the query that populates these columns
rg "sqlite3_column_text" RAG_POC/rag_ingest.cpp -B 10 -A 10Repository: ProxySQL/proxysql-vec
Length of output: 1698
Guard against NULL from sqlite3_column_text() before assigning to std::string fields.
These assignments construct std::string from potentially NULL pointers, which is undefined behavior:
s.name,s.backend_type,s.host,s.user,s.pass,s.db,s.table_name,s.pk_column,s.where_sql
The SQL query lacks COALESCE on most of these columns, and passing NULL to std::string causes crashes. The str_or_empty() helper already exists in this file; use it consistently:
- s.name = (const char*)sqlite3_column_text(st, 1);
+ s.name = str_or_empty((const char*)sqlite3_column_text(st, 1));
- s.backend_type = (const char*)sqlite3_column_text(st, 3);
- s.host = (const char*)sqlite3_column_text(st, 4);
+ s.backend_type = str_or_empty((const char*)sqlite3_column_text(st, 3));
+ s.host = str_or_empty((const char*)sqlite3_column_text(st, 4));
- s.user = (const char*)sqlite3_column_text(st, 6);
- s.pass = (const char*)sqlite3_column_text(st, 7);
- s.db = (const char*)sqlite3_column_text(st, 8);
+ s.user = str_or_empty((const char*)sqlite3_column_text(st, 6));
+ s.pass = str_or_empty((const char*)sqlite3_column_text(st, 7));
+ s.db = str_or_empty((const char*)sqlite3_column_text(st, 8));
- s.table_name = (const char*)sqlite3_column_text(st, 9);
- s.pk_column = (const char*)sqlite3_column_text(st, 10);
- s.where_sql = (const char*)sqlite3_column_text(st, 11);
+ s.table_name = str_or_empty((const char*)sqlite3_column_text(st, 9));
+ s.pk_column = str_or_empty((const char*)sqlite3_column_text(st, 10));
+ s.where_sql = str_or_empty((const char*)sqlite3_column_text(st, 11));🤖 Prompt for AI Agents
In `@RAG_POC/rag_ingest.cpp` around lines 685 - 698, The code assigns raw
sqlite3_column_text() pointers directly into std::string fields of the struct s
(e.g., s.name, s.backend_type, s.host, s.user, s.pass, s.db, s.table_name,
s.pk_column, s.where_sql), which can be NULL; replace those direct casts with
the existing helper str_or_empty(sqlite3_column_text(st, idx)) for each of these
fields so they get a safe empty string instead of constructing std::string from
a NULL pointer (leave numeric fields like source_id and port as-is and continue
using sqlite3_column_int for them).
| backend_port INTEGER NOT NULL, | ||
| backend_user TEXT NOT NULL, | ||
| backend_pass TEXT NOT NULL, | ||
| backend_db TEXT NOT NULL, -- database/schema name |
There was a problem hiding this comment.
Security concern: Plain text password storage.
The backend_pass column stores database credentials in plain text. This is a significant security risk as anyone with read access to the SQLite database can extract backend database credentials.
Consider:
- Encrypting credentials at rest using a key derived from a master secret
- Using a separate secrets manager or vault integration
- At minimum, documenting this risk prominently and ensuring the database file has restricted permissions
🤖 Prompt for AI Agents
In `@RAG_POC/schema.sql` around lines 22 - 25, The schema stores plaintext
credentials in the backend_pass column which is insecure; change the schema and
application to store encrypted credentials (or secret references) instead:
replace backend_pass with a ciphertext column (e.g., backend_pass_cipher) plus
associated metadata columns for IV/salt/algorithm or a backend_pass_ref to hold
a secrets-manager key, update any code that reads/writes backend_pass to perform
encryption/decryption using a master key derived from a secret (or call the
vault client) and add migration of existing plaintext rows to encrypted values;
if immediate encryption integration isn't possible, at minimum document the risk
clearly and ensure the DB file permission hardening is enforced and referenced
in the README.
- lib/MySQL_Catalog.cpp: Convert search/list/remove to use SQLite prepared statements instead of string concatenation for user parameters - lib/RAG_Tool_Handler.cpp: Add escape_fts_query() function to properly escape single quotes in FTS5 MATCH clauses; update all FTS and vector MATCH queries to use escaped values - lib/Static_Harvester.cpp: Add is_valid_schema_name() validation function to ensure schema names only contain safe characters (alphanumeric, underscore, dollar sign) before using in INFORMATION_SCHEMA queries - lib/Query_Tool_Handler.cpp: Add clarifying comments to validate_readonly_query explaining the blacklist (quick exit) + whitelist (allowed query types) approach - Remove backup file lib/Anomaly_Detector.cpp.bak Addresses gemini-code-assist review comments from PR #26.
|
Thank you for the excellent review! The SQL injection vulnerabilities have been addressed:
|
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In `@lib/Query_Tool_Handler.cpp`:
- Around line 1117-1133: The code in Query_Tool_Handler.cpp extracts
schema/table from object_key and concatenates them into SQL sent via
catalog->get_db()->execute_statement, risking SQL injection; change this to use
a parameterized/prepared statement on the DB API (bind run_id, schema and table
as parameters) instead of string concatenation, i.e. prepare a statement like
"SELECT object_id FROM objects WHERE run_id = ? AND schema_name = ? AND
object_name = ? LIMIT 1", bind run_id, schema and table, execute and read the
SQLite3_result rows, and ensure error/resultset cleanup; update the call sites
around object_key, schema, table, run_id and replace the execute_statement usage
with the prepared-and-bound variant to safely fetch object_id.
In `@lib/RAG_Tool_Handler.cpp`:
- Around line 2065-2152: The vec_sql concatenation in the fts_then_vec rerank
filter block directly injects filter values into SQL (see the manual
constructions for source_names, doc_ids, post_type_ids, tags_any, tags_all,
created_after/created_before) causing SQL injection risk; replace the manual
concatenation by invoking the existing helper build_sql_filters(filters,
vec_sql) for this rerank path or replicate the exact validation/escaping logic
used by build_sql_filters (use get_json_string_array/get_json_int_array and
properly quote/escape values and JSON-extract comparisons) so all filter inputs
are validated before being appended to vec_sql.
- Around line 1926-2000: Duplicate manual filter assembly is present after the
call to build_sql_filters (see the call site around build_sql_filters) — remove
the repeated blocks that append filters for source_names, doc_ids,
post_type_ids, tags_any, tags_all, created_after and created_before (the large
if-blocks that build source_list/doc_list/post_type_conditions/tag_conditions
and append to fts_sql). Instead rely on the result of build_sql_filters (which
already handles escaping/formatting); delete those duplicated blocks near where
fts_sql is further modified in the fts_then_vec path so filters are not applied
twice and SQL is not built insecurely.
- Around line 1671-1745: Duplicate manual filter building after the call to
build_sql_filters(filters, vec_sql) causes filters to be applied twice; remove
the repeated blocks that manually append to vec_sql for "source_names",
"doc_ids", "post_type_ids", "tags_any", "tags_all", "created_after", and
"created_before" in RAG_Tool_Handler.cpp so only build_sql_filters(filters,
vec_sql) is used; locate the vec_sql modifications immediately following the
build_sql_filters(filters, vec_sql) call and delete those manual concatenation
loops/conditions (leave build_sql_filters unchanged).
- Around line 1548-1622: The manual filter blocks duplicate and unsafe SQL
construction already handled by build_sql_filters(filters, fts_sql): remove the
repeated manual sections that build source_names/doc_ids (using
get_json_string_array), post_type_ids (get_json_int_array), tags_any, tags_all,
created_after and created_before that append to fts_sql, and rely solely on
build_sql_filters to add those WHERE clauses (to avoid duplicate conditions and
SQL injection); ensure no other code later expects those local variables
(source_list/doc_list/post_type_conditions/tag_conditions) before deleting.
In `@lib/Static_Harvester.cpp`:
- Around line 1069-1089: harvest_view_definitions builds an SQL fragment using
only_schema without validating it; before appending " AND TABLE_SCHEMA = '...'"
validate only_schema with is_valid_schema_name(only_schema) (same approach as
harvest_objects), and if it fails call proxy_error with a clear message and
return -1 (or otherwise skip adding the clause). Ensure this check occurs just
before the sql << " AND TABLE_SCHEMA ..." concatenation so no unvalidated input
is embedded in the sql string used by execute_query.
- Around line 753-775: The routines SQL concatenates only_schema without
validation; before appending " AND ROUTINE_SCHEMA = '...'" call and check
is_valid_schema_name(only_schema) (same pattern used in fetch_* methods) and
only append the filter if it returns true; if invalid, skip adding the filter
and optionally log or return an error; ensure execute_query(...) and
catalog->insert_object(...) behavior remains unchanged when only_schema is
absent or invalid.
♻️ Duplicate comments (11)
lib/MySQL_Catalog.cpp (1)
118-135: Missing UPDATE trigger causes stale FTS index on upsert operations.The
catalog_ftsFTS5 table has INSERT and DELETE triggers, but no UPDATE trigger. Theupsert()function (lines 182-209) usesON CONFLICT ... DO UPDATE, which fires an UPDATE event—not an INSERT. This means FTS entries become stale when documents are modified via upsert.Add an AFTER UPDATE trigger:
Suggested trigger
db->execute("DROP TRIGGER IF EXISTS catalog_ai"); db->execute("DROP TRIGGER IF EXISTS catalog_ad"); +db->execute("DROP TRIGGER IF EXISTS catalog_au"); // ... existing triggers ... +db->execute("CREATE TRIGGER IF NOT EXISTS catalog_au AFTER UPDATE ON catalog BEGIN" + " INSERT INTO catalog_fts(catalog_fts, rowid, schema, kind, key, document, tags)" + " VALUES ('delete', old.id, old.schema, old.kind, old.key, old.document, old.tags);" + " INSERT INTO catalog_fts(rowid, schema, kind, key, document, tags)" + " VALUES (new.id, new.schema, new.kind, new.key, new.document, new.tags);" + "END;");lib/Static_Harvester.cpp (2)
1097-1111: SQL injection vulnerability in harvest_view_definitions UPDATE statement.The
view_def,schema_name, andview_nameare concatenated directly into the UPDATE statement without escaping. Sinceview_defcomes from MySQL'sinformation_schema.VIEWS.VIEW_DEFINITION, it can contain arbitrary SQL including single quotes, which will break the query or enable injection.Use a prepared statement instead:
Suggested fix
// Use prepared statement to prevent SQL injection sqlite3_stmt* stmt = NULL; const char* update_sql = "UPDATE objects SET definition_sql = ?1 " "WHERE run_id = ?2 AND schema_name = ?3 AND object_name = ?4 AND object_type = 'view'"; int rc = catalog->get_db()->prepare_v2(update_sql, &stmt); if (rc == SQLITE_OK) { (*proxy_sqlite3_bind_text)(stmt, 1, view_def.c_str(), -1, SQLITE_TRANSIENT); (*proxy_sqlite3_bind_int)(stmt, 2, current_run_id); (*proxy_sqlite3_bind_text)(stmt, 3, schema_name.c_str(), -1, SQLITE_TRANSIENT); (*proxy_sqlite3_bind_text)(stmt, 4, view_name.c_str(), -1, SQLITE_TRANSIENT); SAFE_SQLITE3_STEP2(stmt); (*proxy_sqlite3_finalize)(stmt); // Check affected rows... }
1185-1194: Column index mismatch in quick profiles causes incorrect data.The SELECT query (lines 1145-1148) returns 11 columns indexed 0-10, but the code reads from incorrect indices:
Field SELECT Index Code Uses engine 4 fields[3] ❌ table_rows_est 5 fields[4] ❌ data_length 6 fields[5] ❌ index_length 7 fields[6] ❌ has_primary_key 8 fields[7] ❌ has_foreign_keys 9 fields[8] ❌ has_time_column 10 fields[9] ❌ Corrected indices
- profile["rows_est"] = row->fields[4] ? atol(row->fields[4]) : 0; - profile["size_bytes"] = (atol(row->fields[5] ? row->fields[5] : "0") + - atol(row->fields[6] ? row->fields[6] : "0")); - profile["engine"] = std::string(row->fields[3] ? row->fields[3] : ""); - profile["has_primary_key"] = atoi(row->fields[7]) != 0; - profile["has_foreign_keys"] = atoi(row->fields[8]) != 0; - profile["has_time_column"] = atoi(row->fields[9]) != 0; + profile["engine"] = std::string(row->fields[4] ? row->fields[4] : ""); + profile["rows_est"] = row->fields[5] ? atol(row->fields[5]) : 0; + profile["size_bytes"] = (atol(row->fields[6] ? row->fields[6] : "0") + + atol(row->fields[7] ? row->fields[7] : "0")); + profile["has_primary_key"] = atoi(row->fields[8]) != 0; + profile["has_foreign_keys"] = atoi(row->fields[9]) != 0; + profile["has_time_column"] = atoi(row->fields[10]) != 0;lib/Query_Tool_Handler.cpp (4)
362-405: Use-after-return: connection returned before all operations complete.In
execute_query(), the connection is returned to the pool at line 376, butmysql_affected_rows()is called at line 382 using the same connection handle. Another thread could reuse the connection between these calls, causing a race condition.Additionally, when
mysql_query()fails (line 370), the code callsreturn_connection()but doesn't return early, allowing execution to continue tomysql_store_result().Suggested fix
if (mysql_query(static_cast<MYSQL*>(mysql), query.c_str())) { proxy_error("Query_Tool_Handler: Query failed: %s\n", mysql_error(static_cast<MYSQL*>(mysql))); return_connection(mysql); + return "{\"error\": \"Query execution failed\"}"; } MYSQL_RES* res = mysql_store_result(static_cast<MYSQL*>(mysql)); - return_connection(mysql); if (!res) { // No result set (e.g., INSERT/UPDATE) json j; j["success"] = true; j["affected_rows"] = static_cast<long>(mysql_affected_rows(static_cast<MYSQL*>(mysql))); + return_connection(mysql); return j.dump(); } // ... fetch rows ... mysql_free_result(res); + return_connection(mysql);
446-455: Use-after-return in execute_query_with_schema.Same issue as
execute_query(): the connection is returned at line 447, butmysql_affected_rows()is called at line 453 after the connection may have been reused by another thread.Move
return_connection(mysql)after all MySQL operations are complete.
923-939: SQL injection via unvalidated schema and name_filter in list_tables.The
schemaandname_filterparameters are concatenated directly into the SQL query without validation. WithCLIENT_MULTI_STATEMENTSenabled (line 299), an attacker could inject arbitrary SQL statements.Suggested validation
else if (tool_name == "list_tables") { std::string schema = json_string(arguments, "schema"); std::string page_token = json_string(arguments, "page_token"); int page_size = json_int(arguments, "page_size", 50); std::string name_filter = json_string(arguments, "name_filter"); + + // Validate inputs to prevent SQL injection + static const std::regex kSafeIdent("^[A-Za-z0-9_]+$"); + static const std::regex kSafeLike("^[A-Za-z0-9_%]+$"); + if (!schema.empty() && !std::regex_match(schema, kSafeIdent)) { + result = create_error_response("Invalid schema name"); + } else if (!name_filter.empty() && !std::regex_match(name_filter, kSafeLike)) { + result = create_error_response("Invalid name_filter"); + } else { // TODO: Implement using MySQL connection std::ostringstream sql;
830-838: Usedeleteinstead offree()for SQLite3_result.
SQLite3_resultis a C++ class allocated withnew. Usingfree()(lines 834, 837) instead ofdeletecauses undefined behavior because the destructor is not called.Suggested fix
if (resultset && resultset->rows_count > 0) { SQLite3_row* row = resultset->rows[0]; std::string schema = std::string(row->fields[0] ? row->fields[0] : ""); - free(resultset); + delete resultset; return schema; } - if (resultset) free(resultset); + delete resultset;lib/RAG_Tool_Handler.cpp (4)
468-482: Critical: Prepared statement is never executed—raw query runs instead.The function prepares a statement and binds parameters (lines 434-467), but then calls
execute_statement(query, ...)with the original raw query string at line 473. The boundstmtis never stepped/executed, so parameter bindings are completely ignored. This defeats SQL injection protection.The prepared statement should be executed instead of the raw query. You need to either:
- Use
vector_db->execute_prepared(stmt, ...)if available, or- Implement stepping through the prepared statement with
sqlite3_step()and collecting results.🐛 Conceptual fix
- // Execute the statement and get results - char* error = NULL; - int cols = 0; - int affected_rows = 0; - SQLite3_result* result = vector_db->execute_statement(query, &error, &cols, &affected_rows); + // Execute the prepared statement and get results + char* error = NULL; + int cols = 0; + int affected_rows = 0; + // Use execute_prepared to run the bound statement instead of the raw query + SQLite3_result* result = vector_db->execute_prepared(stmt, &error, &cols, &affected_rows);Note: Verify that
execute_preparedexists on theSQLite3DBinterface, or implement thesqlite3_steploop manually.
1517-1531: Fractional weights truncated to integers, breaking hybrid scoring.
w_ftsandw_vecare declared asdouble(lines 1514-1515) but parsed usingget_json_int()which returnsint. Fractional weights like0.5will be truncated to0, causing one search type to be completely ignored in fusion.Add a
get_json_doublehelper or parse directly:- w_fts = get_json_int(fuse_params, "w_fts", 1.0); - w_vec = get_json_int(fuse_params, "w_vec", 1.0); + if (fuse_params.contains("w_fts") && fuse_params["w_fts"].is_number()) { + w_fts = fuse_params["w_fts"].get<double>(); + } + if (fuse_params.contains("w_vec") && fuse_params["w_vec"].is_number()) { + w_vec = fuse_params["w_vec"].get<double>(); + }Apply the same fix to the fallback block (lines 1529-1530).
2343-2357: SQL injection inrag.get_docs—doc_idsnot validated.Unlike
rag.get_chunkswhich validateschunk_idsfor dangerous characters (lines 2234-2240),rag.get_docsconcatenatesdoc_idsdirectly into SQL without any validation or escaping.Add validation matching the
chunk_idspattern:🐛 Proposed fix
+ // Validate doc_ids to prevent SQL injection + for (const std::string& doc_id : doc_ids) { + if (doc_id.find('\'') != std::string::npos || + doc_id.find('\\') != std::string::npos || + doc_id.find(';') != std::string::npos) { + return create_error_response("Invalid characters in doc_ids"); + } + } + // Build doc ID list for SQL - std::string doc_list = "'"; + std::string doc_list = ""; for (size_t i = 0; i < doc_ids.size(); ++i) { - if (i > 0) doc_list += "','"; - doc_list += doc_ids[i]; + if (i > 0) doc_list += ","; + // Escape single quotes by doubling + std::string escaped_id = doc_ids[i]; + size_t pos = 0; + while ((pos = escaped_id.find("'", pos)) != std::string::npos) { + escaped_id.replace(pos, 1, "''"); + pos += 2; + } + doc_list += "'" + escaped_id + "'"; } - doc_list += "'";
2426-2432: SQL injection inrag.fetch_from_source—samedoc_idsissue.Same vulnerability as
rag.get_docs:doc_idsare concatenated into SQL without validation. Apply the same fix (validation + escaping) here.
🧹 Nitpick comments (4)
lib/MySQL_Catalog.cpp (1)
507-537: Consider using nlohmann::json for merge document construction.The
merge()function constructs JSON via string concatenation (lines 524-533). While not a security issue sinceinstructionscomes from internal sources, this could produce invalid JSON ifinstructionscontains unescaped quotes or special characters.Suggested improvement
- // Create merged document - std::string merged_doc = "{"; - merged_doc += "\"source_keys\":["; - - for (size_t i = 0; i < keys.size(); i++) { - if (i > 0) merged_doc += " , "; - merged_doc += "\"" + keys[i] + "\""; - } - merged_doc += "] , "; - merged_doc += "\"instructions\":" + std::string(instructions.empty() ? "\"\"" : "\"" + instructions + "\""); - merged_doc += "}"; + // Create merged document using nlohmann::json for proper escaping + nlohmann::json merged_doc; + merged_doc["source_keys"] = keys; + merged_doc["instructions"] = instructions; + std::string merged_doc_str = merged_doc.dump();lib/RAG_Tool_Handler.cpp (3)
34-44: Remove duplicate forward declaration ofGloGATH.The extern declaration for
GloGATHappears twice (lines 35 and 43). Remove one of them.// Forward declaration for GloGATH extern GenAI_Threads_Handler *GloGATH; // JSON library `#include` "../deps/json/json.hpp" using json = nlohmann::json; `#define` PROXYJSON -// Forward declaration for GloGATH -extern GenAI_Threads_Handler *GloGATH;
566-620: Minor: Tag escaping is redundant after validation.Lines 574-577 and 602-605 reject tags containing single quotes, but lines 583-587 and 611-615 then escape single quotes by doubling them. Since validation already rejects inputs with quotes, the escaping loop will never find any quotes to escape.
This is harmless (defense-in-depth), but the escaping code is dead code given the prior validation.
2048-2053: Consider escapingcandidate_idseven though they're database-sourced.While
candidate_idscome from the database rather than direct user input, escaping them provides defense-in-depth against stored injection attacks.// Build candidate ID list for SQL - std::string candidate_list = "'"; + std::string candidate_list = ""; for (size_t i = 0; i < candidate_ids.size(); ++i) { - if (i > 0) candidate_list += "','"; - candidate_list += candidate_ids[i]; + if (i > 0) candidate_list += ","; + std::string escaped = candidate_ids[i]; + size_t pos = 0; + while ((pos = escaped.find("'", pos)) != std::string::npos) { + escaped.replace(pos, 1, "''"); + pos += 2; + } + candidate_list += "'" + escaped + "'"; } - candidate_list += "'";
| size_t dot_pos = object_key.find('.'); | ||
| if (dot_pos != std::string::npos) { | ||
| std::string schema = object_key.substr(0, dot_pos); | ||
| std::string table = object_key.substr(dot_pos + 1); | ||
| // Quick query to get object_id | ||
| char* error = NULL; | ||
| int cols = 0, affected = 0; | ||
| SQLite3_result* resultset = NULL; | ||
| std::ostringstream sql; | ||
| sql << "SELECT object_id FROM objects WHERE run_id = " << run_id | ||
| << " AND schema_name = '" << schema << "'" | ||
| << " AND object_name = '" << table << "' LIMIT 1;"; | ||
| catalog->get_db()->execute_statement(sql.str().c_str(), &error, &cols, &affected, &resultset); | ||
| if (resultset && !resultset->rows.empty()) { | ||
| object_id = atoi(resultset->rows[0]->fields[0]); | ||
| } | ||
| delete resultset; |
There was a problem hiding this comment.
SQL injection via unvalidated object_key in catalog.get_relationships.
The schema and table values extracted from object_key (lines 1119-1120) are concatenated directly into the SQL query (lines 1126-1128) without validation or escaping.
Use prepared statement
if (dot_pos != std::string::npos) {
std::string schema = object_key.substr(0, dot_pos);
std::string table = object_key.substr(dot_pos + 1);
- // Quick query to get object_id
- char* error = NULL;
- int cols = 0, affected = 0;
- SQLite3_result* resultset = NULL;
- std::ostringstream sql;
- sql << "SELECT object_id FROM objects WHERE run_id = " << run_id
- << " AND schema_name = '" << schema << "'"
- << " AND object_name = '" << table << "' LIMIT 1;";
- catalog->get_db()->execute_statement(sql.str().c_str(), &error, &cols, &affected, &resultset);
+ // Use prepared statement to prevent SQL injection
+ sqlite3_stmt* stmt = NULL;
+ const char* sql = "SELECT object_id FROM objects WHERE run_id = ?1 "
+ "AND schema_name = ?2 AND object_name = ?3 LIMIT 1";
+ if (catalog->get_db()->prepare_v2(sql, &stmt) == SQLITE_OK) {
+ (*proxy_sqlite3_bind_int)(stmt, 1, run_id);
+ (*proxy_sqlite3_bind_text)(stmt, 2, schema.c_str(), -1, SQLITE_TRANSIENT);
+ (*proxy_sqlite3_bind_text)(stmt, 3, table.c_str(), -1, SQLITE_TRANSIENT);
+ if ((*proxy_sqlite3_step)(stmt) == SQLITE_ROW) {
+ object_id = (*proxy_sqlite3_column_int)(stmt, 0);
+ }
+ (*proxy_sqlite3_finalize)(stmt);
+ }🤖 Prompt for AI Agents
In `@lib/Query_Tool_Handler.cpp` around lines 1117 - 1133, The code in
Query_Tool_Handler.cpp extracts schema/table from object_key and concatenates
them into SQL sent via catalog->get_db()->execute_statement, risking SQL
injection; change this to use a parameterized/prepared statement on the DB API
(bind run_id, schema and table as parameters) instead of string concatenation,
i.e. prepare a statement like "SELECT object_id FROM objects WHERE run_id = ?
AND schema_name = ? AND object_name = ? LIMIT 1", bind run_id, schema and table,
execute and read the SQLite3_result rows, and ensure error/resultset cleanup;
update the call sites around object_key, schema, table, run_id and replace the
execute_statement usage with the prepared-and-bound variant to safely fetch
object_id.
| if (filters.contains("source_names") && filters["source_names"].is_array()) { | ||
| std::vector<std::string> source_names = get_json_string_array(filters, "source_names"); | ||
| if (!source_names.empty()) { | ||
| std::string source_list = ""; | ||
| for (size_t i = 0; i < source_names.size(); ++i) { | ||
| if (i > 0) source_list += ","; | ||
| source_list += "'" + source_names[i] + "'"; | ||
| } | ||
| fts_sql += " AND c.source_id IN (SELECT source_id FROM rag_sources WHERE name IN (" + source_list + "))"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("doc_ids") && filters["doc_ids"].is_array()) { | ||
| std::vector<std::string> doc_ids = get_json_string_array(filters, "doc_ids"); | ||
| if (!doc_ids.empty()) { | ||
| std::string doc_list = ""; | ||
| for (size_t i = 0; i < doc_ids.size(); ++i) { | ||
| if (i > 0) doc_list += ","; | ||
| doc_list += "'" + doc_ids[i] + "'"; | ||
| } | ||
| fts_sql += " AND c.doc_id IN (" + doc_list + ")"; | ||
| } | ||
| } | ||
|
|
||
| // Metadata filters | ||
| if (filters.contains("post_type_ids") && filters["post_type_ids"].is_array()) { | ||
| std::vector<int> post_type_ids = get_json_int_array(filters, "post_type_ids"); | ||
| if (!post_type_ids.empty()) { | ||
| // Filter by PostTypeId in metadata_json | ||
| std::string post_type_conditions = ""; | ||
| for (size_t i = 0; i < post_type_ids.size(); ++i) { | ||
| if (i > 0) post_type_conditions += " OR "; | ||
| post_type_conditions += "json_extract(d.metadata_json, '$.PostTypeId') = " + std::to_string(post_type_ids[i]); | ||
| } | ||
| fts_sql += " AND (" + post_type_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("tags_any") && filters["tags_any"].is_array()) { | ||
| std::vector<std::string> tags_any = get_json_string_array(filters, "tags_any"); | ||
| if (!tags_any.empty()) { | ||
| // Filter by any of the tags in metadata_json Tags field | ||
| std::string tag_conditions = ""; | ||
| for (size_t i = 0; i < tags_any.size(); ++i) { | ||
| if (i > 0) tag_conditions += " OR "; | ||
| tag_conditions += "json_extract(d.metadata_json, '$.Tags') LIKE '%<" + tags_any[i] + ">%'"; | ||
| } | ||
| fts_sql += " AND (" + tag_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("tags_all") && filters["tags_all"].is_array()) { | ||
| std::vector<std::string> tags_all = get_json_string_array(filters, "tags_all"); | ||
| if (!tags_all.empty()) { | ||
| // Filter by all of the tags in metadata_json Tags field | ||
| std::string tag_conditions = ""; | ||
| for (size_t i = 0; i < tags_all.size(); ++i) { | ||
| if (i > 0) tag_conditions += " AND "; | ||
| tag_conditions += "json_extract(d.metadata_json, '$.Tags') LIKE '%<" + tags_all[i] + ">%'"; | ||
| } | ||
| fts_sql += " AND (" + tag_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("created_after") && filters["created_after"].is_string()) { | ||
| std::string created_after = filters["created_after"].get<std::string>(); | ||
| // Filter by CreationDate in metadata_json | ||
| fts_sql += " AND json_extract(d.metadata_json, '$.CreationDate') >= '" + created_after + "'"; | ||
| } | ||
|
|
||
| if (filters.contains("created_before") && filters["created_before"].is_string()) { | ||
| std::string created_before = filters["created_before"].get<std::string>(); | ||
| // Filter by CreationDate in metadata_json | ||
| fts_sql += " AND json_extract(d.metadata_json, '$.CreationDate') <= '" + created_before + "'"; | ||
| } |
There was a problem hiding this comment.
Filters applied twice—duplicates and SQL injection risk.
build_sql_filters(filters, fts_sql) is called at line 1544, which already adds conditions for source_names, doc_ids, post_type_ids, tags_any, tags_all, created_after, and created_before. Then lines 1548-1622 manually add the same filters again.
Issues:
- Duplicate SQL conditions — each filter appears twice in the WHERE clause
- SQL injection — the manual blocks (lines 1554, 1566, 1593, 1606, 1615, 1621) concatenate values without the validation/escaping that
build_sql_filtersperforms
Remove the duplicate manual filter application since build_sql_filters already handles these:
🐛 Proposed fix
// Apply filters using consolidated filter building function
if (!build_sql_filters(filters, fts_sql)) {
return create_error_response("Invalid filter parameters");
}
- if (filters.contains("source_names") && filters["source_names"].is_array()) {
- std::vector<std::string> source_names = get_json_string_array(filters, "source_names");
- if (!source_names.empty()) {
- std::string source_list = "";
- for (size_t i = 0; i < source_names.size(); ++i) {
- if (i > 0) source_list += ",";
- source_list += "'" + source_names[i] + "'";
- }
- fts_sql += " AND c.source_id IN (SELECT source_id FROM rag_sources WHERE name IN (" + source_list + "))";
- }
- }
-
- if (filters.contains("doc_ids") && filters["doc_ids"].is_array()) {
- // ... remove entire block
- }
-
- // Remove all duplicate filter blocks through line 1622🤖 Prompt for AI Agents
In `@lib/RAG_Tool_Handler.cpp` around lines 1548 - 1622, The manual filter blocks
duplicate and unsafe SQL construction already handled by
build_sql_filters(filters, fts_sql): remove the repeated manual sections that
build source_names/doc_ids (using get_json_string_array), post_type_ids
(get_json_int_array), tags_any, tags_all, created_after and created_before that
append to fts_sql, and rely solely on build_sql_filters to add those WHERE
clauses (to avoid duplicate conditions and SQL injection); ensure no other code
later expects those local variables
(source_list/doc_list/post_type_conditions/tag_conditions) before deleting.
| if (filters.contains("source_names") && filters["source_names"].is_array()) { | ||
| std::vector<std::string> source_names = get_json_string_array(filters, "source_names"); | ||
| if (!source_names.empty()) { | ||
| std::string source_list = ""; | ||
| for (size_t i = 0; i < source_names.size(); ++i) { | ||
| if (i > 0) source_list += ","; | ||
| source_list += "'" + source_names[i] + "'"; | ||
| } | ||
| vec_sql += " AND c.source_id IN (SELECT source_id FROM rag_sources WHERE name IN (" + source_list + "))"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("doc_ids") && filters["doc_ids"].is_array()) { | ||
| std::vector<std::string> doc_ids = get_json_string_array(filters, "doc_ids"); | ||
| if (!doc_ids.empty()) { | ||
| std::string doc_list = ""; | ||
| for (size_t i = 0; i < doc_ids.size(); ++i) { | ||
| if (i > 0) doc_list += ","; | ||
| doc_list += "'" + doc_ids[i] + "'"; | ||
| } | ||
| vec_sql += " AND c.doc_id IN (" + doc_list + ")"; | ||
| } | ||
| } | ||
|
|
||
| // Metadata filters | ||
| if (filters.contains("post_type_ids") && filters["post_type_ids"].is_array()) { | ||
| std::vector<int> post_type_ids = get_json_int_array(filters, "post_type_ids"); | ||
| if (!post_type_ids.empty()) { | ||
| // Filter by PostTypeId in metadata_json | ||
| std::string post_type_conditions = ""; | ||
| for (size_t i = 0; i < post_type_ids.size(); ++i) { | ||
| if (i > 0) post_type_conditions += " OR "; | ||
| post_type_conditions += "json_extract(d.metadata_json, '$.PostTypeId') = " + std::to_string(post_type_ids[i]); | ||
| } | ||
| vec_sql += " AND (" + post_type_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("tags_any") && filters["tags_any"].is_array()) { | ||
| std::vector<std::string> tags_any = get_json_string_array(filters, "tags_any"); | ||
| if (!tags_any.empty()) { | ||
| // Filter by any of the tags in metadata_json Tags field | ||
| std::string tag_conditions = ""; | ||
| for (size_t i = 0; i < tags_any.size(); ++i) { | ||
| if (i > 0) tag_conditions += " OR "; | ||
| tag_conditions += "json_extract(d.metadata_json, '$.Tags') LIKE '%<" + tags_any[i] + ">%'"; | ||
| } | ||
| vec_sql += " AND (" + tag_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("tags_all") && filters["tags_all"].is_array()) { | ||
| std::vector<std::string> tags_all = get_json_string_array(filters, "tags_all"); | ||
| if (!tags_all.empty()) { | ||
| // Filter by all of the tags in metadata_json Tags field | ||
| std::string tag_conditions = ""; | ||
| for (size_t i = 0; i < tags_all.size(); ++i) { | ||
| if (i > 0) tag_conditions += " AND "; | ||
| tag_conditions += "json_extract(d.metadata_json, '$.Tags') LIKE '%<" + tags_all[i] + ">%'"; | ||
| } | ||
| vec_sql += " AND (" + tag_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("created_after") && filters["created_after"].is_string()) { | ||
| std::string created_after = filters["created_after"].get<std::string>(); | ||
| // Filter by CreationDate in metadata_json | ||
| vec_sql += " AND json_extract(d.metadata_json, '$.CreationDate') >= '" + created_after + "'"; | ||
| } | ||
|
|
||
| if (filters.contains("created_before") && filters["created_before"].is_string()) { | ||
| std::string created_before = filters["created_before"].get<std::string>(); | ||
| // Filter by CreationDate in metadata_json | ||
| vec_sql += " AND json_extract(d.metadata_json, '$.CreationDate') <= '" + created_before + "'"; | ||
| } |
There was a problem hiding this comment.
Same duplicate filter issue in vector search SQL block.
The pattern from lines 1548-1622 repeats here: build_sql_filters(filters, vec_sql) is called at line 1667, but then lines 1671-1745 manually add the same filters again without validation.
Remove these duplicate manual filter blocks as well.
🤖 Prompt for AI Agents
In `@lib/RAG_Tool_Handler.cpp` around lines 1671 - 1745, Duplicate manual filter
building after the call to build_sql_filters(filters, vec_sql) causes filters to
be applied twice; remove the repeated blocks that manually append to vec_sql for
"source_names", "doc_ids", "post_type_ids", "tags_any", "tags_all",
"created_after", and "created_before" in RAG_Tool_Handler.cpp so only
build_sql_filters(filters, vec_sql) is used; locate the vec_sql modifications
immediately following the build_sql_filters(filters, vec_sql) call and delete
those manual concatenation loops/conditions (leave build_sql_filters unchanged).
| if (filters.contains("source_names") && filters["source_names"].is_array()) { | ||
| std::vector<std::string> source_names = get_json_string_array(filters, "source_names"); | ||
| if (!source_names.empty()) { | ||
| std::string source_list = ""; | ||
| for (size_t i = 0; i < source_names.size(); ++i) { | ||
| if (i > 0) source_list += ","; | ||
| source_list += "'" + source_names[i] + "'"; | ||
| } | ||
| fts_sql += " AND c.source_id IN (SELECT source_id FROM rag_sources WHERE name IN (" + source_list + "))"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("doc_ids") && filters["doc_ids"].is_array()) { | ||
| std::vector<std::string> doc_ids = get_json_string_array(filters, "doc_ids"); | ||
| if (!doc_ids.empty()) { | ||
| std::string doc_list = ""; | ||
| for (size_t i = 0; i < doc_ids.size(); ++i) { | ||
| if (i > 0) doc_list += ","; | ||
| doc_list += "'" + doc_ids[i] + "'"; | ||
| } | ||
| fts_sql += " AND c.doc_id IN (" + doc_list + ")"; | ||
| } | ||
| } | ||
|
|
||
| // Metadata filters | ||
| if (filters.contains("post_type_ids") && filters["post_type_ids"].is_array()) { | ||
| std::vector<int> post_type_ids = get_json_int_array(filters, "post_type_ids"); | ||
| if (!post_type_ids.empty()) { | ||
| // Filter by PostTypeId in metadata_json | ||
| std::string post_type_conditions = ""; | ||
| for (size_t i = 0; i < post_type_ids.size(); ++i) { | ||
| if (i > 0) post_type_conditions += " OR "; | ||
| post_type_conditions += "json_extract(d.metadata_json, '$.PostTypeId') = " + std::to_string(post_type_ids[i]); | ||
| } | ||
| fts_sql += " AND (" + post_type_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("tags_any") && filters["tags_any"].is_array()) { | ||
| std::vector<std::string> tags_any = get_json_string_array(filters, "tags_any"); | ||
| if (!tags_any.empty()) { | ||
| // Filter by any of the tags in metadata_json Tags field | ||
| std::string tag_conditions = ""; | ||
| for (size_t i = 0; i < tags_any.size(); ++i) { | ||
| if (i > 0) tag_conditions += " OR "; | ||
| tag_conditions += "json_extract(d.metadata_json, '$.Tags') LIKE '%<" + tags_any[i] + ">%'"; | ||
| } | ||
| fts_sql += " AND (" + tag_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("tags_all") && filters["tags_all"].is_array()) { | ||
| std::vector<std::string> tags_all = get_json_string_array(filters, "tags_all"); | ||
| if (!tags_all.empty()) { | ||
| // Filter by all of the tags in metadata_json Tags field | ||
| std::string tag_conditions = ""; | ||
| for (size_t i = 0; i < tags_all.size(); ++i) { | ||
| if (i > 0) tag_conditions += " AND "; | ||
| tag_conditions += "json_extract(d.metadata_json, '$.Tags') LIKE '%<" + tags_all[i] + ">%'"; | ||
| } | ||
| fts_sql += " AND (" + tag_conditions + ")"; | ||
| } | ||
| } | ||
|
|
||
| if (filters.contains("created_after") && filters["created_after"].is_string()) { | ||
| std::string created_after = filters["created_after"].get<std::string>(); | ||
| // Filter by CreationDate in metadata_json | ||
| fts_sql += " AND json_extract(d.metadata_json, '$.CreationDate') >= '" + created_after + "'"; | ||
| } | ||
|
|
||
| if (filters.contains("created_before") && filters["created_before"].is_string()) { | ||
| std::string created_before = filters["created_before"].get<std::string>(); | ||
| // Filter by CreationDate in metadata_json | ||
| fts_sql += " AND json_extract(d.metadata_json, '$.CreationDate') <= '" + created_before + "'"; | ||
| } |
There was a problem hiding this comment.
Duplicate filters in fts_then_vec mode as well.
Same pattern: build_sql_filters is called at line 1922, then lines 1926-2000 manually re-add the same filters without proper escaping. Remove these duplicate blocks.
🤖 Prompt for AI Agents
In `@lib/RAG_Tool_Handler.cpp` around lines 1926 - 2000, Duplicate manual filter
assembly is present after the call to build_sql_filters (see the call site
around build_sql_filters) — remove the repeated blocks that append filters for
source_names, doc_ids, post_type_ids, tags_any, tags_all, created_after and
created_before (the large if-blocks that build
source_list/doc_list/post_type_conditions/tag_conditions and append to fts_sql).
Instead rely on the result of build_sql_filters (which already handles
escaping/formatting); delete those duplicated blocks near where fts_sql is
further modified in the fts_then_vec path so filters are not applied twice and
SQL is not built insecurely.
| std::ostringstream sql; | ||
| sql << "SELECT ROUTINE_SCHEMA, ROUTINE_NAME, ROUTINE_TYPE, ROUTINE_COMMENT " | ||
| << "FROM information_schema.ROUTINES " | ||
| << "WHERE ROUTINE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')"; | ||
|
|
||
| if (!only_schema.empty()) { | ||
| sql << " AND ROUTINE_SCHEMA = '" << only_schema << "'"; | ||
| } | ||
|
|
||
| sql << " ORDER BY ROUTINE_SCHEMA, ROUTINE_NAME;"; | ||
|
|
||
| std::vector<std::vector<std::string>> results; | ||
| if (execute_query(sql.str(), results) == 0) { | ||
| for (const auto& row : results) { | ||
| int object_id = catalog->insert_object( | ||
| current_run_id, row[0], row[1], "routine", | ||
| "", 0, 0, 0, "", "", row[3], "" | ||
| ); | ||
| if (object_id >= 0) { | ||
| count++; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Missing schema validation in harvest_objects routines query.
The routines query (lines 753-762) concatenates only_schema directly into SQL without validation, unlike the fetch_* methods which call is_valid_schema_name() first.
Suggested fix
// Fetch and insert routines (stored procedures/functions)
+ // Validate schema name to prevent SQL injection
+ if (!only_schema.empty() && !is_valid_schema_name(only_schema)) {
+ proxy_error("Static_Harvester: Invalid schema name '%s'\n", only_schema.c_str());
+ return count;
+ }
+
std::ostringstream sql;
sql << "SELECT ROUTINE_SCHEMA, ROUTINE_NAME, ROUTINE_TYPE, ROUTINE_COMMENT "🤖 Prompt for AI Agents
In `@lib/Static_Harvester.cpp` around lines 753 - 775, The routines SQL
concatenates only_schema without validation; before appending " AND
ROUTINE_SCHEMA = '...'" call and check is_valid_schema_name(only_schema) (same
pattern used in fetch_* methods) and only append the filter if it returns true;
if invalid, skip adding the filter and optionally log or return an error; ensure
execute_query(...) and catalog->insert_object(...) behavior remains unchanged
when only_schema is absent or invalid.
| int Static_Harvester::harvest_view_definitions(const std::string& only_schema) { | ||
| if (current_run_id < 0) { | ||
| proxy_error("Static_Harvester: No active run\n"); | ||
| return -1; | ||
| } | ||
|
|
||
| std::ostringstream sql; | ||
| sql << "SELECT TABLE_SCHEMA, TABLE_NAME, VIEW_DEFINITION " | ||
| << "FROM information_schema.VIEWS " | ||
| << "WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys')"; | ||
|
|
||
| if (!only_schema.empty()) { | ||
| sql << " AND TABLE_SCHEMA = '" << only_schema << "'"; | ||
| } | ||
|
|
||
| sql << ";"; | ||
|
|
||
| std::vector<std::vector<std::string>> results; | ||
| if (execute_query(sql.str(), results) != 0) { | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
Missing schema validation in harvest_view_definitions.
Similar to harvest_objects, this function concatenates only_schema into SQL (line 1081) without calling is_valid_schema_name() first.
Suggested fix
int Static_Harvester::harvest_view_definitions(const std::string& only_schema) {
if (current_run_id < 0) {
proxy_error("Static_Harvester: No active run\n");
return -1;
}
+ // Validate schema name to prevent SQL injection
+ if (!only_schema.empty() && !is_valid_schema_name(only_schema)) {
+ proxy_error("Static_Harvester: Invalid schema name '%s'\n", only_schema.c_str());
+ return -1;
+ }
+
std::ostringstream sql;🤖 Prompt for AI Agents
In `@lib/Static_Harvester.cpp` around lines 1069 - 1089, harvest_view_definitions
builds an SQL fragment using only_schema without validating it; before appending
" AND TABLE_SCHEMA = '...'" validate only_schema with
is_valid_schema_name(only_schema) (same approach as harvest_objects), and if it
fails call proxy_error with a clear message and return -1 (or otherwise skip
adding the clause). Ensure this check occurs just before the sql << " AND
TABLE_SCHEMA ..." concatenation so no unvalidated input is embedded in the sql
string used by execute_query.
Add proxy_sqlite3_enable_load_extension and proxy_sqlite3_global_stats_row_step to the #else (MAIN_PROXY_SQLITE3) branch in sqlite3db.h to maintain symmetry with the extern declarations and prevent ODR violations. Addresses coderabbitai review comment.
Add assert(proxy_sqlite3_bind_blob) to the assertion block in SQLite3DB::LoadPlugin to ensure the symbol is provided by plugins. Without this, if a plugin fails to provide the symbol, the code will crash at runtime with no safety check. proxy_sqlite3_bind_blob is actively used in Anomaly_Detector.cpp:765 to bind embeddings. Addresses coderabbitai review comment.
…pserts Add catalog_au AFTER UPDATE trigger in MySQL_Catalog that mirrors the delete+insert pattern used in catalog_ad/catalog_ai. This keeps the FTS index current when upserts occur (INSERT OR REPLACE ... ON CONFLICT ... DO UPDATE), since the UPDATE doesn't trigger INSERT/DELETE triggers. The trigger first deletes the old entry from catalog_fts then inserts the new entry, ensuring the full-text search index stays synchronized with the catalog table. Addresses coderabbitai review comment.
Change free(resultset) to delete resultset in Query_Tool_Handler (extract_schema_name function). SQLite3_result is a C++ class allocated with new, so it must be deallocated with delete, not free(). Using free() causes mixed allocator UB. Addresses coderabbitai review comment.
The execute_parameterized_query function in RAG_Tool_Handler was creating a prepared statement and binding parameters, but then executing the raw query string instead of the prepared statement. This completely bypassed the parameter binding, making the function useless for preventing SQL injection. Changed to use vector_db->execute_prepared(stmt, ...) to execute the bound statement instead of the raw query, so the bound parameters are actually used. Addresses coderabbitai review comment.
Fix off-by-one errors in row->fields indices. The SELECT clause returns: object_id(0), schema_name(1), object_name(2), object_type(3), engine(4), table_rows_est(5), data_length(6), index_length(7), has_primary_key(8), has_foreign_keys(9), has_time_column(10) But the code was reading from fields[3..9] instead of [4..10]. Added comment documenting the correct column order. Addresses coderabbitai review comment.
Add escape_sql_string() helper function that doubles single quotes to prevent SQL injection when strings are used in SQL string concatenation. Update harvest_view_definitions to use this function for view_def, schema_name, and view_name. This prevents SQL injection in the UPDATE statement that stores view definitions in the catalog. Addresses coderabbitai review comment.
Fix two issues in Query_Tool_Handler's execute_query functions: 1. mysql_query() failure path now returns immediately after return_connection() instead of continuing to process on bad state. 2. Capture affected_rows BEFORE return_connection() to avoid race condition. Previously, mysql_affected_rows() was called after return_connection(), potentially accessing a stale connection. Apply fixes to both execute_query() and execute_query_with_schema(). Addresses coderabbitai review comments.
…rules_to_runtime Change free(resultset) to delete resultset in ProxySQL_Admin::load_mcp_query_rules_to_runtime. SQLite3_result is a C++ class allocated with new, so it must be deallocated with delete, not free(). Using free() causes undefined behavior and memory leaks. Addresses coderabbitai review comment.
Fix multi-statement execution in Admin_Handler.cpp for MCP query rules. The previous code built a single SQL string with "DELETE ...; INSERT ..." but SQLite only executed the first statement. Changed to execute statements as an explicit transaction: 1. BEGIN 2. DELETE FROM target_table 3. INSERT OR REPLACE INTO target_table SELECT * FROM source_table 4. COMMIT (or ROLLBACK on error) Applied to both: - LOAD MCP QUERY RULES FROM DISK/TO MEMORY - SAVE MCP QUERY RULES TO DISK Addresses coderabbitai review comment.
…njection Fix two SQL injection vulnerabilities identified by coderabbitai in ProxySQL_Admin_Stats.cpp by converting sprintf/snprintf interpolation to SQLite prepared statements. 1. stats___mcp_query_digest (lines 2581-2636): - Changed from sprintf with format string to prepared statement - digest_text field now properly bound as parameter (was unsafe) - All 10 columns now use positional parameters (?1-?10) 2. stats___mcp_query_tools_counters (lines 1587-1635): - Changed from snprintf with unescaped fields to prepared statement - Fixed incorrect table name logic (was appending _reset incorrectly) - All 8 columns now use positional parameters (?1-?8) These changes prevent SQL injection when resultset fields contain quotes, backslashes, or other SQL metacharacters.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@lib/MySQL_Catalog.cpp`:
- Around line 489-490: In list(), avoid dereferencing null from
proxy_sqlite3_column_text for nullable columns by checking the returned const
unsigned char* for columns 4 and 5 before converting to std::string; replace the
direct casts used for entry["tags"] and entry["links"] with a null-safe pattern
(like retrieving auto ptr = (*proxy_sqlite3_column_text)(stmt, 4/5);
entry["tags"/"links"] = ptr ? std::string((const char*)ptr) : std::string();)
mirroring the defensive check used in search().
- Around line 350-370: The loop inside search() dereferences
proxy_sqlite3_column_text results directly (for columns 0,1,2,4,5), risking
nullptr reads for nullable columns (tags, links) and any unexpected NULLs;
update the code to check each proxy_sqlite3_column_text(stmt, N) result for NULL
before casting to const char* and constructing std::string, e.g., treat NULL as
empty string or set JSON null where appropriate, and reuse a small helper or
inline conditional to convert nullable column text to either std::string or
nullptr for entry["..."] to avoid dereferencing NULL pointers.
In `@lib/Query_Tool_Handler.cpp`:
- Around line 1835-1854: The loop allocates a char* array `row` and eight
`strdup`'d strings then calls `result->add_row(row)` but never frees those
allocations; after `result->add_row(row)` free each duplicated string (e.g.
free(row[i]) for i=0..7) and then delete[] row to avoid leaks, or replace the
manual strdup/new[] pattern with a std::vector<std::string> / std::string
storage and modify/add_row to accept that if feasible; update the code paths
around `result->add_row(row)` to release the original memory immediately after
the call.
In `@lib/Static_Harvester.cpp`:
- Around line 253-258: When mysql_store_result(mysql_conn) returns NULL in the
code around Static_Harvester.cpp, add a mysql_field_count(mysql_conn) check to
distinguish non-SELECT queries from SELECT errors: if
mysql_field_count(mysql_conn) == 0 treat it as no result (keep the existing
pthread_mutex_unlock(&conn_lock) and return 0), but if
mysql_field_count(mysql_conn) != 0 treat it as an error (unlock conn_lock and
return -1, optionally logging the mysql_error(mysql_conn)); update the
mysql_store_result handling logic accordingly in the same function to avoid
silently dropping SELECT failures.
♻️ Duplicate comments (8)
lib/Query_Tool_Handler.cpp (2)
1131-1143: SQL injection vulnerability incatalog.get_relationshipsobject_id lookup.The
schemaandtablevalues extracted fromobject_keyare concatenated directly into the SQL query at lines 1136-1138. Use prepared statements instead.🐛 Suggested fix using prepared statements
- std::ostringstream sql; - sql << "SELECT object_id FROM objects WHERE run_id = " << run_id - << " AND schema_name = '" << schema << "'" - << " AND object_name = '" << table << "' LIMIT 1;"; - catalog->get_db()->execute_statement(sql.str().c_str(), &error, &cols, &affected, &resultset); + sqlite3_stmt* stmt = NULL; + const char* sql = "SELECT object_id FROM objects WHERE run_id = ?1 " + "AND schema_name = ?2 AND object_name = ?3 LIMIT 1"; + if (catalog->get_db()->prepare_v2(sql, &stmt) == SQLITE_OK) { + (*proxy_sqlite3_bind_int)(stmt, 1, run_id); + (*proxy_sqlite3_bind_text)(stmt, 2, schema.c_str(), -1, SQLITE_TRANSIENT); + (*proxy_sqlite3_bind_text)(stmt, 3, table.c_str(), -1, SQLITE_TRANSIENT); + if ((*proxy_sqlite3_step)(stmt) == SQLITE_ROW) { + object_id = (*proxy_sqlite3_column_int)(stmt, 0); + } + (*proxy_sqlite3_finalize)(stmt); + }
939-946: SQL injection vulnerability inlist_tablestool.The
schemaandname_filterparameters are concatenated directly into the SQL query without validation or escaping. A maliciousschemavalue like"test; DROP TABLE users; --"could execute arbitrary SQL.🐛 Suggested fix using validation
else if (tool_name == "list_tables") { std::string schema = json_string(arguments, "schema"); std::string page_token = json_string(arguments, "page_token"); int page_size = json_int(arguments, "page_size", 50); std::string name_filter = json_string(arguments, "name_filter"); + + // Validate schema name (alphanumeric, underscore, dollar sign only) + static const std::regex safe_ident("^[A-Za-z0-9_$]*$"); + if (!schema.empty() && !std::regex_match(schema, safe_ident)) { + result = create_error_response("Invalid schema name"); + } else if (!name_filter.empty() && !std::regex_match(name_filter, std::regex("^[A-Za-z0-9_%$]*$"))) { + result = create_error_response("Invalid name_filter"); + } else { // TODO: Implement using MySQL connection std::ostringstream sql; sql << "SHOW TABLES"; if (!schema.empty()) { - sql << " FROM " << schema; + sql << " FROM `" << schema << "`"; } if (!name_filter.empty()) { sql << " LIKE '" << name_filter << "'"; } std::string query_result = execute_query(sql.str()); result = create_success_response(json::parse(query_result)); + } }lib/RAG_Tool_Handler.cpp (4)
1551-1625: Remove duplicated filter application afterbuild_sql_filters.
build_sql_filters()already appends all filter clauses; the following manual blocks reapply the same conditions and can drift in behavior. Removing the duplicates will reduce SQL bloat and keep filter semantics consistent.Also applies to: 1674-1748, 1929-2003
1520-1533: Preserve fractional hybrid weights (w_fts/w_vec).
get_json_int()truncates decimals, so values like 0.5 become 0 and fusion scores skew. Parse as double in both fuse and fallback branches.💡 Suggested fix
- w_fts = get_json_int(fuse_params, "w_fts", 1.0); - w_vec = get_json_int(fuse_params, "w_vec", 1.0); + w_fts = std::stod(get_json_string(fuse_params, "w_fts", "1.0")); + w_vec = std::stod(get_json_string(fuse_params, "w_vec", "1.0")); ... - w_fts = get_json_int(arguments, "w_fts", 1.0); - w_vec = get_json_int(arguments, "w_vec", 1.0); + w_fts = std::stod(get_json_string(arguments, "w_fts", "1.0")); + w_vec = std::stod(get_json_string(arguments, "w_vec", "1.0"));
2068-2155: Avoid raw concatenation in fts_then_vec rerank filters.
This block appends unescaped filter values directly into SQL, reintroducing SQL injection risk. Reusebuild_sql_filters()(or parameterized queries) and delete the manual concatenation.🔧 Suggested fix
- // Apply filters - if (filters.contains("source_ids") && filters["source_ids"].is_array()) { - // ... - } - if (filters.contains("source_names") && filters["source_names"].is_array()) { - // ... - } - if (filters.contains("doc_ids") && filters["doc_ids"].is_array()) { - // ... - } - if (filters.contains("post_type_ids") && filters["post_type_ids"].is_array()) { - // ... - } - if (filters.contains("tags_any") && filters["tags_any"].is_array()) { - // ... - } - if (filters.contains("tags_all") && filters["tags_all"].is_array()) { - // ... - } - if (filters.contains("created_after") && filters["created_after"].is_string()) { - // ... - } - if (filters.contains("created_before") && filters["created_before"].is_string()) { - // ... - } + // Apply filters using consolidated filter builder + if (!build_sql_filters(filters, vec_sql)) { + return create_error_response("Invalid filter parameters"); + }
2346-2353: Validate/escape doc_ids before building IN clauses.
doc_idsare concatenated into SQL inrag.get_docsandrag.fetch_from_sourcewithout validation or escaping, leaving a SQL injection vector. Reuse the same validation/escaping pattern aschunk_idsor parameterize.🔧 Suggested fix (apply to both blocks)
- // Build doc ID list for SQL - std::string doc_list = "'"; - for (size_t i = 0; i < doc_ids.size(); ++i) { - if (i > 0) doc_list += "','"; - doc_list += doc_ids[i]; - } - doc_list += "'"; + // Build doc ID list for SQL with validation/escaping + std::string doc_list = ""; + for (size_t i = 0; i < doc_ids.size(); ++i) { + const std::string& doc_id = doc_ids[i]; + if (doc_id.find('\'') != std::string::npos || + doc_id.find('\\') != std::string::npos || + doc_id.find(';') != std::string::npos) { + return create_error_response("Invalid characters in doc_ids"); + } + std::string escaped = doc_id; + size_t pos = 0; + while ((pos = escaped.find("'", pos)) != std::string::npos) { + escaped.replace(pos, 1, "''"); + pos += 2; + } + if (i > 0) doc_list += ","; + doc_list += "'" + escaped + "'"; + }Also applies to: 2429-2435
lib/Static_Harvester.cpp (2)
777-785: Validate only_schema before appending to the routines query.
only_schemais concatenated directly into SQL here without the validation used in other fetch methods. Addis_valid_schema_name()check to prevent injection.🔧 Suggested fix
- // Fetch and insert routines (stored procedures/functions) + // Fetch and insert routines (stored procedures/functions) + if (!only_schema.empty() && !is_valid_schema_name(only_schema)) { + proxy_error("Static_Harvester: Invalid schema name '%s'\n", only_schema.c_str()); + return count; + } std::ostringstream sql;
1105-1107: Validate only_schema before appending to the VIEWS query.
This branch usesonly_schemawithout validation, unlike other schema-filtered queries.🔧 Suggested fix
- if (!only_schema.empty()) { + if (!only_schema.empty()) { + if (!is_valid_schema_name(only_schema)) { + proxy_error("Static_Harvester: Invalid schema name '%s'\n", only_schema.c_str()); + return -1; + } sql << " AND TABLE_SCHEMA = '" << only_schema << "'"; }
🧹 Nitpick comments (1)
lib/sqlite3db.cpp (1)
1177-1183: Inconsistent indentation in proxy function assignments.Lines 1177-1183 are missing the leading tab that aligns with the surrounding code (e.g., line 1176 and 1184). This appears to be a merge artifact.
🔧 Suggested fix
proxy_sqlite3_column_count = sqlite3_column_count; proxy_sqlite3_column_int = sqlite3_column_int; - proxy_sqlite3_column_int64 = sqlite3_column_int64; - proxy_sqlite3_column_double = sqlite3_column_double; - proxy_sqlite3_last_insert_rowid = sqlite3_last_insert_rowid; - proxy_sqlite3_errstr = sqlite3_errstr; - proxy_sqlite3_db_handle = sqlite3_db_handle; - proxy_sqlite3_enable_load_extension = sqlite3_enable_load_extension; - proxy_sqlite3_auto_extension = sqlite3_auto_extension; + proxy_sqlite3_column_int64 = sqlite3_column_int64; + proxy_sqlite3_column_double = sqlite3_column_double; + proxy_sqlite3_last_insert_rowid = sqlite3_last_insert_rowid; + proxy_sqlite3_errstr = sqlite3_errstr; + proxy_sqlite3_db_handle = sqlite3_db_handle; + proxy_sqlite3_enable_load_extension = sqlite3_enable_load_extension; + proxy_sqlite3_auto_extension = sqlite3_auto_extension; proxy_sqlite3_errmsg = sqlite3_errmsg;
| while ((step_rc = (*proxy_sqlite3_step)(stmt)) == SQLITE_ROW) { | ||
| nlohmann::json entry; | ||
|
|
||
| const char* kind_val = (const char*)(*proxy_sqlite3_column_text)(stmt, 0); | ||
| const char* key_val = (const char*)(*proxy_sqlite3_column_text)(stmt, 1); | ||
| const char* doc_val = (const char*)(*proxy_sqlite3_column_text)(stmt, 2); | ||
| const char* tags_val = (const char*)(*proxy_sqlite3_column_text)(stmt, 3); | ||
| const char* links_val = (const char*)(*proxy_sqlite3_column_text)(stmt, 4); | ||
|
|
||
| entry["kind"] = std::string(kind_val ? kind_val : ""); | ||
| entry["key"] = std::string(key_val ? key_val : ""); | ||
| entry["schema"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 0)); | ||
| entry["kind"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 1)); | ||
| entry["key"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 2)); | ||
|
|
||
| // Parse the stored JSON document - nlohmann::json handles escaping | ||
| if (doc_val) { | ||
| const char* doc_str = (const char*)(*proxy_sqlite3_column_text)(stmt, 3); | ||
| if (doc_str) { | ||
| try { | ||
| entry["document"] = nlohmann::json::parse(doc_val); | ||
| entry["document"] = nlohmann::json::parse(doc_str); | ||
| } catch (const nlohmann::json::parse_error& e) { | ||
| // If document is not valid JSON, store as string | ||
| entry["document"] = std::string(doc_val); | ||
| entry["document"] = std::string(doc_str); | ||
| } | ||
| } else { | ||
| entry["document"] = nullptr; | ||
| } | ||
|
|
||
| entry["tags"] = std::string(tags_val ? tags_val : ""); | ||
| entry["links"] = std::string(links_val ? links_val : ""); | ||
| entry["tags"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 4)); | ||
| entry["links"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 5)); |
There was a problem hiding this comment.
Potential null pointer dereference when reading column text.
In search(), proxy_sqlite3_column_text can return NULL for NULL columns. At lines 352-370, the code casts the result directly to std::string without null checks for columns 0, 1, 2, 4, and 5. While the schema is enforced as NOT NULL, tags and links columns allow NULL values.
🐛 Suggested defensive fix
- entry["tags"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 4));
- entry["links"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 5));
+ const char* tags_str = (const char*)(*proxy_sqlite3_column_text)(stmt, 4);
+ entry["tags"] = tags_str ? std::string(tags_str) : "";
+ const char* links_str = (const char*)(*proxy_sqlite3_column_text)(stmt, 5);
+ entry["links"] = links_str ? std::string(links_str) : "";📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| while ((step_rc = (*proxy_sqlite3_step)(stmt)) == SQLITE_ROW) { | |
| nlohmann::json entry; | |
| const char* kind_val = (const char*)(*proxy_sqlite3_column_text)(stmt, 0); | |
| const char* key_val = (const char*)(*proxy_sqlite3_column_text)(stmt, 1); | |
| const char* doc_val = (const char*)(*proxy_sqlite3_column_text)(stmt, 2); | |
| const char* tags_val = (const char*)(*proxy_sqlite3_column_text)(stmt, 3); | |
| const char* links_val = (const char*)(*proxy_sqlite3_column_text)(stmt, 4); | |
| entry["kind"] = std::string(kind_val ? kind_val : ""); | |
| entry["key"] = std::string(key_val ? key_val : ""); | |
| entry["schema"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 0)); | |
| entry["kind"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 1)); | |
| entry["key"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 2)); | |
| // Parse the stored JSON document - nlohmann::json handles escaping | |
| if (doc_val) { | |
| const char* doc_str = (const char*)(*proxy_sqlite3_column_text)(stmt, 3); | |
| if (doc_str) { | |
| try { | |
| entry["document"] = nlohmann::json::parse(doc_val); | |
| entry["document"] = nlohmann::json::parse(doc_str); | |
| } catch (const nlohmann::json::parse_error& e) { | |
| // If document is not valid JSON, store as string | |
| entry["document"] = std::string(doc_val); | |
| entry["document"] = std::string(doc_str); | |
| } | |
| } else { | |
| entry["document"] = nullptr; | |
| } | |
| entry["tags"] = std::string(tags_val ? tags_val : ""); | |
| entry["links"] = std::string(links_val ? links_val : ""); | |
| entry["tags"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 4)); | |
| entry["links"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 5)); | |
| while ((step_rc = (*proxy_sqlite3_step)(stmt)) == SQLITE_ROW) { | |
| nlohmann::json entry; | |
| entry["schema"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 0)); | |
| entry["kind"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 1)); | |
| entry["key"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 2)); | |
| // Parse the stored JSON document - nlohmann::json handles escaping | |
| const char* doc_str = (const char*)(*proxy_sqlite3_column_text)(stmt, 3); | |
| if (doc_str) { | |
| try { | |
| entry["document"] = nlohmann::json::parse(doc_str); | |
| } catch (const nlohmann::json::parse_error& e) { | |
| // If document is not valid JSON, store as string | |
| entry["document"] = std::string(doc_str); | |
| } | |
| } else { | |
| entry["document"] = nullptr; | |
| } | |
| const char* tags_str = (const char*)(*proxy_sqlite3_column_text)(stmt, 4); | |
| entry["tags"] = tags_str ? std::string(tags_str) : ""; | |
| const char* links_str = (const char*)(*proxy_sqlite3_column_text)(stmt, 5); | |
| entry["links"] = links_str ? std::string(links_str) : ""; |
🤖 Prompt for AI Agents
In `@lib/MySQL_Catalog.cpp` around lines 350 - 370, The loop inside search()
dereferences proxy_sqlite3_column_text results directly (for columns 0,1,2,4,5),
risking nullptr reads for nullable columns (tags, links) and any unexpected
NULLs; update the code to check each proxy_sqlite3_column_text(stmt, N) result
for NULL before casting to const char* and constructing std::string, e.g., treat
NULL as empty string or set JSON null where appropriate, and reuse a small
helper or inline conditional to convert nullable column text to either
std::string or nullptr for entry["..."] to avoid dereferencing NULL pointers.
| entry["tags"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 4)); | ||
| entry["links"] = std::string((const char*)(*proxy_sqlite3_column_text)(stmt, 5)); |
There was a problem hiding this comment.
Same null pointer risk in list() for nullable columns.
Apply the same defensive null check for tags and links columns as suggested for search().
🤖 Prompt for AI Agents
In `@lib/MySQL_Catalog.cpp` around lines 489 - 490, In list(), avoid dereferencing
null from proxy_sqlite3_column_text for nullable columns by checking the
returned const unsigned char* for columns 4 and 5 before converting to
std::string; replace the direct casts used for entry["tags"] and entry["links"]
with a null-safe pattern (like retrieving auto ptr =
(*proxy_sqlite3_column_text)(stmt, 4/5); entry["tags"/"links"] = ptr ?
std::string((const char*)ptr) : std::string();) mirroring the defensive check
used in search().
| char** row = new char*[8]; | ||
| row[0] = strdup(tool_name.c_str()); | ||
| row[1] = strdup(schema_name.c_str()); | ||
|
|
||
| char buf[32]; | ||
| snprintf(buf, sizeof(buf), "%llu", stats.count); | ||
| row[2] = strdup(buf); | ||
| snprintf(buf, sizeof(buf), "%llu", stats.first_seen); | ||
| row[3] = strdup(buf); | ||
| snprintf(buf, sizeof(buf), "%llu", stats.last_seen); | ||
| row[4] = strdup(buf); | ||
| snprintf(buf, sizeof(buf), "%llu", stats.sum_time); | ||
| row[5] = strdup(buf); | ||
| snprintf(buf, sizeof(buf), "%llu", stats.min_time); | ||
| row[6] = strdup(buf); | ||
| snprintf(buf, sizeof(buf), "%llu", stats.max_time); | ||
| row[7] = strdup(buf); | ||
|
|
||
| result->add_row(row); | ||
| } |
There was a problem hiding this comment.
Memory leak: row data not freed after add_row().
Each iteration allocates a char*[8] array and 8 strdup'd strings. add_row() copies the data internally, but the original allocations are never freed, causing memory leaks proportional to the number of stats entries.
🐛 Suggested fix
result->add_row(row);
+
+ // Free the temporary row data
+ for (int i = 0; i < 8; i++) {
+ free(row[i]);
+ }
+ delete[] row;
}
}🤖 Prompt for AI Agents
In `@lib/Query_Tool_Handler.cpp` around lines 1835 - 1854, The loop allocates a
char* array `row` and eight `strdup`'d strings then calls `result->add_row(row)`
but never frees those allocations; after `result->add_row(row)` free each
duplicated string (e.g. free(row[i]) for i=0..7) and then delete[] row to avoid
leaks, or replace the manual strdup/new[] pattern with a
std::vector<std::string> / std::string storage and modify/add_row to accept that
if feasible; update the code paths around `result->add_row(row)` to release the
original memory immediately after the call.
| MYSQL_RES* res = mysql_store_result(mysql_conn); | ||
| if (!res) { | ||
| // No result set (e.g., INSERT/UPDATE) | ||
| pthread_mutex_unlock(&conn_lock); | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Read the target file and surrounding lines
cat -n lib/Static_Harvester.cpp | sed -n '240,275p'Repository: ProxySQL/proxysql-vec
Length of output: 1197
🏁 Script executed:
# Find the function that contains this code
grep -n "MYSQL_RES\* res = mysql_store_result" lib/Static_Harvester.cpp -B 20 | grep -E "(^[0-9]+-\s*(int|bool|void|.*\()|^[0-9]+-\s*MYSQL_RES)"Repository: ProxySQL/proxysql-vec
Length of output: 621
🏁 Script executed:
# Get more context - find the function name
sed -n '220,260p' lib/Static_Harvester.cpp | head -50Repository: ProxySQL/proxysql-vec
Length of output: 1296
🏁 Script executed:
# Search for mysql_field_count usage in the codebase
rg "mysql_field_count" --type cppRepository: ProxySQL/proxysql-vec
Length of output: 814
🏁 Script executed:
# Find all callers of execute_query
rg "execute_query\(" lib/ --type cpp -B 2 -A 2Repository: ProxySQL/proxysql-vec
Length of output: 11557
🏁 Script executed:
# Search for where this function is called from
rg "\.execute_query\(|execute_query\(" lib/Static_Harvester.cpp -A 3 -B 1Repository: ProxySQL/proxysql-vec
Length of output: 1507
🏁 Script executed:
# Verify mysql_field_count is accessible and the pattern is sound
# Check MySQL_ResultSet.cpp where mysql_field_count is used
sed -n '1,50p' lib/MySQL_ResultSet.cpp | head -30Repository: ProxySQL/proxysql-vec
Length of output: 799
Add mysql_field_count() check to distinguish SELECT errors from non-SELECT queries.
When mysql_store_result() returns NULL, it can mean either no results (INSERT/UPDATE) or an error (SELECT). Use mysql_field_count() to differentiate: if non-zero with NULL results, it's an error (often OOM) and should return -1. This prevents silently dropping SELECT results.
🔧 Suggested fix
MYSQL_RES* res = mysql_store_result(mysql_conn);
if (!res) {
- // No result set (e.g., INSERT/UPDATE)
+ // No result set (e.g., INSERT/UPDATE) vs error
+ if (mysql_field_count(mysql_conn) != 0) {
+ proxy_error("Static_Harvester: mysql_store_result failed: %s\n", mysql_error(mysql_conn));
+ pthread_mutex_unlock(&conn_lock);
+ return -1;
+ }
pthread_mutex_unlock(&conn_lock);
return 0;
}🤖 Prompt for AI Agents
In `@lib/Static_Harvester.cpp` around lines 253 - 258, When
mysql_store_result(mysql_conn) returns NULL in the code around
Static_Harvester.cpp, add a mysql_field_count(mysql_conn) check to distinguish
non-SELECT queries from SELECT errors: if mysql_field_count(mysql_conn) == 0
treat it as no result (keep the existing pthread_mutex_unlock(&conn_lock) and
return 0), but if mysql_field_count(mysql_conn) != 0 treat it as an error
(unlock conn_lock and return -1, optionally logging the
mysql_error(mysql_conn)); update the mysql_store_result handling logic
accordingly in the same function to avoid silently dropping SELECT failures.
…ccess Fix compilation errors in the SQL injection fixes: 1. ProxySQL_Admin_Stats.cpp: Use public statsdb->prepare_v2() API - Changed from direct proxy_sqlite3_prepare_v2() calls with statsdb->db - statsdb->db is private, must use public prepare_v2(query, &stmt) method 2. Admin_Handler.cpp: Add SPA cast for template function access - Added ProxySQL_Admin *SPA=(ProxySQL_Admin *)pa; declaration - Changed all admindb->execute to SPA->admindb->execute - Removed unused 'error' and 'success' variables The build now completes successfully.
There was a problem hiding this comment.
Pull request overview
This PR merges the v3.1-MCP2 branch into v3.1-vec, resolving all merge conflicts and integrating improvements from both branches. The merge brings schema isolation for catalog operations, query rewriting enhancements, two-phase discovery architecture, and maintains HTTP/HTTPS support and RAG functionality.
Changes:
- Introduced schema isolation for catalog operations with composite unique constraint (schema, kind, key)
- Added two-phase discovery architecture (static harvest + LLM semantic analysis)
- Hardcoded catalog path to
datadir/mcp_catalog.dbfor stability (removed configurable paths) - Added
lib/proxy_sqlite3_symbols.cppfor canonical function pointer definitions - Enhanced MCP endpoint error handling and logging
Reviewed changes
Copilot reviewed 59 out of 85 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
lib/MySQL_Catalog.cpp |
Added schema parameter to all catalog functions for multi-tenancy isolation |
lib/MySQL_Tool_Handler.cpp |
Updated catalog function signatures to include schema parameter |
lib/ProxySQL_MCP_Server.cpp |
Hardcoded catalog path and added RAG endpoint initialization |
lib/Static_Harvester.cpp |
New file implementing static metadata harvester for two-phase discovery |
lib/RAG_Tool_Handler.cpp |
New file implementing RAG-powered tools through MCP protocol |
lib/Discovery_Schema.cpp |
Enhanced discovery schema with query digest tracking (not shown in diff) |
lib/Query_Tool_Handler.cpp |
Updated to use Discovery_Schema directly (not shown in diff) |
lib/ProxySQL_Admin_Stats.cpp |
Added MCP query tools counters and digest statistics |
lib/ProxySQL_Admin.cpp |
Added load/save commands for MCP query rules |
lib/MCP_Thread.cpp |
Removed catalog_path and fts_path variables |
lib/sqlite3db.cpp |
Added proxy function pointer initializations |
lib/proxy_sqlite3_symbols.cpp |
New file defining canonical storage for proxy_sqlite3_* function pointers |
lib/debug.cpp |
Updated comment to reference proxy_sqlite3_reset |
lib/PgSQL_Monitor.cpp |
Updated to use proxy_sqlite3_db_handle function pointer |
lib/MCP_Endpoint.cpp |
Added comprehensive exception handling and improved logging |
scripts/mcp/DiscoveryAgent/ClaudeCode_Headless/* |
Updated discovery scripts for multi-agent architecture |
lib/Makefile |
Added proxy_sqlite3_symbols.oo, Discovery_Schema.oo, Static_Harvester.oo, RAG_Tool_Handler.oo |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lib/ProxySQL_Admin_Stats.cpp
Outdated
| char* a = (char*)"INSERT INTO stats_mcp_query_rules VALUES (\"%s\",\"%s\")"; | ||
| for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) { | ||
| SQLite3_row* r = *it; | ||
| int arg_len = 0; | ||
| for (int i = 0; i < 2; i++) { | ||
| arg_len += strlen(r->fields[i]); | ||
| } | ||
| char* query = (char*)malloc(strlen(a) + arg_len + 32); | ||
| sprintf(query, a, r->fields[0], r->fields[1]); | ||
| statsdb->execute(query); | ||
| free(query); | ||
| } |
There was a problem hiding this comment.
The query construction uses sprintf with %s format specifiers without proper escaping, which could lead to SQL injection if the field values contain special characters. Use parameterized queries with proxy_sqlite3_bind_text instead, similar to the approach used in stats___mcp_query_digest.
| char* a = (char*)"INSERT INTO stats_mcp_query_rules VALUES (\"%s\",\"%s\")"; | |
| for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) { | |
| SQLite3_row* r = *it; | |
| int arg_len = 0; | |
| for (int i = 0; i < 2; i++) { | |
| arg_len += strlen(r->fields[i]); | |
| } | |
| char* query = (char*)malloc(strlen(a) + arg_len + 32); | |
| sprintf(query, a, r->fields[0], r->fields[1]); | |
| statsdb->execute(query); | |
| free(query); | |
| } | |
| const char* insert_sql = "INSERT INTO stats_mcp_query_rules VALUES (?, ?)"; | |
| sqlite3_stmt* stmt = NULL; | |
| int rc = (*proxy_sqlite3_prepare_v2)(statsdb->get_db(), insert_sql, -1, &stmt, NULL); | |
| assert(rc == SQLITE_OK); | |
| for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) { | |
| SQLite3_row* r = *it; | |
| rc = (*proxy_sqlite3_bind_text)(stmt, 1, r->fields[0], -1, SQLITE_TRANSIENT); | |
| assert(rc == SQLITE_OK); | |
| rc = (*proxy_sqlite3_bind_text)(stmt, 2, r->fields[1], -1, SQLITE_TRANSIENT); | |
| assert(rc == SQLITE_OK); | |
| SAFE_SQLITE3_STEP(stmt); | |
| rc = (*proxy_sqlite3_reset)(stmt); | |
| assert(rc == SQLITE_OK); | |
| } | |
| rc = (*proxy_sqlite3_finalize)(stmt); | |
| assert(rc == SQLITE_OK); |
| // Try different kinds for flexible merging | ||
| if (get("table", key, doc) == 0 || get("view", key, doc) == 0) { | ||
| // Try different kinds for flexible merging (empty schema searches all) | ||
| if (get("" , "table", key , doc) == 0 || get("" , "view", key, doc) == 0) { |
There was a problem hiding this comment.
Inconsistent spacing around commas. Some have spaces before commas (, ), others don't (,). Standardize to use consistent spacing for better code readability.
| Discovery_Schema* catalog = qth->get_catalog(); | ||
| if (!catalog) return (char*)"Discovery Schema catalog not initialized"; | ||
|
|
||
| char* query = (char*)"SELECT rule_id, active, username, schemaname, tool_name, match_pattern, negate_match_pattern, re_modifiers, flagIN, flagOUT, replace_pattern, timeout_ms, error_msg, OK_msg, log, apply, comment FROM main.mcp_query_rules ORDER BY rule_id"; |
There was a problem hiding this comment.
The query string is safe, but the subsequent processing in save_mcp_query_rules_from_runtime uses sprintf with format strings built from user data. While escape_string_single_quotes is used, consider using parameterized queries consistently for all database operations to prevent potential SQL injection vulnerabilities.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@lib/Admin_Handler.cpp`:
- Around line 2364-2401: Fix three issues in the LOAD MCP QUERY RULES handler:
correct the typo in the strncasecmp target so both comparisons use "LOAD MCP
QUERY RULES FROM DISK" and "LOAD MCP QUERY RULES TO MEMORY" and match the
query_no_space_length checks; do not return true after calling l_free(*ql, *q)
(which frees *q) — instead ensure *q is not freed before the caller uses it or
reassign *q appropriately so there is no use-after-free (adjust/remove the
l_free(*ql,*q) call or set *q to a newly allocated safe value); and on the
success path call send_ok_msg_to_client (using the same client/context used
elsewhere in this file) after COMMIT succeeds so the client receives a response.
Ensure changes reference the handler block that uses query_no_space_length,
query_no_space, l_free, ProxySQL_Admin/SPA and SPA->admindb->execute.
♻️ Duplicate comments (2)
lib/Admin_Handler.cpp (1)
2405-2440: Same issues: return value and missing client response.This handler has the same bugs as the previous one:
Line 2439 returns
trueafterl_free(*ql, *q): The query was freed at line 2408 but never reassigned, causing use-after-free when the caller tries to execute*q.No client response:
send_ok_msg_to_clientis never called on the success path.Proposed fix
// SAVE MCP QUERY RULES TO DISK // Copies rules from working memory (main.mcp_query_rules) to persistent storage (disk.mcp_query_rules) if ( (query_no_space_length == strlen("SAVE MCP QUERY RULES TO DISK") && !strncasecmp("SAVE MCP QUERY RULES TO DISK", query_no_space, query_no_space_length)) ) { - l_free(*ql,*q); - ProxySQL_Admin *SPA=(ProxySQL_Admin *)pa; // Execute as transaction to ensure both statements run atomically // Begin transaction if (!SPA->admindb->execute("BEGIN")) { proxy_error("Failed to BEGIN transaction for SAVE MCP QUERY RULES TO DISK\n"); + SPA->send_error_msg_to_client(sess, (char *)"Failed to BEGIN transaction"); return false; } // Clear target table if (!SPA->admindb->execute("DELETE FROM disk.mcp_query_rules")) { proxy_error("Failed to DELETE from disk.mcp_query_rules\n"); SPA->admindb->execute("ROLLBACK"); + SPA->send_error_msg_to_client(sess, (char *)"Failed to DELETE from disk.mcp_query_rules"); return false; } // Insert from source if (!SPA->admindb->execute("INSERT OR REPLACE INTO disk.mcp_query_rules SELECT * FROM main.mcp_query_rules")) { proxy_error("Failed to INSERT into disk.mcp_query_rules\n"); SPA->admindb->execute("ROLLBACK"); + SPA->send_error_msg_to_client(sess, (char *)"Failed to INSERT into disk.mcp_query_rules"); return false; } // Commit transaction if (!SPA->admindb->execute("COMMIT")) { proxy_error("Failed to COMMIT transaction for SAVE MCP QUERY RULES TO DISK\n"); + SPA->send_error_msg_to_client(sess, (char *)"Failed to COMMIT transaction"); return false; } - return true; + proxy_info("Received %s command\n", query_no_space); + SPA->send_ok_msg_to_client(sess, NULL, 0, query_no_space); + return false; }lib/ProxySQL_Admin_Stats.cpp (1)
2670-2700: Replace sprintf SQL in MCP query rules path.Line 2686–2695 still builds SQL via string interpolation, which reintroduces injection/quoting risk and deviates from the prepared-statement approach used elsewhere. Please switch to a prepared statement with bound parameters.
🛠️ Suggested fix
- char* a = (char*)"INSERT INTO stats_mcp_query_rules VALUES (\"%s\",\"%s\")"; - for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) { - SQLite3_row* r = *it; - int arg_len = 0; - for (int i = 0; i < 2; i++) { - arg_len += strlen(r->fields[i]); - } - char* query = (char*)malloc(strlen(a) + arg_len + 32); - sprintf(query, a, r->fields[0], r->fields[1]); - statsdb->execute(query); - free(query); - } + const char* query_str = "INSERT INTO stats_mcp_query_rules VALUES (?1, ?2)"; + sqlite3_stmt* statement = nullptr; + int rc = statsdb->prepare_v2(query_str, &statement); + ASSERT_SQLITE_OK(rc, statsdb); + for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) { + SQLite3_row* r = *it; + rc = (*proxy_sqlite3_bind_text)(statement, 1, r->fields[0], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_bind_text)(statement, 2, r->fields[1], -1, SQLITE_TRANSIENT); ASSERT_SQLITE_OK(rc, statsdb); + SAFE_SQLITE3_STEP2(statement); + rc = (*proxy_sqlite3_clear_bindings)(statement); ASSERT_SQLITE_OK(rc, statsdb); + rc = (*proxy_sqlite3_reset)(statement); ASSERT_SQLITE_OK(rc, statsdb); + } + (*proxy_sqlite3_finalize)(statement);
… SQL injection Fix issues identified by coderabbitai review: 1. Admin_Handler.cpp: Fix typo in strncasecmp for LOAD MCP QUERY RULES - Line 2365 had "LOAD MCP RULES FROM DISK" instead of "LOAD MCP QUERY RULES FROM DISK" 2. Admin_Handler.cpp: Fix use-after-free and missing client response - Removed l_free(*ql,*q) which freed *q before caller used it - Added send_error_msg_to_client calls on all error paths - Added send_ok_msg_to_client call on success path - Changed return value from true to false (to match handler pattern) - Applied to both LOAD MCP QUERY RULES and SAVE MCP QUERY RULES handlers 3. ProxySQL_Admin_Stats.cpp: Fix sprintf SQL injection in stats___mcp_query_rules - Replaced sprintf with prepared statement using positional parameters - Changed from char* query with malloc/free to sqlite3_stmt with prepare_v2 - Both columns now bound as parameters (?1, ?2)
Summary
Merges branch
v3.1-MCP2intov3.1-vecwith all merge conflicts resolved.Replaces: #15
Key Features from v3.1-MCP2
Features Preserved from v3.1-vec
mcp_use_sslvariable)Conflict Resolution Strategy
Intentional Removals (from v3.1-MCP2 commit 35b0b22)
mcp_catalog_pathvariable → catalog path now hardcoded todatadir/mcp_catalog.dbmcp_fts_pathvariable → FTS path removed (simplified architecture)Architecture Changes
lib/proxy_sqlite3_symbols.cppadded → provides canonical function pointer definitions#define MAIN_PROXY_SQLITE3to avoid conflictsFiles with Conflicts (All Resolved)
include/MCP_Thread.hinclude/MySQL_Tool_Handler.hlib/Admin_FlushVariables.cpplib/MCP_Thread.cpplib/MySQL_Catalog.cpplib/ProxySQL_MCP_Server.cpplib/Query_Tool_Handler.cpptest/tap/tests/Makefiletest/tap/tests/vector_db_performance-t.cppBreaking Changes
schemaparameterBuild Fixes
#define MAIN_PROXY_SQLITE3from test files to fix linker errors with newlib/proxy_sqlite3_symbols.cpp