Technical details for developers and contributors.
┌─────────────────────────────────────────────────────────────────────────────┐
│ RAG MCP Server │
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌──────────────────────────┐ │
│ │ MCP Server │────│ RAGIndex │────│ ChromaDB │ │
│ │ (server.py) │ │ (index.py) │ │ (data/chroma/) │ │
│ │ │ │ │ │ │ │
│ │ - search │ │ - load() │ │ Collection: ir_knowledge│ │
│ │ - list_src │ │ - search() │ │ 23 online sources │ │
│ │ - get_stats │ │ - get_stats() │ │ 23 online sources │ │
│ └──────────────┘ └──────────────────┘ └──────────────────────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ SentenceTransformer│ │
│ │ BAAI/bge-base-en │ │
│ │ (768 dimensions) │ │
│ └───────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
▲
│ Build/Refresh
│
┌─────────────────────────────┴───────────────────────────────────────────────┐
│ Knowledge Pipeline │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────────────────┐ │
│ │ sources.py │ │ build.py │ │ refresh.py │ │
│ │ │ │ │ │ │ │
│ │ 23 online │──▶│ Full rebuild │ │ Incremental updates │ │
│ │ sources │ │ (5 min) │ │ (seconds) │ │
│ │ │ │ │ │ │ │
│ │ - Fetch │ └────────────────┘ └────────────────────────────┘ │
│ │ - Parse │ │
│ │ - Cache │ ┌────────────────┐ ┌────────────────────────────┐ │
│ └────────────────┘ │ ingest.py │ │ status.py │ │
│ │ │ │ │ │
│ │ User docs │ │ Index status reporting │ │
│ │ - Watched │ │ │ │
│ │ - Ingested │ │ │ │
│ └────────────────┘ └────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
- Exposes tools:
search,list_sources,get_stats - Input validation (length limits, type checking)
- Async execution with thread pool for CPU-bound operations
- Loads index at startup for fast queries
- ChromaDB collection wrapper
- Embedding model management (BGE with model allowlist)
- Query augmentation: Expands MITRE IDs with technique names
- Hybrid search: Semantic similarity + keyword boosting
- MITRE technique ID detection and boosting
- Source filtering:
source(substring) andsource_ids(exact match list) - Query logging: Metrics and attention logging for tuning
- Loads configurable thresholds from tuning_config.json
- 23 source configurations (repo, branch, parser)
- GitHub API integration (commits, releases, feeds)
- Parser functions for each source format (23 parsers)
- State management (versions, sync timestamps)
- Network hardening: HTTPS-only, size limits, host allowlist, SSRF protection
- Retry logic: Exponential backoff with jitter for transient failures
- Format validation (.txt, .md, .json, .jsonl)
- Semantic chunking (respects headers, paragraphs)
- Watched documents (knowledge/ folder)
- One-time ingested documents (friendly names)
- 5-phase build: online → user docs → ingested → embed → save
- MITRE text augmentation: Enriches document text before embedding
- Metadata sanitization for ChromaDB
- State file generation
- Sentinel creation: Creates
.rag_mcp_managedfiles in managed directories
- Version checking against upstream (commit SHA, release tags)
- Selective source updates
- User document change detection (file hash comparison)
- JSON structured logging: Refresh summary logged to
rag_mcp.refresh_summaryfor observability
- Index statistics
- Source version/sync status
- Document counts
- Single source of truth for all configuration values
- Environment variable loading with validation
- Cached singleton pattern for performance
- Type-safe Config dataclass
- Sentinel-based deletion guards (
.rag_mcp_managedfile required by default) require_sentinel_file=Falsemode for first-time setup (all other guards remain active)- Forbidden path protection (/, /home, /root, etc.)
- Minimum depth checks (MIN_DELETE_DEPTH=3)
- Root containment validation
- Safe directory removal with
safe_rmtree()- the only approved deletion method
- Project paths (PROJECT_ROOT, DATA_ROOT, KNOWLEDGE_ROOT)
- Sentinel file name (MANAGED_SENTINEL)
- Forbidden paths list
- Depth limits and other safety constants
- Metadata sanitization (ChromaDB type requirements)
- File hashing, atomic JSON writes
- MITRE lookup loading: Dynamic technique ID → name mapping
- Text augmentation: Shared function for build and query time
- Configurable thresholds per query type
- Source boost multipliers
- Keyword boost setting
- Full audit trail of configuration changes
- Safe bounds enforcement (0.40-0.70 for thresholds)
- Parses query metrics and attention logs
- Generates statistics by query type
- Produces recommendations with rationale
- Interactive approval workflow for tuning changes
- Exports analysis to JSON for external processing
1. Online Sources (sources.py)
├── Fetch from GitHub/feeds
├── Parse to JSONL
└── Cache in data/sources/*.jsonl
2. User Documents (ingest.py)
├── Scan knowledge/ folder
├── Validate formats
└── Chunk text content
3. MITRE Lookup (utils.py)
├── Load technique mappings from mitre_attack.jsonl
└── 835 technique ID → name mappings
4. Text Augmentation (build.py)
├── Expand MITRE IDs in document text
└── "Detect T1003" → "Detect T1003 OS Credential Dumping"
5. ChromaDB Index (build.py)
├── Load embedding model
├── Batch embed augmented text
├── Create ir_knowledge collection
└── Save state files
1. Query received via MCP
2. Validate inputs (length, types)
3. Load tuning config (thresholds, boosts)
4. Augment query with MITRE technique names
└── "T1003" → "T1003 OS Credential Dumping"
5. Extract boost terms from query
6. Embed augmented query with BGE model
7. ChromaDB cosine similarity search
8. Apply filters:
- source: Substring match (e.g., "mitre" matches mitre_attack, mitre_car)
- source_ids: Exact match list (deterministic filtering)
- technique: MITRE ATT&CK technique ID
- platform: Target platform
9. Apply source boosts for authoritative sources
10. Apply keyword boosts (hybrid search)
11. Boost MITRE ID matches if applicable
12. Log query metrics (all queries + attention-worthy)
13. Return ranked results
1. Configure logging for rag_mcp.query_metrics and rag_mcp.attention
2. Run queries in production (logs accumulate)
3. Run: python -m rag_mcp.analyze_queries
4. Review statistics and recommendations
5. Approve/reject each recommendation interactively
6. Approved changes saved to data/tuning_config.json
7. Index loads updated config on next query
Each source is defined in SOURCES dict:
SourceConfig(
name="sigma", # Unique identifier
description="SigmaHQ Rules", # Human-readable
source_type="github_commits", # Version tracking method
repo="SigmaHQ/sigma", # GitHub owner/repo
branch="master", # Branch to track
parser="parse_sigma", # Parser function name
paths=["rules/"] # Paths to parse within repo
)github_commits: Track latest commit SHAgithub_releases: Track latest release tagjson_feed: Track feed version field
Each parser takes (repo_dir, output_path) and returns record count.
Outputs JSONL with {text, metadata} per line.
| Source | Type | Description |
|---|---|---|
| sigma | github_commits | SigmaHQ Detection Rules (~3,100) |
| mitre_attack | github_releases | ATT&CK techniques, groups, malware (~2,100) |
| atomic | github_commits | Atomic Red Team Tests (~1,800) |
| elastic | github_releases | Elastic Detection Rules (~1,500) |
| cisa_kev | json_feed | Known Exploited Vulnerabilities (~1,500) |
| capec | github_commits | MITRE CAPEC Attack Patterns (~1,500) |
| splunk_security | github_releases | Splunk Security Content (~1,000) |
| kape | github_commits | KAPE Targets & Modules (~800) |
| forensic_artifacts | github_commits | ForensicArtifacts Definitions (~700) |
| mbc | github_commits | MITRE MBC Malware Behavior Catalog (~650) |
| hijacklibs | github_commits | DLL Hijacking Database (~600) |
| loldrivers | github_commits | Vulnerable Driver Database (~500) |
| mitre_d3fend | json_feed | D3FEND Countermeasures (~490) |
| gtfobins | github_commits | GTFOBins (~450) |
| velociraptor | github_commits | Velociraptor Artifacts (~300) |
| hayabusa | github_commits | Hayabusa Built-in Detection Rules (~190) |
| lolbas | github_commits | LOLBAS Project (~130) |
| chainsaw | github_commits | Chainsaw Forensic Detection Rules (~110) |
| mitre_car | github_commits | MITRE CAR Analytics (~100) |
| stratus_red_team | github_releases | Cloud Attack Techniques (~80) |
| mitre_atlas | github_commits | ATLAS AI/ML Attacks (~50) |
| mitre_engage | github_commits | MITRE Engage Adversary Engagement (~45) |
| forensic_clarifications | static | Authoritative Forensic Artifact Clarifications (5) |
{
"version": 1,
"sources": {
"sigma": {
"version": "abc123def456",
"last_sync": "2024-01-31T08:00:00",
"records": 3101,
"cache_hash": "sha256:..."
}
}
}{
"version": 1,
"files": {
"my-doc.md": {
"hash": "sha256:...",
"records": 5,
"record_ids": ["user_my-doc_0", ...],
"processed_at": "2024-01-31T08:00:00"
}
}
}{
"version": 1,
"documents": {
"friendly-name": {
"original_filename": "doc.txt",
"records": 10,
"record_ids": ["ingested_friendly-name_0", ...],
"ingested_at": "2024-01-31T08:00:00"
}
}
}{
"version": "1.0",
"thresholds": {
"general": 0.50,
"mitre_id": 0.55,
"detection": 0.55,
"forensic": 0.55
},
"source_boosts": {
"forensic_clarifications": 1.15
},
"keyword_boost": 1.15,
"low_score_threshold": 0.50,
"weak_mitre_threshold": 0.60,
"last_modified": "2026-02-02T10:00:00",
"last_modified_by": "analyst",
"modification_history": [
{
"timestamp": "2026-02-02T10:00:00",
"parameter": "threshold:detection",
"old_value": 0.55,
"new_value": 0.52,
"approved_by": "analyst",
"reason": "12% of detection queries below threshold"
}
]
}Collection: ir_knowledge
- Metric: Cosine similarity (HNSW)
- Dimensions: 768 (BGE model)
Document Fields:
id: Unique identifier (e.g.,sigma_rule_123)text: Searchable content (max 1500 chars in results)metadata:source: Source identifiertitle: Optional titlemitre_techniques: Comma-separated technique IDsplatform: Target platform(s)
Queries and documents containing MITRE technique IDs are automatically expanded with official technique names before embedding. This dramatically improves search quality for alphanumeric IDs that have no semantic meaning on their own.
Query time: "T1003" → "T1003 OS Credential Dumping" Index time: Document text is enriched before embedding
The lookup table is loaded dynamically from mitre_attack.jsonl and contains 835 technique mappings. It updates automatically when MITRE data is refreshed.
Combines semantic similarity with exact keyword matching:
- Semantic score: ChromaDB cosine similarity (0-1)
- Source boost: Authoritative sources get multiplied boost (e.g., 1.15x)
- Keyword boost: Results containing query terms get 15% boost (configurable)
Final score = min(1.0, semantic_score × source_boost × keyword_boost)
Two logging levels for production monitoring. Both are OFF by default for privacy (query text is logged verbatim). See README for opt-in instructions.
rag_mcp.query_metrics (INFO) All queries with: query_type, top_score, result_count, augmented flag
rag_mcp.attention (WARNING) Problematic queries flagged for review:
zero_results: No matches found (content gap)low_score:<score>: Top score below thresholdunknown_mitre_ids:<ids>: MITRE ID not in lookupweak_mitre_match:<score>: Augmented query still scored poorly
Sentinel-based deletion guards prevent accidental data loss:
- Sentinel File:
.rag_mcp_managedmust exist in directory beforesafe_rmtree()can delete it - Unified Deletion: All deletions go through
safe_rmtree()- no fallback to nakedshutil.rmtree() - First-Time Setup: Sentinel requirement can be relaxed via
require_sentinel_file=False, but all other safety checks remain active - Forbidden Paths: /, /home, /root, /etc, /usr, /var blocked regardless of sentinel
- Minimum Depth: Directories must be at least 3 levels deep from root
- Root Containment: Deletions must be within project root boundaries
- Build Integration:
build.pycreates sentinels in managed directories after deletion
All network fetches in sources.py include:
- HTTPS-Only: HTTP URLs rejected by default (configurable via
RAG_ALLOW_HTTP) - Size Limits: Downloads capped at 60MB (
RAG_MAX_DOWNLOAD_BYTES) - Host Allowlist: Only approved hosts (github.com, raw.githubusercontent.com, etc.)
- IP Literal Blocking: Direct IP addresses blocked to prevent SSRF
- Redirect Validation: Redirect targets validated against host allowlist
- Retry with Backoff: Transient failures (429, 5xx, timeouts) retried with exponential backoff and jitter
- Model Allowlist: Only approved embedding models can be loaded (
utils.py:ALLOWED_MODELS) - Input Validation: Query/filter length limits prevent DoS (
server.py) - Path Disclosure: Internal paths not exposed in API responses (
index.py) - Metadata Sanitization: Lists/dicts converted to strings for ChromaDB (
utils.py) - URL Host Allowlist: SSRF protection for feed fetching (
sources.py) - Git Parameter Validation: Repo/branch format validation prevents injection (
sources.py) - Tuning Bounds: Automated adjustments stay within safe limits 0.40-0.70 (
analyze_queries.py) - Query Logging Off by Default: Privacy protection - logs contain verbatim query text (
index.py) - Audit Trail Continuity: Overflow entries logged before truncation, max 100 in config (
tuning_config.py) - File Size Limits: 10MB max for user documents prevents memory exhaustion (
ingest.py) - ID Sanitization: Alphanumeric-only pattern prevents injection (
ingest.py) - Atomic File Writes: Prevents corruption from concurrent access (
utils.py)
GITHUB_TOKEN: If used, create with minimal permissions:
- Public repositories read-only
- No private repo access
- No write permissions
Network Exposure: Server uses stdio only. Do NOT expose over network without:
- Authentication layer
- Rate limiting
- TLS encryption
Dependency Scanning: Run pip-audit periodically to detect CVEs in dependencies.
Audit Trail: Config changes are tracked with 100-entry rolling history. For compliance, capture rag_mcp.tuning_config logs to external storage.
# Run all tests
pytest tests/ -v
# Run specific test categories
pytest tests/test_rag_comprehensive.py -v # Comprehensive tests
pytest tests/test_extended_nlp.py -v # Extended NLP tests
# Test search functionality
python -c "
from rag_mcp import RAGIndex
idx = RAGIndex()
idx.load()
print(idx.search('credential dumping', top_k=3))
"
# Test MITRE augmentation
python -c "
from rag_mcp.utils import load_mitre_lookup, augment_text_with_mitre
from pathlib import Path
lookup = load_mitre_lookup(Path('data/sources'))
print(f'Loaded {len(lookup)} techniques')
print(augment_text_with_mitre('Detect T1003 and T1059.001', lookup))
"
# Test query analysis (report only)
python -m rag_mcp.analyze_queries --report-only- Add
SourceConfigtoSOURCESdict in sources.py - Create parser function
parse_newname(repo_dir, output_path) -> int - Register parser in
PARSERSdict - Add any new URL hosts to
ALLOWED_URL_HOSTSif needed - Run
python -m rag_mcp.build --force-fetch
| Operation | Time |
|---|---|
| First build (all sources) | ~5 minutes |
| Incremental refresh | ~10 seconds |
| Search query | ~50ms |
| Model load | ~5 seconds |
The system supports interactive threshold tuning based on production query patterns:
import logging
# Log all queries
logging.getLogger("rag_mcp.query_metrics").setLevel(logging.INFO)
handler = logging.FileHandler("logs/query_metrics.log")
logging.getLogger("rag_mcp.query_metrics").addHandler(handler)
# Log attention-worthy queries
logging.getLogger("rag_mcp.attention").setLevel(logging.WARNING)
handler = logging.FileHandler("logs/attention.log")
logging.getLogger("rag_mcp.attention").addHandler(handler)Run the system in production. Logs capture:
- Query type classification (general, mitre_id, detection, forensic)
- Top scores and result counts
- Augmentation effectiveness
- Attention triggers (zero results, low scores, unknown IDs)
python -m rag_mcp.analyze_queriesThe tool presents:
- Statistics by query type (count, avg, P25, min scores)
- Current configuration
- Attention issues summary
- Content gaps (zero-result queries)
- Specific recommendations with rationale
For each recommendation:
- Review description, rationale, and confidence level
- Enter Y to approve, N to skip, Q to quit
- Approved changes are applied to tuning_config.json
- Full audit trail recorded
Changes take effect immediately on next query (index reloads config). Run targeted searches to verify improved behavior.
chromadb: Vector storesentence-transformers: Embedding modelmcp: Model Context Protocolpyyaml: YAML parsingtoml: TOML parsing (Elastic rules)