Skip to content

Add detection engine, reliability, SIEM, telemetry, Helm chart, middleware, and tests#72

Open
S3DFX-CYBER wants to merge 2 commits into
mainfrom
codex/upgrade-ml-model-to-fine-tuned-bert-gx4anh
Open

Add detection engine, reliability, SIEM, telemetry, Helm chart, middleware, and tests#72
S3DFX-CYBER wants to merge 2 commits into
mainfrom
codex/upgrade-ml-model-to-fine-tuned-bert-gx4anh

Conversation

@S3DFX-CYBER
Copy link
Copy Markdown
Collaborator

@S3DFX-CYBER S3DFX-CYBER commented Apr 23, 2026

Motivation

  • Introduce a policy-driven detection engine to improve multi-turn, agentic, and response-scanning threat detection beyond simple heuristics.
  • Add reliability and observability features (circuit breaker, Prometheus metrics, OpenTelemetry traces, Grafana) to meet latency SLAs and provide operational signals.
  • Provide enterprise integrations (SIEM connectors, threat-intel sharing) and an air-gapped Helm chart for on-prem deployments.
  • Make plugin easier to adopt in common frameworks by providing one-import middleware for LangChain and LlamaIndex.

Description

  • Added a new DetectionEngine implementation under services/analyzer/detection_engine.py with policy loading, collective threat-intel, transformer/sklearn backends, session tracking, agent-chain scoring, OWASP mapping, feedback append, and threat-intel sharing APIs.
  • Implemented reliability and observability: CircuitBreaker in services/analyzer/reliability.py, Prometheus metrics and optional OpenTelemetry traces in services/analyzer/app.py, a /metrics endpoint, and latency accounting in analysis responses.
  • Added SIEM dispatchers in services/analyzer/siem_connectors.py to forward redacted alerts to Splunk HEC, Sentinel, and Elastic, and wired SIEM publishing into the analyzer event store flow.
  • Extended analyzer HTTP API with new endpoints: /v1/feedback, /v1/threat-intel/feed, /v1/threat-intel/share, and kept existing /v1/analyze but wired it to the new engine and instrumentation; preserved fallback heuristics and ML-only fallback.
  • Added Helm chart for air-gapped on‑prem installs under helm/tenet and Docker Compose monitoring stack additions (prometheus, grafana) plus Prometheus/Grafana provisioning and alert rules in monitoring/.
  • Updated requirements.txt to include optional telemetry, transformer, Prometheus client and YAML dependencies and added internal modules for SIEM/reliability/engine; updated README.md with LangChain/LlamaIndex middleware and tiered reliability/integration notes.
  • Extended the plugin package (tenet_plugin) with LangChainTenetMiddleware and LlamaIndexTenetMiddleware for one-import middleware usage and added tracing wrapper support.
  • Added comprehensive unit and integration tests for detection engine, reliability, plugin middleware, and a latency SLA smoke test under tests/unit and tests/integration.

Testing

  • Ran unit tests tests/unit/test_detection_engine.py, tests/unit/test_reliability.py, and tests/unit/test_tenet_plugin.py using pytest, and all unit tests passed.
  • Ran integration smoke test tests/integration/test_latency_sla.py with pytest, which validated the heuristic p99 latency target and passed.
  • Exercised analyzer startup and health flows via the new startup/health code paths in automated tests and confirmed model/redis handling and fallback behavior via the reliability tests.
  • CI-level assertions include file writes for feedback and threat-intel sharing, and those assertions succeeded in the test suite.

Codex Task

Summary by CodeRabbit

  • New Features

    • Added threat detection with multi-factor risk scoring and policy-based verdicts
    • Introduced SIEM integration for event publishing and alert dispatch
    • Added threat-intel sharing endpoints and persistent pattern storage
    • Introduced circuit breaker for handling detection service failures
    • Added feedback collection endpoint for analyst corrections
    • Implemented LangChain and LlamaIndex middleware integration
  • Documentation

    • Updated README with additional integration paths and detection capability tiers
  • Chores

    • Added Prometheus and Grafana observability services
    • Added Helm deployment charts with air-gapped configuration support
    • Added OpenTelemetry tracing instrumentation
    • Added new dependencies for observability and ML frameworks

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
tenet-ai.org Ready Ready Preview, Comment Apr 23, 2026 5:35pm

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

The PR introduces comprehensive observability, reliability, and advanced threat detection capabilities. It adds Prometheus/Grafana monitoring infrastructure, Helm charts for air-gapped deployment, a multi-signal DetectionEngine with policy-based scoring, CircuitBreaker failover mechanism, SIEM integration, OpenTelemetry tracing, and LangChain/LlamaIndex middleware. The analyzer service expands with latency instrumentation, threat-intel endpoints, feedback handling, and session-based risk accumulation.

Changes

Cohort / File(s) Summary
Documentation & Package Config
README.md, requirements.txt
Expanded README with Option C integration path and detection capability tiers. Added observability and ML dependencies (transformers, prometheus-client, opentelemetry-api/sdk, PyYAML).
Monitoring Infrastructure
docker-compose.yml, monitoring/prometheus.yml, monitoring/alerts.yml, monitoring/grafana/provisioning/datasources/prometheus.yml
Added Prometheus and Grafana services to compose. Configured Prometheus scrape jobs for analyzer metrics, alert rules for high latency p99 and malicious request spikes, and Grafana datasource provisioning.
Helm Chart Foundation
helm/tenet/Chart.yaml, helm/tenet/values.yaml
New Helm chart scaffolding for air-gapped TENET deployment with configurable image pull policy, service endpoints, Redis config, and required API key field.
Helm Chart Templates
helm/tenet/templates/analyzer.yaml, helm/tenet/templates/ingest.yaml, helm/tenet/templates/redis.yaml
Kubernetes Deployments and Services for analyzer, ingest, and Redis components. All include Redis connection wiring via environment variables and API key validation.
Detection Engine & Policy
services/analyzer/detection_engine.py, services/analyzer/policy/default_policy.yaml
New multi-signal detection engine evaluating prompt patterns, threat-intel matches, response data leaks, agentic tool injection, and optional transformer classification. YAML policy defines scoring thresholds, pattern weights, agentic monitoring parameters, response-scanning rules, and OWASP LLM Top 10 mappings.
Reliability & Failover
services/analyzer/reliability.py
New CircuitBreaker dataclass with configurable failure threshold and reset timeout, tracking open/closed state and supporting success/failure recording with automatic state transitions.
SIEM Integration
services/analyzer/siem_connectors.py
New SIEMDispatcher for concurrent publishing to multiple SIEM backends (Splunk, Sentinel, Elastic) with automatic field redaction (prompt, response, context, agent trace) unless explicitly enabled.
Analyzer Service Expansion
services/analyzer/app.py
Significant expansion: added Prometheus latency/request metrics, optional OpenTelemetry tracing, CircuitBreaker with fallback-allow, SIEM dispatch for malicious verdicts, new threat-intel/feedback endpoints, and extended AnalysisRequest schema with session tracking, organization ID, response scanning, and agent trace fields.
Plugin Enhancements
tenet_plugin/__init__.py, tenet_plugin/client.py
Updated exports and added LangChainTenetMiddleware and LlamaIndexTenetMiddleware classes. Added conditional OpenTelemetry span wrapping around secure_call and _NullContext fallback for tracing compatibility.
Test Coverage
tests/integration/test_latency_sla.py, tests/unit/test_detection_engine.py, tests/unit/test_reliability.py, tests/unit/test_tenet_plugin.py
New tests for latency SLA (p99 < 20ms), detection engine signal combinations and session accumulation, circuit breaker state transitions and fallback behavior, and middleware routing to secure_call.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant App as Analyzer App
    participant CB as CircuitBreaker
    participant Engine as DetectionEngine
    participant Policy as Policy Rules
    participant Prometheus
    participant SIEM

    Client->>App: POST /v1/analyze (prompt)
    activate App
    Note over App: record start time
    
    App->>CB: check is_open()
    CB-->>App: is_open: false/true
    
    alt CircuitBreaker Open
        App->>App: return fallback verdict
    else CircuitBreaker Closed
        App->>Engine: analyze(prompt, session_id, org_id, ...)
        activate Engine
        Engine->>Policy: load org-specific thresholds
        Policy-->>Engine: thresholds & patterns
        Engine->>Engine: compute multi-signal scores
        Note over Engine: prompt patterns, threat-intel,<br/>response leaks, agentic chains,<br/>transformer classification
        Engine-->>App: EngineResult {risk_score, verdict, ...}
        deactivate Engine
        App->>CB: record_success() or record_failure()
    end
    
    App->>Prometheus: record latency_ms, heuristic_sla_met
    Prometheus-->>App: metrics accepted
    
    alt Verdict is Malicious
        App->>SIEM: publish(event) with sanitization
        SIEM-->>App: async confirmation
    end
    
    App-->>Client: AnalysisResponse {verdict, details, ...}
    deactivate App
Loading
sequenceDiagram
    participant Analyzer as Analyzer Service
    participant ThreatDB as Threat Intel Store
    participant FeedbackDB as Feedback JSONL
    participant PolicyEngine as Detection Engine

    Analyzer->>ThreatDB: GET /v1/threat-intel/feed
    ThreatDB-->>Analyzer: return all patterns
    
    Analyzer->>Analyzer: POST /v1/threat-intel/share<br/>{pattern, score, category}
    Analyzer->>ThreatDB: append & rewrite patterns
    ThreatDB-->>Analyzer: share confirmed
    
    Analyzer->>Analyzer: POST /v1/feedback<br/>{prompt, predicted, analyst_verdict}
    Analyzer->>FeedbackDB: append JSONL entry
    FeedbackDB-->>Analyzer: feedback stored
    
    Analyzer->>PolicyEngine: share_threat_pattern(pattern, score, ...)
    PolicyEngine->>PolicyEngine: incorporate into collective signals
    PolicyEngine-->>Analyzer: updated pattern recognized
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hops of joy through Prometheus calls,
Circuit breakers guard the halls,
Threat-intel patterns hop and share,
SIEM alerts spread everywhere,
LLM risks detected with care! 🔍✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the major components added: detection engine, reliability, SIEM, telemetry, Helm chart, middleware, and tests.
Description check ✅ Passed The PR description comprehensively covers all major changes, motivation, implementation details, and testing approach, exceeding the template structure with detailed sections on motivation, description, and testing results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/upgrade-ml-model-to-fine-tuned-bert-gx4anh

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

🤖 TENET Agent Review

📋 Summary

This pull request introduces a comprehensive set of enhancements to the TENET AI project, transforming it into a more robust and enterprise-ready LLM security middleware. Key additions include a policy-driven DetectionEngine for advanced threat detection, critical reliability features like a circuit breaker and Prometheus/Grafana observability, SIEM integrations with data redaction, and a Helm chart for air-gapped deployments. The PR also provides convenient middleware for LangChain and LlamaIndex, alongside extensive testing. The overall approach is sound, significantly improving the system's security posture, operational resilience, and adoption ease.

🔐 Security Findings

  • [HIGH] services/analyzer/detection_engine.py - The _load_model_backend method loads transformer models based on the TRANSFORMER_MODEL_ID environment variable. If this variable is not strictly controlled and can be influenced by an attacker, loading a malicious model from an untrusted source could lead to arbitrary code execution.
    • Recommendation: Implement a whitelist of trusted model IDs or ensure the TRANSFORMER_MODEL_ID environment variable is only set by trusted deployment mechanisms (e.g., CI/CD, infrastructure-as-code).
  • [MEDIUM] services/analyzer/detection_engine.py - The append_feedback and share_threat_pattern methods write user-controlled content (prompt, pattern) to files whose paths (FEEDBACK_PATH, THREAT_INTEL_PATH) are determined by environment variables. While environment variables are typically controlled by administrators, if an attacker could influence these paths, it could lead to arbitrary file writes or path traversal vulnerabilities.
    • Recommendation: Ensure these environment variables are immutable and not exposed to untrusted configuration. Consider adding path sanitization or restricting writes to a specific, non-configurable subdirectory within the application's data volume.
  • [LOW] services/analyzer/siem_connectors.py - The INCLUDE_SENSITIVE_SIEM_FIELDS environment variable, if set to true, allows sensitive data (prompts, responses, context, agent traces) to be dispatched to external SIEM systems. While the default is false (redacted), this configuration option poses a data leakage risk if misconfigured.
    • Recommendation: Add a prominent warning in documentation about the implications of enabling INCLUDE_SENSITIVE_SIEM_FIELDS and ensure access to modify this setting is restricted to authorized personnel.

🧹 Code Quality

  • services/analyzer/app.py - Extensive use of global variables (ml_model, vectorizer, engine, siem_dispatcher, circuit_breaker) can make testing and state management challenging. Consider using FastAPI's dependency injection system for better modularity and testability.
  • services/analyzer/app.py and tenet_plugin/client.py - The _null_context (or _NullContext) class is duplicated across these two files. Consolidate into a shared utility module to avoid redundancy.
  • services/analyzer/detection_engine.py - The _load_model_backend method uses a broad except Exception block. While fallback is implemented, using more specific exception types would improve error diagnostics.

✅ What's Done Well

  • Introduction of a robust, policy-driven DetectionEngine significantly enhances threat detection capabilities beyond simple heuristics, including multi-turn and agentic attack vectors.
  • Comprehensive observability features (Prometheus, Grafana, OpenTelemetry, circuit breaker) provide critical operational insights and graceful degradation, addressing latency SLAs.
  • Strong enterprise integration points like SIEM connectors with sensitive data redaction and a Helm chart for air-gapped deployments demonstrate a mature security product.

📝 Overall Verdict

REQUEST CHANGES - Address the high-severity security finding regarding transformer model loading and consider the medium-severity file write risks.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

result = run_analysis(prompt)

P1 Badge Preserve tenant context in queued event analysis

_process_single_event analyzes queued items with run_analysis(prompt) only, so tenant-specific policy overrides are dropped for the primary ingest→queue→analyzer path even when the event contains tenant identity. As a result, organization thresholds/rules are silently bypassed and detections can be wrong for multi-tenant deployments.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +33 to +35
for field in ("prompt", "response_text", "context", "agent_trace"):
if field in redacted:
redacted[field] = "[REDACTED]"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Redact system prompts before SIEM export

When INCLUDE_SENSITIVE_SIEM_FIELDS is left at its safe default (false), _sanitize_event only redacts prompt, response_text, context, and agent_trace, so fields like system_prompt (present on ingested events) and arbitrary metadata are still forwarded to Splunk/Sentinel/Elastic in clear text. This creates an unintended data-leak path in the default configuration for malicious-event forwarding.

Useful? React with 👍 / 👎.

Comment thread services/analyzer/app.py
"""Prometheus scrape endpoint."""
if not generate_latest:
raise HTTPException(status_code=503, detail="prometheus-client not installed")
return generate_latest()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return Prometheus metrics with text response type

Returning generate_latest() directly from this FastAPI handler causes the payload to go through the default JSON response pipeline rather than Prometheus text exposition, which can break scraping/parsing for /metrics. The metrics endpoint should return an explicit raw/text response type for Prometheus-format bytes.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
services/analyzer/app.py (1)

656-660: ⚠️ Potential issue | 🟠 Major

Pass the new analysis fields through the queue path.

The direct API path forwards organization_id, session_id, response_text, and agent_trace, but queued events only pass prompt. That means the new detection features are silently disabled for background-processed traffic.

Suggested fix
-    result = run_analysis(prompt)
+    result = run_analysis(
+        prompt=prompt,
+        organization_id=event.get("organization_id"),
+        session_id=event.get("session_id"),
+        response_text=event.get("response_text"),
+        agent_trace=event.get("agent_trace"),
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/analyzer/app.py` around lines 656 - 660, Queued event processing
only forwards `prompt`, so the new analysis fields (`organization_id`,
`session_id`, `response_text`, `agent_trace`) are missing for background jobs;
update the queue producer to include those fields in the enqueued payload and
update the consumer handler that calls `run_analysis(prompt)` to accept and
forward `organization_id`, `session_id`, `response_text`, and `agent_trace` into
`run_analysis` (or into the analysis context) and then pass them through to
`_update_and_store_event(event, event_id, result)` so the queued path mirrors
the direct API path.
🧹 Nitpick comments (2)
tests/unit/test_tenet_plugin.py (1)

118-149: Strengthen middleware assertions to catch behavioral regressions.

Line 132 and Line 149 only assert status. Add assertions for returned LLM payload/analysis fields so middleware pass-through regressions fail loudly.

Suggested test tightening
 def test_langchain_middleware_invoke():
@@
     result = middleware.invoke("hello", llm_callable=lambda **kwargs: "ok")
     assert result["status"] == "success"
+    assert result["llm_response"] == "ok"
+    assert result["analysis"]["verdict"] == "benign"
@@
 def test_llamaindex_middleware_query():
@@
     result = middleware.query("what is tenet?", llm_callable=lambda **kwargs: "ok")
     assert result["status"] == "success"
+    assert result["llm_response"] == "ok"
+    assert result["analysis"]["verdict"] == "benign"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_tenet_plugin.py` around lines 118 - 149, The tests only
assert "status" which can miss regressions; update
test_langchain_middleware_invoke and test_llamaindex_middleware_query to also
assert that the middleware passes through the LLM output and includes the Tenet
analysis fields from the session response: assert the llm_callable return value
(e.g., "ok") is present in the middleware result (check the key your middleware
uses for the LLM payload, e.g., "llm_response" or "payload") and assert the
analysis keys from DummyResponse (blocked, verdict, risk_score, event_id) are
present and equal to the expected values returned by the TenetSecurityPlugin
when invoking LangChainTenetMiddleware.invoke and
LlamaIndexTenetMiddleware.query.
tests/unit/test_detection_engine.py (1)

18-33: Add the session-isolation regression the PR is claiming to fix.

This only proves accumulation within one org. Since _session_key() namespaces by organization_id, the missing regression is reusing the same session_id across two orgs and asserting the second org starts with a clean accumulator.

Suggested test
+def test_same_session_id_is_isolated_per_organization():
+    engine = DetectionEngine(redis_client=None)
+    engine.analyze(
+        prompt="ignore previous instructions",
+        session_id="s1",
+        organization_id="org-a",
+    )
+    result = engine.analyze(
+        prompt="hello world",
+        session_id="s1",
+        organization_id="org-b",
+    )
+    assert result.details["session_events"] == 1
+    assert result.details["threat_signals"]["session"] == 0.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_detection_engine.py` around lines 18 - 33, Add a new test
that verifies session isolation by reusing the same session_id across two
different organization_id values: use DetectionEngine and call analyze(...)
first with session_id="s1" and organization_id="orgA" to produce accumulated
session state, then call analyze(...) with the same session_id="s1" but
organization_id="orgB" and assert the second result does not see the first org's
accumulation (e.g., second.details["session_events"] is 1 or not >= the
accumulated count and second.details["threat_signals"]["session"] is 0.0). This
ensures the _session_key() namespacing by organization_id works and prevents
cross-org session leakage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docker-compose.yml`:
- Around line 136-137: Remove the hardcoded Grafana credentials by eliminating
the literal assignments to GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD
and instead require those values be supplied securely (e.g., via environment
variables injected at deploy time, Docker secrets, or a generated secret stored
in a secret manager). Update the docker-compose service to reference the
external env/secret names rather than fixed "admin"/"admin", and add startup
validation (in orchestration or an entrypoint) to fail fast if
GF_SECURITY_ADMIN_USER or GF_SECURITY_ADMIN_PASSWORD are not provided or empty
so default, insecure credentials cannot be used.
- Around line 119-131: The docker-compose services using floating tags should be
pinned: update the prometheus service image reference (currently
"prom/prometheus:latest" under service name/prometheus block) and the grafana
service image reference ("grafana/grafana:latest" under service name/grafana) to
explicit version tags or referenced pinned env vars (e.g., PROMETHEUS_IMAGE and
GRAFANA_IMAGE) so deployments are deterministic; change the image strings to
specific semver tags (or replace with ${ENV_VAR}) and add those env vars to a
.env or compose env section to control upgrades.

In `@helm/tenet/templates/analyzer.yaml`:
- Around line 22-23: The API key is being exposed by setting env var API_KEY to
.Values.apiKey; change both analyzer.yaml and ingest.yaml to use
valueFrom.secretKeyRef instead: add a new values entry (e.g. apiSecretName) and
require it, update the container env block for the env var named API_KEY to use
valueFrom:{secretKeyRef:{name: {{ .Values.apiSecretName }}, key: "<key-name>"}}
(use a consistent key like "apiKey"), and remove direct references to
.Values.apiKey from templates; also ensure values.yaml and chart docs instruct
users to create or supply a Kubernetes Secret with that name/key rather than
putting the API key in values.yaml.

In `@helm/tenet/templates/ingest.yaml`:
- Around line 22-23: Replace the plaintext env var injection for API_KEY with a
secretKeyRef: stop setting env name API_KEY using value: "{{ required
\"values.apiKey is required\" .Values.apiKey }}", and instead use
valueFrom.secretKeyRef referencing a Kubernetes Secret (referencing a secret
name from a values key like .Values.apiKeySecretName and the secret key name,
e.g., "apiKey"); apply the same change for the analyzer template so both use
valueFrom.secretKeyRef for API_KEY rather than embedding .Values.apiKey
directly.

In `@helm/tenet/templates/redis.yaml`:
- Around line 6-19: The Redis Deployment is ephemeral; update the container spec
for the container named "redis" in helm/tenet/templates/redis.yaml to add
volumeMounts mounting /data and add a corresponding volume that references a
PersistentVolumeClaim (use a values key .Values.redis.persistence.claimName),
add the Redis arg "--appendonly yes" to enable AOF persistence, and update
values.yaml to declare redis.persistence.claimName (and any persistence.enabled
flag if desired) so the chart can supply the PVC name; ensure you reference the
existing .Values.redis.image, .Values.imagePullPolicy and
.Values.redis.servicePort entries when making changes.

In `@services/analyzer/app.py`:
- Around line 368-404: The current code always returns a hard-coded fail-open
AnalysisResponse when the circuit_breaker is open or engine.analyze raises,
which is insecure; make the degraded-mode behavior configurable (e.g., a
fallback_mode or FAIL_OPEN flag) and change the two fallback branches (the
circuit_breaker.is_open block and the except Exception block around
engine.analyze) to consult that configuration: support at least "fail_open"
(current benign response), "degraded_suspicious" (return a mid/high risk_score,
verdict="suspicious"/threat_type as None, higher confidence, and
details={"method":"fallback_suspicious",...}), and "fail_closed"/"block" (return
risk_score=1.0, verdict="malicious"); ensure you call
circuit_breaker.record_failure on exception and record_success on success as
before, include clear logging that mentions the chosen fallback_mode, and keep
AnalysisResponse construction centralized so both fallback paths use the same
decision logic.
- Around line 597-600: The code currently awaits siem_dispatcher.publish inline
when result.verdict == "malicious", which can block the consumer; change this to
decouple SIEM delivery by enqueuing the event for asynchronous delivery instead
of awaiting the publish call directly. Specifically, keep the existing
redis_client.lpush("tenet:alerts", json.dumps(event)) and logger.warning lines,
then replace await siem_dispatcher.publish(event) with a non-blocking handoff
(e.g., push a message to a dedicated outbound queue or start a background task
via asyncio.create_task(...) that calls siem_dispatcher.publish), and ensure any
created background worker/consumer handles retries/errors so the consumer path
(the code checking result.verdict) never awaits external connector I/O.
- Around line 130-155: Import and instantiate a dedicated CollectorRegistry from
prometheus_client when available, and use it when creating the module-level
metrics ANALYZE_REQUESTS and ANALYZE_LATENCY (i.e., pass registry=your_registry
into Counter and Histogram constructors) so each import gets its own registry;
also update any use of generate_latest to call generate_latest(registry) when
exporting metrics, and keep the existing fallback to None if prometheus_client
isn't installed.
- Around line 257-262: The /metrics endpoint (decorated by `@app.get`("/metrics")
in the metrics() function) currently returns raw bytes from generate_latest()
without a Content-Type; update it to return a FastAPI/Starlette Response object
containing generate_latest() as the body and set the Content-Type exactly to
"text/plain; version=1.0.0; charset=utf-8" (preserve the existing HTTPException
branch when generate_latest is missing); ensure you import Response if not
already present and pass the bytes from generate_latest() as the response
content.

In `@services/analyzer/detection_engine.py`:
- Around line 36-37: Replace the unbounded dict self._session_cache with a
bounded TTL-aware cache to avoid unbounded memory growth: instantiate a cache
like cachetools.TTLCache(maxsize=..., ttl=...) (or implement an LRU+TTL map) in
the DetectionEngine constructor where self._session_cache is currently set,
update all accesses that read/write session entries (the code around the blocks
referenced at lines ~168-170 and ~209-223) to use the cache API, and protect
concurrent access with a threading.Lock or use cachetools' thread-safe wrapper;
choose sensible maxsize/ttl values and add a small cleanup/background eviction
if your cache implementation requires it.
- Around line 307-319: share_threat_pattern currently mutates self.threat_intel
in-memory and rewrites intel_path directly, causing lost updates under
concurrency; fix by performing a read-modify-write with synchronization: acquire
a lock (process-wide file lock or a shared threading/multiprocess lock) at the
start of share_threat_pattern, reload the current JSON from intel_path into a
local payload (fallback to {"patterns": []}), merge the new pattern into payload
(optionally dedupe), write payload to a temporary file and atomically replace
intel_path (os.replace) to avoid partial writes, update self.threat_intel from
the freshly written payload, then release the lock so concurrent callers see the
merged result.

In `@services/analyzer/policy/default_policy.yaml`:
- Around line 27-34: The current response_scanning.data_leak_patterns use
literal labels (api_key, private key, ssn, aws_secret_access_key, etc.) which
only match those words; replace those entries with actual leak
signatures/regexes that detect common shapes (e.g., SSN patterns
\d{3}-\d{2}-\d{4} or contiguous 9 digits, credit card BIN/Luhn patterns, AWS
secret access key pattern AKIA[A-Z0-9]{16} and secret-like base64/long key
patterns, PEM BEGIN/END blocks, JWT-looking three-part base64 strings, OAuth
tokens), keeping the same keys (response_scanning, data_leak_patterns) but
storing regular expressions or signature strings for each logical detector
(api_key, ssn, credit_card_number, aws_secret_access_key, private_key) and
adjust scores as needed so the analyzer.matching logic sees real leaks instead
of label-only matches.

In `@services/analyzer/reliability.py`:
- Around line 23-25: The circuit-breaker timeout branch resets only
self._failures but leaves self._opened_at set, preventing future opens; update
the timeout handling to also clear or reset self._opened_at (e.g., set to None
or 0) so the next failure cycle can set a new open timestamp, and ensure the
open-path logic that sets self._opened_at (referenced around lines checking
self._opened_at and setting it on open) still sets time.monotonic() when
opening; in short, reset both self._failures and self._opened_at in the timeout
block and keep the existing guard that sets self._opened_at on subsequent opens.

In `@services/analyzer/siem_connectors.py`:
- Line 25: The code sets self.timeout = float(os.getenv("SIEM_TIMEOUT_SECONDS",
"2.5")) without validation, which will raise ValueError on malformed env values;
update the initializer where self.timeout is set to parse the environment value
with a safe try/except (or use a helper like parse_timeout), logging a warning
via the existing logger and falling back to the default 2.5 seconds when
conversion fails or the value is non-positive; reference the
SIEM_TIMEOUT_SECONDS env var, the self.timeout attribute, and the os.getenv call
to locate and replace the current assignment.

In `@tenet_plugin/client.py`:
- Around line 178-185: invoke() and query() are dropping the user input because
secure_call only forwards llm_kwargs to the downstream llm_callable; include the
user text in the forwarded args so the LLM actually receives it. Modify the
calls to secure_call (from invoke and query) to inject the prompt/query_text
into llm_kwargs (e.g., set llm_kwargs["prompt"] = prompt or llm_kwargs["query"]
= query_text) or otherwise pass both the text and llm_kwargs through so
llm_callable receives the input; update the references to invoke, query,
secure_call, prompt, query_text, llm_callable, and llm_kwargs accordingly.

In `@tests/integration/test_latency_sla.py`:
- Around line 23-24: The current hard 20ms p99 check (p99 computed from
quantiles(samples_ms, n=100)) is flaky on CI; change the assertion to use a
relaxed or configurable threshold instead of a fixed 20.0ms: make the threshold
configurable via an env var (e.g., CI_LATENCY_SLA_MS) with a sensible default
(e.g., 50.0), or compute a dynamic threshold from a stored baseline before
asserting, and update the assert that references p99 to compare against that
configurable/dynamic threshold rather than the literal 20.0.

---

Outside diff comments:
In `@services/analyzer/app.py`:
- Around line 656-660: Queued event processing only forwards `prompt`, so the
new analysis fields (`organization_id`, `session_id`, `response_text`,
`agent_trace`) are missing for background jobs; update the queue producer to
include those fields in the enqueued payload and update the consumer handler
that calls `run_analysis(prompt)` to accept and forward `organization_id`,
`session_id`, `response_text`, and `agent_trace` into `run_analysis` (or into
the analysis context) and then pass them through to
`_update_and_store_event(event, event_id, result)` so the queued path mirrors
the direct API path.

---

Nitpick comments:
In `@tests/unit/test_detection_engine.py`:
- Around line 18-33: Add a new test that verifies session isolation by reusing
the same session_id across two different organization_id values: use
DetectionEngine and call analyze(...) first with session_id="s1" and
organization_id="orgA" to produce accumulated session state, then call
analyze(...) with the same session_id="s1" but organization_id="orgB" and assert
the second result does not see the first org's accumulation (e.g.,
second.details["session_events"] is 1 or not >= the accumulated count and
second.details["threat_signals"]["session"] is 0.0). This ensures the
_session_key() namespacing by organization_id works and prevents cross-org
session leakage.

In `@tests/unit/test_tenet_plugin.py`:
- Around line 118-149: The tests only assert "status" which can miss
regressions; update test_langchain_middleware_invoke and
test_llamaindex_middleware_query to also assert that the middleware passes
through the LLM output and includes the Tenet analysis fields from the session
response: assert the llm_callable return value (e.g., "ok") is present in the
middleware result (check the key your middleware uses for the LLM payload, e.g.,
"llm_response" or "payload") and assert the analysis keys from DummyResponse
(blocked, verdict, risk_score, event_id) are present and equal to the expected
values returned by the TenetSecurityPlugin when invoking
LangChainTenetMiddleware.invoke and LlamaIndexTenetMiddleware.query.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b2772598-bd45-4d10-87bd-ca24cdda09ae

📥 Commits

Reviewing files that changed from the base of the PR and between f8fd067 and 1146bcb.

📒 Files selected for processing (22)
  • README.md
  • docker-compose.yml
  • helm/tenet/Chart.yaml
  • helm/tenet/templates/analyzer.yaml
  • helm/tenet/templates/ingest.yaml
  • helm/tenet/templates/redis.yaml
  • helm/tenet/values.yaml
  • monitoring/alerts.yml
  • monitoring/grafana/provisioning/datasources/prometheus.yml
  • monitoring/prometheus.yml
  • requirements.txt
  • services/analyzer/app.py
  • services/analyzer/detection_engine.py
  • services/analyzer/policy/default_policy.yaml
  • services/analyzer/reliability.py
  • services/analyzer/siem_connectors.py
  • tenet_plugin/__init__.py
  • tenet_plugin/client.py
  • tests/integration/test_latency_sla.py
  • tests/unit/test_detection_engine.py
  • tests/unit/test_reliability.py
  • tests/unit/test_tenet_plugin.py

Comment thread docker-compose.yml
Comment on lines +119 to +131
image: prom/prometheus:latest
container_name: tenet-prometheus
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./monitoring/alerts.yml:/etc/prometheus/alerts.yml:ro
networks:
- tenet-network
restart: unless-stopped

grafana:
image: grafana/grafana:latest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify floating image tags in compose manifests.
rg -n --iglob 'docker-compose*.yml' '^\s*image:\s*[^#\s]+:latest\b'

Repository: TENET-DEV-AI/TENET-AI

Length of output: 230


Pin monitoring and object storage image versions instead of latest.

Lines 45, 119, and 131 use floating tags. Pin explicit versions (or pinned env vars) for deterministic deploys and controlled upgrades.

Suggested fix
-    image: minio/minio:latest
+    image: minio/minio:${MINIO_VERSION:?pin version}
@@
-    image: prom/prometheus:latest
+    image: prom/prometheus:${PROMETHEUS_VERSION:?pin version}
@@
-    image: grafana/grafana:latest
+    image: grafana/grafana:${GRAFANA_VERSION:?pin version}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker-compose.yml` around lines 119 - 131, The docker-compose services using
floating tags should be pinned: update the prometheus service image reference
(currently "prom/prometheus:latest" under service name/prometheus block) and the
grafana service image reference ("grafana/grafana:latest" under service
name/grafana) to explicit version tags or referenced pinned env vars (e.g.,
PROMETHEUS_IMAGE and GRAFANA_IMAGE) so deployments are deterministic; change the
image strings to specific semver tags (or replace with ${ENV_VAR}) and add those
env vars to a .env or compose env section to control upgrades.

Comment thread docker-compose.yml
Comment on lines +136 to +137
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not ship default Grafana admin credentials.

Line 136-Line 137 hardcode admin/admin. This is an unsafe default and can be exploited if the port is reachable.

Suggested fix
-      - GF_SECURITY_ADMIN_USER=admin
-      - GF_SECURITY_ADMIN_PASSWORD=admin
+      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:?set GRAFANA_ADMIN_USER}
+      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:?set GRAFANA_ADMIN_PASSWORD}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:?set GRAFANA_ADMIN_USER}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:?set GRAFANA_ADMIN_PASSWORD}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker-compose.yml` around lines 136 - 137, Remove the hardcoded Grafana
credentials by eliminating the literal assignments to GF_SECURITY_ADMIN_USER and
GF_SECURITY_ADMIN_PASSWORD and instead require those values be supplied securely
(e.g., via environment variables injected at deploy time, Docker secrets, or a
generated secret stored in a secret manager). Update the docker-compose service
to reference the external env/secret names rather than fixed "admin"/"admin",
and add startup validation (in orchestration or an entrypoint) to fail fast if
GF_SECURITY_ADMIN_USER or GF_SECURITY_ADMIN_PASSWORD are not provided or empty
so default, insecure credentials cannot be used.

Comment on lines +22 to +23
- name: API_KEY
value: "{{ required \"values.apiKey is required\" .Values.apiKey }}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find and read the analyzer.yaml file
find . -name "analyzer.yaml" -type f | head -5

Repository: TENET-DEV-AI/TENET-AI

Length of output: 101


🏁 Script executed:

# Let's also check the directory structure to understand the helm chart layout
ls -la helm/tenet/templates/ 2>/dev/null | head -20

Repository: TENET-DEV-AI/TENET-AI

Length of output: 355


🏁 Script executed:

# Search for secret-related patterns in the helm chart
rg -i "secret|apikey|password" helm/tenet/templates/ --type yaml -n

Repository: TENET-DEV-AI/TENET-AI

Length of output: 304


🏁 Script executed:

cat -n helm/tenet/templates/analyzer.yaml

Repository: TENET-DEV-AI/TENET-AI

Length of output: 1246


🏁 Script executed:

cat -n helm/tenet/templates/ingest.yaml

Repository: TENET-DEV-AI/TENET-AI

Length of output: 1226


🏁 Script executed:

# Check if there's a values.yaml file
find . -name "values.yaml" -type f | head -5

Repository: TENET-DEV-AI/TENET-AI

Length of output: 89


🏁 Script executed:

# Check for any secret resources in the helm chart
rg -i "kind.*secret|apiVersion.*v1" helm/ --type yaml -A 2

Repository: TENET-DEV-AI/TENET-AI

Length of output: 937


🏁 Script executed:

cat -n helm/tenet/values.yaml

Repository: TENET-DEV-AI/TENET-AI

Length of output: 381


🏁 Script executed:

# Also check redis.yaml to see if it handles secrets differently
cat -n helm/tenet/templates/redis.yaml

Repository: TENET-DEV-AI/TENET-AI

Length of output: 891


API key exposed as plaintext in environment variables.

Line 23 directly injects .Values.apiKey into env vars, exposing the secret in rendered manifests and Helm release history. Replace with secretKeyRef to reference a Kubernetes Secret resource instead.

Note: This issue also exists in ingest.yaml:23 and requires fixes in both files.

Suggested fix
           env:
             - name: API_KEY
-              value: "{{ required \"values.apiKey is required\" .Values.apiKey }}"
+              valueFrom:
+                secretKeyRef:
+                  name: {{ required "values.apiKeySecret.name is required" .Values.apiKeySecret.name }}
+                  key: {{ default "api-key" .Values.apiKeySecret.key }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@helm/tenet/templates/analyzer.yaml` around lines 22 - 23, The API key is
being exposed by setting env var API_KEY to .Values.apiKey; change both
analyzer.yaml and ingest.yaml to use valueFrom.secretKeyRef instead: add a new
values entry (e.g. apiSecretName) and require it, update the container env block
for the env var named API_KEY to use valueFrom:{secretKeyRef:{name: {{
.Values.apiSecretName }}, key: "<key-name>"}} (use a consistent key like
"apiKey"), and remove direct references to .Values.apiKey from templates; also
ensure values.yaml and chart docs instruct users to create or supply a
Kubernetes Secret with that name/key rather than putting the API key in
values.yaml.

Comment on lines +22 to +23
- name: API_KEY
value: "{{ required \"values.apiKey is required\" .Values.apiKey }}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify API_KEY is currently injected as plaintext and secretKeyRef support is absent/present.
rg -n 'name:\s*API_KEY|values\.apiKey|secretKeyRef|kind:\s*Secret' helm/tenet/templates helm/tenet/values.yaml

Repository: TENET-DEV-AI/TENET-AI

Length of output: 434


🏁 Script executed:

# Check values.yaml for apiKey configuration and see full file structure
head -50 helm/tenet/values.yaml

Repository: TENET-DEV-AI/TENET-AI

Length of output: 276


🏁 Script executed:

# Look for any Secret resource definitions in the helm chart
find helm/tenet/templates -type f -name "*.yaml" -o -name "*.yml" | xargs grep -l "kind: Secret"

Repository: TENET-DEV-AI/TENET-AI

Length of output: 49


🏁 Script executed:

# Check for all instances of apiKey usage across the entire helm chart
rg -n 'apiKey|API_KEY' helm/tenet/ --type yaml

Repository: TENET-DEV-AI/TENET-AI

Length of output: 471


🏁 Script executed:

# Search for other common secret patterns in env values (password, token, secret, credential, etc.)
rg -n 'value:\s*"\{{\s*.*\.(password|token|secret|credential|key|apikey)' helm/tenet/templates --type yaml -i

Repository: TENET-DEV-AI/TENET-AI

Length of output: 225


🏁 Script executed:

# Check the complete analyzer.yaml and ingest.yaml to see full env configuration
cat -n helm/tenet/templates/ingest.yaml | sed -n '15,35p'

Repository: TENET-DEV-AI/TENET-AI

Length of output: 876


🏁 Script executed:

# Check analyzer.yaml env configuration
cat -n helm/tenet/templates/analyzer.yaml | sed -n '15,35p'

Repository: TENET-DEV-AI/TENET-AI

Length of output: 888


🏁 Script executed:

# Simpler search for plaintext value injections of sensitive sounding names
rg -n 'name:\s*(API_KEY|PASSWORD|TOKEN|SECRET)' helm/tenet/templates --type yaml -A 1

Repository: TENET-DEV-AI/TENET-AI

Length of output: 437


Apply secretKeyRef for API_KEY to prevent plaintext secret leaks.

ingest.yaml lines 22-23 and analyzer.yaml lines 22-23 both inject .Values.apiKey as plaintext values, exposing secrets in rendered manifests and Helm release history. Replace with valueFrom.secretKeyRef to reference a Kubernetes Secret instead.

Suggested fix
           env:
             - name: API_KEY
-              value: "{{ required \"values.apiKey is required\" .Values.apiKey }}"
+              valueFrom:
+                secretKeyRef:
+                  name: {{ required "values.apiKeySecret.name is required" .Values.apiKeySecret.name }}
+                  key: {{ default "api-key" .Values.apiKeySecret.key }}

Apply this change to both ingest.yaml and analyzer.yaml.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: API_KEY
value: "{{ required \"values.apiKey is required\" .Values.apiKey }}"
- name: API_KEY
valueFrom:
secretKeyRef:
name: {{ required "values.apiKeySecret.name is required" .Values.apiKeySecret.name }}
key: {{ default "api-key" .Values.apiKeySecret.key }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@helm/tenet/templates/ingest.yaml` around lines 22 - 23, Replace the plaintext
env var injection for API_KEY with a secretKeyRef: stop setting env name API_KEY
using value: "{{ required \"values.apiKey is required\" .Values.apiKey }}", and
instead use valueFrom.secretKeyRef referencing a Kubernetes Secret (referencing
a secret name from a values key like .Values.apiKeySecretName and the secret key
name, e.g., "apiKey"); apply the same change for the analyzer template so both
use valueFrom.secretKeyRef for API_KEY rather than embedding .Values.apiKey
directly.

Comment on lines +6 to +19
replicas: 1
selector:
matchLabels: { app: tenet-redis }
template:
metadata:
labels: { app: tenet-redis }
spec:
containers:
- name: redis
image: {{ .Values.redis.image }}
imagePullPolicy: {{ .Values.imagePullPolicy }}
ports:
- containerPort: {{ .Values.redis.servicePort }}
---
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify Redis template lacks persistence primitives and chart lacks PVC definition.
rg -n 'volumeMounts|persistentVolumeClaim|claimName|appendonly|/data' helm/tenet/templates/redis.yaml
rg -n 'kind:\s*PersistentVolumeClaim|redis\.persistence|claimName' helm/tenet

Repository: TENET-DEV-AI/TENET-AI

Length of output: 47


🏁 Script executed:

cat -n helm/tenet/templates/redis.yaml

Repository: TENET-DEV-AI/TENET-AI

Length of output: 891


🏁 Script executed:

cat -n helm/tenet/values.yaml | head -100

Repository: TENET-DEV-AI/TENET-AI

Length of output: 381


🏁 Script executed:

find helm/tenet -type f \( -name "*.yaml" -o -name "*.yml" \) | xargs grep -l "PersistentVolumeClaim\|PersistentVolume\|StorageClass" 2>/dev/null

Repository: TENET-DEV-AI/TENET-AI

Length of output: 49


Add Redis persistence; current deployment is ephemeral.

The Redis Deployment (lines 6-19) has no volumeMounts or volumes configured, and values.yaml contains no redis.persistence settings. Pod restarts will lose all runtime state, which breaks reliability for queue or session usage.

Add volumeMounts with /data mount point, configure Redis with --appendonly yes, and mount a PersistentVolumeClaim. Update values.yaml to include redis.persistence.claimName configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@helm/tenet/templates/redis.yaml` around lines 6 - 19, The Redis Deployment is
ephemeral; update the container spec for the container named "redis" in
helm/tenet/templates/redis.yaml to add volumeMounts mounting /data and add a
corresponding volume that references a PersistentVolumeClaim (use a values key
.Values.redis.persistence.claimName), add the Redis arg "--appendonly yes" to
enable AOF persistence, and update values.yaml to declare
redis.persistence.claimName (and any persistence.enabled flag if desired) so the
chart can supply the PVC name; ensure you reference the existing
.Values.redis.image, .Values.imagePullPolicy and .Values.redis.servicePort
entries when making changes.

Comment on lines +27 to +34
response_scanning:
data_leak_patterns:
api_key: 0.95
private key: 0.99
ssn: 0.9
social security number: 0.92
credit card number: 0.88
aws_secret_access_key: 0.99
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use real leak signatures here, not just leak labels.

These keys only match outputs that literally contain strings like api_key, ssn, or credit card number. A raw SSN, card number, AWS secret, or PEM block without those labels will pass straight through, which makes the new response-scanning feature miss the most common leak shape.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/analyzer/policy/default_policy.yaml` around lines 27 - 34, The
current response_scanning.data_leak_patterns use literal labels (api_key,
private key, ssn, aws_secret_access_key, etc.) which only match those words;
replace those entries with actual leak signatures/regexes that detect common
shapes (e.g., SSN patterns \d{3}-\d{2}-\d{4} or contiguous 9 digits, credit card
BIN/Luhn patterns, AWS secret access key pattern AKIA[A-Z0-9]{16} and
secret-like base64/long key patterns, PEM BEGIN/END blocks, JWT-looking
three-part base64 strings, OAuth tokens), keeping the same keys
(response_scanning, data_leak_patterns) but storing regular expressions or
signature strings for each logical detector (api_key, ssn, credit_card_number,
aws_secret_access_key, private_key) and adjust scores as needed so the
analyzer.matching logic sees real leaks instead of label-only matches.

Comment on lines +23 to +25
if time.monotonic() - self._opened_at > self.reset_timeout_seconds:
self._failures = 0
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Circuit breaker does not fully reset open-state after timeout.

On Line 24, only _failures is reset. Because _opened_at is not reset, the guard on Line 34 prevents setting a fresh open timestamp on later failure cycles, so reopen behavior degrades after the first timeout.

💡 Proposed fix
     `@property`
     def is_open(self) -> bool:
         if self._failures < self.failure_threshold:
             return False
         if time.monotonic() - self._opened_at > self.reset_timeout_seconds:
             self._failures = 0
+            self._opened_at = 0.0
             return False
         return True

Also applies to: 34-35

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/analyzer/reliability.py` around lines 23 - 25, The circuit-breaker
timeout branch resets only self._failures but leaves self._opened_at set,
preventing future opens; update the timeout handling to also clear or reset
self._opened_at (e.g., set to None or 0) so the next failure cycle can set a new
open timestamp, and ensure the open-path logic that sets self._opened_at
(referenced around lines checking self._opened_at and setting it on open) still
sets time.monotonic() when opening; in short, reset both self._failures and
self._opened_at in the timeout block and keep the existing guard that sets
self._opened_at on subsequent opens.

self.sentinel_token = os.getenv("SENTINEL_TOKEN", "").strip()
self.elastic_url = os.getenv("ELASTIC_INGEST_URL", "").strip()
self.elastic_api_key = os.getenv("ELASTIC_API_KEY", "").strip()
self.timeout = float(os.getenv("SIEM_TIMEOUT_SECONDS", "2.5"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Invalid SIEM_TIMEOUT_SECONDS can crash analyzer startup.

Line 25 trusts environment parsing without a guard. A malformed value causes ValueError at init time.

💡 Proposed fix
-        self.timeout = float(os.getenv("SIEM_TIMEOUT_SECONDS", "2.5"))
+        timeout_raw = os.getenv("SIEM_TIMEOUT_SECONDS", "2.5").strip()
+        try:
+            self.timeout = max(0.1, float(timeout_raw))
+        except ValueError:
+            logger.warning("Invalid SIEM_TIMEOUT_SECONDS=%r; defaulting to 2.5", timeout_raw)
+            self.timeout = 2.5
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/analyzer/siem_connectors.py` at line 25, The code sets self.timeout
= float(os.getenv("SIEM_TIMEOUT_SECONDS", "2.5")) without validation, which will
raise ValueError on malformed env values; update the initializer where
self.timeout is set to parse the environment value with a safe try/except (or
use a helper like parse_timeout), logging a warning via the existing logger and
falling back to the default 2.5 seconds when conversion fails or the value is
non-positive; reference the SIEM_TIMEOUT_SECONDS env var, the self.timeout
attribute, and the os.getenv call to locate and replace the current assignment.

Comment thread tenet_plugin/client.py
Comment on lines +178 to +185
def invoke(self, prompt: str, llm_callable: Callable[..., Any], **llm_kwargs):
return self.plugin.secure_call(
prompt=prompt,
model=self.model,
llm_callable=llm_callable,
llm_kwargs=llm_kwargs,
source_type="langchain",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Middleware drops user input before LLM invocation.

invoke()/query() pass prompt/query_text for inspection, but the downstream LLM callable still receives only llm_kwargs (via secure_call), which can cause runtime failures or empty-input calls.

💡 Proposed fix
 class LangChainTenetMiddleware:
@@
     def invoke(self, prompt: str, llm_callable: Callable[..., Any], **llm_kwargs):
         return self.plugin.secure_call(
             prompt=prompt,
             model=self.model,
-            llm_callable=llm_callable,
+            llm_callable=lambda **kwargs: llm_callable(prompt, **kwargs),
             llm_kwargs=llm_kwargs,
             source_type="langchain",
         )
@@
 class LlamaIndexTenetMiddleware:
@@
     def query(self, query_text: str, llm_callable: Callable[..., Any], **llm_kwargs):
         return self.plugin.secure_call(
             prompt=query_text,
             model=self.model,
-            llm_callable=llm_callable,
+            llm_callable=lambda **kwargs: llm_callable(query_text, **kwargs),
             llm_kwargs=llm_kwargs,
             source_type="llamaindex",
         )

Also applies to: 195-202

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tenet_plugin/client.py` around lines 178 - 185, invoke() and query() are
dropping the user input because secure_call only forwards llm_kwargs to the
downstream llm_callable; include the user text in the forwarded args so the LLM
actually receives it. Modify the calls to secure_call (from invoke and query) to
inject the prompt/query_text into llm_kwargs (e.g., set llm_kwargs["prompt"] =
prompt or llm_kwargs["query"] = query_text) or otherwise pass both the text and
llm_kwargs through so llm_callable receives the input; update the references to
invoke, query, secure_call, prompt, query_text, llm_callable, and llm_kwargs
accordingly.

Comment on lines +23 to +24
p99 = quantiles(samples_ms, n=100)[98]
assert p99 < 20.0, f"p99 was {p99:.3f}ms"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fixed 20ms wall-clock gate is likely CI-flaky.

Line 24 enforces a strict hardware-dependent threshold. On shared runners this can fail intermittently and mask real regressions with noise.

Suggested stabilization
+import os
@@
-    p99 = quantiles(samples_ms, n=100)[98]
-    assert p99 < 20.0, f"p99 was {p99:.3f}ms"
+    p99 = quantiles(samples_ms, n=100)[98]
+    sla_ms = float(os.getenv("TENET_HEURISTIC_P99_SLA_MS", "20.0"))
+    assert p99 < sla_ms, f"p99 was {p99:.3f}ms (SLA {sla_ms:.3f}ms)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/test_latency_sla.py` around lines 23 - 24, The current hard
20ms p99 check (p99 computed from quantiles(samples_ms, n=100)) is flaky on CI;
change the assertion to use a relaxed or configurable threshold instead of a
fixed 20.0ms: make the threshold configurable via an env var (e.g.,
CI_LATENCY_SLA_MS) with a sensible default (e.g., 50.0), or compute a dynamic
threshold from a stored baseline before asserting, and update the assert that
references p99 to compare against that configurable/dynamic threshold rather
than the literal 20.0.

@github-actions
Copy link
Copy Markdown

🤖 TENET Agent Review

📋 Summary

This pull request introduces a comprehensive set of new features and architectural improvements to the TENET AI project, transforming it into a more robust and observable LLM security middleware. Key additions include a policy-driven DetectionEngine for advanced threat detection (multi-turn, agentic, response scanning, collective threat intel), critical reliability features like a circuit breaker and observability (Prometheus, Grafana, OpenTelemetry), and enterprise integrations (SIEM connectors with sensitive data redaction). The PR also simplifies adoption with LangChain/LlamaIndex middleware and provides a Helm chart for air-gapped deployments. The approach is sound, significantly enhancing the system's defensive capabilities and operational maturity.

🔐 Security Findings

  • [LOW] docker-compose.yml - The Grafana service is configured with default admin:admin credentials. While common for local development stacks, this is an insecure default that should be explicitly highlighted for production deployments or replaced with more secure practices (e.g., environment variables, secrets management) if this compose file is ever adapted for non-local use.
  • [LOW] services/analyzer/app.py - The run_analysis function implements a "fallback_allow" mechanism when the DetectionEngine fails or the circuit breaker is open. This prioritizes availability over strict security in failure scenarios, meaning potential attacks might be allowed through if the detection engine is unhealthy. This is a deliberate design choice, but its security implications should be clearly understood and documented for operators.

🧹 Code Quality

  • services/analyzer/detection_engine.py - The DetectionEngine uses an in-memory dictionary (_session_cache) for multi-turn session tracking (L100). This design limits scalability and persistence, as session state will be lost on pod restarts or across multiple analyzer replicas. Consider leveraging the redis_client (which is already passed to the engine) for a distributed and persistent session store.
  • services/analyzer/app.py - The run_analysis function (L265) has grown significantly in complexity due to the integration of the new engine, circuit breaker, and fallback logic. While necessary, consider refactoring parts of this function into smaller, more focused helper methods to improve readability and maintainability.
  • services/analyzer/detection_engine.py - The _model_score method (L116) handles transformer model output by checking for specific terms ("malicious", "attack", "unsafe", "injection", "benign", "safe", "clean"). This string-based parsing could be brittle if model output labels change. Consider a more robust mapping or configuration for model output interpretation.

✅ What's Done Well

  • Comprehensive Security Enhancements: The introduction of a policy-driven DetectionEngine with multi-turn session tracking, agent chain monitoring, response scanning, and collective threat intelligence significantly elevates the system's ability to detect sophisticated LLM attacks.
  • Robust Observability and Reliability: The integration of Prometheus, Grafana, OpenTelemetry, and a circuit breaker provides essential operational visibility and graceful degradation, crucial for maintaining service SLAs in production environments.
  • Strong Data Privacy for SIEM Integration: The SIEMDispatcher's default redaction of sensitive fields (prompt, response_text, context, agent_trace) before publishing to SIEMs is an excellent security control, minimizing the risk of data leaks to external systems.

📝 Overall Verdict

REQUEST CHANGES - Address the in-memory session cache limitation for scalability and persistence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant