Add detection engine, reliability, SIEM, telemetry, Helm chart, middleware, and tests#72
Add detection engine, reliability, SIEM, telemetry, Helm chart, middleware, and tests#72S3DFX-CYBER wants to merge 2 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThe PR introduces comprehensive observability, reliability, and advanced threat detection capabilities. It adds Prometheus/Grafana monitoring infrastructure, Helm charts for air-gapped deployment, a multi-signal DetectionEngine with policy-based scoring, CircuitBreaker failover mechanism, SIEM integration, OpenTelemetry tracing, and LangChain/LlamaIndex middleware. The analyzer service expands with latency instrumentation, threat-intel endpoints, feedback handling, and session-based risk accumulation. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant App as Analyzer App
participant CB as CircuitBreaker
participant Engine as DetectionEngine
participant Policy as Policy Rules
participant Prometheus
participant SIEM
Client->>App: POST /v1/analyze (prompt)
activate App
Note over App: record start time
App->>CB: check is_open()
CB-->>App: is_open: false/true
alt CircuitBreaker Open
App->>App: return fallback verdict
else CircuitBreaker Closed
App->>Engine: analyze(prompt, session_id, org_id, ...)
activate Engine
Engine->>Policy: load org-specific thresholds
Policy-->>Engine: thresholds & patterns
Engine->>Engine: compute multi-signal scores
Note over Engine: prompt patterns, threat-intel,<br/>response leaks, agentic chains,<br/>transformer classification
Engine-->>App: EngineResult {risk_score, verdict, ...}
deactivate Engine
App->>CB: record_success() or record_failure()
end
App->>Prometheus: record latency_ms, heuristic_sla_met
Prometheus-->>App: metrics accepted
alt Verdict is Malicious
App->>SIEM: publish(event) with sanitization
SIEM-->>App: async confirmation
end
App-->>Client: AnalysisResponse {verdict, details, ...}
deactivate App
sequenceDiagram
participant Analyzer as Analyzer Service
participant ThreatDB as Threat Intel Store
participant FeedbackDB as Feedback JSONL
participant PolicyEngine as Detection Engine
Analyzer->>ThreatDB: GET /v1/threat-intel/feed
ThreatDB-->>Analyzer: return all patterns
Analyzer->>Analyzer: POST /v1/threat-intel/share<br/>{pattern, score, category}
Analyzer->>ThreatDB: append & rewrite patterns
ThreatDB-->>Analyzer: share confirmed
Analyzer->>Analyzer: POST /v1/feedback<br/>{prompt, predicted, analyst_verdict}
Analyzer->>FeedbackDB: append JSONL entry
FeedbackDB-->>Analyzer: feedback stored
Analyzer->>PolicyEngine: share_threat_pattern(pattern, score, ...)
PolicyEngine->>PolicyEngine: incorporate into collective signals
PolicyEngine-->>Analyzer: updated pattern recognized
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
🤖 TENET Agent Review📋 SummaryThis pull request introduces a comprehensive set of enhancements to the TENET AI project, transforming it into a more robust and enterprise-ready LLM security middleware. Key additions include a policy-driven 🔐 Security Findings
🧹 Code Quality
✅ What's Done Well
📝 Overall VerdictREQUEST CHANGES - Address the high-severity security finding regarding transformer model loading and consider the medium-severity file write risks. |
There was a problem hiding this comment.
💡 Codex Review
TENET-AI/services/analyzer/app.py
Line 657 in 1146bcb
_process_single_event analyzes queued items with run_analysis(prompt) only, so tenant-specific policy overrides are dropped for the primary ingest→queue→analyzer path even when the event contains tenant identity. As a result, organization thresholds/rules are silently bypassed and detections can be wrong for multi-tenant deployments.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for field in ("prompt", "response_text", "context", "agent_trace"): | ||
| if field in redacted: | ||
| redacted[field] = "[REDACTED]" |
There was a problem hiding this comment.
Redact system prompts before SIEM export
When INCLUDE_SENSITIVE_SIEM_FIELDS is left at its safe default (false), _sanitize_event only redacts prompt, response_text, context, and agent_trace, so fields like system_prompt (present on ingested events) and arbitrary metadata are still forwarded to Splunk/Sentinel/Elastic in clear text. This creates an unintended data-leak path in the default configuration for malicious-event forwarding.
Useful? React with 👍 / 👎.
| """Prometheus scrape endpoint.""" | ||
| if not generate_latest: | ||
| raise HTTPException(status_code=503, detail="prometheus-client not installed") | ||
| return generate_latest() |
There was a problem hiding this comment.
Return Prometheus metrics with text response type
Returning generate_latest() directly from this FastAPI handler causes the payload to go through the default JSON response pipeline rather than Prometheus text exposition, which can break scraping/parsing for /metrics. The metrics endpoint should return an explicit raw/text response type for Prometheus-format bytes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
services/analyzer/app.py (1)
656-660:⚠️ Potential issue | 🟠 MajorPass the new analysis fields through the queue path.
The direct API path forwards
organization_id,session_id,response_text, andagent_trace, but queued events only passprompt. That means the new detection features are silently disabled for background-processed traffic.Suggested fix
- result = run_analysis(prompt) + result = run_analysis( + prompt=prompt, + organization_id=event.get("organization_id"), + session_id=event.get("session_id"), + response_text=event.get("response_text"), + agent_trace=event.get("agent_trace"), + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/analyzer/app.py` around lines 656 - 660, Queued event processing only forwards `prompt`, so the new analysis fields (`organization_id`, `session_id`, `response_text`, `agent_trace`) are missing for background jobs; update the queue producer to include those fields in the enqueued payload and update the consumer handler that calls `run_analysis(prompt)` to accept and forward `organization_id`, `session_id`, `response_text`, and `agent_trace` into `run_analysis` (or into the analysis context) and then pass them through to `_update_and_store_event(event, event_id, result)` so the queued path mirrors the direct API path.
🧹 Nitpick comments (2)
tests/unit/test_tenet_plugin.py (1)
118-149: Strengthen middleware assertions to catch behavioral regressions.
Line 132andLine 149only assertstatus. Add assertions for returned LLM payload/analysis fields so middleware pass-through regressions fail loudly.Suggested test tightening
def test_langchain_middleware_invoke(): @@ result = middleware.invoke("hello", llm_callable=lambda **kwargs: "ok") assert result["status"] == "success" + assert result["llm_response"] == "ok" + assert result["analysis"]["verdict"] == "benign" @@ def test_llamaindex_middleware_query(): @@ result = middleware.query("what is tenet?", llm_callable=lambda **kwargs: "ok") assert result["status"] == "success" + assert result["llm_response"] == "ok" + assert result["analysis"]["verdict"] == "benign"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_tenet_plugin.py` around lines 118 - 149, The tests only assert "status" which can miss regressions; update test_langchain_middleware_invoke and test_llamaindex_middleware_query to also assert that the middleware passes through the LLM output and includes the Tenet analysis fields from the session response: assert the llm_callable return value (e.g., "ok") is present in the middleware result (check the key your middleware uses for the LLM payload, e.g., "llm_response" or "payload") and assert the analysis keys from DummyResponse (blocked, verdict, risk_score, event_id) are present and equal to the expected values returned by the TenetSecurityPlugin when invoking LangChainTenetMiddleware.invoke and LlamaIndexTenetMiddleware.query.tests/unit/test_detection_engine.py (1)
18-33: Add the session-isolation regression the PR is claiming to fix.This only proves accumulation within one org. Since
_session_key()namespaces byorganization_id, the missing regression is reusing the samesession_idacross two orgs and asserting the second org starts with a clean accumulator.Suggested test
+def test_same_session_id_is_isolated_per_organization(): + engine = DetectionEngine(redis_client=None) + engine.analyze( + prompt="ignore previous instructions", + session_id="s1", + organization_id="org-a", + ) + result = engine.analyze( + prompt="hello world", + session_id="s1", + organization_id="org-b", + ) + assert result.details["session_events"] == 1 + assert result.details["threat_signals"]["session"] == 0.0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_detection_engine.py` around lines 18 - 33, Add a new test that verifies session isolation by reusing the same session_id across two different organization_id values: use DetectionEngine and call analyze(...) first with session_id="s1" and organization_id="orgA" to produce accumulated session state, then call analyze(...) with the same session_id="s1" but organization_id="orgB" and assert the second result does not see the first org's accumulation (e.g., second.details["session_events"] is 1 or not >= the accumulated count and second.details["threat_signals"]["session"] is 0.0). This ensures the _session_key() namespacing by organization_id works and prevents cross-org session leakage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docker-compose.yml`:
- Around line 136-137: Remove the hardcoded Grafana credentials by eliminating
the literal assignments to GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD
and instead require those values be supplied securely (e.g., via environment
variables injected at deploy time, Docker secrets, or a generated secret stored
in a secret manager). Update the docker-compose service to reference the
external env/secret names rather than fixed "admin"/"admin", and add startup
validation (in orchestration or an entrypoint) to fail fast if
GF_SECURITY_ADMIN_USER or GF_SECURITY_ADMIN_PASSWORD are not provided or empty
so default, insecure credentials cannot be used.
- Around line 119-131: The docker-compose services using floating tags should be
pinned: update the prometheus service image reference (currently
"prom/prometheus:latest" under service name/prometheus block) and the grafana
service image reference ("grafana/grafana:latest" under service name/grafana) to
explicit version tags or referenced pinned env vars (e.g., PROMETHEUS_IMAGE and
GRAFANA_IMAGE) so deployments are deterministic; change the image strings to
specific semver tags (or replace with ${ENV_VAR}) and add those env vars to a
.env or compose env section to control upgrades.
In `@helm/tenet/templates/analyzer.yaml`:
- Around line 22-23: The API key is being exposed by setting env var API_KEY to
.Values.apiKey; change both analyzer.yaml and ingest.yaml to use
valueFrom.secretKeyRef instead: add a new values entry (e.g. apiSecretName) and
require it, update the container env block for the env var named API_KEY to use
valueFrom:{secretKeyRef:{name: {{ .Values.apiSecretName }}, key: "<key-name>"}}
(use a consistent key like "apiKey"), and remove direct references to
.Values.apiKey from templates; also ensure values.yaml and chart docs instruct
users to create or supply a Kubernetes Secret with that name/key rather than
putting the API key in values.yaml.
In `@helm/tenet/templates/ingest.yaml`:
- Around line 22-23: Replace the plaintext env var injection for API_KEY with a
secretKeyRef: stop setting env name API_KEY using value: "{{ required
\"values.apiKey is required\" .Values.apiKey }}", and instead use
valueFrom.secretKeyRef referencing a Kubernetes Secret (referencing a secret
name from a values key like .Values.apiKeySecretName and the secret key name,
e.g., "apiKey"); apply the same change for the analyzer template so both use
valueFrom.secretKeyRef for API_KEY rather than embedding .Values.apiKey
directly.
In `@helm/tenet/templates/redis.yaml`:
- Around line 6-19: The Redis Deployment is ephemeral; update the container spec
for the container named "redis" in helm/tenet/templates/redis.yaml to add
volumeMounts mounting /data and add a corresponding volume that references a
PersistentVolumeClaim (use a values key .Values.redis.persistence.claimName),
add the Redis arg "--appendonly yes" to enable AOF persistence, and update
values.yaml to declare redis.persistence.claimName (and any persistence.enabled
flag if desired) so the chart can supply the PVC name; ensure you reference the
existing .Values.redis.image, .Values.imagePullPolicy and
.Values.redis.servicePort entries when making changes.
In `@services/analyzer/app.py`:
- Around line 368-404: The current code always returns a hard-coded fail-open
AnalysisResponse when the circuit_breaker is open or engine.analyze raises,
which is insecure; make the degraded-mode behavior configurable (e.g., a
fallback_mode or FAIL_OPEN flag) and change the two fallback branches (the
circuit_breaker.is_open block and the except Exception block around
engine.analyze) to consult that configuration: support at least "fail_open"
(current benign response), "degraded_suspicious" (return a mid/high risk_score,
verdict="suspicious"/threat_type as None, higher confidence, and
details={"method":"fallback_suspicious",...}), and "fail_closed"/"block" (return
risk_score=1.0, verdict="malicious"); ensure you call
circuit_breaker.record_failure on exception and record_success on success as
before, include clear logging that mentions the chosen fallback_mode, and keep
AnalysisResponse construction centralized so both fallback paths use the same
decision logic.
- Around line 597-600: The code currently awaits siem_dispatcher.publish inline
when result.verdict == "malicious", which can block the consumer; change this to
decouple SIEM delivery by enqueuing the event for asynchronous delivery instead
of awaiting the publish call directly. Specifically, keep the existing
redis_client.lpush("tenet:alerts", json.dumps(event)) and logger.warning lines,
then replace await siem_dispatcher.publish(event) with a non-blocking handoff
(e.g., push a message to a dedicated outbound queue or start a background task
via asyncio.create_task(...) that calls siem_dispatcher.publish), and ensure any
created background worker/consumer handles retries/errors so the consumer path
(the code checking result.verdict) never awaits external connector I/O.
- Around line 130-155: Import and instantiate a dedicated CollectorRegistry from
prometheus_client when available, and use it when creating the module-level
metrics ANALYZE_REQUESTS and ANALYZE_LATENCY (i.e., pass registry=your_registry
into Counter and Histogram constructors) so each import gets its own registry;
also update any use of generate_latest to call generate_latest(registry) when
exporting metrics, and keep the existing fallback to None if prometheus_client
isn't installed.
- Around line 257-262: The /metrics endpoint (decorated by `@app.get`("/metrics")
in the metrics() function) currently returns raw bytes from generate_latest()
without a Content-Type; update it to return a FastAPI/Starlette Response object
containing generate_latest() as the body and set the Content-Type exactly to
"text/plain; version=1.0.0; charset=utf-8" (preserve the existing HTTPException
branch when generate_latest is missing); ensure you import Response if not
already present and pass the bytes from generate_latest() as the response
content.
In `@services/analyzer/detection_engine.py`:
- Around line 36-37: Replace the unbounded dict self._session_cache with a
bounded TTL-aware cache to avoid unbounded memory growth: instantiate a cache
like cachetools.TTLCache(maxsize=..., ttl=...) (or implement an LRU+TTL map) in
the DetectionEngine constructor where self._session_cache is currently set,
update all accesses that read/write session entries (the code around the blocks
referenced at lines ~168-170 and ~209-223) to use the cache API, and protect
concurrent access with a threading.Lock or use cachetools' thread-safe wrapper;
choose sensible maxsize/ttl values and add a small cleanup/background eviction
if your cache implementation requires it.
- Around line 307-319: share_threat_pattern currently mutates self.threat_intel
in-memory and rewrites intel_path directly, causing lost updates under
concurrency; fix by performing a read-modify-write with synchronization: acquire
a lock (process-wide file lock or a shared threading/multiprocess lock) at the
start of share_threat_pattern, reload the current JSON from intel_path into a
local payload (fallback to {"patterns": []}), merge the new pattern into payload
(optionally dedupe), write payload to a temporary file and atomically replace
intel_path (os.replace) to avoid partial writes, update self.threat_intel from
the freshly written payload, then release the lock so concurrent callers see the
merged result.
In `@services/analyzer/policy/default_policy.yaml`:
- Around line 27-34: The current response_scanning.data_leak_patterns use
literal labels (api_key, private key, ssn, aws_secret_access_key, etc.) which
only match those words; replace those entries with actual leak
signatures/regexes that detect common shapes (e.g., SSN patterns
\d{3}-\d{2}-\d{4} or contiguous 9 digits, credit card BIN/Luhn patterns, AWS
secret access key pattern AKIA[A-Z0-9]{16} and secret-like base64/long key
patterns, PEM BEGIN/END blocks, JWT-looking three-part base64 strings, OAuth
tokens), keeping the same keys (response_scanning, data_leak_patterns) but
storing regular expressions or signature strings for each logical detector
(api_key, ssn, credit_card_number, aws_secret_access_key, private_key) and
adjust scores as needed so the analyzer.matching logic sees real leaks instead
of label-only matches.
In `@services/analyzer/reliability.py`:
- Around line 23-25: The circuit-breaker timeout branch resets only
self._failures but leaves self._opened_at set, preventing future opens; update
the timeout handling to also clear or reset self._opened_at (e.g., set to None
or 0) so the next failure cycle can set a new open timestamp, and ensure the
open-path logic that sets self._opened_at (referenced around lines checking
self._opened_at and setting it on open) still sets time.monotonic() when
opening; in short, reset both self._failures and self._opened_at in the timeout
block and keep the existing guard that sets self._opened_at on subsequent opens.
In `@services/analyzer/siem_connectors.py`:
- Line 25: The code sets self.timeout = float(os.getenv("SIEM_TIMEOUT_SECONDS",
"2.5")) without validation, which will raise ValueError on malformed env values;
update the initializer where self.timeout is set to parse the environment value
with a safe try/except (or use a helper like parse_timeout), logging a warning
via the existing logger and falling back to the default 2.5 seconds when
conversion fails or the value is non-positive; reference the
SIEM_TIMEOUT_SECONDS env var, the self.timeout attribute, and the os.getenv call
to locate and replace the current assignment.
In `@tenet_plugin/client.py`:
- Around line 178-185: invoke() and query() are dropping the user input because
secure_call only forwards llm_kwargs to the downstream llm_callable; include the
user text in the forwarded args so the LLM actually receives it. Modify the
calls to secure_call (from invoke and query) to inject the prompt/query_text
into llm_kwargs (e.g., set llm_kwargs["prompt"] = prompt or llm_kwargs["query"]
= query_text) or otherwise pass both the text and llm_kwargs through so
llm_callable receives the input; update the references to invoke, query,
secure_call, prompt, query_text, llm_callable, and llm_kwargs accordingly.
In `@tests/integration/test_latency_sla.py`:
- Around line 23-24: The current hard 20ms p99 check (p99 computed from
quantiles(samples_ms, n=100)) is flaky on CI; change the assertion to use a
relaxed or configurable threshold instead of a fixed 20.0ms: make the threshold
configurable via an env var (e.g., CI_LATENCY_SLA_MS) with a sensible default
(e.g., 50.0), or compute a dynamic threshold from a stored baseline before
asserting, and update the assert that references p99 to compare against that
configurable/dynamic threshold rather than the literal 20.0.
---
Outside diff comments:
In `@services/analyzer/app.py`:
- Around line 656-660: Queued event processing only forwards `prompt`, so the
new analysis fields (`organization_id`, `session_id`, `response_text`,
`agent_trace`) are missing for background jobs; update the queue producer to
include those fields in the enqueued payload and update the consumer handler
that calls `run_analysis(prompt)` to accept and forward `organization_id`,
`session_id`, `response_text`, and `agent_trace` into `run_analysis` (or into
the analysis context) and then pass them through to
`_update_and_store_event(event, event_id, result)` so the queued path mirrors
the direct API path.
---
Nitpick comments:
In `@tests/unit/test_detection_engine.py`:
- Around line 18-33: Add a new test that verifies session isolation by reusing
the same session_id across two different organization_id values: use
DetectionEngine and call analyze(...) first with session_id="s1" and
organization_id="orgA" to produce accumulated session state, then call
analyze(...) with the same session_id="s1" but organization_id="orgB" and assert
the second result does not see the first org's accumulation (e.g.,
second.details["session_events"] is 1 or not >= the accumulated count and
second.details["threat_signals"]["session"] is 0.0). This ensures the
_session_key() namespacing by organization_id works and prevents cross-org
session leakage.
In `@tests/unit/test_tenet_plugin.py`:
- Around line 118-149: The tests only assert "status" which can miss
regressions; update test_langchain_middleware_invoke and
test_llamaindex_middleware_query to also assert that the middleware passes
through the LLM output and includes the Tenet analysis fields from the session
response: assert the llm_callable return value (e.g., "ok") is present in the
middleware result (check the key your middleware uses for the LLM payload, e.g.,
"llm_response" or "payload") and assert the analysis keys from DummyResponse
(blocked, verdict, risk_score, event_id) are present and equal to the expected
values returned by the TenetSecurityPlugin when invoking
LangChainTenetMiddleware.invoke and LlamaIndexTenetMiddleware.query.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: b2772598-bd45-4d10-87bd-ca24cdda09ae
📒 Files selected for processing (22)
README.mddocker-compose.ymlhelm/tenet/Chart.yamlhelm/tenet/templates/analyzer.yamlhelm/tenet/templates/ingest.yamlhelm/tenet/templates/redis.yamlhelm/tenet/values.yamlmonitoring/alerts.ymlmonitoring/grafana/provisioning/datasources/prometheus.ymlmonitoring/prometheus.ymlrequirements.txtservices/analyzer/app.pyservices/analyzer/detection_engine.pyservices/analyzer/policy/default_policy.yamlservices/analyzer/reliability.pyservices/analyzer/siem_connectors.pytenet_plugin/__init__.pytenet_plugin/client.pytests/integration/test_latency_sla.pytests/unit/test_detection_engine.pytests/unit/test_reliability.pytests/unit/test_tenet_plugin.py
| image: prom/prometheus:latest | ||
| container_name: tenet-prometheus | ||
| ports: | ||
| - "9090:9090" | ||
| volumes: | ||
| - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro | ||
| - ./monitoring/alerts.yml:/etc/prometheus/alerts.yml:ro | ||
| networks: | ||
| - tenet-network | ||
| restart: unless-stopped | ||
|
|
||
| grafana: | ||
| image: grafana/grafana:latest |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify floating image tags in compose manifests.
rg -n --iglob 'docker-compose*.yml' '^\s*image:\s*[^#\s]+:latest\b'Repository: TENET-DEV-AI/TENET-AI
Length of output: 230
Pin monitoring and object storage image versions instead of latest.
Lines 45, 119, and 131 use floating tags. Pin explicit versions (or pinned env vars) for deterministic deploys and controlled upgrades.
Suggested fix
- image: minio/minio:latest
+ image: minio/minio:${MINIO_VERSION:?pin version}
@@
- image: prom/prometheus:latest
+ image: prom/prometheus:${PROMETHEUS_VERSION:?pin version}
@@
- image: grafana/grafana:latest
+ image: grafana/grafana:${GRAFANA_VERSION:?pin version}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker-compose.yml` around lines 119 - 131, The docker-compose services using
floating tags should be pinned: update the prometheus service image reference
(currently "prom/prometheus:latest" under service name/prometheus block) and the
grafana service image reference ("grafana/grafana:latest" under service
name/grafana) to explicit version tags or referenced pinned env vars (e.g.,
PROMETHEUS_IMAGE and GRAFANA_IMAGE) so deployments are deterministic; change the
image strings to specific semver tags (or replace with ${ENV_VAR}) and add those
env vars to a .env or compose env section to control upgrades.
| - GF_SECURITY_ADMIN_USER=admin | ||
| - GF_SECURITY_ADMIN_PASSWORD=admin |
There was a problem hiding this comment.
Do not ship default Grafana admin credentials.
Line 136-Line 137 hardcode admin/admin. This is an unsafe default and can be exploited if the port is reachable.
Suggested fix
- - GF_SECURITY_ADMIN_USER=admin
- - GF_SECURITY_ADMIN_PASSWORD=admin
+ - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:?set GRAFANA_ADMIN_USER}
+ - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:?set GRAFANA_ADMIN_PASSWORD}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - GF_SECURITY_ADMIN_USER=admin | |
| - GF_SECURITY_ADMIN_PASSWORD=admin | |
| - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:?set GRAFANA_ADMIN_USER} | |
| - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:?set GRAFANA_ADMIN_PASSWORD} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker-compose.yml` around lines 136 - 137, Remove the hardcoded Grafana
credentials by eliminating the literal assignments to GF_SECURITY_ADMIN_USER and
GF_SECURITY_ADMIN_PASSWORD and instead require those values be supplied securely
(e.g., via environment variables injected at deploy time, Docker secrets, or a
generated secret stored in a secret manager). Update the docker-compose service
to reference the external env/secret names rather than fixed "admin"/"admin",
and add startup validation (in orchestration or an entrypoint) to fail fast if
GF_SECURITY_ADMIN_USER or GF_SECURITY_ADMIN_PASSWORD are not provided or empty
so default, insecure credentials cannot be used.
| - name: API_KEY | ||
| value: "{{ required \"values.apiKey is required\" .Values.apiKey }}" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and read the analyzer.yaml file
find . -name "analyzer.yaml" -type f | head -5Repository: TENET-DEV-AI/TENET-AI
Length of output: 101
🏁 Script executed:
# Let's also check the directory structure to understand the helm chart layout
ls -la helm/tenet/templates/ 2>/dev/null | head -20Repository: TENET-DEV-AI/TENET-AI
Length of output: 355
🏁 Script executed:
# Search for secret-related patterns in the helm chart
rg -i "secret|apikey|password" helm/tenet/templates/ --type yaml -nRepository: TENET-DEV-AI/TENET-AI
Length of output: 304
🏁 Script executed:
cat -n helm/tenet/templates/analyzer.yamlRepository: TENET-DEV-AI/TENET-AI
Length of output: 1246
🏁 Script executed:
cat -n helm/tenet/templates/ingest.yamlRepository: TENET-DEV-AI/TENET-AI
Length of output: 1226
🏁 Script executed:
# Check if there's a values.yaml file
find . -name "values.yaml" -type f | head -5Repository: TENET-DEV-AI/TENET-AI
Length of output: 89
🏁 Script executed:
# Check for any secret resources in the helm chart
rg -i "kind.*secret|apiVersion.*v1" helm/ --type yaml -A 2Repository: TENET-DEV-AI/TENET-AI
Length of output: 937
🏁 Script executed:
cat -n helm/tenet/values.yamlRepository: TENET-DEV-AI/TENET-AI
Length of output: 381
🏁 Script executed:
# Also check redis.yaml to see if it handles secrets differently
cat -n helm/tenet/templates/redis.yamlRepository: TENET-DEV-AI/TENET-AI
Length of output: 891
API key exposed as plaintext in environment variables.
Line 23 directly injects .Values.apiKey into env vars, exposing the secret in rendered manifests and Helm release history. Replace with secretKeyRef to reference a Kubernetes Secret resource instead.
Note: This issue also exists in ingest.yaml:23 and requires fixes in both files.
Suggested fix
env:
- name: API_KEY
- value: "{{ required \"values.apiKey is required\" .Values.apiKey }}"
+ valueFrom:
+ secretKeyRef:
+ name: {{ required "values.apiKeySecret.name is required" .Values.apiKeySecret.name }}
+ key: {{ default "api-key" .Values.apiKeySecret.key }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@helm/tenet/templates/analyzer.yaml` around lines 22 - 23, The API key is
being exposed by setting env var API_KEY to .Values.apiKey; change both
analyzer.yaml and ingest.yaml to use valueFrom.secretKeyRef instead: add a new
values entry (e.g. apiSecretName) and require it, update the container env block
for the env var named API_KEY to use valueFrom:{secretKeyRef:{name: {{
.Values.apiSecretName }}, key: "<key-name>"}} (use a consistent key like
"apiKey"), and remove direct references to .Values.apiKey from templates; also
ensure values.yaml and chart docs instruct users to create or supply a
Kubernetes Secret with that name/key rather than putting the API key in
values.yaml.
| - name: API_KEY | ||
| value: "{{ required \"values.apiKey is required\" .Values.apiKey }}" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify API_KEY is currently injected as plaintext and secretKeyRef support is absent/present.
rg -n 'name:\s*API_KEY|values\.apiKey|secretKeyRef|kind:\s*Secret' helm/tenet/templates helm/tenet/values.yamlRepository: TENET-DEV-AI/TENET-AI
Length of output: 434
🏁 Script executed:
# Check values.yaml for apiKey configuration and see full file structure
head -50 helm/tenet/values.yamlRepository: TENET-DEV-AI/TENET-AI
Length of output: 276
🏁 Script executed:
# Look for any Secret resource definitions in the helm chart
find helm/tenet/templates -type f -name "*.yaml" -o -name "*.yml" | xargs grep -l "kind: Secret"Repository: TENET-DEV-AI/TENET-AI
Length of output: 49
🏁 Script executed:
# Check for all instances of apiKey usage across the entire helm chart
rg -n 'apiKey|API_KEY' helm/tenet/ --type yamlRepository: TENET-DEV-AI/TENET-AI
Length of output: 471
🏁 Script executed:
# Search for other common secret patterns in env values (password, token, secret, credential, etc.)
rg -n 'value:\s*"\{{\s*.*\.(password|token|secret|credential|key|apikey)' helm/tenet/templates --type yaml -iRepository: TENET-DEV-AI/TENET-AI
Length of output: 225
🏁 Script executed:
# Check the complete analyzer.yaml and ingest.yaml to see full env configuration
cat -n helm/tenet/templates/ingest.yaml | sed -n '15,35p'Repository: TENET-DEV-AI/TENET-AI
Length of output: 876
🏁 Script executed:
# Check analyzer.yaml env configuration
cat -n helm/tenet/templates/analyzer.yaml | sed -n '15,35p'Repository: TENET-DEV-AI/TENET-AI
Length of output: 888
🏁 Script executed:
# Simpler search for plaintext value injections of sensitive sounding names
rg -n 'name:\s*(API_KEY|PASSWORD|TOKEN|SECRET)' helm/tenet/templates --type yaml -A 1Repository: TENET-DEV-AI/TENET-AI
Length of output: 437
Apply secretKeyRef for API_KEY to prevent plaintext secret leaks.
ingest.yaml lines 22-23 and analyzer.yaml lines 22-23 both inject .Values.apiKey as plaintext values, exposing secrets in rendered manifests and Helm release history. Replace with valueFrom.secretKeyRef to reference a Kubernetes Secret instead.
Suggested fix
env:
- name: API_KEY
- value: "{{ required \"values.apiKey is required\" .Values.apiKey }}"
+ valueFrom:
+ secretKeyRef:
+ name: {{ required "values.apiKeySecret.name is required" .Values.apiKeySecret.name }}
+ key: {{ default "api-key" .Values.apiKeySecret.key }}Apply this change to both ingest.yaml and analyzer.yaml.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: API_KEY | |
| value: "{{ required \"values.apiKey is required\" .Values.apiKey }}" | |
| - name: API_KEY | |
| valueFrom: | |
| secretKeyRef: | |
| name: {{ required "values.apiKeySecret.name is required" .Values.apiKeySecret.name }} | |
| key: {{ default "api-key" .Values.apiKeySecret.key }} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@helm/tenet/templates/ingest.yaml` around lines 22 - 23, Replace the plaintext
env var injection for API_KEY with a secretKeyRef: stop setting env name API_KEY
using value: "{{ required \"values.apiKey is required\" .Values.apiKey }}", and
instead use valueFrom.secretKeyRef referencing a Kubernetes Secret (referencing
a secret name from a values key like .Values.apiKeySecretName and the secret key
name, e.g., "apiKey"); apply the same change for the analyzer template so both
use valueFrom.secretKeyRef for API_KEY rather than embedding .Values.apiKey
directly.
| replicas: 1 | ||
| selector: | ||
| matchLabels: { app: tenet-redis } | ||
| template: | ||
| metadata: | ||
| labels: { app: tenet-redis } | ||
| spec: | ||
| containers: | ||
| - name: redis | ||
| image: {{ .Values.redis.image }} | ||
| imagePullPolicy: {{ .Values.imagePullPolicy }} | ||
| ports: | ||
| - containerPort: {{ .Values.redis.servicePort }} | ||
| --- |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify Redis template lacks persistence primitives and chart lacks PVC definition.
rg -n 'volumeMounts|persistentVolumeClaim|claimName|appendonly|/data' helm/tenet/templates/redis.yaml
rg -n 'kind:\s*PersistentVolumeClaim|redis\.persistence|claimName' helm/tenetRepository: TENET-DEV-AI/TENET-AI
Length of output: 47
🏁 Script executed:
cat -n helm/tenet/templates/redis.yamlRepository: TENET-DEV-AI/TENET-AI
Length of output: 891
🏁 Script executed:
cat -n helm/tenet/values.yaml | head -100Repository: TENET-DEV-AI/TENET-AI
Length of output: 381
🏁 Script executed:
find helm/tenet -type f \( -name "*.yaml" -o -name "*.yml" \) | xargs grep -l "PersistentVolumeClaim\|PersistentVolume\|StorageClass" 2>/dev/nullRepository: TENET-DEV-AI/TENET-AI
Length of output: 49
Add Redis persistence; current deployment is ephemeral.
The Redis Deployment (lines 6-19) has no volumeMounts or volumes configured, and values.yaml contains no redis.persistence settings. Pod restarts will lose all runtime state, which breaks reliability for queue or session usage.
Add volumeMounts with /data mount point, configure Redis with --appendonly yes, and mount a PersistentVolumeClaim. Update values.yaml to include redis.persistence.claimName configuration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@helm/tenet/templates/redis.yaml` around lines 6 - 19, The Redis Deployment is
ephemeral; update the container spec for the container named "redis" in
helm/tenet/templates/redis.yaml to add volumeMounts mounting /data and add a
corresponding volume that references a PersistentVolumeClaim (use a values key
.Values.redis.persistence.claimName), add the Redis arg "--appendonly yes" to
enable AOF persistence, and update values.yaml to declare
redis.persistence.claimName (and any persistence.enabled flag if desired) so the
chart can supply the PVC name; ensure you reference the existing
.Values.redis.image, .Values.imagePullPolicy and .Values.redis.servicePort
entries when making changes.
| response_scanning: | ||
| data_leak_patterns: | ||
| api_key: 0.95 | ||
| private key: 0.99 | ||
| ssn: 0.9 | ||
| social security number: 0.92 | ||
| credit card number: 0.88 | ||
| aws_secret_access_key: 0.99 |
There was a problem hiding this comment.
Use real leak signatures here, not just leak labels.
These keys only match outputs that literally contain strings like api_key, ssn, or credit card number. A raw SSN, card number, AWS secret, or PEM block without those labels will pass straight through, which makes the new response-scanning feature miss the most common leak shape.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/analyzer/policy/default_policy.yaml` around lines 27 - 34, The
current response_scanning.data_leak_patterns use literal labels (api_key,
private key, ssn, aws_secret_access_key, etc.) which only match those words;
replace those entries with actual leak signatures/regexes that detect common
shapes (e.g., SSN patterns \d{3}-\d{2}-\d{4} or contiguous 9 digits, credit card
BIN/Luhn patterns, AWS secret access key pattern AKIA[A-Z0-9]{16} and
secret-like base64/long key patterns, PEM BEGIN/END blocks, JWT-looking
three-part base64 strings, OAuth tokens), keeping the same keys
(response_scanning, data_leak_patterns) but storing regular expressions or
signature strings for each logical detector (api_key, ssn, credit_card_number,
aws_secret_access_key, private_key) and adjust scores as needed so the
analyzer.matching logic sees real leaks instead of label-only matches.
| if time.monotonic() - self._opened_at > self.reset_timeout_seconds: | ||
| self._failures = 0 | ||
| return False |
There was a problem hiding this comment.
Circuit breaker does not fully reset open-state after timeout.
On Line 24, only _failures is reset. Because _opened_at is not reset, the guard on Line 34 prevents setting a fresh open timestamp on later failure cycles, so reopen behavior degrades after the first timeout.
💡 Proposed fix
`@property`
def is_open(self) -> bool:
if self._failures < self.failure_threshold:
return False
if time.monotonic() - self._opened_at > self.reset_timeout_seconds:
self._failures = 0
+ self._opened_at = 0.0
return False
return TrueAlso applies to: 34-35
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/analyzer/reliability.py` around lines 23 - 25, The circuit-breaker
timeout branch resets only self._failures but leaves self._opened_at set,
preventing future opens; update the timeout handling to also clear or reset
self._opened_at (e.g., set to None or 0) so the next failure cycle can set a new
open timestamp, and ensure the open-path logic that sets self._opened_at
(referenced around lines checking self._opened_at and setting it on open) still
sets time.monotonic() when opening; in short, reset both self._failures and
self._opened_at in the timeout block and keep the existing guard that sets
self._opened_at on subsequent opens.
| self.sentinel_token = os.getenv("SENTINEL_TOKEN", "").strip() | ||
| self.elastic_url = os.getenv("ELASTIC_INGEST_URL", "").strip() | ||
| self.elastic_api_key = os.getenv("ELASTIC_API_KEY", "").strip() | ||
| self.timeout = float(os.getenv("SIEM_TIMEOUT_SECONDS", "2.5")) |
There was a problem hiding this comment.
Invalid SIEM_TIMEOUT_SECONDS can crash analyzer startup.
Line 25 trusts environment parsing without a guard. A malformed value causes ValueError at init time.
💡 Proposed fix
- self.timeout = float(os.getenv("SIEM_TIMEOUT_SECONDS", "2.5"))
+ timeout_raw = os.getenv("SIEM_TIMEOUT_SECONDS", "2.5").strip()
+ try:
+ self.timeout = max(0.1, float(timeout_raw))
+ except ValueError:
+ logger.warning("Invalid SIEM_TIMEOUT_SECONDS=%r; defaulting to 2.5", timeout_raw)
+ self.timeout = 2.5🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/analyzer/siem_connectors.py` at line 25, The code sets self.timeout
= float(os.getenv("SIEM_TIMEOUT_SECONDS", "2.5")) without validation, which will
raise ValueError on malformed env values; update the initializer where
self.timeout is set to parse the environment value with a safe try/except (or
use a helper like parse_timeout), logging a warning via the existing logger and
falling back to the default 2.5 seconds when conversion fails or the value is
non-positive; reference the SIEM_TIMEOUT_SECONDS env var, the self.timeout
attribute, and the os.getenv call to locate and replace the current assignment.
| def invoke(self, prompt: str, llm_callable: Callable[..., Any], **llm_kwargs): | ||
| return self.plugin.secure_call( | ||
| prompt=prompt, | ||
| model=self.model, | ||
| llm_callable=llm_callable, | ||
| llm_kwargs=llm_kwargs, | ||
| source_type="langchain", | ||
| ) |
There was a problem hiding this comment.
Middleware drops user input before LLM invocation.
invoke()/query() pass prompt/query_text for inspection, but the downstream LLM callable still receives only llm_kwargs (via secure_call), which can cause runtime failures or empty-input calls.
💡 Proposed fix
class LangChainTenetMiddleware:
@@
def invoke(self, prompt: str, llm_callable: Callable[..., Any], **llm_kwargs):
return self.plugin.secure_call(
prompt=prompt,
model=self.model,
- llm_callable=llm_callable,
+ llm_callable=lambda **kwargs: llm_callable(prompt, **kwargs),
llm_kwargs=llm_kwargs,
source_type="langchain",
)
@@
class LlamaIndexTenetMiddleware:
@@
def query(self, query_text: str, llm_callable: Callable[..., Any], **llm_kwargs):
return self.plugin.secure_call(
prompt=query_text,
model=self.model,
- llm_callable=llm_callable,
+ llm_callable=lambda **kwargs: llm_callable(query_text, **kwargs),
llm_kwargs=llm_kwargs,
source_type="llamaindex",
)Also applies to: 195-202
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tenet_plugin/client.py` around lines 178 - 185, invoke() and query() are
dropping the user input because secure_call only forwards llm_kwargs to the
downstream llm_callable; include the user text in the forwarded args so the LLM
actually receives it. Modify the calls to secure_call (from invoke and query) to
inject the prompt/query_text into llm_kwargs (e.g., set llm_kwargs["prompt"] =
prompt or llm_kwargs["query"] = query_text) or otherwise pass both the text and
llm_kwargs through so llm_callable receives the input; update the references to
invoke, query, secure_call, prompt, query_text, llm_callable, and llm_kwargs
accordingly.
| p99 = quantiles(samples_ms, n=100)[98] | ||
| assert p99 < 20.0, f"p99 was {p99:.3f}ms" |
There was a problem hiding this comment.
Fixed 20ms wall-clock gate is likely CI-flaky.
Line 24 enforces a strict hardware-dependent threshold. On shared runners this can fail intermittently and mask real regressions with noise.
Suggested stabilization
+import os
@@
- p99 = quantiles(samples_ms, n=100)[98]
- assert p99 < 20.0, f"p99 was {p99:.3f}ms"
+ p99 = quantiles(samples_ms, n=100)[98]
+ sla_ms = float(os.getenv("TENET_HEURISTIC_P99_SLA_MS", "20.0"))
+ assert p99 < sla_ms, f"p99 was {p99:.3f}ms (SLA {sla_ms:.3f}ms)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/integration/test_latency_sla.py` around lines 23 - 24, The current hard
20ms p99 check (p99 computed from quantiles(samples_ms, n=100)) is flaky on CI;
change the assertion to use a relaxed or configurable threshold instead of a
fixed 20.0ms: make the threshold configurable via an env var (e.g.,
CI_LATENCY_SLA_MS) with a sensible default (e.g., 50.0), or compute a dynamic
threshold from a stored baseline before asserting, and update the assert that
references p99 to compare against that configurable/dynamic threshold rather
than the literal 20.0.
🤖 TENET Agent Review📋 SummaryThis pull request introduces a comprehensive set of new features and architectural improvements to the TENET AI project, transforming it into a more robust and observable LLM security middleware. Key additions include a policy-driven 🔐 Security Findings
🧹 Code Quality
✅ What's Done Well
📝 Overall VerdictREQUEST CHANGES - Address the in-memory session cache limitation for scalability and persistence. |
Motivation
Description
DetectionEngineimplementation underservices/analyzer/detection_engine.pywith policy loading, collective threat-intel, transformer/sklearn backends, session tracking, agent-chain scoring, OWASP mapping, feedback append, and threat-intel sharing APIs.CircuitBreakerinservices/analyzer/reliability.py, Prometheus metrics and optional OpenTelemetry traces inservices/analyzer/app.py, a/metricsendpoint, and latency accounting in analysis responses.services/analyzer/siem_connectors.pyto forward redacted alerts to Splunk HEC, Sentinel, and Elastic, and wired SIEM publishing into the analyzer event store flow./v1/feedback,/v1/threat-intel/feed,/v1/threat-intel/share, and kept existing/v1/analyzebut wired it to the new engine and instrumentation; preserved fallback heuristics and ML-only fallback.helm/tenetand Docker Compose monitoring stack additions (prometheus,grafana) plus Prometheus/Grafana provisioning and alert rules inmonitoring/.requirements.txtto include optional telemetry, transformer, Prometheus client and YAML dependencies and added internal modules for SIEM/reliability/engine; updatedREADME.mdwith LangChain/LlamaIndex middleware and tiered reliability/integration notes.tenet_plugin) withLangChainTenetMiddlewareandLlamaIndexTenetMiddlewarefor one-import middleware usage and added tracing wrapper support.tests/unitandtests/integration.Testing
tests/unit/test_detection_engine.py,tests/unit/test_reliability.py, andtests/unit/test_tenet_plugin.pyusingpytest, and all unit tests passed.tests/integration/test_latency_sla.pywithpytest, which validated the heuristic p99 latency target and passed.startup/healthcode paths in automated tests and confirmed model/redis handling and fallback behavior via the reliability tests.Codex Task
Summary by CodeRabbit
New Features
Documentation
Chores