feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement#469
feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement#469IngmarVG-IB wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
Introduce pkg/discovery for DNS-AID based agent discovery using SVCB records (RFC 9460) with private-use SvcParamKeys per draft-mozleywilliams-dnsop-dnsaid-01. Agents publish themselves via K8s ConfigMaps and discover peers through DNS lookups. - pkg/discovery: Discoverer (DNS SVCB resolver), Publisher (K8s ConfigMap-based registration), SVCB parser for keys 65400-65408 - pkg/server: WithOnStart/WithOnShutdown lifecycle hooks as functional options, executed during server start/shutdown respectively - pkg/api: /v1/agents endpoint listing discovered agents in a domain, with input validation and feature-gated via AICR_DISCOVERY_ENABLED - pkg/defaults: discovery timeout constants and ServerDefaultPort
|
Welcome to AICR, @IngmarVG-IB! Thanks for your first pull request. Before review, please ensure:
A maintainer will review this soon. |
Strip rdatapolicy plugin mentions from pkg/discovery doc comments and the k8s-discovery agent definition. These features are under embargo and should not be referenced in public code.
OpenShell evaluates target agents' policy documents (served at their dns-aid-core policy URI, SvcParamKey 65403) as Layer 1 caller-side enforcement before allowing connections. This completes the security story: "OpenShell secures the agent boundary; DNS-AID resolves what's inside it." New pkg/openshell package implements: - All 16 native policy rules matching dns-aid-core's PolicyDocument schema - Three enforcement modes: strict (deny), permissive (log, default), disabled - Fail-open on policy fetch errors (matches dns-aid-core behavior) - SSRF-protected fetcher with HTTPS-only, private IP rejection - TTL-based cache with bounded eviction and singleflight coalescing - Enforcement layer filtering (Layer 0/1/2 per rule) - Realm isolation detection with cross-realm access logging Integration: - /v1/agents endpoint evaluates OpenShell policy per discovered agent - OPENSHELL_MODE env var controls enforcement (strict/permissive/disabled) - Guard accepts nil safely — zero behavioral change when discovery disabled New error codes: POLICY_DENIED, POLICY_FETCH New defaults: PolicyFetchTimeout (3s), PolicyCacheTTL (5m), PolicyMaxBytes (64KB) Test coverage: 84.2% (30 tests across evaluator, fetcher, guard)
Signed-off-by: Mark Chmarny <mchmarny@users.noreply.github.com>
mchmarny
left a comment
There was a problem hiding this comment.
Well-structured PR — clean separation between discovery, policy, and API layers. The code follows project conventions (functional options, pkg/errors, pkg/defaults, slog, table-driven tests, create-or-update K8s semantics). Feature gating via env vars is a good rollout strategy.
Key issues to address:
-
SSRF bypass via hostname (fetcher.go) — The private IP check only catches raw IP literals in URLs. A hostname resolving to
127.0.0.1or a link-local address passes validation. Since policy URIs come from DNS records (untrusted input), this needs aDialContext-level check or at minimum a documented limitation. -
Silent data loss on context cancellation (resolver.go) —
DiscoverAllswallows all errors including context cancellation, returning an empty list withnilerror. Callers can't distinguish "no agents" from "timed out mid-discovery." -
Confusing dual-return on fail-open (guard.go) —
Checkreturns(allowed=true, err!=nil)on fetch failure. This contract is easy to misuse. -
bapVersionsv-prefix handling (types.go) — Av-prefixed version likev2.1.0producesmcp/v2instead ofmcp/2.
Minor/cleanup:
- Dead code:
g.Wait()error check inDiscoverAllcan never trigger - Unnecessary loop variable capture (Go 1.22+ scoping)
- Duplicated
startMockDNStest helper across packages PORTenv var parsed in two places.claude/agents/k8s-discovery.mdmay be unintentional
Overall the architecture is solid. The SSRF issue is the most important one to fix before merge.
| ) | ||
|
|
||
| g, gctx := errgroup.WithContext(ctx) | ||
| g.SetLimit(maxDiscoverConcurrency) |
There was a problem hiding this comment.
The loop variable capture (entry := entry) is unnecessary in Go 1.22+ (this project uses Go 1.26). The loop variable is already per-iteration scoped.
| mu.Lock() | ||
| records = append(records, *record) | ||
| mu.Unlock() | ||
| return nil |
There was a problem hiding this comment.
g.Wait() never returns a non-nil error here because every g.Go callback returns nil (individual failures are logged and skipped at line 113). This entire if err block is dead code. Either remove it or return errors from the goroutines for genuinely fatal failures (e.g., context cancellation).
| records []AgentRecord | ||
| ) | ||
|
|
||
| g, gctx := errgroup.WithContext(ctx) | ||
| g.SetLimit(maxDiscoverConcurrency) | ||
|
|
There was a problem hiding this comment.
Context cancellation is silently swallowed inside the goroutines (line 107 returns nil on all errors). If gctx is canceled (e.g., parent timeout), every in-flight DNS query will fail and be logged as a warning, but DiscoverAll will return an empty list with no error. Consider checking ctx.Err() after g.Wait() and propagating cancellation so callers know the result is incomplete due to timeout rather than an empty domain.
| } | ||
| for _, opt := range opts { | ||
| opt(g) | ||
| } |
There was a problem hiding this comment.
NewGuard panics on invalid mode, but setupDiscovery in pkg/api/server.go already validates the mode and returns an error. The panic is redundant for the current call site — and panics in library code are generally undesirable. Consider returning an error from a constructor, or just trusting the caller validation and removing the panic. If you keep it, document the panic in the function's doc comment.
| } | ||
|
|
||
| policyURI := record.Params.Policy | ||
| if policyURI == "" { | ||
| return allowed, nil | ||
| } |
There was a problem hiding this comment.
Check returns both an allowed result and a non-nil error on fetch failure. The caller in handleAgents (agents.go:103) checks err != nil to log, then checks result != nil separately. This dual-return contract is easy to misuse — a future caller might short-circuit on err != nil and miss that the result is still allowed: true (fail-open). Consider returning (allowed, nil) on fail-open with the fetch warning logged internally, or documenting this contract more prominently.
| // writes the index ConfigMap. Uses optimistic concurrency via resourceVersion | ||
| // to prevent lost updates when multiple publishers race. | ||
| func (p *Publisher) updateIndex(ctx context.Context) error { | ||
| for range maxIndexUpdateRetries { |
There was a problem hiding this comment.
The index is rebuilt by listing all agent ConfigMaps on every publish/deregister. With many agents this becomes O(N) on every registration. Fine for the current scale, but if you expect >100 agents per namespace, consider an append/remove strategy with a periodic full reconciliation instead.
| if idx := strings.IndexByte(r.Version, '.'); idx > 0 { | ||
| ver = r.Version[:idx] | ||
| } else { | ||
| ver = r.Version | ||
| } | ||
| } |
There was a problem hiding this comment.
bapVersions() extracts the major version from r.Version using string splitting. If Version is a semver like v2.1.0 (with v prefix), this returns v2 rather than 2, producing mcp/v2 instead of mcp/2. Consider stripping the v prefix if present.
| return nil, errors.New(errors.ErrCodePolicyFetch, | ||
| "policy URI returned unexpected content-type: "+ct) | ||
| } | ||
|
|
There was a problem hiding this comment.
Good: reading maxBytes+1 then checking length is the correct pattern to detect oversized responses without allocating unbounded memory. Clean.
| "github.com/miekg/dns" | ||
| "github.com/stretchr/testify/assert" | ||
| "github.com/stretchr/testify/require" |
There was a problem hiding this comment.
The startMockDNS helper is duplicated from pkg/discovery/testhelper_test.go (the comment says so). Consider extracting it to an exported testutil or internal/testdns package to avoid drift between the two copies.
| @@ -0,0 +1,24 @@ | |||
| --- | |||
There was a problem hiding this comment.
Is this file intended to be checked in? It looks like a Claude Code agent instruction file rather than project source. If it was added unintentionally, consider removing it from the PR.
Summary
/v1/agentsendpoint with feature gatingStory: "OpenShell secures the agent boundary; DNS-AID resolves what's inside it."
Motivation / Context
AICR agents in Kubernetes need a standard discovery layer to find peers and a security boundary to control which connections are permitted. DNS-AID provides discovery via DNS SVCB lookups; OpenShell evaluates target agents' published policy documents before allowing the calling agent to connect.
Type of Change
Component(s) Affected
/v1/agentsendpoint)pkg/discovery,pkg/openshell,pkg/errors,pkg/defaults)Implementation Notes
DNS-AID Discovery (
pkg/discovery/)Discoverer: DNS SVCB queries for_{name}._{protocol}._agents.{domain}Publisher: K8s ConfigMap-based agent registration with create-or-update semantics_agents.{domain}for listing all agentsOnStart/OnShutdown) for auto-publish/deregisterAICR_DISCOVERY_ENABLED=trueOpenShell Policy Enforcement (
pkg/openshell/)strict(deny on violation),permissive(log + allow, default),disabledOPENSHELL_MODEenv var with startup validationIntegration
/v1/agentsevaluates OpenShell policy for each discovered agent with a policy URInil— zero behavioral change when discovery is disabledPOLICY_DENIED,POLICY_FETCHPolicyFetchTimeout(3s),PolicyCacheTTL(5m),PolicyMaxBytes(64KB)Key Design Decisions
major.minoras integers (not lexicographic) to correctly handle versions like1.12.Testing
pkg/openshell: 84.2% coverage (30 tests)pkg/discovery: Mock DNS server tests for SVCB parsing, index listing, publisherpkg/api: Handler tests with nil guard, domain validation, empty indexRisk Assessment
Medium — New packages with no existing consumers beyond the
/v1/agentsendpoint. Feature-gated behindAICR_DISCOVERY_ENABLEDandOPENSHELL_MODEenv vars, so zero impact on existing functionality when disabled. Default mode ispermissive(log-only), providing a safe rollout path.Checklist
-racego vetclean,gofmtclean)pkg/errors, slog,pkg/defaults)git commit -S)