feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement by IngmarVG-IB · Pull Request #469 · NVIDIA/aicr

IngmarVG-IB · 2026-03-27T23:10:21Z

Summary

Adds DNS-AID agent discovery via SVCB records (RFC 9460) with private-use SvcParamKeys per draft-mozleywilliams-dnsop-dnsaid-01
Adds OpenShell caller-side policy enforcement (Layer 1) consuming dns-aid-core's PolicyDocument schema
Integrates both into the /v1/agents endpoint with feature gating

Story: "OpenShell secures the agent boundary; DNS-AID resolves what's inside it."

Motivation / Context

AICR agents in Kubernetes need a standard discovery layer to find peers and a security boundary to control which connections are permitted. DNS-AID provides discovery via DNS SVCB lookups; OpenShell evaluates target agents' published policy documents before allowing the calling agent to connect.

Type of Change

New feature (non-breaking change that adds functionality)

Component(s) Affected

API server (/v1/agents endpoint)
Core libraries (pkg/discovery, pkg/openshell, pkg/errors, pkg/defaults)

Implementation Notes

DNS-AID Discovery (`pkg/discovery/`)

Discoverer: DNS SVCB queries for _{name}._{protocol}._agents.{domain}
Publisher: K8s ConfigMap-based agent registration with create-or-update semantics
TXT-based index at _agents.{domain} for listing all agents
Server lifecycle hooks (OnStart/OnShutdown) for auto-publish/deregister
Feature-gated via AICR_DISCOVERY_ENABLED=true

OpenShell Policy Enforcement (`pkg/openshell/`)

All 16 native policy rules matching dns-aid-core's schema
Three enforcement modes: strict (deny on violation), permissive (log + allow, default), disabled
Fail-open on policy fetch errors (matches dns-aid-core behavior)
SSRF-protected fetcher: HTTPS-only, rejects private/loopback IPs
TTL-based cache (5m default) with bounded eviction (max 1024 entries) and singleflight coalescing
Enforcement layer filtering — Layer 1 only evaluates applicable rules
Realm isolation detection with cross-realm access logging
CEL custom rules deferred to future phase (logged as unsupported)
Controlled via OPENSHELL_MODE env var with startup validation

Integration

/v1/agents evaluates OpenShell policy for each discovered agent with a policy URI
Guard accepts nil — zero behavioral change when discovery is disabled
New error codes: POLICY_DENIED, POLICY_FETCH
New defaults: PolicyFetchTimeout (3s), PolicyCacheTTL (5m), PolicyMaxBytes (64KB)

Key Design Decisions

Fail-open: If a policy document can't be fetched, the connection is allowed. This matches dns-aid-core and is correct for a discovery system where policy is advisory at Layer 1.
No Python dependency: OpenShell is a pure Go reimplementation consuming the same JSON schema as dns-aid-core's Python SDK.
Numeric TLS version comparison: Parses major.minor as integers (not lexicographic) to correctly handle versions like 1.12.

Testing

make test  # all tests pass with -race

pkg/openshell: 84.2% coverage (30 tests)
- Evaluator: 15 table-driven tests covering all 16 rules, layer filtering, multiple violations
- Fetcher: SSRF validation, caching, content-type/size rejection, non-200 handling
- Guard: all 3 modes, fail-open, compliant caller, mode validation
- Availability: midnight wrap, timezone handling, malformed input fail-open
- TLS version: numeric comparison including double-digit minor versions
pkg/discovery: Mock DNS server tests for SVCB parsing, index listing, publisher
pkg/api: Handler tests with nil guard, domain validation, empty index

Risk Assessment

Medium — New packages with no existing consumers beyond the /v1/agents endpoint. Feature-gated behind AICR_DISCOVERY_ENABLED and OPENSHELL_MODE env vars, so zero impact on existing functionality when disabled. Default mode is permissive (log-only), providing a safe rollout path.

Checklist

Tests pass locally with -race
Linter passes (go vet clean, gofmt clean)
No tests skipped or disabled
New tests added for new functionality
Changes follow existing patterns (functional options, pkg/errors, slog, pkg/defaults)
Commits are cryptographically signed (git commit -S)

Introduce pkg/discovery for DNS-AID based agent discovery using SVCB records (RFC 9460) with private-use SvcParamKeys per draft-mozleywilliams-dnsop-dnsaid-01. Agents publish themselves via K8s ConfigMaps and discover peers through DNS lookups. - pkg/discovery: Discoverer (DNS SVCB resolver), Publisher (K8s ConfigMap-based registration), SVCB parser for keys 65400-65408 - pkg/server: WithOnStart/WithOnShutdown lifecycle hooks as functional options, executed during server start/shutdown respectively - pkg/api: /v1/agents endpoint listing discovered agents in a domain, with input validation and feature-gated via AICR_DISCOVERY_ENABLED - pkg/defaults: discovery timeout constants and ServerDefaultPort

copy-pr-bot · 2026-03-27T23:10:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-03-27T23:10:32Z

Welcome to AICR, @IngmarVG-IB! Thanks for your first pull request.

Before review, please ensure:

All commits are signed off per the DCO
CI checks pass (tests, lint, security scan)
The PR description explains the why behind your changes

A maintainer will review this soon.

Strip rdatapolicy plugin mentions from pkg/discovery doc comments and the k8s-discovery agent definition. These features are under embargo and should not be referenced in public code.

OpenShell evaluates target agents' policy documents (served at their dns-aid-core policy URI, SvcParamKey 65403) as Layer 1 caller-side enforcement before allowing connections. This completes the security story: "OpenShell secures the agent boundary; DNS-AID resolves what's inside it." New pkg/openshell package implements: - All 16 native policy rules matching dns-aid-core's PolicyDocument schema - Three enforcement modes: strict (deny), permissive (log, default), disabled - Fail-open on policy fetch errors (matches dns-aid-core behavior) - SSRF-protected fetcher with HTTPS-only, private IP rejection - TTL-based cache with bounded eviction and singleflight coalescing - Enforcement layer filtering (Layer 0/1/2 per rule) - Realm isolation detection with cross-realm access logging Integration: - /v1/agents endpoint evaluates OpenShell policy per discovered agent - OPENSHELL_MODE env var controls enforcement (strict/permissive/disabled) - Guard accepts nil safely — zero behavioral change when discovery disabled New error codes: POLICY_DENIED, POLICY_FETCH New defaults: PolicyFetchTimeout (3s), PolicyCacheTTL (5m), PolicyMaxBytes (64KB) Test coverage: 84.2% (30 tests across evaluator, fetcher, guard)

Signed-off-by: Mark Chmarny <mchmarny@users.noreply.github.com>

mchmarny

Well-structured PR — clean separation between discovery, policy, and API layers. The code follows project conventions (functional options, pkg/errors, pkg/defaults, slog, table-driven tests, create-or-update K8s semantics). Feature gating via env vars is a good rollout strategy.

Key issues to address:

SSRF bypass via hostname (fetcher.go) — The private IP check only catches raw IP literals in URLs. A hostname resolving to 127.0.0.1 or a link-local address passes validation. Since policy URIs come from DNS records (untrusted input), this needs a DialContext-level check or at minimum a documented limitation.
Silent data loss on context cancellation (resolver.go) — DiscoverAll swallows all errors including context cancellation, returning an empty list with nil error. Callers can't distinguish "no agents" from "timed out mid-discovery."
Confusing dual-return on fail-open (guard.go) — Check returns (allowed=true, err!=nil) on fetch failure. This contract is easy to misuse.
bapVersions v-prefix handling (types.go) — A v-prefixed version like v2.1.0 produces mcp/v2 instead of mcp/2.

Minor/cleanup:

Dead code: g.Wait() error check in DiscoverAll can never trigger
Unnecessary loop variable capture (Go 1.22+ scoping)
Duplicated startMockDNS test helper across packages
PORT env var parsed in two places
.claude/agents/k8s-discovery.md may be unintentional

Overall the architecture is solid. The SSRF issue is the most important one to fix before merge.

mchmarny · 2026-04-14T15:14:22Z

+	)
+
+	g, gctx := errgroup.WithContext(ctx)
+	g.SetLimit(maxDiscoverConcurrency)


The loop variable capture (entry := entry) is unnecessary in Go 1.22+ (this project uses Go 1.26). The loop variable is already per-iteration scoped.

mchmarny · 2026-04-14T15:14:25Z

+			mu.Lock()
+			records = append(records, *record)
+			mu.Unlock()
+			return nil


g.Wait() never returns a non-nil error here because every g.Go callback returns nil (individual failures are logged and skipped at line 113). This entire if err block is dead code. Either remove it or return errors from the goroutines for genuinely fatal failures (e.g., context cancellation).

mchmarny · 2026-04-14T15:14:27Z

+		records []AgentRecord
+	)
+
+	g, gctx := errgroup.WithContext(ctx)
+	g.SetLimit(maxDiscoverConcurrency)
+


Context cancellation is silently swallowed inside the goroutines (line 107 returns nil on all errors). If gctx is canceled (e.g., parent timeout), every in-flight DNS query will fail and be logged as a warning, but DiscoverAll will return an empty list with no error. Consider checking ctx.Err() after g.Wait() and propagating cancellation so callers know the result is incomplete due to timeout rather than an empty domain.

mchmarny · 2026-04-14T15:14:29Z

+	}
+	for _, opt := range opts {
+		opt(g)
+	}


NewGuard panics on invalid mode, but setupDiscovery in pkg/api/server.go already validates the mode and returns an error. The panic is redundant for the current call site — and panics in library code are generally undesirable. Consider returning an error from a constructor, or just trusting the caller validation and removing the panic. If you keep it, document the panic in the function's doc comment.

mchmarny · 2026-04-14T15:14:31Z

+	}
+
+	policyURI := record.Params.Policy
+	if policyURI == "" {
+		return allowed, nil
+	}


Check returns both an allowed result and a non-nil error on fetch failure. The caller in handleAgents (agents.go:103) checks err != nil to log, then checks result != nil separately. This dual-return contract is easy to misuse — a future caller might short-circuit on err != nil and miss that the result is still allowed: true (fail-open). Consider returning (allowed, nil) on fail-open with the fetch warning logged internally, or documenting this contract more prominently.

mchmarny · 2026-04-14T15:14:45Z

+// writes the index ConfigMap. Uses optimistic concurrency via resourceVersion
+// to prevent lost updates when multiple publishers race.
+func (p *Publisher) updateIndex(ctx context.Context) error {
+	for range maxIndexUpdateRetries {


The index is rebuilt by listing all agent ConfigMaps on every publish/deregister. With many agents this becomes O(N) on every registration. Fine for the current scale, but if you expect >100 agents per namespace, consider an append/remove strategy with a periodic full reconciliation instead.

mchmarny · 2026-04-14T15:14:47Z

+		if idx := strings.IndexByte(r.Version, '.'); idx > 0 {
+			ver = r.Version[:idx]
+		} else {
+			ver = r.Version
+		}
+	}


bapVersions() extracts the major version from r.Version using string splitting. If Version is a semver like v2.1.0 (with v prefix), this returns v2 rather than 2, producing mcp/v2 instead of mcp/2. Consider stripping the v prefix if present.

mchmarny · 2026-04-14T15:14:48Z

+		return nil, errors.New(errors.ErrCodePolicyFetch,
+			"policy URI returned unexpected content-type: "+ct)
+	}
+


Good: reading maxBytes+1 then checking length is the correct pattern to detect oversized responses without allocating unbounded memory. Clean.

mchmarny · 2026-04-14T15:14:50Z

+	"github.com/miekg/dns"
+	"github.com/stretchr/testify/assert"
+	"github.com/stretchr/testify/require"


The startMockDNS helper is duplicated from pkg/discovery/testhelper_test.go (the comment says so). Consider extracting it to an exported testutil or internal/testdns package to avoid drift between the two copies.

mchmarny · 2026-04-14T15:14:51Z

@@ -0,0 +1,24 @@
+---


Is this file intended to be checked in? It looks like a Claude Code agent instruction file rather than project source. If it was added unintentionally, consider removing it from the PR.

IngmarVG-IB requested a review from a team as a code owner March 27, 2026 23:10

github-actions bot added area/api size/XL labels Mar 27, 2026

IngmarVG-IB added 2 commits March 27, 2026 20:14

fix: remove embargoed rdata-policy references from discovery docs

0a6724e

Strip rdatapolicy plugin mentions from pkg/discovery doc comments and the k8s-discovery agent definition. These features are under embargo and should not be referenced in public code.

IngmarVG-IB changed the title ~~feat(discovery): add DNS-AID agent discovery and server lifecycle hooks~~ feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement Mar 28, 2026

mchmarny assigned IngmarVG-IB Apr 2, 2026

mchmarny requested a review from lockwobr April 2, 2026 11:25

Merge branch 'main' into feat/dns-aid-discovery

3b66bf9

Signed-off-by: Mark Chmarny <mchmarny@users.noreply.github.com>

mchmarny requested changes Apr 14, 2026

View reviewed changes

Conversation

IngmarVG-IB commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

DNS-AID Discovery (pkg/discovery/)

OpenShell Policy Enforcement (pkg/openshell/)

Integration

Key Design Decisions

Testing

Risk Assessment

Checklist

Uh oh!

copy-pr-bot bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IngmarVG-IB commented Mar 27, 2026 •

edited

Loading

DNS-AID Discovery (`pkg/discovery/`)

OpenShell Policy Enforcement (`pkg/openshell/`)