Skip to content

feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement#469

Open
IngmarVG-IB wants to merge 4 commits intoNVIDIA:mainfrom
IngmarVG-IB:feat/dns-aid-discovery
Open

feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement#469
IngmarVG-IB wants to merge 4 commits intoNVIDIA:mainfrom
IngmarVG-IB:feat/dns-aid-discovery

Conversation

@IngmarVG-IB
Copy link
Copy Markdown

@IngmarVG-IB IngmarVG-IB commented Mar 27, 2026

Summary

  • Adds DNS-AID agent discovery via SVCB records (RFC 9460) with private-use SvcParamKeys per draft-mozleywilliams-dnsop-dnsaid-01
  • Adds OpenShell caller-side policy enforcement (Layer 1) consuming dns-aid-core's PolicyDocument schema
  • Integrates both into the /v1/agents endpoint with feature gating

Story: "OpenShell secures the agent boundary; DNS-AID resolves what's inside it."

Motivation / Context

AICR agents in Kubernetes need a standard discovery layer to find peers and a security boundary to control which connections are permitted. DNS-AID provides discovery via DNS SVCB lookups; OpenShell evaluates target agents' published policy documents before allowing the calling agent to connect.

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • API server (/v1/agents endpoint)
  • Core libraries (pkg/discovery, pkg/openshell, pkg/errors, pkg/defaults)

Implementation Notes

DNS-AID Discovery (pkg/discovery/)

  • Discoverer: DNS SVCB queries for _{name}._{protocol}._agents.{domain}
  • Publisher: K8s ConfigMap-based agent registration with create-or-update semantics
  • TXT-based index at _agents.{domain} for listing all agents
  • Server lifecycle hooks (OnStart/OnShutdown) for auto-publish/deregister
  • Feature-gated via AICR_DISCOVERY_ENABLED=true

OpenShell Policy Enforcement (pkg/openshell/)

  • All 16 native policy rules matching dns-aid-core's schema
  • Three enforcement modes: strict (deny on violation), permissive (log + allow, default), disabled
  • Fail-open on policy fetch errors (matches dns-aid-core behavior)
  • SSRF-protected fetcher: HTTPS-only, rejects private/loopback IPs
  • TTL-based cache (5m default) with bounded eviction (max 1024 entries) and singleflight coalescing
  • Enforcement layer filtering — Layer 1 only evaluates applicable rules
  • Realm isolation detection with cross-realm access logging
  • CEL custom rules deferred to future phase (logged as unsupported)
  • Controlled via OPENSHELL_MODE env var with startup validation

Integration

  • /v1/agents evaluates OpenShell policy for each discovered agent with a policy URI
  • Guard accepts nil — zero behavioral change when discovery is disabled
  • New error codes: POLICY_DENIED, POLICY_FETCH
  • New defaults: PolicyFetchTimeout (3s), PolicyCacheTTL (5m), PolicyMaxBytes (64KB)

Key Design Decisions

  • Fail-open: If a policy document can't be fetched, the connection is allowed. This matches dns-aid-core and is correct for a discovery system where policy is advisory at Layer 1.
  • No Python dependency: OpenShell is a pure Go reimplementation consuming the same JSON schema as dns-aid-core's Python SDK.
  • Numeric TLS version comparison: Parses major.minor as integers (not lexicographic) to correctly handle versions like 1.12.

Testing

make test  # all tests pass with -race
  • pkg/openshell: 84.2% coverage (30 tests)
    • Evaluator: 15 table-driven tests covering all 16 rules, layer filtering, multiple violations
    • Fetcher: SSRF validation, caching, content-type/size rejection, non-200 handling
    • Guard: all 3 modes, fail-open, compliant caller, mode validation
    • Availability: midnight wrap, timezone handling, malformed input fail-open
    • TLS version: numeric comparison including double-digit minor versions
  • pkg/discovery: Mock DNS server tests for SVCB parsing, index listing, publisher
  • pkg/api: Handler tests with nil guard, domain validation, empty index

Risk Assessment

Medium — New packages with no existing consumers beyond the /v1/agents endpoint. Feature-gated behind AICR_DISCOVERY_ENABLED and OPENSHELL_MODE env vars, so zero impact on existing functionality when disabled. Default mode is permissive (log-only), providing a safe rollout path.

Checklist

  • Tests pass locally with -race
  • Linter passes (go vet clean, gofmt clean)
  • No tests skipped or disabled
  • New tests added for new functionality
  • Changes follow existing patterns (functional options, pkg/errors, slog, pkg/defaults)
  • Commits are cryptographically signed (git commit -S)

Introduce pkg/discovery for DNS-AID based agent discovery using SVCB
records (RFC 9460) with private-use SvcParamKeys per
draft-mozleywilliams-dnsop-dnsaid-01. Agents publish themselves via K8s
ConfigMaps and discover peers through DNS lookups.

- pkg/discovery: Discoverer (DNS SVCB resolver), Publisher (K8s
  ConfigMap-based registration), SVCB parser for keys 65400-65408
- pkg/server: WithOnStart/WithOnShutdown lifecycle hooks as functional
  options, executed during server start/shutdown respectively
- pkg/api: /v1/agents endpoint listing discovered agents in a domain,
  with input validation and feature-gated via AICR_DISCOVERY_ENABLED
- pkg/defaults: discovery timeout constants and ServerDefaultPort
@IngmarVG-IB IngmarVG-IB requested a review from a team as a code owner March 27, 2026 23:10
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

Welcome to AICR, @IngmarVG-IB! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

Strip rdatapolicy plugin mentions from pkg/discovery doc comments
and the k8s-discovery agent definition. These features are under
embargo and should not be referenced in public code.
OpenShell evaluates target agents' policy documents (served at their
dns-aid-core policy URI, SvcParamKey 65403) as Layer 1 caller-side
enforcement before allowing connections. This completes the security
story: "OpenShell secures the agent boundary; DNS-AID resolves what's
inside it."

New pkg/openshell package implements:
- All 16 native policy rules matching dns-aid-core's PolicyDocument schema
- Three enforcement modes: strict (deny), permissive (log, default), disabled
- Fail-open on policy fetch errors (matches dns-aid-core behavior)
- SSRF-protected fetcher with HTTPS-only, private IP rejection
- TTL-based cache with bounded eviction and singleflight coalescing
- Enforcement layer filtering (Layer 0/1/2 per rule)
- Realm isolation detection with cross-realm access logging

Integration:
- /v1/agents endpoint evaluates OpenShell policy per discovered agent
- OPENSHELL_MODE env var controls enforcement (strict/permissive/disabled)
- Guard accepts nil safely — zero behavioral change when discovery disabled

New error codes: POLICY_DENIED, POLICY_FETCH
New defaults: PolicyFetchTimeout (3s), PolicyCacheTTL (5m), PolicyMaxBytes (64KB)

Test coverage: 84.2% (30 tests across evaluator, fetcher, guard)
@IngmarVG-IB IngmarVG-IB changed the title feat(discovery): add DNS-AID agent discovery and server lifecycle hooks feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement Mar 28, 2026
@mchmarny mchmarny requested a review from lockwobr April 2, 2026 11:25
Signed-off-by: Mark Chmarny <mchmarny@users.noreply.github.com>
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured PR — clean separation between discovery, policy, and API layers. The code follows project conventions (functional options, pkg/errors, pkg/defaults, slog, table-driven tests, create-or-update K8s semantics). Feature gating via env vars is a good rollout strategy.

Key issues to address:

  1. SSRF bypass via hostname (fetcher.go) — The private IP check only catches raw IP literals in URLs. A hostname resolving to 127.0.0.1 or a link-local address passes validation. Since policy URIs come from DNS records (untrusted input), this needs a DialContext-level check or at minimum a documented limitation.

  2. Silent data loss on context cancellation (resolver.go)DiscoverAll swallows all errors including context cancellation, returning an empty list with nil error. Callers can't distinguish "no agents" from "timed out mid-discovery."

  3. Confusing dual-return on fail-open (guard.go)Check returns (allowed=true, err!=nil) on fetch failure. This contract is easy to misuse.

  4. bapVersions v-prefix handling (types.go) — A v-prefixed version like v2.1.0 produces mcp/v2 instead of mcp/2.

Minor/cleanup:

  • Dead code: g.Wait() error check in DiscoverAll can never trigger
  • Unnecessary loop variable capture (Go 1.22+ scoping)
  • Duplicated startMockDNS test helper across packages
  • PORT env var parsed in two places
  • .claude/agents/k8s-discovery.md may be unintentional

Overall the architecture is solid. The SSRF issue is the most important one to fix before merge.

Comment thread pkg/discovery/resolver.go
)

g, gctx := errgroup.WithContext(ctx)
g.SetLimit(maxDiscoverConcurrency)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop variable capture (entry := entry) is unnecessary in Go 1.22+ (this project uses Go 1.26). The loop variable is already per-iteration scoped.

Comment thread pkg/discovery/resolver.go
mu.Lock()
records = append(records, *record)
mu.Unlock()
return nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g.Wait() never returns a non-nil error here because every g.Go callback returns nil (individual failures are logged and skipped at line 113). This entire if err block is dead code. Either remove it or return errors from the goroutines for genuinely fatal failures (e.g., context cancellation).

Comment thread pkg/discovery/resolver.go
Comment on lines +100 to +105
records []AgentRecord
)

g, gctx := errgroup.WithContext(ctx)
g.SetLimit(maxDiscoverConcurrency)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context cancellation is silently swallowed inside the goroutines (line 107 returns nil on all errors). If gctx is canceled (e.g., parent timeout), every in-flight DNS query will fail and be logged as a warning, but DiscoverAll will return an empty list with no error. Consider checking ctx.Err() after g.Wait() and propagating cancellation so callers know the result is incomplete due to timeout rather than an empty domain.

Comment thread pkg/openshell/guard.go
Comment on lines +63 to +66
}
for _, opt := range opts {
opt(g)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewGuard panics on invalid mode, but setupDiscovery in pkg/api/server.go already validates the mode and returns an error. The panic is redundant for the current call site — and panics in library code are generally undesirable. Consider returning an error from a constructor, or just trusting the caller validation and removing the panic. If you keep it, document the panic in the function's doc comment.

Comment thread pkg/openshell/guard.go
Comment on lines +88 to +93
}

policyURI := record.Params.Policy
if policyURI == "" {
return allowed, nil
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check returns both an allowed result and a non-nil error on fetch failure. The caller in handleAgents (agents.go:103) checks err != nil to log, then checks result != nil separately. This dual-return contract is easy to misuse — a future caller might short-circuit on err != nil and miss that the result is still allowed: true (fail-open). Consider returning (allowed, nil) on fail-open with the fetch warning logged internally, or documenting this contract more prominently.

// writes the index ConfigMap. Uses optimistic concurrency via resourceVersion
// to prevent lost updates when multiple publishers race.
func (p *Publisher) updateIndex(ctx context.Context) error {
for range maxIndexUpdateRetries {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index is rebuilt by listing all agent ConfigMaps on every publish/deregister. With many agents this becomes O(N) on every registration. Fine for the current scale, but if you expect >100 agents per namespace, consider an append/remove strategy with a periodic full reconciliation instead.

Comment thread pkg/discovery/types.go
Comment on lines +83 to +88
if idx := strings.IndexByte(r.Version, '.'); idx > 0 {
ver = r.Version[:idx]
} else {
ver = r.Version
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bapVersions() extracts the major version from r.Version using string splitting. If Version is a semver like v2.1.0 (with v prefix), this returns v2 rather than 2, producing mcp/v2 instead of mcp/2. Consider stripping the v prefix if present.

Comment thread pkg/openshell/fetcher.go
return nil, errors.New(errors.ErrCodePolicyFetch,
"policy URI returned unexpected content-type: "+ct)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: reading maxBytes+1 then checking length is the correct pattern to detect oversized responses without allocating unbounded memory. Clean.

Comment thread pkg/api/agents_test.go
Comment on lines +27 to +29
"github.com/miekg/dns"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startMockDNS helper is duplicated from pkg/discovery/testhelper_test.go (the comment says so). Consider extracting it to an exported testutil or internal/testdns package to avoid drift between the two copies.

@@ -0,0 +1,24 @@
---
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file intended to be checked in? It looks like a Claude Code agent instruction file rather than project source. If it was added unintentionally, consider removing it from the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants