Skip to content

[9.3](backport #6289) [OpAMP] Add E2E test#6430

Merged
ycombinator merged 1 commit into9.3from
mergify/bp/9.3/pr-6289
Feb 27, 2026
Merged

[9.3](backport #6289) [OpAMP] Add E2E test#6430
ycombinator merged 1 commit into9.3from
mergify/bp/9.3/pr-6289

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Feb 26, 2026

What is the problem this PR solves?

// Please do not just reference an issue. Explain WHAT the problem this PR solves here.

This PR ensures that an OTel Collector (from an upstream contrib release) is able to successfully connect to Fleet Server over OpAMP. Preliminary OpAMP support was added in Fleet Server in #6270 so this PR here is a follow up to that work.

Note: There will be a follow up PR that adds a test (or extends this one) to ensure that EDOT Collector is able to successfully connect to Fleet Server over OpAMP (#6394)

How does this PR solve the problem?

// Explain HOW you solved the problem in your code. It is possible that during PR reviews this changes and then this section should be updated.

By adding a new E2E test, TestStandAloneRunningSuite/TestOpAMP that downloads and extracts the OTel Collector binary from an upstream contrib release, configures it with the opamp extension, configures Fleet Server to turn on the feature_flags.enable_opamp feature flag, runs the Collector, and verifies that the Collector is connecting to Fleet Server over OpAMP.

How to test this PR locally

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

* Implement API boilerplate for POST /v1/opamp endpoint

* Add OpAMP section to dev doc

* Flesh out dev doc

* Update dev doc to use Fleet enrollment token

* Check feature flag before handing OpAMP requests

* Allow running specific tests with TEST_RUN env var

* Removing irrelevant file

* WIP: Reimplement using opamp-go server package

* Update spec

* Move OpAMP documentation to separate file

* Remove error that's no longer needed

* Update OpAMP feature flag test to use Enabled() method

The test previously referenced ErrOpAMPDisabled and handleOpAMP which
no longer exist. The feature flag check now happens at route registration
time, so test the Enabled() method directly instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Disable HTTP keep-alive for OpAMP requests to fix EOF errors

The server's IdleTimeout (30s) matches the OTel Collector's polling
interval (~30s), causing a race where the server closes the idle
connection just as the client tries to reuse it. Setting Connection:
close on OpAMP responses forces a fresh connection per poll, eliminating
the race with negligible overhead given the 30s polling interval.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Adding configuration files to be used by OpAMP E2E test

* WIP: Adding OpAMP E2E test

* Fix otelcol template data to use nested OpAMP keys

The otelcol-opamp.tpl template accesses {{ .OpAMP.APIKey }} and
{{ .OpAMP.InstanceUID }}, so the template data must nest these under
an "OpAMP" key rather than passing them as flat top-level keys.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use distinct filename for otelcol config in TestOpAMP

The otelcol config was being written to config.yml, overwriting the
fleet-server config in the same temp dir. Rename it to otelcol.yml.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Make otelcol-contrib download URL platform-aware in TestOpAMP

Use runtime.GOOS and runtime.GOARCH to build the download URL
dynamically instead of hardcoding darwin_arm64. Also chmod the
extracted binary since extractTarGz doesn't preserve permissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix resp.Body handling in TestOpAMP

Use explicit Close() instead of defer since resp is reassigned later
in the function, which would cause the deferred close to act on the
wrong response.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Increase TestOpAMP timeout and use defer for cleanup

Increase context timeout from 1 to 3 minutes to account for the
otelcol-contrib download. Use defer for cancel() and cmd.Wait() so
cleanup happens even on test failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Start OTel Collector in TestOpAMP

Extract instanceUID and apiKey into variables, remove the placeholder
time.Sleep, and start the otelcol-contrib binary with the OpAMP
extension config pointing at fleet-server.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Verify agent enrollment in TestOpAMP

Poll Kibana via AgentIsOnline to confirm the OTel Collector was
enrolled as an agent in Fleet Server after connecting via OpAMP.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Extract OTel Collector version into package-level constant

Move the hardcoded otelcol-contrib version into otelColContribVersion
in const.go so it can be easily updated in one place.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Continue writing TestOpAMP e2e test

- Configure fleet-server with a static policy token for dummy-policy so
  that GetEnrollmentTokenForPolicyID can find the enrollment token
- Fetch enrollment token before the raw POST to /v1/opamp
- Add Authorization and Content-Type headers to the raw POST
- Assert HTTP 200 response from the raw POST

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix TestOpAMP e2e test

- Enroll a dummy agent before starting the OTel Collector to initialize
  the .fleet-agents index. Without this, findEnrolledAgent fails with
  index_not_found_exception in a standalone fleet-server environment
  (unlike agent-managed fleet-server which self-enrolls on startup).
- Add AgentHasStatus scaffold method that accepts multiple acceptable
  statuses, and AgentIsUpdating that delegates to it.
- Use AgentIsUpdating in TestOpAMP: OpAMP agents communicate via the
  OpAMP protocol rather than Fleet's normal checkin/ack protocol, so
  they never acknowledge the initial policy change action and Kibana
  shows them as "updating" rather than "online".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fixing conflicts during rebase

* Download OTel Contrib source and build collector from it

* Running go fmt

* Fetch entire Agent doc from ES and make finer-grained assertions on its contents

* Check status from doc field

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 7aededf)
@mergify mergify bot requested a review from a team as a code owner February 26, 2026 15:59
@mergify mergify bot added the backport label Feb 26, 2026
@mergify mergify bot requested review from blakerouse and ycombinator February 26, 2026 15:59
@mergify mergify bot added the backport label Feb 26, 2026
@github-actions github-actions bot added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog labels Feb 26, 2026
@mergify mergify bot mentioned this pull request Feb 26, 2026
8 tasks
@ycombinator ycombinator enabled auto-merge (squash) February 26, 2026 18:39
@ycombinator
Copy link
Contributor

buildkite test this

@ycombinator ycombinator merged commit 70f7f01 into 9.3 Feb 27, 2026
11 checks passed
@ycombinator ycombinator deleted the mergify/bp/9.3/pr-6289 branch February 27, 2026 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant