Skip to content

[OpAMP] Add E2E test#6289

Merged
ycombinator merged 30 commits intoelastic:mainfrom
ycombinator:opamp-e2e-test
Feb 26, 2026
Merged

[OpAMP] Add E2E test#6289
ycombinator merged 30 commits intoelastic:mainfrom
ycombinator:opamp-e2e-test

Conversation

@ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Feb 4, 2026

What is the problem this PR solves?

// Please do not just reference an issue. Explain WHAT the problem this PR solves here.

This PR ensures that an OTel Collector (from an upstream contrib release) is able to successfully connect to Fleet Server over OpAMP. Preliminary OpAMP support was added in Fleet Server in #6270 so this PR here is a follow up to that work.

Note: There will be a follow up PR that adds a test (or extends this one) to ensure that EDOT Collector is able to successfully connect to Fleet Server over OpAMP (#6394)

How does this PR solve the problem?

// Explain HOW you solved the problem in your code. It is possible that during PR reviews this changes and then this section should be updated.

By adding a new E2E test, TestStandAloneRunningSuite/TestOpAMP that downloads and extracts the OTel Collector binary from an upstream contrib release, configures it with the opamp extension, configures Fleet Server to turn on the feature_flags.enable_opamp feature flag, runs the Collector, and verifies that the Collector is connecting to Fleet Server over OpAMP.

How to test this PR locally

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@mergify
Copy link
Contributor

mergify bot commented Feb 4, 2026

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 18, 2026

✅ Vale Linting Results

No issues found on modified lines!


The Vale linter checks documentation changes against the Elastic Docs style guide.

To use Vale locally or report issues, refer to Elastic style guide for Vale.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 18, 2026

🔍 Preview links for changed docs

@mergify
Copy link
Contributor

mergify bot commented Feb 20, 2026

This pull request is now in conflicts. Could you fix it @ycombinator? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b opamp-e2e-test upstream/opamp-e2e-test
git merge upstream/main
git push upstream opamp-e2e-test

ycombinator and others added 21 commits February 19, 2026 16:42
The test previously referenced ErrOpAMPDisabled and handleOpAMP which
no longer exist. The feature flag check now happens at route registration
time, so test the Enabled() method directly instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The server's IdleTimeout (30s) matches the OTel Collector's polling
interval (~30s), causing a race where the server closes the idle
connection just as the client tries to reuse it. Setting Connection:
close on OpAMP responses forces a fresh connection per poll, eliminating
the race with negligible overhead given the 30s polling interval.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The otelcol-opamp.tpl template accesses {{ .OpAMP.APIKey }} and
{{ .OpAMP.InstanceUID }}, so the template data must nest these under
an "OpAMP" key rather than passing them as flat top-level keys.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The otelcol config was being written to config.yml, overwriting the
fleet-server config in the same temp dir. Rename it to otelcol.yml.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use runtime.GOOS and runtime.GOARCH to build the download URL
dynamically instead of hardcoding darwin_arm64. Also chmod the
extracted binary since extractTarGz doesn't preserve permissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use explicit Close() instead of defer since resp is reassigned later
in the function, which would cause the deferred close to act on the
wrong response.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Increase context timeout from 1 to 3 minutes to account for the
otelcol-contrib download. Use defer for cancel() and cmd.Wait() so
cleanup happens even on test failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract instanceUID and apiKey into variables, remove the placeholder
time.Sleep, and start the otelcol-contrib binary with the OpAMP
extension config pointing at fleet-server.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ycombinator ycombinator added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches skip-changelog labels Feb 20, 2026
blakerouse
blakerouse previously approved these changes Feb 23, 2026
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming we'll need to update the expected state for this test once Kibana changes so that opamp agents don't appear as "Updating"

Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also consider moving the opamp e2e test into its own suite so it can be easily extended in the future

@ycombinator ycombinator merged commit 7aededf into elastic:main Feb 26, 2026
10 checks passed
@github-actions
Copy link
Contributor

@Mergifyio backport 9.2 9.3

@mergify
Copy link
Contributor

mergify bot commented Feb 26, 2026

backport 9.2 9.3

✅ Backports have been created

Details

mergify bot pushed a commit that referenced this pull request Feb 26, 2026
* Implement API boilerplate for POST /v1/opamp endpoint

* Add OpAMP section to dev doc

* Flesh out dev doc

* Update dev doc to use Fleet enrollment token

* Check feature flag before handing OpAMP requests

* Allow running specific tests with TEST_RUN env var

* Removing irrelevant file

* WIP: Reimplement using opamp-go server package

* Update spec

* Move OpAMP documentation to separate file

* Remove error that's no longer needed

* Update OpAMP feature flag test to use Enabled() method

The test previously referenced ErrOpAMPDisabled and handleOpAMP which
no longer exist. The feature flag check now happens at route registration
time, so test the Enabled() method directly instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Disable HTTP keep-alive for OpAMP requests to fix EOF errors

The server's IdleTimeout (30s) matches the OTel Collector's polling
interval (~30s), causing a race where the server closes the idle
connection just as the client tries to reuse it. Setting Connection:
close on OpAMP responses forces a fresh connection per poll, eliminating
the race with negligible overhead given the 30s polling interval.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Adding configuration files to be used by OpAMP E2E test

* WIP: Adding OpAMP E2E test

* Fix otelcol template data to use nested OpAMP keys

The otelcol-opamp.tpl template accesses {{ .OpAMP.APIKey }} and
{{ .OpAMP.InstanceUID }}, so the template data must nest these under
an "OpAMP" key rather than passing them as flat top-level keys.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use distinct filename for otelcol config in TestOpAMP

The otelcol config was being written to config.yml, overwriting the
fleet-server config in the same temp dir. Rename it to otelcol.yml.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Make otelcol-contrib download URL platform-aware in TestOpAMP

Use runtime.GOOS and runtime.GOARCH to build the download URL
dynamically instead of hardcoding darwin_arm64. Also chmod the
extracted binary since extractTarGz doesn't preserve permissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix resp.Body handling in TestOpAMP

Use explicit Close() instead of defer since resp is reassigned later
in the function, which would cause the deferred close to act on the
wrong response.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Increase TestOpAMP timeout and use defer for cleanup

Increase context timeout from 1 to 3 minutes to account for the
otelcol-contrib download. Use defer for cancel() and cmd.Wait() so
cleanup happens even on test failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Start OTel Collector in TestOpAMP

Extract instanceUID and apiKey into variables, remove the placeholder
time.Sleep, and start the otelcol-contrib binary with the OpAMP
extension config pointing at fleet-server.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Verify agent enrollment in TestOpAMP

Poll Kibana via AgentIsOnline to confirm the OTel Collector was
enrolled as an agent in Fleet Server after connecting via OpAMP.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Extract OTel Collector version into package-level constant

Move the hardcoded otelcol-contrib version into otelColContribVersion
in const.go so it can be easily updated in one place.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Continue writing TestOpAMP e2e test

- Configure fleet-server with a static policy token for dummy-policy so
  that GetEnrollmentTokenForPolicyID can find the enrollment token
- Fetch enrollment token before the raw POST to /v1/opamp
- Add Authorization and Content-Type headers to the raw POST
- Assert HTTP 200 response from the raw POST

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix TestOpAMP e2e test

- Enroll a dummy agent before starting the OTel Collector to initialize
  the .fleet-agents index. Without this, findEnrolledAgent fails with
  index_not_found_exception in a standalone fleet-server environment
  (unlike agent-managed fleet-server which self-enrolls on startup).
- Add AgentHasStatus scaffold method that accepts multiple acceptable
  statuses, and AgentIsUpdating that delegates to it.
- Use AgentIsUpdating in TestOpAMP: OpAMP agents communicate via the
  OpAMP protocol rather than Fleet's normal checkin/ack protocol, so
  they never acknowledge the initial policy change action and Kibana
  shows them as "updating" rather than "online".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fixing conflicts during rebase

* Download OTel Contrib source and build collector from it

* Running go fmt

* Fetch entire Agent doc from ES and make finer-grained assertions on its contents

* Check status from doc field

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 7aededf)
mergify bot pushed a commit that referenced this pull request Feb 26, 2026
* Implement API boilerplate for POST /v1/opamp endpoint

* Add OpAMP section to dev doc

* Flesh out dev doc

* Update dev doc to use Fleet enrollment token

* Check feature flag before handing OpAMP requests

* Allow running specific tests with TEST_RUN env var

* Removing irrelevant file

* WIP: Reimplement using opamp-go server package

* Update spec

* Move OpAMP documentation to separate file

* Remove error that's no longer needed

* Update OpAMP feature flag test to use Enabled() method

The test previously referenced ErrOpAMPDisabled and handleOpAMP which
no longer exist. The feature flag check now happens at route registration
time, so test the Enabled() method directly instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Disable HTTP keep-alive for OpAMP requests to fix EOF errors

The server's IdleTimeout (30s) matches the OTel Collector's polling
interval (~30s), causing a race where the server closes the idle
connection just as the client tries to reuse it. Setting Connection:
close on OpAMP responses forces a fresh connection per poll, eliminating
the race with negligible overhead given the 30s polling interval.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Adding configuration files to be used by OpAMP E2E test

* WIP: Adding OpAMP E2E test

* Fix otelcol template data to use nested OpAMP keys

The otelcol-opamp.tpl template accesses {{ .OpAMP.APIKey }} and
{{ .OpAMP.InstanceUID }}, so the template data must nest these under
an "OpAMP" key rather than passing them as flat top-level keys.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use distinct filename for otelcol config in TestOpAMP

The otelcol config was being written to config.yml, overwriting the
fleet-server config in the same temp dir. Rename it to otelcol.yml.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Make otelcol-contrib download URL platform-aware in TestOpAMP

Use runtime.GOOS and runtime.GOARCH to build the download URL
dynamically instead of hardcoding darwin_arm64. Also chmod the
extracted binary since extractTarGz doesn't preserve permissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix resp.Body handling in TestOpAMP

Use explicit Close() instead of defer since resp is reassigned later
in the function, which would cause the deferred close to act on the
wrong response.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Increase TestOpAMP timeout and use defer for cleanup

Increase context timeout from 1 to 3 minutes to account for the
otelcol-contrib download. Use defer for cancel() and cmd.Wait() so
cleanup happens even on test failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Start OTel Collector in TestOpAMP

Extract instanceUID and apiKey into variables, remove the placeholder
time.Sleep, and start the otelcol-contrib binary with the OpAMP
extension config pointing at fleet-server.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Verify agent enrollment in TestOpAMP

Poll Kibana via AgentIsOnline to confirm the OTel Collector was
enrolled as an agent in Fleet Server after connecting via OpAMP.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Extract OTel Collector version into package-level constant

Move the hardcoded otelcol-contrib version into otelColContribVersion
in const.go so it can be easily updated in one place.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Continue writing TestOpAMP e2e test

- Configure fleet-server with a static policy token for dummy-policy so
  that GetEnrollmentTokenForPolicyID can find the enrollment token
- Fetch enrollment token before the raw POST to /v1/opamp
- Add Authorization and Content-Type headers to the raw POST
- Assert HTTP 200 response from the raw POST

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix TestOpAMP e2e test

- Enroll a dummy agent before starting the OTel Collector to initialize
  the .fleet-agents index. Without this, findEnrolledAgent fails with
  index_not_found_exception in a standalone fleet-server environment
  (unlike agent-managed fleet-server which self-enrolls on startup).
- Add AgentHasStatus scaffold method that accepts multiple acceptable
  statuses, and AgentIsUpdating that delegates to it.
- Use AgentIsUpdating in TestOpAMP: OpAMP agents communicate via the
  OpAMP protocol rather than Fleet's normal checkin/ack protocol, so
  they never acknowledge the initial policy change action and Kibana
  shows them as "updating" rather than "online".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fixing conflicts during rebase

* Download OTel Contrib source and build collector from it

* Running go fmt

* Fetch entire Agent doc from ES and make finer-grained assertions on its contents

* Check status from doc field

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 7aededf)
@ycombinator ycombinator deleted the opamp-e2e-test branch February 26, 2026 18:39
ycombinator added a commit that referenced this pull request Feb 26, 2026
* Implement API boilerplate for POST /v1/opamp endpoint

* Add OpAMP section to dev doc

* Flesh out dev doc

* Update dev doc to use Fleet enrollment token

* Check feature flag before handing OpAMP requests

* Allow running specific tests with TEST_RUN env var

* Removing irrelevant file

* WIP: Reimplement using opamp-go server package

* Update spec

* Move OpAMP documentation to separate file

* Remove error that's no longer needed

* Update OpAMP feature flag test to use Enabled() method

The test previously referenced ErrOpAMPDisabled and handleOpAMP which
no longer exist. The feature flag check now happens at route registration
time, so test the Enabled() method directly instead.



* Disable HTTP keep-alive for OpAMP requests to fix EOF errors

The server's IdleTimeout (30s) matches the OTel Collector's polling
interval (~30s), causing a race where the server closes the idle
connection just as the client tries to reuse it. Setting Connection:
close on OpAMP responses forces a fresh connection per poll, eliminating
the race with negligible overhead given the 30s polling interval.



* Adding configuration files to be used by OpAMP E2E test

* WIP: Adding OpAMP E2E test

* Fix otelcol template data to use nested OpAMP keys

The otelcol-opamp.tpl template accesses {{ .OpAMP.APIKey }} and
{{ .OpAMP.InstanceUID }}, so the template data must nest these under
an "OpAMP" key rather than passing them as flat top-level keys.



* Use distinct filename for otelcol config in TestOpAMP

The otelcol config was being written to config.yml, overwriting the
fleet-server config in the same temp dir. Rename it to otelcol.yml.



* Make otelcol-contrib download URL platform-aware in TestOpAMP

Use runtime.GOOS and runtime.GOARCH to build the download URL
dynamically instead of hardcoding darwin_arm64. Also chmod the
extracted binary since extractTarGz doesn't preserve permissions.



* Fix resp.Body handling in TestOpAMP

Use explicit Close() instead of defer since resp is reassigned later
in the function, which would cause the deferred close to act on the
wrong response.



* Increase TestOpAMP timeout and use defer for cleanup

Increase context timeout from 1 to 3 minutes to account for the
otelcol-contrib download. Use defer for cancel() and cmd.Wait() so
cleanup happens even on test failure.



* Start OTel Collector in TestOpAMP

Extract instanceUID and apiKey into variables, remove the placeholder
time.Sleep, and start the otelcol-contrib binary with the OpAMP
extension config pointing at fleet-server.



* Verify agent enrollment in TestOpAMP

Poll Kibana via AgentIsOnline to confirm the OTel Collector was
enrolled as an agent in Fleet Server after connecting via OpAMP.



* Extract OTel Collector version into package-level constant

Move the hardcoded otelcol-contrib version into otelColContribVersion
in const.go so it can be easily updated in one place.



* Continue writing TestOpAMP e2e test

- Configure fleet-server with a static policy token for dummy-policy so
  that GetEnrollmentTokenForPolicyID can find the enrollment token
- Fetch enrollment token before the raw POST to /v1/opamp
- Add Authorization and Content-Type headers to the raw POST
- Assert HTTP 200 response from the raw POST



* Fix TestOpAMP e2e test

- Enroll a dummy agent before starting the OTel Collector to initialize
  the .fleet-agents index. Without this, findEnrolledAgent fails with
  index_not_found_exception in a standalone fleet-server environment
  (unlike agent-managed fleet-server which self-enrolls on startup).
- Add AgentHasStatus scaffold method that accepts multiple acceptable
  statuses, and AgentIsUpdating that delegates to it.
- Use AgentIsUpdating in TestOpAMP: OpAMP agents communicate via the
  OpAMP protocol rather than Fleet's normal checkin/ack protocol, so
  they never acknowledge the initial policy change action and Kibana
  shows them as "updating" rather than "online".



* Fixing conflicts during rebase

* Download OTel Contrib source and build collector from it

* Running go fmt

* Fetch entire Agent doc from ES and make finer-grained assertions on its contents

* Check status from doc field

---------


(cherry picked from commit 7aededf)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[OpAMP][E2E Test] Verify that contrib OTel Collectors can talk to Fleet over OpAMP

5 participants