Skip to content

[oracle] Reduce false negatives in metric system tests#17959

Open
shmsr wants to merge 1 commit intoelastic:mainfrom
shmsr:fix/oracle-healthcheck-startup-tolerance
Open

[oracle] Reduce false negatives in metric system tests#17959
shmsr wants to merge 1 commit intoelastic:mainfrom
shmsr:fix/oracle-healthcheck-startup-tolerance

Conversation

@shmsr
Copy link
Member

@shmsr shmsr commented Mar 21, 2026

Summary

This PR reduces false negatives in the Oracle metric system tests by making each metric datastream test wait for at least two indexed documents before the test case is considered complete. The change applies to the memory, performance, sysmetric, system_statistics, and tablespace system test configs.

Problem

The failures in build 40045 were initially reported as Oracle connection failures, but the full job evidence showed a more specific pattern. The uploaded oracle-system JUnit artifact reported that the built-in elastic-agent logs - ... checks failed because the Oracle metric input briefly transitioned from HEALTHY to DEGRADED with ORA-12541: TNS:no listener. However, the main Buildkite log also showed that the same datastreams eventually reached healthy Oracle containers and successfully indexed documents.

That combination matters. It means the package was not failing because it could not collect data at all. Instead, the system suite was failing because a transient Oracle connection error appeared in the agent logs during the test lifecycle, even though collection recovered and data was indexed successfully.

Why this fix

The Oracle package test configs do not have a package-side option to narrowly suppress or filter the built-in elastic-agent logs assertion for a known transient Oracle error. The practical lever available in the package is to change when the test considers collection successful.

Using assert.min_count is the safest way to do that. It keeps the change local to the system tests, does not alter Oracle package runtime behavior, and avoids relying on changes in elastic-package itself. It also fits this failure mode better than assert.hit_count, which is exact and therefore brittle for metric collection, or assert.fields_present, which validates document contents but does not reliably extend the test long enough past the first successful fetch.

Why min_count: 2

2 is the smallest value that meaningfully changes the timing of the tests. With the previous behavior, a test could finish immediately after the first successful document and start teardown while the sql/metrics input was still active. If Oracle disappeared during or just after that first success window, the input could briefly emit a HEALTHY -> DEGRADED event with ORA-12541, and the log assertion would fail the test even though the datastream had already proven it could collect data.

Requiring at least two documents forces the test to survive one additional successful collection cycle. That gives the Oracle metric input more time to settle after startup and makes the test less sensitive to a single transient connection blip around the first successful fetch. The value is intentionally minimal: it extends the observation window without turning the tests into exact-hit-count checks or adding more delay than necessary.

Evidence used

The change is based on the logs from build 40045.

The uploaded JUnit artifact build/test-results/oracle-system-1774069460280051161.xml showed five failures, all caused by transient HEALTHY -> DEGRADED Oracle metric input log events with ORA-12541. At the same time, the main Buildkite console log showed that memory, performance, sysmetric, system_statistics, and tablespace all progressed to healthy Oracle containers and indexed documents successfully. Based on that evidence, this PR is aimed at reducing a false-negative test outcome rather than changing the behavior of the integration itself.

Test plan

  • elastic-package test system --data-streams memory -v --report-format human
  • elastic-package test system --data-streams performance -v --report-format human
  • elastic-package test system --data-streams sysmetric -v --report-format human
  • elastic-package test system --data-streams system_statistics -v --report-format human
  • elastic-package test system --data-streams tablespace -v --report-format human
  • Inspect build 40045 logs and the uploaded oracle-system JUnit artifact.
  • Confirm from the Buildkite console log that the affected Oracle metric datastreams still indexed data successfully.
  • Local execution is currently blocked in this environment because Docker is unavailable to elastic-package.

Related issues

Relates #17957.
Relates #17958.

Require two documents in Oracle metric system tests so the agent has time to settle after the first successful collection. This reduces false failures from transient sql.query degradations that occur even when data is indexed successfully.
@shmsr shmsr requested a review from a team as a code owner March 21, 2026 10:10
@shmsr shmsr self-assigned this Mar 21, 2026
@elastic-vault-github-plugin-prod

🚀 Benchmarks report

Package oracle 👍(0) 💚(0) 💔(1)

Expand to view
Data stream Previous EPS New EPS Diff (%) Result
database_audit 17241.38 12345.68 -4895.7 (-28.4%) 💔

To see the full report comment with /test benchmark fullreport

@elasticmachine
Copy link

💚 Build Succeeded

cc @shmsr

@shmsr shmsr changed the title [oracle] Reduce transient system test log failures [oracle] Reduce false negatives in metric system tests Mar 21, 2026
@andrewkroh andrewkroh added Integration:oracle Oracle Team:Obs-InfraObs Observability Infrastructure Monitoring team [elastic/obs-infraobs-integrations] labels Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Integration:oracle Oracle Team:Obs-InfraObs Observability Infrastructure Monitoring team [elastic/obs-infraobs-integrations]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants