[oracle] Reduce false negatives in metric system tests#17959
Open
shmsr wants to merge 1 commit intoelastic:mainfrom
Open
[oracle] Reduce false negatives in metric system tests#17959shmsr wants to merge 1 commit intoelastic:mainfrom
shmsr wants to merge 1 commit intoelastic:mainfrom
Conversation
Require two documents in Oracle metric system tests so the agent has time to settle after the first successful collection. This reduces false failures from transient sql.query degradations that occur even when data is indexed successfully.
🚀 Benchmarks reportPackage
|
| Data stream | Previous EPS | New EPS | Diff (%) | Result |
|---|---|---|---|---|
database_audit |
17241.38 | 12345.68 | -4895.7 (-28.4%) | 💔 |
To see the full report comment with /test benchmark fullreport
💚 Build Succeeded
cc @shmsr |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR reduces false negatives in the Oracle metric system tests by making each metric datastream test wait for at least two indexed documents before the test case is considered complete. The change applies to the
memory,performance,sysmetric,system_statistics, andtablespacesystem test configs.Problem
The failures in build 40045 were initially reported as Oracle connection failures, but the full job evidence showed a more specific pattern. The uploaded
oracle-systemJUnit artifact reported that the built-inelastic-agent logs - ...checks failed because the Oracle metric input briefly transitioned fromHEALTHYtoDEGRADEDwithORA-12541: TNS:no listener. However, the main Buildkite log also showed that the same datastreams eventually reached healthy Oracle containers and successfully indexed documents.That combination matters. It means the package was not failing because it could not collect data at all. Instead, the system suite was failing because a transient Oracle connection error appeared in the agent logs during the test lifecycle, even though collection recovered and data was indexed successfully.
Why this fix
The Oracle package test configs do not have a package-side option to narrowly suppress or filter the built-in
elastic-agent logsassertion for a known transient Oracle error. The practical lever available in the package is to change when the test considers collection successful.Using
assert.min_countis the safest way to do that. It keeps the change local to the system tests, does not alter Oracle package runtime behavior, and avoids relying on changes inelastic-packageitself. It also fits this failure mode better thanassert.hit_count, which is exact and therefore brittle for metric collection, orassert.fields_present, which validates document contents but does not reliably extend the test long enough past the first successful fetch.Why
min_count: 22is the smallest value that meaningfully changes the timing of the tests. With the previous behavior, a test could finish immediately after the first successful document and start teardown while thesql/metricsinput was still active. If Oracle disappeared during or just after that first success window, the input could briefly emit aHEALTHY -> DEGRADEDevent withORA-12541, and the log assertion would fail the test even though the datastream had already proven it could collect data.Requiring at least two documents forces the test to survive one additional successful collection cycle. That gives the Oracle metric input more time to settle after startup and makes the test less sensitive to a single transient connection blip around the first successful fetch. The value is intentionally minimal: it extends the observation window without turning the tests into exact-hit-count checks or adding more delay than necessary.
Evidence used
The change is based on the logs from build 40045.
The uploaded JUnit artifact
build/test-results/oracle-system-1774069460280051161.xmlshowed five failures, all caused by transientHEALTHY -> DEGRADEDOracle metric input log events withORA-12541. At the same time, the main Buildkite console log showed thatmemory,performance,sysmetric,system_statistics, andtablespaceall progressed to healthy Oracle containers and indexed documents successfully. Based on that evidence, this PR is aimed at reducing a false-negative test outcome rather than changing the behavior of the integration itself.Test plan
elastic-package test system --data-streams memory -v --report-format humanelastic-package test system --data-streams performance -v --report-format humanelastic-package test system --data-streams sysmetric -v --report-format humanelastic-package test system --data-streams system_statistics -v --report-format humanelastic-package test system --data-streams tablespace -v --report-format humanoracle-systemJUnit artifact.elastic-package.Related issues
Relates #17957.
Relates #17958.