Recover OpenFeature provider after initialization timeout#11474
Conversation
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master resultsStartup Time
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fcfa34c9cd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
This comment has been minimized.
This comment has been minimized.
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
Motivation
A Java application can start while its Datadog Agent is already running but does not yet have an
FFE_FLAGSpayload cached for that tracer. That is the customer-reported shape we have been investigating: the tracer starts, asks for feature flag configuration, and the OpenFeature provider waits for usable configuration during initialization.That blocking behavior is intentional.
setProviderAndWait()should wait until the provider receives usable flag configuration, or fail with the configured timeout. The bug is what happens after the timeout path: the OpenFeature provider transitions toERROR, but whenFFE_FLAGSarrives later it must transition back toREADYwithout requiring an application restart.There are two edge cases this PR needs to handle correctly. First, a real config can arrive right at the timeout boundary, after the evaluator has received it but before provider initialization has finished deciding whether it timed out. Second, usable config can disappear after the initial config is received but before initialization returns, or after the provider is already
READY; in either case the provider state should not remainREADYwhile evaluations returnPROVIDER_NOT_READY.Changes
This keeps provider initialization blocking.
DDEvaluator.initialize()registers for Feature Flagging Gateway updates and waits on the initialization latch until a non-nullServerConfigurationarrives or the configured timeout expires. Anullconfig update means there is still no usable FFE product for the provider, so it does not satisfy initialization.The provider tracks initialization state explicitly. If real configuration arrives while initialization is still blocked, the provider records that initial config was received and lets the OpenFeature SDK publish
PROVIDER_READYafterinitialize()returns. If initialization has already timed out and the provider is inERROR, the next real config update moves the provider toREADYand emitsPROVIDER_READY. After the provider is ready, later real config updates emitPROVIDER_CONFIGURATION_CHANGED.Nullable config updates are handled as real lifecycle transitions. If config disappears after the initial config was received but before initialization returns, initialization fails instead of advertising readiness without usable config. If the provider was already
READYand FFE config becomes unavailable, it emitsPROVIDER_ERROR; evaluations then return the caller default withPROVIDER_NOT_READY. A later real config moves the provider back toREADY.Decisions
Blocking is not the problem we are fixing here. Blocking until config or timeout is the Java behavior we want because the provider should not advertise readiness before it has usable flag configuration. The missing piece was recovery after timeout and after losing usable configuration. This PR fixes those recovery paths without changing the public provider API or the caller-facing timeout option.
This also pairs with the first-RC subscription work that landed separately: that work makes the tracer request
FFE_FLAGSas early as possible from an already-running Agent, while this change makes the Java OpenFeature provider recover correctly if that first attempt still times out. The related Go tracer reference isSubscribeRC, which subscribesFFE_FLAGSduring tracer startup so it is included in the first RC request: https://github.com/DataDog/dd-trace-go/blob/3ded6653e44aeb0d27bd5944e1e8033775473768/internal/openfeature/rc_subscription.go#L40-L44Evidence
Dogfooding validation support has now merged to
ffe-dogfoodingmain via DataDog/ffe-dogfooding#71. It adds the local Java build path used to run this PR against the full ffe-dogfooding compose stack before the Java artifacts are published.The local validation built
dd-java-agent-1.63.0-SNAPSHOT.jaranddd-openfeature-1.63.0-SNAPSHOT.jarfrom thisdd-trace-javabranch, staged the local provider JAR into the Java dogfooding app image, and mounted the localdd-trace-javacheckout so the Java container started with the local agent JAR.Commands used for the dogfooding smoke test:
The successful readiness run used the ffe-dogfooding
.envRemote Config credentials. A dummy API key started the containers but left Java inPROVIDER_ERROR, which matched the expected Agent authentication failure rather than a provider startup failure.The full compose stack started successfully with
app-go,app-python,app-nodejs,app-java,app-ruby,app-dotnet,datadog-agent,mock-intake,otlp-intake, andevaluatorall running. Health checks passed on app ports8081through8086, and the evaluator/statsendpoint responded. The Java container log confirmed the local Java tracer was used:State Transitions
stateDiagram-v2 [*] --> NOT_STARTED NOT_STARTED --> INITIALIZING: initialize() INITIALIZING --> INITIALIZING: null config / keep blocking INITIALIZING --> INITIAL_CONFIG_RECEIVED: real config arrives before initialize returns INITIALIZING --> ERROR: timeout without real config / throw ProviderNotReadyError INITIAL_CONFIG_RECEIVED --> READY: initialize returns while config is still present / OpenFeature SDK emits PROVIDER_READY INITIAL_CONFIG_RECEIVED --> ERROR: null config before initialize returns / throw ProviderNotReadyError ERROR --> READY: later real config / emit PROVIDER_READY ERROR --> ERROR: null config / remain unavailable READY --> READY: later real config / emit PROVIDER_CONFIGURATION_CHANGED READY --> ERROR: null config / emit PROVIDER_ERROR